[AusNOG] SPAM-MED: Re: Vocus international service outage

Matt Perkins matt at spectrum.com.au
Wed Aug 27 09:47:11 EST 2014


I appreciate Jame's honesty it would make me want to buy from Vocus. Not 
the other way around. If this were another big wholesale provider that 
will remain nameless They would still be denying the whole thing ever 
happened and asking the customer to reset there router.

What I am interested to know is technically what happened at the DC that 
fried stuff. I ask out of concern that if a big US data center can have 
something like this happen perhaps other's have the same vulnerability  
(me even :)

I have only ever heard of things like that happening when  someone goes 
and removes a neutral from 3 phase power systems on a UPS. Or a change 
over switch doesn't overlap the neutral before switching.   I guess if 
someone touched the 3 phase bus together that might do it but  doubtful 
you would break line cards just PS's  Perhaps a floating 
Telecommunications reference conductor.  It could be a million things.  
Any more info would be great.

Matt


  On 27/08/2014 8:21 am, James Spenceley wrote:
> Hi Wolfgang,
>
> RFP will be out to today.
>
> The long and short if it is...
>
> We had an issue caused by the DC provider which took out our primary 
> pop (power surge or something similar which killed all the RPs and 
> line cards). In one of those one in a million chances the big upgrade 
> we'd completed across all our US sites left us with a code on the 
> secondary site routers that had a bug which has now been diagnosed by 
> Cisco (hence the delay in RFP), this bug only happens when large 
> traffic loads (and routes) hit the RP from a steady state and the bug 
> causes intermittent black holing of traffic passing that box.
>
> As a few people said , stuff breaks, this was one of the nightmare 
> events where we had two issues in parallel one an undocumented Cisco 
> bug that ironically only shows up when the other site failed. As I 
> said it's a pretty rare case of events that had to happen to cause the 
> outage and it took quiet a bit of time for Cisco to recreate/identify 
> the bug (the other issue was complete hardware failure was clearly 
> easy to diagnose), that Cisco delay is what has caused the delay in 
> the RFP.
>
> James
>
>
> On 26 Aug 2014, at 23:58, "Wolfgang Nagele (AusRegistry)" 
> <wolfgang.nagele at ausregistry.com.au 
> <mailto:wolfgang.nagele at ausregistry.com.au>> wrote:
>
>> Hi,
>>
>> This is your choice. Not mine and I am sure there are others here 
>> that do not agree with your idea of brushing issues like this off the 
>> table. If this were Telstra you would be striking a different tone.
>>
>> It's irrelevant if you have multiple upstreams - there are minimum 
>> requirements that we put on suppliers in the year 2014. Having Vocus 
>> suffer a 4 hour degradation due to a single DC fault in SJC is not 
>> something that we accept in 2014. There is redundancy via LA (and 
>> Singapore) which didn't work. I would like to know why - not why it 
>> didn't fail over automatically there can be many reasons. Vocus 
>> engineers should have been able to re-route via LA and completely 
>> take SJC out of the equation. There are questions to be answered 
>> here. As well as delays in notifications - we have not received a 
>> notification for over an hour. Again not acceptable for an incident 
>> of that magnitude.
>>
>> We all learn based on mistakes - ignoring them gains nothing.
>>
>> As for multiple upstreams, yes we have them and yes we routed around 
>> the issue. To me that's irrelevant to the issue at hand. If you are 
>> happy with a supplier that has 4 hour degradation due to a single DC 
>> fault on it's main international backhaul - yes - move along, nothing 
>> to see here.
>>
>> Cheers,
>> Wolfgang
>>
>> On 8/26/14, 11:40 PM, "Skeeve Stevens" 
>> <skeeve+ausnog at eintellegonetworks.com 
>> <mailto:skeeve+ausnog at eintellegonetworks.com>> wrote:
>>
>>     Yup.. move along, nothing to see here.
>>
>>     Once an outage is fixed, those who dwell on the cause that they
>>     can do nothing about, are focusing in the wrong place.
>>
>>     If your only transit was through a single upstream, that is where
>>     you should be focusing, not the provider.
>>
>>
>>     ...Skeeve
>>
>>     *Skeeve Stevens - *eintellego Networks Pty Ltd
>>     skeeve at eintellegonetworks.com
>>     <mailto:skeeve at eintellegonetworks.com> ;
>>     www.eintellegonetworks.com <http://www.eintellegonetworks.com/>
>>
>>     Phone: 1300 239 038; Cell +61 (0)414 753 383 ; skype://skeeve
>>
>>     facebook.com/eintellegonetworks
>>     <http://facebook.com/eintellegonetworks> ; linkedin.com/in/skeeve
>>     <http://linkedin.com/in/skeeve>
>>
>>     twitter.com/theispguy <http://twitter.com/theispguy> ; blog:
>>     www.theispguy.com <http://www.theispguy.com/>
>>
>>
>>     The Experts Who The Experts Call
>>
>>     Juniper - Cisco - Cloud- Consulting- IPv4 Brokering
>>
>>
>>     On 26 August 2014 23:19, Kristoffer Sheather @ CloudCentral
>>     <kristoffer.sheather at cloudcentral.com.au
>>     <mailto:kristoffer.sheather at cloudcentral.com.au>> wrote:
>>
>>         Shit broke, they fixed.
>>         <EOM />
>>         ------------------------------------------------------------------------
>>         *From*: "Wolfgang Nagele (AusRegistry)"
>>         <wolfgang.nagele at ausregistry.com.au
>>         <mailto:wolfgang.nagele at ausregistry.com.au>>
>>         *Sent*: Tuesday, August 26, 2014 11:17 PM
>>         *To*: "James Spenceley" <james at iroute.org
>>         <mailto:james at iroute.org>>
>>         *Cc*: "Ausnog at ausnog.net <mailto:Ausnog at ausnog.net>"
>>         <ausnog at ausnog.net <mailto:ausnog at ausnog.net>>
>>         *Subject*: SPAM-MED: Re: [AusNOG] Vocus international service
>>         outage
>>         Hi James,
>>         Still waiting for the RfO on this whole thing with the
>>         details. Neither seen one here nor as a follow-up to customer
>>         notifications.
>>         Cheers,
>>         Wolfgang
>>         On 8/24/14, 12:38 AM, "Wolfgang Nagele (AusRegistry)"
>>         <wolfgang.nagele at ausregistry.com.au
>>         <mailto:wolfgang.nagele at ausregistry.com.au>> wrote:
>>
>>             Hi James,
>>             Hmm - can understand that but would have expected that
>>             there is sufficient redundancy in the LA landing of your
>>             network. I would have expected that a re-route and taking
>>             SJC largely out of the equation would be possible.
>>             Surprised to say the least ...
>>             Cheers,
>>             Wolfgang
>>             On 8/24/14, 12:10 AM, "James Spenceley" <james at iroute.org
>>             <mailto:james at iroute.org>> wrote:
>>
>>                 Early mail is a power surge in a US DC has damaged
>>                 both core routers. Wouldn't surprise me if transport
>>                 from other providers out of that building will be
>>                 having similar issues.
>>                 Circuits are being moved directly to borders as we speak.
>>
>>
>>                 Sent from my iPhone
>>
>>                 On 24 Aug 2014, at 0:00, Jared Hirst
>>                 <jared.hirst at serversaustralia.com.au
>>                 <mailto:jared.hirst at serversaustralia.com.au>> wrote:
>>>                 WOW.... It has taken 2 hours to get remote hands to
>>>                 the DC with what seems to be a device with no
>>>                 redundancy?
>>>                 2014/08/23 13:55
>>>                 UTC 	
>>>
>>>                 Engineers are currently awaiting remote support in
>>>                 the US. Links will be physically moved from the
>>>                 failed device in order to restore services on an
>>>                 alternate device.
>>>
>>>                 On Sat, Aug 23, 2014 at 11:29 PM, Andrew Yager
>>>                 <andrew at rwts.com.au <mailto:andrew at rwts.com.au>> wrote:
>>>
>>>                     [hijacking the thread...]
>>>                     They say there is a big rewrite coming on the
>>>                     way rpd and sampled interact in 14.2; and the
>>>                     slow convergance issues have been fixed in more
>>>                     releases than I care to remember right now...
>>>                     but people say they are pretty good in the
>>>                     12.3r6 train. Our MX80's are slated for upgrade
>>>                     to that at some stage in the next few months.
>>>                     Some noise about this on j-nsp again today.
>>>                     Andrew
>>>                     On 23 August 2014 23:23, Jonathan Thorpe
>>>                     <jthorpe at conexim.com.au
>>>                     <mailto:jthorpe at conexim.com.au>> wrote:
>>>
>>>                         True, but they otherwise work exceptionally
>>>                         well.
>>>
>>>                         I'm not sure what kind of PowerPC CPU is
>>>                         doing all the work on an MX80's RE, but I do
>>>                         sometimes wonder if the CPU on a <$40
>>>                         Raspberry Pi might be more up to the job :-P
>>>
>>>                         *From:*Tony Wicks [mailto:tony at wicks.co.nz
>>>                         <mailto:tony at wicks.co.nz>]
>>>                         *Sent:* Saturday, 23 August 2014 11:13 PM
>>>                         *To:* Jonathan Thorpe
>>>                         *Cc:* 'Ausnog at ausnog.net
>>>                         <mailto:Ausnog at ausnog.net>'
>>>                         *Subject:* RE: [AusNOG] Vocus international
>>>                         service outage
>>>
>>>                         Well, if you buy the big chassis boxes.....
>>>
>>>                         *From:*AusNOG
>>>                         [mailto:ausnog-bounces at lists.ausnog.net] *On
>>>                         Behalf Of *Jonathan Thorpe
>>>                         *Sent:* Sunday, 24 August 2014 1:10 a.m.
>>>                         *To:* Andrew Yager; Jared Hirst
>>>                         *Cc:* Ausnog at ausnog.net
>>>                         <mailto:Ausnog at ausnog.net>
>>>                         *Subject:* Re: [AusNOG] Vocus international
>>>                         service outage
>>>
>>>                         Glad I'm not the only one holding my breath
>>>                         on our MXs :)
>>>
>>>                         *From:*AusNOG
>>>                         [mailto:ausnog-bounces at lists.ausnog.net] *On
>>>                         Behalf Of *Andrew Yager
>>>                         *Sent:* Saturday, 23 August 2014 10:55 PM
>>>                         *To:* Jared Hirst
>>>                         *Cc:* Ausnog at ausnog.net
>>>                         <mailto:Ausnog at ausnog.net>
>>>                         *Subject:* Re: [AusNOG] Vocus international
>>>                         service outage
>>>
>>>                         We've done the same (about 20 minutes ago).
>>>
>>>                         Right now I hate how long Juniper MX's take
>>>                         to stabilise their routing table with
>>>                         sampling on.
>>>
>>>                         Andrew
>>>
>>>                         On 23 August 2014 22:41, Jared Hirst
>>>                         <jared.hirst at serversaustralia.com.au
>>>                         <mailto:jared.hirst at serversaustralia.com.au>> wrote:
>>>
>>>                             We have just turned Vocus off. Using
>>>                             other providers for now, as the flapping
>>>                             is causing it to go up and down.
>>>
>>>                             On Sat, Aug 23, 2014 at 10:37 PM, Daniel
>>>                             Watson <Daniel at glovine.com.au
>>>                             <mailto:Daniel at glovine.com.au>> wrote:
>>>
>>>                                 Indeed seeing some big drops in
>>>                                 gaming traffic at present, normally
>>>                                 we see above 60mbit on weekends at
>>>                                 evenings, but not even seeing 40mbit
>>>                                 at present :S
>>>
>>>                                 Regards,
>>>
>>>                                 Daniel Watson
>>>
>>>                                 Network Administrator / Network
>>>                                 Operations Manager
>>>
>>>                                 E Daniel at GloVine.com.au
>>>                                 <mailto:Daniel at GloVine.com.au>
>>>
>>>                                 W www.GloVine.com.au
>>>                                 <http://www.GloVine.com.au>
>>>
>>>                                 *From:*AusNOG
>>>                                 [mailto:ausnog-bounces at lists.ausnog.net
>>>                                 <mailto:ausnog-bounces at lists.ausnog.net>]
>>>                                 *On Behalf Of *Jared Hirst
>>>                                 *Sent:* Saturday, 23 August 2014
>>>                                 10:35 PM
>>>
>>>
>>>                                 *To:* Andrew Cox
>>>                                 *Cc:* Ausnog at ausnog.net
>>>                                 <mailto:Ausnog at ausnog.net>
>>>                                 *Subject:* Re: [AusNOG] Vocus
>>>                                 international service outage
>>>
>>>                                 Yeah we are seeing this! Everything
>>>                                 running via them is flapping, they
>>>                                 claim to have 'routed around it' but
>>>                                 that's not the case. Very
>>>                                 frustrating as it's been an hour and
>>>                                 no one there seems to know whats
>>>                                 going on....
>>>
>>>                                 On Sat, Aug 23, 2014 at 10:29 PM,
>>>                                 Andrew Cox <andrew.cox at bigair.net.au
>>>                                 <mailto:andrew.cox at bigair.net.au>>
>>>                                 wrote:
>>>
>>>                                     Hey All,
>>>
>>>                                     Just saw the dashboards light up
>>>                                     with connectivity issues
>>>                                     internationally for Vocus
>>>                                     services and thought I'd make
>>>                                     others aware.
>>>
>>>                                     Vocus outage report is saying:
>>>                                     "core network link between 59
>>>                                     Doody Street, Alexandria and 55
>>>                                     South Market Street, San Jose
>>>                                     has failed" which hopefully
>>>                                     isn't a Southern Cross fault!
>>>
>>>                                     Anyone else seeing this or have
>>>                                     more info?
>>>
>>>                                     Cheers,
>>>
>>>                                     Andrew
>>>
>>>
>>>                                     _______________________________________________
>>>                                     AusNOG mailing list
>>>                                     AusNOG at lists.ausnog.net
>>>                                     <mailto:AusNOG at lists.ausnog.net>
>>>                                     http://lists.ausnog.net/mailman/listinfo/ausnog
>>>
>>>
>>>                                 -- 
>>>
>>>
>>>                             _______________________________________________
>>>                             AusNOG mailing list
>>>                             AusNOG at lists.ausnog.net
>>>                             <mailto:AusNOG at lists.ausnog.net>
>>>                             http://lists.ausnog.net/mailman/listinfo/ausnog
>>>
>>>
>>>                         --
>>>                         *Andrew Yager, Managing Director* /MACS
>>>                         (Snr) CP BCompSc MCP/
>>>                         Real World Technology Solutions Pty Ltd - IT
>>>                         people you can trust
>>>                         ph: 1300 798 718 <tel:1300%20798%20718> or
>>>                         (02) 9037 0500 <tel:%2802%29%209037%200500>
>>>                         fax: (02) 9037 0591 <tel:%2802%29%209037%200591>
>>>                         http://www.rwts.com.au/
>>>
>>>
>>>                         _______________________________________________
>>>                         AusNOG mailing list
>>>                         AusNOG at lists.ausnog.net
>>>                         <mailto:AusNOG at lists.ausnog.net>
>>>                         http://lists.ausnog.net/mailman/listinfo/ausnog
>>>
>>>                     --
>>>                     *Andrew Yager, Managing Director* /MACS (Snr) CP
>>>                     BCompSc MCP/
>>>                     Real World Technology Solutions Pty Ltd - IT
>>>                     people you can trust
>>>                     ph: 1300 798 718 <tel:1300%20798%20718> or (02)
>>>                     9037 0500 <tel:%2802%29%209037%200500>
>>>                     fax: (02) 9037 0591 <tel:%2802%29%209037%200591>
>>>                     http://www.rwts.com.au/
>>>
>>>                     _______________________________________________
>>>                     AusNOG mailing list
>>>                     AusNOG at lists.ausnog.net
>>>                     <mailto:AusNOG at lists.ausnog.net>
>>>                     http://lists.ausnog.net/mailman/listinfo/ausnog
>>>
>>>                 _______________________________________________
>>>                 AusNOG mailing list
>>>                 AusNOG at lists.ausnog.net <mailto:AusNOG at lists.ausnog.net>
>>>                 http://lists.ausnog.net/mailman/listinfo/ausnog
>>
>>
>>         _______________________________________________
>>         AusNOG mailing list
>>         AusNOG at lists.ausnog.net <mailto:AusNOG at lists.ausnog.net>
>>         http://lists.ausnog.net/mailman/listinfo/ausnog
>>
>>
>
>
> _______________________________________________
> AusNOG mailing list
> AusNOG at lists.ausnog.net
> http://lists.ausnog.net/mailman/listinfo/ausnog


-- 
/* Matt Perkins
         Direct 1300 137 379     Spectrum Networks Ptd. Ltd.
         Office 1300 133 299     matt at spectrum.com.au
                                 Level 6, 350 George Street Sydney 2000
         PGP/GNUPG Public Key can be found at  http://pgp.mit.edu
*/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20140827/e2930e4d/attachment-0001.html>


More information about the AusNOG mailing list