[AusNOG] SPAM-MED: Re: Vocus international service outage

James Spenceley james at iroute.org
Wed Aug 27 08:21:01 EST 2014


Hi Wolfgang,

RFP will be out to today.

The long and short if it is...

We had an issue caused by the DC provider which took out our primary pop (power surge or something similar which killed all the RPs and line cards). In one of those one in a million chances the big upgrade we'd completed across all our US sites left us with a code on the secondary site routers that had a bug which has now been diagnosed by Cisco (hence the delay in RFP), this bug only happens when large traffic loads (and routes) hit the RP from a steady state and the bug causes intermittent black holing of traffic passing that box.

As a few people said , stuff breaks, this was one of the nightmare events where we had two issues in parallel one an undocumented Cisco bug that ironically only shows up when the other site failed. As I said it's a pretty rare case of events that had to happen to cause the outage and it took quiet a bit of time for Cisco to recreate/identify the bug (the other issue was complete hardware failure was clearly easy to diagnose), that Cisco delay is what has caused the delay in the RFP.

James


> On 26 Aug 2014, at 23:58, "Wolfgang Nagele (AusRegistry)" <wolfgang.nagele at ausregistry.com.au> wrote:
> 
> Hi,
> 
> This is your choice. Not mine and I am sure there are others here that do not agree with your idea of brushing issues like this off the table. If this were Telstra you would be striking a different tone.
> 
> It’s irrelevant if you have multiple upstreams - there are minimum requirements that we put on suppliers in the year 2014. Having Vocus suffer a 4 hour degradation due to a single DC fault in SJC is not something that we accept in 2014. There is redundancy via LA (and Singapore) which didn’t work. I would like to know why - not why it didn’t fail over automatically there can be many reasons. Vocus engineers should have been able to re-route via LA and completely take SJC out of the equation. There are questions to be answered here. As well as delays in notifications - we have not received a notification for over an hour. Again not acceptable for an incident of that magnitude.
> 
> We all learn based on mistakes - ignoring them gains nothing.
> 
> As for multiple upstreams, yes we have them and yes we routed around the issue. To me that’s irrelevant to the issue at hand. If you are happy with a supplier that has 4 hour degradation due to a single DC fault on it’s main international backhaul - yes - move along, nothing to see here.
> 
> Cheers,
> Wolfgang
> 
> On 8/26/14, 11:40 PM, "Skeeve Stevens" <skeeve+ausnog at eintellegonetworks.com> wrote:
> 
> Yup.. move along, nothing to see here.
> 
> Once an outage is fixed, those who dwell on the cause that they can do nothing about, are focusing in the wrong place.
> 
> If your only transit was through a single upstream, that is where you should be focusing, not the provider.
> 
> 
> ...Skeeve
> 
> Skeeve Stevens - eintellego Networks Pty Ltd
> skeeve at eintellegonetworks.com ; www.eintellegonetworks.com
> Phone: 1300 239 038; Cell +61 (0)414 753 383 ; skype://skeeve
> facebook.com/eintellegonetworks ; linkedin.com/in/skeeve 
> twitter.com/theispguy ; blog: www.theispguy.com
> 
> The Experts Who The Experts Call
> Juniper - Cisco - Cloud - Consulting - IPv4 Brokering
> 
> 
>> On 26 August 2014 23:19, Kristoffer Sheather @ CloudCentral <kristoffer.sheather at cloudcentral.com.au> wrote:
>> Shit broke, they fixed.
>>  
>> <EOM />
>>  
>> From: "Wolfgang Nagele (AusRegistry)" <wolfgang.nagele at ausregistry.com.au>
>> Sent: Tuesday, August 26, 2014 11:17 PM
>> To: "James Spenceley" <james at iroute.org>
>> Cc: "Ausnog at ausnog.net" <ausnog at ausnog.net>
>> Subject: SPAM-MED: Re: [AusNOG] Vocus international service outage
>>  
>> Hi James,
>>  
>> Still waiting for the RfO on this whole thing with the details. Neither seen one here nor as a follow-up to customer notifications.
>>  
>> Cheers,
>> Wolfgang
>>  
>> On 8/24/14, 12:38 AM, "Wolfgang Nagele (AusRegistry)" <wolfgang.nagele at ausregistry.com.au> wrote:
>>  
>> Hi James,
>>  
>> Hmm - can understand that but would have expected that there is sufficient redundancy in the LA landing of your network. I would have expected that a re-route and taking SJC largely out of the equation would be possible. Surprised to say the least …
>>  
>> Cheers,
>> Wolfgang
>>  
>> On 8/24/14, 12:10 AM, "James Spenceley" <james at iroute.org> wrote:
>>  
>> Early mail is a power surge in a US DC has damaged both core routers. Wouldn't surprise me if transport from other providers out of that building will be having similar issues. 
>>  
>> Circuits are being moved directly to borders as we speak.
>>  
>> 
>> 
>> Sent from my iPhone
>> 
>>> On 24 Aug 2014, at 0:00, Jared Hirst <jared.hirst at serversaustralia.com.au> wrote:
>>>  
>>> WOW.... It has taken 2 hours to get remote hands to the DC with what seems to be a device with no redundancy?
>>>  
>>> 2014/08/23 13:55
>>> UTC	
>>> Engineers are currently awaiting remote support in the US. Links will be physically moved from the failed device in order to restore services on an alternate device.
>>> 
>>>  
>>>> On Sat, Aug 23, 2014 at 11:29 PM, Andrew Yager <andrew at rwts.com.au> wrote: 
>>>> [hijacking the thread…]
>>>>  
>>>> They say there is a big rewrite coming on the way rpd and sampled interact in 14.2; and the slow convergance issues have been fixed in more releases than I care to remember right now… but people say they are pretty good in the 12.3r6 train. Our MX80's are slated for upgrade to that at some stage in the next few months.
>>>>  
>>>> Some noise about this on j-nsp again today.
>>>>  
>>>> Andrew
>>>>  
>>>>  
>>>>  
>>>>> On 23 August 2014 23:23, Jonathan Thorpe <jthorpe at conexim.com.au> wrote:
>>>>> True, but they otherwise work exceptionally well.
>>>>> 
>>>>>  
>>>>> 
>>>>> I’m not sure what kind of PowerPC CPU is doing all the work on an MX80’s RE, but I do sometimes wonder if the CPU on a <$40 Raspberry Pi might be more up to the job :-P
>>>>> 
>>>>>  
>>>>> 
>>>>> From: Tony Wicks [mailto:tony at wicks.co.nz]
>>>>> Sent: Saturday, 23 August 2014 11:13 PM
>>>>> To: Jonathan Thorpe
>>>>> Cc: 'Ausnog at ausnog.net'
>>>>> Subject: RE: [AusNOG] Vocus international service outage
>>>>> 
>>>>>  
>>>>> 
>>>>> Well, if you buy the big chassis boxes…..
>>>>> 
>>>>>  
>>>>> 
>>>>> From: AusNOG [mailto:ausnog-bounces at lists.ausnog.net] On Behalf Of Jonathan Thorpe
>>>>> Sent: Sunday, 24 August 2014 1:10 a.m.
>>>>> To: Andrew Yager; Jared Hirst
>>>>> Cc: Ausnog at ausnog.net
>>>>> Subject: Re: [AusNOG] Vocus international service outage
>>>>> 
>>>>>  
>>>>> 
>>>>> Glad I’m not the only one holding my breath on our MXs :)
>>>>> 
>>>>>  
>>>>> 
>>>>> From: AusNOG [mailto:ausnog-bounces at lists.ausnog.net] On Behalf Of Andrew Yager
>>>>> Sent: Saturday, 23 August 2014 10:55 PM
>>>>> To: Jared Hirst
>>>>> Cc: Ausnog at ausnog.net
>>>>> Subject: Re: [AusNOG] Vocus international service outage
>>>>> 
>>>>>  
>>>>> 
>>>>> We've done the same (about 20 minutes ago).
>>>>> 
>>>>>  
>>>>> 
>>>>> Right now I hate how long Juniper MX's take to stabilise their routing table with sampling on.
>>>>> 
>>>>>  
>>>>> 
>>>>> Andrew
>>>>> 
>>>>>  
>>>>> 
>>>>> On 23 August 2014 22:41, Jared Hirst <jared.hirst at serversaustralia.com.au> wrote:
>>>>> 
>>>>> We have just turned Vocus off. Using other providers for now, as the flapping is causing it to go up and down.
>>>>> 
>>>>>  
>>>>> 
>>>>> On Sat, Aug 23, 2014 at 10:37 PM, Daniel Watson <Daniel at glovine.com.au> wrote:
>>>>> 
>>>>> Indeed seeing some big drops in gaming traffic at present, normally we see above 60mbit on weekends at evenings, but not even seeing 40mbit at present :S
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Daniel Watson
>>>>> 
>>>>> Network Administrator / Network Operations Manager
>>>>> 
>>>>>  
>>>>> 
>>>>> E Daniel at GloVine.com.au
>>>>> 
>>>>> W www.GloVine.com.au
>>>>> 
>>>>>  
>>>>> 
>>>>> From: AusNOG [mailto:ausnog-bounces at lists.ausnog.net] On Behalf Of Jared Hirst
>>>>> Sent: Saturday, 23 August 2014 10:35 PM
>>>>> 
>>>>> 
>>>>> To: Andrew Cox
>>>>> Cc: Ausnog at ausnog.net
>>>>> Subject: Re: [AusNOG] Vocus international service outage
>>>>> 
>>>>>  
>>>>> 
>>>>> Yeah we are seeing this! Everything running via them is flapping, they claim to have 'routed around it' but that's not the case. Very frustrating as it's been an hour and no one there seems to know whats going on....
>>>>> 
>>>>>  
>>>>> 
>>>>> On Sat, Aug 23, 2014 at 10:29 PM, Andrew Cox <andrew.cox at bigair.net.au> wrote:
>>>>> 
>>>>> Hey All,
>>>>> 
>>>>> Just saw the dashboards light up with connectivity issues internationally for Vocus services and thought I'd make others aware.
>>>>> 
>>>>> Vocus outage report is saying: "core network link between 59 Doody Street, Alexandria and 55 South Market Street, San Jose has failed" which hopefully isn't a Southern Cross fault!
>>>>> 
>>>>> Anyone else seeing this or have more info?
>>>>> 
>>>>>  
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Andrew
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> AusNOG mailing list
>>>>> AusNOG at lists.ausnog.net
>>>>> http://lists.ausnog.net/mailman/listinfo/ausnog
>>>>> 
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> --
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> AusNOG mailing list
>>>>> AusNOG at lists.ausnog.net
>>>>> http://lists.ausnog.net/mailman/listinfo/ausnog
>>>>> 
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> --
>>>>> Andrew Yager, Managing Director   MACS (Snr) CP BCompSc MCP
>>>>> Real World Technology Solutions Pty Ltd - IT people you can trust
>>>>> ph: 1300 798 718 or (02) 9037 0500
>>>>> fax: (02) 9037 0591
>>>>> http://www.rwts.com.au/
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> AusNOG mailing list
>>>>> AusNOG at lists.ausnog.net
>>>>> http://lists.ausnog.net/mailman/listinfo/ausnog
>>>>  
>>>>  
>>>> --
>>>> Andrew Yager, Managing Director   MACS (Snr) CP BCompSc MCP
>>>> Real World Technology Solutions Pty Ltd - IT people you can trust
>>>> ph: 1300 798 718 or (02) 9037 0500
>>>> fax: (02) 9037 0591
>>>> http://www.rwts.com.au/
>>>> 
>>>> _______________________________________________
>>>> AusNOG mailing list
>>>> AusNOG at lists.ausnog.net
>>>> http://lists.ausnog.net/mailman/listinfo/ausnog
>>> _______________________________________________
>>> AusNOG mailing list
>>> AusNOG at lists.ausnog.net
>>> http://lists.ausnog.net/mailman/listinfo/ausnog
>> 
>> _______________________________________________
>> AusNOG mailing list
>> AusNOG at lists.ausnog.net
>> http://lists.ausnog.net/mailman/listinfo/ausnog
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20140827/eab29a30/attachment-0001.html>


More information about the AusNOG mailing list