[AusNOG] AAPT Ethernet outage

Joshua D'Alton joshua at railgun.com.au
Tue Jul 3 10:42:04 EST 2012


server$ sudo incident_report generate

is that how its done? :D  all good points, do you have a blog?

On Tue, Jul 3, 2012 at 10:32 AM, Matt Perkins <matt at spectrum.com.au> wrote:

>  Hi Art,
>  Firstly thanks for showing some balls by using your email address. You
> may live to regret it. Excuse my hard tone I have just come back from a
> meeting with one of my biggest customers explaining why they should not
> leave us. Due to your outage. Here's some suggestions of the bat.
>
> Frontier doesn't work. - When it's up (30 seconds to return a query is not
> up) the details on it are almost useless when there is a parent case. It
> only shows details from your case 90% of the time. Give us access to the
> parent case. What's the use of giving us the parent case number if we can
> only access it by waiting 30 minutes on hold. I have given up ringing most
> times as you end up with someone with poor communications skills who has
> very little technical understanding. We are wholesale customers don't ask
> us if we have reset the router.
>
> Twitter. - Twitter - Twitter - Twitter -Twitter and in case you missed it
> Twitter.
> Open a twitter account right now. Put it on your 3g phone and key in evey
> bit of info you have during a major outage at minimum 30 second intervals.
> Dont Lie. We will know. Just tell us the truth you will find it will be
> welcomed by your customers. We are wholesale customers. We understand
> problems happen remember we have customers screeching at us while you just
> put up a firewall at AAPT. Better bad news then no news.  Have a look at
> https://twitter.com/#!/spectrumnet to see how it's done.
>
> Incident reports - A industry sudo standard today. You need a tick box in
> frontier please send me an incident report when this case is complete. This
> needs to include root cause and resolution as well as what will be done to
> stop a re-occurrence. Here's a hint. Dont blame the vendor. I cant blame
> you to my customers they dont care all they care about is they were down
> and how it wont happen again. Take charge of it.  Here's a free technical
> hint for your last outage. You need a hard power watchdog on the switch. A
> device that will hard reboot the power on your switch when it cant be seen
> for 5 minutes it needs to be in the pop and self contained.  Im sure you
> could afford them after all the money you saved on those non mainstream
> switches.
>
> Major outages - Any outage that effects more then 2 customers. How about a
> RVA (recorded voice announcement) while we are waiting on hold. We need the
> following information. The service types that are effect. The location that
> is effected. The estimated restoration time or time that more information
> will be forthcoming. This needs to be automated and should be the first job
> during a large scale outage. Yes even before starting to fix it.
>
> Management Systems - clearly there is a poor line of communications
> between your front line support and your back of house engineering. Well i
> hope that's all it is. If it's not then your engineering monitoring systems
> are substandard or your front of house are apathetic about customer
> support.  Let's go with a communication problem.  There needs to be lesion
> officers in both departments.  Engineers don't like communicating when they
> are under pressure. It's part of the personality type. A designated
> communication officer that works in the team can make this happen.
>
> Finally - Wholesale customers are usually knowledgeable in most cases
> they will know more about the systems than your front line support people.
> Dont assume they have the same skill set as retail customers. Telling
> someone presenting with 10,20 or 100 AAPT services all off the air to
> "reboot your router" is not helpfull.
>
> Matt.
>
>
>
>
>
>  On 3/07/12 8:33 AM, Art Cartwright wrote:
>
>  Hi, my name is Art and I run network operations at AAPT. I am new to the
> forum and I wanted to give everyone an update on the event that happened on
> Saturday in the AAPT network.
>
>
>
> On Saturday between the times of 12h00 and 14h30 AAPT experienced a large
> number of Ethernet switches (in both NSW and VIC) stop passing traffic and
> become unreachable.
>
>
>
> We know now that the problem was caused by a vendor’s equipment i*ncorrectly
> handling of the “Leap Second Insertion” by **NTP**.*
>
> * *
>
> *At 15h00 we mobilized our on-call Field Operations staff who needed
> access various PoP’s in the CBD to power cycle the switches. We power
> cycled the first switch at 16h20 and *all services on the affected switch
> were restored immediately.
>
>
>
> We then mobilised more field operations staff as we knew we had to reboot
> all devices manually.
>
>
>
> *By 19h22 the m*ajority of customer services were confirmed restored and
> by 01h15 99% of customer services were restored except for three sites
> where we had issues with site access.
>
>
>
> The vendor was able to simulate the issue in their lab in the early hours
> of Sunday morning and isolated it to the NTP “leap second insertion”.
>
>
>
> I accept that during the event we did a poor job communicating with
> customers and the broader community at to what was happening. Our updates
> were infrequent and at times incorrect. This is something that we are
> looking at improving.
>
>
>
> I would welcome any suggestions as to the communication channels we should
> investigate.
>
>
>
> Thanks
>
>
>
> Art
>
>
>
> This communication, including any attachments, is confidential. If you are not the intended
> recipient, you should not read it - please contact me immediately, destroy it, and do not
> copy or use any part of this communication or disclose anything about it.
>
>
>
> _______________________________________________
> AusNOG mailing listAusNOG at lists.ausnog.nethttp://lists.ausnog.net/mailman/listinfo/ausnog
>
>
>
> --
> /* Matt Perkins
>         Direct 1300 137 379     Spectrum Networks Ptd. Ltd.
>         Office 1300 133 299     matt at spectrum.com.au
>         Fax    1300 133 255     Level 6, 350 George Street Sydney 2000
>         SIP 1300137379 at sip.spectrum.com.au
>         PGP/GNUPG Public Key can be found at  http://pgp.mit.edu
> */
>
>
> _______________________________________________
> AusNOG mailing list
> AusNOG at lists.ausnog.net
> http://lists.ausnog.net/mailman/listinfo/ausnog
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20120703/cf906e72/attachment.html>


More information about the AusNOG mailing list