[AusNOG] AAPT Ethernet outage

Matt Perkins matt at spectrum.com.au
Tue Jul 3 10:32:43 EST 2012


Hi Art,
  Firstly thanks for showing some balls by using your email address. You 
may live to regret it. Excuse my hard tone I have just come back from a 
meeting with one of my biggest customers explaining why they should not 
leave us. Due to your outage. Here's some suggestions of the bat.

Frontier doesn't work. - When it's up (30 seconds to return a query is 
not up) the details on it are almost useless when there is a parent 
case. It only shows details from your case 90% of the time. Give us 
access to the parent case. What's the use of giving us the parent case 
number if we can only access it by waiting 30 minutes on hold. I have 
given up ringing most times as you end up with someone with poor 
communications skills who has very little technical understanding. We 
are wholesale customers don't ask us if we have reset the router.

Twitter. - Twitter - Twitter - Twitter -Twitter and in case you missed 
it Twitter.
Open a twitter account right now. Put it on your 3g phone and key in 
evey bit of info you have during a major outage at minimum 30 second 
intervals. Dont Lie. We will know. Just tell us the truth you will find 
it will be welcomed by your customers. We are wholesale customers. We 
understand problems happen remember we have customers screeching at us 
while you just put up a firewall at AAPT. Better bad news then no news.  
Have a look at https://twitter.com/#!/spectrumnet to see how it's done.

Incident reports - A industry sudo standard today. You need a tick box 
in frontier please send me an incident report when this case is 
complete. This needs to include root cause and resolution as well as 
what will be done to stop a re-occurrence. Here's a hint. Dont blame the 
vendor. I cant blame you to my customers they dont care all they care 
about is they were down and how it wont happen again. Take charge of 
it.  Here's a free technical hint for your last outage. You need a hard 
power watchdog on the switch. A device that will hard reboot the power 
on your switch when it cant be seen for 5 minutes it needs to be in the 
pop and self contained.  Im sure you could afford them after all the 
money you saved on those non mainstream switches.

Major outages - Any outage that effects more then 2 customers. How about 
a RVA (recorded voice announcement) while we are waiting on hold. We 
need the following information. The service types that are effect. The 
location that is effected. The estimated restoration time or time that 
more information will be forthcoming. This needs to be automated and 
should be the first job during a large scale outage. Yes even before 
starting to fix it.

Management Systems - clearly there is a poor line of communications 
between your front line support and your back of house engineering. Well 
i hope that's all it is. If it's not then your engineering monitoring 
systems are substandard or your front of house are apathetic about 
customer support.  Let's go with a communication problem.  There needs 
to be lesion officers in both departments.  Engineers don't like 
communicating when they are under pressure. It's part of the personality 
type. A designated communication officer that works in the team can make 
this happen.

Finally - Wholesale customers are usually knowledgeable in most cases  
they will know more about the systems than your front line support 
people. Dont assume they have the same skill set as retail customers. 
Telling someone presenting with 10,20 or 100 AAPT services all off the 
air to "reboot your router" is not helpfull.

Matt.




  On 3/07/12 8:33 AM, Art Cartwright wrote:
>
> Hi, my name is Art and I run network operations at AAPT. I am new to 
> the forum and I wanted to give everyone an update on the event that 
> happened on Saturday in the AAPT network.
>
> On Saturday between the times of 12h00 and 14h30 AAPT experienced a 
> large number of Ethernet switches (in both NSW and VIC) stop passing 
> traffic and become unreachable.
>
> We know now that the problem was caused by a vendor's equipment 
> i*ncorrectly handling of the "Leap Second Insertion" by **NTP**.*
>
> **
>
> *At 15h00 we mobilized our on-call Field Operations staff who needed 
> access various PoP's in the CBD to power cycle the switches. We power 
> cycled the first switch at 16h20 and *all services on the affected 
> switch were restored immediately.
>
> We then mobilised more field operations staff as we knew we had to 
> reboot all devices manually.
>
> *By 19h22 the m*ajority of customer services were confirmed restored 
> and by 01h15 99% of customer services were restored except for three 
> sites where we had issues with site access.
>
> The vendor was able to simulate the issue in their lab in the early 
> hours of Sunday morning and isolated it to the NTP "leap second 
> insertion".
>
> I accept that during the event we did a poor job communicating with 
> customers and the broader community at to what was happening. Our 
> updates were infrequent and at times incorrect. This is something that 
> we are looking at improving.
>
> I would welcome any suggestions as to the communication channels we 
> should investigate.
>
> Thanks
>
> Art
>
> This communication, including any attachments, is confidential. If you are not the intended
> recipient, you should not read it - please contact me immediately, destroy it, and do not
> copy or use any part of this communication or disclose anything about it.
>
>
>
> _______________________________________________
> AusNOG mailing list
> AusNOG at lists.ausnog.net
> http://lists.ausnog.net/mailman/listinfo/ausnog
>
>


-- 
/* Matt Perkins
         Direct 1300 137 379     Spectrum Networks Ptd. Ltd.
         Office 1300 133 299     matt at spectrum.com.au
         Fax    1300 133 255     Level 6, 350 George Street Sydney 2000
         SIP 1300137379 at sip.spectrum.com.au
         PGP/GNUPG Public Key can be found at  http://pgp.mit.edu
*/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20120703/3c5400a1/attachment.html>


More information about the AusNOG mailing list