[AusNOG] AAPT Ethernet outage
Matt Perkins
matt at spectrum.com.au
Tue Jul 3 10:32:43 EST 2012
Hi Art,
Firstly thanks for showing some balls by using your email address. You
may live to regret it. Excuse my hard tone I have just come back from a
meeting with one of my biggest customers explaining why they should not
leave us. Due to your outage. Here's some suggestions of the bat.
Frontier doesn't work. - When it's up (30 seconds to return a query is
not up) the details on it are almost useless when there is a parent
case. It only shows details from your case 90% of the time. Give us
access to the parent case. What's the use of giving us the parent case
number if we can only access it by waiting 30 minutes on hold. I have
given up ringing most times as you end up with someone with poor
communications skills who has very little technical understanding. We
are wholesale customers don't ask us if we have reset the router.
Twitter. - Twitter - Twitter - Twitter -Twitter and in case you missed
it Twitter.
Open a twitter account right now. Put it on your 3g phone and key in
evey bit of info you have during a major outage at minimum 30 second
intervals. Dont Lie. We will know. Just tell us the truth you will find
it will be welcomed by your customers. We are wholesale customers. We
understand problems happen remember we have customers screeching at us
while you just put up a firewall at AAPT. Better bad news then no news.
Have a look at https://twitter.com/#!/spectrumnet to see how it's done.
Incident reports - A industry sudo standard today. You need a tick box
in frontier please send me an incident report when this case is
complete. This needs to include root cause and resolution as well as
what will be done to stop a re-occurrence. Here's a hint. Dont blame the
vendor. I cant blame you to my customers they dont care all they care
about is they were down and how it wont happen again. Take charge of
it. Here's a free technical hint for your last outage. You need a hard
power watchdog on the switch. A device that will hard reboot the power
on your switch when it cant be seen for 5 minutes it needs to be in the
pop and self contained. Im sure you could afford them after all the
money you saved on those non mainstream switches.
Major outages - Any outage that effects more then 2 customers. How about
a RVA (recorded voice announcement) while we are waiting on hold. We
need the following information. The service types that are effect. The
location that is effected. The estimated restoration time or time that
more information will be forthcoming. This needs to be automated and
should be the first job during a large scale outage. Yes even before
starting to fix it.
Management Systems - clearly there is a poor line of communications
between your front line support and your back of house engineering. Well
i hope that's all it is. If it's not then your engineering monitoring
systems are substandard or your front of house are apathetic about
customer support. Let's go with a communication problem. There needs
to be lesion officers in both departments. Engineers don't like
communicating when they are under pressure. It's part of the personality
type. A designated communication officer that works in the team can make
this happen.
Finally - Wholesale customers are usually knowledgeable in most cases
they will know more about the systems than your front line support
people. Dont assume they have the same skill set as retail customers.
Telling someone presenting with 10,20 or 100 AAPT services all off the
air to "reboot your router" is not helpfull.
Matt.
On 3/07/12 8:33 AM, Art Cartwright wrote:
>
> Hi, my name is Art and I run network operations at AAPT. I am new to
> the forum and I wanted to give everyone an update on the event that
> happened on Saturday in the AAPT network.
>
> On Saturday between the times of 12h00 and 14h30 AAPT experienced a
> large number of Ethernet switches (in both NSW and VIC) stop passing
> traffic and become unreachable.
>
> We know now that the problem was caused by a vendor's equipment
> i*ncorrectly handling of the "Leap Second Insertion" by **NTP**.*
>
> **
>
> *At 15h00 we mobilized our on-call Field Operations staff who needed
> access various PoP's in the CBD to power cycle the switches. We power
> cycled the first switch at 16h20 and *all services on the affected
> switch were restored immediately.
>
> We then mobilised more field operations staff as we knew we had to
> reboot all devices manually.
>
> *By 19h22 the m*ajority of customer services were confirmed restored
> and by 01h15 99% of customer services were restored except for three
> sites where we had issues with site access.
>
> The vendor was able to simulate the issue in their lab in the early
> hours of Sunday morning and isolated it to the NTP "leap second
> insertion".
>
> I accept that during the event we did a poor job communicating with
> customers and the broader community at to what was happening. Our
> updates were infrequent and at times incorrect. This is something that
> we are looking at improving.
>
> I would welcome any suggestions as to the communication channels we
> should investigate.
>
> Thanks
>
> Art
>
> This communication, including any attachments, is confidential. If you are not the intended
> recipient, you should not read it - please contact me immediately, destroy it, and do not
> copy or use any part of this communication or disclose anything about it.
>
>
>
> _______________________________________________
> AusNOG mailing list
> AusNOG at lists.ausnog.net
> http://lists.ausnog.net/mailman/listinfo/ausnog
>
>
--
/* Matt Perkins
Direct 1300 137 379 Spectrum Networks Ptd. Ltd.
Office 1300 133 299 matt at spectrum.com.au
Fax 1300 133 255 Level 6, 350 George Street Sydney 2000
SIP 1300137379 at sip.spectrum.com.au
PGP/GNUPG Public Key can be found at http://pgp.mit.edu
*/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20120703/3c5400a1/attachment.html>
More information about the AusNOG
mailing list