[AusNOG] RailCorp Sydenham signalling failure report

Mark Smith nanog at 85d5b20a518b8f6864949bd940457dc124746ddc.nosense.org
Sun May 1 10:17:43 EST 2011


On Fri, 29 Apr 2011 23:27:55 +1000
Tom Sykes <TomSykes at nbnco.com.au> wrote:

> Spanning tree. Awesome.
> 

Spanning tree isn't really at fault here. A flakey link was going up and
down, triggering constant spanning tree reconvergence. "Layer 3
switching" can be seen as the solution to "spanning tree problems", yet
similar effects would have occurred if OSPF were being used to
determine packet forwarding paths. OSPF doesn't like constant
reconvergence either.

Their troubleshooting method wasn't very effective. From page 5 - 

"The faulty network switch was identified by local technical staff."

So it was switched off or isolated from the network.

" The
disaster recovery process was initiated at 07:48:03. This involved the
staged restart of all signal control servers and workstations."

Oh, no it wasn't. 

"The first workstation area became functional at 08:10:15 and full
control on all workstations was restored at 08:52 with the faulty switch
powered off at 08.46."

So now they isolate the faulty element, 58 minutes after working out
what it is. If they'd removed the faulty switch immediately, spanning
tree is likely have reconverged and settled within no more than a few
minutes. Restarting servers and workstations should not have been
necessary at all, unless that is the only way to restart applications
that aren't tolerant to any level of packet loss.



> Might get a taxi from the airport next time I'm up there.
> 
> TS.
> 
> From: ausnog-bounces at lists.ausnog.net [mailto:ausnog-bounces at lists.ausnog.net] On Behalf Of Skeeve Stevens
> Sent: Friday, 29 April, 2011 10:31 PM
> To: ausnog at ausnog.net
> Subject: [AusNOG] RailCorp Sydenham signalling failure report
> 
> The link to the full report is @ http://www.railcorp.info/__data/assets/pdf_file/0008/9719/110429-Signal_System_Report.pdf
> 
> The Cisco's involved look to be 3550's.
> 
> The report is very interesting and describes the networking issues in reasonable detail.
> 
> 
> ...Skeeve
> 
> 
> 
> --
> 
> Skeeve Stevens, CEO - eintellego Pty Ltd - The Networking Specialists
> 
> skeeve at eintellego.net ; www.eintellego.net
> 
> Phone: 1300 753 383 ; Fax: (+612) 8572 9954
> 
> Cell +61 (0)414 753 383 ; skype://skeeve
> 
> facebook.com/eintellego or eintellego at facebook.com
> 
> twitter.com/networkceoau ; www.linkedin.com/in/skeeve
> 
> PO Box 7726, Baulkham Hills, NSW 1755 Australia
> 
> 
> 
> --
> 
> eintellego - The Experts that the Experts call
> 
> - Juniper - HP Networking - Cisco - Brocade - Arista - Allied Telesis
> 
> On 29/04/11 9:41 PM, "Andy Linton" <asjl at lpnz.org<mailto:asjl at lpnz.org>> wrote:
> 
> Time to check your electrolytic caps
> 
> -------- Original Message --------
> Subject: [Disconnect] RailCorp Sydenham signalling failure report
> From: Andrew McNamara <andrewm at object-craft.com.au<mailto:andrewm at object-craft.com.au>>
> To: disconnect at object-craft.com.au<mailto:disconnect at object-craft.com.au>
> CC:
> 
> RailCorp has released their report into the 12 April Sydenham signalling
> failure (which crippled the whole metro network). They found it to be
> due to a Cisco switch with leaky electrolytic caps repeatedly bouncing
> (see the linked PDF):
> 
>     http://www.railcorp.info/publications/sydenham
> 
> 
> --
> Andrew McNamara, Senior Developer, Object Craft
> http://www.object-craft.com.au/
> _______________________________________________
> Disconnect mailing list
> Disconnect at object-craft.com.au<mailto:Disconnect at object-craft.com.au>
> https://www.object-craft.com.au/cgi-bin/mailman/listinfo/disconnect
> _______________________________________________
> NZNOG mailing list
> NZNOG at list.waikato.ac.nz<mailto:NZNOG at list.waikato.ac.nz>
> http://list.waikato.ac.nz/mailman/listinfo/nznog
> 



More information about the AusNOG mailing list