[AusNOG] RailCorp Sydenham signalling failure report

Edwin Groothuis edwin at mavetju.org
Sun May 1 10:33:21 EST 2011


On 01/05/2011, at 10:17 AM, Mark Smith wrote:
> 
> "The first workstation area became functional at 08:10:15 and full
> control on all workstations was restored at 08:52 with the faulty switch
> powered off at 08.46."
> 
> So now they isolate the faulty element, 58 minutes after working out
> what it is. If they'd removed the faulty switch immediately, spanning
> tree is likely have reconverged and settled within no more than a few
> minutes. Restarting servers and workstations should not have been
> necessary at all, unless that is the only way to restart applications
> that aren't tolerant to any level of packet loss.

Also don't forget the possibility of a very low DHCP lease time which could have caused everything to get an 169.254.x.x address.

If the policy of the machines is that everything starts up at reboot and that users are not allowed to restart the application, then yes you are back at restarting the computers. And in that case I wouldn't call them workstations anymore but refer to them as appliances.

Edwin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20110501/c1ecc01d/attachment.html>


More information about the AusNOG mailing list