[AusNOG] Outage that costs Millions

Narelle narellec at gmail.com
Thu Jul 1 12:22:33 EST 2010


On Thu, Jul 1, 2010 at 10:20 AM, Matt Carter <matt at iseek.com.au> wrote:
>
> As others have pointed out a "spanning tree issue" doesn't tear down your
> network for 90 minutes, it *prevents* it from being torn down for 90 minutes,
> it could be thought of as a "last resort safety" so to assert a spanning tree
> issue caused this problem, in my mind, is to assert a lack of spanning tree,
> meaning the required last resort safety mechanisms were either not in place,
> or not configured properly. (if they were, how could it be a spanning tree issue??)


I have seen failures of this duration in large spanning tree networks
before. The reason for the lengthy time to restore is a) it can be
really tricky to find the 'root' of the problem, b) people had
forgotten life way back when before routers were everywhere (and over
the last few years have been relearning all this, and c) in large
carriers people start to get scared when restoring traffic as the
rerouting of traffic gets complex / ports need to be identified,
records updated etc etc.

It just ain't as simple as it looks...

<disclaimer>
and this in no way is meant to explain or justify any of Telstra's
actions, indeed I know virtually nothing about the specific
configuration of that network...
</disclaimer>

-- 


Narelle
narellec at gmail.com



More information about the AusNOG mailing list