[AusNOG] Outage that costs Millions

Andrew Fort afort at choqolat.org
Wed Jun 30 13:00:07 EST 2010


On Wed, Jun 30, 2010 at 12:16 PM, Lincoln Dale <ltd at cisco.com> wrote:
>
> On 30/06/2010, at 11:57 AM, Andrew Fort wrote:
>
>> On Wed, Jun 30, 2010 at 11:46 AM, Lincoln Dale <ltd at cisco.com> wrote:
>>> On 30/06/2010, at 10:43 AM, Daniel Hood wrote:
>>>> The issue the outage was facing was a Spanning Tree Loop that knocked over all of the
>>>
>>> it is the _absence_ of Spanning Tree that means that a network cannot _recover_ from someone causing a loop.
>>> common misconception is that Spanning Tree causes loops.  that is incorrect at best.
>>
>> Sure.  If it were due to a customer or operator created loop, the
>> question for me becomes: was l2 traffic suppression configured, and
>> did it work?
>
> certainly it is best practice to make use of the features that are available (e.g. storm control) that help mitigate "bad things" that can happen at L2 (e.g. host going mad generating broadcast frames).
>
> but if there is a loop at L2:
>  (a) STP's role is to build a topology in a loop free manner.  it does that well enough but perhaps not in an optimal manner.
>  (b) 'BPDU Guard' operates on edge ports sending out periodic BPDUs in the expectation that they never come back - and if they
do - the edge port that receives that BPDU is errdisabled.
>
> best practice is that (b) is most certainly enabled too.

Right, which is essential to ensure that a customer doesn't form a
loop and kill your network before *STP has a change to reconverge.


> no idea what happened in this scenario, but my experience is that L2 loops attributed to "STP" are rarely due to STP bugs or issues but rather operational issues or misconfiguration.

Right.  I've only ever seen one true STP bug on what I believe is the
same hardware as involved here; a 7609 "7200 router" card with a
PA-A3-OC3 adapter doing sub-rate services.  When a 2684-bridged .1d
BPDU (from a 800 series on the edge where STP was enabled on the 800s
bridge interface) arrived at a boundary port on our MSTP rings (and
was bridged to a VLAN on the fabric), this BPDU was also bridged
(against the MSTP boundary rules).  The 7600s considered the MSTP
boundary ports flapping state  continually as they were seeing this
.1d BPDU they should not have, and loops did form, consequently.

> certainly there are aspects of STP that could be 'better'.  incidentally i talked about those at AusNOG last year. :)

The sooner we can get rid of break after make protocols, the happier
I'll be :-).

What's your view on TRILL?

cheers,

-- 
Andreux Fort (afort at choqolat.org)



More information about the AusNOG mailing list