<html><head><meta http-equiv="Content-Type" content="text/html charset=iso-8859-1"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div><br></div><div><div>On 04/03/2013, at 12:08 AM, Damian Guppy <<a href="mailto:the.damo@gmail.com">the.damo@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div dir="ltr"><div style="">They have now put up an incident report, cause was a combination of a bad rule was applied to all edge routers across all 23 global datacenters using flowspec and a bug in Junos caused the routers to have a memory leak and crash when they processed the rule, to top things off their automated recovery tools couldnt reboot/recover the vast majority of the routers automatically, and the ones they could got flooded with all the traffic the rest of them would normally handle. They ended up having to get people onsite at all datacenters to physically hard reboot the routers.</div>
<div style=""><br></div><div style="">Poor guys</div><div><br></div><a href="http://blog.cloudflare.com/todays-outage-post-mortem-82515">http://blog.cloudflare.com/todays-outage-post-mortem-82515</a><br><div><br></div><div style="">
--Damian</div></div></blockquote><div><br></div><div><div>Bug looks to be pretty easy to reproduce too (in an arbitrary version):</div><div><br></div><div>bdale@mx80-bng1> show route table inetflow.0</div><div><br></div><div><div>inetflow.0: 1 destinations, 1 routes (1 active, 0 holddown, 0 hidden)</div><div>Restart Complete</div><div>+ = Active Route, - = Last Active, * = Both</div><div><br></div><div>173.2.3.4,*,port=53,len=99971,=99985/term:N/A </div><div> *[Flow/5] 00:01:37</div><div> Fictitious</div></div><div><br></div><div>bdale@mx80-bng1> show chassis routing-engine | match Mem</div><div> Memory utilization 37 percent</div><div><br></div><div>... after taking a swig of beverage </div><div><br></div><div><div>bdale@mx80-bng1> show chassis routing-engine | match Mem</div><div> Memory utilization 97 percent</div></div><div><div>bdale@mx80-bng1> show chassis routing-engine | match Mem</div><div> Memory utilization 99 percent</div></div><div><br></div><div>It also pegs the CPU up to maximum during this time.</div><div><br></div><div>Anyone using Flowspec out there might want to take a good hard look at your validation until this is addressed (a commit script would do the trick). Especially so, those receiving Flowspec via BGP from external sources eg: Team Cymru (should be just prefixes), Arbor (Roland may have more insight on sizing validation) etc as there appears to be no way to filter/validate specific rules (just sources you learnt them from).</div><div><br></div><div>I've tried a few other "illegal" values (eg: 65537, 65555) for packet length, but nothing kicks it off like the Cloudflare sizes (the rate memory is consumed *may* be proportional to the size of the packet described). Removing the route prior to topping out doesn't reclaim the memory either : (</div></div><div><br></div><div>Ben</div><div><br></div></div></body></html>