[AusNOG] Cloudflare offline

Ben Dale bdale at comlinx.com.au
Tue Mar 5 10:42:06 EST 2013


Just to close this one off for everyone who hit me up offline, it looks like they hit an existing PR 734453.

After upgrading to the release suggested, lab tests confirm the issue is no longer happening (though abnormally high values are still accepted).

You may now return to your regular arguments over historical decisions.

Cheers,

Ben

On 04/03/2013, at 4:35 PM, Ben Dale <bdale at comlinx.com.au> wrote:

> 
> On 04/03/2013, at 12:08 AM, Damian Guppy <the.damo at gmail.com> wrote:
> 
>> They have now put up an incident report, cause was a combination of a bad rule was applied to all edge routers across all 23 global datacenters using flowspec and a bug in Junos caused the routers to have a memory leak and crash when they processed the rule, to top things off their automated recovery tools couldnt reboot/recover the vast majority of the routers automatically, and the ones they could got flooded with all the traffic the rest of them would normally handle. They ended up having to get people onsite at all datacenters to physically hard reboot the routers.
>> 
>> Poor guys
>> 
>> http://blog.cloudflare.com/todays-outage-post-mortem-82515
>> 
>> --Damian
> 
> Bug looks to be pretty easy to reproduce too (in an arbitrary version):
> 
> bdale at mx80-bng1> show route table inetflow.0
> 
> inetflow.0: 1 destinations, 1 routes (1 active, 0 holddown, 0 hidden)
> Restart Complete
> + = Active Route, - = Last Active, * = Both
> 
> 173.2.3.4,*,port=53,len=99971,=99985/term:N/A          
>                    *[Flow/5] 00:01:37
>                       Fictitious
> 
> bdale at mx80-bng1> show chassis routing-engine | match Mem
>     Memory utilization          37 percent
> 
> ... after taking a swig of beverage 
> 
> bdale at mx80-bng1> show chassis routing-engine | match Mem
>     Memory utilization          97 percent
> bdale at mx80-bng1> show chassis routing-engine | match Mem
>     Memory utilization          99 percent
> 
> It also pegs the CPU up to maximum during this time.
> 
> Anyone using Flowspec out there might want to take a good hard look at your validation until this is addressed (a commit script would do the trick).   Especially so, those receiving Flowspec via BGP from external sources eg: Team Cymru (should be just prefixes), Arbor (Roland may have more insight on sizing validation) etc as there appears to be no way to filter/validate specific rules (just sources you learnt them from).
> 
> I've tried a few other "illegal" values (eg: 65537, 65555) for packet length, but nothing kicks it off like the Cloudflare sizes (the rate memory is consumed *may* be proportional to the size of the packet described).  Removing the route prior to topping out doesn't reclaim the memory either : (
> 
> Ben
> 
> _______________________________________________
> AusNOG mailing list
> AusNOG at lists.ausnog.net
> http://lists.ausnog.net/mailman/listinfo/ausnog

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20130305/c0e7dcc3/attachment.html>


More information about the AusNOG mailing list