[AusNOG] Cloudflare offline

Ben Dale bdale at comlinx.com.au
Mon Mar 4 17:35:58 EST 2013


On 04/03/2013, at 12:08 AM, Damian Guppy <the.damo at gmail.com> wrote:

> They have now put up an incident report, cause was a combination of a bad rule was applied to all edge routers across all 23 global datacenters using flowspec and a bug in Junos caused the routers to have a memory leak and crash when they processed the rule, to top things off their automated recovery tools couldnt reboot/recover the vast majority of the routers automatically, and the ones they could got flooded with all the traffic the rest of them would normally handle. They ended up having to get people onsite at all datacenters to physically hard reboot the routers.
> 
> Poor guys
> 
> http://blog.cloudflare.com/todays-outage-post-mortem-82515
> 
> --Damian

Bug looks to be pretty easy to reproduce too (in an arbitrary version):

bdale at mx80-bng1> show route table inetflow.0

inetflow.0: 1 destinations, 1 routes (1 active, 0 holddown, 0 hidden)
Restart Complete
+ = Active Route, - = Last Active, * = Both

173.2.3.4,*,port=53,len=99971,=99985/term:N/A          
                   *[Flow/5] 00:01:37
                      Fictitious

bdale at mx80-bng1> show chassis routing-engine | match Mem
    Memory utilization          37 percent

... after taking a swig of beverage 

bdale at mx80-bng1> show chassis routing-engine | match Mem
    Memory utilization          97 percent
bdale at mx80-bng1> show chassis routing-engine | match Mem
    Memory utilization          99 percent

It also pegs the CPU up to maximum during this time.

Anyone using Flowspec out there might want to take a good hard look at your validation until this is addressed (a commit script would do the trick).   Especially so, those receiving Flowspec via BGP from external sources eg: Team Cymru (should be just prefixes), Arbor (Roland may have more insight on sizing validation) etc as there appears to be no way to filter/validate specific rules (just sources you learnt them from).

I've tried a few other "illegal" values (eg: 65537, 65555) for packet length, but nothing kicks it off like the Cloudflare sizes (the rate memory is consumed *may* be proportional to the size of the packet described).  Removing the route prior to topping out doesn't reclaim the memory either : (

Ben

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20130304/d9b18b69/attachment.html>


More information about the AusNOG mailing list