[AusNOG] census issues tonight
Mark Delany
g2x at juliet.emu.st
Wed Aug 10 14:28:14 EST 2016
On 10Aug16, James Braunegg allegedly wrote:
> No need for Geo Blocking.. that???s hard work
>
> Just only advertise the route locally within Australia i.e... to Optus, Telstra and on peering exchanges... Job done..
Nope. Job not done. This sort of single-bullet approach is probably
why they failed.
If you want scale and resiliency there are many many things you do to
ensure success. For example how would an AU-only route announcement
protect against a DDOS initiated here? Australians love their ancient
Windows boxen so there are plenty of locally available bots for rent.
It's hard to know where to even begin with the census site as they got
it wrong in so many ways. It's obvious they never even did a mental
walk thru of what-ifs.
Based on HTTP responses with failure text, we can guess that that they
had a coupled system when a de-coupled one would have been more
resilient. They relied on physical scaling which is obviously
impossible to augment in any reasonable time frame. They did not do a
trial run of anything to try and get a sense of the traffic profile so
they were completely guessing. Why not get everyone to register a week
beforehand to get a feel for the traffic and load? Their servers were
centralized, which is an obvous no-no. Even their DNS setup was such
that they couldn't swing traffic quickly if they had to.
Their efforts at switching routing during the evening suggests that
they though it was some sort of traffic based DOS, but as other
observed, there is not a lot of evidence that that was actually the
case. It looks like all they knew was that their service was failing
and they were scrambling to deal with it. Did they do a practise run
with an actually DDOS? Their 6h DNS TTL suggests not as that's one of
the first things you want to be able to change rapidly.
I also saw no evidence of their ability to gracefully degrade. Either
they were up or they were down. No ability to redistribute the
traffic, nor to have the browser-based JS reach for an alternative
site or for the site to do less work when it got too busy, such as
dump and defer validation.
Their one bullet seems to be to have provisioned twice as much
front-end server capacity as they thought they'd need. A mere 2x
margin for a completely new, unknown traffic profile system? That's
pretty scandalous for such a high-profile site right there.
Mark.
More information about the AusNOG
mailing list