[AusNOG] Optus downtime chat + affecting SMS verification toTelstra?

Ben Buxton bb.ausnog at bb.cactii.net
Fri Nov 17 15:56:37 AEDT 2023


It looks like the outage was largely due to a max-prefix issue then (or
lack thereof). And their change management processes don't seem to come
into play (except perhaps during restoration?). Given that this was from
prefixes received over an exchange, I'm curious to know why no-one else
seems to have suffered as it's unlikely just 1 peer would be affected.

Something glaringly missing from the Senate submission is information about
why the restoration took so long. 6 hours is an embarrasingly long time to
fix what was essentially a max-prefix trip. I would really like to know
more details about:

- OOB access
- Remote power / reboot capability
- Potential issues about comms between engineers and otherwise accessing a
downed network - i bet it took a long time to contact some key engineers.

Again it looks like they explained what happened (max prefix trip and then
engineers working + onsite for 6 hours to mitigate). But not why they feel
6 hours was an acceptable duration - the submission seems to imply 6 hours
is a normal investigation time. This aspect really needs to be picked apart
further.

Outages happen - it's a fact of life. But prevention only goes so far, you
need to build and test robust mitigation strategies and incident management
plans.


On Fri, 17 Nov 2023 at 13:36, Christopher O'Shea <casper.oshea at gmail.com>
wrote:

> I wouldn't be so quick to blame it on a single thing. We have all been
> there, An incident always comes down to many things not going the way you
> think.
>
> Reading between the lines, I see that a peer's network creates larger than
> "normal" routes, and seeing they called out IPv6 in their submission to
> Senate [1]
> Lack of filtering of v6 for that peer due to an oversight or
> misunderstanding of the template/group between v4 and v6.
>
> Then, when it was shared with their PE routers (Which seem to be Cisco) On
> the ASK9K (Not sure what they use), the default limit of  524288 [2] for v6
> could lead to the session's termination by default.
>
> We should read these reports and understand if the same thing could happen
> to your network, what protection you have to stop this, and your device's
> default behaviour.
>
> I would like to know more about their out-of-band and why it had issues.
> (Could it be that DNS broke, issue getting to internal documentation or was
> the password vault access broken, or the IP limit of the OOB device was too
> tight).
>
> Chris O'Shea
>
> [1]
> https://www.aph.gov.au/DocumentStore.ashx?id=2ed95079-023d-49d5-87fd-d9029740629b&subId=750333
>  reports of the Optus outage
> [2]
> https://www.cisco.com/c/en/us/td/docs/routers/asr9000/software/routing/command/reference/b-routing-cr-asr9000/bgp-commands.html#wp3192417938
>
>
>
> On Fri, Nov 17, 2023 at 2:02 AM Tony Wicks <tony at wicks.co.nz> wrote:
>
>> To be fair, Assuming there were config issues (i.e. the lack of
>> maximum-prefixes and the lack of filtering preventing large route tables
>> hitting devices that can not carry full tables) the behaviour of a network
>> device when its RIB/FIB or memory is exceeded also significantly comes into
>> play. Dropping BGP is fine, crashing the router so it requires a hard reset
>> is another case entirely. In my experience (I have not used Cisco's in a
>> telco environment for many years however) Cisco devices have been much more
>> pre-disposed to crash catastrophically than over vendor devices like Nokia
>> or Juniper.
>>
>>
>>
>> -----Original Message-----
>> From: AusNOG <ausnog-bounces at lists.ausnog.net> On Behalf Of DaZZa
>> Sent: Friday, November 17, 2023 2:38 PM
>> To: Andrew Oakeley <andrew at oakeley.com.au>
>> Cc: michael.bethune at australiaonline.au; Luke Thompson <
>> luke.t at tncrew.com.au>; ausnog at lists.ausnog.net
>> Subject: Re: [AusNOG] Optus downtime chat + affecting SMS verification
>> toTelstra?
>>
>> What a load of crap.
>>
>> The root cause was they're morons, and configured the routers incorrectly.
>>
>> Cisco had nothing to do with it. I'll bet the routers behaved exactly as
>> they were intended to behave.
>>
>>
>> _______________________________________________
>> AusNOG mailing list
>> AusNOG at lists.ausnog.net
>> https://lists.ausnog.net/mailman/listinfo/ausnog
>>
> _______________________________________________
> AusNOG mailing list
> AusNOG at lists.ausnog.net
> https://lists.ausnog.net/mailman/listinfo/ausnog
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ausnog.net/pipermail/ausnog/attachments/20231117/7f3c8c18/attachment-0001.htm>


More information about the AusNOG mailing list