<div dir="ltr"><div><br></div>Blaming routing updates from peers is a scapegoat and never is the cause of an outage - public BGP is the wild west and you're always getting broken information - it's your responsibility to filter those updates and (unless it's a zero-day poison packet bug) you only have yourself to blame if you fall over from them.<div><br>If I were an optus business customer, reading that outage page would just make me even more determined to move elsewhere.<br><br></div><div>They vaguely categorised the "what" of the outage into a big bucket (software upgrade related), but gave absolutely no useful information or explain the "why" which would regain my confidence.<br><br></div><div>Why did this upgrade trigger an outage?</div><div> - Was there a behaviour/feature change they neglected to take into account?<br> - Did the upgrade require a config change that broke?<br> - Were they neglectful in following config best practices? (filtering, prefix limits, restarts, etc?)</div><div> - Did the new software have an unidentified bug?</div><div> - Why did testing not catch this problem (they do test changes...right?)</div><div> - How did progressive rollout still lead to this impact? (they do progressive rollouts over N days/weeks...right?)</div><div><br>Why did mitigation take so long?<br> - What detection/telemetry measures led them to realise the scope of the outage? (news reports dont count)</div><div><div> - Were they dependent on the downed network for oncall paging & comms?</div><div></div></div><div> - Why did their rollback plan fail? (they had a rollback plan...right?)</div><div> - Why was remote console/power access not working? (they have both...right?)</div><div> - Were they dependent on the downed network for said access?</div><div> - Were their playbooks/credential access dependent on the downed network?</div><div><br>"We have made changes to the network to address this issue so that it cannot occur again." ... this smells like "whoops forgot to set max-prefix (with restart!)".<br><br>Bugs, config stuff-ups, etc happen, and they will continue to happen - it is a lie to state that outages will never happen again. This is the culmination of monumental failures in the trigger, prevention and mitigation measures which cannot be fixed in a couple of days, it sounds like much deeper architectural and organisational issues need addressing.</div><div><br></div><div>Many of the above failures are things that a young network will experience and learn from, but for Optus these should all be well planned for already.</div><div><br>I suspect any government investigation will simply add more bureaucracy and boxes to tick rather than effect meaningful change, but one can always be hopeful...<br><br></div><div>BB<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 14 Nov 2023 at 13:02, Michael Bethune <<a href="mailto:mike@ozonline.com.au">mike@ozonline.com.au</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">"Optus network received changes to routing information from an <br>
international peering network following a software upgrade"<br>
<br>
I note they are very careful to avoid nominating whose software upgrade.<br>
<br>
I also note that when they say they received routing updates,<br>
don't they limit the number of prefixes accepted by their BGP from<br>
any given peer?<br>
<br>
Sounds like a carefully crafted statement to enable them to point fingers<br>
elsewhere, not unexpected.<br>
<br>
- Michael.<br>
<br>
Quoting <a href="mailto:francisfides@mailup.net" target="_blank">francisfides@mailup.net</a>:<br>
<br>
> Looks like it was a software upgrade:<br>
> <a href="https://www.abc.net.au/news/2023-11-13/optus-identifies-cause-of-nationwide-outage-software-upgrade/103099902" rel="noreferrer" target="_blank">https://www.abc.net.au/news/2023-11-13/optus-identifies-cause-of-nationwide-outage-software-upgrade/103099902</a><br>
><br>
> Nothing in their media centre, just appears as a new box on their <br>
> outage response page: <a href="https://www.optus.com.au/notices/outage-response" rel="noreferrer" target="_blank">https://www.optus.com.au/notices/outage-response</a><br>
><br>
> Cheers<br>
><br>
> ----<br>
> Text:<br>
><br>
> "We have been working to understand what caused the outage on <br>
> Wednesday, and we now know what the cause was and have taken steps <br>
> to ensure it will not happen again. We apologise sincerely for <br>
> letting our customers down and the inconvenience it caused.<br>
><br>
> At around 4.05am Wednesday morning, the Optus network received <br>
> changes to routing information from an international peering network <br>
> following a software upgrade. These routing information changes <br>
> propagated through multiple layers in our network and exceeded <br>
> preset safety levels on key routers. This resulted in those routers <br>
> disconnecting from the Optus IP Core network to protect themselves.<br>
><br>
> The restoration required a large-scale effort of the team and in <br>
> some cases required Optus to reconnect or reboot routers physically, <br>
> requiring the dispatch of people across a number of sites in <br>
> Australia. This is why restoration was progressive over the afternoon.<br>
><br>
> Given the widespread impact of the outage, our investigations into <br>
> the issue took longer than we would have liked as we examined <br>
> several different paths to restoration. The restoration of the <br>
> network was at all times our priority and we subsequently <br>
> established the cause working together with our partners. We have <br>
> made changes to the network to address this issue so that it cannot <br>
> occur again.<br>
><br>
> We are committed to learning from what has occurred and continuing <br>
> to work with our international vendors and partners to increase the <br>
> resilience of our network. We will also support and fully cooperate <br>
> with the reviews being undertaken by the Government and the Senate.<br>
><br>
> We continue to invest heavily to improve the resiliency of our <br>
> network and services."<br>
><br>
> --<br>
><br>
> <a href="mailto:francisfides@mailup.net" target="_blank">francisfides@mailup.net</a><br>
><br>
> On Thu, Nov 9, 2023, at 07:15, DaZZa wrote:<br>
>> I have all three you're asking about.<br>
>><br>
>> But I'm very small potatoes compared to most of the members of this<br>
>> list, and my required remote footprint is correspondingly small, so<br>
>> it's easy to maintain.<br>
>><br>
>> D<br>
>><br>
>> On Thu, 9 Nov 2023 at 06:18, Phillip Grasso <br>
>> <<a href="mailto:phillip.grasso@gmail.com" target="_blank">phillip.grasso@gmail.com</a>> wrote:<br>
>>>><br>
>>>> I mean come on, it's nearly 2024 and a [major] telco does not <br>
>>>> have remote console access?<br>
>>><br>
>>><br>
>>> If we send a poll out to this community, how many would be able to <br>
>>> genuinely honestly answer:<br>
>>><br>
>>> Do you have a console or appropriate control plane access into all <br>
>>> your critical infrastructure?<br>
>>> Do you have independant out of band that does not share any <br>
>>> infrastructure with your current system(s) - with exemption for <br>
>>> physical location and power.<br>
>>> Do you have the ability to remote power control your devices?<br>
>>><br>
>>> We know from the facebook outage in 2021 that they probably didn't <br>
>>> have the above, so its not entirely uncommon for folks to have <br>
>>> *proper independant* console and remote access.<br>
>>><br>
>>><br>
>>> I empathize with the Optus team and their customers who have been <br>
>>> negatively impacted by this incident. I sincerely hope that some <br>
>>> positive outcomes can emerge from this situation, including:<br>
>>><br>
>>> - Attention to critical infrastructure resilience<br>
>>> - BGP clue increases<br>
>>> - Incident management improves<br>
>>> (I'm sure there's more).<br>
>>><br>
>>> Network is a black box to most people and I think a large chunk of <br>
>>> Australia now knows what it feels like to not have it.<br>
>>><br>
>>><br>
>>> On Wed, 8 Nov 2023 at 11:06, Ben Buxton <<a href="mailto:bb.ausnog@bb.cactii.net" target="_blank">bb.ausnog@bb.cactii.net</a>> wrote:<br>
>>>><br>
>>>><br>
>>>><br>
>>>> On Wed, 8 Nov 2023 at 10:14, DaZZa <<a href="mailto:dazzagibbs@gmail.com" target="_blank">dazzagibbs@gmail.com</a>> wrote:<br>
>>>>><br>
>>>>> Yeah, I'd be willing to bet that it's a change which wasn't thoroughly<br>
>>>>> tested before being rolled out, and which had an inadequate backout<br>
>>>>> plan.<br>
>>>><br>
>>>><br>
>>>> Also, "Our on-site technician is actively prioritising <br>
>>>> establishing a console connection.".<br>
>>>><br>
>>>> I mean come on, it's nearly 2024 and a [major] telco does not <br>
>>>> have remote console access? Whilst I'm<br>
>>>> looking forward to enthusiastically reading the PM, I'll have to <br>
>>>> book a physio appointment in advance due to<br>
>>>> neck strain from all the head shaking it'll likely induce.<br>
>>>><br>
>>>> BB<br>
>>>><br>
>>>><br>
>>>>><br>
>>>>><br>
>>>>> Interestingly, my Optus mobile actually had a valid connection for a<br>
>>>>> short time - wasn't able to actually DO anything, but was connected to<br>
>>>>> the OPtus network - but it's now gone to "SOS" mode.<br>
>>>>><br>
>>>>> D<br>
>>>>><br>
>>>>> On Wed, 8 Nov 2023 at 10:01, John Edwards <<a href="mailto:jaedwards@gmail.com" target="_blank">jaedwards@gmail.com</a>> wrote:<br>
>>>>> ><br>
>>>>> > The 4am Wednesday morning outage start looks suspiciously like <br>
>>>>> a firmware upgrade window.<br>
>>>>> ><br>
>>>>> > I note that Optus devices where I am are showing "SoS" which <br>
>>>>> indicates the tower is unable to reach the location register, <br>
>>>>> which presumably is on a private network and indicative of a <br>
>>>>> pretty major fault rather than just IP.<br>
>>>>> ><br>
>>>>> > John<br>
>>>>> ><br>
>>>>> ><br>
>>>>> > On Wed, 8 Nov 2023 at 09:10, DaZZa <<a href="mailto:dazzagibbs@gmail.com" target="_blank">dazzagibbs@gmail.com</a>> wrote:<br>
>>>>> >><br>
>>>>> >> The Optus hamster finally died of old age.<br>
>>>>> >><br>
>>>>> >> I would suggest your SMS issues would be caused by whoever is issuing<br>
>>>>> >> the SMS using Optus - not so much by the Telstra end receiving it.<br>
>>>>> >><br>
>>>>> >> Anecdotally, Optus enterprise/wholesale appears to be still functional<br>
>>>>> >> - at least my link appears to be working fine - and my BGP<br>
>>>>> >> advertisements are still being seen overseas - seems to be only NBN<br>
>>>>> >> and mobile based services which are busted<br>
>>>>> >><br>
>>>>> >> D<br>
>>>>> >><br>
>>>>> >> On Wed, 8 Nov 2023 at 09:27, <<a href="mailto:francisfides@mailup.net" target="_blank">francisfides@mailup.net</a>> wrote:<br>
>>>>> >> ><br>
>>>>> >> > Morning all,<br>
>>>>> >> > Hope the chaos isn't too hard on your work/family.<br>
>>>>> >> > I have had trouble with a couple of SMS verifications <br>
>>>>> coming through to me, my Telstra number. Is this related?<br>
>>>>> >> ><br>
>>>>> >> > Any general banter around the downtime would be fine too - <br>
>>>>> looks like it all began at 4.07am AEDT?<br>
>>>>> >> ><br>
>>>>> >> > Cheers<br>
>>>>> >> ><br>
>>>>> >> > --<br>
>>>>> >> ><br>
>>>>> >> > <a href="mailto:francisfides@mailup.net" target="_blank">francisfides@mailup.net</a><br>
>>>>> >> > _______________________________________________<br>
>>>>> >> > AusNOG mailing list<br>
>>>>> >> > <a href="mailto:AusNOG@lists.ausnog.net" target="_blank">AusNOG@lists.ausnog.net</a><br>
>>>>> >> > <a href="https://lists.ausnog.net/mailman/listinfo/ausnog" rel="noreferrer" target="_blank">https://lists.ausnog.net/mailman/listinfo/ausnog</a><br>
>>>>> >><br>
>>>>> >><br>
>>>>> >><br>
>>>>> >> --<br>
>>>>> >> veg·e·tar·i·an:<br>
>>>>> >> Ancient tribal slang for the village idiot who can't hunt, <br>
>>>>> fish or ride<br>
>>>>> >> _______________________________________________<br>
>>>>> >> AusNOG mailing list<br>
>>>>> >> <a href="mailto:AusNOG@lists.ausnog.net" target="_blank">AusNOG@lists.ausnog.net</a><br>
>>>>> >> <a href="https://lists.ausnog.net/mailman/listinfo/ausnog" rel="noreferrer" target="_blank">https://lists.ausnog.net/mailman/listinfo/ausnog</a><br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>> --<br>
>>>>> veg·e·tar·i·an:<br>
>>>>> Ancient tribal slang for the village idiot who can't hunt, fish or ride<br>
>>>>> _______________________________________________<br>
>>>>> AusNOG mailing list<br>
>>>>> <a href="mailto:AusNOG@lists.ausnog.net" target="_blank">AusNOG@lists.ausnog.net</a><br>
>>>>> <a href="https://lists.ausnog.net/mailman/listinfo/ausnog" rel="noreferrer" target="_blank">https://lists.ausnog.net/mailman/listinfo/ausnog</a><br>
>>>><br>
>>>> _______________________________________________<br>
>>>> AusNOG mailing list<br>
>>>> <a href="mailto:AusNOG@lists.ausnog.net" target="_blank">AusNOG@lists.ausnog.net</a><br>
>>>> <a href="https://lists.ausnog.net/mailman/listinfo/ausnog" rel="noreferrer" target="_blank">https://lists.ausnog.net/mailman/listinfo/ausnog</a><br>
>><br>
>><br>
>><br>
>> --<br>
>> veg·e·tar·i·an:<br>
>> Ancient tribal slang for the village idiot who can't hunt, fish or ride<br>
>> _______________________________________________<br>
>> AusNOG mailing list<br>
>> <a href="mailto:AusNOG@lists.ausnog.net" target="_blank">AusNOG@lists.ausnog.net</a><br>
>> <a href="https://lists.ausnog.net/mailman/listinfo/ausnog" rel="noreferrer" target="_blank">https://lists.ausnog.net/mailman/listinfo/ausnog</a><br>
> _______________________________________________<br>
> AusNOG mailing list<br>
> <a href="mailto:AusNOG@lists.ausnog.net" target="_blank">AusNOG@lists.ausnog.net</a><br>
> <a href="https://lists.ausnog.net/mailman/listinfo/ausnog" rel="noreferrer" target="_blank">https://lists.ausnog.net/mailman/listinfo/ausnog</a><br>
><br>
<br>
<br>
<br>
<br>
_______________________________________________<br>
AusNOG mailing list<br>
<a href="mailto:AusNOG@lists.ausnog.net" target="_blank">AusNOG@lists.ausnog.net</a><br>
<a href="https://lists.ausnog.net/mailman/listinfo/ausnog" rel="noreferrer" target="_blank">https://lists.ausnog.net/mailman/listinfo/ausnog</a><br>
</blockquote></div>