[AusNOG] Optus downtime chat + affecting SMS verification toTelstra?

Andrew Oakeley andrew at oakeley.com.au
Fri Nov 17 11:31:47 AEDT 2023


And in the senate enquiry this morning they both blamed Cisco

"The trigger was the Singtel outage, but the root cause was Cisco."

https://www.abc.net.au/news/2023-11-17/asx-markets-business-live-news-optus-outage-senate-inquiry/103115518

-----Original Message-----
From: AusNOG <ausnog-bounces at lists.ausnog.net> On Behalf Of DaZZa
Sent: Friday, November 17, 2023 8:15 AM
To: Luke Thompson <luke.t at tncrew.com.au>
Cc: michael.bethune at australiaonline.au; ausnog at lists.ausnog.net
Subject: Re: [AusNOG] Optus downtime chat + affecting SMS verification toTelstra?

And now Singtel have returned serve and are denying it was them.

https://www.zdnet.com/article/singtel-refutes-reports-that-its-system-upgrade-caused-optus-outage/

It's like watching kids trying to blame each other for who broke the window with the cricket ball.

D

On Wed, 15 Nov 2023 at 11:01, Luke Thompson <luke.t at tncrew.com.au> wrote:
>
> They've blamed Singtel Internet Exchange (STiX) for the international peering route updates, at least going by anonymous sources cited by SMH.
>
> https://www.smh.com.au/technology/identity-of-third-party-who-brought-
> down-optus-network-revealed-20231114-p5ejy1.html
>
> Luke
>
> On 14 November 2023 12:37:30 pm Ben Buxton <bb.ausnog at bb.cactii.net> wrote:
>>
>>
>> Blaming routing updates from peers is a scapegoat and never is the cause of an outage - public BGP is the wild west and you're always getting broken information - it's your responsibility to filter those updates and (unless it's a zero-day poison packet bug) you only have yourself to blame if you fall over from them.
>>
>> If I were an optus business customer, reading that outage page would just make me even more determined to move elsewhere.
>>
>> They vaguely categorised the "what" of the outage into a big bucket (software upgrade related), but gave absolutely no useful information or explain the "why" which would regain my confidence.
>>
>> Why did this upgrade trigger an outage?
>>   - Was there a behaviour/feature change they neglected to take into account?
>>   - Did the upgrade require a config change that broke?
>>   - Were they neglectful in following config best practices? (filtering, prefix limits, restarts, etc?)
>>   - Did the new software have an unidentified bug?
>>   - Why did testing not catch this problem (they do test changes...right?)
>>   - How did progressive rollout still lead to this impact? (they do 
>> progressive rollouts over N days/weeks...right?)
>>
>> Why did mitigation take so long?
>>   - What detection/telemetry measures led them to realise the scope of the outage? (news reports dont count)
>>   - Were they dependent on the downed network for oncall paging & comms?
>>   - Why did their rollback plan fail? (they had a rollback plan...right?)
>>   - Why was remote console/power access not working? (they have both...right?)
>>   - Were they dependent on the downed network for said access?
>>   - Were their playbooks/credential access dependent on the downed network?
>>
>> "We have made changes to the network to address this issue so that it cannot occur again." ... this smells like "whoops forgot to set max-prefix (with restart!)".
>>
>> Bugs, config stuff-ups, etc happen, and they will continue to happen - it is a lie to state that outages will never happen again. This is the culmination of monumental failures in the trigger, prevention and mitigation measures which cannot be fixed in a couple of days, it sounds like much deeper architectural and organisational issues need addressing.
>>
>> Many of the above failures are things that a young network will experience and learn from, but for Optus these should all be well planned for already.
>>
>> I suspect any government investigation will simply add more bureaucracy and boxes to tick rather than effect meaningful change, but one can always be hopeful...
>>
>> BB
>>
>> On Tue, 14 Nov 2023 at 13:02, Michael Bethune <mike at ozonline.com.au> wrote:
>>>
>>> "Optus network received changes to routing information from an 
>>> international peering network following a software upgrade"
>>>
>>> I note they are very careful to avoid nominating whose software upgrade.
>>>
>>> I also note that when they say they received routing updates, don't 
>>> they limit the number of prefixes accepted by their BGP from any 
>>> given peer?
>>>
>>> Sounds like a carefully crafted statement to enable them to point 
>>> fingers elsewhere, not unexpected.
>>>
>>> - Michael.
>>>
>>> Quoting francisfides at mailup.net:
>>>
>>> > Looks like it was a software upgrade:
>>> > https://www.abc.net.au/news/2023-11-13/optus-identifies-cause-of-n
>>> > ationwide-outage-software-upgrade/103099902
>>> >
>>> > Nothing in their media centre, just appears as a new box on their 
>>> > outage response page: 
>>> > https://www.optus.com.au/notices/outage-response
>>> >
>>> > Cheers
>>> >
>>> > ----
>>> > Text:
>>> >
>>> > "We have been working to understand what caused the outage on 
>>> > Wednesday, and we now know what the cause was and have taken steps 
>>> > to ensure it will not happen again.  We apologise sincerely for 
>>> > letting our customers down and the inconvenience it caused.
>>> >
>>> > At around 4.05am Wednesday morning, the Optus network received 
>>> > changes to routing information from an international peering 
>>> > network  following a software upgrade. These routing information 
>>> > changes propagated through multiple layers in our network and 
>>> > exceeded preset safety levels on key routers. This resulted in 
>>> > those routers disconnecting from the Optus IP Core network to protect themselves.
>>> >
>>> > The restoration required a large-scale effort of the team and in 
>>> > some cases required Optus to reconnect or reboot routers 
>>> > physically,  requiring the dispatch of people across a number of 
>>> > sites in Australia. This is why restoration was progressive over the afternoon.
>>> >
>>> > Given the widespread impact of the outage, our investigations into 
>>> > the issue took longer than we would have liked as we examined 
>>> > several different paths to restoration. The restoration of the 
>>> > network was at all times our priority and we subsequently 
>>> > established the cause working together with our partners. We have 
>>> > made changes to the network to address this issue so that it 
>>> > cannot occur again.
>>> >
>>> > We are committed to learning from what has occurred and continuing 
>>> > to work with our international vendors and partners to increase 
>>> > the resilience of our network. We will also support and fully 
>>> > cooperate with the reviews being undertaken by the Government and the Senate.
>>> >
>>> > We continue to invest heavily to improve the resiliency of our 
>>> > network and services."
>>> >
>>> > --
>>> >
>>> >   francisfides at mailup.net
>>> >
>>> > On Thu, Nov 9, 2023, at 07:15, DaZZa wrote:
>>> >> I have all three you're asking about.
>>> >>
>>> >> But I'm very small potatoes compared to most of the members of 
>>> >> this list, and my required remote footprint is correspondingly 
>>> >> small, so it's easy to maintain.
>>> >>
>>> >> D
>>> >>
>>> >> On Thu, 9 Nov 2023 at 06:18, Phillip Grasso 
>>> >> <phillip.grasso at gmail.com> wrote:
>>> >>>>
>>> >>>> I mean come on, it's nearly 2024 and a [major] telco does not 
>>> >>>> have remote console access?
>>> >>>
>>> >>>
>>> >>> If we send a poll out to this community, how many would be able 
>>> >>> to  genuinely honestly answer:
>>> >>>
>>> >>> Do you have a console or appropriate control plane access into 
>>> >>> all  your critical infrastructure?
>>> >>> Do you have independant out of band that does not share any 
>>> >>> infrastructure with your current system(s) - with exemption for 
>>> >>> physical location and power.
>>> >>> Do you have the ability to remote power control your devices?
>>> >>>
>>> >>> We know from the facebook outage in 2021 that they probably 
>>> >>> didn't  have the above, so its not entirely uncommon for folks 
>>> >>> to have *proper independant* console and remote access.
>>> >>>
>>> >>>
>>> >>> I empathize with the Optus team and their customers who have 
>>> >>> been negatively impacted by this incident. I sincerely hope that 
>>> >>> some positive outcomes can emerge from this situation, including:
>>> >>>
>>> >>> - Attention to critical infrastructure resilience
>>> >>> - BGP clue increases
>>> >>> - Incident management improves
>>> >>> (I'm sure there's more).
>>> >>>
>>> >>> Network is a black box to most people and I think a large chunk 
>>> >>> of  Australia now knows what it feels like to not have it.
>>> >>>
>>> >>>
>>> >>> On Wed, 8 Nov 2023 at 11:06, Ben Buxton <bb.ausnog at bb.cactii.net> wrote:
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Wed, 8 Nov 2023 at 10:14, DaZZa <dazzagibbs at gmail.com> wrote:
>>> >>>>>
>>> >>>>> Yeah, I'd be willing to bet that it's a change which wasn't 
>>> >>>>> thoroughly tested before being rolled out, and which had an 
>>> >>>>> inadequate backout plan.
>>> >>>>
>>> >>>>
>>> >>>> Also, "Our on-site technician is actively prioritising 
>>> >>>> establishing a console connection.".
>>> >>>>
>>> >>>> I mean come on, it's nearly 2024 and a [major] telco does not 
>>> >>>> have remote console access? Whilst I'm looking forward to 
>>> >>>> enthusiastically reading the PM, I'll have to book a physio 
>>> >>>> appointment in advance due to neck strain from all the head 
>>> >>>> shaking it'll likely induce.
>>> >>>>
>>> >>>> BB
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> Interestingly, my Optus mobile actually had a valid connection 
>>> >>>>> for a short time - wasn't able to actually DO anything, but 
>>> >>>>> was connected to the OPtus network - but it's now gone to "SOS" mode.
>>> >>>>>
>>> >>>>> D
>>> >>>>>
>>> >>>>> On Wed, 8 Nov 2023 at 10:01, John Edwards <jaedwards at gmail.com> wrote:
>>> >>>>> >
>>> >>>>> > The 4am Wednesday morning outage start looks suspiciously 
>>> >>>>> > like
>>> >>>>>  a firmware upgrade window.
>>> >>>>> >
>>> >>>>> > I note that Optus devices where I am are showing "SoS" which
>>> >>>>> indicates the tower is unable to reach the location register, 
>>> >>>>> which presumably is on a private network and indicative of a 
>>> >>>>> pretty major fault rather than just IP.
>>> >>>>> >
>>> >>>>> > John
>>> >>>>> >
>>> >>>>> >
>>> >>>>> > On Wed, 8 Nov 2023 at 09:10, DaZZa <dazzagibbs at gmail.com> wrote:
>>> >>>>> >>
>>> >>>>> >> The Optus hamster finally died of old age.
>>> >>>>> >>
>>> >>>>> >> I would suggest your SMS issues would be caused by whoever 
>>> >>>>> >> is issuing the SMS using Optus - not so much by the Telstra end receiving it.
>>> >>>>> >>
>>> >>>>> >> Anecdotally, Optus enterprise/wholesale appears to be still 
>>> >>>>> >> functional
>>> >>>>> >> - at least my link appears to be working fine - and my BGP 
>>> >>>>> >> advertisements are still being seen overseas - seems to be 
>>> >>>>> >> only NBN and mobile based services which are busted
>>> >>>>> >>
>>> >>>>> >> D
>>> >>>>> >>
>>> >>>>> >> On Wed, 8 Nov 2023 at 09:27, <francisfides at mailup.net> wrote:
>>> >>>>> >> >
>>> >>>>> >> > Morning all,
>>> >>>>> >> > Hope the chaos isn't too hard on your work/family.
>>> >>>>> >> > I have had trouble with a couple of SMS verifications
>>> >>>>> coming through to me, my Telstra number. Is this related?
>>> >>>>> >> >
>>> >>>>> >> > Any general banter around the downtime would be fine too 
>>> >>>>> >> > -
>>> >>>>> looks like it all began at 4.07am AEDT?
>>> >>>>> >> >
>>> >>>>> >> > Cheers
>>> >>>>> >> >
>>> >>>>> >> > --
>>> >>>>> >> >
>>> >>>>> >> >   francisfides at mailup.net 
>>> >>>>> >> > _______________________________________________
>>> >>>>> >> > AusNOG mailing list
>>> >>>>> >> > AusNOG at lists.ausnog.net
>>> >>>>> >> > https://lists.ausnog.net/mailman/listinfo/ausnog
>>> >>>>> >>
>>> >>>>> >>
>>> >>>>> >>
>>> >>>>> >> --
>>> >>>>> >> veg·e·tar·i·an:
>>> >>>>> >> Ancient tribal slang for the village idiot who can't hunt,
>>> >>>>> fish or ride
>>> >>>>> >> _______________________________________________
>>> >>>>> >> AusNOG mailing list
>>> >>>>> >> AusNOG at lists.ausnog.net
>>> >>>>> >> https://lists.ausnog.net/mailman/listinfo/ausnog
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> --
>>> >>>>> veg·e·tar·i·an:
>>> >>>>> Ancient tribal slang for the village idiot who can't hunt, 
>>> >>>>> fish or ride _______________________________________________
>>> >>>>> AusNOG mailing list
>>> >>>>> AusNOG at lists.ausnog.net
>>> >>>>> https://lists.ausnog.net/mailman/listinfo/ausnog
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> AusNOG mailing list
>>> >>>> AusNOG at lists.ausnog.net
>>> >>>> https://lists.ausnog.net/mailman/listinfo/ausnog
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> veg·e·tar·i·an:
>>> >> Ancient tribal slang for the village idiot who can't hunt, fish 
>>> >> or ride _______________________________________________
>>> >> AusNOG mailing list
>>> >> AusNOG at lists.ausnog.net
>>> >> https://lists.ausnog.net/mailman/listinfo/ausnog
>>> > _______________________________________________
>>> > AusNOG mailing list
>>> > AusNOG at lists.ausnog.net
>>> > https://lists.ausnog.net/mailman/listinfo/ausnog
>>> >
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> AusNOG mailing list
>>> AusNOG at lists.ausnog.net
>>> https://lists.ausnog.net/mailman/listinfo/ausnog
>>
>> _______________________________________________
>> AusNOG mailing list
>> AusNOG at lists.ausnog.net
>> https://lists.ausnog.net/mailman/listinfo/ausnog
>>
>
> _______________________________________________
> AusNOG mailing list
> AusNOG at lists.ausnog.net
> https://lists.ausnog.net/mailman/listinfo/ausnog



--
veg·e·tar·i·an:
Ancient tribal slang for the village idiot who can't hunt, fish or ride _______________________________________________
AusNOG mailing list
AusNOG at lists.ausnog.net
https://lists.ausnog.net/mailman/listinfo/ausnog


More information about the AusNOG mailing list