[AusNOG] Optus downtime chat + affecting SMS verification toTelstra?
Luke Thompson
luke.t at tncrew.com.au
Wed Nov 15 11:01:18 AEDT 2023
They've blamed Singtel Internet Exchange (STiX) for the international
peering route updates, at least going by anonymous sources cited by SMH.
https://www.smh.com.au/technology/identity-of-third-party-who-brought-down-optus-network-revealed-20231114-p5ejy1.html
Luke
On 14 November 2023 12:37:30 pm Ben Buxton <bb.ausnog at bb.cactii.net> wrote:
>
> Blaming routing updates from peers is a scapegoat and never is the cause of
> an outage - public BGP is the wild west and you're always getting broken
> information - it's your responsibility to filter those updates and (unless
> it's a zero-day poison packet bug) you only have yourself to blame if you
> fall over from them.
>
> If I were an optus business customer, reading that outage page would just
> make me even more determined to move elsewhere.
>
> They vaguely categorised the "what" of the outage into a big bucket
> (software upgrade related), but gave absolutely no useful information or
> explain the "why" which would regain my confidence.
>
> Why did this upgrade trigger an outage?
> - Was there a behaviour/feature change they neglected to take into account?
> - Did the upgrade require a config change that broke?
> - Were they neglectful in following config best practices? (filtering,
> prefix limits, restarts, etc?)- Did the new software have an unidentified bug?
> - Why did testing not catch this problem (they do test changes...right?)
> - How did progressive rollout still lead to this impact? (they do
> progressive rollouts over N days/weeks...right?)
>
> Why did mitigation take so long?
> - What detection/telemetry measures led them to realise the scope of the
> outage? (news reports dont count)- Were they dependent on the downed
> network for oncall paging & comms?
>
> - Why did their rollback plan fail? (they had a rollback plan...right?)
> - Why was remote console/power access not working? (they have both...right?)
> - Were they dependent on the downed network for said access?
> - Were their playbooks/credential access dependent on the downed network?
>
> "We have made changes to the network to address this issue so that it
> cannot occur again." ... this smells like "whoops forgot to set max-prefix
> (with restart!)".
>
> Bugs, config stuff-ups, etc happen, and they will continue to happen - it
> is a lie to state that outages will never happen again. This is the
> culmination of monumental failures in the trigger, prevention and
> mitigation measures which cannot be fixed in a couple of days, it sounds
> like much deeper architectural and organisational issues need addressing.
> Many of the above failures are things that a young network will experience
> and learn from, but for Optus these should all be well planned for already.
>
> I suspect any government investigation will simply add more bureaucracy and
> boxes to tick rather than effect meaningful change, but one can always be
> hopeful...
>
> BB
>
> On Tue, 14 Nov 2023 at 13:02, Michael Bethune <mike at ozonline.com.au> wrote:
> "Optus network received changes to routing information from an
> international peering network following a software upgrade"
>
> I note they are very careful to avoid nominating whose software upgrade.
>
> I also note that when they say they received routing updates,
> don't they limit the number of prefixes accepted by their BGP from
> any given peer?
>
> Sounds like a carefully crafted statement to enable them to point fingers
> elsewhere, not unexpected.
>
> - Michael.
>
> Quoting francisfides at mailup.net:
>
>> Looks like it was a software upgrade:
>> https://www.abc.net.au/news/2023-11-13/optus-identifies-cause-of-nationwide-outage-software-upgrade/103099902
>>
>> Nothing in their media centre, just appears as a new box on their
>> outage response page: https://www.optus.com.au/notices/outage-response
>>
>> Cheers
>>
>> ----
>> Text:
>>
>> "We have been working to understand what caused the outage on
>> Wednesday, and we now know what the cause was and have taken steps
>> to ensure it will not happen again. We apologise sincerely for
>> letting our customers down and the inconvenience it caused.
>>
>> At around 4.05am Wednesday morning, the Optus network received
>> changes to routing information from an international peering network
>> following a software upgrade. These routing information changes
>> propagated through multiple layers in our network and exceeded
>> preset safety levels on key routers. This resulted in those routers
>> disconnecting from the Optus IP Core network to protect themselves.
>>
>> The restoration required a large-scale effort of the team and in
>> some cases required Optus to reconnect or reboot routers physically,
>> requiring the dispatch of people across a number of sites in
>> Australia. This is why restoration was progressive over the afternoon.
>>
>> Given the widespread impact of the outage, our investigations into
>> the issue took longer than we would have liked as we examined
>> several different paths to restoration. The restoration of the
>> network was at all times our priority and we subsequently
>> established the cause working together with our partners. We have
>> made changes to the network to address this issue so that it cannot
>> occur again.
>>
>> We are committed to learning from what has occurred and continuing
>> to work with our international vendors and partners to increase the
>> resilience of our network. We will also support and fully cooperate
>> with the reviews being undertaken by the Government and the Senate.
>>
>> We continue to invest heavily to improve the resiliency of our
>> network and services."
>>
>> --
>>
>> francisfides at mailup.net
>>
>> On Thu, Nov 9, 2023, at 07:15, DaZZa wrote:
>>> I have all three you're asking about.
>>>
>>> But I'm very small potatoes compared to most of the members of this
>>> list, and my required remote footprint is correspondingly small, so
>>> it's easy to maintain.
>>>
>>> D
>>>
>>> On Thu, 9 Nov 2023 at 06:18, Phillip Grasso
>>> <phillip.grasso at gmail.com> wrote:
>>>>>
>>>>> I mean come on, it's nearly 2024 and a [major] telco does not
>>>>> have remote console access?
>>>>
>>>>
>>>> If we send a poll out to this community, how many would be able to
>>>> genuinely honestly answer:
>>>>
>>>> Do you have a console or appropriate control plane access into all
>>>> your critical infrastructure?
>>>> Do you have independant out of band that does not share any
>>>> infrastructure with your current system(s) - with exemption for
>>>> physical location and power.
>>>> Do you have the ability to remote power control your devices?
>>>>
>>>> We know from the facebook outage in 2021 that they probably didn't
>>>> have the above, so its not entirely uncommon for folks to have
>>>> *proper independant* console and remote access.
>>>>
>>>>
>>>> I empathize with the Optus team and their customers who have been
>>>> negatively impacted by this incident. I sincerely hope that some
>>>> positive outcomes can emerge from this situation, including:
>>>>
>>>> - Attention to critical infrastructure resilience
>>>> - BGP clue increases
>>>> - Incident management improves
>>>> (I'm sure there's more).
>>>>
>>>> Network is a black box to most people and I think a large chunk of
>>>> Australia now knows what it feels like to not have it.
>>>>
>>>>
>>>> On Wed, 8 Nov 2023 at 11:06, Ben Buxton <bb.ausnog at bb.cactii.net> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 8 Nov 2023 at 10:14, DaZZa <dazzagibbs at gmail.com> wrote:
>>>>>>
>>>>>> Yeah, I'd be willing to bet that it's a change which wasn't thoroughly
>>>>>> tested before being rolled out, and which had an inadequate backout
>>>>>> plan.
>>>>>
>>>>>
>>>>> Also, "Our on-site technician is actively prioritising
>>>>> establishing a console connection.".
>>>>>
>>>>> I mean come on, it's nearly 2024 and a [major] telco does not
>>>>> have remote console access? Whilst I'm
>>>>> looking forward to enthusiastically reading the PM, I'll have to
>>>>> book a physio appointment in advance due to
>>>>> neck strain from all the head shaking it'll likely induce.
>>>>>
>>>>> BB
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Interestingly, my Optus mobile actually had a valid connection for a
>>>>>> short time - wasn't able to actually DO anything, but was connected to
>>>>>> the OPtus network - but it's now gone to "SOS" mode.
>>>>>>
>>>>>> D
>>>>>>
>>>>>> On Wed, 8 Nov 2023 at 10:01, John Edwards <jaedwards at gmail.com> wrote:
>>>>>> >
>>>>>> > The 4am Wednesday morning outage start looks suspiciously like
>>>>>> a firmware upgrade window.
>>>>>> >
>>>>>> > I note that Optus devices where I am are showing "SoS" which
>>>>>> indicates the tower is unable to reach the location register,
>>>>>> which presumably is on a private network and indicative of a
>>>>>> pretty major fault rather than just IP.
>>>>>> >
>>>>>> > John
>>>>>> >
>>>>>> >
>>>>>> > On Wed, 8 Nov 2023 at 09:10, DaZZa <dazzagibbs at gmail.com> wrote:
>>>>>> >>
>>>>>> >> The Optus hamster finally died of old age.
>>>>>> >>
>>>>>> >> I would suggest your SMS issues would be caused by whoever is issuing
>>>>>> >> the SMS using Optus - not so much by the Telstra end receiving it.
>>>>>> >>
>>>>>> >> Anecdotally, Optus enterprise/wholesale appears to be still functional
>>>>>> >> - at least my link appears to be working fine - and my BGP
>>>>>> >> advertisements are still being seen overseas - seems to be only NBN
>>>>>> >> and mobile based services which are busted
>>>>>> >>
>>>>>> >> D
>>>>>> >>
>>>>>> >> On Wed, 8 Nov 2023 at 09:27, <francisfides at mailup.net> wrote:
>>>>>> >> >
>>>>>> >> > Morning all,
>>>>>> >> > Hope the chaos isn't too hard on your work/family.
>>>>>> >> > I have had trouble with a couple of SMS verifications
>>>>>> coming through to me, my Telstra number. Is this related?
>>>>>> >> >
>>>>>> >> > Any general banter around the downtime would be fine too -
>>>>>> looks like it all began at 4.07am AEDT?
>>>>>> >> >
>>>>>> >> > Cheers
>>>>>> >> >
>>>>>> >> > --
>>>>>> >> >
>>>>>> >> > francisfides at mailup.net
>>>>>> >> > _______________________________________________
>>>>>> >> > AusNOG mailing list
>>>>>> >> > AusNOG at lists.ausnog.net
>>>>>> >> > https://lists.ausnog.net/mailman/listinfo/ausnog
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> --
>>>>>> >> veg·e·tar·i·an:
>>>>>> >> Ancient tribal slang for the village idiot who can't hunt,
>>>>>> fish or ride
>>>>>> >> _______________________________________________
>>>>>> >> AusNOG mailing list
>>>>>> >> AusNOG at lists.ausnog.net
>>>>>> >> https://lists.ausnog.net/mailman/listinfo/ausnog
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> veg·e·tar·i·an:
>>>>>> Ancient tribal slang for the village idiot who can't hunt, fish or ride
>>>>>> _______________________________________________
>>>>>> AusNOG mailing list
>>>>>> AusNOG at lists.ausnog.net
>>>>>> https://lists.ausnog.net/mailman/listinfo/ausnog
>>>>>
>>>>> _______________________________________________
>>>>> AusNOG mailing list
>>>>> AusNOG at lists.ausnog.net
>>>>> https://lists.ausnog.net/mailman/listinfo/ausnog
>>>
>>>
>>>
>>> --
>>> veg·e·tar·i·an:
>>> Ancient tribal slang for the village idiot who can't hunt, fish or ride
>>> _______________________________________________
>>> AusNOG mailing list
>>> AusNOG at lists.ausnog.net
>>> https://lists.ausnog.net/mailman/listinfo/ausnog
>> _______________________________________________
>> AusNOG mailing list
>> AusNOG at lists.ausnog.net
>> https://lists.ausnog.net/mailman/listinfo/ausnog
>>
>
>
>
>
> _______________________________________________
> AusNOG mailing list
> AusNOG at lists.ausnog.net
> https://lists.ausnog.net/mailman/listinfo/ausnog
> _______________________________________________
> AusNOG mailing list
> AusNOG at lists.ausnog.net
> https://lists.ausnog.net/mailman/listinfo/ausnog
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ausnog.net/pipermail/ausnog/attachments/20231115/d123dfe9/attachment-0001.htm>
More information about the AusNOG
mailing list