[AusNOG] Optus downtime chat + affecting SMS verification toTelstra?
DaZZa
dazzagibbs at gmail.com
Fri Nov 17 11:14:38 AEDT 2023
And now Singtel have returned serve and are denying it was them.
https://www.zdnet.com/article/singtel-refutes-reports-that-its-system-upgrade-caused-optus-outage/
It's like watching kids trying to blame each other for who broke the
window with the cricket ball.
D
On Wed, 15 Nov 2023 at 11:01, Luke Thompson <luke.t at tncrew.com.au> wrote:
>
> They've blamed Singtel Internet Exchange (STiX) for the international peering route updates, at least going by anonymous sources cited by SMH.
>
> https://www.smh.com.au/technology/identity-of-third-party-who-brought-down-optus-network-revealed-20231114-p5ejy1.html
>
> Luke
>
> On 14 November 2023 12:37:30 pm Ben Buxton <bb.ausnog at bb.cactii.net> wrote:
>>
>>
>> Blaming routing updates from peers is a scapegoat and never is the cause of an outage - public BGP is the wild west and you're always getting broken information - it's your responsibility to filter those updates and (unless it's a zero-day poison packet bug) you only have yourself to blame if you fall over from them.
>>
>> If I were an optus business customer, reading that outage page would just make me even more determined to move elsewhere.
>>
>> They vaguely categorised the "what" of the outage into a big bucket (software upgrade related), but gave absolutely no useful information or explain the "why" which would regain my confidence.
>>
>> Why did this upgrade trigger an outage?
>> - Was there a behaviour/feature change they neglected to take into account?
>> - Did the upgrade require a config change that broke?
>> - Were they neglectful in following config best practices? (filtering, prefix limits, restarts, etc?)
>> - Did the new software have an unidentified bug?
>> - Why did testing not catch this problem (they do test changes...right?)
>> - How did progressive rollout still lead to this impact? (they do progressive rollouts over N days/weeks...right?)
>>
>> Why did mitigation take so long?
>> - What detection/telemetry measures led them to realise the scope of the outage? (news reports dont count)
>> - Were they dependent on the downed network for oncall paging & comms?
>> - Why did their rollback plan fail? (they had a rollback plan...right?)
>> - Why was remote console/power access not working? (they have both...right?)
>> - Were they dependent on the downed network for said access?
>> - Were their playbooks/credential access dependent on the downed network?
>>
>> "We have made changes to the network to address this issue so that it cannot occur again." ... this smells like "whoops forgot to set max-prefix (with restart!)".
>>
>> Bugs, config stuff-ups, etc happen, and they will continue to happen - it is a lie to state that outages will never happen again. This is the culmination of monumental failures in the trigger, prevention and mitigation measures which cannot be fixed in a couple of days, it sounds like much deeper architectural and organisational issues need addressing.
>>
>> Many of the above failures are things that a young network will experience and learn from, but for Optus these should all be well planned for already.
>>
>> I suspect any government investigation will simply add more bureaucracy and boxes to tick rather than effect meaningful change, but one can always be hopeful...
>>
>> BB
>>
>> On Tue, 14 Nov 2023 at 13:02, Michael Bethune <mike at ozonline.com.au> wrote:
>>>
>>> "Optus network received changes to routing information from an
>>> international peering network following a software upgrade"
>>>
>>> I note they are very careful to avoid nominating whose software upgrade.
>>>
>>> I also note that when they say they received routing updates,
>>> don't they limit the number of prefixes accepted by their BGP from
>>> any given peer?
>>>
>>> Sounds like a carefully crafted statement to enable them to point fingers
>>> elsewhere, not unexpected.
>>>
>>> - Michael.
>>>
>>> Quoting francisfides at mailup.net:
>>>
>>> > Looks like it was a software upgrade:
>>> > https://www.abc.net.au/news/2023-11-13/optus-identifies-cause-of-nationwide-outage-software-upgrade/103099902
>>> >
>>> > Nothing in their media centre, just appears as a new box on their
>>> > outage response page: https://www.optus.com.au/notices/outage-response
>>> >
>>> > Cheers
>>> >
>>> > ----
>>> > Text:
>>> >
>>> > "We have been working to understand what caused the outage on
>>> > Wednesday, and we now know what the cause was and have taken steps
>>> > to ensure it will not happen again. We apologise sincerely for
>>> > letting our customers down and the inconvenience it caused.
>>> >
>>> > At around 4.05am Wednesday morning, the Optus network received
>>> > changes to routing information from an international peering network
>>> > following a software upgrade. These routing information changes
>>> > propagated through multiple layers in our network and exceeded
>>> > preset safety levels on key routers. This resulted in those routers
>>> > disconnecting from the Optus IP Core network to protect themselves.
>>> >
>>> > The restoration required a large-scale effort of the team and in
>>> > some cases required Optus to reconnect or reboot routers physically,
>>> > requiring the dispatch of people across a number of sites in
>>> > Australia. This is why restoration was progressive over the afternoon.
>>> >
>>> > Given the widespread impact of the outage, our investigations into
>>> > the issue took longer than we would have liked as we examined
>>> > several different paths to restoration. The restoration of the
>>> > network was at all times our priority and we subsequently
>>> > established the cause working together with our partners. We have
>>> > made changes to the network to address this issue so that it cannot
>>> > occur again.
>>> >
>>> > We are committed to learning from what has occurred and continuing
>>> > to work with our international vendors and partners to increase the
>>> > resilience of our network. We will also support and fully cooperate
>>> > with the reviews being undertaken by the Government and the Senate.
>>> >
>>> > We continue to invest heavily to improve the resiliency of our
>>> > network and services."
>>> >
>>> > --
>>> >
>>> > francisfides at mailup.net
>>> >
>>> > On Thu, Nov 9, 2023, at 07:15, DaZZa wrote:
>>> >> I have all three you're asking about.
>>> >>
>>> >> But I'm very small potatoes compared to most of the members of this
>>> >> list, and my required remote footprint is correspondingly small, so
>>> >> it's easy to maintain.
>>> >>
>>> >> D
>>> >>
>>> >> On Thu, 9 Nov 2023 at 06:18, Phillip Grasso
>>> >> <phillip.grasso at gmail.com> wrote:
>>> >>>>
>>> >>>> I mean come on, it's nearly 2024 and a [major] telco does not
>>> >>>> have remote console access?
>>> >>>
>>> >>>
>>> >>> If we send a poll out to this community, how many would be able to
>>> >>> genuinely honestly answer:
>>> >>>
>>> >>> Do you have a console or appropriate control plane access into all
>>> >>> your critical infrastructure?
>>> >>> Do you have independant out of band that does not share any
>>> >>> infrastructure with your current system(s) - with exemption for
>>> >>> physical location and power.
>>> >>> Do you have the ability to remote power control your devices?
>>> >>>
>>> >>> We know from the facebook outage in 2021 that they probably didn't
>>> >>> have the above, so its not entirely uncommon for folks to have
>>> >>> *proper independant* console and remote access.
>>> >>>
>>> >>>
>>> >>> I empathize with the Optus team and their customers who have been
>>> >>> negatively impacted by this incident. I sincerely hope that some
>>> >>> positive outcomes can emerge from this situation, including:
>>> >>>
>>> >>> - Attention to critical infrastructure resilience
>>> >>> - BGP clue increases
>>> >>> - Incident management improves
>>> >>> (I'm sure there's more).
>>> >>>
>>> >>> Network is a black box to most people and I think a large chunk of
>>> >>> Australia now knows what it feels like to not have it.
>>> >>>
>>> >>>
>>> >>> On Wed, 8 Nov 2023 at 11:06, Ben Buxton <bb.ausnog at bb.cactii.net> wrote:
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Wed, 8 Nov 2023 at 10:14, DaZZa <dazzagibbs at gmail.com> wrote:
>>> >>>>>
>>> >>>>> Yeah, I'd be willing to bet that it's a change which wasn't thoroughly
>>> >>>>> tested before being rolled out, and which had an inadequate backout
>>> >>>>> plan.
>>> >>>>
>>> >>>>
>>> >>>> Also, "Our on-site technician is actively prioritising
>>> >>>> establishing a console connection.".
>>> >>>>
>>> >>>> I mean come on, it's nearly 2024 and a [major] telco does not
>>> >>>> have remote console access? Whilst I'm
>>> >>>> looking forward to enthusiastically reading the PM, I'll have to
>>> >>>> book a physio appointment in advance due to
>>> >>>> neck strain from all the head shaking it'll likely induce.
>>> >>>>
>>> >>>> BB
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> Interestingly, my Optus mobile actually had a valid connection for a
>>> >>>>> short time - wasn't able to actually DO anything, but was connected to
>>> >>>>> the OPtus network - but it's now gone to "SOS" mode.
>>> >>>>>
>>> >>>>> D
>>> >>>>>
>>> >>>>> On Wed, 8 Nov 2023 at 10:01, John Edwards <jaedwards at gmail.com> wrote:
>>> >>>>> >
>>> >>>>> > The 4am Wednesday morning outage start looks suspiciously like
>>> >>>>> a firmware upgrade window.
>>> >>>>> >
>>> >>>>> > I note that Optus devices where I am are showing "SoS" which
>>> >>>>> indicates the tower is unable to reach the location register,
>>> >>>>> which presumably is on a private network and indicative of a
>>> >>>>> pretty major fault rather than just IP.
>>> >>>>> >
>>> >>>>> > John
>>> >>>>> >
>>> >>>>> >
>>> >>>>> > On Wed, 8 Nov 2023 at 09:10, DaZZa <dazzagibbs at gmail.com> wrote:
>>> >>>>> >>
>>> >>>>> >> The Optus hamster finally died of old age.
>>> >>>>> >>
>>> >>>>> >> I would suggest your SMS issues would be caused by whoever is issuing
>>> >>>>> >> the SMS using Optus - not so much by the Telstra end receiving it.
>>> >>>>> >>
>>> >>>>> >> Anecdotally, Optus enterprise/wholesale appears to be still functional
>>> >>>>> >> - at least my link appears to be working fine - and my BGP
>>> >>>>> >> advertisements are still being seen overseas - seems to be only NBN
>>> >>>>> >> and mobile based services which are busted
>>> >>>>> >>
>>> >>>>> >> D
>>> >>>>> >>
>>> >>>>> >> On Wed, 8 Nov 2023 at 09:27, <francisfides at mailup.net> wrote:
>>> >>>>> >> >
>>> >>>>> >> > Morning all,
>>> >>>>> >> > Hope the chaos isn't too hard on your work/family.
>>> >>>>> >> > I have had trouble with a couple of SMS verifications
>>> >>>>> coming through to me, my Telstra number. Is this related?
>>> >>>>> >> >
>>> >>>>> >> > Any general banter around the downtime would be fine too -
>>> >>>>> looks like it all began at 4.07am AEDT?
>>> >>>>> >> >
>>> >>>>> >> > Cheers
>>> >>>>> >> >
>>> >>>>> >> > --
>>> >>>>> >> >
>>> >>>>> >> > francisfides at mailup.net
>>> >>>>> >> > _______________________________________________
>>> >>>>> >> > AusNOG mailing list
>>> >>>>> >> > AusNOG at lists.ausnog.net
>>> >>>>> >> > https://lists.ausnog.net/mailman/listinfo/ausnog
>>> >>>>> >>
>>> >>>>> >>
>>> >>>>> >>
>>> >>>>> >> --
>>> >>>>> >> veg·e·tar·i·an:
>>> >>>>> >> Ancient tribal slang for the village idiot who can't hunt,
>>> >>>>> fish or ride
>>> >>>>> >> _______________________________________________
>>> >>>>> >> AusNOG mailing list
>>> >>>>> >> AusNOG at lists.ausnog.net
>>> >>>>> >> https://lists.ausnog.net/mailman/listinfo/ausnog
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> --
>>> >>>>> veg·e·tar·i·an:
>>> >>>>> Ancient tribal slang for the village idiot who can't hunt, fish or ride
>>> >>>>> _______________________________________________
>>> >>>>> AusNOG mailing list
>>> >>>>> AusNOG at lists.ausnog.net
>>> >>>>> https://lists.ausnog.net/mailman/listinfo/ausnog
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> AusNOG mailing list
>>> >>>> AusNOG at lists.ausnog.net
>>> >>>> https://lists.ausnog.net/mailman/listinfo/ausnog
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> veg·e·tar·i·an:
>>> >> Ancient tribal slang for the village idiot who can't hunt, fish or ride
>>> >> _______________________________________________
>>> >> AusNOG mailing list
>>> >> AusNOG at lists.ausnog.net
>>> >> https://lists.ausnog.net/mailman/listinfo/ausnog
>>> > _______________________________________________
>>> > AusNOG mailing list
>>> > AusNOG at lists.ausnog.net
>>> > https://lists.ausnog.net/mailman/listinfo/ausnog
>>> >
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> AusNOG mailing list
>>> AusNOG at lists.ausnog.net
>>> https://lists.ausnog.net/mailman/listinfo/ausnog
>>
>> _______________________________________________
>> AusNOG mailing list
>> AusNOG at lists.ausnog.net
>> https://lists.ausnog.net/mailman/listinfo/ausnog
>>
>
> _______________________________________________
> AusNOG mailing list
> AusNOG at lists.ausnog.net
> https://lists.ausnog.net/mailman/listinfo/ausnog
--
veg·e·tar·i·an:
Ancient tribal slang for the village idiot who can't hunt, fish or ride
More information about the AusNOG
mailing list