[AusNOG] Optus downtime chat + affecting SMS verification toTelstra?
Ben Buxton
bb.ausnog at bb.cactii.net
Tue Nov 14 13:36:46 AEDT 2023
Blaming routing updates from peers is a scapegoat and never is the cause of
an outage - public BGP is the wild west and you're always getting broken
information - it's your responsibility to filter those updates and (unless
it's a zero-day poison packet bug) you only have yourself to blame if you
fall over from them.
If I were an optus business customer, reading that outage page would just
make me even more determined to move elsewhere.
They vaguely categorised the "what" of the outage into a big bucket
(software upgrade related), but gave absolutely no useful information or
explain the "why" which would regain my confidence.
Why did this upgrade trigger an outage?
- Was there a behaviour/feature change they neglected to take into
account?
- Did the upgrade require a config change that broke?
- Were they neglectful in following config best practices? (filtering,
prefix limits, restarts, etc?)
- Did the new software have an unidentified bug?
- Why did testing not catch this problem (they do test changes...right?)
- How did progressive rollout still lead to this impact? (they do
progressive rollouts over N days/weeks...right?)
Why did mitigation take so long?
- What detection/telemetry measures led them to realise the scope of the
outage? (news reports dont count)
- Were they dependent on the downed network for oncall paging & comms?
- Why did their rollback plan fail? (they had a rollback plan...right?)
- Why was remote console/power access not working? (they have
both...right?)
- Were they dependent on the downed network for said access?
- Were their playbooks/credential access dependent on the downed network?
"We have made changes to the network to address this issue so that it
cannot occur again." ... this smells like "whoops forgot to set max-prefix
(with restart!)".
Bugs, config stuff-ups, etc happen, and they will continue to happen - it
is a lie to state that outages will never happen again. This is the
culmination of monumental failures in the trigger, prevention and
mitigation measures which cannot be fixed in a couple of days, it sounds
like much deeper architectural and organisational issues need addressing.
Many of the above failures are things that a young network will experience
and learn from, but for Optus these should all be well planned for already.
I suspect any government investigation will simply add more bureaucracy and
boxes to tick rather than effect meaningful change, but one can always be
hopeful...
BB
On Tue, 14 Nov 2023 at 13:02, Michael Bethune <mike at ozonline.com.au> wrote:
> "Optus network received changes to routing information from an
> international peering network following a software upgrade"
>
> I note they are very careful to avoid nominating whose software upgrade.
>
> I also note that when they say they received routing updates,
> don't they limit the number of prefixes accepted by their BGP from
> any given peer?
>
> Sounds like a carefully crafted statement to enable them to point fingers
> elsewhere, not unexpected.
>
> - Michael.
>
> Quoting francisfides at mailup.net:
>
> > Looks like it was a software upgrade:
> >
> https://www.abc.net.au/news/2023-11-13/optus-identifies-cause-of-nationwide-outage-software-upgrade/103099902
> >
> > Nothing in their media centre, just appears as a new box on their
> > outage response page: https://www.optus.com.au/notices/outage-response
> >
> > Cheers
> >
> > ----
> > Text:
> >
> > "We have been working to understand what caused the outage on
> > Wednesday, and we now know what the cause was and have taken steps
> > to ensure it will not happen again. We apologise sincerely for
> > letting our customers down and the inconvenience it caused.
> >
> > At around 4.05am Wednesday morning, the Optus network received
> > changes to routing information from an international peering network
> > following a software upgrade. These routing information changes
> > propagated through multiple layers in our network and exceeded
> > preset safety levels on key routers. This resulted in those routers
> > disconnecting from the Optus IP Core network to protect themselves.
> >
> > The restoration required a large-scale effort of the team and in
> > some cases required Optus to reconnect or reboot routers physically,
> > requiring the dispatch of people across a number of sites in
> > Australia. This is why restoration was progressive over the afternoon.
> >
> > Given the widespread impact of the outage, our investigations into
> > the issue took longer than we would have liked as we examined
> > several different paths to restoration. The restoration of the
> > network was at all times our priority and we subsequently
> > established the cause working together with our partners. We have
> > made changes to the network to address this issue so that it cannot
> > occur again.
> >
> > We are committed to learning from what has occurred and continuing
> > to work with our international vendors and partners to increase the
> > resilience of our network. We will also support and fully cooperate
> > with the reviews being undertaken by the Government and the Senate.
> >
> > We continue to invest heavily to improve the resiliency of our
> > network and services."
> >
> > --
> >
> > francisfides at mailup.net
> >
> > On Thu, Nov 9, 2023, at 07:15, DaZZa wrote:
> >> I have all three you're asking about.
> >>
> >> But I'm very small potatoes compared to most of the members of this
> >> list, and my required remote footprint is correspondingly small, so
> >> it's easy to maintain.
> >>
> >> D
> >>
> >> On Thu, 9 Nov 2023 at 06:18, Phillip Grasso
> >> <phillip.grasso at gmail.com> wrote:
> >>>>
> >>>> I mean come on, it's nearly 2024 and a [major] telco does not
> >>>> have remote console access?
> >>>
> >>>
> >>> If we send a poll out to this community, how many would be able to
> >>> genuinely honestly answer:
> >>>
> >>> Do you have a console or appropriate control plane access into all
> >>> your critical infrastructure?
> >>> Do you have independant out of band that does not share any
> >>> infrastructure with your current system(s) - with exemption for
> >>> physical location and power.
> >>> Do you have the ability to remote power control your devices?
> >>>
> >>> We know from the facebook outage in 2021 that they probably didn't
> >>> have the above, so its not entirely uncommon for folks to have
> >>> *proper independant* console and remote access.
> >>>
> >>>
> >>> I empathize with the Optus team and their customers who have been
> >>> negatively impacted by this incident. I sincerely hope that some
> >>> positive outcomes can emerge from this situation, including:
> >>>
> >>> - Attention to critical infrastructure resilience
> >>> - BGP clue increases
> >>> - Incident management improves
> >>> (I'm sure there's more).
> >>>
> >>> Network is a black box to most people and I think a large chunk of
> >>> Australia now knows what it feels like to not have it.
> >>>
> >>>
> >>> On Wed, 8 Nov 2023 at 11:06, Ben Buxton <bb.ausnog at bb.cactii.net>
> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On Wed, 8 Nov 2023 at 10:14, DaZZa <dazzagibbs at gmail.com> wrote:
> >>>>>
> >>>>> Yeah, I'd be willing to bet that it's a change which wasn't
> thoroughly
> >>>>> tested before being rolled out, and which had an inadequate backout
> >>>>> plan.
> >>>>
> >>>>
> >>>> Also, "Our on-site technician is actively prioritising
> >>>> establishing a console connection.".
> >>>>
> >>>> I mean come on, it's nearly 2024 and a [major] telco does not
> >>>> have remote console access? Whilst I'm
> >>>> looking forward to enthusiastically reading the PM, I'll have to
> >>>> book a physio appointment in advance due to
> >>>> neck strain from all the head shaking it'll likely induce.
> >>>>
> >>>> BB
> >>>>
> >>>>
> >>>>>
> >>>>>
> >>>>> Interestingly, my Optus mobile actually had a valid connection for a
> >>>>> short time - wasn't able to actually DO anything, but was connected
> to
> >>>>> the OPtus network - but it's now gone to "SOS" mode.
> >>>>>
> >>>>> D
> >>>>>
> >>>>> On Wed, 8 Nov 2023 at 10:01, John Edwards <jaedwards at gmail.com>
> wrote:
> >>>>> >
> >>>>> > The 4am Wednesday morning outage start looks suspiciously like
> >>>>> a firmware upgrade window.
> >>>>> >
> >>>>> > I note that Optus devices where I am are showing "SoS" which
> >>>>> indicates the tower is unable to reach the location register,
> >>>>> which presumably is on a private network and indicative of a
> >>>>> pretty major fault rather than just IP.
> >>>>> >
> >>>>> > John
> >>>>> >
> >>>>> >
> >>>>> > On Wed, 8 Nov 2023 at 09:10, DaZZa <dazzagibbs at gmail.com> wrote:
> >>>>> >>
> >>>>> >> The Optus hamster finally died of old age.
> >>>>> >>
> >>>>> >> I would suggest your SMS issues would be caused by whoever is
> issuing
> >>>>> >> the SMS using Optus - not so much by the Telstra end receiving it.
> >>>>> >>
> >>>>> >> Anecdotally, Optus enterprise/wholesale appears to be still
> functional
> >>>>> >> - at least my link appears to be working fine - and my BGP
> >>>>> >> advertisements are still being seen overseas - seems to be only
> NBN
> >>>>> >> and mobile based services which are busted
> >>>>> >>
> >>>>> >> D
> >>>>> >>
> >>>>> >> On Wed, 8 Nov 2023 at 09:27, <francisfides at mailup.net> wrote:
> >>>>> >> >
> >>>>> >> > Morning all,
> >>>>> >> > Hope the chaos isn't too hard on your work/family.
> >>>>> >> > I have had trouble with a couple of SMS verifications
> >>>>> coming through to me, my Telstra number. Is this related?
> >>>>> >> >
> >>>>> >> > Any general banter around the downtime would be fine too -
> >>>>> looks like it all began at 4.07am AEDT?
> >>>>> >> >
> >>>>> >> > Cheers
> >>>>> >> >
> >>>>> >> > --
> >>>>> >> >
> >>>>> >> > francisfides at mailup.net
> >>>>> >> > _______________________________________________
> >>>>> >> > AusNOG mailing list
> >>>>> >> > AusNOG at lists.ausnog.net
> >>>>> >> > https://lists.ausnog.net/mailman/listinfo/ausnog
> >>>>> >>
> >>>>> >>
> >>>>> >>
> >>>>> >> --
> >>>>> >> veg·e·tar·i·an:
> >>>>> >> Ancient tribal slang for the village idiot who can't hunt,
> >>>>> fish or ride
> >>>>> >> _______________________________________________
> >>>>> >> AusNOG mailing list
> >>>>> >> AusNOG at lists.ausnog.net
> >>>>> >> https://lists.ausnog.net/mailman/listinfo/ausnog
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> veg·e·tar·i·an:
> >>>>> Ancient tribal slang for the village idiot who can't hunt, fish or
> ride
> >>>>> _______________________________________________
> >>>>> AusNOG mailing list
> >>>>> AusNOG at lists.ausnog.net
> >>>>> https://lists.ausnog.net/mailman/listinfo/ausnog
> >>>>
> >>>> _______________________________________________
> >>>> AusNOG mailing list
> >>>> AusNOG at lists.ausnog.net
> >>>> https://lists.ausnog.net/mailman/listinfo/ausnog
> >>
> >>
> >>
> >> --
> >> veg·e·tar·i·an:
> >> Ancient tribal slang for the village idiot who can't hunt, fish or ride
> >> _______________________________________________
> >> AusNOG mailing list
> >> AusNOG at lists.ausnog.net
> >> https://lists.ausnog.net/mailman/listinfo/ausnog
> > _______________________________________________
> > AusNOG mailing list
> > AusNOG at lists.ausnog.net
> > https://lists.ausnog.net/mailman/listinfo/ausnog
> >
>
>
>
>
> _______________________________________________
> AusNOG mailing list
> AusNOG at lists.ausnog.net
> https://lists.ausnog.net/mailman/listinfo/ausnog
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ausnog.net/pipermail/ausnog/attachments/20231114/6cb88d88/attachment.htm>
More information about the AusNOG
mailing list