[AusNOG] Thought experiment: How would you manage 11 (or 100) million devices?

Wed Oct 6 11:57:26 EST 2010

On 05/10/2010, at 2:20 PM, Andrew Fort wrote:

> How would you go about managing (lets assume via IPvSomething, SNMP,
> TL1, CLIs, et al) 11 million devices? *   Assume that you need to
> gather performance stats, alerts, and configure the devices to do
> stuff.

For starters, I'd want to optimise for aggregation benefits. Many SNMP tools query one OID with each query, when SNMP is quite happy bundling multiple requests into single queries. I once wrote a tool to poll all of the counters for a particular VC on a 288-port DSLAM simultaneously every 5 minutes, something that would have created significant noise on the management interface if polled on a per-port basis.

In a network of this scale where something is always changing, the individual change itself becomes uninteresting except as a historical event for troubleshooting. The tools and alerts need to be focussed on rates of change rather than change itself.

If each change can increment counters for the attributes related to the change (ie: type of change, POI location, RSP, OLT attached, ONT hardware version) then you have a system that can scale well as an aggregate AND provide insight into events.

Humans spot rates of change fairly easily with graphs "Whoa - big spike, better check it out", but in a system with millions of events large rates of change could be used to select which graphs are presented to operators for attention.

Once you're treating events like flows of information, you can then manage them like flows as well. Rate limit what goes to the central management systems- make that an event. If the rate limit is ever hit then you probably already know what's going on and don't need to be concerned with logging it!

Another optimisation is to figure out ways to prevent flip-flopping devices from outshining other network faults. The rate-of-change analysis mostly filters this out, but at a lower level it still gets in the way of debugging faults.  In an NBN-style network with controlled hardware this might be less of an issue as they can specify devices that retry with a decay algorithm.  In a network with uncontrolled CPE some horrors can emerge. One particular DSL modem that never made it onto the Telstra approved list appeared to be missing a timer routine - which resulted in any authentication failure being immediately re-issued as fast as the little cpu could transmit it.

Above all, expect the unexpected - with 11 million nearly identical CPE possibly running a cut-down linux kernel, a compromise in the platform could be disastrous. An ideal NMS could play a role in detecting and mitigating a doomsday scenario.

John