[AusNOG] Network Management and Tools

Ben Cornish benc at brennanit.com.au
Mon Jul 5 14:26:41 EST 2010


We also do Similar to you Tom.
We also had a scaling issue on the client End node monitoring too.
There is something to be said about pushing all this information together in one place and how you can use it and can benefit from it.

Based on the fact we had a home grown Customer Management System that store the technical source of truth for everything, we chose to tac on both solutions to this and try and drive it out of one hierarchal tool.
We also had likings to Nagios/Cacti/Rancdi/Ipplan and wanted central configuration of them.

We store all our Core devices into the Customer management System.
We tie back Rancid/Nagios/Cacti/Ipplan/Whois/Radiator/Netflow/Radiator/Ldap/DNS/Reverse DNS and some other home grown tools.
We also store location data and device role / termination demographics for each device.

By putting all this data together we can and do:

*         Pre population of PE's for VRF's/Loops - giving Rapid PE deployment and replacement ease and accuracy.

*         Configuration of all the Core tools automatically for highly changing components - eg Client provissioning.

*         Along with Client monitoring data we can correlate and pin point Supplier/Device/InterConnect issues much faster.

*         When a outage occurs we can track which client  was where before it and only update those clients directly and through the outage.

*         Track who did what on each core device an play backwards for issue post mortems

*         Telemetry information - the CMS solution can pull up where the service terminated - saving time looking for where a user is.

*         Automatically log help desk tickets / job queues to follow up down nodes for client stuff or Core Device issues into the Internal Job Queues.

*         Interface Device templating. The system can scan configurations collected by Rancid and look where interfaces don't comply to standards or missing rate limits or other vital configuration information.

Nagios alone - configured right with all the extras is gold.. Some people only like to use the polling feature(Active Checks)... but the passive checks are also good sometimes for non-critical things..
Extra ngaios elements like this are gold:

*         Rancid hook for alerting when configurations are not updated for longer than 48 hours

*         Dual Power Supply Checks

*         Hooks into Cacti for threshold of Aggregation links over X time period.

*         Interfaces that are down but not admin down

*         HSRP standby changes

*         Device interface errors exceeded

*         Device has been rebooted

*         BGP/OSPF session changed state

*         Large routing Table changes

*         MTU checks

*         Checking TOS bits over the network are working and QoS is being honoured

*         List goes on.....

We found by doing all this, efficiency and accuracy is much better...
And also can make you lazy to some degree...

Anything that is monkey work should and generally can always be automated.


From: ausnog-bounces at lists.ausnog.net [mailto:ausnog-bounces at lists.ausnog.net] On Behalf Of Tom Wright
Sent: Monday, 5 July 2010 11:39 AM
To: phil colbourn
Cc: ausnog at ausnog.net
Subject: Re: [AusNOG] Network Management and Tools

Hi Phil,

We have a custom-built SQL database of our devices (including
some meta-data about them) that allows us to drive (with some glue)
whatever third party packages we wish to use - commercial or not.

Having that sort of fluidity is nice, because if you get sick of tool 'X'
you can supplement or replace it in a fairly isolated fashion.

I'm not necessarily against having multiple systems to collect data from
the network: e.g, you wouldn't want your monitoring system to go down
when you took your RRD's offline for maintenance, etc.

Backing up configs is a simple, and yet potentially frustrating exercise
when you're running a heterogeneous environment - there are usually
many ways to do it but a key outcome should be to have a mechanism to
diff them over time.

Don't forget that if you're a Cisco shop you should also backup your VLAN
databases (if anyone's interested in a convenient way of doing this via
SNMP, contact me off-list).

We feed our syslog and command accounting records back into an SQL
database. Having authentication, accounting and syslog records in a
'correlate-able' format is truly sexy, and makes troubleshooting much
easier when you can point at a device (or group of devices) and ask "what
just happened?".

We use MRTG/RRDTool for a large chunk of our performance
management, and it runs flawlessly for many 1000's of targets.


-- Tom


On 04/07/2010, at 9:30 PM, phil colbourn wrote:

What do you use for

 *   Alarm/event management (SNMP traps, syslog)
 *   Performance management (SNMP Polls)
 *   Backup config files
 *   AAA (local, RADIUS)
 *   Configuration management



--
Kind Regards,

Tom Wright
Internode Network Operations
P: +61 8 8228 2999
W: http://www.internode.on.net

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20100705/7fe943b3/attachment.html>


More information about the AusNOG mailing list