Saturday, April 4, 2015

Network monitoring is not for amateurs

So, another blogger recently posted that network monitoring was easy. All you needed to do was deploy off the shelf network monitoring tools X, Y, and Z, and your network would be monitored and you would be able to sleep well at night knowing that your network was working well.

At which point I have to stop and say... what... the.... bleep?

I've been doing this a *long* time. In this or prior jobs (where I was a staff engineer at a firewall company in particular) I've deployed Nagios, MRTG, Zenoss, OpenNMS, Big Brother, and a variety of proprietary network monitoring tools such as PRTG and combined firewall/IDS tools such as Checkpoint, as well as specialty tools such as smartd, arpwatch and snort. And what I have to say is that monitoring networks is *hard*. Pretty much any of the big network monitoring tools (other than Nagios, whose configuration is utterly non-automated) will take you 80% of the way to effective monitoring of your network with a few mouse-clicks. But to go to that extra 20% needed to make sure that you're immediately notified if anything your client relies on is not providing services, you end up having to install plugins, write sensors, and combine multiple tools together. It's a lengthy task that can require days of research to solve just 1% of that last 20%.

Here's an example. I was having issues where one of my iSCSI servers was dropping off the network for thirty seconds or so. I'd learn about it because my users would start howling that their Linux virtual machines were no longer functioning properly. I investigated, and found that when Linux cannot write to a volume for a certain amount of time, it then marks the volume read-only. Which, if the volume is the root volume on a Linux virtual machine, means things go to bleep in a handbasket.

And Nagios reported not a problem during all this time. All the normal Nagios sensors for system load, disk space free, memory free, and so forth showed that the virtual machine was operating completely normally. The virtual machine was pingable, it was responding to SNMP queries, it looked completely healthy to any network monitoring tool in the universe. So I ended up writing a Nagios sensor that 'touched' a file in /, and if it couldn't because the file system had been marked read-only, reported an error. I then deployed it to my virtual machines and added it to my list of sensors to monitor, and when my iSCSI server dropped offline and the filesystems went read-only, I'd get a Nagios alert. (Said iSCSI server, BTW, is not an Intransa iSCSI server, it's a generic SuperMicro box running Linux with the LIO iSCSI stack, but that's irrelevant to the conversation). After some time of getting alerted when things went AWOL I managed to zone in on exactly the set of log file entries on the iSCSI server that indicated what was going on at the time things went to bleep, and discovered that I had a failing disk in the iSCSI system that was not triggering the smartmond monitoring tool. Unfortunately figuring out which of those 24 disks were causing the SAS bus reset loop was non-trivial given that smartctl was showing no problems when I looked at them, but once I tracked it down and replaced the disk, the problem was solved.

So let's recap. I had a serious problem with my storage network. smartmond, the host-based disk monitoring tool on the iSCSI box, couldn't detect it, all the disks looked fine when you looked at them with the tool it uses, smartctl. Nagios and all its prepackaged sensors couldn't detect it until I wrote a custom sensor. And this other blogger says that monitoring complex networks including clients, storage boxes, video cameras, and so forth is *easy*? Just a matter of installing packages X, Y, and Z and sitting back and relaxing?

Crack. That's my only explanation for anybody making such a statement. Either that, or utter inexperience at monitoring complex heterogeneous networks with multiple points of failure, of which only 80% or so are covered by off-the-shelf network monitoring tools, and the remainder requiring writing custom sensors to detect domain-specific points of failure. Either way, anybody who listens to that blogger and believes that monitoring a network effectively is easy is a fool. And that's all I will say on that subject.

-ELG