Network monitoring and getting actionable alerts

I’ve always worked in the IT/telecommunications industry; no matter the company, everyone seems to struggle with network and infrastructure monitoring and getting actionable alerts. There are plenty of NMS platforms out there, to-date I haven’t found one that gives targeted, actionable alerts without a major initial and ongoing investments in managing the network management software.

Options available

When choosing a network monitoring software, most people will lean towards open source solutions (such a Nagios, MRTG, Zenoss, etc.) because of the lack of up front cost.  For some, this is a viable solution, however for most, it becomes a headache to keep up with (usually due to complex configurations and lack of product documentation).

Larger enterprises and service providers who need reliable, vendor supported products usually turn towards the industry leaders (Solarwinds, WhatsUp Gold, PRTG, etc.) however incur significant upfront cost and annual maintenance.

Noise in network monitoring

No matter if its open source or a paid product, all network monitoring platforms have one goal;  Alert when something is no longer working.  While this sounds like an easy task, two major hurdles are stumbled upon by most NMS platforms.  Interconnected devices and alerting thresholds cause lots of alerts and ultimately lead to continual tweaking to get it right.

[iconbox icon=”lightbulb-o” iconColor=”#3b5998″ title=”Scenario #1″ type=”left”]You’re monitoring a remote site with 20 devices and the link to that site goes down, you’ll receive a minimum of 20 alerts – one for each device that is “down.”  In reality only the connectivity to the site is down and the 20 devices are actually up, your NMS just can’t reach them.[/iconbox]

[iconbox icon=”lightbulb-o” iconColor=”#3b5998″ title=”Scenario #2″ type=”left”]You’re monitoring a remote site with 20 devices.  The ISP has informed you of a brief maintenance (under 1 minute) while they perform configuration changes.  The connectivity to the remote site is interrupted for less than a minute, most NMS platforms would send you an alert when it went down and when it was restored/back up. [/iconbox]

Dependency tracking and why it fails

Most NMS platforms have some ability to associate upstream and downstream dependencies to a polled device.  These dependencies allow for admins to specify that a [downstream] device(s) are connected through an [upstream] device/interface.

[iconbox icon=”lightbulb-o” iconColor=”#3b5998″ title=”Scenario #3″ type=”left”]You’re monitor 20 devices at a remote site + the route at the edge of the network.  You’ve specified that the 20 devices are downstream of the router.  If the link/connectivity to the router goes down, the NMS will understand that connectivity to the router is lost and will only alert that the router is down.  The downstream devices would show as down due to an upstream device issue and you wouldn’t receive alerts for every device.[/iconbox]

Dependency associations seem like a solution to getting targeted alerts for the specific device/interface that is down, however fails miserably once you try to scale.

Here’s why: Traditional hub and spoke networks grow in ways where there are multiple upstream and downstream dependencies through many different devices (including switches, routers, wireless bridges/access points, transceivers, MUXes, etc).  Service provider and large enterprise networks get even more complex when you add BGP, OSPF and spanning-tree to provide alternate  paths for connectivity.  Even though a link may actually be down, connectivity is automatically re-routed over alternate paths [probably] with different upstream/downstream dependencies.

Threshold alerting as a means to quiet noise

Getting an alert every time a device/interface goes down and back up causes a NOC to become desensitized to important issues and has the possibility for a tech to miss a critical alerts in the middle of a bunch of noise.  Some NMS platforms will allow you customize the alert engine to send alerts when a threshold has been crossed.

[iconbox icon=”lightbulb-o” iconColor=”#3b5998″ title=”Scenario #4″ type=”left”]Only send alerts when the device/interface has been down for >1 minute.  Only send alerts when the device/interface has returned to the up state for >5 minutes.[/iconbox]

This alert logic eliminates the for “quick hit” outages and only sends a restoration alert when the device/interface has been stable for a period of time.

Is there a solution?

The NMS market has been stagnate with the same dozen or so players for the past 10+ years.  There isn’t a solution that provides targeted, actionable alerting for large enterprise and service provider networks.  Some of the legacy players (WhatsUp Gold) have some very impressive tools to perform Layer-2 mapping (helps visualize the upstream and downstream dependencies) however no one product seems to eliminate noise and give actionable alerts.

A few notable startups have made their entry into the market (Logic Monitor is a favorite of mine) which bridges the gap between the functionality of open source and paid support, wrapped up nice cloud-based package. Unfortunately no one has been able to conquer automatic dependency association and thus managing the network monitoring platform must remain a critical part of any enterprise or service provider.