Adding alerting infrastructure to Observium, an network management platform designed to be both highly functional and a joy to use.
Observium is an Open Source, auto-discovering network monitoring platform written in PHP which supports a wide range of devices and operating systems. We collect data and status via SNMP and an optional Agent and present the information in a useful-to-engineers manner.
To keep things simple to manage, we try to discover everything that can be graphed or monitored on a device automatically. You usually don't know you need to graph something until after the event or outage! We even try to automatically discover neighbouring devices seen via CDP and LLDP tables or OSPF neighbour tables.
We believe that one of the key purposes of an NMS is to help engineers understand their networks. One of the very first features implemented was the ability to visualise a device's place in the network based on the devices it connected directly to, and the devices its interfaces share subnets with.
We originally began the project in 2006 with the intention of replacing labour-intensive monitoring tools like MRTG, Cacti and Nagios. We started out as network engineers with very little programming or development knowledge, but a definite idea of how we wanted to present the information so that it would make our day to day work easier, especially during an outage.
Since then we believe that we've succeeded in creating a unique network and server status visualization platform which is providing thousands of organisations with an easy to manage and pleasant to use platform for managing their network and server estates.
There is a limited live demonstration of the software on our Demo Site.
In the past 6 months our user-base has increased dramatically. We're usually one of the first suggestions on reddit and other sites when people ask for monitoring software suggestions. We were featured in Linux Format in 2010 and have appeared on the TWiT podcast show, FLOSS Weekly.
More screenshots and information can be seen on the Project Site.
By far and away our number one requested feature is up/down and threshold alerting. It's the natural companion to the metrics and status visualisation, as we already collect all of the data we need.
Until now we've been hesitant as it's a fairly mammoth task which needs to be planned and implemented properly.
We now feel that the rest of the project has reached a state where we can turn our focus to adding a real alerting system to Observium.
We've helped a lot of people kick their Cacti habit, now we want to help them get off the Nagios for good.
We want to design the alerting aspect of the project along the same lines as the rest of the platform. We want as much autodiscovery and sane defaults as possible, so that new devices can be monitored and alerted with the minimum of human intervention.
We all know that when a new device is deployed it can take a few weeks before anyone gets around to braving the alerting system to add it, we want to make that less tiresome. Using Observium's existing auto-discovery features, a correctly configured device would be automatically discovered and added to the alerting system.
We've decided on some basic parameters about about how an Observium-style alerting system should work:
- Use the existing Observium database for host and entity information (an entity is a port, a drive, a sensor, etc)
- Use the existing Observium pollers to collect metrics, no separate poller
- Follow the spirit of Observium’s automation ethos and require minimum configuration with sane defaults
- No other alerting system treats different “types” of entity in the way we do. Most have a generic list of entities that they check, we have a dozen different database tables in different formats
- We need to know what to monitor and have sane defaults. We need to monitor almost everything someone would need to monitor automatically, out of the box
- We need to have some method of easily defining general conditions that apply to an entire network of similar devices We need to be able to override these general conditions both per-device and per-entity
We plan to have each poller module build and pass an array of metrics and states to a metric/state checker which checks the values against a series of conditions for that entity generated from the database.
This checker will put alerts into a queue which will be sent out via a separately executed alert dispatcher.
The entity conditions will be generated from a series of database tables at poll-time, allowing the creation of checks with host, entity or global scope.
The intention is to allow checks to also be limited to entities with specific attributes. For example, we could limit link-speed and duplex checks to only Ethernet interfaces.
Some examples of checks for the 'port' entity type might include
- Bits/sec in/out
- Bits/sec in/out as percentage of interface speed
- Errors/sec in/out
- Unicast/nonunicast/broadcast packets in/out
- ADSL SNR/noise margin/sync speed
- Interface link speed
- Duplex mode
- Promiscuous mode
We also intend to allow an alert to be delayed for a set period of time. For example, you might not want to be alerted if an interface is above 90% utilization unless it's been that way for 30 minutes.
What are we funding?
The funding goal will pay for 3 months of our time to work on implementing the alerting framework, configuration interface and hook in to as many of Observium's polling modules as possible.
Until now, Observium development has been ad-hoc, squeezed in between paying jobs. To properly implement the alerting system we need to be able to spend a decent block of time working on it.
To do this expenses will have to be reduced to the bare minimum, and only ramen and the occasional piece of roadkill will be consumed. It'll be tough, but it'll be worth it!
Now that we've been very generously funded to our initial goal amount by a single contributor in the first hour, we need to start thinking beyond the alerting system.
Other things we have on drawing board include:
- A defined plugin system to replace the existing "apps" system to allow graphing an alerting of *nix applications
- A Cacti-a-like system for graphing arbitrary OIDs and data from scripts
- Better data collection support for more complex device types like load balancers, storage arrays and firewalls
- A daemon to proxy SNMP and Agent requests to reach hosts within private networks
- Expansion of the ISP-specific feature set including VRFs and Pseudowires
- Better support for routing protocol data collection from OSPF, EIGRP and IS-IS
- iOS and Android notification via Pushover
- Cisco QoS graphing from CISCO-CLASS-BASED-QOS-MIB
- Per-user VPN statistics from CISCO-IPSEC-FLOW-MONITOR-MIB and CISCO-REMOTE-ACCESS-MONITOR-MIB
Once the campaign is completed we'll allow backers to vote on which features they'd like to be prioritized after the alerting system.
Risks and challenges Learn about accountability on Kickstarter
The primary risk is that we don't manage to fully implement the alerting system within the time afforded to us by the funding.
Even in this situation, nothing will be wasted, any development work we've done will get us ever closer to having a finished, usable alerting system. We'll get there, it might just take a little longer!
Have a question? If the info above doesn't help, you can ask the project creator directly.
seconds to go
Pledge £5 or moreYou selected
Thanks for supporting Observium!Estimated delivery:
Pledge £10 or moreYou selected
Double the thanks, and double the kudos.Estimated delivery:
Pledge £25 or moreYou selected
You'll be a trailblazer of open-source funding, as well as being top of my list of people to buy drinks for.Estimated delivery:
Pledge £50 or moreYou selected
Fantastic. We'll credit you or your organisation on an acknowledgements page on the Observium web site.Estimated delivery:
Pledge £100 or moreYou selected
In addition to credit, we'll include a logo and a link, as well as a warm fuzzy feeling!Estimated delivery:
Pledge £250 or moreYou selected
I'll answer email queries about the system and consider requests for features, plus a logo and credit on the site.Estimated delivery:
Pledge £500 or moreYou selected
I'll answer email queries about the system and consider requests for features. I'll also give half a day of consulting to help implement Observium on your network, plus a logo and credit on the site.Estimated delivery:
Pledge £1,000 or moreYou selected
I'll answer email queries about the system and consider requests for features. I'll also give a full day of consulting to help implement Observium on your network, plus a logo and credit on the site.Estimated delivery:
Pledge £5,000 or moreYou selected
Woah! You funded the whole thing! You get all of the above things, plus extra help with whatever you need, development of a new device type? Assistance building custom dashboards?Estimated delivery:
- (30 days)