Nagios made easy
In depth: What's the best way to monitor multiple Linux servers for configuration errors, high load or other problems? The answer is Nagios, which is a fantastic (and free!) networking monitoring system that lets you track multiple servers (HTTP, SMTP, SSH and more) across multiple machines, all backed by a neat user interface.
Nagios gives you an unbeatable overview of all your machines, meaning that you can fix upcoming problems before they turn critical and be certain that you're not missing anything about your network. The basic structure of Nagios is pretty simple: you set up one machine as your Nagios server, and it gathers information on the client machines you point it at, then displays it in a neat web page format. Read on to learn how to get started with Nagios on your own network!
You can get basic service information on the client machines (whether they're pingable, whether SSH is OK, whether HTTP is OK) without needing to install any software on them; the server alone will do the work. Or you can get centralised reports on further information (such as disk space or CPU usage), if you install client-side software. The server can react to alert situations by sending email or text to inform a human person, or by doing more or less anything else that you can write a script to make happen.
It's incredibly versatile, configurable, and powerful. The flipside to this, inevitably, is that it can also be difficult to set up correctly. This tutorial will walk you through getting the central server and a single client machine set up (all other client machines can be set up exactly the same) and will touch briefly on the other possibilities you can investigate once you have it running. It's complicated, but it's worth the effort you put in to get it going, as once Nagios is running it needs very little maintenance.
Initial server setup
First of all we'll tackle getting the central Nagios server set up, and enable reporting via the web front-end. You can install from source from www.nagios.org, or there should be a package available for your system. For Debian, this is the nagios2 package (not nagios, which is the old version), and you should also install nagios-plugins (we'll talk about plugins later). On setup, it'll ask you for the admin password - remember this! - and whether you want version 1 backwards compatibility (there's no need for this if you're performing a new setup).
You'll also need to install apache2 if you don't already have it installed (we're not going to cover this in any detail here). Once both Nagios and Apache are installed, go to /etc/nagios2 and copy the config sections from the apache2.conf file into the config file belonging to your installation of Apache.
This will provide a basic configuration that should work with the default install to get the web reporting working. If you've personalised your Apache or Nagios install or moved any files around, you'll need to check that the various directives and aliases are pointing to the correct places. Restart Apache, and that's the web side of things set up. Of course, now you need to get the server set up so that there's information being delivered to that webpage.
Note: from here on in, assume that all files or directories live in /etc/nagios2 unless otherwise stated.
The Debian configuration setup - which I'd recommend as an approach worth taking even if you're installing from another system or from source - is to create a conf.d directory in which the majority of the config files are put. Within this directory, configuration can be broken out into separate files as much as you like. We find this far more manageable than using one enormous file, although you can do that if you prefer!
The Nagios screen from the basic initial install.
Within the main config file - nagios.cfg , you refer to this config file directory with this line:
cfg_dir=/etc/nagios2/conf.d
You can add any other directories with a similar line - use as many as you like.
For the basic server setup, monitoring only the server itself (localhost), from a package install, most of the default settings should be OK. All you need to do is to edit the conf.d/contacts_nagios2.cfg file to put your email address in under the first defined contact.
This means that if an alert is triggered, Nagios will send you an email. Before you start Nagios though, take a look at the defaults defined for the localhost in conf.d/localhost_nagios2.cfg . The host definition and the first service definition should look a bit like this:
define host{
use generic-host
host_name localhost
alias localhost
address 127.0.0.1
}
# Define a service to check the disk space of the root partition
# Warning if < 20% free, critical if < 10% free space on partition.
define service{
use generic-service
e template to use
host_name localhost
service_description Disk Space
check_command check_all_disks!20%!10%
}
After that there will be another couple of service definitions.
Templates
The use keyword in that code snippet refers to a very useful function of Nagios: the ability to set up templates. Templates mean that you can set your service or host defaults in a single place rather than having to retype them constantly. Config reuse, like any other code reuse, is always a better bet - less hassle, more maintainability!
The generic-host template lives in the conf.d/generic-host_nagios2.cfg file. It sets a bunch of defaults, including notification enablement, various aspects of dealing with notifications and events, and so on. However, in the individual host definition you can override any of these. generic-service works as a per-service (rather than per-host) template in exactly the same way (check out conf.d/generic-service_nagios2.cfg to see what this looks like and what defaults are set).
The check_command keyword refers to a command found in /etc/nagios-plugins/config/disk.cfg. This plugin directory contains all the commands for checking various services, and there are an enormous number of plugins that you can add, on top of the ones that a default package install will give you (if installing from source you may need to get more of these by hand).
For the moment, leave our basic install's defaults as they are. We'll add another host to the various groups in a moment. Restart Nagios, and go look at http://server.example.com/nagios2 (you'll be asked for that nagiosadmin password that you set at install). You'll get a welcome page - click on 'Tactical Overview' in the left-hand menu, and the report screen will come up, as in the screenshot over the page.
The tactical overview shows the hosts, services, and their states. For Nagios, hosts and services can be in one of several states. Pending means that it's not done checking yet. OK means, unsurprisingly, that all is well. Then there are various levels of not-OK state: yellow is Warning, and red is Critical. You can set these levels yourself in the config files. If you click on a service, you'll get a detailed page up of information about that service (or host).
Change the front page
To make the front page of the Nagios web interface show the tactical overview rather than a home page, edit /usr/share/nagios2/htdocs/index.html and replace this line:
FRAME SRC="main.html" NAME="main">
with this one:
<FRAME SRC="/cgi-bin/nagios2/tac.cgi" NAME="main">
Hosts and services
So far all Nagios is doing is monitoring itself and that one default gateway. Next, you need to add a host (client). The best way to do this is to create a conf.d/host-client1.cfg file, named for your client machine, which should initially look a bit like this:
define host{
use generic-host
host_name client1
address 10.0.0.2
}
As you can see, this can be very basic - almost all of the information is taken from the host template. Reload Nagios (/etc/init.d/nagios2 reload), then give it a minute or two to run the various checks before you check out the web page.
Hostgroups, and defining services
Nagios can see your client now, but right at the moment you don't have any actual checks defined for this host. You could add them individually to the host config file - in the same way that checks are defined in the localhost config file that we looked at initially. However, it would be a lot of hassle to have to go through for every host.
As with the templates, you want to type things once only. This means using host groups, which enable you to define a service per hostgroup rather than per host.
Edit conf.d/hostgroups_nagios2.cfg to add the short name of the host to the relevant groups. You can, if you want to, create a new group. For example, we have this group setup:
define hostgroup {
hostgroup_name debian-servers
alias Debian GNU/Linux Servers
members localhost,webserver,ldapserver
}
There are many more options than this that you can set for the host group - the Nagios documentation page is very helpful. However, the minimal setup above will do the job just fine.
Now you have your client host, and you've added it to a webgroup - let's call it debian-servers, as per the code snippet above. Now, edit the conf.d/services_nagios2.cfg file to set up a couple of service checks for that host group. Let's check for pingability and for SSH connectability:
define service {
hostgroup_name debian-servers
service_description SSH
check_command check_ssh
use generic-service
notification_interval 0
}
define service {
hostgroup_name debian-servers
service_description PING
check_command check_ping!100.0,20%!500.0,60%
use generic-service
notification_interval 0
}
You will probably want to define a group of machines that you expect to be SSH-accessible, and a group that you expect to be ping-accessible. For example, the default setup has the gateway pingable but you don't necessarily expect it to be SSH-accessible.
But here we'll just use the single group, as we currently only have one client machine. Nagios will now check the machines in those groups, for the services in the check_command statement, with the generic-service settings, and complain if a problem is found.
Here's the feedback from Nagios showing a happy service...
...and here's a less happy service, in a warning state.
Checking multiple websites
If you just want to check that a webserver is responding to HTTP requests, the existing check_http command is fine. However, you may have more than one domain running on a single server and want to check them all separately. To do this, first edit commands.cfg and add the following lines:
define command{
command_name check_http-website1
command_line /usr/lib/nagios/plugins/check_http -H website1.example.com
}
Create similar commands for as many websites as you have. Then edit the webserver's host config file (eg conf.d/host-webserver.cfg ) to include service for each command:
define service{
host_name webserver
service_description website1
check_command check_http-website1
use generic-service
notification_interval 1440
}
Alerts
Now you have Nagios set up so you can monitor your systems from a centralised web page. Now we want to set up an email alert, so that you get told if anything's wrong, without having to go looking. You've already set up your contacts_nagios2.cfg file above.
This file also defines an 'admin' contact group. It's better to use contact groups than individual users, as again this increases maintainability. If you want a change in personnel, you need change only the group membership, not the references in all the other config files. The default case is for admins to contain only the root user, which we set up already, so we'll stick with that.
The generic service definition is once again your friend. Our preferred default is for everything to drop us an email if it goes wrong in some way, as we check our email pretty often and certainly more often than we would remember to hit a web page. So that's what we'll set up here. In the conf.d/generic-service_nagios2.cfg file, add the following to the service definition:
notification_interval 1440 is_volatile 0 check_period 24x7 normal_check_interval 5 retry_check_interval 1 max_check_attempts 10 notification_period 24x7 notification_options c,r contact_groups admins
The notification interval defines how often you get reminded (in minutes) - here it's every 24 hours. Time periods are defined in conf.d/timeperiods_nagios2.cfg. check_period defines when the service is expected to run - here, all the time.
The normal_check_interval and retry_check_interval are in minutes: the service here is set to be checked every five minutes, but if an answer isn't forthcoming and a retry is made, the retry should happen every minute. Ten retry attempts will be made before Nagios concludes that there's something wrong with the service, though you can of course reduce this number if you prefer.
The notification_period sets when alerts should be set - again, we've set this to all the time - and notification_options sets when you should receive an alert. For hosts, d = notify on down states; u = notify on unreachable states; r = notify on host recoveries; and f = notify when host starts and stops flapping. For services, w = notify on warning states; u = unknown states; c = critical states; and again r = recovery and f = start/stop of flapping. Finally, contact_groups defines who to contact when a notification is required.
Once you have all that in place, reload Nagios, then try turning off SSH on your test client. You should receive a message to the address you set in the contacts file, telling you that the client is not SSH accessible. Turn it back on, and you'll get another alert, telling you it's OK again.
Configuring the From line
The default From: line in the email alerts is the Nagios user - this may not be good if you have a mail server that wants a registered address before it will send. If you're using Exim 4, you need to set the 'untrusted user' option, and then add
-- -f address@example.com
to the end of the host-notify-by-email and notify-by-email commands in the commands.cfg file.
Plugins
So, now you have a basic Nagios system up and running, and you have it set up to be easy to add further hosts and services at the same fairly basic level. However, there's a lot more that it can do.
As an example, let's look at the plugin that enables you to monitor disk usage, CPU usage, and other similar things on remote hosts. At present, on your remote/client hosts, you can only monitor whether or not they're up. Ideally you want more than that - you want to know if they're about to run out of disk space, for example, or if the mail delivery is down.
What you want for this is the NRPE plugin. Install the NRPE plugin on your Nagios server (it's the nagios-nrpe-plugin package on Debian), and install the NRPE server on your remote host/client (the nagios-nrpe-server package on Debian). The NRPE server will collect information from the machine, and pass it on to the plugin when it is contacted by the main server.
Check it's working...
To test the communication between server and client, run
/usr/lib/nagios/plugins/check_nrpe -H client -c check_users
...on the server, and it should tell you how many users are logged in on the client. Next, check /etc/nagios-plugins/config/check_nrpe.cfg on the server, if necessary - on Debian this is already set up, so no need to edit it. It should look like this:
define command {
command_name check_nrpe
command_line /usr/lib/nagios/plugins/check_nrpe -H
$HOSTADDRESS$ -c $
ARG1$ -a $ARG2$
}
# this command runs a program $ARG1$ with only one argument
define command {
command_name check_nrpe_1arg
command_line /usr/lib/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $
ARG1$
}
Edit conf.d/services_nagios2.cfg on the remote host to add the services you want to monitor on the remote host. Our configuration looks like this:
define service {
service_description SMTP
use generic-service
hostgroup_name nrpe
check_command check_nrpe_1arg!check_smtp
}
define service {
service_description LOAD
use generic-service
hostgroup_name nrpe
check_command check_nrpe_1arg!check_load
}
define service {
service_description DISK
use generic-service
hostgroup_name nrpe
check_command check_nrpe!check_disk!/
}
(Note that this requires an nrpe hostgroup to be set up.)
Arguments that take only one argument - the service to check - use the check_nrpe_1arg command (see /etc/nagios-plugins/config/check_nrpe.cfg ). If you want to give further arguments, you need to edit /etc/nagios/nrpe.cfg on the client machine so that the dont_blame_nrpe keyword has the value 1. Then use the check_nrpe command.
In the code above, we use it to pass the disk mount point to be checked. Restart Nagios, and you should be able to start seeing information from your client.
Possible commands are at /usr/lib/nagios/plugins on the client - or you can create your own in /etc/nagios/nrpe-local.cfg. I've created a couple of arguments that look like this:
command[check_disk]=/usr/lib/nagios/plugins/check_disk -w 10% -c 5% -p $ARG1$ command[check_smtp]=/usr/lib/nagios/plugins/check_smtp -w 1 -c 2
...as this means that we can check whichever local disk we want, rather than being restricted to / as with the default check_disk command; and can make sure that SMTP is running happily on all machines. You can create other commands in a very similar way. (One thing to watch for: the check_disk command warns on percentage of disk space free, rather than percentage used.)
So, that's your first very basic Nagios setup done - checking itself and an external client, and alerting you to any problems. Adding extra hosts and services is straightforward, and you can check out the plugin directory if you want to do more. Running /usr/lib/nagios/plugins/plugin_name -h will get you help output for that plugin. For now though, sit back and enjoy as your network monitors itself!
First published in Linux Format magazine
You should follow us on Identi.ca or Twitter



Copyright 2012 Future Publishing Limited (company
registered number 2008885), a company registered
in England and Wales whose registered office is at
Beauford Court, 30 Monmouth Street, Bath, BA1 2BW, UK
Your comments
Great tutorial, thank you.
Anonymous Penguin (not verified) - March 31, 2009 @ 6:41pm
Great tutorial, thank you.
Nagios is awsome for system monitoring
Anonymous Penguin (not verified) - April 11, 2009 @ 12:01am
The tutorial is great. I would recommend deploying the solution with your enterprise to see how it works. I have been configuring Nagios for a major retailer. Thus far I have configured over 400 server and 2800 services.
Nagios is straight forward in appearance, so you will need to think outside the box for solutions. However, due to the simplicity of Nagios almost any solution will work.
Mike Kniaziewicz, MIS
thanks, it seems tho the --
Anonymous Penguin (not verified) - May 20, 2010 @ 10:20am
thanks, it seems tho the -- -f trick doesnt work on freebsd. Crazy its this hard to change the from address on nagios and crazy that the freebsd mail binary has no way of changing the from address. exim itself can change it but the exim binary doesnt allow the subject to be set so I cannot use that for nagios.
Post new comment