Nagios made easy

Apps

In depth: What's the best way to monitor multiple Linux servers for configuration errors, high load or other problems? The answer is Nagios, which is a fantastic (and free!) networking monitoring system that lets you track multiple servers (HTTP, SMTP, SSH and more) across multiple machines, all backed by a neat user interface.

Nagios gives you an unbeatable overview of all your machines, meaning that you can fix upcoming problems before they turn critical and be certain that you're not missing anything about your network. The basic structure of Nagios is pretty simple: you set up one machine as your Nagios server, and it gathers information on the client machines you point it at, then displays it in a neat web page format. Read on to learn how to get started with Nagios on your own network!

You can get basic service information on the client machines (whether they're pingable, whether SSH is OK, whether HTTP is OK) without needing to install any software on them; the server alone will do the work. Or you can get centralised reports on further information (such as disk space or CPU usage), if you install client-side software. The server can react to alert situations by sending email or text to inform a human person, or by doing more or less anything else that you can write a script to make happen.

It's incredibly versatile, configurable, and powerful. The flipside to this, inevitably, is that it can also be difficult to set up correctly. This tutorial will walk you through getting the central server and a single client machine set up (all other client machines can be set up exactly the same) and will touch briefly on the other possibilities you can investigate once you have it running. It's complicated, but it's worth the effort you put in to get it going, as once Nagios is running it needs very little maintenance.

Initial server setup

First of all we'll tackle getting the central Nagios server set up, and enable reporting via the web front-end. You can install from source from www.nagios.org, or there should be a package available for your system. For Debian, this is the nagios2 package (not nagios, which is the old version), and you should also install nagios-plugins (we'll talk about plugins later). On setup, it'll ask you for the admin password - remember this! - and whether you want version 1 backwards compatibility (there's no need for this if you're performing a new setup).

You'll also need to install apache2 if you don't already have it installed (we're not going to cover this in any detail here). Once both Nagios and Apache are installed, go to /etc/nagios2 and copy the config sections from the apache2.conf file into the config file belonging to your installation of Apache.

This will provide a basic configuration that should work with the default install to get the web reporting working. If you've personalised your Apache or Nagios install or moved any files around, you'll need to check that the various directives and aliases are pointing to the correct places. Restart Apache, and that's the web side of things set up. Of course, now you need to get the server set up so that there's information being delivered to that webpage.

Note: from here on in, assume that all files or directories live in /etc/nagios2 unless otherwise stated.

The Debian configuration setup - which I'd recommend as an approach worth taking even if you're installing from another system or from source - is to create a conf.d directory in which the majority of the config files are put. Within this directory, configuration can be broken out into separate files as much as you like. We find this far more manageable than using one enormous file, although you can do that if you prefer!

The Nagios screen from the basic initial install.

The Nagios screen from the basic initial install.

Within the main config file - nagios.cfg , you refer to this config file directory with this line:

cfg_dir=/etc/nagios2/conf.d

You can add any other directories with a similar line - use as many as you like.

For the basic server setup, monitoring only the server itself (localhost), from a package install, most of the default settings should be OK. All you need to do is to edit the conf.d/contacts_nagios2.cfg file to put your email address in under the first defined contact.

This means that if an alert is triggered, Nagios will send you an email. Before you start Nagios though, take a look at the defaults defined for the localhost in conf.d/localhost_nagios2.cfg . The host definition and the first service definition should look a bit like this:

define host{
 use  generic-host 
 host_name localhost
 alias localhost
 address 127.0.0.1
}
# Define a service to check the disk space of the root partition
# Warning if < 20% free, critical if < 10% free space on partition.
define service{
 use   generic-service  
 e template to use
 host_name  localhost 
 service_description Disk Space
 check_command check_all_disks!20%!10%
}

After that there will be another couple of service definitions.

Templates

The use keyword in that code snippet refers to a very useful function of Nagios: the ability to set up templates. Templates mean that you can set your service or host defaults in a single place rather than having to retype them constantly. Config reuse, like any other code reuse, is always a better bet - less hassle, more maintainability!

The generic-host template lives in the conf.d/generic-host_nagios2.cfg file. It sets a bunch of defaults, including notification enablement, various aspects of dealing with notifications and events, and so on. However, in the individual host definition you can override any of these. generic-service works as a per-service (rather than per-host) template in exactly the same way (check out conf.d/generic-service_nagios2.cfg to see what this looks like and what defaults are set).

The check_command keyword refers to a command found in /etc/nagios-plugins/config/disk.cfg. This plugin directory contains all the commands for checking various services, and there are an enormous number of plugins that you can add, on top of the ones that a default package install will give you (if installing from source you may need to get more of these by hand).

For the moment, leave our basic install's defaults as they are. We'll add another host to the various groups in a moment. Restart Nagios, and go look at http://server.example.com/nagios2 (you'll be asked for that nagiosadmin password that you set at install). You'll get a welcome page - click on 'Tactical Overview' in the left-hand menu, and the report screen will come up, as in the screenshot over the page.

The tactical overview shows the hosts, services, and their states. For Nagios, hosts and services can be in one of several states. Pending means that it's not done checking yet. OK means, unsurprisingly, that all is well. Then there are various levels of not-OK state: yellow is Warning, and red is Critical. You can set these levels yourself in the config files. If you click on a service, you'll get a detailed page up of information about that service (or host).

Change the front page

To make the front page of the Nagios web interface show the tactical overview rather than a home page, edit /usr/share/nagios2/htdocs/index.html and replace this line:

FRAME SRC="main.html" NAME="main">

with this one:

<FRAME SRC="/cgi-bin/nagios2/tac.cgi" NAME="main">

Hosts and services

So far all Nagios is doing is monitoring itself and that one default gateway. Next, you need to add a host (client). The best way to do this is to create a conf.d/host-client1.cfg file, named for your client machine, which should initially look a bit like this:

define host{
 use  generic-host  
 host_name client1
 address 10.0.0.2
}

As you can see, this can be very basic - almost all of the information is taken from the host template. Reload Nagios (/etc/init.d/nagios2 reload), then give it a minute or two to run the various checks before you check out the web page.

Hostgroups, and defining services

Nagios can see your client now, but right at the moment you don't have any actual checks defined for this host. You could add them individually to the host config file - in the same way that checks are defined in the localhost config file that we looked at initially. However, it would be a lot of hassle to have to go through for every host.

As with the templates, you want to type things once only. This means using host groups, which enable you to define a service per hostgroup rather than per host.

Edit conf.d/hostgroups_nagios2.cfg to add the short name of the host to the relevant groups. You can, if you want to, create a new group. For example, we have this group setup:

define hostgroup {
 hostgroup_name debian-servers
 alias  Debian GNU/Linux Servers
 members  localhost,webserver,ldapserver
}

There are many more options than this that you can set for the host group - the Nagios documentation page is very helpful. However, the minimal setup above will do the job just fine.

Now you have your client host, and you've added it to a webgroup - let's call it debian-servers, as per the code snippet above. Now, edit the conf.d/services_nagios2.cfg file to set up a couple of service checks for that host group. Let's check for pingability and for SSH connectability:

define service {
 hostgroup_name   debian-servers
 service_description  SSH
 check_command   check_ssh
 use    generic-service
 notification_interval  0 
}

define service {
 hostgroup_name   debian-servers
 service_description  PING
 check_command   check_ping!100.0,20%!500.0,60%
 use    generic-service
 notification_interval  0 
}

You will probably want to define a group of machines that you expect to be SSH-accessible, and a group that you expect to be ping-accessible. For example, the default setup has the gateway pingable but you don't necessarily expect it to be SSH-accessible.

But here we'll just use the single group, as we currently only have one client machine. Nagios will now check the machines in those groups, for the services in the check_command statement, with the generic-service settings, and complain if a problem is found.

Here's the feedback from Nagios showing a happy service...

Here's the feedback from Nagios showing a happy service...

...and here's a less happy service, in a warning state.

...and here's a less happy service, in a warning state.

Checking multiple websites

If you just want to check that a webserver is responding to HTTP requests, the existing check_http command is fine. However, you may have more than one domain running on a single server and want to check them all separately. To do this, first edit commands.cfg and add the following lines:

define command{
 command_name check_http-website1
 command_line /usr/lib/nagios/plugins/check_http -H website1.example.com
}

Create similar commands for as many websites as you have. Then edit the webserver's host config file (eg conf.d/host-webserver.cfg ) to include service for each command:

define service{
 host_name  webserver
 service_description website1
 check_command  check_http-website1
 use   generic-service
 notification_interval 1440
}

Alerts

Now you have Nagios set up so you can monitor your systems from a centralised web page. Now we want to set up an email alert, so that you get told if anything's wrong, without having to go looking. You've already set up your contacts_nagios2.cfg file above.

This file also defines an 'admin' contact group. It's better to use contact groups than individual users, as again this increases maintainability. If you want a change in personnel, you need change only the group membership, not the references in all the other config files. The default case is for admins to contain only the root user, which we set up already, so we'll stick with that.

The generic service definition is once again your friend. Our preferred default is for everything to drop us an email if it goes wrong in some way, as we check our email pretty often and certainly more often than we would remember to hit a web page. So that's what we'll set up here. In the conf.d/generic-service_nagios2.cfg file, add the following to the service definition:

notification_interval  1440 
is_volatile   0
check_period   24x7
normal_check_interval  5
retry_check_interval  1
max_check_attempts  10
notification_period  24x7
notification_options  c,r
contact_groups   admins

The notification interval defines how often you get reminded (in minutes) - here it's every 24 hours. Time periods are defined in conf.d/timeperiods_nagios2.cfg. check_period defines when the service is expected to run - here, all the time.

The normal_check_interval and retry_check_interval are in minutes: the service here is set to be checked every five minutes, but if an answer isn't forthcoming and a retry is made, the retry should happen every minute. Ten retry attempts will be made before Nagios concludes that there's something wrong with the service, though you can of course reduce this number if you prefer.

The notification_period sets when alerts should be set - again, we've set this to all the time - and notification_options sets when you should receive an alert. For hosts, d = notify on down states; u = notify on unreachable states; r = notify on host recoveries; and f = notify when host starts and stops flapping. For services, w = notify on warning states; u = unknown states; c = critical states; and again r = recovery and f = start/stop of flapping. Finally, contact_groups defines who to contact when a notification is required.

Once you have all that in place, reload Nagios, then try turning off SSH on your test client. You should receive a message to the address you set in the contacts file, telling you that the client is not SSH accessible. Turn it back on, and you'll get another alert, telling you it's OK again.

Configuring the From line

The default From: line in the email alerts is the Nagios user - this may not be good if you have a mail server that wants a registered address before it will send. If you're using Exim 4, you need to set the 'untrusted user' option, and then add

-- -f address@example.com

to the end of the host-notify-by-email and notify-by-email commands in the commands.cfg file.

Plugins

So, now you have a basic Nagios system up and running, and you have it set up to be easy to add further hosts and services at the same fairly basic level. However, there's a lot more that it can do.

As an example, let's look at the plugin that enables you to monitor disk usage, CPU usage, and other similar things on remote hosts. At present, on your remote/client hosts, you can only monitor whether or not they're up. Ideally you want more than that - you want to know if they're about to run out of disk space, for example, or if the mail delivery is down.

What you want for this is the NRPE plugin. Install the NRPE plugin on your Nagios server (it's the nagios-nrpe-plugin package on Debian), and install the NRPE server on your remote host/client (the nagios-nrpe-server package on Debian). The NRPE server will collect information from the machine, and pass it on to the plugin when it is contacted by the main server.

Check it's working...

To test the communication between server and client, run

/usr/lib/nagios/plugins/check_nrpe -H client -c check_users 

...on the server, and it should tell you how many users are logged in on the client. Next, check /etc/nagios-plugins/config/check_nrpe.cfg on the server, if necessary - on Debian this is already set up, so no need to edit it. It should look like this:

define command {
 command_name check_nrpe
 command_line /usr/lib/nagios/plugins/check_nrpe -H 
$HOSTADDRESS$ -c $
ARG1$ -a $ARG2$
}
# this command runs a program $ARG1$ with only one argument
define command {
 command_name check_nrpe_1arg
 command_line /usr/lib/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $
ARG1$
}

Edit conf.d/services_nagios2.cfg on the remote host to add the services you want to monitor on the remote host. Our configuration looks like this:

define service {
 service_description SMTP
 use   generic-service
 hostgroup_name  nrpe
 check_command  check_nrpe_1arg!check_smtp
}
define service {
 service_description LOAD
 use   generic-service
 hostgroup_name  nrpe
 check_command  check_nrpe_1arg!check_load
}
define service {
 service_description DISK
 use   generic-service
 hostgroup_name  nrpe
 check_command  check_nrpe!check_disk!/
}

(Note that this requires an nrpe hostgroup to be set up.)

Arguments that take only one argument - the service to check - use the check_nrpe_1arg command (see /etc/nagios-plugins/config/check_nrpe.cfg ). If you want to give further arguments, you need to edit /etc/nagios/nrpe.cfg on the client machine so that the dont_blame_nrpe keyword has the value 1. Then use the check_nrpe command.

In the code above, we use it to pass the disk mount point to be checked. Restart Nagios, and you should be able to start seeing information from your client.

Possible commands are at /usr/lib/nagios/plugins on the client - or you can create your own in /etc/nagios/nrpe-local.cfg. I've created a couple of arguments that look like this:

command[check_disk]=/usr/lib/nagios/plugins/check_disk -w 10% -c 5% -p $ARG1$
command[check_smtp]=/usr/lib/nagios/plugins/check_smtp -w 1 -c 2

...as this means that we can check whichever local disk we want, rather than being restricted to / as with the default check_disk command; and can make sure that SMTP is running happily on all machines. You can create other commands in a very similar way. (One thing to watch for: the check_disk command warns on percentage of disk space free, rather than percentage used.)

So, that's your first very basic Nagios setup done - checking itself and an external client, and alerting you to any problems. Adding extra hosts and services is straightforward, and you can check out the plugin directory if you want to do more. Running /usr/lib/nagios/plugins/plugin_name -h will get you help output for that plugin. For now though, sit back and enjoy as your network monitors itself!

First published in Linux Format

First published in Linux Format magazine

You should follow us on Identi.ca or Twitter


Your comments

Great tutorial, thank you.

Great tutorial, thank you.

Nagios is awsome for system monitoring

The tutorial is great. I would recommend deploying the solution with your enterprise to see how it works. I have been configuring Nagios for a major retailer. Thus far I have configured over 400 server and 2800 services.

Nagios is straight forward in appearance, so you will need to think outside the box for solutions. However, due to the simplicity of Nagios almost any solution will work.

Mike Kniaziewicz, MIS

thanks, it seems tho the --

thanks, it seems tho the -- -f trick doesnt work on freebsd. Crazy its this hard to change the from address on nagios and crazy that the freebsd mail binary has no way of changing the from address. exim itself can change it but the exim binary doesnt allow the subject to be set so I cannot use that for nagios.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Post new comment

CAPTCHA
We can't accept links (unless you obfuscate them). You also need to negotiate the following CAPTCHA...

Username:   Password:
Create Account | About TuxRadar