Nagios monitoring and exception reporting on a myriad of systems and their variables
The most basic setup
Everytime I install Nagios in a new environment (3 times so far) it begins as an effort to know the most basic information and fulfill a best practice requirement. A basic howto covers "is this server and some of it's services UP". A basic setup provides a false sense of security.
Ah, actually UP doesn't mean "OK"
Almost as soon as the first fault occurs that is not picked up by Nagios because (say) an email is not delivered or a website is "down" someone realises that basic monitoring of a host with ICMP or a service with a TCP connect is really not sufficient and basically pointless. Users don't care that the service is there it needs to work.
Any monitoring needs to be done as the service is consumed.
Beyond TCP Connect AKA check_tcp
- RDP
check_x224 - this neatly checks that beyond a TCP connection there is something on the remote end that can have a basic X.224 conversation and it allows you to specify a latency. This has been a good indication that a windows host is alive and the RDP access is usable both in that it works and is interactive. If you have a number of windows hosts behind a firewall this can easily be forwarded on non standard ports allowing you to monitor all of them without installing nrpe.
Even this is not really good enough!
Checking through the service end to end
Websites (Apache)
Apache (or any webserver) is never really the cause of any issues since it's seldom deployed differently to how it is extensively tested by millions of websites that use them
The webserver is, however, the visible / front end of a number of seperate points of failure...
- Links and routing. The webwerver might serve contact that comes from any number of places. The time taken to render this and serve it up against an acceptable latency is good metric.
- Dynamic content is the core of most websites, this is often bespoke per site and this a more likely a cause of failure than the webserver.
- Database Access is necessary for the content in many websites and content management systems (like Wordpress) are intolerant of a lack of database access.
To properly and concisely monitor a system set up like this check_http can be used to ensure that 200 message is returned on a specific URL with know text that is served from a database within an acceptable timeout. That should be confirmation that every layer is working and suitably.
SIP Trunks (Asterisk)
check_ast_reg - has anyone noticed this showing
check_ast_reg UNKNOWN - Error in communication to asterisk
when there clearly isn't an issue and tcpdump shows the correct data is returned? Somewhere in AMI.pm or in the AnyEvent perl library the response from Asterisk AMI is being lost/trunkated an the on_error is triggered.
I've re-written this in PHP (check_ast_reg) as a drop in replacement (for my basic use case) without the need for external libraries and I no longer have the problem.
check_ast_reg!user:host!passwd
TO DO: Place a call from a SIP client on the Nagios server to a "weasels" message and have Nagios check to see that the call is connected and in progress (say) with
core show channels
Email (Sendmail, Dovecot, Exchange, whatever)
Despite the ubiquity of Social Media, email is still an important form of contact. Non delivery due to mis-configuration, blacklisting or load is for all intents as bad as a fail in users eyes.
check_mailq - if there are too many messages in the queue some end point is probably not getting their email.
check_email_delivery - this plugin will send an email to a mailbox (via SMTP) and then confirm (via IMAP) that it's arrived.
check_smtp_send & check_imap_receive - it may seem somewhat pointless doing these checking since they are implied in check_email_delivery, however, there is still a gap. Being able to received external email is a good test, but being able to send it via your smartmailer and confirm it's arrived in an INBOX at a service (like Gmail) is a good test.
in ./objects/hostfile.cfg define service{ use local-service host_name A.B.C.D service_description email_smtp_toGmail check_command check_smtp_send!60!120!nagios@yourcorretlysetupdoamin!SMTPAuthPasswd } define service{ use slow-service host_name A.B.C.D service_description email_imap_fromGmail check_command check_imap_receive!60!120 }
in ./objects/commands.cfg define command{ command_name check_smtp_send command_line $USER1$/check_smtp_send -H $HOSTADDRESS$ --mailto YourAccount@gmail.com --mailfrom $ARG3$ --notls --nossl --auth PLAIN -U $ARG3$ -P $ARG4$ --body '$ARG3$ to YourAccount@gmail.com test' --header 'Subject: }
check_rbl - to warn if/when the smarthost is is added to an RBL list
MySQL Replication
Some people insist on checking the output of show slave status. Much like checking a TCP port ACK this really isn't good evidence that the service is working. I learned from a DBM in the 90s to either count the records in a frequently changing table, or sum the IDs. If there is a different you have a problem.
This check can be done anywhere there is mysql CLI that has access to both master and slave.
in ./check_mysql_replication #!/bin/sh exitstatus=2 U=`echo "select sum(NASPortId) from radius.radacct;" | mysql -N -u radius -ppasswd -h master` M=`echo "select sum(NASPortId) from radius.radacct;" | mysql -N -u radius -ppasswd -h slave` R=`echo $U-$M| bc` if [ $R == 0 ]; then echo "OK, MySQL SELECT Delta is $R|delta=$R" exitstatus=0 else echo "CRITICAL: MySQL SELECT Delta is $R|delta=$R" fi
Backups
As is a universal truth backups are always the last thought. In early 2014 I worked on a project and ensuring automated backups were current was a requirement for the DR plan which we were responsible for.
check_file_age returns age and size. The check passes if the file meets a minimum size and maximum age. I'd used this, passing in the directory of the backups since it carries the last write as the timestamp. I ran into 2 issues with that:
- If only 1 of n backups succeeds the directory carries the new timestamp and thus the check incorrectly passes.
- If n of n empty backup files are written the check incorrectly passes.
I modified the remote script to return the total size and the time of the oldest file.