Nagios monitoring and exception reporting on a myriad of systems and their variables
The most basic setup
Everytime I install Nagios in a new environment (3 times so far) it begins as an effort to know the most basic information and fulfill a best practice requirement. A basic howto covers "is this server and some of it's services UP". A basic setup provides a false sense of security.
Ah, actually UP doesn't mean "OK"
Almost as soon as the first fault occurs that is not picked up by Nagios because (say) an email is not delivered, someone realises that basic monitoring of a host with ICMP or a service with a TCP connect is really not sufficient and basically pointless. Users don't care that the service is there, they care that it works.
Beyond TCP Connect AKA check_tcp
I've never yet actually had an issue with Apache so whilst in theory fetching a specific URL and checking for the existance of specific text seems to be a fix I've never bothered to impement this.
check_x224 - this fairly neatly checks that beyond a TCP connection there is something on the remote end that can have a basic X.224 conversation and it allows you to specify a latency. This has been a good indication that a windows host is alive and the RDP access is usable both in that it works and is interactive. If you have a number of windows hosts behind a firewall this can easily be forwarded on non standard ports allowing you to monitor all of them without installing nrpe.
Checking through the service end to end
These are my notes on Nagios (or general system monitoring) best practice, with specific examples.
Asterisk - SIP Trunks
check_ast_reg - has anyone noticed this showing
check_ast_reg UNKNOWN - Error in communication to asterisk
when there clearly isn't an issue and tcpdump shows the correct data is returned? Somewhere in AMI.pm or in the AnyEvent perl library the response from Asterisk AMI is being lost/trunkated an the on_error is triggered.
I've re-written this in PHP (check_ast_reg) as a drop in replacement (for my basic use case) without the need for external libraries and I no longer have the problem.
TO DO: Place a call from a SIP client on the Nagios server to a "weasels" message and have Nagios check to see that the call is connected and in progress (say) with
core show channels
Despite the ubiquity of Social Media, email is still an important form of contact. Non delivery due to mis-configuration, blacklisting or load is for all intents as bad as a fail in users eyes.
check_mailq - if there are too many messages in the queue some end point is probably not getting their email.
check_email_delivery - this plugin will send an email to a mailbox (via SMTP) and then confirm (via IMAP) that it's arrived.
check_smtp_send & check_imap_receive - it may seem somewhat pointless doing these checking since they are implied in check_email_delivery, however, there is still a gap. Being able to received external email is a good test, but being able to send it via your smartmailer and confirm it's arrived in an INBOX at a service (like Gmail) is a good test.
check_rbl - to warn if/when the smarthost is is added to an RBL list
command_line $USER1$/check_smtp_send -H $HOSTADDRESS$ --mailto YourAccount@gmail.com --mailfrom $ARG3$ --notls --nossl --auth PLAIN -U $ARG3$ -P $ARG4$ --body '$ARG3$ to YourAccount@gmail.com test' --header 'Subject:
- MySQL Replication
Some people insist on checking the output of show slave status. Much like checking a TCP port ACK this really isn't good evidence that the service is working. I learned from a DBM in the 90s to either count the records in a frequently changing table, or sum the IDs. If there is a different you have a problem.
This check can be done anywhere there is mysql CLI that has access to both master and slave.
U=`echo "select sum(NASPortId) from radius.radacct;" | mysql -N -u radius -ppasswd -h master`
M=`echo "select sum(NASPortId) from radius.radacct;" | mysql -N -u radius -ppasswd -h slave`
R=`echo $U-$M| bc`
if [ $R == 0 ]; then
echo "OK, MySQL SELECT Delta is $R|delta=$R"
echo "CRITICAL: MySQL SELECT Delta is $R|delta=$R"
As is a universal truth backups are always the last thought. In early 2014 I worked on a project and ensuring automated backups were current was a requirement for the DR plan which we were responsible for.
check_file_age returns age and size. The check passes if the file meets a minimum size and maximum age. I'd used this, passing in the directory of the backups since it carries the last write as the timestamp. I ran into 2 issues with that:
I modified the remote script to return the total size and the time of the oldest file.
- If only 1 of n backups succeeds the directory carries the new timestamp and thus the check incorrectly passes.
- If n of n empty backup files are written the check incorrectly passes.