How not to DDOS your own distributed web application

The title is slightly flippant.  The is not a discussion of a real DDOS, but rather the journey taken to discover and correct a self-induced DDOS (more along the lines of slowloris than a raw traffic issue) of a distributed web application created by it's own developers and exacerbated by the Managed Hosting company.

Investigation

This graph represents the before and after of the number of Apache processes in the Waiting state:

  • Apache Max children is set to 360
  • MySQL Max connections is set to 270

Apache Server Status - Waiting graph

The issue was investigated first by looking at a 5 minute server graph, this didn't yield a discernable pattern or cause but in looking further into a high load average I hit upon the idea of monitoring Apache's /server-status/ with something like:

]# while [ 1 ]; do echo -n `date "+%Y-%m-%d %H:%M:%S |"` >> /tmp/log ;  wget -q -O - "http://127.0.0.1/server-status/?auto" | grep Scoreboard | cut -f 2 -d " " >> /tmp/log; sleep 1; done

It was clear from this log that every minute the process receiving the CDRs was triggered (by the many PBXs), consuming all of the Apache children and in so doing hitting the MySQL connection limit this lasted for the first 10-15 seconds of every minute.

Findings

The process on the remote / distributed 1500 servers that transmitted the CDRs was set to run in a cron job and send any that existed every minute.  and for the most part the servers are time sync'd. Problem suddenly VERY much apparent.

The 1500 servers only send CDRs when there has been a call and are more-or-less spread evenly over habited timezones, so whilst there was not a demand for 1500 concurrent connections the demand was frequently bouncing off Apache and MySQL limits.

was clearly a big contributor to a high load average.

The issue

http://127.0.0.1/server-status/?auto

Solution

Apache coped with this, but eventually there were too many Apache children for the number of allowed MySQL client connections.

 

  $output = shell_exec('wget -q -O - "http://192.168.0.219/server-status/?auto" | grep Scoreboard | cut -f2 -d" "');
  $output = trim($output);
  $arr = array('_'=>0, 'S'=>0, 'R'=>0, 'W'=>0, 'L'=>0, 'D'=>0, 'C'=>0, 'L'=>0, 'G'=>0, 'I'=>0, '.'=>0);
  for ($i=0; $i<strlen($output); $i++) {
    $arr[$output[$i]]++;
  }
  echo $arr['_'] . ':' . $arr['S'] . ':' . $arr['R'] . ':' . $arr['W'] . ':' . $arr['L'] . ':' . $arr['D'] . ':' . $arr['C'] . ':' . $arr['L'] . ':' . $arr['G'] . ':' . $arr['I'] . ':' . $arr['.'];

Background / Preamble

The subject of this analysis is a distributed PBX system where a central LAMP "manager" communicates with (around) 1500 Asterisk PBXs for configuration/control and the "many" regularly send Call Data Records (CDRs), voicemails and callrecordings to the "one".

Soon after a Zend / Doctrine re-write was rolled out grumblings of a lack of responsiveness on the central web application UI started (or rather became more significant).  The regular answer given by the main developer was that the application needed parts of the load seperated to different servers.

Coupled with this specific application errors were escalated by support with the key text: 

SQLSTATE[08004] [1040] Too many connections

The server hosting the Apache and MySQL applications was "fully managed" - so there was no access to the OS, Apache or MySQL config.