The title is slightly flippant. The is not a discussion of a real DDOS, but rather the journey taken to discover and correct a self-induced DDOS (more along the lines of slowloris than a raw traffic issue) of a distributed web application created by it's own developers and exacerbated by the Managed Hosting company.
This graph represents the before and after of the number of Apache processes in the Waiting state:
The issue was investigated first by looking at a 5 minute server graph, this didn't yield a discernable pattern or cause but in looking further into a high load average I hit upon the idea of monitoring Apache's /server-status/ with something like:
]# while [ 1 ]; do echo -n `date "+%Y-%m-%d %H:%M:%S |"` >> /tmp/log ; wget -q -O - "http://127.0.0.1/server-status/?auto" | grep Scoreboard | cut -f 2 -d " " >> /tmp/log; sleep 1; done
It was clear from this log that every minute the process receiving the CDRs was triggered (by the many PBXs), consuming all of the Apache children and in so doing hitting the MySQL connection limit this lasted for the first 10-15 seconds of every minute.
The process on the remote / distributed 1500 servers that transmitted the CDRs was set to run in a cron job and send any that existed every minute. and for the most part the servers are time sync'd. Problem suddenly VERY much apparent.
The 1500 servers only send CDRs when there has been a call and are more-or-less spread evenly over habited timezones, so whilst there was not a demand for 1500 concurrent connections the demand was frequently bouncing off Apache and MySQL limits.
was clearly a big contributor to a high load average.
Apache coped with this, but eventually there were too many Apache children for the number of allowed MySQL client connections.