Continuing the “quick&dirty” series, here’s a solution I implemented on a server where sometimes (once a week or so) mySQL just dies.

This has unknown (=still to investigate) reasons. The problems appears to happen when there is heavy load on the Apache server: spikes are unusual and they’re made by malicious scanner probably. Still, the mySQL server shouldn’t die but just respond slowly, as it’s under heavy load because it powers some web sites on that server. This has to be investigated, of course.

I the meanwhile, mySQL server has to be up and Apache has to be responsive. Without implementing a “professional” solution with all the bells and whistles, a thing that can be done is to create a simple watchdog script in Perl.

The following script is made for Gentoo Linux, but it should trivial to adapt it to any Unix variant.

use Email::Address;
use Email::MIME;
use Email::Sender::Transport::Sendmail;
use Email::Sender::Simple qw/sendmail/;
use version;
use Arthas::Defaults::520;

our $VERSION = qv(1.0);
our $BASE = '/root/bin';
our $transport;
our $config = { };

lock_session();
configure();
chh();
unlock_session();

sub lock_session {
    die "Existing lock file" if -e "$BASE/checkhealth.lock";
    open (my $lfh, '>', "$BASE/checkhealth.lock") or die "Can't lock";
    print $lfh "PID\n";
    close $lfh;
} 

sub configure {
    $transport = Email::Sender::Transport::Sendmail->new;
}

sub chh {
    # Get all mySQL and Apache processes
    my @processes = split m/\n/xs, `/bin/ps ax`;
    my @p_mysql = grep { $_ =~ m/mysqld/xs } @processes;
    my @p_apache = grep { $_ =~ m/apache/xs } @processes;
    
    # Parse uptime command output, and get system load in last minute, last 5 minutes and last 15 minutes
    my $uptime = `/usr/bin/uptime`;
    my ($ul1, $ul5, $ul15) = (0, 0, 0);
    if ( $uptime =~ m/(\d+),\d+,\s*(\d+),\d+,\s*(\d+),\d+/xs ) {
        ($ul1, $ul5, $ul15) = ($1, $2, $3);
    }

    # Server name is used when reporting via e-mail, because if we run the same script on more than one
    # server we want to know which server had a problem.
    my $uname = `/usr/bin/uname -a`;
    (undef, my $servername) = split /\s+/, $uname;

    my $body = '';

    # No mySQL process active? Restart!
    if ( !@p_mysql ) {
        $body .= "MYSQL server not active, restarting...\n\n";

        my $rsout = `/etc/init.d/mysql restart`;
	    $body .= $rsout."\n\n";
    }

    # No Apache process active? Restart! Unlikely, never happened, but I don't need to spare 5 lines of code
    if ( !@p_apache ) {
        $body .= "Apache server not active, restarting...\n\n";

        my $rsout = `/etc/init.d/apache2 restart`;
	    $body .= $rsout."\n\n";
    }

    # If load in that past minutes is over 6 (it should be 2 at most, it's a 2-core VPS)
    if ( $ul1 > 6 ) {
        $body .= "Server under heavy load, restarting Apache...\n\n";

        my $rsout = `/etc/init.d/apache2 restart`;
	    $body .= $rsout."\n\n";
    }

    # Notify, but only in case of a problem - we surely don't want to get an useless e-mail every 5 minutes or so
    if ( $body ) {
        # Add uname and uptime output, which can be useful
        $body .= "SERVER DATA:\n" . $uname . $uptime . "\n";

        my @parts = (Email::MIME->create(
            attributes => {
                encoding        => 'quoted-printable',
                content_type    => 'text/plain',
                charset         => 'UTF-8',
            },
            body_str => $body,
        ));

        my $msg = Email::MIME->create(
            header_str  => [
                From    => 'domain@cattlegrid.info',
                To      => 'mb@cattlegrid.info',
                Subject => "$servername: health notice",
            ],
            parts       => \@parts,
            );
        sendmail($msg, { transport => $transport });
    }
}

sub unlock_session {
    unlink "$BASE/checkhealth.lock" or die "Can't unlock";
}

The script is designed to be run as a cronjob, because it’s the safest thing: there is no need to ensure the script itself is always running, as it’s called every few minutes. The definition “few minutes” depend on how long you want it to check the health of the system.

The locking system, basically just a simple lockfile, is needed to prevent the script being executed twice at the same time. It’s actually an extremely unlikely event, which could happen only if you schedule the script to be run very often, like every minute.

Here’s an example of the notification e-mails that are being sent in case of a problem.

MYSQL server not active, restarting...

* Checking mysqld configuration for mysql ... [ ok ]
* Stopping mysql ... [ ok ]
* Starting mysql ... [ ok ]

SERVER DATA:
Linux cattlegrid 5.14.17-x86_64-linode150 #1 SMP Thu Nov 11 13:17:05 EST 2021 x86_64 AMD EPYC 7601 32-Core Processor AuthenticAMD GNU/Linux
17:47:28 up 11 days, 18:06,  2 users,  load average: 0,09, 0,19, 0,23

The rest of the script is explained in the comments. It’s actually very simple, maybe even silly, but works like a charm.