Nagless

Monitoring software often suffers from the "Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf! Wolf!" (W) problem, wherein issues are raised repeatedly to the point that some folks ignore the alerts. Other folks complicate the monitoring by putting a layer between the "things that generate alerts" (monitoring software) and "targets of those alerts" (the poor sods who are on-call); this new layer only allows some alerts through based on various rules, which can range from a simple "do not send another alert if the prior alert was within the last 30 minutes" to highly complicated schemes where there is a hierarchy, whereby if, say, DNS is down all the things that depend on DNS will have their alerts suppressed. Larger sites are probably the only ones that need the complexity of a hierarchical alarm suppression system, or escalation should the on-call(s) not respond in a suitable amount of time, etc. Still other folks are okay with W, but I really do not get their thinking, especially given that all of the rest of the team ended up ignoring all of the alerts. Too much noise.

Alerts tend to go out via email, or did back in the day, or still do for folks who have email infrastructure, so a good place to put a declutter layer is on a mail server. Some account will accept all the monitoring alerts and determine whether to forward the alerts, or not, based on whatever logic is deemed necessary to balance the need for timely notification of important events versus not drowning the on-call in noise. Very important alerts may also be sent to everyone via a pager system in addition to email. The declutter layer could also redirect alerts to, say, an IRC channel, or might have logic that varies where the alerts go depending on whether someone is near a computer, or not. Too much logic of course is bloat and may well come with errors that require time to debug and fix, or may cause too many alerts to be missed. "…and that amount of Wu Wei is just right," said Goldilocks.

OpenSMTP

The mail server is a good place to route messages to a program, though this could also be done via a .forward file, or a program could process messages via IMAP depending on the level of access to the system. Mail from spammers and other such malicious sources should ideally not be able to reach the nagless address, as besides the time wasted in cleaning up their little messes there is also the risk of a security vulnerability, or, less likely, the ability of a remote attacker to somehow mask system alerts. An inside attacker would be more likely to break an alert system, as they likely have access to relevant code, documentation, or systems involved. Some duplication of monitoring may be necessary in larger sites, as those duplicate checks may notice something that has been suppressed by an insider in another system?

    $ grep nagless /etc/mail/smtpd.conf 
    action "nagless" mda "/usr/bin/true"
    match from local for rcpt-to regex "^nagless@" action "nagless"
    match from any for rcpt-to regex "^nagless@" reject
    $ echo foo | mail -s test nagless
    $ grep nagless /var/log/maillog
    ...

Sometimes you just want to get your foot in the door, so a delivery to true(1) should not fail, though will throw the messages away. Monitoring zero. Next up is something that actually parses the messages and does suitable things with them. Or you could stop right here?

Dedup

The MDA protocol is fairly simple; the email shows up on standard input, line by line. We do not need much of a parser here, maybe only to look for the "Subject" header. Even this can have complications; I had a MDA once that would extract the message subject and post it to syslog wherein they would show up in a tailed logfile, and foolish me assumed that the subject would appear somewhere in a 4096K buffer of the message. Efficient block read, search the block, Done. Right? However, certain messages from Microsoft Exchange somehow managed to stuff somewhere north of 8192K into the headers before the subject line. Why does a message with a few sentences in it need orders of magnitude more metadata along for the ride, and how many sanity points would it cost to learn what the heck Microsoft was doing there?

You may need a more complicated parser if there is information you need to dig out from somewhere within a MIME part. The following only looks for a line starting with "Subject: " in the header and may do something with it, otherwise not much happens. Some may want better logs of what is going on.

    // nagless - mail delivery agent (MDA) that forwards but suppresses
    // messages for some amount of time, to cut down on monitoring spam

    #include 
    #include 
    #include 

    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 

    // PORTABILITY some OS put this at /usr/lib/sendmail or you may instead
    // need to use something like msmtp to send the mail off elsewhere
    #ifndef SENDMAIL_PATH
    #define SENDMAIL_PATH "/usr/sbin/sendmail"
    #endif

    // NOTE syslog any other things have limits if you want to increase the
    // maximum Subject length sent
    #define MAX_BLATHER 70

    #define NAGFAIL EX_TEMPFAIL

    enum { PIPE_R, PIPE_W };

    char *Checkpoint_File;
    char *Contact_Address;
    int Nagless_Sec;

    int Flag_Syslog; // -s

    static void emit_help(void);
    static void handle_subject(char *subj, ssize_t subjlen);
    static void maybe_notify(char *subj, ssize_t subjlen);

    int
    main(int argc, char *argv[])
    {
        int ch;
        const char *errstr;

        char *line      = NULL;
        size_t linesize = 0;
        ssize_t linelen;

        if (!setlocale(LC_ALL, "C")) err(1, "setlocale");

        while ((ch = getopt(argc, argv, "s")) != -1) {
            switch (ch) {
            case 's':
                Flag_Syslog = 1;
                break;
            default:
                emit_help();
            }
        }
        argc -= optind;
        argv += optind;

        if (argc != 3 || !argv[0] || !argv[0][0] || !argv[1] || !argv[1][0] ||
            !argv[2] || !argv[2][0])
            emit_help();
        Checkpoint_File = argv[0];
        // PORTABILITY use libbsd or strtoul(3) or hopefully not atoi(3)
        Nagless_Sec = (int) strtonum(argv[1], 0, INT_MAX, &errstr);
        if (errstr) err(NAGFAIL, "strtonum '%s': %s", argv[1], errstr);
        Contact_Address = argv[2];

        if (Flag_Syslog) openlog("nagless", LOG_NDELAY, LOG_USER);

    #ifdef __OpenBSD__
        if (pledge("exec fattr proc rpath stdio unveil", NULL) == -1)
            err(1, "pledge");
        if (unveil(SENDMAIL_PATH, "x") == -1) err(1, "unveil");
        if (unveil(Checkpoint_File, "rw") == -1) err(1, "unveil");
        if (unveil(NULL, NULL) == -1) err(1, "unveil");
    #endif

        while ((linelen = getline(&line, &linesize, stdin)) != -1) {
            // TODO but what if we get a \r\n line??
            if (linelen == 1) goto DONE; // blank line, end of headers
            if (!strncmp(line, "Subject: ", 9)) {
                handle_subject(line + 9, linelen - 10);
                goto DONE;
            }
        }
        //free(line);
        //if (ferror(stdin)) err(NAGFAIL, "getline");
    DONE:
        exit(EXIT_SUCCESS);
    }

    static void
    emit_help(void)
    {
        fputs(
          "Usage: nagless [-s] checkpoint-file window-seconds notify-address\n",
          stderr);
        exit(NAGFAIL);
    }

    inline static void
    handle_subject(char *subj, ssize_t subjlen)
    {
        char *cp;
        if (subjlen <= 0) return;
        if (subjlen > MAX_BLATHER) subjlen = MAX_BLATHER;
        cp = subj;
        for (size_t i = 0; i < (size_t) subjlen; ++i, ++cp)
            if (!isprint(*cp)) *cp = '.';
        if (Flag_Syslog) syslog(LOG_NOTICE, "%.*s", (int) subjlen, subj);
        maybe_notify(subj, subjlen);
    }

    inline static void
    maybe_notify(char *subj, ssize_t subjlen)
    {
        char *failure = "unknown";
        int ret, status, tochild[2];
        pid_t child, pid;
        struct stat sb;
        time_t now;
        if (time(&now) == (time_t) -1) err(NAGFAIL, "time");
        if (stat(Checkpoint_File, &sb) == -1)
            err(NAGFAIL, "stat '%s'", Checkpoint_File);
        if (llabs(now - sb.st_mtim.tv_sec) < Nagless_Sec) return;
        utimes(Checkpoint_File, NULL);

        if (pipe(tochild) == -1) {
            failure = "pipe";
            goto NOTIFY_FAIL;
        }

        pid = fork();
        if (pid < 0) {
            failure = "fork";
            goto NOTIFY_FAIL;
        } else if (pid == 0) { // child
            close(tochild[PIPE_W]);
            if (dup2(tochild[PIPE_R], STDIN_FILENO) != STDIN_FILENO)
                err(1, "dup2");
            close(tochild[PIPE_R]);
            execl(SENDMAIL_PATH, "sendmail", "-t", (char *) 0);
            err(1, "execl");
        }

        close(tochild[PIPE_R]);
        ret = dprintf(tochild[PIPE_W], "To: %s\nSubject: %.*s\n\nfyi",
                      Contact_Address, (int) subjlen, subj);
        if (ret < 0)
            warn("dprintf");
        else if (ret == 0)
            warn("dprintf zero write??");
        if (close(tochild[PIPE_W]) == -1) warn("close");
        // TODO may need timeout here, in the event the mail program gets stuck
        child = wait(&status);
        if (child == -1) warn("wait");
        if (status) warnx("non-zero exit (%d)", status);
        return;

    NOTIFY_FAIL:
        if (Flag_Syslog) syslog(LOG_ERR, "notification failure: %s", failure);
        warn("notification failure: %s", failure);
    }

Back in smtpd.conf, the action line should look something like

    action "nagless" mda "/usr/local/libexec/nagless -s /var/run/nagless/cpf 1800 user@example.org" user nagless

and you'll need to touch and chown and chmod the /var/run/nagless/cpf file so that the nagless user can peek and poke at it. Then, point monit or whatever at the local nagless account instead of directly at your user account:

    # grep set\ alert /etc/monitrc 
    set alert nagless@example.org
    set alert nagless@example.org not on { instance, action }

In theory, if all goes well, only some alerts will make their way through. This is pretty barebones so probably does need more complications on what gets through, how events are logged for review, etc.

Improvements?

Sending mail directly to "sendmail" is a bit low-level and may get you in trouble if you miss out on essential headers that mail(1) or similar add, especially if your mail goes off to some other company that checks for those headers and then drops your alerts into a spam bucket. A pipe to mail to sendmail is more processes so is slightly more likely to fail if the system is really struggling with resources.
There is not a lot of testing there, it's really a quick prototype, so ideally there would be tests that check that various conditions work the right way.
Using "touch" on a checkpoint file is not very robust; if a lot of messages come in at the same time a few of them could get through before the filesystem mtime change is noticed. Rows could instead be put into a database and something else pushes one or more unseen messages elsewhere, if necessary, and the rows would give you something in the future to review what happened.
Syslog may not be ideal if "too many" messages are dumped into it. OpenBSD tends not to have this problem, while other systems do.
If email or DNS is having trouble then the notification may not go through. There are services you can send messages to that then page you, but those tend to cost money, or might be broken if your link to them are down. Or you could devise something over a serial line, something that should work even if DNS and email are all broken.
The alerts could also be sent off to the side to an IRC bot that has a FIFO ready to read from for just that purpose (some previous blog post here).
Other...

Placid

If your alerts fire too infrequently then the on-call and everyone else may be out of practice on what to do, or who knows if the monitoring code has not bitrotted. A not very exciting environment may need practice alerts, much as folks test a building's fire alarm system even though there has not been a fire for who knows how long. Computer systems, being both new and typically riddled with errors, tend not to (but can) have this problem.

Message Reduction

Turning off httpd generates a bunch of syslog messages, too many perhaps, but the number of emails generated for this event was just one:

    Nov 20 01:56:37 thrig httpd[86929]: parent terminating, pid 86929
    Nov 20 01:57:09 thrig monit[83927]: 'httpd' process is not running
    Nov 20 01:57:09 thrig nagless: monit:restart|httpd|Does not exist
    Nov 20 01:57:09 thrig monit[83927]: 'httpd' trying to restart
    Nov 20 01:57:09 thrig monit[83927]: 'httpd' start: '/usr/sbin/rcctl start httpd'
    Nov 20 01:57:09 thrig nagless: monit:restart|httpd|Does not exist
    Nov 20 01:57:09 thrig httpd[96273]: startup
    Nov 20 01:58:10 thrig monit[83927]: 'httpd' process is running with pid 96273
    Nov 20 01:58:10 thrig nagless: monit:alert|httpd|Exists
    Nov 20 01:58:10 thrig nagless: monit:alert|httpd|Exists