So at work today, as usual, I had a lot of trouble getting productive. This whole "burnt out" thang's been hitting me hard. Anyways, after last night (where I had to disable some of the checks in one of my "monitoring" scripts because the on-call admin was going numb from his cell phone going off all night) I sat down to come up with a better way. It's been an interesting learning experience, this project. More importantly - for the first time I felt myself really getting into working, after plunking away at it for a few hours. Once I finally figured out what I wanted to do, I was able to slowly ramp up until I was actually working at a decent pace towards the end and accomplishing something. I even stayed an hour later than usual (after making sure my dad was positive he'd be back in time to pick up my sons) to finish it and test it and make sure it worked.
The "project" is a log checker. We have a centralized syslog server, so all of the syslog messages from all of our machines hit one box.
I wrote a script that reads the new syslog entry every 5 minutes, parses through them, dumping most, and sorting the rest by type and machine that sent them and then inserting them into a db table. It also will send email-notifications, either individual emails (for the more critical matches) or aggregate emails (i.e. one big email summarizing the past x minutes worth of matches) for the less important matches. Now, you can also "squelch" notifications, so if a certain machine is paging out a lot you can tell it to temporarily stop sending the emails (but it still will record the entries into the database).
Every night it summarizes that day's worth of entries and puts them into a long-term history table, so we can see longer historical trends. It keeps 7 days of detailed reports and the historical report table grows indefinitely (we can trim it by hand as needed).
It matches w/ full perl regexps, and as I get around to it will support more than just email. It already does "heartbeats" - i.e. every machine logs a certain entry to the syslog server, and if it hasn't seen one of those heartbeats for an hour (they're stored in their own db table, and updated rather than having dupes inserted) then it will page out about it so we know the machine is either down or its syslog or cron daemons are having problems.
It's got some nice features, and because it's all stored in a db it's easy to get to the data and do things with it. I also designed it to be fairly fault tolerant - there's a lot of error checking, and in case something goes wrong it errs on the side of safety. For instance if there's an error in the parser, or let's say it's killed halfway through parsing the latest entries, it will stop running and email out errors until it's fixed. Same thing if the db can't be reached. That way nothing will be lost - once the problem has been fixed, it'll catch up inserting the matches into the db from there.
Anyways, it's getting a little bigger and bloatier and more confusing as time goes on, as features are requested and added, but it still works pretty well and isn't that messy yet. I worked hard on commenting everything and trying to keep things neat (everything uses strict and -w, for instance, and I subroutine any blocks of code that repeat). It's far from perfect but I'm pretty happy with it.
Anyways... just felt like rambling about that for no reason. G'nite, all.