|
In my office we have a script on our log servers that monitors the hosts sending logs and alerts us if a machine starts pumping out an inordinate amount of logs. I'm trying to figure out if it's possible to move this into Splunk and try to get rid of yet another hand-rolled script. My concern though, is that this would have to be a batch job run once or twice a day, thus losing the real-time alerting that we get now. So, I'm wondering is there a way to set things up so that when new logs come in the count of logs from that host can be checked against some threshold that I set? |
|
You'd have to more or less roll this yourself, eg, run a search every 5 minutes and look at the last x recent minutes to see if the number has changed drastically. However, in the splunk world, you often tend to run your indexing system somewhere near capacity, so sometimes during a spike it can go over capacity, causing lag in the indexed data, which might make your volume appear to go down. Options:
eg a search of sourcetype=foo host=bar | stats count as event_count | eval event_count>50000 or event_count<200 This would emit one event only if they count is outside the threshhold, so you could make your alert condition be more events than 0. |
|
I've found that a good way to track this kind of information is by leveraging the indexing metrics on the splunk indexer instance. The indexing metrics are captured every 30 seconds for the top 10 source, sourcetype, index, and host. (You can up the number or series in the We have an email alerting saved search setup to run every 5 minutes (from -6m@m to -1m@m) that uses the 'source' metrics to point out any log files that are becoming too chatty (which may lead to exceeding our license usage, which is the primary scenario this search was setup to alert us about.) Here is a slimmed down version of our search:
The list of PLEASE NOTE: Please understand that this search provided here is only an example. So don't just copy it, run it, and expect to get sane results on your system. This is simply one possible way of tracking your indexing/usage patterns, but this is merely a starting point, not a solution. I'm been meaning to overhaul this alert. We take aggregate snapshots of the indexing metrics info that end up in the This is a pretty efficient and informative approach (we've already collected the data) under the presumption that you have sufficient indexing capacity to accept at least some part of an unusual rise. In a tightly run shop, that's probably true, but I'm hesitant to prescribe as a general approach for all splunk instances.
(05 Apr '10, 18:11)
jrodman ♦
Thats a good point. I've updated the post to add a disclaimer. I certainly did not intend to present this as a reusable solution; I was simply pointing out a starting point that leverages the indexing metrics. Hopefully the post is clearer about that now.
(05 Apr '10, 20:01)
Lowell ♦
|
