|
I'm trying to write instructions for some people to set up an app while onsite, and one of the steps involves backfilling a lot of summary index data. I've followed the steps to use the script Splunk provides for this (fill_summary_index.py), http://www.splunk.com/base/Documentation/4.2.1/Knowledge/Managesummaryindexgapsandoverlaps But this process is incredibly slow, much slower than I would expect. One big 'stats count by foo bar' over my entire test dataset takes only about 30 seconds but running this backfill script against the same data is going to take an hour or more for each saved search at this rate, which is crazy. I expected the backfill to take a little longer than one giant search but not thousands of times longer. This is a big problem because if it takes hours on this tiny dataset it'll take days on bigger data, which isnt OK at all. So now I'm thinking maybe advanced users arent supposed to use the python script? That with the oldschool http://www.splunk.com/base/Documentation/latest/SearchReference/Collect And at this point though I'm sure someone's way ahead of me which is what brings me here. Anyone have an emerging best practice they care to share? Or have I just completely missed a piece of documentation? thanks. |
|
Great questions & observations Nick. I still use the shipped script and experience the challenges you mention. I get slightly quicker results by making the machine do as much work as it can by setting a concurrency flag (usually to 8). I leverage a text file when order is important. Most of my summaries have weeklies which are built on dailies which are built on hourlies. The dailies and weeklies are quick, but some time is definitely invested in the hourlies. I like your approach if it adds speed. I'm trying to think of how a dedup would work with that method as I rely on that flag to avoid re-summarizing what has been summarized. Basically, I run a command like this...
where summary.jobs looks like this:
|
