Refine your search:

2
1

I'm trying to write instructions for some people to set up an app while onsite, and one of the steps involves backfilling a lot of summary index data.

I've followed the steps to use the script Splunk provides for this (fill_summary_index.py),

http://www.splunk.com/base/Documentation/4.2.1/Knowledge/Managesummaryindexgapsandoverlaps

But this process is incredibly slow, much slower than I would expect. One big 'stats count by foo bar' over my entire test dataset takes only about 30 seconds but running this backfill script against the same data is going to take an hour or more for each saved search at this rate, which is crazy. I expected the backfill to take a little longer than one giant search but not thousands of times longer. This is a big problem because if it takes hours on this tiny dataset it'll take days on bigger data, which isnt OK at all.

So now I'm thinking maybe advanced users arent supposed to use the python script? That with the oldschool collect command and a bit of stats count by foo bar and a dash of bin to get the timestamps and a dash of addinfo maybe to add the search-time, and a backgrounded search I could probably generate the entire run of backfilled events with one long running search.

http://www.splunk.com/base/Documentation/latest/SearchReference/Collect

And at this point though I'm sure someone's way ahead of me which is what brings me here. Anyone have an emerging best practice they care to share? Or have I just completely missed a piece of documentation? thanks.

asked 24 May '11, 11:50

nick's gravatar image

nick ♦
14.2k1318
accept rate: 47%

edited 24 May '11, 12:00


One Answer:

Great questions & observations Nick.

I still use the shipped script and experience the challenges you mention. I get slightly quicker results by making the machine do as much work as it can by setting a concurrency flag (usually to 8). I leverage a text file when order is important. Most of my summaries have weeklies which are built on dailies which are built on hourlies. The dailies and weeklies are quick, but some time is definitely invested in the hourlies.

I like your approach if it adds speed. I'm trying to think of how a dedup would work with that method as I rely on that flag to avoid re-summarizing what has been summarized.

Basically, I run a command like this...

   $SPLUNK_HOME/bin/splunk cmd python $SPLUNK_HOME/bin/fill_summary_index.py -app test_app -namefile $SPLUNK_HOME/etc/apps/test_app/bin/summary.jobs -et -90d -lt now -j 8 -dedup true

where summary.jobs looks like this:

dashboard_a_base_summary-1h
dashboard_b_base_summary-1h
dashboard_c_base_summary-1h
dashboard_d_base_summary-1h
dashboard_a_base_summary-1d
dashboard_b_base_summary-1d
dashboard_c_base_summary-1d
dashboard_d_base_summary-1d
dashboard_a_base_summary-1w
dashboard_b_base_summary-1w
dashboard_c_base_summary-1w
dashboard_d_base_summary-1w
link

answered 24 May '11, 12:15

bwooden's gravatar image

bwooden ♦
2.3k19
accept rate: 38%

Post your answer
toggle preview

Follow this question

Log In to enable email subscriptions

RSS:

Answers

Answers + Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×150
×12

Asked: 24 May '11, 11:50

Seen: 1,219 times

Last updated: 24 May '11, 12:15

Copyright © 2005-2012 Splunk, Inc. All rights reserved.