I have a large archive of old data i want to load while also loading new real-time data.
What is the most efficient way to load archived data? I see batch, one-shot, and monitor. I want to make sure that i dont impact loading new new real-time data.
Splunk has a configuration-free input type called oneshot that's ideal for this task.
From the UI, it is labeled as "Manager >> Data inputs >> Files & Directories >> Add New >> Index a file on the Splunk server" and from the CLI it's invoked as "splunk add oneshot [-source sourcename] [-sourcetype sourcetype]"
When added, the input begins immediately regardless of whether Splunk has seen this particular file before.
The inputs can be tracked via the REST management API like:
wget https://localhost:8089/services/data/inputs/oneshot --no-check-certificate --user admin --password changeme -O -
Note that oneshot input can only load files (including archives). To load full directories, oneshot should be called per file in the directory.
Let's say you have a directory full of large log files that you want to feed to the Splunk oneshot command w/ a delay, the bash script below is an efficient and clean way to do this.
For example, if you had someapp2011-01-01.log
through someapp2011-12-31.log
in /some/directory
and you want to feed each file every 5 minutes:
#!/bin/bash
SPLUNK_HOME=/your/path/to/splunk
for f in $(ls /some/directory/*.log);
do
echo "Processing $f file..."
$SPLUNK_HOME/bin/splunk add oneshot "/some/directory/$f" -index someindex -sourcetype sometype -host somehost -auth admin:changeme
echo "Finished feeding $f... pausing 5 minutes."
sleep 300
done
NOTE: This approach works with Splunk 3.4.x (possibly earlier) and in Splunk 4.0+. The oneshot mode may be preferable if you are only working with Splunk 4.0+.
I've had success by copying log files into the $SPLUNK_HOME/var/spool/splunk
folder, which is the default batch mode input. This works best if your historical log files are already on an indexer or forwarding instance.
You can add a special ***SPLUNK***
header to the start of your file to give it the original path, if that is important to you. (You can also set index
, sourcetype
, and host
.)
You could for example, load old mail logs from your old
directory with a set of commands like so:
(echo '***SPLUNK**** source=/var/log/mail'; zcat /var/log/old/mail*.gz) > $SPLUNK_HOME/var/spool/splunk/mail.log
This uses your shell to match files with simmilar names and copy them to a single location. And you can use other simple shell scripting techniques to make this a simple or complicated as you need. You can also do some data throttling to keep from overwhelming your indexer with some sleep
commands. Yeah, not a very high-tech solution, but it can work.)
Be careful when loading event that are more than a year old when missing the year portion of the event's timestamp. (For example, syslog files often don't show the year.) One way around this problem is put a timestamp in the filename (or at least a 4 digit year portion of the timestamp). Most often, splunk will recognize this and load the events with the correct date. But it's best to keep an eye out for this problem.
You may also want to consider using a separate index for loading historical data. Splunk 4.0+ handles wide date ranges much better than earlier versions, but there could still be advantages to using this approach even with Splunk 4.0. For example, if you aren't sure you have you have your props.conf
indexing settings setup correctly yet (timestamp parsing, sourcetype matching, etc), then loading into an independent (and potentially throw-away) index could pay of big time in terms of cleanup time (especially compared to manually finding and deleting incorrectly indexed events.) You can always merge buckets from your temporary index into your desired destination index.
Splunk has a configuration-free input type called oneshot that's ideal for this task.
From the UI, it is labeled as "Manager >> Data inputs >> Files & Directories >> Add New >> Index a file on the Splunk server" and from the CLI it's invoked as "splunk add oneshot [-source sourcename] [-sourcetype sourcetype]"
When added, the input begins immediately regardless of whether Splunk has seen this particular file before.
The inputs can be tracked via the REST management API like:
wget https://localhost:8089/services/data/inputs/oneshot --no-check-certificate --user admin --password changeme -O -
Note that oneshot input can only load files (including archives). To load full directories, oneshot should be called per file in the directory.
To oneshot add an entire directory recursively, in powershell, the following worked for me
forfiles /p D:\tutorialdata /s /c "cmd /c if @isdir==FALSE D:\Splunk\bin\splunk.exe add oneshot @PATH"
It may be useful to note that oneshot
will also let you specify -host
and -index
parameters as well.