Solved: Some of my data does not have the correct sourcety...

Jaci · ‎04-16-2010

Is there a way to export the data that isn't correct then re-import it using the correct sourcetype? If not, is there another way to change the sourcetype after the data has been indexed?

jrodman · ‎04-16-2010

The easiest method is to wipe the data and reindex.

Wiping the data can be global (splunk clean eventdata -index myindex) or more focused (splunk search "some data | delete"). The full wrinkles of these methods are discussed elsewhere.

Another means is sourcetype renaming, if you want to alias an entire sourcetype to another one you can do this, by eg, in props.conf:

[wrong_sourcetype]
rename = right_sourcetype

This clearly doesn't work if your [wrong_sourcetype] is a valid sourcetype on its own.

It's also possible to dump a bucket to a csv format, manipulate that, and then generate a new bucket from the modified or filtered csv data. This is sort of, 'for wizards'.

The command to emit a bucket to csv is splunk cmd exporttool bucketname filename.csv -csv To generate a new bucket from the csv, you can use splunk cmd importtool new_bucket_dir filename.csv You will either have to manually assign the correct splunk name to the bucket_dir, for example by naming it the same as the original, or by using some kind of script to name it. I used the following shell fragment, where $bucket was the old bucket

bucket_id=$(echo $bucket | sed 's/.*_//')
(cd $NEW_BUCKET; ls *.tsidx | sed 's/-[0-9]\+\.tsidx$//' |sed 's/-/ /') | {
global_low=0
global_high=0
while read high low; do
    if [ $global_high -eq 0 ] || [ $high -gt $global_high ]; then
        global_high=$high
    fi
    if [ $global_low -eq 0 ] || [ $low -lt $global_low ]; then
        global_low=$low
    fi
done
REAL_BUCKET_NAME=db_${global_high}_${global_low}_${bucket_id}
mv $NEW_BUCKET $bucket_dir/$REAL_BUCKET_NAME

Once you have a newly constructed, duplicated bucket, you can remove the old one from your index and insert the new one.

The main problem with exporttool/importtool is that they're not all that optimized, so they consume a significant amount of ram, and a significant amount of cpu for a significant amount of time. We'll be making them faster, but for now you should probably be sure you have a certain amount of headroom on the box where you're processing them.

If you want to go down that path, the full script (treat as example) is stuck in the wiki over here: http://www.splunk.com/wiki/Community:Modifying_indexed_data_via_export_and_import

View solution in original post

Mick · ‎04-16-2010

No and no, once data has been indexed, that's the state it's going to stay in. An export/import capability has been requested on a number of occasions, but it's not built yet. If you want to change the 'sourcetype' value, all you can really do is re-index the data

If that's not possible, then the next best solution is to just use tags - http://docs.splunk.com/Documentation/Splunk/5.0/Knowledge/Defineandusetags

jrodman · ‎04-16-2010

The easiest method is to wipe the data and reindex.

Wiping the data can be global (splunk clean eventdata -index myindex) or more focused (splunk search "some data | delete"). The full wrinkles of these methods are discussed elsewhere.

Another means is sourcetype renaming, if you want to alias an entire sourcetype to another one you can do this, by eg, in props.conf:

[wrong_sourcetype]
rename = right_sourcetype

This clearly doesn't work if your [wrong_sourcetype] is a valid sourcetype on its own.

It's also possible to dump a bucket to a csv format, manipulate that, and then generate a new bucket from the modified or filtered csv data. This is sort of, 'for wizards'.

The command to emit a bucket to csv is splunk cmd exporttool bucketname filename.csv -csv To generate a new bucket from the csv, you can use splunk cmd importtool new_bucket_dir filename.csv You will either have to manually assign the correct splunk name to the bucket_dir, for example by naming it the same as the original, or by using some kind of script to name it. I used the following shell fragment, where $bucket was the old bucket

bucket_id=$(echo $bucket | sed 's/.*_//')
(cd $NEW_BUCKET; ls *.tsidx | sed 's/-[0-9]\+\.tsidx$//' |sed 's/-/ /') | {
global_low=0
global_high=0
while read high low; do
    if [ $global_high -eq 0 ] || [ $high -gt $global_high ]; then
        global_high=$high
    fi
    if [ $global_low -eq 0 ] || [ $low -lt $global_low ]; then
        global_low=$low
    fi
done
REAL_BUCKET_NAME=db_${global_high}_${global_low}_${bucket_id}
mv $NEW_BUCKET $bucket_dir/$REAL_BUCKET_NAME

Once you have a newly constructed, duplicated bucket, you can remove the old one from your index and insert the new one.

The main problem with exporttool/importtool is that they're not all that optimized, so they consume a significant amount of ram, and a significant amount of cpu for a significant amount of time. We'll be making them faster, but for now you should probably be sure you have a certain amount of headroom on the box where you're processing them.

If you want to go down that path, the full script (treat as example) is stuck in the wiki over here: http://www.splunk.com/wiki/Community:Modifying_indexed_data_via_export_and_import

Some of my data does not have the correct sourcetype. Can I change it?

Mastering Data Pipelines: Unlocking Value with Splunk

The Latest Cisco Integrations With Splunk Platform!

AI Adoption Hub Launch | Curated Resources to Get Started with AI in Splunk