The documentation says Splunk is creating a CRC hash of the first and last 256 bytes of a file in order to detect weather the file's content has already been processed (eg. log file rotation). Is this true? Recent observations made me believe that only the first 256 bytes and the file size are relevant. How does this similar file detection work exactly?
What are the options to override/tune this behavior other than crcSalt=<SOURCE>? Is there a way to increase this 256 byte window? (eg. let splunk use the first 512 byte to detect simliar files).
EDIT:
Here is an example, to illustrate what I mean:
First 256 byte of every file the directory is the same:
sp@locutus:test_input$ for f in $(ls -1 .); do echo "head -c 256 $f | md5 = $(head -c 256 $f | md5)"; done
head -c 256 timings1_0.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_1.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_2.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_3.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_4.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_5.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings2_0.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings2_1.csv | md5 = e665ba09f505913aa5fe05d603fde49a
...
Last 256 bytes are different:
sp@locutus:test_input$ for f in $(ls -1 .); do echo "tail -c 256 $f | md5 = $(tail -c 256 $f | md5)"; done
tail -c 256 timings1_0.csv | md5 = de07cfe6f9b7209cbfdc3c63b5e45f66
tail -c 256 timings1_1.csv | md5 = b17470e217afcb23017596a569ce759a
tail -c 256 timings1_2.csv | md5 = 3aa94dfeb5014537e33bdd67ab7d16d0
tail -c 256 timings1_3.csv | md5 = 290d8c33f80a79a83bd02d10417ee8af
tail -c 256 timings1_4.csv | md5 = 292a292f17b01a4d4483712b70eddc68
tail -c 256 timings1_5.csv | md5 = 102566f80f0fb29a1ed8d5db5b26cce6
tail -c 256 timings2_0.csv | md5 = 61caa775c378b1c8887f2a442b546758
tail -c 256 timings2_1.csv | md5 = fd097acdbbb32391a4e0d9bccc37bc68
...
Filesize is different as well:
sp@locutus:test_input$ for f in $(ls -1 .); do echo "du -h $f $(du -h $f)"; done
du -h timings1_0.csv 2,3M timings1_0.csv
du -h timings1_1.csv 8,6M timings1_1.csv
du -h timings1_2.csv 3,4M timings1_2.csv
du -h timings1_3.csv 3,1M timings1_3.csv
du -h timings1_4.csv 2,8M timings1_4.csv
du -h timings1_5.csv 2,8M timings1_5.csv
du -h timings2_0.csv 2,3M timings2_0.csv
du -h timings2_1.csv 7,3M timings2_1.csv
...
Added to Splunk (it hasn't been on this instance before) into an empty index "test":
sp@locutus:test_input$ splunk add monitor . -index test -sourcetype splunk_dup_test
Your session is invalid. Please login.
Splunk username: admin
Password:
Added monitor of '/Users/sp/temp/test_input'.
Waited a fair amount of time (Splunk finished indexing):
splunk search "index=test | stats count by source"
source count
---------------------------------------- -----
/Users/sp/temp/test_input/timings1_0.csv 11662
(Only 1 file got indexed)
asked
13 Jul '10, 21:17
ziegfried ♦
7.1k●1●3●15
accept rate:
53%