|
I suspect that I may have duplicate events indexed by Splunk. The cause may be my originating files having dupes OR my Splunk configuration may be indexing some events twice or more times. To be sure, what search can I run to find all my duplicate events currently within my Splunk index? |
|
I think it's safe to assume that if an event is duplicated (same value for
I'm not sure about Gerald's comment about multi-line events, since my de-dedup catching was limited to single line events, but it seems to me that some kind of
BTW, I found the Also, in my case I was trying to not only get a count of duplicate events but figure out the extra volume (in bytes) that could have been avoided if the data was de-duped externally before being loaded. I used a search like this:
This shows you the impact in megabytes per day. 1
Lowell is absolutely right that this transaction will be MUCH, MUCH faster than anything involving stats because of its favorable eviction policy. Transaction, especially with maxspan set, will only keep data for the current second in memory, as search scans backwards through time.
(25 Aug '10, 03:46)
Stephen Sorkin ♦
1
Stephen it would be nice if there was a search command that could remove duplicates -1, I'm not what the impact would be. * | tag_dupes | delete
(25 Aug '10, 07:37)
Marinus
|
|
Try appending this search string to your current search to find duplicates: | transaction fields="_time,_raw" connected=f keepevicted=t | search linecount > 1 1
This won't work if the original data is multiline. But you could fix that with
(16 Feb '10, 00:23)
gkanapathy ♦
3
Actually now that I think about it:
(16 Feb '10, 00:47)
gkanapathy ♦
1
Agreed. showdupes filter=all|latest would be very beneficial, especially when debugging input configs.
(01 Apr '10, 13:10)
maverick ♦
+1, needed. Has it been filed? (Don't forget to accept your current answer, unless it doesn't satisfy.)
(05 Apr '10, 19:20)
jrodman ♦
|
|
Original fixed due to some typos: sourcetype=* | rename _raw as raw | eval raw_bytes=len(raw) | transaction raw maxspan=1s keepevicted=true | search eventcount>1 | eval extra_events=eventcount-1 | eval extra_bytes=extra_events*raw_bytes | timechart span=1s sum(extra_events) as extra_events, sum(eval(extra_bytes/1024.0/1024.0)) as extra_mb To show number of events and size by sourcetype: sourcetype=* | rename _raw as raw | eval raw_bytes=len(raw) | transaction raw maxspan=1s keepevicted=true | search eventcount>1 | eval extra_events=eventcount-1 | eval extra_bytes=extra_events*raw_bytes |stats sum(extra_events) as extra_events, sum(eval(extra_bytes/1024.0/1024.0)) as extra_mb by host,sourcetype |