|
Events are going missing from our search results. The "scanned events" total during the search is correct, but the "matched events" is much smaller even though we are doing simple "source=foo" type of search, which normally does not filter out any events from a source. The events are missing from contiguous timestamp ranges, out of 9 similar sources they are only missing from the 2 sources bigger than 500MB, and they only went missing after we started forwarding new data for the same "source" and "host" (most of the data for the source comes from a massive uncompressed backlog archive). Before re-doing it all on the 18th, it was all these for <14:32 on 13 Nov, a big missing patch for 13-14 Nov, and a few scattered patches of "matching" events sinnce 14 Nov. Currently events on the 2 sources are not "matching" for times before 17:23 on 17 Nov. We re-did our Splunk setup from scratch, and encountered exactly the same problem. This is what we did to cause the problem: Forward massive back-logs: on 18 Nov 2010 11:00, forward events from 9 uncompressed log archives totalling 1.7GB in a single monitored directory on a local Linux Box A, to the "livelogs" index on the Windows Splunk server. These logs span 2009-11-01 00:00 to 2010-11-18 00:00. Besides the "license violation #1" warning on the 19th, this step goes smoothly. Searching for "*" showed "scanned events" equals "matched events" the whole way, and every last line of the logs is accounted for. NO MISSING DATA Forward realtime Events*: around 18 Nov 16:45, start forwarding real-time events starting from 2010-11-18 00:00 from a monitored directory on a different Linux Box B (in a different country) to the "livelogs" index on the Windows Splunk server. Events have the same "source" and "host" as the imported backlogs, but come from the live enviroment and trickle in in real-time. These log files are of course much smaller and roll at midnight (foo.log->foo.log.1). Some back-logs were >500MB: The largest archives are a 540MB log starting from 1 Nov 2009, and a 777MB request log that starts from 1 Oct 2010. There also are some smaller logs (180MB, 2MB, etc) all starting from 1 Nov 2009. Events go missing: We come in the next morning, to discover searching "source=" for the two largest sources (540MB and 777MB) scans all the events but only events since 17:23 yesterday are "matching" and show in the results. For the 180MB source, the 2MB source and the other sub-500MB sources, all events come back (matched events equal to scanned events). The livelogs index: I notice that the "main" index only has hot_v1_0, and about 50MB of misc. logs. The "livelogs" index has hot_v1_1 (28MB) and db_1290031199_1257053531_0 (846MB). The "Sources.data" therein shows the the 9 logs for timestamps up to 2010-11-17 23:59:59. Search errors Sometimes we get this: Environment:
Related questions: Clarifications for @gkanapathy extract from system/local/indexes.conf:
extract from system/local/props.conf:
extract from system/local/transforms.conf:
|
|
I have a possible solution that I'm going to test. I was reading HowSplunkStoresIndexes and think the problem may be that the "livelogs" index, being a "custom" index, by default has only one hot bucket, and the hot bucket max range is 90 days. So it makes sense that indexing a 1.7GB, 12 month span of archival data in "livelogs" would have some issues. Also in Indexes.conf the default quarantinePastSecs is 300 days, but part of each archival data source is as much as 360 days old, so I'm setting it up to 420 days to be safe. After the archival data is indexed, I'm going to manually rotate the hot buckets to make sure there's fresh hot buckets for the new data. Here is my new system/local/indexes.conf:
It worked. The lesson is, when adding archival data, do not merely use a separate index, but ALSO configure it similarly to the settings for the "main" index found in defaults/indexes.conf -- one hot buckets and a longer than 90 day-range worth of data arriving all at once is a recipe for disaster. Also, when your archival data is very old, adjust quarantinePastSecs to avoid your oldest data being quarantined. http://answers.splunk.com/questions/712/put-data-in-separate-index-based-on-timestamp
(23 Nov '10, 10:43)
grahampoulter
|
|
Hmm. Would you mind posting your indexes.conf settings, update them in your posting? Can you also clarify, for the logs that are "missing" data, did you ever see them in Splunk? i.e., was the data there and searchable but then gone, or has it always (as far as you know) been unsearchable? Also, do you possibly have large logs either without timestamps, or many hundreds of thousands of entries with identical (down to 1 second) timestamps? I have added the requested clarifications to the post under "Clarifications for @gkanapathy"
(21 Nov '10, 07:29)
grahampoulter
|
