We have a configuration that's been idling for over two days, and instead of processing locations that the tailing processor has acknowledged, it continues to loop over previously processed locations, and the internal logs.
Does this mean there's a practical/hard limit on the number of directories that can be absorbed? It seems other monitor inputs are being neglected somewhat. The tailing processor acknowledged these directories 2 days ago, but had not yet processed down to the bottommost level (the files themselves).
Are there any good commands to inspect what the tailing processor is up to? What's on the queues etc?
The 4.1.x implementation of file input leans heavily on stat() to get information about which files should be opened and read. For network inputs, it's quite possible to simply have total latency for a very large number of stat()s become unreasonably large.
I understand you're seeing the nfs system is not overloaded, but what sort of latency are we seeing? I'm not familiar with troubleshooting with nfsstat. I would think to look at the i/o picture with iostat and/or using
What is your total number of files in the monitored hierarchies? Essentially, a number something like this:
find /the /directories /you /monitor |wc -l
The specifics of the monitor lines might be useful, as well as any information about subdirectories of these you might not actually be interested in.
answered 14 Jun '10, 21:41
We've seen a number of customer issues recently with large numbers of files on relatively slow storage (NFS, UNC, anything non-local). This appears to have been caused by a subtle bug in the 4.1.x line of monitor:// code, in which large numbers of slow stat() calls end up starving out readdir() calls - meaning new files don't get picked up.
The next scheduled maintenance release will resolve this issue
There is a bug in the 4.1.3 and previous builds that is triggered by the slow NFS stat() calls, that much we know for sure. If your Splunk instance just stops indexing data, or stops picking up new files, then it's highly likely you are hitting this bug. Splunk should never stop indexing data from monitored files and discovering new files as long as new data is available.
Theoretically, there should be no limit to the number of files that Splunk can monitor. We depend on the OS to tell us which files have changed, so if the OS knows then Splunk will know too. An important distinction here is a 'live' file vs a 'static' file. Live files are currently being updated/written to, and Splunk will pick up new data from here as long as it keeps being added. Static files should be indexed and then ignored indefinitely.
The concept of 'real-time' can also play a major role here. How quickly do you want Splunk to pick up data once it's written to a file? If your requirement is that Splunk should display the data as quickly as possible, then you want to limit the number of live files that Splunk is monitoring. If you have 20,000 live files then Splunk will not be able to keep up to date with every single file at the same time. However, if you have 20,000 files and only 50 of them are live, then Splunk should be easily able to keep up - are we starting to make sense yet?
Our testing has shown that when tailing 10,000 live files, you can expect somewhere between 30 seconds and 1 minute lag. Those timings should increase linearly as you increase the number of files, so at 20,000 files you can expect 1 - 2 minutes delay.
There's no hard and fast rule here, the speed of Splunk is dependent on the hardware resources available and how the instance is tuned - number of CPU cores, number of indexing threads, number of FD's, speed of the disk Splunk is writing to, data-segmentation etc. The faster Splunk can write to the index, the faster it can pull new data from files. Testing is the only true way to gauge the expected performance of your hardware, with your data.
answered 30 Jun '10, 22:46
I have had a similar problem (4.1.1, 4.1.2 and 4.1.3), although I wasn't on a particularly slow disk. I had tuned the max_fd for the lightweight forwarder up as high as the system would let me in order to pick up as many files as possible. (On that server, which has a 32,000 max, I could get the forwarder to 16,000.)
The forwarder would run for anywhere from 15 minutes to several hours before it would stop indexing anything. Sometimes it would continue to index one active file, but as soon as that file was quiet for longer than the time_before_close value, it would stop indexing that too.
The forwarder would continue to consume a lot of cpu and memory, it just didn't seem to be doing anything.
I throttled the max_fd back to 1024 last night, and now it seems to be keeping up just fine. Last week I had cut back the number of files it was traversing, so I didn't really need the max_fd = 16000 but that didn't seem to help the stability or latency.
I suspect there is a practical limit to the number of threads a forwarder can juggle internally. It is somewhere between 1024 and 16000 (at least on Solaris 10).
I have some forwarders with max_fd = 8192 and they seem to be running ok. (I need to look at them more closely now that I have something to study.) The instability threshold may actually be between 8192 and 16000. If I had a lab, and some time I could probably pin down the threshold more precisely.
My experience thus far is if you need to scan more than 8,000 files, you definitely need another forwarder (on the same system) - regardless of how fast your disk is. In fact I'd be inclined to recommend a forwarder for every 2,000 - 3,000 files. There was another thread on this here: http://answers.splunk.com/questions/3727/performance-of-forwarder-in-high-volume-environment/3742#3742 . I think we are stuck figuring out a rule of them by trial and error and experience.
answered 29 Jun '10, 14:25
The key to the final resolution is that our use case involves a lot of small files, and we have a notable latency (~5ms) between filer and indexer.
Splunk uses stat() and access() a fair bit during it's various uptake cycles. With lots of small files (as opposed to a few big ones), Splunk is spending expensive, uncached iops to stat() the files as it traverses the inputs.
Had the situation been reverse (a few big files), readahead cache would've kicked in, and the effect of the latency would've been negligible.
To mitigate this a little, we added forwarders closer to the source (<1ms), to take advantage of less RTT on the noncached iops. Curiously, we've observed NFS caching being drastically less effective on access() calls at higher latencies, but we're still investigating some of these interesting side-effects.
answered 17 Sep '10, 13:20