Refine your search:

8
6

When Splunk monitors hundreds/thousands of files, there seems to be a long lag between the time the event is generated and the time Splunk indexes the event and makes it searchable. In the worst cases, this lag can be many minutes, 15 minutes or more. What can I do to increase indexing throughput in this scenario?

asked 29 Jan '10, 21:31

hulahoop's gravatar image

hulahoop ♦
2.5k3240
accept rate: 40%


2 Answers:

When installing Splunk, the default settings may not account for usage outside the norm. Monitoring many or hundreds of active files falls in this category.

There are 2 settings you can adjust in limits.conf to increase the indexing throughput when a large number of active files is involved:

[inputproc]

max_fd = <integer>
* Maximum number of file descriptors that Splunk can use in the Select Processor.
* The maximum value honored is half the current number of allowed file descriptors per process. (ulimit -n /setrlimit NOFILES)
* If a value chosen is higher than the maximum allowed value, the maximum value is used instead.
* Defaults to 32.

time_before_close = <integer>
* Modtime delta required before Splunk can close a file on EOF.
* Tells the system not to close files that have been updated in past <integer> seconds.
* Defaults to 5.

For example, these settings can increase the number of files Splunk actively monitors while reducing the rate at which Splunk recycles file descriptors:

[inputproc]
max_fd = 256
time_before_close = 2

A more in-depth discussion on Splunk’s file monitoring system follows.

In order to understand Splunk file monitoring it is useful to know:

  • Each Splunk instance has a single monitoring thread
  • One file descriptor is used to per source
  • File descriptors are recycled once EOF is reached
  • The default number of file descriptors used by Splunk is 32 (in limits.conf: max_fd = 32)
  • For most Unix file systems, the max fds allocated to a single program is 1024

Splunk monitors files using a sliding window. At startup, Splunk will create the configured number of file descriptors in order to save some overhead in opening and closing fds. From this pool of fds, Splunk will begin monitoring the configured data inputs. When a fd reaches EOF, the fd is returned to the pool and immediately begins monitoring the next source in the queue.

In past versions, Splunk created one thread per source. The overhead of managing the threads and context switching defeated the performance gains of monitoring files in parallel. Ultimately, Splunk is still constrained by I/O. By using a single thread, the context switching can be avoided and Splunk can better maximize the I/O throughput.

The number of file descriptors and throughput is inversely proportional. The higher the number of fds, the lower the throughput per file descriptor. Therefore, increasing max_fd beyond a certain point will invoke diminishing returns. We believe this point to be about 256.

Please Note: File monitoring improvements in Splunk 4.1 will deliver a significant performance increase. It is not clear if this tuning will be required in 4.1.

Also Note: This tuning does not affecting indexing of gzip files. If you have many gzip files, then consider uncompressing them first to take advantage of Splunk's multi-threaded file monitoring. Splunk handles gzip files sequentially.

link

answered 29 Jan '10, 21:47

hulahoop's gravatar image

hulahoop ♦
2.5k3240
accept rate: 40%

edited 29 Jan '10, 23:13

The "ignoreOlderThan" inputs.conf parameter introduced in 4.2 deserves a mention :

ignoreOlderThan = <time window=""> - Causes the monitored input to stop checking files for updates if their modtime has passed this threshold. This improves the speed of file tracking operations when monitoring directory hierarchies with large numbers of historical files (for example, when active log files are colocated with old files that are no longer being written to). - A file whose modtime falls outside this time window when seen for the first time will not be indexed at all.

See inputs.conf.spec for more.

(18 May '11, 08:25) hexx ♦

In my case, we have about 1600 actively written files for a syslog archive. About 30GB / day to disk.

I think it may be considered a "bad" practice, but i avoid the extra disk read IO and CPU overhead by sending that data to splunk in a single TCP syslog pipe. I use a transform to extract host from the standard message itself. I use other transforms to assign sourcetypes as needed.

I think there are two main caveats with this approach.

  1. You lose ability to auto-sourcetype by individual file source (for syslog sources)
  2. If splunk goes down or restarts, you only have as much buffer as your syslog forwarder can handle. This may be no buffer or a queue in RAM or other auto-handled spooling conventions.

I don't mind assigning sourcetypes as needed because i got tired of the cryptic and inconsistent auto-sourcetype names for sources that had low log volumes. We still collect non-syslog files too.

And also in my environment, nobody cries if we miss a few events here or there.

I hope this answer also helps someone.

link

answered 20 Apr '11, 10:08

gfriedmann's gravatar image

gfriedmann
20710
accept rate: 11%

Post your answer
toggle preview

Follow this question

Log In to enable email subscriptions

RSS:

Answers

Answers + Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×326
×311
×147
×103
×28

Asked: 29 Jan '10, 21:31

Seen: 2,101 times

Last updated: 18 May '11, 08:26

Copyright © 2005-2012 Splunk, Inc. All rights reserved.