Recently I migrated the Windows Splunk server in our QA environment to Ubuntu 10.04.
Things were working well for a week, today splunkd crashed. I suspect it is because the open file limit was set to a low number on the linux server. I have increased the open file limit and restarted splunkd.
Looking at the logs, can anyone confirm if this theory is true? If not, any thoughts on why this happened? Thanks.
Tailing Splunkd.log, the last error messages before the crash (after a bunch of info messages) are:
08-18-2011 15:40:31.187 +0000 ERROR JournalSlice - Cannot create new journal slice file: Too many open files, file="/opt/splunk/var/lib/splunk/defaultdb/db/hot_v1_42901/rawdata/0"
08-18-2011 15:40:31.188 +0000 ERROR JournalSlice - Failed to write header for rawdata
08-18-2011 15:40:31.188 +0000 INFO HotDBManager - no hot found for event ts=1303840686, closest match=id=42900 [et,lt,span,flush,lru]=[1303676631,1303676631,14400,9223372036854775807,1313682031] [expanded span=164055]
08-18-2011 15:40:31.188 +0000 FATAL HotDBManager - hot dir with id already exists in createDir: /opt/splunk/var/lib/splunk/defaultdb/db/hot_v1_42901
08-18-2011 15:40:31.357 +0000 WARN EventLoop - Main Thread: about to throw a EventLoopException: error from PolledSocket write: Broken pipe
Tailing /var/log/messages:
Aug 18 06:41:13 QAIFSPLUNK02 rsyslogd: [origin software="rsyslogd" swVersion="4.2.0" x-pid="790" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'lightweight'.
Aug 18 15:40:31 QAIFSPLUNK02 kernel: [781860.711631] __ratelimit: 3 callbacks suppressed
Aug 18 15:40:31 QAIFSPLUNK02 kernel: [781860.711641] splunkd[12731]: segfault at 157a000 ip 0000000000f33280 sp 00007fb9ff7b90d0 error 4 in splunkd[400000+1017000]
... View more