Getting Data In

Why is the monit process sometimes restarting

mataharry
Communicator

I have Linux servers with Splunk, and the process monit to check my processed.

But sometimes I see an issue where monit restarts Splunk unexpectedly.

Tags (3)
1 Solution

yannK
Splunk Employee
Splunk Employee

Sometimes Monit may failing to read the pid file of splunk, and decide too quickly that splunk is down

There are several common scenarii :
- splunk restarted for no reason (when the pid file was updated by a search process or child processes, monit gave up too quickly)
- splunk started twice during a restart. (splunk deletes the pod file when shutting down, and monit can restart it too quickly, ending up with 2 splunk process and a port conflict)

Please tune your monit logic to retry / wait more cycles before jumping the gun.

View solution in original post

awyszkowski
Splunk Employee
Splunk Employee

Here's some pointers for a "real world" Splunk process monitor in Monit, that will restart splunk when it is detected down by 'splunk status'.

First off, we want to get better downtime detection. We want to do away with pid checking and port checking, this often leads to confusion as pids can be somewhat fluid with Splunk. We also know that part of "normal" operation of Splunk can involve a restart (be it a rolling restart, a GUI invoked administrative restart after installing an app, etc). Best off to use Splunk's own "splunk status" command, and exploit the fact that exit status carries some value (0 means it's running, other status mean it is not or there was an issue determining state).

Secondly, monit tends to want to shut off the service prior to restarting it. This can lead to ugliness if splunk was actually running. So rather than using restart logic, just use a 'splunk start' to get it going again ('splunk start' is effectively a non-op if splunk is already running, as opposed to a stop-start).

Note - this does no alerting, and merely starts Splunk when it is detected down for two consecutive windows 5 minutes apart (you might have to tweak your settings if your global monit polling frequency is different).

-

Assuming you have the following setting in /etc/monitrc

# Polling frequency
set daemon 20

In /etc/monit/splunk_health.sh (new file)

#!/bin/bash
TEXT=`/opt/splunk/bin/splunk status 2>&1`
STATUS=$?
>&2 echo $TEXT
exit $STATUS

In /etc/monit/conf.d/splunk.monitrc (probably a new file)

check program splunkd with path "/etc/monit/splunk_health.sh" every 15 cycles
    start program = "/usr/sbin/service splunk start"
    stop program = "/usr/sbin/service splunk stop"
    if status !=0 for 2 cycles then start

yannK
Splunk Employee
Splunk Employee

Sometimes Monit may failing to read the pid file of splunk, and decide too quickly that splunk is down

There are several common scenarii :
- splunk restarted for no reason (when the pid file was updated by a search process or child processes, monit gave up too quickly)
- splunk started twice during a restart. (splunk deletes the pod file when shutting down, and monit can restart it too quickly, ending up with 2 splunk process and a port conflict)

Please tune your monit logic to retry / wait more cycles before jumping the gun.

Get Updates on the Splunk Community!

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

Take a look below to explore our upcoming Community Office Hours, Tech Talks, and Webinars this month. This ...

They're back! Join the SplunkTrust and MVP at .conf24

With our highly anticipated annual conference, .conf, comes the fez-wearers you can trust! The SplunkTrust, as ...

Enterprise Security Content Update (ESCU) | New Releases

Last month, the Splunk Threat Research Team had two releases of new security content via the Enterprise ...