Splunk Search

Search to find Agent/Server with No "Reconnect" Event

Johnstone234
Loves-to-Learn

Hi,

 

I am hoping to get some help in creating a search, which will be turned into an alert - I am working with system logs from a monitoring device, where a log is submitted when any one of ~600 servers go down and while the server stays down a new log is dropped every ~10 mins, then if the server comes back up a "Reconnect" log is submitted.

I am wanting to get the search to return me the name of a server/agent that has had at least 1 "disconnect" but no "reconnect" entry within a time period and then once a reconnect is received - the server is no longer listed.

I am not very experienced with Splunk and thus far only have a search that is returning me counts of both types of events (connect/disconnect):

index="XXXlogs" sourcetype="systemlog" eventid="*connectserver" devicename="device1" logdescription="Agent*" |  stats count by win_server, event_id

Any help is appreciated.

Labels (2)
0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @Johnstone234,

if you want to have an alert when a server (with Splunk Universal Forwarder on board) is down, you should create a lookup (called e.g. "perimeter.csv") containing all the servers to monitor (containing at least one field called e.g. "host") and then run a simple search like:

| metasearch index=_internal
| eval host=lower(host)
| stats count BY host
| append [ | inputlookup perimeter.csv | eval host=lower(host), count=0 | fields host count ]
| stats sum(count) AS total BY host
| where total=0

you could schedule this alert e.g. every 5 minutes and have a message if in the last 5 minutes a server (of your list) didn't send logs.

The perimeter.csv lookup can be manually managed or using a scheduled search (e.g. every night); I prefer to manually manage the list to have more control on the perimeter to monitor.

Ciao.

Giuseppe

0 Karma

Johnstone234
Loves-to-Learn

Hi @gcusello 

The alert that I would be wanting to set up would be specifically to monitor when our monitoring device drops connection to the server, I should have noted in my original post - each of the servers in question has an agent on them that our device connects to, the alert would be to have us go and look at our device to see why the disconnect has occurred without a reconnect.

I do appreciate the response, though!

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @Johnstone234,

let me understand: your devices send logs even if disconnected?

the problem is that if you haven't logs from the devices you'll never have something in your stats command results, so you need a perimeter lookup, so you can see my approach and modify your search similar to mine.

index="XXXlogs" sourcetype="systemlog" eventid="*connectserver" devicename="device1" logdescription="Agent*" 
| eval win_server=lower(win_server)
| stats count by win_server, event_id
| append [ | inputlookup perimeter.csv | eval win_server=lower(win_server), count=0 | fields win_server count ]
| stats sum(count) AS total BY win_server
| where total=0

If instead you have logs when a device is disconnected, what's the message for connected and disconnected?

Ciao.

Giuseppe

0 Karma

Johnstone234
Loves-to-Learn

@gcusello  - Yeah, the device that I am using will send me logs if and when the servers are "down" or if the device cannot reach them.  

When the device is disconnected - 

Agent - "Servername" - Error: Unable to reach Agent at "IP Address"

When the device is connected - 

Agent - "Servername" - Info: Connected to Agent at "IP Address"

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @Johnstone234,

in this case, you have to recognize if in the monitoring period (e.g. every five minutes), you have only the event of disconnection but not the following event or connection, so plese, try something like this (adapt my hint because I cannot test it):

index="XXXlogs" sourcetype="systemlog" eventid="*connectserver" devicename="device1" logdescription="Agent*" 
| eval Status=if(like(logdescription,"%Unable to reach Agent%"),"Down","Up")
| stats dc(Status) AS dc_status values(Status) AS Status count by win_server, event_id
| where dc_status=1 AND Status="Up"

Ciao.

Giuseppe

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Rather than starting a match string with a wildcard (*) you might be better of using an OR condition

eventid="connectserver" OR eventid="disconnectserver"

This assumes you can precisely define the eventids.

Also, would it be sufficient to just get the latest eventid for each win_server/devicename and filter on that?

| stats latest(eventid) as eventid by win_server
| where eventid!="connectserver"

 

0 Karma

Johnstone234
Loves-to-Learn

@ITWhisperer  - Thanks, I am able to specify the event so I have added that to the search

 

index="XXXlogs" sourcetype="systemlog" eventid="connectserver" OR eventid="disconnectserver" devicename="device1" logdescription="Agent*"

| stats latest(eventid) by win_server | rename citrix_win AS Server, latest(event_id) AS LatestState

 

From this, is there a way to have an alert fired when the "LatestState" has been "disconnectserver" for over a certain amount of time?

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

If you capture the times when it connects and when it disconnects, you can compare the latest of these to work out how long it has been disconnected for. This runanywhere example demonstrates this with random connect and disconnect events over a day to demonstrate the principle

| gentimes start=-1 increment=1h 
| eval state=if(random()%2=0,"connected","disconnected")
| rename starttime as _time 
| table _time state
| eval connecttime=if(state="connected",_time,null())
| eval disconnecttime=if(state="disconnected",_time,null())
| stats latest(connecttime) as lastconnect latest(disconnecttime) as lastdisconnect
| eval disconnectperiod=if(lastdisconnect>lastconnect,tostring(lastdisconnect-lastconnect,"duration"),"")

 

0 Karma
Get Updates on the Splunk Community!

Introducing the Splunk Community Dashboard Challenge!

Welcome to Splunk Community Dashboard Challenge! This is your chance to showcase your skills in creating ...

Wondering How to Build Resiliency in the Cloud?

IT leaders are choosing Splunk Cloud as an ideal cloud transformation platform to drive business resilience,  ...

Updated Data Management and AWS GDI Inventory in Splunk Observability

We’re making some changes to Data Management and Infrastructure Inventory for AWS. The Data Management page, ...