I started using pingstatus
command in our application. It works great, but I do have a couple of problems:
My first problem is the fact that it runs just one ping. Our networks are somewhat shaky, so running several pings and getting a package loss percentage would be great. The ideal would be to add a parameter to the command such as count=... (so that I can run it like this: pingstatus url as IP count=5
), and get back both pingdelay
and pingloss
or similar. I understand that it will create delays, but I can live with it.
My other problem is the pingdelay values that I'm getting: when I tested it on my Windows machine, I got all kinds of delays, usually quite small, similar to 0.000015 or even lower. However, when I moved it to the main network which consists of many (dozens to hundreds) Linux machines, most of the pingdelay values I'm getting are 0.0. Yes, that's right - a plain zero! Those which are not look like 0.00136 or so, but there are only a couple of percents of such non-zero records. What's worse - I know that I'm dealing with quite a wide network there with multiple locations and routers, so I would expect values in 20-100 ms range, not 1 ms or below.
Now, for the first problem, I should probably look into pingstatus.py - but I'm far from being a Python expert, let alone Splunk Python. A push in the right direction is all I would like.
For the second one: is there a big difference between Windows and Linux ping.py behavior?
Since you don't know Python, I'm going to give you some sample code to change the pingstatus.py
count=1
if len(sys.argv)>1 and len(sys.argv) != 4 and len(sys.argv)!=5:
print "Usage |pingstatus url as <local-field> (or have url field name in da\
ta) <optional-count>"
sys.exit()
elif len(sys.argv) == 4:
urlfield=sys.argv[3]
elif len(sys.argv) == 5:
urlfield=sys.argv[3]
count=sys.argv[4]
That will get you your count argument as a number added to your pingstatus command. Don't use count=5 as input as you'll have to parse that. Just put in 5. For example: |pingstatus url as ip 5|table ip pingstatus*
Next, for the pingdelay field, you can use this approach.
if urlfield in r:
for i in range(1, count+1):
try:
delay = ping.do_one(r[urlfield], timeout=2)
if count=1:
r["pingdelay"] = delay
continue:
else:
pingdelay="pingdelay" + str(i)
r[pingdelay] = delay
except socket.error, e:
if count=1:
r["pingdelay"] = 10000000
else:
pingdelay="pingdelay" + str(i)
r[pingdelay] = 10000000
This will will created fields pringdelay1, pingdelay2, etc if your count is greater than 1. This has not been tested, so you'll have to play it. Also, don't just copy and paste from this answers post as the formatting may be wrong. In Python, proper indentation matters. In Splunk to print your results, do:
|table pingdelay*
As for Windows vs Linux, I'm not sure why this is different as I used a public domain ping.py program to get my results. For Windows you may have to find a version that is better suited for it. Keep in mind this is a reference implementation to get you an idea how to do this. It is used as is.
To summarize our discussion and provide back a modified pingstatus command:
Attached is a pingstatus.py which adds an optional count
parameter to ping more than once, and outputs the number of unsuccessful and successful pings as well as computers the average time. It's very crude (you can see I'm no Python programmer), but it gets its job done. pingsuccess
and pingfail
are the counts of successful/unsuccessful pings.
My version of pingstatus has four invocation formats:
pingstatus
(the expected field containing IP/hostname is url
, count defauts to 1)
pingstatus count
(uses url
input field and the provided count; example pingstatus 5
)
pingstatus url as ipfield
(uses provided field for IP/hostname, count is 1; example pingstatus url as IP
)
pingstatus url as ipfield count
(uses as IP/hostname, pings
times; i.e. pingstatus url as IP 10
)
See whether you like it and make any changes you feel are needed. I guess def usage
would be in order: it could be used in try
blocks surrounding count = int(sys.argv[?])
to catch ValueError.
Edit: I attached the file, but I don't see how it can be downloaded. Let me know if you want the file and I'll paste the source somewhere.
Here's what we'll do. Paste the source somewhere with comments that attribute you as the change agent name to the file. I'll add this as experimental_pingstatus.py to the distribution in the bin directory and let those who want to explore using it the options continue. This way, as you said, they can use the plain vanilla method of pingstatus as provided and look into more advanced stuff as needed. Let me know where you pasted it. If there is enough character space, you can even paste here on answers.
Here it goes (part one - I hope it formats correctly):
# Copyright (C) 2005-2011 Splunk Inc. All Rights Reserved. Version 4.x
# Author: Nimish Doshi
# Modified by: Arkady Zilberberg, 2015-03-31
# Change history:
# Added an optional count of pings (see Usage: comment below)
# Adds fields:
# pingdelay (just as the original, though now averaged between successful pings)
# pingsuccess - the count of successful pings
# pingfail - the count of failed pings
# pingdelay1 through pingdelay<n> - actual pingdelays for each ping
import sys,splunk.Intersplunk
import string
import ping
import socket
urlfield="url"
count = 1
# Usage:
# pingstatus (ping once, generate pingdelay)
# pingstatus count (ping count of times, generate pingdelay - average, pingloss - tally the losses)
# pingstatus url as local-field (ping once, getting url from local-field)
# pingstatus url as local-field count (ping count of times, generate pingdelay - average, pingloss, get url from local-field)
if len(sys.argv) == 1:
pass
elif len(sys.argv) == 2:
count = int(sys.argv[1])
elif len(sys.argv) == 4:
urlfield=sys.argv[3]
elif len(sys.argv) == 5:
urlfield = sys.argv[3]
count = int(sys.argv[4])
else:
print "Usage | pingstatus [url as <local-field>] [count] (or have field named 'url' in data)"
sys.exit()
results = []
try:
results,dummyresults,settings = splunk.Intersplunk.getOrganizedResults()
...and part two:
for r in results:
if urlfield in r:
total_delay = 0
pingsuccess = 0
pingfail = 0
for i in range(1, count + 1):
try:
delay = ping.do_one(r[urlfield], timeout=2)
total_delay += delay
pingsuccess += 1
pingdelay = "pingdelay" + str(i)
r[pingdelay] = delay
del delay
except NameError:
pingfail += 1
except TypeError:
pingfail += 1
except socket.error, e:
pingfail += 1
r["pingdelay"] = 10000000 if pingsuccess == 0 else total_delay / pingsuccess
r["pingsuccess"] = pingsuccess
r["pingfail"] = pingfail
except:
import traceback
stack = traceback.format_exc()
results = splunk.Intersplunk.generateErrorResults("Error : Traceback: " + str(stack))
splunk.Intersplunk.outputResults( results )
Loaded version 1.2.1 with your code and release notes.
If you feel like improving the code, please do so and send back the result.
Since you don't know Python, I'm going to give you some sample code to change the pingstatus.py
count=1
if len(sys.argv)>1 and len(sys.argv) != 4 and len(sys.argv)!=5:
print "Usage |pingstatus url as <local-field> (or have url field name in da\
ta) <optional-count>"
sys.exit()
elif len(sys.argv) == 4:
urlfield=sys.argv[3]
elif len(sys.argv) == 5:
urlfield=sys.argv[3]
count=sys.argv[4]
That will get you your count argument as a number added to your pingstatus command. Don't use count=5 as input as you'll have to parse that. Just put in 5. For example: |pingstatus url as ip 5|table ip pingstatus*
Next, for the pingdelay field, you can use this approach.
if urlfield in r:
for i in range(1, count+1):
try:
delay = ping.do_one(r[urlfield], timeout=2)
if count=1:
r["pingdelay"] = delay
continue:
else:
pingdelay="pingdelay" + str(i)
r[pingdelay] = delay
except socket.error, e:
if count=1:
r["pingdelay"] = 10000000
else:
pingdelay="pingdelay" + str(i)
r[pingdelay] = 10000000
This will will created fields pringdelay1, pingdelay2, etc if your count is greater than 1. This has not been tested, so you'll have to play it. Also, don't just copy and paste from this answers post as the formatting may be wrong. In Python, proper indentation matters. In Splunk to print your results, do:
|table pingdelay*
As for Windows vs Linux, I'm not sure why this is different as I used a public domain ping.py program to get my results. For Windows you may have to find a version that is better suited for it. Keep in mind this is a reference implementation to get you an idea how to do this. It is used as is.
Looking closer at your pingstatus.py:
I wonder why is there if "_raw" in r
part in the code? Does it mean that any filtered search without _raw (say, someone has | fields - _raw
somewhere in the chain before piping it to pingstatus) will not ping at all? It's not an empty question - when one is dealing with summary searches, it is often necessary to remove the original _raw (and sometimes remove or replace _time as well) and just use some fields (one of which might be pingdelay generated by pingstatus). If that _raw removal happens earlier in the chain then pingstatus will not work (I guess).
Also, as a potential future improvement - ping.py does not return anything from do_one in a few cases, definitely when the pinged host is unreacheable. I think you'll do great if you catch NameError separately (as delay might not be defined at all after do_one call) and del delay
in between do_one
invocations.
Again, this was a reference implementation. It was meant to test against _raw to see if a machine responds to a ping given an address to check. Nothing more. You are free to remove that if statement and simply look for the presence of the field that represents the host you are pinging (URL, hostname, IP, etc). Since I didn't want this to be tied to ping.py as users are free to add their own ping module, I didn't catch any specific exceptions from it, other than simply timing out.
First of all let me tell you that it was a very useful "reference" implementation - thank you! I will try my changes and let you know how they work.
What I like about your pingstatus command is that it is absolutely minimalistic and gives the user a full freedom to put it anywhere in the chain and modify the event as necessary. There are no forms or dashboards to manage, no permissions to give or take - just the functionality, pure and simple :).