Why does transaction
group irrelevant events together with relevant ones? What am I doing wrong?
Sample Postfix log event:
Jan 20 14:04:57 smtp2 postfix/local[14880]: 0EA1B1027961: to=<REDACTED>, orig_to=<root>, relay=local, delay=0.04, delays=0.03/0.01/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
...Search:
sourcetype="postfix_syslog" host="*-smtp*"
| rex field=_raw "\:\s(?<threadId>[0-9A-F]{6,24})\:"
| transaction threadId host maxspan=24h
| search threadId=0EA1B1027961
... returns this result:
Jan 20 13:14:36 smtp2 postfix/smtpd[8618]: connect from <somehost>
Jan 20 13:14:36 smtp2 postfix/smtpd[8618]: disconnect from <somehost>
Jan 20 13:19:36 smtp2 postfix/smtpd[9238]: connect from <somehost>
Jan 20 13:19:36 smtp2 postfix/smtpd[9238]: disconnect from <somehost>
... <repeated ~ 20 times>
Jan 20 13:24:36 smtp2 postfix/smtpd[9857]: disconnect from <somehost>
Jan 20 14:04:57 smtp2 postfix/pickup[6583]: 0EA1B1027961: uid=0 from=<root>
Jan 20 14:04:57 smtp2 postfix/cleanup[14874]: 0EA1B1027961: message-id=<REDACTED>
Jan 20 14:04:57 smtp2 postfix/qmgr[1436]: 0EA1B1027961: from=<REDACTED>, size=1343, nrcpt=1 (queue active)
Jan 20 14:04:57 smtp2 postfix/local[14880]: 0EA1B1027961: to=<REDACTED>, orig_to=<root>, relay=local, delay=0.04, delays=0.03/0.01/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
Jan 20 14:04:57 smtp2 postfix/qmgr[1436]: 0EA1B1027961: removed
Only the bottom few events include the search keyword 0EA1B1027961
- the rest don't.
The above in turn is the result of my attempt to use transaction
on a standard Postfix log on our smtp relays that is fairly simple: each email passing through (or deferred or rejected) results in several log entries that have a common session ID. I'd like to group them together and run various stats commands on them. Yet my results are way off.
Original field extraction and transaction statement displaying results in a table:
sourcetype="postfix_syslog" host="*-smtp*"
| rex field=_raw "\:\s(?<threadId>[0-9A-F]{6,24})\:"
| transaction threadId host maxspan=24h
| search threadId=*
| table threadId status duration eventcount
| sort -duration
Results:
threadId status duration eventcount
0EA1B1027961 sent 3300 29
AB0711005181 sent 233 11
7BF211005181 sent 202 11
5CE8B1005181 sent 178 11
244971005181 sent 173 10
Clicking on an event and then "view events" brings up a "transaction" event above, that includes irrelevant events.
Appreciate any help and apologies if my question is not clear.
Incidentals:
P.S. Excluding the host
field from the transaction
field returns better results (but not perfect if I do want to isolate the results to a single host).
| transaction threadId maxspan=24h
So how do I ensure that including the host
field in the transaction statement does not result in including irrelevant events? In other words that only events with both the relevant threadId
and host
values are grouped by the transaction
?
P.P.S. This may be related (from "Search Reference - Transaction"):
The events are grouped into transactions based on the values of this field. If a quoted list of fields is specified, events are grouped together if they have the same value for each of the fields.
What is "a quoted list of fields"?
Running transaction with maxspan=24h is going to bring your search head to its knees for any appreciable volume of search results. At first glance, transaction might appear to be the function you're looking for, but it really isn't. Stats is what you want.
I mocked-up your data with just the meaningful fields like this:
| makeresults count=50
| eval raw=split("0EA1B1027961,2EB1F149F9CD,1D7620F38E4C,0EA1B1027961,DFC2597BE73E",",")
| eval statuschoices=split("sent, ",",")
| eval threadId=mvindex(raw,random()%5)
| eval alphabet=split("abcdefg","")
| eval status=mvindex(statuschoices,random()%2)
| eval host=mvindex(alphabet,random()%7)
`comment("Mocked-up sample data above with credit to to4kawa")`
`comment("The stats line immediately below is important since it does all the heavy lifting for you")`
| stats count,values(status) as Status,earliest(_time) as Beginning,latest(_time) as Ending by host,threadId
`comment("You need to compute the duration below by subtracting Beginning time from Ending time, both of which are in epoch format")`
`comment("I added some random value to Ending just to give some variation in duration")`
| eval Ending = Ending+random()%100000
| eval Duration = Ending - Beginning
`comment("Neither the Beginning nor Ending fields are important at this point, so remove them from the results")`
| fields - Beginning,Ending
| sort -Duration
It gives what I think you're looking for.
Adapting your search to this approach looks something like this:
sourcetype="postfix_syslog" host="*-smtp*"
| rex field=_raw "\:\s(?<threadId>[0-9A-F]{6,24})\:"
| stats count,values(status) as Status,earliest(_time) as Beginning,latest(_time) as Ending by host,threadId
| search threadId=*
| eval Duration = Ending - Beginning
| fields - Beginning,Ending
| sort -Duration
Hope that helps!
rmmiller
References:
https://docs.splunk.com/Documentation/Splunk/7.1.2/SearchReference/Stats
https://docs.splunk.com/Documentation/Splunk/7.1.2/SearchReference/Timefunctions
Thank you!
Possible to modify your search to include additional fields in the table? ("to", "from", "relay", "client", etc.)
Also, to do something similar to maxpause
and maxspan
clauses in transaction
in your search, i.e. if a group of events with the same threadId
spans more than 1h, or there's a pause longer than 10min, break them up into separate groups (logical events) for further analysis?
Overall - thank you! What you did is awesome and hopefully one day I'll wrap my mind over this and be able to craft similar searches. Regretfully I am not there yet. 🙂
(Back to transaction
for a moment: would you know if it's possible to force an "AND" relationship between fields in transaction
, to prevent the behavior I described in my OP?)
Regarding the cost of transaction
: I think we can manage it, at least with respect to Postfix logs.
Running transaction with maxspan=24h is going to bring your search head to its knees for any appreciable volume of search results.
Running my original search with transaction
over 30 days:
This search has completed and has returned 36,025 results by scanning 394,663 events in 12.567 seconds
This works for me and It's unlikely I'll do it over 30 days often. A similar search over 24h: "968 results by scanning 12,090 events in 0.494 seconds". Past year: "6,087 results by scanning 2,957,109 events in 245.953 seconds".
Postfix is the simplest of transactional logs we have; some other (Aspera, tomcat) may run for hours using the same thread or session ID, and contain dozens of fields we need to run stats on. The key is to group events with the same session ID into a single transaction (per host) and then treat it as a single event for further analysis - even if it's expensive. Far from sure I'll be able to use stats
for the purpose.
Possible to modify your search to include additional fields in the table? ("to", "from", "relay", "client", etc.)
Yes, just add additional items to the by clause of stats like this:
| stats count,values(status) as Status,earliest(_time) as Beginning,latest(_time) as Ending by host,threadId,relay,client
See my latest comment under your OP regarding what you're seeing with transaction.
One thing to keep in mind regarding your use of transaction being acceptable in terms of performance. There are memory limits (in limits.conf) that could affect the accuracy of your searches using transaction. Stats allows you to work around this limitation and be faster.
There's a fantastic presentation from one of the previous Splunk conferences that explains the benefits of using stats.
https://conf.splunk.com/files/2016/slides/let-stats-sort-them-out-building-complex-result-sets-that-...
Hope that helps!
rmmiller
That produces no results... I'll dig into it...
Yes, just add additional items to the by clause of stats like this:
| stats count,values(status) as Status,earliest(_time) as Beginning,latest(_time) as Ending by host,threadId,relay,client
Hi @mitag
My apologies. I was in a rush. You are looking for the values of "relay" and "client" rather than splitting them out into different groups for stats, which is what my erroneous command would give you.
I was actually thinking this instead:
| stats count, values(status) AS Status, values(relay) AS Relay, values(client) AS client, earliest(_time) AS Beginning, latest(_time) AS Ending by host,threadId
The quoted list of fields looks like transaction "threadId host" maxspan=24h
.
From Splunk doc team:
Turns out that the text refers to an old way that the “fields” argument worked. You do not specify fields in a quoted list. You simply list the fields immediately after the “transaction” command. Several of the Extended Examples show this.
Thanks - that's what I thought - but it must be something other than an "AND" between fields in transactions: no results when I use it. Would you know what Splunk means when they say "If a quoted list of fields is specified, events are grouped together if they have the same value for each of the fields."?
"the same value for each of the fields", i.e. events are grouped into a transaction based on threadId
==host
?
No, "the same value for each of the fields" should be interpreted as (threadId is the same in event A and event B) AND (host is the same in event A and event B) --> merge event A and event B into the smae transaction.
Is it possible that my Splunk instance has a bug or something else is wrong with it? Else I don't understand why I'd get no results on this - when it's clear there should be:
sourcetype="postfix_syslog" host="*-smtp*"
| rex field=_raw "\:\s(?<threadId>[0-9A-F]{6,24})\:"
| transaction "threadId host" maxspan=24h
| search threadId=0EA1B1027961
I don't think you should have quotes around those 2 fields unless it was actually a single field named "threadId host". In this particular case, you have 2 distinct fields, i.e., one named "threadId" and another named "host". Neither one really needs quotes around it in order to function with transaction.
When using multiple fields with transaction, you'll get your expected grouping if each event includes both fields your specify. If you have some events that are missing "threadId", they will probably not end up in the same transaction as those with the same values in both the "threadId" and "host" fields.
Transaction is also really picky about the order of events. Events must be in descending time order, otherwise it can get confused and include some events that shouldn't be part of a transaction.
See https://docs.splunk.com/Documentation/Splunk/7.1.2/SearchReference/Transaction#Usage for further explanations of both.
Thank you. Perhaps transaction
gets confused by the presence of several events with the same exact time stamp and takes it out on me by including unrelated events? 🙂
This though may be the answer to my OP:
You might see the following events grouped into a transaction:
event=1 host=a event=2 host=a cookie=b event=3 cookie=b
This effectively means there's no way to force the fields in transaction
to behave in an AND
fashion - which in turn means we can't use it for our purposes and it'll be bound to include unrelated events.
Unless that mysterious "quoted list of fields" Splunk refers to in their documentation (but gives no example of) is still a possibility?