Transaction includes non-relevant events?

mitag · ‎01-20-2020

Why does transaction group irrelevant events together with relevant ones? What am I doing wrong?

Sample Postfix log event:

Jan 20 14:04:57 smtp2 postfix/local[14880]: 0EA1B1027961: to=<REDACTED>, orig_to=<root>, relay=local, delay=0.04, delays=0.03/0.01/0/0, dsn=2.0.0, status=sent (delivered to mailbox)

...Search:

sourcetype="postfix_syslog" host="*-smtp*" 
| rex field=_raw "\:\s(?<threadId>[0-9A-F]{6,24})\:" 
| transaction threadId host maxspan=24h 
| search threadId=0EA1B1027961

... returns this result:

Jan 20 13:14:36 smtp2 postfix/smtpd[8618]: connect from <somehost>
Jan 20 13:14:36 smtp2 postfix/smtpd[8618]: disconnect from <somehost>
Jan 20 13:19:36 smtp2 postfix/smtpd[9238]: connect from <somehost>
Jan 20 13:19:36 smtp2 postfix/smtpd[9238]: disconnect from <somehost>
... <repeated ~ 20 times>
Jan 20 13:24:36 smtp2 postfix/smtpd[9857]: disconnect from <somehost>
Jan 20 14:04:57 smtp2 postfix/pickup[6583]: 0EA1B1027961: uid=0 from=<root>
Jan 20 14:04:57 smtp2 postfix/cleanup[14874]: 0EA1B1027961: message-id=<REDACTED>
Jan 20 14:04:57 smtp2 postfix/qmgr[1436]: 0EA1B1027961: from=<REDACTED>, size=1343, nrcpt=1 (queue active)
Jan 20 14:04:57 smtp2 postfix/local[14880]: 0EA1B1027961: to=<REDACTED>, orig_to=<root>, relay=local, delay=0.04, delays=0.03/0.01/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
Jan 20 14:04:57 smtp2 postfix/qmgr[1436]: 0EA1B1027961: removed

Only the bottom few events include the search keyword 0EA1B1027961 - the rest don't.

The above in turn is the result of my attempt to use transaction on a standard Postfix log on our smtp relays that is fairly simple: each email passing through (or deferred or rejected) results in several log entries that have a common session ID. I'd like to group them together and run various stats commands on them. Yet my results are way off.

Original field extraction and transaction statement displaying results in a table:

 sourcetype="postfix_syslog" host="*-smtp*"
 | rex field=_raw "\:\s(?<threadId>[0-9A-F]{6,24})\:"
 | transaction threadId host maxspan=24h
 | search threadId=*
 | table threadId status duration eventcount
 | sort -duration

Results:

threadId    status  duration    eventcount
0EA1B1027961    sent    3300    29
AB0711005181    sent    233 11
7BF211005181    sent    202 11
5CE8B1005181    sent    178 11
244971005181    sent    173 10

Clicking on an event and then "view events" brings up a "transaction" event above, that includes irrelevant events.

Appreciate any help and apologies if my question is not clear.

Incidentals:

Splunk Enterprise 7.1.2
fairly new to Splunk

P.S. Excluding the host field from the transaction field returns better results (but not perfect if I do want to isolate the results to a single host).

| transaction threadId maxspan=24h

So how do I ensure that including the hostfield in the transaction statement does not result in including irrelevant events? In other words that only events with both the relevant threadId and host values are grouped by the transaction?

P.P.S. This may be related (from "Search Reference - Transaction"):

The events are grouped into transactions based on the values of this field. If a quoted list of fields is specified, events are grouped together if they have the same value for each of the fields.

What is "a quoted list of fields"?

rmmiller · ‎01-20-2020

Running transaction with maxspan=24h is going to bring your search head to its knees for any appreciable volume of search results. At first glance, transaction might appear to be the function you're looking for, but it really isn't. Stats is what you want.

I mocked-up your data with just the meaningful fields like this:

| makeresults count=50
| eval raw=split("0EA1B1027961,2EB1F149F9CD,1D7620F38E4C,0EA1B1027961,DFC2597BE73E",",")
| eval statuschoices=split("sent, ",",")
| eval threadId=mvindex(raw,random()%5)
| eval alphabet=split("abcdefg","")
| eval status=mvindex(statuschoices,random()%2)
| eval host=mvindex(alphabet,random()%7)
`comment("Mocked-up sample data above with credit to to4kawa")`
`comment("The stats line immediately below is important since it does all the heavy lifting for you")`
| stats count,values(status) as Status,earliest(_time) as Beginning,latest(_time) as Ending by host,threadId
`comment("You need to compute the duration below by subtracting Beginning time from Ending time, both of which are in epoch format")`
`comment("I added some random value to Ending just to give some variation in duration")`
| eval Ending = Ending+random()%100000
| eval Duration = Ending - Beginning
`comment("Neither the Beginning nor Ending fields are important at this point, so remove them from the results")`
| fields - Beginning,Ending
| sort -Duration

It gives what I think you're looking for.

Adapting your search to this approach looks something like this:

sourcetype="postfix_syslog" host="*-smtp*"
| rex field=_raw "\:\s(?<threadId>[0-9A-F]{6,24})\:"
| stats count,values(status) as Status,earliest(_time) as Beginning,latest(_time) as Ending by host,threadId
| search threadId=*
| eval Duration = Ending - Beginning
| fields - Beginning,Ending
| sort -Duration

Hope that helps!

rmmiller

References:
https://docs.splunk.com/Documentation/Splunk/7.1.2/SearchReference/Stats
https://docs.splunk.com/Documentation/Splunk/7.1.2/SearchReference/Timefunctions

mitag · ‎01-21-2020

Thank you!

Possible to modify your search to include additional fields in the table? ("to", "from", "relay", "client", etc.)

Also, to do something similar to maxpause and maxspan clauses in transaction in your search, i.e. if a group of events with the same threadId spans more than 1h, or there's a pause longer than 10min, break them up into separate groups (logical events) for further analysis?

Overall - thank you! What you did is awesome and hopefully one day I'll wrap my mind over this and be able to craft similar searches. Regretfully I am not there yet. 🙂

(Back to transaction for a moment: would you know if it's possible to force an "AND" relationship between fields in transaction, to prevent the behavior I described in my OP?)

Regarding the cost of transaction: I think we can manage it, at least with respect to Postfix logs.

Running transaction with maxspan=24h is going to bring your search head to its knees for any appreciable volume of search results.

Running my original search with transaction over 30 days:

This search has completed and has returned 36,025 results by scanning 394,663 events in 12.567 seconds

This works for me and It's unlikely I'll do it over 30 days often. A similar search over 24h: "968 results by scanning 12,090 events in 0.494 seconds". Past year: "6,087 results by scanning 2,957,109 events in 245.953 seconds".

Postfix is the simplest of transactional logs we have; some other (Aspera, tomcat) may run for hours using the same thread or session ID, and contain dozens of fields we need to run stats on. The key is to group events with the same session ID into a single transaction (per host) and then treat it as a single event for further analysis - even if it's expensive. Far from sure I'll be able to use stats for the purpose.

rmmiller · ‎01-21-2020

Possible to modify your search to include additional fields in the table? ("to", "from", "relay", "client", etc.)

Yes, just add additional items to the by clause of stats like this:
| stats count,values(status) as Status,earliest(_time) as Beginning,latest(_time) as Ending by host,threadId,relay,client

See my latest comment under your OP regarding what you're seeing with transaction.

One thing to keep in mind regarding your use of transaction being acceptable in terms of performance. There are memory limits (in limits.conf) that could affect the accuracy of your searches using transaction. Stats allows you to work around this limitation and be faster.

There's a fantastic presentation from one of the previous Splunk conferences that explains the benefits of using stats.
https://conf.splunk.com/files/2016/slides/let-stats-sort-them-out-building-complex-result-sets-that-...

Hope that helps!
rmmiller

mitag · ‎01-21-2020

That produces no results... I'll dig into it...

Yes, just add additional items to the by clause of stats like this:

| stats count,values(status) as Status,earliest(_time) as Beginning,latest(_time) as Ending by host,threadId,relay,client

rmmiller · ‎01-22-2020

Hi @mitag
My apologies. I was in a rush. You are looking for the values of "relay" and "client" rather than splitting them out into different groups for stats, which is what my erroneous command would give you.

I was actually thinking this instead:

| stats count, values(status) AS Status, values(relay) AS Relay, values(client) AS client, earliest(_time) AS Beginning, latest(_time) AS Ending by host,threadId

richgalloway · ‎01-20-2020

The quoted list of fields looks like transaction "threadId host" maxspan=24h.

---
If this reply helps you, Karma would be appreciated.

mitag · ‎01-22-2020

From Splunk doc team:

Turns out that the text refers to an old way that the “fields” argument worked. You do not specify fields in a quoted list. You simply list the fields immediately after the “transaction” command. Several of the Extended Examples show this.

mitag · ‎01-21-2020

Thanks - that's what I thought - but it must be something other than an "AND" between fields in transactions: no results when I use it. Would you know what Splunk means when they say "If a quoted list of fields is specified, events are grouped together if they have the same value for each of the fields."?

"the same value for each of the fields", i.e. events are grouped into a transaction based on threadId==host?

rmmiller · ‎01-21-2020

No, "the same value for each of the fields" should be interpreted as (threadId is the same in event A and event B) AND (host is the same in event A and event B) --> merge event A and event B into the smae transaction.

mitag · ‎01-21-2020

Is it possible that my Splunk instance has a bug or something else is wrong with it? Else I don't understand why I'd get no results on this - when it's clear there should be:

sourcetype="postfix_syslog" host="*-smtp*" 
 | rex field=_raw "\:\s(?<threadId>[0-9A-F]{6,24})\:" 
 | transaction "threadId host" maxspan=24h 
 | search threadId=0EA1B1027961

rmmiller · ‎01-21-2020

I don't think you should have quotes around those 2 fields unless it was actually a single field named "threadId host". In this particular case, you have 2 distinct fields, i.e., one named "threadId" and another named "host". Neither one really needs quotes around it in order to function with transaction.

When using multiple fields with transaction, you'll get your expected grouping if each event includes both fields your specify. If you have some events that are missing "threadId", they will probably not end up in the same transaction as those with the same values in both the "threadId" and "host" fields.

Transaction is also really picky about the order of events. Events must be in descending time order, otherwise it can get confused and include some events that shouldn't be part of a transaction.

See https://docs.splunk.com/Documentation/Splunk/7.1.2/SearchReference/Transaction#Usage for further explanations of both.

mitag · ‎01-21-2020

Thank you. Perhaps transaction gets confused by the presence of several events with the same exact time stamp and takes it out on me by including unrelated events? 🙂

This though may be the answer to my OP:

You might see the following events grouped into a transaction:
 event=1 host=a
 event=2 host=a cookie=b
 event=3 cookie=b

This effectively means there's no way to force the fields in transaction to behave in an AND fashion - which in turn means we can't use it for our purposes and it'll be bound to include unrelated events.

Unless that mysterious "quoted list of fields" Splunk refers to in their documentation (but gives no example of) is still a possibility?

Transaction includes non-relevant events?

Join Us at the Builder Bar at .conf24 – Empowering Innovation and Collaboration

Combine Multiline Logs into a Single Event with SOCK - a Guide for Advanced Users

Everything Community at .conf24!