Solved: Transaction problems with lots of events and multi...

johnnymc · ‎08-18-2011

Hello,
i would like to construct a complete transaction of a mail session, starting from the customer smtp connection, until the final delivery of the message.

I have some ironport mail logs, with inside some fields. They should compare in logical order in every session,starting from the first:

ICID: smtp connection ID
MID: message internal ID
DCID: delivery connection ID, alias the final delivery (local or remote)

Basically, i have to find the start of the connection (which starts always with "New SMTP ICID", then (and ONLY IF the message composing is going to happen, sometimes clients drop the connection) there's a line with MID and ICID in the same event, then use the MID until i find another line with both MID and DCID.
Finally, correlate them all together.

Here are example sessions to make thing clearer(anonimized,sorry):

example1. Full session, with message delivery:

Aug 18 11:08:32 splunk_mail_logs_indexer1: Info: New SMTP ICID 1513562731 interface smtpout (1.2.3.4) address 123.2.2.2 reverse dns host host.domain.it verified yes
Aug 18 11:08:32 splunk_mail_logs_indexer1: Info: ICID 1513562731 RELAY SG Servers_bypass_AntiSpam match 195.110.129.0/24 SBRS not enabled
Aug 18 11:08:32 splunk_mail_logs_indexer1: Info: SMTP Auth: (ICID 1513562731) succeeded for user: xxx@yyy.it using AUTH mechanism: PLAIN with profile: SmtpAuth_CGP_LDAP
Aug 18 11:08:32 splunk_mail_logs_indexer1: Info: Start MID 182684958 ICID 1513562731
Aug 18 11:08:32 splunk_mail_logs_indexer1: Info: MID 182684958 ICID 1513562731 From: <xxx@yyy.it>
Aug 18 11:08:32 splunk_mail_logs_indexer1: Info: MID 182684958 ICID 1513562731 RID 0 To: <zzz@yyy.it>
Aug 18 11:08:32 splunk_mail_logs_indexer1: Info: MID 182684958 Message-ID '<4E4CD693.3090603@yyy.it>'
Aug 18 11:08:32 splunk_mail_logs_indexer1: Info: MID 182684958 Subject 'something'
Aug 18 11:08:32 splunk_mail_logs_indexer1: Info: MID 182684958 ready 15669 bytes from <xxx@yyy.it>
Aug 18 11:08:32 splunk_mail_logs_indexer1: Info: ICID 1513562731 close
Aug 18 11:08:32 splunk_mail_logs_indexer1: Info: MID 182684958 rewritten to MID 182684959 by LDAP rewrite
Aug 18 11:08:32 splunk_mail_logs_indexer1: Info: LDAP: Reroute query Produzione.routing MID 182684958 RID 0 address zzz@yyy.it to [('zzz@yyy.it', 'mailhost.it')]
Aug 18 11:08:32 splunk_mail_logs_indexer1: Info: LDAP: Mailhost query Produzione.routing MID 182684958 address xxx@yyy.it to mailhost.it
Aug 18 11:08:32 splunk_mail_logs_indexer1: Info: Message finished MID 182684958 done

example 2. SMTP session, mail server stops client from sending message:

Aug 18 11:09:20 splunk_mail_logs_indexer1: Info: New SMTP ICID 675888601 interface smtpout (1.2.3.4) address 123.44.2.21 reverse dns host host.domain.it verified yes
Aug 18 11:09:20 splunk_mail_logs_indexer1: Info: ICID 675888601 RELAY SG Servers_bypass_AntiSpam match 195.110.129.0/24 SBRS not enabled
Aug 18 11:09:32 splunk_mail_logs_indexer1: Info: Start MID 396231626 ICID 675888601
Aug 18 11:09:32 splunk_mail_logs_indexer1: Info: MID 396231626 ICID 675888601 From: <user@domain.eu>
Aug 18 11:09:40 splunk_mail_logs_indexer1: Info: MID 396231626 ICID 675888601 RID 0 To: <user2@domain.eu>
Aug 18 11:09:42 splunk_mail_logs_indexer1: Info: Message aborted MID 396231626 Receiving aborted by sender
Aug 18 11:09:42 splunk_mail_logs_indexer1: Info: Message finished MID 396231626 aborted
Aug 18 11:09:42 splunk_mail_logs_indexer1: Info: ICID 675888601 close

I have a huge turnover of sessions on this mail frontend server, in just 1 second i can have thousand of sessions. Since i'm searching for sessions originated from a particular IP, the others are not relevant to me.

Problem is, that i know that session ends with "Message finished" when a message exists, and "ICID close" when not. ICID close, however happens every time, since is when the client disconnects (after leaving the server to take care of the message) and happens BEFORE the "message finished (see example 1)
This leads to incorrect parsing of transaction when I have a lot of events related to IP not relevant to my search.
In fact, if i "purge" logs from these transactions - using nested subsearches - transaction command starts to behave correctly.

To summarize, purging unrelevant events BEFORE transaction works (but I think this is not efficient as it should be):

earliest=-1h@m index=sistemi_legale sourcetype=ironport_mail_logs [search earliest=-1h@m index="sistemi_legale" host=ironportout* sourcetype=ironport_mail_logs Start MID [ search  earliest=-1h@m index="sistemi_legale" host=ironportout* sourcetype=ironport_mail_logs New SMTP ICID  RemoteIP=1.2.3.4|fields host ICID | format maxresults=50000 ] | fields host ICID MID | eval query="host=".host." (ICID=".ICID.") OR (MID=".MID.") OR (DCID MID=".MID.")"|fields query]|transaction ICID MID DCID host startswith="New SMTP ICID"

This instead, leads to wrong transactions (but is easier to read, and IMHO SHOULD work):

earliest=-1h@m index="sistemi_legale" host=ironportout* sourcetype=ironport_mail_logs RemoteIP=1.2.3.4 (MID=* OR ICID=* OR DCID=*)|transaction MID ICID DCID host maxspan=300s startswith="New SMTP ICID"

I've found that the transaction command is not as effective with multiple fields as when using one field only; on the other hand, during my test, using "startswith" and "endswith" helps dramatically in identified complete and correct transactions.

Could anyone help?
Thanks in advance.

bbingham · ‎08-18-2011

To start with, if you're going to be dealing with specific data sets, you'll actually find your sub-searches are probably faster then using the transaction command on the entire dataset. Sub-searches can disperse the work to several different indexers while the transaction command resides entirely on the search head and is not map-reducible.

That being said, I'll talk about questions on the transaction command.

There's a lot to go over with the transaction command. When using multiple fields in the transaction command, Splunk assumes there is a "transitive" relationship and there is no repeat of the first instantiated fields. Meaning, if MID is created in the first line and has ICID on the second, splunk will then assume all events that have the same MID or the same ICID are supposed to be together. To make this explanation simpler, lets take a generic sample:

event=1 host=a
event=2 host=a cookie=b
event=3 cookie=b

In the sample above, |transaction host, cookie, would behave as you'd expect. You'll have 1 output that combines all three events. But what does splunk do when host changes but keeps the same cookie? This is the same type of concept that happens in mail, the mail id never changes, but all of the device identifiers change.

you might see the following events grouped into a transaction:

event=1 host=a
event=2 host=a cookie=b
event=3 host=b
event=4 host=b cookie=b

These 4 events have 1 field that links them, the same cookie. However when splunk looks at this "transaction", it sees 2 distinct transactions, here's why (simplified):

event=1 host=a

Splunk takes the first line and asks, do I have one of my fields in the field list? If the answer is yes, it knows the event is the start of a transaction. Next event,

event=2 host=a cookie=b

Splunk then asks the same question, does this event have any fields from the field list? Yes, it has 2. Does one field match a already defined field? Yes! Is it the same value? Yes! So this event must be part of the same transaction. Then splunk says, the 2nd field hasn't been set yet, so lets creates the transitive link between host and cookie.

event=3 host=b

Does this event already have a field that's been instantiated? Yes. Does it match the value of the first transaction? No. Therefore this must be a new transaction.

event=4 host=b cookie=b

Lastly, does it have an instantiated field? Yes. Does it match? Yes, it matches event 3. Does the 2nd field match anything in transaction2? No, cookie be has not been instantiated yet for the transaction, so lets instantiate it for the 2nd transaction. The outcome:

------------------------
event=1 host=a
event=2 host=a cookie=b
------------------------
event=3 host=b
event=4 host=b cookie=b

Clearly not what we want. I want ALL 4 events in 1 transaction because they have the same cookie. This is where the transaction profiling app comes into play. We make the assumption that instead of a breadcrumb style transaction, you're actually going to want to build a transaction on top of a transaction. So instead of |transaction host cookie you'd really build this as | transaction host| transaction cookie

There are some unique challenges with chaining the transaction command together, but the short answer, to make sure you don't lose any lines, prepend an eval expression:

eval $uni_field$=if(isnull($uni_field$), (splunk_server+"#"+index+"#"+_cd), $uni_field$) | transaction $uni_field$

this will make sure that any event that does not contain the unifying field is not thrown away. You could then do a starts with ends with transaction for each piece of the transaction, and combine it at the end. Make sure if you chain together transaction commands that have startswith and endswith, you add keepevicted=t so you don't discard lines you may want to use later in the transaction.

By using this approach, you can chain as many transactions together as you'd like, and it will capture all the events you're looking to capture. There is a very similar example to this question in the transaction profiling application that deals with multi-tier email systems.

Now in your example, the clientip is only printed on 1 line, meaning you would have to use something like this for your search:

earliest=-1h@m index="sistemi_legale" host=ironportout* sourcetype=ironport_mail_logs (MID=* OR ICID=* OR DCID=*)
  |eval eid=splunk_server+"#"+index+"#"+_cd
  |eval ICID=if(isnull(ICID), eid, ICID) | transaction ICID maxspan=300s startswith="New SMTP ICID" keepevicted="t"
  |eval DCID=if(isnull(DCID), eid, DCID) | transaction DCID maxspan=300s
  |eval MID=if(isnull(MID), eid, MID) | transaction MID  maxspan=300s 
  |search RemoteIP=1.2.3.4

Splunk will then have to combine every event over the last hour into a transaction and then search for the IP. This is highly inefficient. Using the subsearches will limit the data and let the indexers do all the work so that at the end you know you have only the events you need, and then run the transaction command on the outcome.

We are currently working on some new commands to make this process easier, but I hope i've answered your questions on the transaction command.

View solution in original post

bbingham · ‎08-18-2011

To start with, if you're going to be dealing with specific data sets, you'll actually find your sub-searches are probably faster then using the transaction command on the entire dataset. Sub-searches can disperse the work to several different indexers while the transaction command resides entirely on the search head and is not map-reducible.

That being said, I'll talk about questions on the transaction command.

There's a lot to go over with the transaction command. When using multiple fields in the transaction command, Splunk assumes there is a "transitive" relationship and there is no repeat of the first instantiated fields. Meaning, if MID is created in the first line and has ICID on the second, splunk will then assume all events that have the same MID or the same ICID are supposed to be together. To make this explanation simpler, lets take a generic sample:

event=1 host=a
event=2 host=a cookie=b
event=3 cookie=b

In the sample above, |transaction host, cookie, would behave as you'd expect. You'll have 1 output that combines all three events. But what does splunk do when host changes but keeps the same cookie? This is the same type of concept that happens in mail, the mail id never changes, but all of the device identifiers change.

you might see the following events grouped into a transaction:

event=1 host=a
event=2 host=a cookie=b
event=3 host=b
event=4 host=b cookie=b

These 4 events have 1 field that links them, the same cookie. However when splunk looks at this "transaction", it sees 2 distinct transactions, here's why (simplified):

event=1 host=a

Splunk takes the first line and asks, do I have one of my fields in the field list? If the answer is yes, it knows the event is the start of a transaction. Next event,

event=2 host=a cookie=b

Splunk then asks the same question, does this event have any fields from the field list? Yes, it has 2. Does one field match a already defined field? Yes! Is it the same value? Yes! So this event must be part of the same transaction. Then splunk says, the 2nd field hasn't been set yet, so lets creates the transitive link between host and cookie.

event=3 host=b

Does this event already have a field that's been instantiated? Yes. Does it match the value of the first transaction? No. Therefore this must be a new transaction.

event=4 host=b cookie=b

Lastly, does it have an instantiated field? Yes. Does it match? Yes, it matches event 3. Does the 2nd field match anything in transaction2? No, cookie be has not been instantiated yet for the transaction, so lets instantiate it for the 2nd transaction. The outcome:

------------------------
event=1 host=a
event=2 host=a cookie=b
------------------------
event=3 host=b
event=4 host=b cookie=b

Clearly not what we want. I want ALL 4 events in 1 transaction because they have the same cookie. This is where the transaction profiling app comes into play. We make the assumption that instead of a breadcrumb style transaction, you're actually going to want to build a transaction on top of a transaction. So instead of |transaction host cookie you'd really build this as | transaction host| transaction cookie

There are some unique challenges with chaining the transaction command together, but the short answer, to make sure you don't lose any lines, prepend an eval expression:

eval $uni_field$=if(isnull($uni_field$), (splunk_server+"#"+index+"#"+_cd), $uni_field$) | transaction $uni_field$

this will make sure that any event that does not contain the unifying field is not thrown away. You could then do a starts with ends with transaction for each piece of the transaction, and combine it at the end. Make sure if you chain together transaction commands that have startswith and endswith, you add keepevicted=t so you don't discard lines you may want to use later in the transaction.

By using this approach, you can chain as many transactions together as you'd like, and it will capture all the events you're looking to capture. There is a very similar example to this question in the transaction profiling application that deals with multi-tier email systems.

Now in your example, the clientip is only printed on 1 line, meaning you would have to use something like this for your search:

earliest=-1h@m index="sistemi_legale" host=ironportout* sourcetype=ironport_mail_logs (MID=* OR ICID=* OR DCID=*)
  |eval eid=splunk_server+"#"+index+"#"+_cd
  |eval ICID=if(isnull(ICID), eid, ICID) | transaction ICID maxspan=300s startswith="New SMTP ICID" keepevicted="t"
  |eval DCID=if(isnull(DCID), eid, DCID) | transaction DCID maxspan=300s
  |eval MID=if(isnull(MID), eid, MID) | transaction MID  maxspan=300s 
  |search RemoteIP=1.2.3.4

Splunk will then have to combine every event over the last hour into a transaction and then search for the IP. This is highly inefficient. Using the subsearches will limit the data and let the indexers do all the work so that at the end you know you have only the events you need, and then run the transaction command on the outcome.

We are currently working on some new commands to make this process easier, but I hope i've answered your questions on the transaction command.

johnnymc · ‎08-22-2011

Thanks a lot for the very exhaustive answer. I think i'll stick with the subsearch approach for this.
The unique eid approach will be useful for another scenarios, though, where data is not specific as in this example.

aljohnson_splun · ‎06-02-2015

@bbingham what were the new commands that you mention in the last sentence ?

bbingham · ‎06-02-2015

They were custom search commands that were only released in an old app named "transaction profiling". The app never made it out of a preview step. However, there were several enhancements to the core transaction command that will now handle most of the issues that I discuss in this old post.

Transaction problems with lots of events and multiple fields

Splunk Custom Visualizations App End of Life

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk