Re: Replace join with a more performant query

fabry

So far I created this Join

index="index" "mysearchtext"
| rex field=message ", request_id: \\\"(?<request_id>[^\\\"]+)"
| fields _time request_id
| eval matchfield=request_id
| join matchfield [ search index="index"
   | spath request.id
   | rename request.id as id
   | fields mynewfield
   | eval matchfield=id
| table _time request_id mynewfield

Basically I want to join 2 logs where request_id = id .

The join is working as expected but as you expect is not efficient.

I'd like to replace it with a more efficient search leveraging the fact that the events of the subsearch where I extract the field "mynewfield" are indexed for sure after some milliseconds the main search (where I extract the field request_id)

Another useful info is that the logs that matches "mysearchtext" are way less than the logs in the subsearch

Here a sample of the data

{"AAA": "XXX","CCC":"DDD","message":{"request":{ "id": "MY_REQUEST_ID"} } }
{"AAA": "XXX","CCC":"DDD","message":"application logs in text format e.g. 2024/04/26 06:35:21 mysearchtext headers: [], client: clientip, server, host, request_id=\"MY_REQUEST_ID\" "}

The first event contains the message field which is a json string --> we have thousands of this logs

The second one are "alerts" and we have just a few of them, the format of the "message" field is plain text.

Both contains the value MY_REQUEST_ID which is the field that I have to use to correlate both logs.

The output should be a TABLE of ONLY the events with "mysearchtext" (the second event) with some additional fields coming from the second event.

The events above are sorted by time (reverse order), the second event is happens just few milliseconds before the first one (basically the second one is just a log message of the same REST request of the first event. The first event is the REST request response sent to the customer)

bowesmana

As @PickleRick says, without knowing your data, it's not totally clear what may work in your case.

Are you actually searching the same index in both cases? From your search example, the subsearch will contain the events from the outer search.

You can generally combine two data sets into a single search and do some data munging before doing stats values() so

(index="index" "mysearchtext") OR (index="index")
``` This will get a field called mr_id from any event that has a field 
    called message that has the pattern match ```
| rex field=message ", request_id: \\\"(?<mr_id>[^\\\"]+)"
``` This statement will extract the request.id from a JSON payload ```
| spath request.id
``` This creates a common field called id, which is either taken from a 
    non null request.id or from the mr_id. Depending on the event type the 
    id is extracted from, you should hopefully end up with a single id.

    Note that if your source index really is "index" then this assumes that
    the id coming from events with mysearchtext will NOT also have a JSON
    payload with the request id
```
| eval id=coalesce('request.id', mr_id)
``` This then aggregates the fields by the common id.
    Note it uses values() so you don't know which event came at which time
    but that can be solved if needed
```
| stats values(mynewfield) as mynewfield values(_time) as times by request_id
``` This just formats the multivalue times field to human readable
| eval times=strftime(times, "%F %T.%Q")

fabry

I have added more info about the data in the main comment

PickleRick

Ok. If you have two searches with a significantly different performance (one is taking much longer to complete than the other or returns way more events) and you deal with subsearches, you want to put the "smaller" one into subsearch. (Subsearches have their limitations and with a smaller search it's kess probable that the subsearch will get silently finalized returning wrong/incomplete results).

As to replacing join, the typical way is to use stats.

In a similar case to yours (we don't know your events and the search you presented is incorrect) it would be something like

<first search conditions>
| fields a b c d
| rename d as common_field
| append [
    <second search conditions>
    | fields e f g h
    | rename g as common_field ]
| stats values(*) as * by common_field

This is more or less the only way if you need to transform your data in both searches separately (for example have different stats aggregations in them).

If you have only streaming commands, you can try to either use multisearch to overcome the limitations of the append command or alternatively search with both sets of conditions and conditionally set/chose/transform fields you need depending on which of result sets they come from. (This is probably the most effective way in some cases but not always applicable and can be quite messy to write)

fabry

I have added more info about the data in the main comment.

Is stats the right Splunk command to use even if I need a table as result?

PickleRick

It's the other way around - you might need table if you didn't have stats. If you do stats it produces a results table from your summarized events.

And yes, join will probably be the way to go. As you seem to have different sets of fields, you simply extract them _before_ doing

stats values(interesting_field1) as interesting_field1 values(interesting_field2) as interesting_field2 [...] by common_field

Replace join with a more performant query

join

subsearch

table

Built-in Service Level Objectives Management to Bridge the Gap Between Service & ...

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer at Splunk .conf24 ...

Share Your Ideas & Meet the Lantern team at .Conf! Plus All of This Month’s New ...