Splunk Search

Filtering o365 data with a search query

Abass42
Path Finder

I have a question about filtering in data. We have a customer who is requesting a set of fields to be sent in from 0365. The issue is, we cant modify what we pull in because we are using  an API, not the universal forwarder. Currently I am trying to test out the search query to confirm that I am only pulling in the correct events with those fields. 

The o365 data pulls in about 400+ fields. We are wanting about 40 of those events for a specific use case. My question is, what is the correct syntax for splunk to only search for those fields. 

Original query that brings in about 400+ fields:

 

index=o365

 

 

New query for about 35 fields:

 

index=o365 "Operation"="*" OR "LabelAction"="*" OR "LabelAppliedDateTime"="*" OR "LabelIid"="*" OR "abelName"="*" OR "DlpAuditEventMetadata.DlpPolicyMatchId"="*" OR "DlpAuditEventMetadata.EvaluationTime"="*" OR "DlpOriginalFilePath"="*" OR "IrmContentId"="*" OR "PolicyMatchInfo.PolicyId"="*" OR "PolicyMatchInfo.PolicyName"="*" OR "PolicyMatchInfo.RuleId"="*" OR  "PolicyMatchInfo.RuleName"="*" OR  "ProtectionEventData.IsProtected"="*" OR "ProtectionEventData.IsProtectedBefore"="*" OR "ProtectionEventData.ProtectionEventType"="*" OR "ProtectionEventData.ProtectionOwner"="*" OR "ProtectionEventData.ProtectionType"="*" OR "ProtectionEventData.TemplateId"="*" OR "ProtectionEventType"="*" OR "RMSEncrypted"="*" OR "SensitiveInfoTypeData{}.Confidence"="*" OR "SensitiveInfoTypeData{}.Count"="*" OR "SensitiveInfoTypeData{}.SensitiveInfoTypeId"="*" OR "SensitiveInfoTypeData{}.SensitiveInfoTypeName"="*" OR "SensitiveInfoTypeData{}.SensitiveInformationDetailedClassificationAttributes{}.Confidence"="*" OR "SensitiveInfoTypeData{}.SensitiveInformationDetailedClassificationAttributes{}.Count"="*" OR "SensitivityLabelEventData.ActionSource"="*" OR "SensitivityLabelEventData.ActionSourceDetail"="*" OR "SensitivityLabelEventData.ContentType"="*" OR "SensitivityLabelEventData.JustificationText"="*" OR "SensitivityLabelEventData.LabelEventType"="*" OR "SensitivityLabelEventData.OldSensitivityLabelId"="*" OR "SensitivityLabelEventData.SensitivityLabelId"="*" OR "SensitivityLabelEventData.SensitivityLabelPolicyId"="*" OR "LabelName"="*" | fields Operation,LabelAction,LabelAppliedDateTime,LabelIid,abelName,DlpAuditEventMetadata.DlpPolicyMatchId,DlpAuditEventMetadata.EvaluationTime,DlpOriginalFilePath,IrmContentId,PolicyMatchInfo.PolicyId,PolicyMatchInfo.PolicyName,PolicyMatchInfo.RuleId,PolicyMatchInfo.RuleName,ProtectionEventData.IsProtected,ProtectionEventData.IsProtectedBefore,ProtectionEventData.ProtectionEventType,ProtectionEventData.ProtectionOwner,ProtectionEventData.ProtectionType,ProtectionEventData.TemplateId,ProtectionEventType,RMSEncrypted,SensitiveInfoTypeData{}.Confidence,SensitiveInfoTypeData{}.Count,SensitiveInfoTypeData{}.SensitiveInfoTypeId,SensitiveInfoTypeData{}.SensitiveInfoTypeName,SensitiveInfoTypeData{}.SensitiveInformationDetailedClassificationAttributes{}.Confidence,SensitiveInfoTypeData{}.SensitiveInformationDetailedClassificationAttributes{}.Count,SensitivityLabelEventData.ActionSource,SensitivityLabelEventData.ActionSourceDetail,SensitivityLabelEventData.ContentType,SensitivityLabelEventData.JustificationText,SensitivityLabelEventData.LabelEventType,SensitivityLabelEventData.OldSensitivityLabelId,SensitivityLabelEventData.SensitivityLabelId,SensitivityLabelEventData.SensitivityLabelPolicyId,LabelName

 

 

Basically, From my understanding and my research, if you just append a specific string in quotes, or outside of quotes, splunk searches all events for that string and pulls it in. Such as:

 

index=Test field1 field2 field3 

 

That would bring in only events with field1 or field2 or field3 within it. Adding quotes to it, such as 

 

index=Test "field1"="*" "field2"="*" "field3"="*"

 

Should filter the same way.

I have tested it both way, with double quotes surrounding the field, as well as no quotes. Im also using | fields Which should only bring those fields in, but i dont know if its only showing those fields, but bringing in ALL of the events. 

 

My question is, is this correct? With the base searches ive been testing with, searching all of the events in o365 for one day, full 24 hours, brings in  23,410,064 events. Filtering out with the query I pasted above, for the same day, same 24 hours, brings in 23,409,887 events. Ive tested this a couple of ways, and each time, searching over the same time period, the filtering query brings in about 1k less events. But I can still only view the first 1k events, 20 pages worth. But that may be another question. 

 

My longwinded question boils down to, am I searching this data correctly? I know its a heavy index with millions of events, but filtering out to only 40 or so fields, some of which only appear .6% of the time, still brings in millions of events. Is there a way to fully validate it? 

o365.PNG

 

Labels (2)
0 Karma

bowesmana
SplunkTrust
SplunkTrust

As @yuanliu points out, under certain circumstances, the following are functionally the same

index=Test field1 field2 field3 
index=Test "field1"="*" "field2"="*" "field3"="*"

However, from Splunk's point of view they are very different.

In the first case, the search is looking for a piece of TEXT in the _raw event called 'field1'  or 2/3

whereas in the second, it's looking for a field called field1 that is extracted and has some value, so considering these two _raw example events 

2023-08-31T08:00:00 field1="Hello" 
2023-08-31T08:00:00 Hello="field1" 

The first search will find both events, whereas the second search will only find the SECOND event.

Here's an example to demonstrate

| makeresults
| eval x=split("2023-08-31T08:00:00 field1=\"Hello\",2023-08-31T08:00:00 Hello=\"field1\"", ",")
| mvexpand x
| eval _time=strptime(x,"%FT%T")
| rename x as _raw 
| extract
| search field1

this finds both events, but if you change the last line to search field1=* you will only get one event.

As for validating your data, you can clearly not go through 24m events, so you would have to do aggregations and check numbers and can only validate if you know what you expect.

Making all those wildcard searches is not particular performant, and as that picks up most events, then you may want to turn that into a NOT search, by

index=o365 NOT (f1=* f2=*...)

which should return the 1k not found 

Don't forget that Splunk is returning you _raw events and doing field extraction, so when you say you only want 40 fields, the just deal with all the events and after doing any data processing you need, validate for the events you want to exclude by filtering at a later point in the Splunk pipeline.

For example, if your events you do NOT want do not have an Operation field, then 

| stats count by Operation

will actually filter those events that don't have the operation field anyway and would be much faster than your complex wildcard search

 

yuanliu
SplunkTrust
SplunkTrust

As you must be familiar by now, the answer to any data analysis question depends on data.  If strings "field1", "field2", etc., appears in raw data AND signifies the existence of field names of same, you are correct that index=Test field1 field2 field3 and index=Test "field1"="*" "field2"="*" "field3"="*" are functionally equivalent. (Even in such cases, semantic differences can still cause performance differences depending on the inner workings of search engine.)  In some cases, however, a field can exist without the field name appearing in raw data; or the field name may exist in raw data but not as a term in SPL sense.  In such cases, the two are functionally different.

For example, Splunk may extract from raw data "field1_abcd" to give field1=abcd.  Search "index=Test field1" will not find this one.

Hope this helps.

Get Updates on the Splunk Community!

Get the T-shirt to Prove You Survived Splunk University Bootcamp

As if Splunk University, in Las Vegas, in-person, with three days of bootcamps and labs weren’t enough, now ...

Introducing the Splunk Community Dashboard Challenge!

Welcome to Splunk Community Dashboard Challenge! This is your chance to showcase your skills in creating ...

Wondering How to Build Resiliency in the Cloud?

IT leaders are choosing Splunk Cloud as an ideal cloud transformation platform to drive business resilience,  ...