Re: Count is getting mismatched for the fields aft...

SureshkumarD · ‎04-12-2024

Hi Team,

I need to extract the values of the fields where it has multiple values. So, I used commands like mvzip, mvexpand, mvindex and eval. However the output of my spl query is not matching with the count of the interesting field. Could you please assist on this? Here is my SPL query and output screenshots below.

index="xxx" sourcetype="xxx" source=xxx events{}.application="xxx" userExperienceScore=FRUSTRATED
| rename userActions{}.application as Application, userActions{}.name as Action, userActions{}.targetUrl as Target_URL, userActions{}.duration as Duration, userActions{}.type as User_Action_Type, userActions{}.apdexCategory as useractions_experience_score
| eval x=mvzip(mvzip(Application,Action),Target_URL), y=mvzip(mvzip(Duration,User_Action_Type),useractions_experience_score)
| mvexpand x
| mvexpand y
| dedup x
| eval x=split(x,","), y=split(y,",")
| eval Application=mvindex(x,0), Action=mvindex(x,1), Target_URL=mvindex(x,2), Duration=mvindex(y,0), User_Action_Type=mvindex(y,1), useractions_experience_score=mvindex(y,2)
| eval Duration_in_Mins=Duration/60000
| eval Duration_in_Mins=round(Duration_in_Mins,2)
| table _time, Application, Action, Target_URL,Duration_in_Mins,User_Action_Type,useractions_experience_score
| sort - _time
| search useractions_experience_score=FRUSTRATED
| search Application="*"
| search Action="*"

Query Output with the statistics count:

Expected Count:

yuanliu

mvzip, mvexpand and mvindex are simply wrong tools for your data structure. (Well, mvexpand will be needed, but only after you properly handle the array in your data.) As everybody in this post has pointed out: You need to post sample or precise mock data to reveal the structure. (In text, never screenshot.) This is extremely important when asking question about data analytics in a forum. When you force volunteers to read your mind, not only will they get FRUSTRATED, but even if they are willing, most of the time their mind reading will be incorrect.

This said, based on your code, I kind of picture together a rough structure of your data. I will use JSON to illustrate. Something like

{
  "userActions": [
    {
      "application": "app1",
      "name": "action1",
      "targetUrl": "url1",
      "duration": 1234,
      "type": "actiontype1",
      "apdexCategory": "SATISFIED"
    },
    {
      "application": "app1",
      "name": "action2",
      "targetUrl": "url1",
      "duration": 2345,
      "type": "actiontype1",
      "apdexCategory": "DISATISFIED"
    },
    {
      "application": "app1",
      "name": "action3",
      "targetUrl": "url2",
      "duration": 3456,
      "type": "actiontype2",
      "apdexCategory": "FRUSTRATED"
    }
  ],
  "userExperienceScore": "FRUSTRATED",
  "events": [
    {"application": "xxx", "irrelevant": "aaa"},
    {"application": "yyy", "irrelevant": "bbb"}
  ]
}

Your event could be in JSON or it could be in XML, but it contains at least two arrays, events[] and userActions[]. Is this correct? The array events[] is not what frustrates you because its elements and components are no longer needed after initial search. Your end goal from the above three elements of userActions[] is to pick out

    {
      "application": "app1",
      "name": "action3",
      "targetUrl": "url2",
      "duration": 3456,
      "type": "actiontype2",
      "apdexCategory": "FRUSTRATED"
    }

and display it in this format:

_time	Application	Action	Target_URL	Duration_in_Mins	User_Action_Type	useractions_experience_score
2024-04-18 22:45:22	app1	action3	url2	0.06	actiontype2	FRUSTRATED

If the above looks close, the first thing you need to do is to forget all about Splunk's flattened fields userActions{}.*; in fact, discard them all. Use spath to reach elements of this array, then mvexpand over the elements, no funny mvzip business. After that, everything becomes trivial.

Using my speculated data, I can reconstruct your SPL into the following to obtain my illustrated output:

index="xxx" sourcetype="xxx" source=xxx
 events{}.application="xxx" userExperienceScore=FRUSTRATED

| fields - userActions{}.*
| spath path=userActions{}
| mvexpand userActions{}
| spath input=userActions{}
| dedup application name targetUrl
| search apdexCategory = FRUSTRATED application = * name = *
| sort - _time
| rename application as Application, name as Action, targetUrl as Target_URL,
  type as User_Action_Type, apdexCategory as useractions_experience_score
| eval Duration_in_Mins = round(duration / 60000, 2)
| table _time, Application, Action, Target_URL,Duration_in_Mins,User_Action_Type,useractions_experience_score

Hope this helps.

Here is an emulation of my speculated data. Play with it and compare with real data

| makeresults
| eval _raw = "{
  \"userActions\": [
    {
      \"application\": \"app1\",
      \"name\": \"action1\",
      \"targetUrl\": \"url1\",
      \"duration\": 1234,
      \"type\": \"actiontype1\",
      \"apdexCategory\": \"SATISFIED\"
    },
    {
      \"application\": \"app1\",
      \"name\": \"action2\",
      \"targetUrl\": \"url1\",
      \"duration\": 2345,
      \"type\": \"actiontype1\",
      \"apdexCategory\": \"DISATISFIED\"
    },
    {
      \"application\": \"app1\",
      \"name\": \"action3\",
      \"targetUrl\": \"url2\",
      \"duration\": 3456,
      \"type\": \"actiontype2\",
      \"apdexCategory\": \"FRUSTRATED\"
    }
  ],
  \"userExperienceScore\": \"FRUSTRATED\",
  \"events\": [
    {\"application\": \"xxx\", \"irrelevant\": \"aaa\"},
    {\"application\": \"yyy\", \"irrelevant\": \"bbb\"}
  ]
}"
| spath
``` data speculation for
index="xxx" sourcetype="xxx" source=xxx events{}.application="xxx" userExperienceScore=FRUSTRATED
```

SureshkumarD

Hi @yuanliu ,

Thanks much for your response. I tried the SPL query which you shared. If I dedup only application, name and target url, then I could see duplicates in _time field (Refer screenshot 1) and here is the real data's timestamp in screenshot 2 which inturn gives us the incorrect count between the query output (count 😎 and the real data logs (count 12)

Screenshot1

Screenshot2:

yuanliu

You are avoiding answer the question everybody asks: What is your actual data structure? Is it close to what I speculated? Further, you have never illustrated what is expected output. So, the two screenshots with only timestamps means nothing to volunteers. How can we help further?

Specifically, WHY should the output NOT have multiple timestamps after deduping application, name and target url? In your original SPL, the only dedup is on x, which is Application+Action+Target_URL. How is this different? Anything after mvexpand in my search is based on my reading of your intent based only on that complex SPL sample. Instead of making volunteers to read your mind, how about expressing the actual dataset, the result you are trying to get from the data, and the logic to derive the desired result and dataset in plain language (without SPL)?

SureshkumarD

Hi @yuanliu ,

Here are my responses below. Please let me know if I missed anything.

1) What is your actual data structure? - It is in JSON format.

userActions: [ [-]
{ [-]
apdexCategory: FRUSTRATED
application: xxx
cdnBusyTime: null
cdnResources: 0
cumulativeLayoutShift: 0.0006
customErrorCount: 0
dateProperties: [ [+]
]
documentInteractiveTime: 13175
domCompleteTime: 15421
domContentLoadedTime: 14261
domain: xxx
doubleProperties: [ [+]
]
duration: 15430
endTime: 1714043044710
firstInputDelay: 1
firstPartyBusyTime: null
firstPartyResources: 0
frontendTime: 14578
internalApplicationId: APPLICATION-B9BADE8D75A35E32
internalKeyUserActionId: APPLICATION_METHOD-E3EF5284923A6BA3
javascriptErrorCount: 0
keyUserAction: true
largestContentfulPaint: 15423
loadEventEnd: 15430
loadEventStart: 15430
longProperties: [ [+]
]
matchingConversionGoals: [ [+]
]
name: loading of page /home
navigationStart: 1714043029280
networkTime: 804
requestErrorCount: 0
requestStart: 630
responseEnd: 852
responseStart: 678
serverTime: 48
speedIndex: 13866
startTime: 1714043029280
stringProperties: [ [+]
]
targetUrl: xxx
thirdPartyBusyTime: null
thirdPartyResources: 0
totalBlockingTime: null
type: Load
userActionPropertyCount: 0
visuallyCompleteTime: 15416
}
]

2) Is it close to what I speculated? Further, you have never illustrated what is expected output. So, the two screenshots with only timestamps means nothing to volunteers. How can we help further? -Yes correct, it is relevant to the data what you predicted and here is the expected output

_time Application Action Target_URL Duration User_Action_Type
2024-04-25 07:39:53 xxx loading /home www.abc.com 0.26 Load
2024-04-25 06:25:50 abc loading /wcc/ui/ www.xyz.com   3.00 Load
2024-04-24 19:00:57 xyz keypress policy www.bdc.com 3.00 Load
2024-04-24 17:05:11 abc loading /home www.xyz.com   0.53 Xhr
2024-04-24 10:14:47 bcd loading /prod www.rst.com   0.02 Load

3) Specifically, WHY should the output NOT have multiple timestamps after deduping application, name and target url? - I could see multiple timesteps after dedup the application, name and target url. Here is the screenshot below and see the yellow highlighted.

4) In your original SPL, the only dedup is on x, which is Application+Action+Target_URL. How is this different?. My bad, here is the below correct SPL query that I tried.

index="dynatrace" sourcetype="dynatrace:usersession" source=saas_prod events{}.application="participantmanagement.thehartford.com" userExperienceScore=FRUSTRATED
| rename userActions{}.application as Application, userActions{}.name as Action, userActions{}.targetUrl as Target_URL, userActions{}.duration as Duration, userActions{}.type as User_Action_Type, userActions{}.apdexCategory as useractions_experience_score
| eval x=mvzip(mvzip(Application,Action),Target_URL), y=mvzip(mvzip(Duration,User_Action_Type),useractions_experience_score)
| mvexpand x
| mvexpand y
| eval x=split(x,","), y=split(y,",")
| eval Application=mvindex(x,0), Action=mvindex(x,1), Target_URL=mvindex(x,2), Duration=mvindex(y,0), User_Action_Type=mvindex(y,1), useractions_experience_score=mvindex(y,2)
| eval Duration_in_Mins=Duration/60000
| eval Duration_in_Mins=round(Duration_in_Mins,2)
| table _time, Application, Action, Target_URL,Duration_in_Mins,User_Action_Type,useractions_experience_score
| search useractions_experience_score=FRUSTRATED
| sort - _time
| dedup _time
| search Application="*"
| search Action="*"
| fields - useractions_experience_score

5) Anything after mvexpand in my search is based on my reading of your intent based only on that complex SPL sample. Instead of making volunteers to read your mind, how about expressing the actual dataset, the result you are trying to get from the data, and the logic to derive the desired result and dataset in plain language (without SPL)? Here is the actual dataset below.

4/25/24
7:39:53.000 AM
{ [-]
applicationType: WEB_APPLICATION
bounce: true
browserFamily: Microsoft Edge
browserMajorVersion: Microsoft Edge 122
browserType: Desktop Browser
clientType: Desktop Browser
connectionType: UNKNOWN
dateProperties: [ [+]
]
displayResolution: HD
doubleProperties: [ [+]
]
duration: 15430
endReason: TIMEOUT
endTime: 1714043044710
errors: [ [+]
]
events: [ [-]
{ [-]
application: xxx
internalApplicationId: APPLICATION-B9BADE8D75A35E32
name: Page change
page: /index.html
pageGroup: /index.html
startTime: 1714043029280
type: PageChange
}
{ [-]
application: xxx
internalApplicationId: APPLICATION-B9BADE8D75A35E32
name: Page change
page: /home
pageGroup: /home
pageReferrer: /index.html
pageReferrerGroup: /index.html
startTime: 1714043043405
type: PageChange
}
{ [-]
application: xxx
internalApplicationId: APPLICATION-B9BADE8D75A35E32
name: Page change
page: /employee/details
pageGroup: /employee/details
pageReferrer: /employee/coverage-list
pageReferrerGroup: /employee/coverage-list
startTime: 1714043088821
type: PageChange
}
{ [-]
application: xxx
internalApplicationId: APPLICATION-B9BADE8D75A35E32
name: Page change
page: /employee/coverage-list
pageGroup: /employee/coverage-list
pageReferrer: /employee/details
pageReferrerGroup: /employee/details
startTime: 1714043403199
type: PageChange
}
]
hasCrash: false
hasError: false
hasSessionReplay: false
internalUserId: 17140430425327CQRENIQATV2OT5DV5BTJ2UB3MQF2ALH
ip: 10.215.67.0
longProperties: [ [+]
]
matchingConversionGoals: [ [+]
]
matchingConversionGoalsCount: 0
newUser: true
numberOfRageClicks: 0
numberOfRageTaps: 0
osFamily: Windows
osVersion: Windows 10
partNumber: 0
screenHeight: 720
screenOrientation: LANDSCAPE
screenWidth: 1280
startTime: 1714043029280
stringProperties: [ [+]
]
syntheticEvents: [ [+]
]
tenantId: vhz76055
totalErrorCount: 0
totalLicenseCreditCount: 0
userActionCount: 1
userActions: [ [-]
{ [-]
apdexCategory: FRUSTRATED
application: xxx
cdnBusyTime: null
cdnResources: 0
cumulativeLayoutShift: 0.0006
customErrorCount: 0
dateProperties: [ [+]
]
documentInteractiveTime: 13175
domCompleteTime: 15421
domContentLoadedTime: 14261
domain: xxx
doubleProperties: [ [+]
]
duration: 15430
endTime: 1714043044710
firstInputDelay: 1
firstPartyBusyTime: null
firstPartyResources: 0
frontendTime: 14578
internalApplicationId: APPLICATION-B9BADE8D75A35E32
internalKeyUserActionId: APPLICATION_METHOD-E3EF5284923A6BA3
javascriptErrorCount: 0
keyUserAction: true
largestContentfulPaint: 15423
loadEventEnd: 15430
loadEventStart: 15430
longProperties: [ [+]
]
matchingConversionGoals: [ [+]
]
name: loading of page /home
navigationStart: 1714043029280
networkTime: 804
requestErrorCount: 0
requestStart: 630
responseEnd: 852
responseStart: 678
serverTime: 48
speedIndex: 13866
startTime: 1714043029280
stringProperties: [ [+]
]
targetUrl: xxx
thirdPartyBusyTime: null
thirdPartyResources: 0
totalBlockingTime: null
type: Load
userActionPropertyCount: 0
visuallyCompleteTime: 15416
}
]
userExperienceScore: FRUSTRATED
userSessionId: UEFUBRDAPRDHURTCPUKFKKPJVORTPPJA-0
userType: REAL_USER

PickleRick

OK. First things first.

As @bowesmana already pointed out - you're using dedup. This is a very tricky command. It leaves you with just the first event having a unique value of given field (or set of fields). You lose all information about data in all other fields if they differ in multiple events.

A simple run-anywhere example

| makeresults count=10
| eval constant="b"
| eval random=random()
| eventstats values(random) as all_values
| dedup constant

Here we explicitly created a summarized field called all_values so that we can in the end see what events were generated but as we do the dedup command we're left with just one result having just one value in the random field.

That's one thing that definitely affects your results.

Another thing is that multivalued fields can be a bit unintuitive sometimes.

Another run-anywhere example

| makeresults count=10 
| eval a=split("a,b,c,d",",")
| stats count count(a)

Since each of generated result rows has a multivalued field a with four separate values, the overall count which returns just count of results will return 10 but count(a) which will count values in the a field will return 40 because there are 40 separate values in that field.

SureshkumarD

Hi @PickleRick

Not sure If I explained my requirements in a right way. I would like to expand the multiple values present in the each row to separate rows which is having only the "Frustated" related details like the below expected output

Expected output

URL	Duration	Type	Status
www.cde.com	88647	Load	Frustated
www.fge.com	6265	Load	Frustated
www.abc.com	500	Load	Frustated

bowesmana

Judging from the SPL, where you have two searches for FRUSTRATED, it seems like you have data where multiple userExperienceScores can exist for the same event, hence all the mvexpanding out.

As @PickleRick points out, it's quite tricky to deal with multivalue fields, particularly when you have 6 MV fields that you are zipping up into 2 pairs (x and y). I assume you are using 2 pairs as there is not a 1:1 correlation between the MV's in each of the pairs.

What I would suggest is to find a reasonably complex SINGLE (or two) event where you can exhibit the problem. This will make it much easier to diagnose the issue. Then we can help explain what is going on.

If you are able to share an example of the raw data (sanitised and preferably NOT a screen image - so we can produce a working example of a solution) that would be good.

PickleRick

We don't know your raw data but the main question is why do you go to all this trouble of mvzipping and joining all those values into a multivalued field when next you want to do is mvexpand.

Why not just filter on raw data in the initial search?

bowesmana

You are deduping 'x' so you need to understand the consequences of that.

Your search is not doing any aggregations, so without knowing what combinations of Application, Action and Target_URL you have, it's impossible to know what's going on here.

These 3 lines are most likely the source of your problem

| mvexpand x
| mvexpand y
| dedup x

KendallW

Hi @SureshkumarD would it be possible to provide some sample data to go with the search?

SureshkumarD

Hi @KendallW @bowesmana ,

Can I share you the raw data to go with the search? Please let me know.

Count is getting mismatched for the fields after using the mvzip, mvexpand and mvindex commands

other

Introducing the Splunk Community Dashboard Challenge!

Wondering How to Build Resiliency in the Cloud?

Updated Data Management and AWS GDI Inventory in Splunk Observability