Solved: How to clean Duplicates from Index?

delly_fofie · ‎07-15-2022

Hello,

We have a use case.

Using the Splunk DB Connect, we ingest data from the various systems especially from the ERP.

Every change on an article in the ERP is pushed into a temp DB which is monitored by the SPLUNK DB connect.

There a millions of data movements each day.

But in the end of the day, we just need to work with the latest unique data that are in the system for each article. Each event has some 10-30 fields.

What is the best way to getting rid of all the duplicates that are comming into the system ?
Delete ? How ?
skip ? Lookup ? Summary DB ? How ?

What are the ideas that you might have or maybe some ideas i'm missing ?

gcusello · ‎11-01-2023

Hi @delly_fofie ,

the only way to avoid to index twice a log, is to run an SQl query that checks is data i duplicated before index them.

This action can be performed on SQL, not in Splunk.

In Splunk you could ingest also duplicated events and then, using Splunk Search Programming Language (SPL), remove duplicates in search results, but not in indexing.

In other words, it isn't possible to check if a data is already indexed before index it, the only way to do this is in the generating SQL query that you use to extract events suing DB-Connect.

Ciao.

Giuseppe

View solution in original post

gcusello · ‎07-15-2022

Hi @delly_fofie,

Delete duplicates in Splunk is possible but in this way you only make a logical deletion, in other words, you don't save both disk space and license.

My hint is to optimize your extraction quesry to avoid to index twice.

Ciao.

Giuseppe

delly_fofie · ‎11-01-2023

Hello @gcusello Lets assume I would go with your idea.

But still if on the day 1 I manage to only get unique in the indexing, the next day I will have new entries and already existing entries in Splunk and will still create duplicate data in Splunk.

gcusello · ‎11-01-2023

Hi @delly_fofie ,

the only way to avoid to index twice a log, is to run an SQl query that checks is data i duplicated before index them.

This action can be performed on SQL, not in Splunk.

In Splunk you could ingest also duplicated events and then, using Splunk Search Programming Language (SPL), remove duplicates in search results, but not in indexing.

In other words, it isn't possible to check if a data is already indexed before index it, the only way to do this is in the generating SQL query that you use to extract events suing DB-Connect.

Ciao.

Giuseppe

gcusello · ‎11-01-2023

Hi @delly_fofie ,

good for you, see next time!

Ciao and happy splunking

Giuseppe

P.S.: Karma Points are appreciated by all the contributors 😉

ITWhisperer · ‎07-15-2022

Deleting events from an index is tricky as it is easy to accidentally delete all the events from the index - this is why it is protected by its own level of security and is usually only granted to specific, isolated users, to minimise the likelihood of accidental deletions.

So, assuming you aren't going to be deleting events from the index, and that your dbconnect is potentially retrieving events which are already in the index, you should consider comparing the retrieved events with those already in the index and only add the new or updated events.

Another possibility is to have a "summary" index which you refresh (delete and insert) with the latest events for each article.

How to clean Duplicates from Index?

index

Get the T-shirt to Prove You Survived Splunk University Bootcamp

Introducing the Splunk Community Dashboard Challenge!

Wondering How to Build Resiliency in the Cloud?