Getting Data In

How to clean Duplicates from Index?

delly_fofie
Engager

Hello,

We have a use case.

Using the Splunk DB Connect, we ingest data from the various systems especially from the ERP.

Every change on an article in the ERP is pushed into a temp DB which is monitored by the SPLUNK DB connect.

There a millions of data movements each day. 

But in the end of the day, we just need to work with the latest unique data that are in the system for each article. Each event has some 10-30 fields.

What is the best way to getting rid of all the duplicates that are comming into the system ?
Delete ? How ? 
skip ? Lookup ? Summary DB ? How ? 

What are the ideas that you might have or maybe some ideas i'm missing ?

Labels (1)
Tags (1)
0 Karma
1 Solution

gcusello
SplunkTrust
SplunkTrust

Hi @delly_fofie ,

the only way to avoid to index twice a log, is to run an SQl query that checks is data i duplicated before index them.

This action can be performed on SQL, not in Splunk.

In Splunk you could ingest also duplicated events and then, using Splunk Search Programming Language (SPL), remove duplicates in search results, but not in indexing.

In other words, it isn't possible to check if a data is already indexed before index it, the only way to do this is in the generating SQL query that you use to extract events suing DB-Connect.

Ciao.

Giuseppe

View solution in original post

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @delly_fofie,

Delete duplicates in Splunk is possible but in this way you only make a logical deletion, in other words, you don't save both disk space and license.

My hint is to optimize your extraction quesry to avoid to index twice.

Ciao.

Giuseppe

0 Karma

delly_fofie
Engager

Hello @gcusello Lets assume I would go with your idea.

But still if on the day 1 I manage to only get unique in the indexing, the next day I will have new entries and already existing entries in Splunk and will still create duplicate data in Splunk.

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @delly_fofie ,

the only way to avoid to index twice a log, is to run an SQl query that checks is data i duplicated before index them.

This action can be performed on SQL, not in Splunk.

In Splunk you could ingest also duplicated events and then, using Splunk Search Programming Language (SPL), remove duplicates in search results, but not in indexing.

In other words, it isn't possible to check if a data is already indexed before index it, the only way to do this is in the generating SQL query that you use to extract events suing DB-Connect.

Ciao.

Giuseppe

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @delly_fofie ,

good for you, see next time!

Ciao and happy splunking

Giuseppe

P.S.: Karma Points are appreciated by all the contributors 😉

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Deleting events from an index is tricky as it is easy to accidentally delete all the events from the index - this is why it is protected by its own level of security and is usually only granted to specific, isolated users, to minimise the likelihood of accidental deletions.

So, assuming you aren't going to be deleting events from the index, and that your dbconnect is potentially retrieving events which are already in the index, you should consider comparing the retrieved events with those already in the index and only add the new or updated events.

Another possibility is to have a "summary" index which you refresh (delete and insert) with the latest events for each article.

0 Karma
Get Updates on the Splunk Community!

Get the T-shirt to Prove You Survived Splunk University Bootcamp

As if Splunk University, in Las Vegas, in-person, with three days of bootcamps and labs weren’t enough, now ...

Introducing the Splunk Community Dashboard Challenge!

Welcome to Splunk Community Dashboard Challenge! This is your chance to showcase your skills in creating ...

Wondering How to Build Resiliency in the Cloud?

IT leaders are choosing Splunk Cloud as an ideal cloud transformation platform to drive business resilience,  ...