Solved: Lost events on HF HEC

MichalG1

Hello Team,

Deployment with:

- HF with ACK when sending to Indexer

- HEC on HF with ACK

- application sending events via HEC on HF with ACK

Still in this model there is a chance that some of the events will be missed. Application might get ACK from HEC, but if the event is still on the HF output queue (not yet sent to the indexer) and we have non-gracefull reboot of HF (so that it could not flush out it's output queue). Can you confirm ? What would be the best way to address it ? So that once the application receives ACK we do have end to end guarantee that event is indexed ?

Thanks,

Michal

isoutamo

If you haven't implemented reading and queuing HEC acs then it cannot work. You definitely will lost some events without that implementation. Also even you have implemented it with LB deployed you probably will got some duplicate events as it's not 100% that you will check ack from that individual HF/HEC where you have sent original event.

I'm not sure if HEC ack implement also HF level ack into use? Personally I will enable it manually.

As I said, if I use HEC ack I also enable ack on inputs.conf on the whole path from HEC node to all indexers.

If your HF will crash before HEC client has read that then your client should sent those events again and you will get duplicates. Same situation if you have many HF behind LB and sticky sessions didn't work or any HF will crash/stop serving.

You should implement your HEC client so, that there is some timeout for preventing it to wait forever. Just after timeout has reached it will send that event again. There will be come situations when you will never get that ack for individual event!

View solution in original post

isoutamo

Hi

have you read this https://docs.splunk.com/Documentation/Splunk/9.2.1/Data/AboutHECIDXAck? And have you implemented ack response handling on your HEC client?

Are you using separate channel values on every HEC client instances?

How many HEC receiver you have and are those behind load balancer?

r. Ismo

MichalG1

Thank you @isoutamo for the help here.

I have not yet implemented it, because i want to understand how resilient is whole solution.

As fast as i understand we have two solutions:

- forwarder ACK configured in the outputs.conf (useAck=true)

- HEC ACK configured in the inputs.conf (useAck=true)

Both solutions are independent. But when i do enable only HEC ACK it effectively enables also forwarder ACK (because it can return ACK to HEC client based on the information returned back from indexer)

In the HEC doc we have:

"HEC responds with the status information to the client (4). The body of the reply contains the status of each of the requests that the client queried. A true status only indicates that the event that corresponds to that ackID was replicated at the desired replication factor"

So effectively i need to enable useAck=true in inputs.conf - correct ?

Also what happens when HEC server (my HF) will have a hardware crash before it received ACK from indexer (or even before it flush it's output queue) ? Will it be able to recover after that crash ? To do that it would have to have a kind of journal file system persistence in order to recover ? Without that if the event is lost my HEC client will try to query HEC server infinitively.....

Thanks,

Michal

isoutamo

If you haven't implemented reading and queuing HEC acs then it cannot work. You definitely will lost some events without that implementation. Also even you have implemented it with LB deployed you probably will got some duplicate events as it's not 100% that you will check ack from that individual HF/HEC where you have sent original event.

I'm not sure if HEC ack implement also HF level ack into use? Personally I will enable it manually.

As I said, if I use HEC ack I also enable ack on inputs.conf on the whole path from HEC node to all indexers.

If your HF will crash before HEC client has read that then your client should sent those events again and you will get duplicates. Same situation if you have many HF behind LB and sticky sessions didn't work or any HF will crash/stop serving.

You should implement your HEC client so, that there is some timeout for preventing it to wait forever. Just after timeout has reached it will send that event again. There will be come situations when you will never get that ack for individual event!

Lost events on HF HEC

heavy forwarder

Introducing the Splunk Community Dashboard Challenge!

Get the T-shirt to Prove You Survived Splunk University Bootcamp

Wondering How to Build Resiliency in the Cloud?