Slow indexer/receiver detection capability

hrawat_splunk

9.1.3/9.2.1 onwards slow indexer/receiver detection capability is fully functional now (SPL-248188, SPL-248140).

https://docs.splunk.com/Documentation/Splunk/9.2.1/ReleaseNotes/Fixedissues
You can enable it on forwarding side in outputs.conf

maxSendQSize = <integer>
* The size of the tcpout client send buffer, in bytes.
  If tcpout client(indexer/receiver connection) send buffer is full,
  a new indexer is randomly selected from the list of indexers provided
  in the server setting of the target group stanza.
* This setting allows forwarder to switch to new indexer/receiver if current
  indexer/receiver is slow.
* A non-zero value means that max send buffer size is set.
* 0 means no limit on max send buffer size.
* Default: 0

Additionally 9.1.3/9.2.1 and above will correctly log target ipaddress causing tcpout blocking.

WARN AutoLoadBalancedConnectionStrategy [xxxx TcpOutEloop] - Current dest host connection nn.nn.nn.nnn:9997, oneTimeClient=0, _events.size()=20, _refCount=2, _waitingAckQ.size()=4, _supportsACK=1, _lastHBRecvTime=Thu Jan 20 11:07:43 2024 is using 20214400 bytes. Total tcpout queue size is 26214400. Warningcount=20

Note: This config works correctly starting 9.1.3/9.2.1. Do not use it with 9.2.0/9.1.0/9.1.1/9.1.2( there is incorrect calculation https://community.splunk.com/t5/Getting-Data-In/Current-dest-host-connection-is-using-18446603427033...).

gjanders

This setting definitely looks useful for slow receivers, but how would I determine when to use it, and an appropriate value?

For example you have mentioned:

WARN AutoLoadBalancedConnectionStrategy [xxxx TcpOutEloop] - Current dest host connection nn.nn.nn.nnn:9997, oneTimeClient=0, _events.size()=20, _refCount=2, _waitingAckQ.size()=4, _supportsACK=1, _lastHBRecvTime=Thu Jan 20 11:07:43 2024 is using 20214400 bytes. Total tcpout queue size is 26214400. Warningcount=20

I note that you have Warningcount=20, a quick check in my environment shows Warningcount=1, if i'm just seeing the occasional warning I'm assuming tweaking this setting would be of minimal benefit?

Furthermore, how would I appropriately set the bytes value?

I'm assuming it's per-pipeline, and the variables involved might relate to volume per-second per-pipline, any other variables?

Any example of how this would be tuned and when?

Thanks

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

hrawat_splunk

If warning count is 1, then it's not a big issue.
What it indicates is out of maxQueueSize bytes tcpout queue, one connection has occupied a large space. Thus TcpOutputProcessor will get pauses. maxQueueSize is per pipeline and is shared by all target connections per pipeline.
You may want to increase maxQueueSize( double the size).

gjanders

Thanks, I'll review the maxQueueSize

If the warning count was higher, such as 20 in your example.

What would be the best way to determine a good value (in bytes) for maxSendQSize to avoid the slow indexer scenario?

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

hrawat_splunk

If Warningcount is high, then I would like to see if target receiver/indexer is putting back-pressure. Check if queues blocked on target. If queues not blocked, check on target using netstat

netstat -an|grep <splunktcp port>

and see RECV Q, if it's high. If receiver queues are not blocked, but netstat shows RECV Q is full, then receiver need additional pipelines.

If Warningcount is high because there was rolling restart at indexing tier, then set maxSendQSize to some 5% value of maxQueueSize.
Example

maxSendQSize=2000000
maxQueueSize=50MB

If using autoLBVolume, then have

maxQueueSize > 5 x autoLBVolume
autoLBVolume > maxSendQSize
Example

maxQueueSize=50MB
autoLBVolume=5000000
maxSendQSize=2000000

maxSendQSize is total outstanding raw size of events/chunks in connection queue that needs to be sent to TCP Send-Q. This happens generally when TCP Send-Q is already full.

autoLBVolume is minimum total raw size of events/chunks to be sent to a connection.

Slow indexer/receiver detection capability

other

Built-in Service Level Objectives Management to Bridge the Gap Between Service & ...

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer at Splunk .conf24 ...

Share Your Ideas & Meet the Lantern team at .Conf! Plus All of This Month’s New ...