I've been researching this topic for a while and am surprised I haven't really found a lot of data on this. I need to come up with a disaster recovery option for if an indexer goes down, like a hardware failure for instance.
In the case of a failed indexer, questions I have are:
-What to do about forwarders who are forwarding their data to an indexer that goes down? Where does the data go and is there an easy way to tell all these forwarders to go somewhere else?
-Who should pick up the slack if an indexer goes down? How do you synch the databases so that they all have the same data? Can you index the same data to more then one indexer?
Has anyone else given this thought? Comments? Suggestions? I'm aware of the splunk page regarding backup but didn't really see one for DR.
Disaster recovery for Splunk is not as complicated as it would seem on paper.
Check out these two links:
https://conf.splunk.com/files/2017/recordings/architecting-splunk-for-high-availability-and-disaster...
https://conf.splunk.com/files/2017/slides/architecting-splunk-for-high-availability-and-disaster-rec...
This is from my understanding of Splunk HA from a mostly app point of view. Sorry if this seems confusing and I might not have all the information in the right order. Also Splunk 5.x makes this a lot easier.
Q: What to do about forwarders who are forwarding their data to an indexer that goes down? Where does the data go and is there an easy way to tell all these forwarders to go somewhere else?
A: If you only have a single index I would configure two things, Index acknowledgement to prevent in-flight data loss and increase your MaxQueueSize in your output.conf. If you are monitoring File or Log data Splunk will continue from its last know start point which is stored in the fish bucket. You will drop streamed TCP events if your queue is not large enough. Queuesize can be increase from the inputs.conf and outputs.conf.
Index Acknowledgement will prevent against inflight data lost when an index is in failed or unusable state. This setting does have performance implications.
Increasing your MaxQueueSize will allow your forwarder to hold more events in memory. This could be help if you are streaming raw TCP events to a forwarder.
If you have multiple indexers you can configure Splunk’s auto load balance. This will rotate indexer on a time interval to those indexer still responding.
Q: Who should pick up the slack if an indexer goes down?
A: If you are using Splunk’s auto load balancing the remaining Indexers will pick up the slack.
Q: How do you synch the databases so that they all have the same data? Can you index the same data to more than one indexer?
A: Keep in mind Splunk isn't your standard relational database. There are a few of answers to this problem and yes you can index the same raw data multiple times. And to accomplish this you will have to use a combination of the following concepts. Configure data distribution using data cloning, load balancing, and data routing on your forwarders which can be configured from the outputs.conf.
The problem with indexing the same data multiple times is storage and licensing cost (Splunk License is based on indexer through put MB or GB per day).
You could use data cloning to send copies of the events to multiple receiving indexers by configuring your outputs.conf on the forwarder. Keep in mind that that cloning events will have similar search results, but are NOT always exact copies.
You can also install multiple instance of Splunk on a single server if you have extra head room on your servers. Using data distribution you could have a forwarder send events to two physical servers contain two splunk instances each. The first physical server would contain splunk_index1_primary and splunk_index2_secondary and the second physical server would contain splunk_index1_secondary and splunk_index2_primary. On the forwarder you would configure to two data cloning groups on your forwarder.
Output.conf – data cloning with load balancing.
[tcpout]
defaultGroup=cloned_group1,cloned_group2
[tcpout:cloned_group1]
server=splunk_index1_primary:9997, splunk_index2_primary:9997
[tcpout:cloned_group2]
server= splunk_index1_secondary:9997, splunk_index2_secondary:9997
Additional reading:
Install mulitple splunk instance on single machine
I hope this gets you started or a least helps.
Great info, and I do want to emphasize your comment that the new index replication feature of Splunk 5.0 makes this much easier!
Hi
emiller42s suggestion is an excellent starting point.
It depends on what DR requirements you have? Do you need to recover after a disaster or do you have to be disaster tolerant? Do you need historical data at all times or is it enough if you can keep alerting on the new data that is still being indexed. Or maybe you are ok with a little downtime once in while. Forwarders will not loose any data if an indexer goes down. They will start sending data again when the indexer is available again.
Those are the first steps I'd take to enhance the resiliance of a Splunk installation:
Mirror the disks splunk is indexing to -> a (single) disk failure won't hurt anymore
If you have data that is sent to your indexers via syslog or any data that is not handled by a forwarder (or a solution whitch makes sure no data is lost in transit -> listening to udp ports is bad) write that data to files and index the files that way you can safely update your indexers
Set up more than one indexer and configure the forwarders to do autoloadbalancig (this is easy to set up) between them. If one indexer goes down only the historical data of that indexer will not be available if something happens. Indexing will carry on and alerting/searching the recent data still works