I want to get hundreds of millions of data from billions of data, but it takes more than an hour each time.
I just used the simplest search: index="test" name=jack
But, it's very slow.
Then I checked the memory and CPU usage. Each search takes only 200-300 MB of memory.
So I modified the max_mem_usage_mb, search_process_memory_usage_percentage_threshold and search_process_memory_usage_threshold parameters
in $SPLUNK_HOME/etc/apps/search/local/limits.conf
, but they didn't seem to play a significant role.
Is there any effective way to improve the speed of my search?
Thanks! 🙂
You'll be much faster in finding Jack's company if you also specify how to find a company in your search. What that looks like depends on your data which you didn't share with us - knowing your data would help.
That could look like one of these:
index=foo sourcetype=company_register name=jack
index=foo category=employees name=jack
etc.
If you have an accelerated datamodel, it could look like this:
| tstats summariesonly=t values(company) as companies from datamodel=your_model where your_model.name=jack
To chain that you could build a dashboard with in-page drilldowns that steps through the tree you expect in your data.
See my previous answer on tuning your SHC for performance.
I tried, but it didn't work.
:(
Thanks,I'll try more indexers
You improve (the speed of) your search by DOING SOMETHING with your millions of events by piping them to MORE SPL by adding |Other SPL here
. What is your SPL now? Show us your sample events and a mockup of your desired FINAL OUTPUT. Make it work first, the worry about optimizing it.
Splunk is not an ETL tool (it is a Needle-in-the-haystack
tool, not a Forklift-the-haystack
tool). There is no way to make it perform acceptably when the final output is millions
of rows/events. It can process billions down to millions, and millions down to hundreds or maybe thousands, but that's it. You need to figure out what you really need to do and probably you don't need the millions as the final output, but rather as an input to some other calculation which can probably be done in Splunk. Otherwise, use another, more appropriate tool. Splunk is not a part of the pipeline
tool, it is an end of the pipeline
tool, to give the final conclusions.
Because of the large amount of data, the data I get through keyword search will have a lot of correlation, so I will extract a lot of data, but I can not add restrictions, abandon part of the data, which will lead to slow speed. I am eager to solve this problem.
Thanks:)
Increase the resources available to Splunk at the search head level.
Modify the settings below (based on your environment) at $SPLUNK_HOME:/etc/system/local/limits.conf and cycle the search head(s).
[defaults]
max_mem_usage_mb = 16000
[search]
# If number of cpu's in your machine is 14 then total system wide number of
# concurrent searches this machine can handle is 20.
# which is base_max_searches + max_searches_per_cpu x num_cpus = 6 + 14 x 1 = 20
base_max_searches = 6
max_searches_per_cpu = 16
[scheduler]
# Percent of total concurrent searches that will be used by scheduler is
# total concurrency x max_searches_perc = 20 x 60% = 12 scheduled searches
# User default value (needed only if different from system/default value) when
# no max_searches_perc.<n>.when (if any) below matches.
max_searches_perc = 80
The search currently provided, index=foo field=value
, does not consume SH memory at all. It's purely Indexer-unzip-rawdata induced CPU bound.
All searches consume memory on the search head. Assuming you are not running your query directly from the indexer UI.
The indexers perform the work of the query then pass those results back to the SH for any additional parsing and display to the end user. Depending on the size of your search artifacts this can produce tremendous resource consumption on both sides.
You may have a look at this slide deck, especially slide 27. It's a pure streaming command search using some field extraction. That kind of search is done on the indexer and it depends on uncompressing all the packed raw data files, which - as we now know - are all on one disk => slow
Of course, displaying the events will take some small amount of memory, but really, that won't be the bottleneck in this scenario.
just added the missing link...
The search currently provided does not do any additional work on the SH, it's all map and no significant reduce.
In order to help this person we first need to understand their goals, not throw around tons of deep-dive tuning.
In order to help this person we should also not provide inaccurate information, such as the idea that searches do not consume memory on a search head.
This search doesn't, there is no command running on the SH.
It'd be different if there was a high-cardinality stats, a transaction, etc.
While you're defining what information to provide, I wouldn't recommend recommending max_searches_per_cpu = 16
. It's a good way to thrash your indexing tier.
@adonio♦ Could you help me ?
Thanks 🙂
@qazwsxe
For Faster Search, you need to be specific. Use source or sourcetype in your search and make use of Time range picker to search the logs in the specific time range.
Also, we have below modes of searching in Splunk:
I also used sourcetype and specified fast mode. But the speed is still very slow. I don't know how to solve it.
Be more specific in describing what goal you want your search to achieve. I doubt it's "list millions of events on screen" because there's no value in that.
I want to extract hundreds of millions of data from billions of data by simple keyword search, but the speed is too slow. No matter how much data is searched, CPU and memory usage have not changed significantly. Excuse me, is there something wrong with my usage? I just want to speed up my search.
Okay, what do you want to do with those hundreds of millions of data?
Because of the large amount of data, the data I get through keyword search will have a lot of correlation, so I will extract a lot of data, but I can not add restrictions, abandon part of the data, which will lead to slow speed. I am eager to solve this problem.
Thanks:)