How to Avoid Skipped Searches in Splunk Cloud

By Kamal Dorairaj, Senior Splunk Consultant

When working in Splunk, scheduled searches often skip or cannot execute for one reason or another. If a search got skipped, a scheduled job waits on the line for the search slot and the job sits in deferred status. If the scheduling window expires, the search does not run and the job resets to skipped status.


Reports, alerts, data models and many app dashboards like Enterprise Security (ES) depend on your scheduled searches. When these searches do not run in the scheduled time window, the overall Splunk environment becomes unreliable because of missing data or incorrect data in the dashboards and reports. This can cause you to miss crucial alerts and can slow the performance of dashboards.

How to Find Skipped Searches in Splunk Cloud

The below spls show any scheduled searches that are being skipped, the reason why and the savedsearch name. Time range of last 7 days is a good place to start, and can go back up to 30 days.

index=_internal sourcetype=scheduler status=”skipped” | stats values(savedsearch_name) count by reason
index=_internal sourcetype=scheduler savedsearch_Name=* status=skipped | stats count by reason

Skipped Search Reasons and Solutions

These are the most common reasons for skipped searches in any Splunk Cloud stack:

Reason 1: The maximum number of concurrent running jobs for historical scheduled searches on this instance or cluster has been reached.

Explanation: By default, only one instance of a given search is kept running at the same time. Therefore, if the search is already running for scheduled_time and when the next cron scheduled _time comes due for the next run, the scheduler will not execute the next job. So, the next scheduled search job is skipped now because the previous scheduled search job for the same saved search is still running. Example: The cron schedule configured to run the search every 5 minutes but the search itself takes 10 minutes to complete.

Solution:
a. Tune the spl of the search job to run within the configured time window. If possible, shorten the earliest time for the job.
b. If you cannot improve the search runtime, then increase the cron period to make sure it’s more than the search execution time.

Reason 2: The maximum number of concurrent historical scheduled search on this cluster has been reached or the maximum number of concurrent auto-summarization searches has been reached.

Explanation:
Each Splunk cloud stack is configured to run a maximum number of concurrent searches. We can find it under Settings -> Search Preferences -> Limits set for scheduled searches and summarization searches. As a cloud customer, set these values in the backend in the limits.conf. When we change the percentage from 50% to 75%, we can see the number of search slots for the scheduled searches and summarization searches (DM) increased. Before increasing this concurrency limit, please make sure all the best practices listed in the below Solution section are implemented. Example: If a DM skipped for last 5 minutes acceleration then when it runs the next time it tries to run for last 10 minutes data. If that is also skipped, the next time it tries to run for last 15 minutes data and keeps accumulating.

Solution:
a. Avoid scheduling search jobs at the top of the hour, like 10:00 am, 11:00 am or 10:30 pm, 11:30 pm, etc. It means we are scheduling too many searches at the same time. Tune the cron schedules of the scheduled jobs to fire on different minutes, which makes the load spread through the hour/day.
b. Disable unused scheduled searches. Remove unused apps and TA’s.
c. Not all the DMs need to be accelerated. Only if the summary data is more than 1GB is the acceleration useful.
d. Consider workload management. The following rule limits the number of concurrent ad hoc searches to 50% of the total search concurrency limit, which in turn provides more search slots for scheduled searches and summarization searches: search_type=adhoc AND adhoc_search_percentage >50.
e. Create WLM rule to place prioritized searches in the high priority pool to get them maximum resources and hence the search completes faster.
f. Tune Datamodel to search only the indexes needed.

Best practice setup for search jobs can eliminate skipped searches and incomplete data. Following the quick tips here will ensure more thorough alert reporting and faster dashboards.