ITSI Rules Engine: Understanding and Troubleshooting

matthews
February 11, 2022
03:35 pm

In ITSI, the Rules Engine is the system responsible for processing notable events into episodes and executing actions. For this reason, it’s important to understand how it works, and how to troubleshoot if episodes are not being created in a consistent manner.

ITSI Rules Engine workflow.

First, a correlation search is built and scheduled, to create notable events. Correlation searches typically search the itsi_summary index for service health scores or KPI values. The results of the correlation search are stored in itsi_tracked_alerts. This index contains all notable events from any correlation searches.

Next is the ITSI rules engine saved search. This search runs in real time, and the default Notable Event Aggregation Policy (NEAP) groups the notable events into episodes based on source. This is where the actual Rules Engine process begins, and by creating custom NEAPs, we can filter, group and act upon notable events from a correlation search.

Finally, the episodes created by the NEAP are stored in itsi_grouped_alerts, and any actions defined should execute based on the criteria defined.

Creating a NEAP.

To begin creating episodes, ensure that your Correlation Search is enabled, and that notable events are being populated in the itsi_tracked_alerts index. The source field in this index will be your correlation search name.

Once you confirm that events are flowing into itsi_tracked_alerts, you’ll want to create a NEAP, and ensure that you have “source” matches ”your correlation search“ under the “include the events if” section on the filtering criteria page.

The goal with this section is to include all notable events that should be grouped and alerted in the NEAP. Under split episodes by, you can select how you wish for your notable events to be split into individual episodes. This can be any field in the itsi_tracked_alerts data, I typically split by service_name.

For break conditions, you can configure when an episode should end. I typically do “if the flow of events is paused for 600 seconds”, the idea here being, if there are no notable events being created and flowed into the episode for 10mins, then the problem has been resolved, however it ultimately depends on the service.

Once filled out, you can test your configurations by selecting preview on the same page. This should return all notable events from your correlation search, grouped into episodes based on your split by selection. If you are not seeing results here, ensure that your correlation search is indeed writing events to itsi_tracked_alerts, and that your filtering criteria is valid.

Troubleshooting episodes not being created.

If episodes are not being created, make sure that the service, correlation search, and NEAP are all enabled. Additionally, ensure that that the itsi_event_grouping search is enabled, and in the jobs panel, ensure that it is running. To restart the rules engine, you can disable this search for a few minutes, and then re-enable.

Ensure that your correlation is picking up notable events. Without notable events, there is nothing for the Rules Engine to group. To do this, search the itsi_tracked_alerts index and filter source=<your correlation search>. If there are notable events here, then the issue is most likely due to the Rules Engine.

Next, check that there are events being written to the itsi_grouped_alerts index. If you have notable events in itsi_tracked_alerts, but no episode info exists in the itsi_grouped_alerts index, then ensure the NEAP is correctly picking up events in the filtering criteria page of the NEAP. Use the preview option in the NEAP editor to ensure that events exist and should be grouped. If there are no results from preview, then check your filtering criteria. If there are results in preview, but no episodes are being created, then there is most likely issue with the rules engine itself.

Troubleshooting steps

Ensure the ITSI rules engine search is enabled and running (itsi_event_grouping). If needed, you can restart the rules engine by disabling this search for a few minutes, and reenabling it.
Check the ITSI Event Analytics Monitoring dashboard for any glaring issues, including ensuring that only 1 Java process is running. Multiple java processes running can cause issues.
In the SA-ITOA app, there is a itsi_rules_engine.properties file that holds configs specific to the functionality of the rules engine. One setting in particular, exit_condition_messages_contain is important to check.

exit_condition_messages_contain = unable to distribute to the peer,might have returned partial results,search results might be incomplete,search results may be incomplete,unable to distribute to peer,search process did not exit cleanly,streamed search execute failed,failed to create a bundles setup with server
This setting checks for specific error messages in the internal logs, and if found, stops the rules engine. It’s important to make sure that any ITSI Search Head is NOT generating any of these messages. If these messages are found, then the Rules Engine will stop, and thus episode creation and alerting will be affected.
To check if any of these messages are present, the following search can be run (make sure to filter on your ITSI SH hosts):
index=”_internal” source=”*splunkd*” “unable to distribute to the peer” OR “might have returned partial results” OR “search results might be incomplete” OR “search results may be incomplete” OR “unable to distribute to peer” OR “search process did not exit cleanly” OR “streamed search execute failed” OR “failed to create a bundles setup with server”

Using custom ITSI index names?

If you’re using custom index names outside of itsi_tracked_alerts, like my_org_itsi_tracked_alerts, ensure that all macros have been updated to include these index names.

If you’re on ITSI 4.11 or lower and are using custom index names, you will need to take additional steps. In the SA-ITOA app, the itsi_rules_engine.properties file will need to be updated to include these custom index names, as the default index names are hardcoded in the java build of ITSI, and can cause delays in episode creations, or duplicate episodes. This has been raised to Splunk and will be updated to use a macro in a future release. Until then, the following stanza can be added in SA-ITOA/local/itsi_rules_engine.properties to include the custom index names.

[default]
index_name = <your custom itsi_grouped_alerts index name>

# Search to retrieve missed events to backfill
backfill_events_search = search (`itsi_event_management_index_with_close_events` ) OR ( `itsi_event_management_group_index`) NOT orig_sourcetype=snow:incident | stats first(_time) AS _time first(_raw) AS _raw first(source) AS source first(sourcetype) AS sourcetype count(eval(index=<your custom itsi_grouped_alerts index name>)) AS c_grouped by event_id | where c_grouped=0 | fields _time, _raw, source, sourcetype | sort 0 _time

grouping_missed_events_search = search (`itsi_event_management_index_with_close_events` ) OR ( `itsi_event_management_group_index`) NOT orig_sourcetype=snow:incident | stats first(_time) AS _time first(_raw) AS _raw first(source) AS source first(sourcetype) AS sourcetype count(eval(index=<your custom itsi_grouped_alerts index name>)) AS c_grouped by event_id | where c_grouped=0 | fields _time, _raw, source, sourcetype

Additional troubleshooting steps can be found in Splunk documentation here or learn more about our Splunk Consulting Services.