Make Unstructured Data Searchable with a Multikv Command

Jeff Rabine | Splunk Consultant

Splunk works best with structured data. This blog post will cover how to make unstructured data searchable. In this real-world example, the customer wanted to use data in the second table of an unstructured log file. Changing the log format was not an option, and access to .conf files was not available, so all changes needed to happen at search time.

Raw log sample:

raw log sample where we need to make unstructured data searchable

As you can see, there are two tables of data in the log. The first step is to remove the top table from the results since it’s unnecessary for this search. We will do that using the rex command to over-write _raw capturing only the data that we need.

two tables of data

 

The next step is to use the multikv to break the tables into separate events. This command will attempt to create fields and values from the table however, in our case, we removed the headers from the table because the formatting of our table was not clean. This caused the multikv command to not work properly. Since we removed the headers, we will set them to noheader=t.

Multikv noheader=t

Now, the last thing we need to do is create our field extractions, and then we can use the data however we please.

final image multikv

 

As you can see, we now have nice clean data!

Other uses of the multikv command:

Depending on your data, there are other ways to use the multikv command. Neither of these examples was able to make unstructured data searchable for our customer, but I recommend trying them with your data. Your success with the following examples will depend on how cleanly formatted your logs are.

In our example, we stripped out the headers of the table to make unstructured data searchable. You may be able to leave the headers. That would save you from extracting the fields with the rex command. Also, by default, the command will attempt to process multiple tables within the log, so you might just have to use the multikv command. After running this search, check to see if the correct fields were extracted.

index="fruitfactory" sourcetype="fruitfactory"
| multikv

You can also tell the command what row contains the headers of the table. This would allow you to always look for the headers on the first, second, etc row of the event. Again, check and see if the correct fields were extracted after running this command.

index="fruitfactory" sourcetype="fruitfactory"
| multikv forceheader=

Want to learn more about unstructured data or using the multikv command? Contact us today!

Monitor Splunk Alerts for Errors

Zubair Rauf | Senior Splunk Consultant – Team Lead

In the past few years, Splunk has become a very powerful tool to help teams in organizations proactively analyze their log data for reactive and proactive actions in a plethora of use cases. I have observed almost every Splunker monitor Splunk Alerts for errors. Splunk Alerts use a saved search to look for events, this can be in real-time (if enabled) or on a schedule. Scheduled alerts are more commonplace and are frequently used. Alerts trigger when the search meets specific conditions specified by the alert owner.

Triggered alerts call alert actions which can help owners respond to alerts. Some standard alert actions are to send an email, add to triggered alerts, etc. Other Splunk TAs also help users integrate external alerting tools like PagerDuty, creating JIRA tickets, and many other things using these tools. Users can also create their own custom alert actions which can help them respond to alerts or integrate with external alerting or MoM tools. There are times that they may fail due to different reasons and a user may not get the intended alert they set up. This can be inconvenient for users and if the alerts are used to monitor critical services, this can have a financial impact as well and can prove to be costly if alerts are not received on time.

The following two searches can help users understand if any triggered alerts are not sending emails or the alert action is failing. Alert actions can fail because of multiple reasons, and Splunk internal logs will be able to capture most of those reasons as long as proper logging is set in the alert action script.

Please note that the user running the searches need to have access to the “_internal” index in Splunk.

The first search looks at email alerts and will tell you by subject which alert did not go through. You can use the information in the results of

index=_internal host=<search_head> sourcetype=splunk_python ERROR

| transaction startswith=“Sending email.” endswith=“while sending mail to”

| rex field=_raw “subject\”\=(?P<subject>[^\”]+)\””

| rex field=_raw “\-\s(?<error_message>.*)\swhile\ssending\smail\sto\:\s(?P<rec_mail>.*)”

| stats count values(host) as host by subject, rec_mail, error_message

Note: Please replace <search_head> with the name of your search head(s), wildcards will also work.

Legend;

host - The host the alert is saved/run on

subject - Subject of the email - by default it is Splunk Alert: <name_of_alert>

rec_mail - Recipients of the email alert

error_message - Message describing why the alert failed to send email

The second (below) search looks through the internal logs to find errors while sending alerts using alert actions to external alerting tools/integrations

| transaction action date_hour date_minute startswith=“Invoking” endswith=“exit code”

| eval alert_status = if(code=0, “success”, “failed”)

| table _time search action alert_status app owner code duration event_message

| eval event_message = mvjoin(event_message, “ -> “)

| bin _time span=2h

| stats values(action) as alert_action count(eval(alert_status=“failed”)) as failed_count count(eval(alert_status=“success”) as success_count latest(event_message) as failure_reason by search, _time

| search failed_count>0

Note: Please replace <search_head> with the name of your search head(s), wildcards will also work.

These two searches can be setup as their own alert, but I would recommend setting these up on an Alert Monitoring dashboard.  Splunk Administrators can monitor Splunk Alerts periodically to see whether any alerts are failing to send emails or any external alerting tools integrations are not working. Splunk puts a variety of tools in your hand but without proper knowledge, every tool becomes a hammer.

To learn more and have our consultants help you with your Splunk needs, please feel free to reach out to us.

Splunk Issue: Indexers Failing to Rejoin the Cluster (Solved)

Yetunde Awojoodu | Splunk Consultant

 

Indexers failing to rejoin the cluster can cause serious issues, but this blog will provide simple steps to help resolve the issue in your environment. First, indexers in a Splunk environment can be clustered or non-clustered. In an indexer cluster environment, there can be two or more indexers also called peer nodes. Each peer node indexes external data, stores them, and simultaneously sends and receives replicated data. These indexing and replication activities are coordinated by a cluster master. The cluster master is also responsible for managing the configuration of peer nodes, searching of peer nodes and remedial activities if a peer goes offline.

A peer node will need to be connected to the cluster master and stay connected to receive instructions. There are however situations in which a peer node could become disconnected from the cluster master. A peer could go offline intentionally by issuing the CLI offline command or unintentionally as in a server crashing or due to intermittent or recurring network issues in the environment.

When a peer gets disconnected from the cluster master and the master does not receive a heartbeat after a set period, the master begins bucket fixing activities to ensure the defined replication factor is met so that the cluster remains in a healthy state. Refer to Splunk docs if you are interested in learning more about what happens when a peer goes down.

Depending on the reason for the disconnection, indexers in a cluster may go offline for a few minutes or hours. In my situation, a partial datacenter outage had caused multiple appliances to fail and resulted in failed connections among the Splunk servers. This caused about half of the indexers to lose connection to the cluster master for several hours.

The datacenter issue was later determined to be a result of storage shortfalls and was resolved within 24 hours but once the datacenter was restored, however, we noticed the indexers failing to rejoin the cluster. The master would not allow them back into the cluster.

Problem Indicators

There were multiple errors in Splunk that pointed to issues with the indexers:

i. Messages indicating failed connection from Cluster Master to the Indexers

ii. Messages indicating indexers being rejected from joining the cluster

iii. Cluster Master in an unhealthy state – Replication and Search factor not met

iv. Most Splunk servers including search heads in red status

v. No replication activities as seen on the Indexer Clustering Dashboard on the Cluster Master

vi. Indexers with “BatchAdding” status on the Indexer Clustering dashboard

Below is a screenshot of some of the error messages seen in Splunk:

Indexer failing to join back cluster Failed to add peer error image

What Happened?

We found that during the datacenter outage, the disconnected indexers continued to perform activities such as rolling hot buckets to warm using the standalone bucket naming convention rather than the naming convention for buckets in a clustered environment. Since the cluster master did not recognize the standalone buckets it didn’t stop those specific indexers failing to rejoin the cluster. More on this

The screenshot below shows examples of standalone nonreplicated buckets:

Indexer fails to join back cluster examples of standalone nonreplicated buckets

Below is another screenshot of what buckets originating from the local indexer in a cluster should look like:

Indexer fails to join back cluster buckets originating from the local indexer in a cluster should look like

*Note that the buckets prefixed with “rb” are replicated buckets from other indexers in the cluster.

Solution

The solution is to rename each standalone bucket to cluster bucket convention. This was a manual effort that took us a few hours to complete but it may be possible to develop a script especially if there are several indexers failing to rejoin the cluster. In my scenario, we had a total of 9 indexers of which 5 were disconnected.

Below are steps to resolve the issue:

i. Identify the indexers that are unable to join the cluster

ii. Put the Cluster Master in maintenance mode

iii. Stop Splunk on one indexer at a time

iv. Locate the standalone buckets in the indexed data directories on each indexer. The default location is $SPLUNK_HOME/var/lib/splunk/*/db/

v. Append the cluster master GUID to the standalone buckets.

vi. Start Splunk

This process requires great attention to detail to avoid messing up the bucket names. I will recommend having other members of your team on a video conference to watch out for any errors and validate any changes made.

How to Locate and Rename Each Bucket

To locate the erroneous buckets, we developed a regex to match the expected format and issued a find command on the CLI to identify them.

➣ find /opt/splunk/var/lib/splunk/* -type d -name “db*” | grep -P “db_\d*_\d*_\d*$”

*Make sure to replace “/opt/splunk/var/lib/splunk/*” with your specific indexed data directory path

On each indexer, go to each data directory in the results from the “find” command and scroll through to identify the bucket names missing the local indexer guid. Once found, append the correct local indexer guid to the bucket name as it appears in the other bucket names. Note that the guid to be appended is different for each indexer. It is the guid for the local indexer so make sure to copy the correct guid from another bucket on that indexer or check $SPLUNK_HOME/etc/instance.cfg file on that indexer for the correct guid to be appended. As a precaution, feel free to back up the buckets (or just rename to .old) before appending the guid and delete the old directories once the indexers are restored.

Bucket Name Format without Guid – db_<newest_time><oldest_time><bucketid>

Bucket Name Format with Guid – db_<newest_time><oldest_time><bucketid>_<guid>

 

Once you have appended the correct guid to each standalone bucket, the bucket name will look like this as shown in a screenshot above:

db_1640917434_1639072107_161_E56FC9B8-EACF-4D96-8B76-4E28FCF41819

Remember to restart each indexer after the changes have been made and watch out for status changes and replication activities on the indexer clustering dashboard. Note that you may need to wait for a few hours to get the cluster back to a healthy state but not much should be required once the indexers have been restarted.

If you experienced similar errors, as shown above, these simple steps should help resolve the issue in your environment.

If you need additional assistance wrangling your Splunk indexer storage.

You can always contact one of our Splunk specialists here.

Deleting the Unsupported Splunk Windows Universal Forwarder

By: Aaron Dobrzeniecki | Splunk Consultant

 

Have you ever encountered an issue when you are trying to install a new version of Splunk over an older unsupported version of Splunk? Having an older unsupported version of Spunk Univeral Forwarder installed can cause issues with your data ingesting into your Splunk indexers. Please see the Splunk Compatibility Matrix here. In the situation below we encountered compatibility issues because we were in the middle of upgrading our Splunk environment to 8.x. The Splunk Universal Forwarder 6.5.2 is not compatible with 8.x indexers so the data from those forwarders would not ingest into Splunk. Without deleting the unsupported Splunk Windows Universal Forwarder you’ll have huge issues!

We ran into an issue with a Windows Universal Forwarder that was on version 6.5.2 of Splunk, and we were trying to upgrade the Universal Forwarder to 7.3.3. (Since Splunk 7.3.3 has reached the end of support as well, you would want to install the latest version of the Splunk Universal Forwarder). When we tried to install the new version over the current version we received an error that the installation package for the current version was missing. Since Splunk 6.5.2 has reached the end of support and was removed from the website, we were unable to get the older install from Splunk.com because that version does not exist anymore. We were not able to get the 6.5.2 installation package to uninstall the Universal Forwarder from our machine. NOTE: If you have Splunk VPN you can access the older versions of Splunk.

We have two options: Delete the unsupported Splunk Windows Universal Forwarder or grab the current version from behind the Splunk VPN. For this exercise, we are going to be ripping the Splunk agent from the Windows box because we do not have Splunk VPN access. Please follow the steps below:

1. Open the Command Prompt as an Administrator

2. Run: sc stop SplunkUniversalForwarder

3. Run: sc delete SplunkUniversalForwarder

This stops, then deletes, the Splunk Windows service. You may have to do this for a second Splunk service. Older Splunk Universal Forwarder software had two services, although when tested with 7.1.x, it installed only one service.

4. You can find the internal service name by right-clicking on it in the Services Control Panel, select “Properties”, and look for the “Service Name” at the top of the dialog box.

5. Run: rmdir /s /q “C:\Program Files\SplunkUniversalForwarder”

6. From the same Command Prompt, run:

regedit

This will open the Registry Editor.

7. Search for “Splunk”. You should find an item under “HKEY_CLASSES_ROOT\Installer\Products\<SOME GUID>”

8. In the details, you’ll see a key for “ProductName”, and the value will be “Universal Forwarder”.

By seeing this, you know you have got the right item.

9. Right-click on the GUID in the left-hand pane and select “Delete” to delete the entire entry from the registry.
Now, the Splunk installer should see your host as a new install.

From here you can install the latest version of the Splunk Universal Forwarder for Windows. The reason we were unable to remove the Splunk program was due to the registry not being able to find the installation file for Splunk. (The 6.5.2 installation did not exist on the box anymore and we did not have a backup copy of the installation) After deleting the unsupported Splunk Windows Universal Forwarder from the registry we were able to move forward and make our Forwarder compatible with the rest of the Splunk environment.

 

If you are looking for a Splunk Managed Service Provider, check out our Evaluation Checklist or fill out the form below to speak with one of our expert Splunk consultants.

Slice And Dice: Comparing Values Over Specific Times

Part 1: Day Over Week

 

Marvin Martinez | Splunk Consultant

If you’re lost, you can look and you will find it: Time OVER time. Sometimes, data can’t just be visualized sequentially. Often, it is most useful to compare data across specific points in time. In this 4-part series, I will be outlining some interesting ways to help visualize data points across specific points in time, namely day over week (i.e. this past Thursday versus the last three Thursdays) and hour over week (this past Thursday at 1 p.m. to last three Thursdays at 1 p.m.). The first installment will focus on the day-over-week visualization that allows the user to quickly visualize the last four Thursdays (or any other day of the week) overlayed on top of each other to quickly determine any discrepancies between them.

For example, let’s say you’re monitoring sales throughout the day. You notice that sales seemed to spike up at 8 p.m. yesterday, but you’re curious if it is spiking up at that same hour every Thursday or if the peaks are happening at different times and how big those discrepancies are.

Splunk has some very handy, albeit underrated, commands that can greatly assist with these types of analyses and visualizations. Additionally, it can be difficult to clearly display the appropriate context and intent of the visualizations, so it is imperative to clearly delineate what the data points represent on the charts themselves. We’ll explore both situations in this article, including some sample SPL to help you get where you need to get with your own data.

To achieve this, we’ll use the timewrap command, along with the xyseries/untable commands, to help set up the proper labeling for our charts for ease of interpretation.

In the end, our Day Over Week Comparison chart will look like this. The chart shows each Thursday across the past four weeks, overlayed on top of each other. (Note that you could go back as far as your data lets you go. This is just how far back we went for the purpose of this article.)

Day over week comparison chart

This is the search that was used for the panel shown above. Each of our events has a TotalSales field that we are using as our value to chart.

| index=”test” ….
| eval Date = strftime(_time,”%m/%d”), Day = strftime(_time,”%A”), Hour=strftime(_time,”%H”)
| fields _time, “TotalSales”, Hour, Day, Date
| where Day=”Thursday”
| timechart span=1h count(TotalSales) as TotalSales
| timewrap 1w series=short
| rename TotalSales _s0 as LatestDay, TotalSales _s1 as WeeksBack1, TotalSales _s2 as WeeksBack2, TotalSales _s3 as WeeksBack3, TotalSales _s4 as WeeksBack4
| untable _time FieldName FieldValue
| eventstats latest(_time) as LatestDate
| eval WeekNum = substr(FieldName,-1)
| eval FieldName = CASE(FieldName = “LatestDay”, “Latest Thursday”,1=1,”Thursday (” . strftime(relative_time(LatestDate,”-” . WeekNum . “w@h”),”%m/%d”) . “)”)
| xyseries _time FieldName FieldValue
| table _time “Thursday*” “Latest *”

Let’s break this search down and explain what is going on here. The first four lines are setting up our data. We create fields to store the specific Date, Day, and Hour for our values as well as filter out only the information for the specific day we are looking to compare.

The timewrap command requires a timechart command before it, so execute a timechart command to get the data the way you need it (in this case, spanned across an hour). Note that you are already only looking at a specific day across multiple weeks.

Now execute the timewrap command, spanning one-week intervals and using the series=short option. This creates the resulting fields as “TotalSales_s0…” and so on. This is helpful since they can then be easily renamed in the next command to make it easier to understand which date each series represents. At this point, if you ran the search just up to the rename command after the timewrap, your visualized result should look something like this:

timewrap step for day over week set up

This is almost there, but still not intuitive when viewing. Let’s work on that!

The “| untable _time FieldName FieldValue” command will reformat the data as shown below, with a FieldName and FieldValue column, containing the field names and values, respectively. We can use this to rename our fields to make them easier to understand.

FieldName and FieldValue for day over week composition

The following eventstats and eval commands do just that. Eventstats will determine what the latest date is in the remaining events. We will use this to help determine the actual date of the other line values. The eval commands grab the last number from the FieldName and then use a CASE statement to update the name of the field. If the FieldName is “LatestDay,” then we know it is the most recent day in our data set. Otherwise, we will update the field name to be the name of the day followed by a relative_time offset going back the WeekNum number of weeks.

| eval WeekNum = substr(FieldName,-1)
| eventstats latest(_time) as LatestDate
| eval FieldName = CASE(FieldName = “LatestDay”, “Latest Thursday”,1=1,”Thursday (” . strftime(relative_time(LatestDate,”-” . WeekNum . “w@h”),”%m/%d”) . “)”)

After these commands, your results should look something like this. Note that the FieldName is now intuitively named.

FieldName Visual for day over week composition

From here, all we need to do is use the xyseries command to turn these field names back into columns for our chart.

| xyseries _time FieldName FieldValue

At this point, the chart will look pretty much the way we need it to.

Day over week visualization almost complete

However, we need this to be as easy to read as possible! Looking at the legend, the fields aren’t quite in the right order. Note how “Latest Thursday” is at the top but the other series do not follow sequentially in the right time order. We can fix that with the table command to control the ordering of our chart. Order the table command with _time, followed by the “Thursday” entries (use the * to denote a wildcard) and, finally, the “Latest *” field at the very end.

| table _time “Thursday*” “Latest *”

Now, your chart looks like this. Much better! The dates in the legend now show in ascending order and are easier to follow.

Day over week chart final versionFigure 1: Dotted lines were added via dashboard panel options for fieldDashStyles and lineDashStyle

In the next installment, we will create a panel that will allow us to compare one specific day from a prior week to the latest day and visualize the percentage changes between the two throughout every hour of that day. A sneak peek is shown below. Until next time!

 

Visualizing data doesn’t have to be complicated.
Contact one of our exceptional Splunk Consultants today so we can put your data to work for you!

 

ITSI Rules Engine: Understanding and Troubleshooting

Brent Mckinney | Splunk Consultant

In ITSI, the Rules Engine is the system responsible for processing notable events into episodes and executing actions. For this reason, it’s important to understand how it works, and how to troubleshoot if episodes are not being created in a consistent manner.

ITSI Rules Engine workflow.

First, a correlation search is built and scheduled, to create notable events. Correlation searches typically search the itsi_summary index for service health scores or KPI values. The results of the correlation search are stored in itsi_tracked_alerts. This index contains all notable events from any correlation searches.

Next is the ITSI rules engine saved search. This search runs in real time, and the default Notable Event Aggregation Policy (NEAP) groups the notable events into episodes based on source. This is where the actual Rules Engine process begins, and by creating custom NEAPs, we can filter, group and act upon notable events from a correlation search.

Finally, the episodes created by the NEAP are stored in itsi_grouped_alerts, and any actions defined should execute based on the criteria defined.

Creating a NEAP.

To begin creating episodes, ensure that your Correlation Search is enabled, and that notable events are being populated in the itsi_tracked_alerts index. The source field in this index will be your correlation search name.

Once you confirm that events are flowing into itsi_tracked_alerts, you’ll want to create a NEAP, and ensure that you have “source” matches ”your correlation search“ under the “include the events if” section on the filtering criteria page.

The goal with this section is to include all notable events that should be grouped and alerted in the NEAP. Under split episodes by, you can select how you wish for your notable events to be split into individual episodes. This can be any field in the itsi_tracked_alerts data, I typically split by service_name.

For break conditions, you can configure when an episode should end. I typically do “if the flow of events is paused for 600 seconds”, the idea here being, if there are no notable events being created and flowed into the episode for 10mins, then the problem has been resolved, however it ultimately depends on the service.

Once filled out, you can test your configurations by selecting preview on the same page. This should return all notable events from your correlation search, grouped into episodes based on your split by selection. If you are not seeing results here, ensure that your correlation search is indeed writing events to itsi_tracked_alerts, and that your filtering criteria is valid.

Troubleshooting episodes not being created.

If episodes are not being created, make sure that the service, correlation search, and NEAP are all enabled. Additionally, ensure that that the itsi_event_grouping search is enabled, and in the jobs panel, ensure that it is running. To restart the rules engine, you can disable this search for a few minutes, and then re-enable.

Ensure that your correlation is picking up notable events. Without notable events, there is nothing for the Rules Engine to group. To do this, search the itsi_tracked_alerts index and filter source=<your correlation search>. If there are notable events here, then the issue is most likely due to the Rules Engine.

Next, check that there are events being written to the itsi_grouped_alerts index. If you have notable events in itsi_tracked_alerts, but no episode info exists in the itsi_grouped_alerts index, then ensure the NEAP is correctly picking up events in the filtering criteria page of the NEAP. Use the preview option in the NEAP editor to ensure that events exist and should be grouped. If there are no results from preview, then check your filtering criteria. If there are results in preview, but no episodes are being created, then there is most likely issue with the rules engine itself.

Troubleshooting steps

  1. Ensure the ITSI rules engine search is enabled and running (itsi_event_grouping). If needed, you can restart the rules engine by disabling this search for a few minutes, and reenabling it.
  2. Check the ITSI Event Analytics Monitoring dashboard for any glaring issues, including ensuring that only 1 Java process is running. Multiple java processes running can cause issues.
  3. In the SA-ITOA app, there is a itsi_rules_engine.properties file that holds configs specific to the functionality of the rules engine. One setting in particular, exit_condition_messages_contain is important to check.
  • exit_condition_messages_contain = unable to distribute to the peer,might have returned partial results,search results might be incomplete,search results may be incomplete,unable to distribute to peer,search process did not exit cleanly,streamed search execute failed,failed to create a bundles setup with server
  • This setting checks for specific error messages in the internal logs, and if found, stops the rules engine. It’s important to make sure that any ITSI Search Head is NOT generating any of these messages. If these messages are found, then the Rules Engine will stop, and thus episode creation and alerting will be affected.
  • To check if any of these messages are present, the following search can be run (make sure to filter on your ITSI SH hosts):
  • index=”_internal” source=”*splunkd*” “unable to distribute to the peer” OR “might have returned partial results” OR “search results might be incomplete” OR “search results may be incomplete” OR “unable to distribute to peer” OR “search process did not exit cleanly” OR “streamed search execute failed” OR “failed to create a bundles setup with server”

Using custom ITSI index names?

If you’re using custom index names outside of itsi_tracked_alerts, like my_org_itsi_tracked_alerts, ensure that all macros have been updated to include these index names.

If you’re on ITSI 4.11 or lower and are using custom index names, you will need to take additional steps. In the SA-ITOA app, the itsi_rules_engine.properties file will need to be updated to include these custom index names, as the default index names are hardcoded in the java build of ITSI, and can cause delays in episode creations, or duplicate episodes. This has been raised to Splunk and will be updated to use a macro in a future release. Until then, the following stanza can be added in SA-ITOA/local/itsi_rules_engine.properties to include the custom index names.

[default]
index_name = <your custom itsi_grouped_alerts index name>

# Search to retrieve missed events to backfill
backfill_events_search = search (`itsi_event_management_index_with_close_events` ) OR ( `itsi_event_management_group_index`) NOT orig_sourcetype=snow:incident | stats first(_time) AS _time first(_raw) AS _raw first(source) AS source first(sourcetype) AS sourcetype count(eval(index=<your custom itsi_grouped_alerts index name>)) AS c_grouped by event_id | where c_grouped=0 | fields _time, _raw, source, sourcetype | sort 0 _time

grouping_missed_events_search = search (`itsi_event_management_index_with_close_events` ) OR ( `itsi_event_management_group_index`) NOT orig_sourcetype=snow:incident | stats first(_time) AS _time first(_raw) AS _raw first(source) AS source first(sourcetype) AS sourcetype count(eval(index=<your custom itsi_grouped_alerts index name>)) AS c_grouped by event_id | where c_grouped=0 | fields _time, _raw, source, sourcetype

Additional troubleshooting steps can be found in Splunk documentation here or learn more about our Splunk Consulting Services.

 

Format JSON Data at Search Time

By Forrest Lybarger | Splunk Consultant

JSON data is a very common format in Splunk and users like to have control of the data. Splunk’s collection of json_* commands help users format JSON data at search time so that it can be presented and used without any permanent changes to the indexed data. This guide will help users learn how to use these commands themselves, so they can have full control of their JSON data. It is important to note that the commands are often looking for the name of a field containing a whole JSON event, so if the whole event is one JSON event then _raw can be used to reference the whole event as one field.

First is the json_valid command. Using this command, strings can be validated as using proper JSON. It has simple syntax where you input the field containing the JSON data and it returns true or false. If you want to store the value in a field, you could use an if statement.

    • Example:
    • | eval valid = if(json_valid(_raw), 1, 0)
    • Result:
    • valid=1

Second is json_array. This command lets you create an array with JSON formatting. You can use an array field or multi-value field in this command instead of hard coded values. This can be useful in cases like if you need a sub-search to return an array.

    • Example:
    • | eval array=json_array(field_name)
    • Result:
    • array=[“string1”, “string2″, “string3”]

Third is json_object. This command creates JSON objects from the inputs given to the command. Other JSON commands can be used as inputs including another json_object command to create nested objects. The first input is just the name of the object, then you can add one more input that can be a nested JSON object, array, or hard-coded value.

    • Example:
    • | eval obj=json_object(“object1”, json_object(“object2”, json_array(“item1”, “item2”))
    • Result:
    • Obj={“object1”:{“object2”:[“item1″,”item2”]}}

Lastly is json_extract. This command essentially lets you do a field extraction on the fly with JSON data. You just give the command the field containing the JSON data (if the event is one big JSON event, then do _raw) then tell it the path you want extracted.

    • Example:
    • | eval ext=json_extract(_raw, “array{0}”)
    • Event:
    • {
    •   “array”: [
    •     {
    •       “name”: “item1”,
    •       “subarray1”: [
    •         { “name”: “subitem1” },
    •         { “name”: “subitem2” }
    •       ]
    •     },
    •     {
    •       “name”: “item2”,
    •       “subarray2”: [
    •         { “name”: “subitem1” },
    •         { “name”: “subitem2” },
    •         { “name”: “subitem3” }
    •       ]
    •     },
    •     {
    •       “name”: “item3”,
    •       “subarray3”: [
    •         { “name”: “subitem1” },
    •         { “name”: “subitem2” }
    •       ]
    •     }
    •   ]
    • }
    •             Result:
    •             ext={“name”: “item1″,”subarray1”: [{ “name”: “subitem1” },{ “name”: “subitem2” }]}

These commands are very useful for users that deal with a lot of web data or API related events. Some notable data that these commands could be useful for are AWS Cloudtrail events. Users might want to extract whole arrays as fields or do other JSON manipulation and these commands make all that possible within Splunk’s search bar.

Monitoring Windows Event Logs in Splunk

By: Karl Cepull  | Senior Director, Operational Intelligence

 

Splunk is a widely accepted tool for log aggregation and analysis in both security and IT Ops use cases. Splunk’s add-ons for Microsoft Windows, including Exchange and Active Directory, rely on Windows Event Logs being available and a forwarder used to send those logs into Splunk. Windows logs provide a wealth of information with every action taken. The problem is the volume of information available means ingesting a large amount of non-relevant data into Splunk. Looking at a couple of general use cases, here is a list of Windows Event IDs to add when looking for specific information.

Use Case 1: Security

Windows Security can include several of the other use cases listed below. These are Event IDs that indicate suspicious or unusual activity.

EventID Priority Description Sub-Codes
1102 High The audit log was cleared. Probably want to investigate why.
4767 High Account Unlocked
4740 High Account Locked Out
4771 High Kerberos pre-authentication failed
4772 High A Kerberos authentication ticket request failed
4820 High Kerberos Ticket-Granting-Ticket was denied because the device does not meet the access control restrictions. 0x12 (account disabled), 0x18 (bad password), 0x6 (bad username)
4625 High Logon Failure. Sub-codes begin with 0xC00000. 64 (user doesn’t exist), 6A (bad password), 234 (user currently locked out), 72 (account disabled), 6F (logon outside of permitted times), 193 (account expiration)
4719 Medium System audit policy was changed.
4728 Medium A user was added to a privileged global group
4732 High A user was added to a privileged local group
4756 High A user was added to a privileged universal group
4782 High Password hash an account was accessed

Use Case 2: IT Operations

While there are several different Event IDs to monitor for all aspects of IT Operations, a few important ones are listed here. These are events a system administration should pay special attention to.

EventID Priority Description
1101 Medium Audit events have been dropped by transport. Possibly dirty shutdown.
1104 High The security log is now full. There will be holes in your logs if not fixed.
4616 High System time was changed.
4657 High A registry value was changed.
4697 Medium An attempt was made to install a service.

Use Case 3: Monitor User Accounts

A typical user may appear in Windows logs for logging on and off a system. Other user account events should not appear regularly for any one user. Without a larger planned event, where planned account activity is occurring, most of these Event IDs should remain low. Log-on and log-off events are listed here as low priority. Others are higher.

EventID Priority Description
4720 High User account created
4723 Medium User changed own password
4724 Medium Privileged user changed their password
4725 High Account disabled
4738 High Account changed
4726 High Account deleted
4781 Medium Account name changed
4624 Low Successful logon
4647 Low User-initiated logoff

Use Case 4: Scheduled Tasks

In a recent security scare, the threat was seen creating scheduled tasks to perform actions that compromised data security. Like all other actions, scheduled tasks are logged in Windows Events, and can be added to Splunk.

EventID Priority Description
4698 Medium A scheduled task was created
4699 Medium A scheduled task was deleted
4700 Medium A scheduled task was enabled
4701 Medium A scheduled task was disabled
4702 Medium A scheduled task was updated

Use Case 5: Windows Firewall

Windows has a built-in firewall. While this may be disabled by system administrators, environments where the firewall is active can use the event logs to monitor for suspicious activity. Unexpected and unauthorized rules and policies changes are strong indicators of threat, along with unapproved stopping of firewall services.

EventID Priority Description
4946 High A rule was added to the Windows Firewall exception list
4947 High A rule was modified in the Windows Firewall exception list
4950 Medium A setting was changed in Windows Firewall
4954 Medium Group Policy settings for Windows Firewall was changed
5025 High The Windows Firewall service was stopped
5031 High Windows Firewall blocked an application from accepting incoming traffic

Use Case 6: Windows Filtering Platform

Windows Filtering Platform is a set of API and system services that provide a platform for creating network filtering applications. It’s important to keep an eye on these events to make sure any unexpected or unapproved actions are captured.

EventID Priority Description
5146 High The Windows Filtering Platform has blocked a packet.
5147 High A more restrictive Windows Filtering Platform filter has blocked a packet.
5148 High The Windows Filtering Platform has detected a DoS attack and entered a defensive mode; packets associated with this attack will be discarded.
5149 High The DoS attack has subsided and normal processing is being resumed.
5150 High The Windows Filtering Platform has blocked a packet.
5151 High A more restrictive Windows Filtering Platform filter has blocked a packet.
5152 High The Windows Filtering Platform blocked a packet.
5153 High A more restrictive Windows Filtering Platform filter has blocked a packet.
5154 Medium The Windows Filtering Platform has permitted an application or service to listen on a port for incoming connections.
5155 High The Windows Filtering Platform has blocked an application or service from listening on a port for incoming connections.
5156 Medium The Windows Filtering Platform has allowed a connection.
5157 High The Windows Filtering Platform has blocked a connection.
5158 Medium The Windows Filtering Platform has permitted a bind to a local port.
5159 Medium The Windows Filtering Platform has blocked a bind to a local port.
5447 High A Windows Filtering Platform filter was changed.

This is not an exhaustive list of Windows Event Codes, nor is it a complete list for each use case. It’s a starting point for observation and can help to limit the number of events ingested by Splunk from Windows. Start by allowing the Event IDs listed above. As specific use cases develop, a deeper exploration of other Event IDs can help expand Splunk’s scope and effectiveness.

Adding Event IDs to Splunk

The easiest way to monitor Windows Event Logs in Splunk is to use the Splunk Add-On for Microsoft Windows. After installing the app, create a folder named “local” inside the app. Then, copy inputs.conf from the app’s “Default” folder and paste it in the local folder. Within each of the input stanzas, an allowed list can be added based on the pre-defined categories within the add-on.

[WinEventLog://Application]

disabled = 0

start_from = oldest

current_only = 0

checkpointInterval = 5

renderXml = true

[WinEventLog://Security]

disabled = 0

start_from = oldest

current_only = 0

evt_resolve_ad_obj = 1

checkpointInterval = 5

blacklist1 = EventCode="4662" Message="Object Type:(?!\s*groupPolicyContainer)"

blacklist2 = EventCode="566" Message="Object Type:(?!\s*groupPolicyContainer)"

renderXml = true

[WinEventLog://System]

disabled = 0

start_from = oldest

current_only = 0

checkpointInterval = 5

renderXml = true

Based on the use cases above, add the setting “whitelist=” to the stanza, followed by a comma-separated list of Event IDs. For example, to monitor “System,” use the IT Ops Event IDs. The stanza would then look like this:

[WinEventLog://System]
disabled = 0
start_from = oldest
current_only = 0
checkpointInterval = 5
renderXml = true
whitelist = 1101, 1104, 4616, 4657, 4697

The same can be done with the other input stanzas for more comprehensive coverage of Windows Event Logs.

Contact us for more tips and tricks on monitoring Windows Event Logs with Splunk!

Back to the Present: Fixing Incorrect Timestamps in Splunk

  By: Jay Young  | Senior Splunk Consultant

 

It is not uncommon, in large and small Splunk Enterprise environments, to have events with future or past timestamps. With time being a critical component of Splunk, incorrect timestamps can severely impact the hot and warm buckets on the indexers; hot buckets may roll too early, before they meet the set size of the attribute maxDataSize(default size 750mb), creating non-uniform-sized warm buckets.

The excellent news is that Splunk has added IndexAttributes in indexes.conf. These attributes are quarantinePastSecs and quarantineFutureSecs to support the inspection of time at the indexing tier. These two IndexAttributes help quarantine events to better manage the flow of time throughout all indexes.

The quarantine constraints detect future and past events with varying degrees of time as they get indexed. If the indexers encounter events with timestamps that exceed these boundaries, it sends them to a separate hot bucket called the hot quarantine bucket; this bucket is located in the exact location as the primary hot bucket and is identified by “hot_quan_vx_xxx.”

The quarantinePastSecs and quarantineFutureSecs both have default settings in Splunk and should not be altered in the default indexes.conf file; they do allow for control at the individual index level and can be adjusted to fit a single indexes allowable time range.

The two Indexes.conf.IndexAttributes:

  • quarantinePastSecs = <positive integer>
  • quarantineFutureSecs = <positive integer>

 

By default, these two IndexAttributes are set at 30 days for quarantineFutureSecs and 900 Days for quarantinePastSecs. Both IndexAttributes’ accepted values are calculated in seconds. These values determine the range for acceptable future and past events.

Common Issues That Cause Future and Past Timestamps

  1. Improperly configured attributes in props.conf.
  2. Having different timezone events sent to the indexers.
  3. Events are delayed and then get sent to the indexers.
  4. System turned off for extended period with no time server configured on bootup.
  5. When the time changes in the spring and fall, excluding Arizona. 😊

How to Check for Future and Past Timestamps

Example: This is a quick way to identify indexes with future or past timestamps.

Name Action Type App Current Size Max Size Event Count Earliest Event Lastest Event
main Edit Delete Disable Events Search 204 MB 500 GB 22.6k In 3 months In 13 hours

Splunk > Settings > Indexes >

The Lastest Event column in the Splunk Index administration page shown in the example above shows that events will be current time events “in 13 hours.” This means at this time, they are events in the future. For example: if the present time is midnight, then at 1 p.m., the events in the hot quarantine bucket would be eligible to roll to warm buckets as they have passed the present time of now; but until the time passes, the future events will be kept in the Hot Quarantine bucket on the indexer.

The Earliest Event Column is the oldest event currently Indexed in the index.

Example Queries:

  • Query for small environments: index=* earliest=+5m latest=+5y
  • Query for larger environments: index=(name) sourcetype=(name) earliest=+5m latest=+5y

Understanding _indextime vs. _time the Parsed One

In Splunk, there are two different times used. Events in Splunk are not generally received at the same time as indicated in the event timestamp; the difference is usually a few seconds from the indexer arrival time to the event timestamp. The actual arrival time is written to _indextime, and the timestamp embedded in the event is parsed and stored in _time. When searching in Splunk, 99.999% of the time you will be searching against the _time parsed from the event. Future and past timestamps would be tough to get in _indextime and would be a server date and time issue instead of event time issues.

Example Query to find Indexing Latency:

  • index=(name) | eval time=_time | eval indextime=_indextime | eval latency=(indextime-time) | stats count by avg(latency), min(latency), max(latency) by sourcetype

Bucket Naming Examples Normal and Future

In the example below, the lastest timestamp is the newest event in the warm bucket; this can also be described as the last event to get indexed before the hot bucket rolls to warm. The earliest timestamp is the oldest in the warm bucket; this is the first event to go into the newly created hot bucket. When the hot bucket rolls, the warm bucket gets created, then gets appended with epoch timestamps to represent the two values, earliest and lastest.

Epoch timestamp converter: https://www.epochconverter.com/

Example: Normal warm bucket.

EARLIEST TIMESTAMP 1. DST Monday, August 16, 2021 11:42:24 AM GMT-05:00
                                                                                     1.earliest         2.Latest

current warm bucket with epoch time — db_1629961138_1629132144_95

LASTEST TIMESTAMP 2. Thursday, August 26, 2021 1:58:58 AM GMT-05:00

Example: What would happen if a future timestamped event is allowed to roll into a warm bucket?

EARLIEST TIMESTAMP 1. Tuesday, August 24, 2021 11:42:24 AM GMT-05:00 DST
                                                     1.earliest        2.latest

current warm bucket — db_1756166400_1629823344_97

LASTEST TIMESTAMP 2. Tuesday, August 26, 2025 12:00:00 AM GMT-05:00 DST

The above example is a warm bucket with a future timestamp rolled into it. This bucket now has a timestamp that is four years in the future. This future timestamp would cause this warm bucket to not roll into a cold bucket or be removed for four years plus the current retention policy. This type of future bucket would continuously be searched for the next four years, affecting search performance.

Using Splunk without these two IndexAttributes would cause warm buckets to sit on the indexers waiting to age out until the present time pasted the Lastest event timestamp, causing it to be continuously searched from scheduled and ad-hoc searches due to the lastest time being in the future. If this process was allowed to continue, your Splunk indexers could potentially reach the default number of warm buckets (“maxWarmDBCount=300”) and have hundreds of warm buckets sitting on the indexers with data that should have aged out a long time ago.

Indexes.conf indexAttribute reference:

https://docs.splunk.com/Documentation/Splunk/latest/Admin/Indexesconf

https://docs.splunk.com/Documentation/Splunk/latest/Admin/Indexesconf

Contact us for more help on managing your Splunk environment!