Troubleshooting In Splunk

      By: David Allen  |  Senior Splunk Consultant

 

Like many things in life, having the right tools to fix a problem is what separates the novice from the expert. The novice has almost no tools in their toolbox, and the tools that they have are large and clunky. Whereas the expert has many tools which have been refined over many years of experience fixing many problems. Without the correct tools for the job, many tasks simply could not be accomplished — or at least would become much more difficult and time-consuming. Understanding what tools are available to fix and troubleshoot common Splunk Enterprise problems and how to use those tools — at a macro level — is the purpose of this blog.

The first tool need you will as you begin your Splunk troubleshooting journey is some basic knowledge on how to troubleshoot and how to narrow down all the possibilities, like peeling off layers of an onion until eventually you find the root cause of the problem. Some problems we see over and over and have learned to do a couple of simple checks to solve the problem. How many times have you run a search and said to yourself, “Where are all my fields?” We quickly learned to check the search modem and make sure if you want fields to be in Verbose mode or to a lesser extent Smart mode.

Search Job Inspector

The first tool you will need for troubleshooting basic searching problems is the Splunk Search Job Inspector. If you suspect that your search is not working properly, then using the Search Job Inspector may shed some light on the issue. The Search Job Inspector is a tool that lets you take a closer look at what your search is doing and see where the Splunk software is spending most of its time.

You can access the Job Inspector by clicking the dropdown to the left of the search mode.

A screenshot of the Splunk app showing options for the Search Job Inspector dropdown.

From the dropdown, select “Inspect Job.” You will see another screen containing Execution Costs details and Search Job Properties details. Also note the amount of time it took to complete the search, which may be a symptom of a problem you are not even aware of yet!

A screenshot of the Splunk Search Job Inspector menu.

Execution Costs

The Execution Costs section lists information about the components of the search and how much impact each component has on the overall performance of the search.

  • – The component durations in seconds.
  • – How many times each component was invoked while the search ran.
  • – The input and output event counts for each component.

With the information in the Execution Costs section, you can troubleshoot the efficiency of your search. You can narrow down which processing components are impacting the search performance.

Search Job Properties

The Search Job Properties section contains a list of many fields with additional search parameters. The most important fields for troubleshooting are the following:

  • eventCount: The number of events returned by the search.
    eventFieldCount: The number of fields found in the search results.
    eventIsTruncated: Indicates that events have not been stored and not available for search.
    isDone: Indicates if the search has completed.
    isFailed: Indicates if there was a fatal error executing the search.
    isFinalized: Indicates if the search was finalized (stopped before completion).
    runDuration: Time in seconds that the search took to complete.

You also have access to the search.log link on the far right which opens the search.log where you can search for errors and warnings which may give a clue as to the search issue you are experiencing.

But for more complex SPL (Search Programming Language) problems, ask yourself some basic questions:

  • – When did this start happening?
  • – What is the exact thing I am searching for and what is the time range?
  • – Have I added a lookup file or event type incorrectly?

Then remove lines of SPL and add them back one line at a time and find the line where the problem shows up. Then unravel the problematic line piece by piece until you find the problem.

The above example works fine for basic search problems, but for Enterprise problems, you are going to need some more powerful tools in your toolbox.

BTOOL

Splunk uses configuration files for almost all the settings within it. These .conf files are scattered all over the directory structure in many similarly-named .conf files. Splunk combines these similarly named files lexicographically and by a defined precedence order. To complicate things further, Splunk needs to be reset to reload the latest changes to the .conf files. If you think that Splunk is not using the configurations that it should, then BTOOL is your tool of choice.

BTOOL is a Splunk CLI command which shows what the actual settings are on the disk, or most likely SSD these days – not what is in memory and maybe not what Splunk is currently running, so beware of this subtle fact. To be sure you are seeing the config settings Splunk is actually running, you will need to restart Splunk. To access this command be sure to be at the Splunk home directory: $SPLUNK_HOME/bin.

BTOOL comes with Splunk Enterprise software, so no additional purchase, download, or installation is needed.

Below is the typical BTOOL command syntax (Note: All CLI commands in this document are typical for *nix OS):

./splunk btool list [options]

Here are some examples of the most common BTOOL commands:

To display all the merged settings of the various inputs.conf files, run this command:

./splunk btool inputs list –-debug

Or you may want to see all input configurations contained in the search app:

./splunk btool –app=search inputs list

Or you may want to see all props configurations set in the search app, and in what context they are set:

./splunk btool props list –app=search --debug

Lastly, you may want to find an input stanza for which you know name:

./splunk btool inputs list | grep splunktcp

Splunk Log Files

Under the hood, Splunk is running a lot of processes, from ingesting data to searching data and a lot more. All of these processes, and many of the steps in between, generate data that the Splunk software records into log files. Analyzing these log files can give clues to help solve your Splunk problem. The most common logs used for troubleshooting in Splunk are the internal logs located in: $SPLUNK_HOME/var/log. This path is monitored by default, and the contents are sent to various indexes based upon the type of log file. The most common internal indexes are _introspection, _internal, and _audit.

The _introspection index collects data about the impact of the Splunk software on the host system.
It specifically collects the OS resource usage for Splunk processes, which can be broken down by Splunk process and viewed by host-level, dynamic CPU utilization and paging information. This index also contains disk input-output usage statistics. This can be very useful in diagnosing Splunk performance issues.

For example, use this search to find the median CPU usage of the main Splunkd process for one host over the last hour:

index=_introspection component=PerProcess host= data.process=splunkd
data.args="-p * start" OR data.args="service") earliest=-1h
| timechart median(data.pct_cpu) as cpu_usage(%)

_internal: This index includes Splunk Enterprise internal logs. This index can be used to check the flow of data through the various pipeline processes, data about license usage, the search scheduler, various Splunk crash logs, various search information, and more.

For instance, to search for the size of the search artifacts, use this search:

index=_internal sourcetype=splunkd_access method=GET jobs
|stats sum(bytes) by uri

The _audit indexes contain information about user activities such as failed or successful user log ins, modified settings, updated lookup files. Running searches, capability checks, and configuration changes generate audit events.

For example, to audit user access use this search:

index="_audit" action=log* action="login attempt"

One of the most common log files used for troubleshooting is the splunkd.log, which uses source type Splunkd and is indexed to the _internal index. The Splunkd source type is further broken down by component, so you can further refine your search by its subcategory. Using the source type Splunkd, you can use a search like this to check for any data quality warnings or errors:

index=_internal sourcetype=splunkd (component=AggregatorMiningProcessor OR
component=LineBreakingProcessor OR component=DateParserVerbose)
(log_level=WARN OR log_level=ERROR)

Or to check for potential index issues, you can use a search like this one:

index=_internal sourcetype=splunkd host=idx*.yourcompany.splunkcloud.com
component=metrics group=per_*_regex_cpu

Splunk generates many internal log files, and searching those internal logs is a good way to find or isolate many common Splunk Enterprise problems.

Health Report Manager – New in Splunk Version 8.0.0

The Health Report Manager is a high-level overview of your Enterprise and lets you view the status of various Splunk Enterprise components. Individual components report their health status every 30 seconds and results are displayed through a tree structure that provides a continuous, real-time view of the health of your deployment.

The Health Report Manager can be accessed here:

A screenshot of Splunk Enterprise showing where users can access the Health Report Manager feature.

Once selected, the dropdown appears as shown below and displays the health of the various components. Check this if you suspect there may be an issue with your Enterprise or if the indicator is not green.

A screenshot showing the Health Status of Splunkd report.

How the Splunk Health Report Works

The health report records the health status of Splunk features in a tree structure, where leaf nodes represent particular features, and intermediary nodes categorize the various features. Feature health status is color-coded in four states as follows:

  • Green: The feature is functioning properly.
  • Yellow: The feature is experiencing a problem.
  • Red: The feature has a severe issue and is negatively impacting the functionality of your deployment.
  • Grey: Health report is disabled for the feature.

Let’s run through a scenario of how we could find the problem if the Searches Skipped indicator is red.

  1. First, select the Searches Skipped feature to view diagnostic information about the current health status of the feature.
  2. Review the information under Root Cause. In this case, the percentage of high priority searches skipped is 44% over the last 24 hours, which exceeds the red threshold of 10% and causes the feature’s health status to change to red.
  3. Review the Last 50 Related Messages. These log entries include warning messages showing that some scheduled searches cannot be executed. For example: 09-15-2020 16:11:00.324 +0000 WARN SavedSplunker - cannot execute scheduled searches that live at the system level (need an app context).

One explanation for this type of warning message is the possibility that the number of high-priority searches running exceeds the maximum concurrent search limit, which can cause searches to be skipped.

After you review the root cause and log file information, which suggest that maximum search concurrency limits caused the Searches Skipped feature’s status change, you can use the Cloud Monitoring Console to check search scheduler activity and confirm if the suspected cause is correct.

  • – In Splunk Web, click Apps > Cloud Monitoring Console.
  • – Click Search > Scheduler Activity. The Count of Scheduler Executions panel shows that 43.62 % of searches have been skipped over the last 4 hours, which approximates the percentage of skipped searches reported under root cause in the health report.
    A screenshot of the Splunk App showing number of skipped searches.
  • – Click Search > Skipped Scheduled Searches. The Count of Skipped Scheduled Searches panel shows that 756 searches have been skipped over the last 4 hours because “the maximum number of concurrent historical searches on this instance has been reached.” This confirms that the cause of the Skipped Searches status change is that the maximum concurrent search limit has been reached on the system.
    A screenshot of the Count of Skipped Scheduled Searches report showing that the maximum number of scheduled searches has been reached.
  • – You can now take steps to remedy this issue by decreasing the total number of concurrent scheduled searches running and increasing the relative concurrency limit for scheduled searches, which can bring the number of concurrent searches below the maximum concurrent search limit, and return the Searches Skipped feature to the green state.

DIAG

A diag file provides a snapshot of the configurations and logs from the Splunk software along with select information about the platform instance. The diag collection process gathers information such as server specifications, operating system (OS) version, file system information, internal logs, configuration files, and current network connections. No customer data is included in the diag file.

In your troubleshooting quest using the CLI from the $SPLUNK_HOME/bin folder, run the following command for each instance that you are troubleshooting:

./splunk diag

If you do contact Splunk regarding an issue, they will often request a diag file for their analysis. You can even index the diag output file and “Splunk it” and create your own troubleshooting dashboards!

To generate and upload a diag, the CLI syntax is:

./splunk diag --upload

This command interactively prompts for values such as a Splunk username and password, choice of open cases for that user, and a description of the upload.

You can also remove certain components form the diag report with a command as follows:

./splunk diag -collect=

Likewise, components can be removed with this command:

./splunk diag -disable=

Using Telnet to Test Open Ports

When it comes to checking if a network port is opened or closed on a remote computer, there’s no easier way than to use Telnet. Ports that are left open for no reason are a security risk that can be exploited by malicious programs and viruses. At the same time, if a legitimate software communicates through a certain port, having that port closed will make the program throw errors and malfunction. Telnet allows the user to test individual ports and see whether they are open or not.

On a *nix OS box, you can run the Telnet command through the terminal as shown below:

telnet [domainname or ip] [port], e.g.>telnet 192.168.1.1 443

When a computer port is open a blank screen will show up, meaning that the connection has been successful. An unsuccessful connection will be accompanied by an error message.

Many times when you try to use Telnet, you may find that your own network is blocking your connection. It’s quite common for users to be running a firewall, which blocks connection to outbound ports. A basic way to test whether your firewall is interrupting your Telnet is to disable your firewall and run a Telnet test.

Network Toolkit

This is an app on Splunkbase. The Network Toolkit app provides a series of tools for troubleshooting networks. It includes tools for evaluating internet connection bandwidth, performing pings, traceroutes, DNS lookups, whois record checks, and waking sleeping computers (via wake-on-lan).

Additional Online Resources for Splunk Troubleshooting

The Splunk How-To YouTube Channel
Splunk Online Documentation
Online Troubleshooting Manual
Splunk Answers
Submitting a case to Splunk

Want to learn more about data model accelerations? Contact us today!

Share and Share Alike: Using a Data Model Job Server for Shared Data Model Accelerations

      By: Jon Walthour  |  Senior Splunk Consultant, Team Lead

 

  • – Common Information Model (CIM)
  • – DBX Health Dashboards
  • – Palo Alto app
  • – Splunk Global Monitoring Console
  • – Infosec
  • – CIM Validator
  • – CIM Usage Dashboards
  • – ArcSight CEF data models add-on
  • – SA-Investigator
  • – Threat hunting

All these Splunk apps and add-ons and many others use data models to power their searches. In order for a data model-powered search to function at peak performance, they are often accelerated. This means that at regular, frequent intervals, the searches that define these data models are run by Splunk and the results are summarized and stored on the indexers. And, because of the design of data models and data model accelerations, this summarized data stored on the indexers is tied to the search head or search head cluster that created it.

So, imagine it: You’re employing many different apps and add-ons in your Splunk deployment that all require these data models. Many times you need the same data models accelerated on several different search heads for different purposes. All these data models on all these search heads running search jobs to maintain and keep their summarized data current. All this summarized data is stored again and again on the indexers, each copy of a bucket’s summary data identical, but tied to a different search head.

In a large distributed deployment with separate search heads or search head clusters for Enterprise Security, IT Service Intelligence, adhoc searching, etc., you end up accelerating these data models everywhere you want to use them—on each search head or search head cluster, on your Monitoring Console instance, on one or more of your heavy forwarders running DB Connect, and more. That’s a lot of duplicate searches consuming CPU and memory on both your search heads and your indexers and duplicate accelerated data-consuming storage on those indexers.

There is a better way, though. Beginning with version 8.0, you can now share data models across instances—run once, use everywhere in your deployment that uses the same indexers. You accelerate the data models as usual on Search Head 1. Then, on Search Head 2, you direct Splunk to use the accelerated data created by the searches run on Search Head 1. You do this in datamodel.conf on Search Head 2 under the stanzas for each of the data models you want to share by adding the setting “acceleration.source_guid” like this:

[<data model name.]
acceleration.source_guid = guid of Search Head 1

You get the GUID from one of two places. If a standalone search head created the accelerated data, the GUID is in $SPLUNK_HOME/etc/instance.cfg. If the accelerated data was created by data model searches run on a search head cluster, you will find the GUID for the cluster in server.conf on any cluster member in the [shclustering] stanza.

That’s it, but there are a few “gotchas” to keep in mind.

First, keep in mind that everything in Splunk exists in the context of an app, also known as a namespace. So, the data models you’re accelerating are defined in the context of an app. Thus, the datamodel.conf you’re going to have on the other search heads with the “acceleration.source_guid” setting must be defined in the same namespace (the same app) as the one in which the data model accelerations are generated on the originating search head.

Second, once you set up this sharing, you cannot edit the data models on the search heads sharing the accelerated data (Search Head 2, in our example above) via Splunk web. You have to set up this sharing via the command line, and you can only edit it via the command line. You will also not be able to rebuild the accelerated data on the sharing search heads for obvious reasons, as they did not build the accelerated data in the first place.

Third, as with all other things in multisite indexer clusters, sharing data model accelerations in multisite indexer clusters gets more complicated. Basically, since the summary data hitches a ride with the primary buckets in a multisite deployment, which end up being spread across the sites, while search heads get “assigned” to particular sites, you want to set “summary_replication” to “true” in the [clustering] stanza in server.conf. This ensures that every searchable copy of a bucket, not just the primary bucket, has a copy of the accelerated data and that searches of summary data are complete. There are other ways to deal with this issue, but I’ve found simply replicating the accelerated data to all searchable copies ensures no missing data and no duplicate data the best.

Finally, when you’re running a tstats search against a shared data model, always use summariesonly=true. Again, this ensures a consistent view of the data as unsummarized data could introduce differing sources and thus incorrect results. One way to address this is to ensure the definition of the indexes that comprise the sources for the data models in the CIM (Common Information Model) add-on are consistent across all the search heads and search head clusters.

And this leads us to the pièce de résistance, the way to take this feature to a whole new level: Install a separate data model acceleration search head built entirely for the purpose of running the data model accelerations. It does nothing else, as in a large deployment, accelerating all the data models will keep it quite busy. Now, this means this search head will need plenty of memory and plenty of CPU cores to ensure the acceleration search jobs run smoothly, quickly, and do not queue up waiting for CPU resources or, worse yet, get skipped altogether. The data models for the entire deployment are managed on this job server. They are all accelerated by this instance and every other search head and search head cluster has a datamodel.conf where all data model stanzas have an “acceleration.source_guid” setting pointing to this data model job search head.

This gives you two big advantages. First, all the other search heads and clusters are freed up to use the accelerated data models without having to expend the resources to maintain them. It separates the maintenance of the data model accelerations from the use of them. Even in an environment where only one search head or search head cluster is utilizing these accelerated data models, this advantage alone can be significant.

So often in busy Enterprise Security implementations, you can encounter significant skipped search ratios because regularly run correlation searches collide with regularly run acceleration jobs and there just aren’t enough host resources to go around. By offloading the acceleration jobs to a separate search head, this risk of data model content loss because of skipped accelerations or missed notable events because of skipped correlation searches is greatly diminished.

Second, since only one instance creates all the data models, there is only one copy of the summary data on the indexers, not multiple duplicate copies for various search heads, saving potentially gigabytes of disk space. And, since the accelerations are only run once on those indexers, indexer resources are freed up to handle more search load.

In the world of medium and large distributed Splunk deployments, Splunk instances get specialized—indexers do indexing, search heads do searching. We also often have specialized instances for the Monitoring Console, the Cluster Manager, the Search Head Cluster Deployer, and complex modular inputs like DBConnect, Splunk Connect for Syslog, and the AWS add-ons. The introduction of Splunk Cloud has brought us the “Inputs Data Manager,” or IDM, instance for these modular inputs. I offer to you that we should add another instance type to this repertoire—the DMA instance to handle all the data model accelerations. No decently-sized Splunk deployment should be without one.

Want to learn more about data model accelerations? Contact us today!

Using Child Playbooks in Splunk Phantom

      By: Joe Wohar  |  Senior Splunk Consultant

 

Splunk Phantom is an amazing SOAR platform that can really help your SOC automate your incident response processes. It allows you to build playbooks, which are Python scripts under the covers, that will act on security events that have been ingested into the platform. If you have a well-defined process for handling your security events/incidents, you can build a Splunk Phantom playbook to run through that entire process, therefore saving your security analysts time and allow them to work on more serious incidents.

A common occurrence with Splunk Phantom users is that they create a playbook that they want to use in conjunction with other playbooks. For example, a security analyst created three playbooks: a phishing playbook, an unauthorized admin access playbook, and a retrieve user information playbook. In both a phishing event and an unauthorized admin access event, they’d like to retrieve user information. Therefore, the analyst decides to have each of those playbooks call the “retrieve user information playbook” as a child playbook. However, when calling another playbook as a child playbook, there a few gotchas that you need to consider.

Calling Playbooks Synchronously vs. Asynchronously

When adding a playbook block to a playbook, there are only two parameters: Playbook and Synchronous. The Playbook parameter is simple: choose the playbook you’d like to run as a child playbook. The Synchronous option allows you to choose whether or not you’d like to run the child playbook synchronously.

A screenshot from Splunk Phantom showing options for retrieving user information from a playbook.

Choosing “OFF” will cause the child playbook to run asynchronously. This means that the child playbook is called to run on the event the parent playbook is running against, and then the parent playbook continues down the path. If you’ve called the child playbook at the end of the parent playbook, then the parent playbook will finish running and the child playbook will continue running separately.

Choosing “ON” means that the parent playbook will call the child playbook and wait for it to finish running before moving on to the next block. So when a child playbook is called, you have two playbooks running at the same time on the event. This means that every synchronous child playbook is a performance hit to your Splunk Phantom instance. It is best to avoid running child playbooks synchronously unless absolutely necessary due to the performance impact.

Since there are cases where you might need the child playbook to be synchronous, there are a few tips to avoid causing too much of a performance impact.

  1. Keep your child playbooks short and simple. You want your child playbook to finish running quickly so that the parent playbook can resume.
  2. Avoid adding prompts into child playbooks. Prompts wait for a user to take an action. If you put a prompt into a child playbook, the parent playbook has to wait for the child playbook to finish running and the child playbook has to wait for user input.
  3. Avoid using “no op” action blocks from the Phantom app. The “no op” action causes the playbook to wait for a specified number of seconds before moving on to the next block in the path. The “no op” block causes the child playbook to take longer to run, which you usually want to avoid, but there are instances where you may need to run a “no op” action in a child playbook (covered later).
  4. When using multiple synchronous child playbooks, run them in series, not parallel. Running synchronous child playbooks in series ensures that at any given time during the parent playbook’s run, only two playbooks are running at the same time: the parent playbook and one child playbook.

Sending Data Between Parent and Child Playbooks in Splunk Phantom

When calling a child playbook, the only thing that is carried over to the child playbook is the event id number. None of the block outputs from the parent playbook are carried into the child playbook. This creates the problem of how to get data from a parent playbook into a child playbook. There are two main ways of doing this: add the data to a custom list or add data to an artifact in the container. The first option, adding data to a custom list, is a very inconvenient option due to how difficult it is to get data out of a custom list. Also, custom lists are really designed to be a list for checking values against, not storing data to be pulled later.

Adding data to an artifact in the container can be done in two different ways: update an artifact or create a new artifact. Adding data to an artifact is also much easier than adding and updating data in a custom list because there are already actions created to do both tasks in the Phantom app for Phantom: update artifact and add artifact. “Update artifact” will require you to have an artifact id as a reference so it knows which artifact in the container to update. Adding an artifact is simpler because you can always add an artifact, but you can only update an artifact if one exists.

When adding an artifact, there can be a slight delay between the time the action runs and when the artifact is actually added to the container. My advice here is when you add an artifact to a container that you want to pull data from in the child or parent playbook, add a short wait action (you only need it to wait 2 seconds) immediately after the “add artifact” action. You can have the playbook wait by adding a “no op” action block from the Phantom app for Phantom (which you should already have installed if you’re using the add artifact and update artifact actions).

Documentation Tips for Parent and Child Playbooks in Splunk Phantom

When creating a child playbook that you plan to use in multiple parent playbooks, documentation will really help you manage your playbooks in the long run. Here are a couple of quick tips for making your life easier.

  1. Use a naming convention for the child playbooks at least. I’d definitely recommend using a naming convention for all of your playbooks, but if you don’t want to use a naming convention for parent playbooks, at the very least use one for the child playbooks. Adding something like “ – [Child]” will really make it easier to find child playbooks and manage them.
  2. Put the required fields for the child playbook into the playbook’s description. Calling a child playbook is very easy, but if your parent playbook isn’t using the same CEF fields as the child playbook, you’re going to have a problem. Adding this list to the description will help let you know if you need to update your container artifact to add those needed fields or not.

Follow these tips and tricks and you’ll be setting yourself up for a performant and easy-to-manage Splunk Phantom instance for the long term.

Want to learn more about using playbooks in Splunk Phantom? Contact us today!

Deep Freeze Your Splunk Data in AWS, Part 2

      By: Zubair Rauf  |  Senior Splunk Consultant, Team Lead

In Part 1 of this blog post, we touched on the need to freeze Splunk data in AWS S3. In that post, we described how to do this using a script to move the Splunk bucket into S3. In this post, we will describe how to accomplish the same result by mounting the S3 bucket on every indexer using S3FS-Fuse, then telling Splunk to just move the bucket to that mountpoint directly. S3FS is a package available in the EPEL Repository. EPEL (Extra Packages for Enterprise Linux) is a repository that provides additional packages for Linux from the Fedora sources.

High Level Process

  1. Install S3FS-Fuse
  2. Mount S3 Bucket using S3FS to a chosen mountpoint
  3. Make the mountpoint persistent by updating the rc.local script
  4. Update the index to use ColdToFrozenDir when freezing data
  5. Verify frozen data exists in S3 bucket

Dependencies and Installation

The following packages are required to make this work:

Package Repository
S3FS-Fuse epel
dependency: fuse amzn2-core
dependency: fuse-lib amzn2-core

Note: This test was done on instances running Centos-7 and did not have the EPEL repo added for yum. Therefore, we had to install that as well before proceeding with the S3FS-Fuse installation.

Install S3FS-Fuse

The following commands (can also be scripted once they are tested to work in your environment) were used to install EPEL and S3FS-fuse on test indexers. You have to run these as root on the indexer hosts.

cd /tmp
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum -y install ./epel-release-latest-7.noarch.rpm
rpm --import http://download.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-6
yum -y install s3fs-fuse

Mounting the S3 Bucket

Use the following commands to mount the S3 Bucket to a new folder called frozen-s3 in /opt/splunk/data.

Note: This method uses the passwd-s3fs file to access the S3 bucket. Please ensure that the AWS credentials you use have access to the S3 bucket. The AWS credentials need to be a user role with an Access Key and Secret Key generated. I created a user, ‘splunk_access,’ which has a role, ‘splunk-s3-archival,’ attached with it. This role has explicit permissions to access my test S3 bucket.

The S3 bucket has the following JSON policy attached with it which gives the ‘splunk-s3-archival’ role full access to the bucket. The account_id in the policy is your 12-digit account number.

{

"Version": "2012-10-17",

"Id": "Policy1607555060391",

"Statement": [

{

"Sid": "Stmt1607555054806",

"Effect": "Allow",

"Principal": {

"AWS": "arn:aws:iam:::role/splunk-s3-archival"
},

"Action": "s3:*",

"Resource": "arn:aws:s3:::splunk-to-s3-frozen-demo"

}

]
}

The following commands should be run as root on the server. Please make sure to update the following variables listed inside < > in the commands with your respective values:

  • splunk_user → The user that Splunk will run as
  • aws_region → The AWS region your S3 bucket was created in
  • bucket_name → S3 bucket name
  • mount_point → Path to the directory where the S3 bucket will be mounted

These commands can be run on one indexer manually to test in your environment and scripted for the remaining indexers.


cd /opt/splunk/data

mkdir frozen-s3
cd /opt/splunk/data/frozen-s3
sudo vi /home//.s3fs/passwd-s3fs ## Add AWS Access_key:Secret_key in this file
sudo chmod 600 /home//.s3fs/passwd-s3fs
su -c 's3fs -d -o passwd_file=/home//.s3fs/passwd-s3fs,allow_other,endpoint= '
echo su -c 's3fs -d -o passwd_file=/home//.s3fs/passwd-s3fs,allow_other,endpoint=us-east-2 ' >> /etc/rc.d/rc.local
chmod +x /etc/rc.d/rc.local

Adding the mount command to the rc.local script ensures that the rc.local script mounts the S3 bucket on boot.
Once you have manually mounted the S3 bucket, you can use the following command to verify the bucket has mounted successfully.


df -h
## I get the following output from df -h

[centos@s3test ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 485M 0 485M 0% /dev
tmpfs 495M 0 495M 0% /dev/shm
tmpfs 495M 6.8M 488M 2% /run
tmpfs 495M 0 495M 0% /sys/fs/cgroup
/dev/xvda2 10G 3.3G 6.8G 33% /
tmpfs 99M 0 99M 0% /run/user/1001
s3fs 256T 0 256T 0% /opt/splunk/data/frozen-s3
tmpfs 99M 0 99M 0% /run/user/1000
tmpfs 99M 0 99M 0% /run/user/0

The bolded mount is the s3 bucket mounted using S3FS.

Setting Up Test Index

Create a test index with the following settings and push it out to all indexers through the Cluster Master:

[s3-test]
homePath = $SPLUNK_DB/s3-test/db
coldPath = $SPLUNK_DB/s3-test/colddb
thawedPath = $SPLUNK_DB/s3-test/thaweddb
frozenTimePeriodInSecs = 600
maxDataSize = 10
maxHotBuckets = 1
maxWarmDBCount = 1
coldToFrozenDir = $SPLUNK_HOME/data/frozen-s3/_index_name

The coldToFrozenDir parameter in the above stanza defined where Splunk should freeze the data for this index. This needs to be set for every index you wish to freeze. Splunk will automatically replace the _index_name variable from the coldToFrozenDir parameter to the index name (S3-test) in this case. This makes it easier to copy the parameter to multiple individual index stanzas.

For testing purposes, the frozenTimePeriodInSecs, maxDataSize, maxHotBuckets, and maxWarmDBCount have been set very low. This is to ensure that the test index rolls data fast. In production, these values should either be left as the default or changed in consultation with a Splunk Architect.

The index needs to be setup/updated on the Cluster Master and configurations pushed out to all indexers.

Adding and Freezing Data

Once the indexers are restarted with the correct settings for index, upload sample data from the UI. Keep adding sample data files to the index till the index starts to roll over hot and warm buckets to cold and eventually frozen. Eventually, you will start to see your frozen data present in your S3 bucket.

Want to learn more about Freezing Data to S3 with Splunk? Contact us today!

How to Configure Splunk DB Connect & Export Splunk Data to an External Database

      By: Quinlan Ferris  | Splunk Consultant

 

At some point you may find the need to export data from Splunk to an external database for data collection or to keep important information pulled from searches in a safe and secure place along with other assets. You do have a few options to achieve this; for example, downloading a .csv file from a search and manually copying it over or using a script to append to the existing database. However, the most streamlined and efficient way is to use a Splunk-supported app, Splunk DB Connect. Not only does this app allow you to write to an existing or newly created database and run on a cron schedule, but you can also run searches against your data in an existing database. For now, we will cover exporting Splunk data to an external database. Splunk supports a wide variety of databases ranging from MySQL to MongoDB; while some are not directly supported, there are ways to configure SQLite and more if needed.

 

Prerequisite: Installation of Splunk DB Connect, install DB drivers, secure DB connection.

First, we will want to download Splunk DB Connect.

There are a few different ways to install depending on your environment and how it is configured. If running a single instance, install on the search head. On a distributed environment with search head clustering, you will need to install on deployer and heavy forwarders.

Next, make sure you install a Java Database Connectivity (JDBC) database driver. Optionally you can download more drivers based on your database in use. Database drivers should be extracted into the folder $SPLUNK_HOME\etc\apps\splunk_app_db_connect\drivers\.

Finally, once the app is installed, click “new connection” and simply configure desired connection type, identity, and JDBC settings for your database.

Use Case: Customer has Splunk DB Connect installed with connection setup and would like to write to an existing database.

Data Tab & Outputs

Navigate to the Splunk DB Connect app. We will need to navigate to the Data Lab tab in the top left and click on “outputs.” This section should currently be empty unless there are already outputs configured. We will click on the add new outputs in the top right to begin the setup of our outputs.

Search

This is where the possibilities are endless for what you can write to your database. You can either create a search based on specific fields you would like to map to the database – for example, creating a custom search will allow you to map specific field-value pairs to an existing table in a database and either replace current values in the table or append new values – or use a saved search job that already has important information that can be pushed to the database. Either way, you can use the search results and put them in the existing table.

Choose Table

*Must have a connection already established in the app with an external database.*

In this section, you will choose the connection to the database you created when initially setting up Splunk DB Connect. Next, you will want to select the “catalog” you will want. This is essentially the database you want to connect and append results to. In some rare cases and testing purposes, you may have a database with only one table. In that case, the schema will not be an option since there is only one schema. If your database has multiple tables, you will be given the option to select a schema to map the fields to. Once the schema is selected, Splunk will query and show a preview of the database you will be working in and existing data in the table. Confirm this is the correct database and has the correct values and click next.

Field Mappings

Depending on your field names and table names, you will map accordingly. Click on “add new search field” and map based on search and table names.

For example:

Search query has a field named product_Id that shows product name with last 4 digits
Table has a field name productId that displays the productId ESIN number

This will allow you to push the new field, product_Id, to the database and append it.

Upsert: There is a box at the bottom that, if selected, will update any table rows that have the same unique ID specified in the dropdown. This means that if there is an ID # 22 and it is constantly changing, having this box selected will change ID # 22 each time. If this box is not selected, it will simply add ID # 22 to a new row each time it changes.

Set Properties

This section is the finalization and setting up basic information like name and description of output.

Under “parameter settings,” you are able to set up query timeout to increase or schedule this output on a frequency basis based on a cron schedule. If you use cron to schedule this every two minutes it will update your database if there is any new data created in Splunk based on your search and update it on the database.

Complete!

You have set up your first output with Splunk DB Connect. If you are running a scheduled output, check your database to see the updated results from your Splunk search in your database.

To learn more, or if you have any questions, contact us by filling out the form below:

Getting Started with the Splunk App for Infrastructure and Using it as the Basis for your ITSI Entity Strategy

      By: Brent Mckinney  | Splunk Consultant

 

As the name implies, the Splunk App for Infrastructure (SAI) allows you to monitor and troubleshoot your business’s infrastructure with pre-built and highly customizable displays. SAI provides insight into all layers of your organization’s infrastructure. Splunk’s IT Service Intelligence (ITSI) is a premium offering that allows you to further monitor your infrastructure, through the means of setting analyzing KPIs across all entities in your organization. ITSI utilizes machine learning to baseline what is normal behavior and help identify anomalies in your data.

 

Getting started with Splunk App for Infrastructure

There are 3 major components of the SAI: Getting Data In, investigating your Data, and setting up alerts.

The first step is to decide what you want to monitor. SAI offers a handful of options to bring in data, from a variety of sources. Once the SAI is installed in your Splunk environment, you can visit the “Add Data” page within the app to explore these options. Some options utilize other Splunk-built add-ons like the add-on for AWS, which you’re most likely already using if you’ve onboarded AWS data in the past. Other options allow you to utilize OS daemons like collectd for Linux machines. There are options for Windows, OSX, Kubernetes, and more.

Once you’ve begun onboarding data through one of these avenues, you can immediately start investigating. In the “Investigate” tab of the SAI, you can see all entities that were onboarded in the previous step. Splunk uses the term “entity” when describing your individual data sources. This could be a physical server at your organization, a cloud instance in AWS, or any source that you define as a center for monitoring.  The investigate tab shows you a list of all entities, their current status, last time data was collected, and any tags associated. You can also use this page to assign entities to specific groups, that can be used to further classify all components of your infrastructure. From here you can click directly on an entity in the list and see an overview of key metrics that are being collected. This typically shows the OS and current version, network I/O, memory and CPU utilization, and disk space free % by default. You can drill down further by visiting the analysis tab and building custom views by dragging and dropping available metrics to the dashboard. This allows different users to easily build views to monitor their systems.

Finally, you can set up alerts for your entities and specific metrics, based on conditions that you define. You can alert if a certain metric exceeds or falls below a certain value, and assign severities. The “Alert” tab is a great way to view and manage all alerts in SAI, as it shows you which entities triggered alerts, each entity’s severity, how many times the alert was triggered, and the timestamps that they occurred.

 

Integrating with ITSI

Similar to SAI, one of the first steps in bringing ITSI to life is defining the entities you want to begin analyzing. Luckily, ITSI offers seamless integration with what’s already defined in SAI. Under the “Configuration > Entities” tab in ITSI, you can select “Manage Integrations” and you have the option to integrate both entities and alerts from SAI into ITSI.

The key to why this is important is this: SAI offers a way to get data in from a large variety of sources, and makes it easy to choose what you want to monitor from each of these sources, while ITSI specializes in painting a detailed picture on how all of these sources operate and contribute to the health of the overall environment. Using the SAI to define entities, bring in data, and validate metrics, makes setting up services and getting the value of ITSI seamless. ITSI relies on many different data sources if you’ve got a wide range of KPIs you want to monitor, so it can be confusing trying to set up services and monitor KPIs when you’re not sure if the data is even there to support it. SAI solves that problem by allowing you to easily add data to Splunk, verify that the metrics you’re interested in are indeed coming in, and then allowing to import these entities directly into ITSI for analytics.

ITSI’s power is analyzing KPI behavior over time, by learning and analyzing the behavior of entities in your organization. ITSI was not built to collect data in the way that SAI does. So using SAI to create and manage entities in your infrastructure makes it easier to simply assign entities in ITSI and begin analyzing, rather than using ITSI as a means of creating entities and validating data.

 

Want to learn more about the Splunk App for Infrastructure and using it as the basis for your ITSI entity strategy? Contact us today!

Enhanced Troubleshooting for Blocked Queues

      By: Eric Howell | Splunk Consultant

 

Overview

Splunk’s method of ingesting and storing data follows a specific set of actions. These actions (e.g. event parsing, timestamping, indexing, etc) are separated logically and performed in different pipelines. Please refer to the below, Splunk-provided breakdown in which queue/pipeline these activities are performed.

 

As you can see, actual ingestion of data into Splunk has many different processes that are followed before the data finally finds its resting place within a bucket. At each stage, a configuration file (props.conf or transforms.conf for the majority of activities) will dictate what is done, often tied to provided regex within the .conf file or other attributes within the file. Since there is no one-size-fits-all method to ingesting that expansive myriad of data types and formats that can be brought into Splunk, to say nothing of formatting issues that can happen to a log or data stream, custom ingestion settings are frequently configured to ensure proper data integrity and consistency.

This also creates a number of places where a configuration or attribute can create conflict (e.g. greedy regex, improper implementation of the “Big 8” props.conf configurations, etc) and cause Splunk to spend more time than absolutely necessary to parse, type, aggregate, or index quickly. These issues may go unnoticed for a long period of time until a larger volume data stream is ingested or during a period of greater activity on the servers providing their logs to Splunk, at which point you can experience significant slow-down in your ingestion process. This can lead to events becoming searchable seconds, minutes, hours (or worse) later than the actual inciting time frame. This could have heavy impact during an outage when teams are attempting to triage or troubleshoot and return an environment to full function.

Quickly and effectively identifying which queues are being blocked, or causing slow down, is relatively straightforward, but there are configurations that can be implemented to enrich your internal logs to help you pinpoint specific sourcetypes that are your pain points.

Initial Troubleshooting

You have determined that events are arriving later than expected, and you suspect that queues are being blocked. The next step would be to determine which specific queues are being blocked. The search provided below will provide insight on which queues are experiencing blocks:

index=_internal source=$SPLUNK_HOME/var/log/splunk/metrics.log  blocked=true

(replace $SPLUNK_HOME with wherever your Splunk install is located on your server)

 

Your results will provide something like this:

 

In this example, the typing queue is the blocked queue in question. Now, you’ll be wondering, what exactly am I to do with that? Without additional configurations, the metrics.log is missing some of the robust functionality that will help you pinpoint the specific sourcetype causing the block, but you can reference this information against your knowledge of the ingestion pipeline (the image included at the top of this document) to troubleshoot settings that could be causing slowness during this process.

It is important to remember the order of queues in the pipeline, as the the queues later in the pipeline experiencing blocks will cause upstream queues to fill up, as well.

Additionally, the Monitoring Console can provide visualizations to help identify blocked queues by navigating to Indexing- > Performance -> Indexing Performance : Instance (or Advanced).

Expanding Functionality

You can enrich your visualizations by making a change to limits.conf on your indexers and forwarders. This will provide CPU time information for the RegexProcessor process in metrics.log for host, source, sourcetype, and index.

This configuration change needs to be made under the [default] stanza. Implemented correctly, the above configuration will look something like this:

[default]

regex_cpu_profiling = true

With this change configured, and your metrics.log providing more robust information, several additional panels in the Monitoring Console will populate with data. These are found here:

Indexing- > Performance -> Indexing Performance : Advanced

With this information, you should be able to identify the specific sourcetype that is causing your queue blockages and troubleshoot the configurations specific to that data so as to resolve your issue.

 

Want to learn more troubleshooting tips for blocked queues? Contact us today!

TekStream Wins Prestigious Splunk Partner of the Year 2021 Awards

By: Karl Cepull | Senior Director, Operational Intelligence

 

Splunk recently announced their partner awards for 2021, and TekStream took home not one, but two, significant Splunk partner awards:

  • 2021 Professional Services Partner of the Year – AMERICAS Region
  • 2021 Global Services Partner of the Year

These awards are given to recognize top partner excellence in post-sale and professional services implementations and commitment to technical excellence, certifications, and customer satisfaction. To be singled out for these awards, from among more than 2,000 Splunk partners worldwide, is an incredible achievement.

So, what does that mean?

It reflects TekStream’s commitment to do what is right for the customer. Period. We are hyper-focused on making customers happy, whether those are our own “direct” customers or those that we have the honor of helping on behalf of Splunk through our partnership.

Our commitment starts long before the project starts. Often, our sales team will engage with Splunk’s account team to do “co-selling.” This is where both teams engage cooperatively during the sales process to ensure the account fully understands our recommendations and the benefits of a Splunk/TekStream solution. Together we provide technical demos, answer questions, and craft the Statement of Work (SOW) for the project.

“I always enjoy having TekStream on my projects. I know I can trust their consultants to be professional and take great care of Splunk customers. I have had several TekStream consultants on very difficult projects. Each time they come out the other side with flying colors. Thank you TekStream and congratulations on 2021 Global Services Partner of the Year and 2021 AMER Professional Services Partner of the Year,” said Doug Searcy, Senior Professional Services Engagement Manager at Splunk.

We have built a reputation for being very “creative” in how we can structure deals – providing options to the customer that may not otherwise be available to them. Some of these options include flexibility on resource availability and timelines, access to non-Splunk resources (such as AWS engineers), and even financing options.

Of course, there’s the execution of the project itself. TekStream is able to assign a strong bench of resources, all of whom are Splunk Core Certified Consultants, which represents months of training and multiple other certifications. Additionally, all of our consultants spend time “manning the help desk” before they go out into the field. This gives them a unique opportunity to practice what they’ve learned in class. We call this “getting their street smarts to go along with their book smarts”. That way their first project isn’t the first time they’ve worked with Splunk or a customer; it could be their hundredth!

While on a Splunk project, our team members are never alone, even though they may be the only one on the project. Our Splunk consultants are highly collaborative and are constantly chatting with one another, asking questions and seeking and offering guidance, all to ensure that the solutions we provide are top-notch. We like to say that “when you hire one of us, you get all of us.”

All team members also have access to senior resources, whether that be in Splunk Enterprise, or any number of premium applications, such as Enterprise Security, IT Service Intelligence, Phantom, User Behavior Analytics, SIEM Replacements, CMMC, or Splunk Observability Suite.

Finally, TekStream stands behind our people and is committed to their success. If we discover that there may be a mismatch between the needs of a project and the skills of a consultant, we will quickly work to get them the assistance that they need, whether that is another resource providing free oversight, or swapping out the resource for a better fit. With TekStream, you are assured success in your Splunk projects.

“It’s a great accomplishment that not only highlights the commitment to delivery excellence but also TekStream’s investment into the Global PS practice which has translated in expanding to other PS programs such as ODS. Your team provides excellent leadership and support for our AMER PS practice while always demonstrating the willingness to lean in on our new product enablement and learning side by side with our top internal technical talent to provide positive customer experiences leading to adoption,” said Clint Locker, Director, Professional Services, South at Splunk.

We are honored and delighted to be recognized by Splunk with these two prestigious awards! We look forward to our continued mutual success in serving our clients.

Interested in working with an award-winning Splunk Partner? Contact us today!

 

Hide Rows or Panels in Splunk Dashboards

By: Yetunde Awojoodu  | Splunk Consultant

 

Imagine a use case requirement to present metrics for multiple application groups in a single Splunk dashboard for the NOC with the different application groups also interested in viewing the same metrics for their hosts. Rather than creating a separate dashboard for each application group in addition to the consolidated one for the NOC, one dashboard could address the requests from all the teams. Note however that in this situation, the data presented in the dashboard can be viewed by all the groups.

These requests can be met by hiding panels or rows in one dashboard. There are perhaps multiple ways to hide panels or rows within a dashboard, but I will be demonstrating how to do so using XML. Your scenario may be quite different from mine but if you need to hide panels or rows, this article should show you how to do so. For simplicity of understanding, my sample data includes only three groups. I have also provided the hypothetical data I used in case you would like to get hands-on in your test lab to understand the way this works. It includes average CPU and memory values for each application group.

I will demonstrate this concept using two scenarios. In the first scenario, the application groups would like to view both CPU and memory data for their groups only and the panels for each group appear on a single row in the dashboard. In the second scenario, the NOC wants to see only panels for memory data for all the groups.

 

Scenario I – View Only a Selected Row

First, create your dashboard panels, one panel per metric type with a suitable visualization. At this stage, my dashboard looks like this with average CPU and memory for each application group.

Once you have created the panels needed, create an input with a value for each application group. I have selected the dropdown input because it appears to work best when hiding panels or rows due to the setting and unsetting of tokens. My input configuration looks like this:

To hide rows or panels, you will need to include a “change” section in your XML as shown below, making sure to create a token for each group and set or unset tokens based on the panels you would like to see when a particular group is selected.  Also remember to include a depends attribute for each row or panel you are referencing. Here is the change section in my dashboard XML:

<change>

<condition label=”All”>

<set token=”CMM_token”>true</set>

<set token=”BBC_token”>true</set>

<set token=”MNS_token”>true</set>

</condition>

<condition label=”CMM”>

<set token=”CMM_token”>true</set>

<unset token=”BBC_token”></unset>

<unset token=”MNS_token”></unset>

</condition>

<condition label=”BBC”>

<set token=”BBC_token”>true</set>

<unset token=”CMM_token”></unset>

<unset token=”MNS_token”></unset>

</condition>

<condition label=”MNS”>

<set token=”MNS_token”>true</set>

<unset token=”CMM_token”></unset>

<unset token=”BBC_token”></unset>

</condition>

</change>

 

The snapshot below shows a section of my XML including how the depends attribute is specified for each row or panel that may be hidden.

I have selected BBC only in the screenshot below to show you what the dashboard will look like when only one row is selected after configuring hidden rows.

 

Scenario II – View Only Selected Panels

As in the first scenario, create your panels and configure an input suitable for your use case. I have selected the dropdown input and my configuration looks like this:

In this case, after creating tokens for the values in the change section, the “depends” attribute will be specified with each panel rather than each row as in the first scenario. The CPU_token was created for the CPU panels and MEM_token for memory panels. Below is a snapshot of a section of the XML:

In the screenshot below, I have selected only panels with memory metrics.

In summary, hiding panels or rows in a dashboard can be achieved by simply setting and unsetting tokens and specifying the “depends” attribute in the XML for each panel or row. In my opinion, this method is quite straight forward compared to a few others I have come across.

Have fun Splunking!

Want to learn more about hiding rows or panels in Splunk dashboards? Contact us today!