How to Combine Multiple Data Sources in Splunk SPL

Caroline Lea
September 9, 2021
09:03 am

By: Yetunde Awojoodu | Splunk Consultant

Depending on your use case or what you are looking to achieve with your Splunk Processing Language (SPL), you may need to query multiple data sources and merge the results. The most intuitive command to use when these situations arise is the “join” command, but it tends to consume a lot of resources – especially when joining large datasets. I will be describing a few other commands or functions that can be applied when combining data from multiple sources in Splunk, including their benefits and limitations.

“OR” Boolean Operator

The most common use of the “OR” operator is to find multiple values in event data, e.g. “foo OR bar.” This tells the program to find any event that contains either word. However, the “OR” operator is also commonly used to combine data from separate sources, e.g. (sourcetype=foo OR sourcetype=bar OR sourcetype=xyz). Additional filtering can also be added to each data source, e.g., (index=ABC loc=Ohio) OR (index=XYZ loc=California). When used in this manner, Splunk runs a single search, looking for any events that match any of the specified criteria in the searches. The required events are identified earlier in the search before calculations and manipulations are applied.

Syntax for “OR” Operator:

(<search1>) OR (<search2>) OR (<search3>)

Pros:

– Merges fields and event data from multiple data sources
– Saves time since it does only a single search for events that match specified criteria and returns only the applicable events before any other manipulations

Cons:

– Only used with base searches. Does not allow calculations or manipulations per source, so any further calculations or manipulations will need to be performed on all returned events

Example: In the example below, the OR operator is used to combine fields from two different indexes and grouped by the customer_id, which is common to both data sources.

Splunk search interface displaying table command output that joins web logs and call center data based on customer ID.

Append Command

Append is a streaming command used to add the results of a secondary search to the results of the primary search. The results from the append command are usually appended to the bottom of the results from the primary search. After the append, you can use the table command to display the results as needed. Note that the secondary search must begin with a generating command. It is important to also note that append searches are not processed like subsearches where the subsearch is processed first. They are run at the point they are encountered in the SPL.

Syntax for Append:

<primary search> ... | append [<secondary search>]

Pros:

– Displays fields from multiple data sources

Cons:

– Subject to a maximum result rows limit of 50,000 by default
– The secondary search must begin with a generating command
– It can only run over historical data, not real-time data

Example: In the example below, the count of web activities on the Splunk User Interface is displayed from _internal index along with count per response from the _audit index.

Splunk query showing how to combine internal access logs with audit logs using append command, followed by statistics output grouped by method and info fields.

The last four rows are the results of the appended search. Both result sets share the count field. You can see that the append command just tacks on the results of the subsearch to the end of the previous search, even though the results share the same field values.

Multisearch Command

Multisearch is a generating command that runs multiple streaming searches at the same time. It requires at least two searches and should only contain purely streaming operations such as eval, fields, or rex within each search. One major benefit of the multisearch command is that it runs multiple searches simultaneously rather than sequentially as with the append command. This could save you some runtime especially when running more complex searches that include multiple calculations and/or inline extractions per data source. Results from the multisearch command are interleaved, not added to the end of the results as with the append command.

Syntax for the Multisearch Command:

| multisearch [<search1>] [<search2>] [<search3>] ...

Since multisearch is a generating command, it must be the first command in your SPL. It is important to note that the searches specified in square brackets above are not actual subsearches. They are full searches that produce separate sets of data that will be merged to get the expected results. A subsearch is a search within a primary or outer search. When a search contains a subsearch, Splunk processes the subsearch first as a distinct search job and then runs the primary search.

Pros:

– Merges data from multiple data sources
– Multisearch runs searches simultaneously, thereby saving runtime with complex searches
– There is no limit to the number of result rows it can produce
– Results from the multisearch command are interleaved allowing for a more organized view

Cons:

– Requires that the searches are entirely distributable or streamable
– Can be resource-intensive due to multiple searches running concurrently. This needs to be taken into consideration since it can cause search heads to crash.

Example: In the example shown below, the multisearch command is used to combine the action field from the web_logs index and queue field from the tutorial_games index using eval to view the sequence of events and identify any roadblocks in customer purchases. The results are interleaved using the _time field.

Example of a Splunk multisearch command that merges data from different indexes and assigns a unified call_sequence field before aggregating by customer ID and time.

Union Command

Union is a generating command that is used to combine results from two or more datasets into one large dataset. The behavior of the union command depends on whether the dataset is a streaming or non-streaming dataset. Centralized streaming or non-streaming datasets are processed the same as append command while distributable streaming datasets are processed the same as multisearch command.

Syntax for Union Command:

| union [<search2>] [<search2>] … OR … | union [<search>]

However, with streaming datasets, instead of this syntax:
<streaming_dataset1> | union <streaming_dataset2>

Your search is more efficient with this syntax:
... | union <streaming_dataset1>, <streaming_dataset2>

Pros:

– Merges data from multiple data sources
– Can process both streaming and non-streaming commands, though behavior will depend on the command type
– As an added benefit of the max out argument, which specifies the maximum number of results to return from the subsearch. The default is 50,000 results. This value is the maxresultrows setting in the [searchresults] stanza in the limits.conf file.

Example: The example below is similar to the multisearch example provided above and the results are the same. Both searches are distributable streaming, so they are “unioned” by using the same processing as the multisearch command.

Splunk query using the union command to merge datasets from web logs and call center sources, with results grouped by customer ID and call_sequence.

In the second example below, because the head command is a centralized streaming command rather than a distributable streaming command, any subsearches that follow the head command are processed using the append command. In other words, when a command forces the processing to the search head, all subsequent commands must also be processed on the search head.

Simplified union example in Splunk combining events for two specific customers across web logs and call center sources, displaying the merged call_sequence over time.

Comparing the Four Options

The table below shows a comparison:

#	OR	Append	Multisearch	Union
1.	Boolean Operator	Streaming command	Generating command	Generating Command
2.	Used in between searches	Used in between searches	Must be the first command in your SPL	Can be either the first command or used in between searches. Choose the most efficient method based on the command types needed
3.	Results are interleaved	Results are added to the bottom of the table	Results are interleaved	Results are interleaved based on the time field
4.	No limit to the number of rows that can be produced	Subject to a maximum of 50,000 result rows by default	No limit to the number of rows that can be produced	Default of 50,000 result rows with non-streaming searches. Can be changed using maxout argument.
5.	Requires at least two base searches	Requires a primary search and a secondary one	Requires at least two searches	Requires at least two searches that will be “unioned”
6	Does not allow use of operators within the base searches	Allows both streaming and non-streaming operators	Allows only streaming operators	Allows both streaming and non-streaming operators
7.	Does only a single search for events that match specified criteria	Appends results of the “subsearch” to the results of the primary search	Runs searches simultaneously	Behaves like multisearch with streaming searches and like append with non-streaming
8.	Transforming commands such as chart, timechart or stats cannot be used within the searches but can be specified after		Transforming commands such as chart, timechart or stats cannot be used within the streaming searches	Behaves like multisearch with streaming searches and like append with non-streaming

I hope you now have a better understanding of the different multisearch command options presented and will make the most optimized choice for your use case.

Want to learn more about combining data sources in Splunk? Contact us today!