How to Combine Multiple Data Sources in Splunk SPL

      By: Yetunde Awojoodu  |  Splunk Consultant

Depending on your use case or what you are looking to achieve with your Splunk Processing Language (SPL), you may need to query multiple data sources and merge the results. The most intuitive command to use when these situations arise is the “join” command, but it tends to consume a lot of resources – especially when joining large datasets. I will be describing a few other commands or functions that can be applied when combining data from multiple sources in Splunk, including their benefits and limitations.

“OR” Boolean Operator

The most common use of the “OR” operator is to find multiple values in event data, e.g. “foo OR bar.” This tells the program to find any event that contains either word. However, the “OR” operator is also commonly used to combine data from separate sources, e.g. (sourcetype=foo OR sourcetype=bar OR sourcetype=xyz). Additional filtering can also be added to each data source, e.g., (index=ABC loc=Ohio) OR (index=XYZ loc=California). When used in this manner, Splunk runs a single search, looking for any events that match any of the specified criteria in the searches. The required events are identified earlier in the search before calculations and manipulations are applied.

Syntax for “OR” Operator:

(<search1>) OR (<search2>) OR (<search3>)

Pros:

  • – Merges fields and event data from multiple data sources
  • – Saves time since it does only a single search for events that match specified criteria and returns only the applicable events before any other manipulations

Cons:

  • – Only used with base searches. Does not allow calculations or manipulations per source, so any further calculations or manipulations will need to be performed on all returned events

Example: In the example below, the OR operator is used to combine fields from two different indexes and grouped by the customer_id, which is common to both data sources.

Splunk search interface displaying table command output that joins web logs and call center data based on customer ID.

Append Command

Append is a streaming command used to add the results of a secondary search to the results of the primary search. The results from the append command are usually appended to the bottom of the results from the primary search. After the append, you can use the table command to display the results as needed. Note that the secondary search must begin with a generating command. It is important to also note that append searches are not processed like subsearches where the subsearch is processed first. They are run at the point they are encountered in the SPL.

Syntax for Append:

<primary search> ... | append [<secondary search>]

Pros:

  • – Displays fields from multiple data sources

Cons:

  • – Subject to a maximum result rows limit of 50,000 by default
  • – The secondary search must begin with a generating command
  • – It can only run over historical data, not real-time data

Example: In the example below, the count of web activities on the Splunk User Interface is displayed from _internal index along with count per response from the _audit index.

Splunk query showing how to combine internal access logs with audit logs using append command, followed by statistics output grouped by method and info fields.

The last four rows are the results of the appended search. Both result sets share the count field. You can see that the append command just tacks on the results of the subsearch to the end of the previous search, even though the results share the same field values.

Multisearch Command

Multisearch is a generating command that runs multiple streaming searches at the same time. It requires at least two searches and should only contain purely streaming operations such as eval, fields, or rex within each search. One major benefit of the multisearch command is that it runs multiple searches simultaneously rather than sequentially as with the append command. This could save you some runtime especially when running more complex searches that include multiple calculations and/or inline extractions per data source. Results from the multisearch command are interleaved, not added to the end of the results as with the append command.

Syntax for the Multisearch Command:

| multisearch [<search1>] [<search2>] [<search3>] ...

Since multisearch is a generating command, it must be the first command in your SPL. It is important to note that the searches specified in square brackets above are not actual subsearches. They are full searches that produce separate sets of data that will be merged to get the expected results. A subsearch is a search within a primary or outer search. When a search contains a subsearch, Splunk processes the subsearch first as a distinct search job and then runs the primary search.

Pros:

  • – Merges data from multiple data sources
  • – Multisearch runs searches simultaneously, thereby saving runtime with complex searches
  • – There is no limit to the number of result rows it can produce
  • – Results from the multisearch command are interleaved allowing for a more organized view

Cons:

  • – Requires that the searches are entirely distributable or streamable
  • – Can be resource-intensive due to multiple searches running concurrently. This needs to be taken into consideration since it can cause search heads to crash.

Example: In the example shown below, the multisearch command is used to combine the action field from the web_logs index and queue field from the tutorial_games index using eval to view the sequence of events and identify any roadblocks in customer purchases. The results are interleaved using the _time field.

Example of a Splunk multisearch command that merges data from different indexes and assigns a unified call_sequence field before aggregating by customer ID and time.

Union Command

Union is a generating command that is used to combine results from two or more datasets into one large dataset. The behavior of the union command depends on whether the dataset is a streaming or non-streaming dataset. Centralized streaming or non-streaming datasets are processed the same as append command while distributable streaming datasets are processed the same as multisearch command.

Syntax for Union Command:

| union [<search2>] [<search2>] … OR … | union [<search>]

However, with streaming datasets, instead of this syntax:
<streaming_dataset1> | union <streaming_dataset2>

Your search is more efficient with this syntax:
... | union <streaming_dataset1>, <streaming_dataset2>

Pros:

  • – Merges data from multiple data sources
  • – Can process both streaming and non-streaming commands, though behavior will depend on the command type
  • – As an added benefit of the max out argument, which specifies the maximum number of results to return from the subsearch. The default is 50,000 results. This value is the maxresultrows setting in the [searchresults] stanza in the limits.conf file.

Example: The example below is similar to the multisearch example provided above and the results are the same. Both searches are distributable streaming, so they are “unioned” by using the same processing as the multisearch command.

Splunk query using the union command to merge datasets from web logs and call center sources, with results grouped by customer ID and call_sequence.

In the second example below, because the head command is a centralized streaming command rather than a distributable streaming command, any subsearches that follow the head command are processed using the append command. In other words, when a command forces the processing to the search head, all subsequent commands must also be processed on the search head.

Simplified union example in Splunk combining events for two specific customers across web logs and call center sources, displaying the merged call_sequence over time.

Comparing the Four Options

The table below shows a comparison:

#ORAppendMultisearchUnion
1.Boolean OperatorStreaming commandGenerating commandGenerating Command
2.Used in between searchesUsed in between searchesMust be the first command in your SPLCan be either the first command or used in between searches. Choose the most efficient method based on the command types needed
3.Results are interleavedResults are added to the bottom of the tableResults are interleavedResults are interleaved based on the time field
4.No limit to the number of rows that can be producedSubject to a maximum of 50,000 result rows by defaultNo limit to the number of rows that can be producedDefault of 50,000 result rows with non-streaming searches. Can be changed using maxout argument.
5.Requires at least two base searchesRequires a primary search and a secondary oneRequires at least two searchesRequires at least two searches that will be “unioned”
6Does not allow use of operators within the base searchesAllows both streaming and non-streaming operatorsAllows only streaming operatorsAllows both streaming and non-streaming operators
7.Does only a single search for events that match specified criteriaAppends results of the “subsearch” to the results of the primary searchRuns searches simultaneouslyBehaves like multisearch with streaming searches and like append with non-streaming
8.Transforming commands such as chart, timechart or stats cannot be used within the searches but can be specified after Transforming commands such as chart, timechart or stats cannot be used within the streaming searchesBehaves like multisearch with streaming searches and like append with non-streaming

I hope you now have a better understanding of the different multisearch command options presented and will make the most optimized choice for your use case.

Want to learn more about combining data sources in Splunk? Contact us today!