Re-Index Raw Splunk Events to a New Index

      By: Zubair Rauf  |  Splunk Consultant, Team Lead

 

A few days ago, I came across a very rare use case in which a user had to reindex a specific subset of raw Splunk events into another index in their data. This was historical data and could not be easily routed to a new index at index-time.

After much deliberation on how to move this data over, we settled on the summary index method, using the collect command. This would enable us to search for the specific event we want and reindex them in a separate index.

When re-indexing raw events using the “collect” command, Splunk automatically assigns the source as search_id (for ad-hoc search) and saved search name (for scheduled searches), sourcetype as stash, and host as the Splunk server that runs the search. To change these, you can specify these values as parameters in the collect command. The method we were going to follow was simple – build the search to return the events we cared about, use the collect command with the “sourcetype,” “source,” and “host” values to get the original values of source, sourcetype, and host to show up in the new event.

To our dismay, this was not what happened, as anything added to those fields is treated as a literal string and doesn’t take the dynamic values of the field being referenced. For example, host = host would literally change the host value to “host” instead of the host field in the original event. We also discovered that when summarizing raw events, Splunk will not add orig_source, orig_sourcetype, and orig_index to the summarized/re-indexed events.

To solve this problem, I had to get creative, as there was no easy and direct way to do that. I chose props and transforms to solve my problem. This method is by no means perfect and only works with two types of events:

  • – Single-line plaintext events
  • – Single-line JSON events

Note: This method is custom and only works if all steps are followed properly. The transforms extract field values based on regex, therefore everything has to be set up with care to make sure the regex works as designed. This test was carried out on a single-instance Splunk server, but if you are doing it in a distributed environment, I will list the steps below on where to install props and transforms.

The method we employed was simple:

  1. Make sure the target index is created on the indexers.
  2. Create three new fields in search for source, sourcetype, and host.
  3. Append the new fields to the end of the raw event (for JSON, we had to make sure they were part of the JSON blob).
  4. Create Transforms to extract the values of those fields.
  5. Use props to apply those transforms on the source.

I will outline the process I used for both types of events using the _internal and _introspection index in which you can find both single-line plain text events and single-line JSON events.

Adding the Props and Transforms

The best way to add the props and transforms is to package them up in a separate app and push them out to the following servers:

  1. Search Head running the search to collect/re-index the data in a new index (If using a cluster, use the deployer to push out the configurations).
  2. Indexers that will be ingesting the new data (If using a clustered environment, use the Indexer cluster Manager Node, previously Master Server/Cluster Master to push out the configuration).

Splunk TA

I created the following app and pushed it out to my Splunk environment:

TA-collect-raw-events/
└── local
├── props.conf
└── transforms.conf

The files included the following settings:

Transforms.conf

[setDynamicSource]
FORMAT = source::$1
REGEX = ^.*myDynamicSource=\"([^\"]+)\"
DEST_KEY= MetaData:Source

[setDynamicSourcetype]
FORMAT = sourcetype::$1
REGEX = ^.*myDynamicSourcetype=\"([^\"]+)\"
DEST_KEY= MetaData:Sourcetype

[setDynamicHost]
FORMAT = host::$1
REGEX = ^.*myDynamicHost=\"([^\"]+)\"
DEST_KEY= MetaData:Host

[setDynamicSourceJSON]
FORMAT = source::$1
REGEX = ^.*\"myDynamicSourceJSON\":\"([^\"]+)\"
DEST_KEY= MetaData:Source

[setDynamicSourcetypeJSON]
FORMAT = sourcetype::$1
REGEX = ^.*\"myDynamicSourcetypeJSON\":\"([^\"]+)\"
DEST_KEY= MetaData:Sourcetype

[setDynamicHostJSON]
FORMAT = host::$1
REGEX = ^.*\"myDynamicHostJSON\":\"([^\"]+)\"
DEST_KEY= MetaData:Host

Props.conf

[source::myDynamicSource]
TRANSFORMS-set_source = setDynamicSource, setDynamicSourcetype, setDynamicHost

[source::myDynamicSourceJSON]
TRANSFORMS-set_source = setDynamicSourceJSON, setDynamicSourcetypeJSON, setDynamicHostJSON

Once the TA is successfully deployed on the Indexers and Search Heads, you can use the following searches to test this solution.

Single-line Plaintext Events

The easiest way to test this is by using the _internal, splunkd.log data as it is always generating when your Splunk instance is running. I used the following search to take ten sample events and re-index them using the metadata of the original event.

index=_internal earliest=-10m@m latest=now
| head 10
| eval myDynamicSource= source
| eval myDynamicSourcetype= sourcetype
| eval myDynamicHost= host
| eval _raw = _raw." myDynamicSource=\"".myDynamicSource."\" myDynamicSourcetype=\"".myDynamicSourcetype."\" myDynamicHost=\"".myDynamicHost."\""
| collect testmode=true index=collect_test source="myDynamicSource" sourcetype="myDynamicSourcetype" host="myDynamicHost"

Note: Change testmode=false when you want to index the new data as testmode=true is to test your search to see if it works.

The search does append some metadata fields (created with eval) in the original search to the newly indexed event. This method will use license to index this data again as the sourcetype is not stash.

Single-line JSON Events

To test JSON events, I am using the Splunk introspection logs from the _introspection index. This search also extracts ten desired events and re-indexes them in the new index. This search inserts metadata fields into the JSON event:

index=_introspection sourcetype=splunk_disk_objects earliest=-10m@m latest=now
| eval myDynamicSourceJSON=source
| eval myDynamicSourcetypeJSON=sourcetype
| eval myDynamicHostJSON=host
| rex mode=sed s/.$//g
| eval _raw = _raw.",\"myDynamicSourceJSON\":\"".myDynamicSourceJSON."\",\"myDynamicSourcetypeJSON\":\"
".myDynamicSourcetypeJSON."\":\",\"myDynamicHostJSON\":\"".myDynamicHostJSON."\
"}"
| collect testmode=true index=collect_test source="myDynamicSourceJSON" sourcetype="myDynamicSourcetypeJSON" Host="myDynamicHostJSON"

The data in the original events does not pretty-print as JSON, but all fields are extracted in the search as they are in the raw Splunk event.

The events are re-indexed into the new index, and with the | spath command, all the fields from the JSON are extracted as well as visible under Interesting Fields.

One thing to note here is that this is not a mass re-indexing solution. This is good for a very specific use case where there are not a lot of variables involved.

To learn more about this or if you need help with implementing a custom solution like this, please feel free to reach out to us.