Tsidx Reduction for Storage Savings

Caroline Lea
September 13, 2019
04:56 pm

By: Yetunde Awojoodu | Splunk Consultant

Introduction

Tsidx Reduction was introduced in Splunk Enterprise v6.4 to provide users with the option of reducing the size of index files (tsidx files) primarily to save on storage space. The tsidx reduction process transforms full size index files into minified versions which will contain only essential metadata. A few scenarios to consider tsidx reduction include:

Consistently running out of disk space or nearing storage limits but not ready to incur additional storage costs
Have older data that are not searched regularly
Can afford a tradeoff between storage costs and search performance

How it works

Each bucket contains a tsidx file (time series index data) and a journal.gz file (raw data). A tsidx file associates each unique keyword in your data with location references to events, which are stored in the associated rawdata file. This allows for fast full text searches. By default, an indexer retains tsidx files for all its indexed data for as long as it retains the data itself.

When buckets are tsidx reduced, they still contain a smaller version of the tsidx files. The reduction applies mainly to the lexicon of the bucket which is used to find events matching any keywords in the search. The bloom filters, tsidx headers, and metadata files are still left in place. This means that for reduced buckets, search terms will not be checked against the lexicon to see where they occur in the raw data.

Once a bucket is identified as potentially containing a search term, the entire raw data of the bucket that matches the time range of the search will need to be scanned to find the search term rather than first scanning the lexicon to find a pointer to the term in the raw data. This is where the tradeoff with search performance occurs. If a search hits a reduced bucket, the resulting effect will be slower searches. By reducing tsidx files for older data, you incur little performance hit for most searches while gaining large savings in disk usage.

The process can decrease bucket size by one-third to two-thirds depending on the type of data. For example, a 1GB bucket would decrease in size between 350MB – 700MB. The exact amount depends on the type of data. Data with many unique terms require larger tsidx files. To make a rough estimate of a bucket’s reduction potential, look at the size of its merged_lexicon.lex file. The merged_lexicon.lex file is an indicator of the number of unique terms in a bucket’s data. Buckets with larger lexicon files have tsidx files that reduce to a greater degree.

When a search hits the reduced buckets, a message appears in Splunk Web to warn users of a potential delay in search completion: “Search on most recent data has completed. Expect slower search speeds as we search the minified buckets.” Once you enable tsidx reduction, the indexer begins to look for buckets to reduce. Each indexer reduces one bucket at a time, so performance impact should be minimal.

Benefits

Savings in disk usage due to reduced tsidx files
Extension of data lifespan by permitting data to be kept longer (and searchable) in Splunk
Longer term storage without the need for extra architectural steps like adding S3 archival or rolling to Hadoop.

Configuration

The configuration is pretty straight forward and you can perform a trial by starting with one index and observing the results before taking further action on any other indexes. You will need to specify a reduction age on a per-index basis:

1. On Splunk UI:

Go to Settings > Indexes > Select an Index
Set tsidx reduction policy.

2. Splunk Configuration File:

indexes.conf
[<indexname>]
enableTsidxReduction = true
timePeriodInSecBeforeTsidxReduction = <NumberOfSeconds>

The attribute “timePeriodInSecBeforeTsidxReduction” is the amount of time, in seconds, that a bucket can age before it becomes eligible for tsidx reduction. When this time difference is exceeded, a bucket becomes eligible for tsidx reduction. Default Is 604800

To check whether a bucket is reduced, run the dbinspect search command:

| dbinspect index=_internal
The tsidxState field in the results specifies “full” or “mini” for each bucket.

To restore reduced buckets to their original state, refer toSplunk Docs

A few notes

Tsidx reduction should be used on old data and not on frequently searched data. You can continue to search across the aged data, if necessary, but such searches will exhibit significantly worse performance. Rare term searches, in particular, will run slowly.
A few search commands do not work with reduced buckets. These include ‘tstats’ and ‘typeahead’. Warnings will be included in search.log

Reference Links

https://docs.splunk.com/Documentation/Splunk/7.2.6/Indexer/Reducetsidxdiskusage

https://conf.splunk.com/files/2016/slides/behind-the-magnifying-glass-how-search-works.pdf

https://conf.splunk.com/files/2017/slides/splunk-data-life-cycle-determining-when-and-where-to-roll-data.pdf

Want to learn more about Tsidx Reduction for Storage Savings? Contact us today!