Masking Important Data in Your Splunk Environment

By: Aaron Dobrzeniecki | Splunk Consultant

 

If you have problems or questions regarding masking important data when it gets ingested into Splunk, this is the blog for you. Common use cases include masking credit card numbers, SSN, passwords, account IDs, or anything that should not be visible to the public. When masking data before it gets indexed into Splunk, you want to make sure you (if applicable) test it in a dev environment. A great website to use is www.regex101.com.

The overall methodology of how the two approaches work specifically relies on the correctness of your regular expression. Splunk will look for strings that match the defined regex pattern. You can then tell Splunk to strip out, replace the matching string, or replace part of the string. Both of the methods below do the same exact thing – match a regex and replace the values – but both methods do it in a slightly different manner.

In the example data below, I will be masking the account IDs to only show the last four digits of the account ID. There are two ways you can mask data before it gets ingested into Splunk.

Method 1:

Using props.conf and transforms.conf to modify the data so that the first 12 characters of the account ID turn into “x”‘s.

One sample event:

[02/Nov/2019:16:05:20] VendorID=9999 Code=D AcctID=9999999999999999

When ingested into Splunk using the below props.conf and transforms.conf the event will be indexed as so:

[02/Nov/2019:16:05:20] VendorID=9999 Code=D AcctID=xxxxxxxxxxxx9999

props.conf

[mysourcetype]

TRANSFORMS-data_mask=data_masking

 

transforms.conf

[data_masking]

SOURCE_KEY=_raw

REGEX=(^.*)(\sAcctID=)\d{12}(\d*)

FORMAT=$1$2xxxxxxxxxxxx$3

DEST_KEY=_raw

Specify the field you want Splunk to search for the matching data in using the SOURCE_KEY parameter. Splunk will attempt to match the regex specified in the REGEX setting. If it matches, Splunk will replace the matching portion with the value from FORMAT and then write the transformed value to the field specified in DEST_KEY (which is the same in this example). The values for FORMAT are as followed. The dollar sign digit relates to the capture groups. In the example above you can see that there are 3 total capture groups: (^.*) is the first capture group; (\sAcctID=) is the second capture group; and finally (\d*) is the third capture group (I included a third capture group to specify extra digits, if they exist in the event or not). See how we did not include the \d{12}? This is because THAT regex string is what we want to mask.

The basis behind masking your important data is to make sure that you have created the correct regex. In the example above I created the entire regex string that encompasses an entire event. In doing so, we are able to bring back the entire event using the capture groups and ridding the event of the data to be masked.

Another way to mask important data from being ingested into Splunk is to use the SEDCMD to replace the desired texts with X’s or whatever you want to show that the data has been masked. Using the same sample event above we will get the same results as above, but using a different method.

Method 2:

props.conf

[mysourcetype]

SEDCMD-replace=s/AcctID\=\d{12}/AcctID=xxxxxxxxxxxx/g

The above props.conf will mask the data as desired. The key here is to make sure that your regex string (the one that is replacing the original regex string) includes the part that you want to keep and does not include the string that you want to get rid of. With SEDCMD, Splunk replaces the current regex with the regex you specify in the third segment of the SEDCMD.

In conclusion, there are two ways to anonymize data with Splunk Enterprise:

Use the SEDCMD like a sed script to do replacements and substitutions. The sed script method is easier to do, takes less time to configure, and is slightly faster than a transform. But there are limits to how many times you can invoke SEDCMD and what it can do.

Use a regular expression transform (method 1). This method takes longer to configure, but is easier to modify after the initial configuration and can be assigned to multiple data inputs more easily.

Want to learn more about masking important data in your Splunk environment? Contact us today!