Splunk Phantom Workbooks

By: Joe Wohar | Senior Splunk Consultant

 

Splunk Phantom is an amazing software used to automate cybersecurity processes, however, many companies do not know that they could also be using Phantom for case management. Arguably the most powerful, yet unknown to many, case management feature of Phantom is the ability to create and use workbooks.

If you’re familiar with Phantom, then you know that Phantom playbooks are repeatable processes that Phantom runs through against events. Phantom workbooks are repeatable, defined processes that analysts run through against events. However, they’re typically only used when an analyst needs to get involved. When an event is determined to be a threat that needs to be investigated by an analyst, the event should be promoted to a case. This can be done with a manual change done on the event (by clicking the toolbox icon button) or by having the conditions specified in a playbook that can then turn the event into a case.

Image 1: A workbook must be selected when converting an event to a case.

 

One of the biggest advantages of workbooks is that it’s a great way of ensuring that your analysts (new or old) are following the same set of steps when working cases. SOPs help define processes for your analysts to follow, but workbooks put those processes right into the case and make the work easily trackable. Workbooks are made up of 2 trackable components: phases and tasks.

 

Phases

Phases split the investigation into different sections, such as identification, acquisition, analysis, and reporting. Individual SLAs can be set for each phase of a workbook. When SLAs are missed/breached, there is a panel on the Phantom home page for tracking that:

Image 2: Home page SLA breach tracker.

 

Phases are made up of tasks, which are where the specific steps for investigations are listed.

Image 3: Adding new phases/tasks to a workbook.

 

Tasks

Tasks are very customizable, so they can be pretty general with few trackable requirements or be very specific with many tracked steps. First, tasks can have a default owner assigned to them, which could be useful if you want to have a “review” task so that a more experienced analyst can review a newer analyst’s work, however, I think most often you’d want to leave that blank so that tasks can be assigned to the analyst working the case. The description section of the task is where you can describe the specific things that should be done in the task. If you don’t want to track specific steps, you can simply use this section to create a list of steps for analysts to follow. However, if you have very specific steps involved in a task, you may want to use the description just for describing the process and then have the steps listed as actions or playbooks. This brings us to the next part of tasks, adding actions and playbooks.

Actions and playbooks are Phantom automation being added to the human process. The actions and playbooks added to a task are limited to the actions available in your configured apps and the playbooks that you have available in your Phantom instance. Then, when an analyst goes to run the action from the investigation screen, the action is already pulled up and they just need to enter the details.

Image 4: Workbook opened in a case with 2 tasks.

 

Image 5: Pop-up window from clicking the “run query” action in the workbook.

 

Running a playbook from a workbook is even simpler. Just click the playbook and click the “Run Playbook” button.

Image 6: Pop-up window from clicking the “Disabled User” playbook in the workbook.

 

As analysts move through the tasks and complete them, the phase’s tracker will be updated to show completion and whether or not tasks were completed on time and if the phase was completed on time.

Images 7 & 8: First task complete and then both tasks completed.

 

If you’re not using Phantom for case management, then you’re likely using Phantom to create tickets and add details to them in another software, which is costing you more in hardware and licensing. By using Phantom for case management, you’ll save the cost of another software and its hardware while using software you’ve already bought at no additional cost.

Not sure how to get started with workbooks? Try taking one of your best defined SOPs and make a workbook for it. If you’re not currently a Phantom customer and would like to try it out, you can download the OVA by registering here: https://my.phantom.us/

 

Want to learn more about Phantom workbooks? Contact us today!

 

The Bin Command

By: Forrest Lybarger | Splunk Consultant

 

The bin command is a relatively uncommon, but incredibly useful tool in Splunk. How it works is a user gives it a field (the field must be numeric) then Splunk groups the events by the specified field. The next important thing to know about bin is that the time chart command calls the bin command behind the scenes, so only use the bin command if the time chart command can’t perform the task. Below I will go through the options for the bin command and examples of how to use it.

At its most basic level, the bin command will group events in groups based on intervals. For example: “…| bin GB as bin_size” will create the bin_size field and assign a range to the event such as 0-10 or 10-20. The result looks like the screenshot below.

Without the “as bin_size” part of the command the range value will be assigned to the GB field instead. So long as the field is numeric, the bin command can group events along ranges, but utilizing the command’s options will give users more control over the results.

 

Bins

The bins option is very simple in application. It limits the number of buckets the command can create by establishing a maximum. For example: “…| bin bins=20 linecount as bin_size” will limit bin_size to only having 20 different values, but it might not reach 20 values. The results could look something like the screenshot below.

Splunk determines the size of the buckets on its own here, but there is a way to control the bucket size with other options.

 

Minspan

The minspan option lets a user set the minimum size of buckets. This means that you can prevent a bucket from being too granular for your use case. For example: “…| bin minspan=100 linecount as bin_size” prevents the results from grouping into anything smaller than a 0-100 bucket.

 

 

Span

The span option is by far the most useful option in the bunch. It allows you to control the size of the buckets, which, when combined with the other options, gives users much more control over the bin command’s results. For example: “…| bin span=5 linecount as bin_size” creates buckets with a size of 5. The span value can be numeric, time-based, or logarithmic.

 

 

Start/End

The end option also controls the size of buckets, but in an indirect way. When given a value, the end option causes the bin command to change the way it automatically calculates bucket sizes by having it use the end value as the highest value. For example: “…| bin end=1000 linecount as bin_size” causes the bin command to make one large bucket of size 0-100 because the bin command thinks results range 0-1000 and then it wants to break the results into buckets of 100.

The span option overrides end.

The start option does a similar operation, but on the beginning value and is also overridden by the span option.

 

Aligntime

The aligntime option is last and is only valid when dividing events by _time. This option can offset the bucket partitioning and is ignored if span is in days, months, or years. Aligntime is almost always used in conjunction with span in order to set the bucket size. For example: “…| bin span=2h aligntime=@d+1h _time as bucket” will build 2h buckets with a 1hr offset, meaning the buckets cover times between odd hours.

 

Conclusion

While the bin command isn’t the most common search command in SPL, it is very powerful in specific circumstances. If you ever encounter data that you want to group but might not want to use a transforming command or many other niche cases, you can use the bin command to group events. Since it is also a very underutilized command, you can possibly save someone else a lot of time with this added knowledge.

Want to learn more about the bin command? Contact us today!

Masking Important Data in Your Splunk Environment

By: Aaron Dobrzeniecki | Splunk Consultant

 

If you have problems or questions regarding masking important data when it gets ingested into Splunk, this is the blog for you. Common use cases include masking credit card numbers, SSN, passwords, account IDs, or anything that should not be visible to the public. When masking data before it gets indexed into Splunk, you want to make sure you (if applicable) test it in a dev environment. A great website to use is www.regex101.com.

The overall methodology of how the two approaches work specifically relies on the correctness of your regular expression. Splunk will look for strings that match the defined regex pattern. You can then tell Splunk to strip out, replace the matching string, or replace part of the string. Both of the methods below do the same exact thing – match a regex and replace the values – but both methods do it in a slightly different manner.

In the example data below, I will be masking the account IDs to only show the last four digits of the account ID. There are two ways you can mask data before it gets ingested into Splunk.

Method 1:

Using props.conf and transforms.conf to modify the data so that the first 12 characters of the account ID turn into “x”‘s.

One sample event:

[02/Nov/2019:16:05:20] VendorID=9999 Code=D AcctID=9999999999999999

When ingested into Splunk using the below props.conf and transforms.conf the event will be indexed as so:

[02/Nov/2019:16:05:20] VendorID=9999 Code=D AcctID=xxxxxxxxxxxx9999

props.conf

[mysourcetype]

TRANSFORMS-data_mask=data_masking

 

transforms.conf

[data_masking]

SOURCE_KEY=_raw

REGEX=(^.*)(\sAcctID=)\d{12}(\d*)

FORMAT=$1$2xxxxxxxxxxxx$3

DEST_KEY=_raw

Specify the field you want Splunk to search for the matching data in using the SOURCE_KEY parameter. Splunk will attempt to match the regex specified in the REGEX setting. If it matches, Splunk will replace the matching portion with the value from FORMAT and then write the transformed value to the field specified in DEST_KEY (which is the same in this example). The values for FORMAT are as followed. The dollar sign digit relates to the capture groups. In the example above you can see that there are 3 total capture groups: (^.*) is the first capture group; (\sAcctID=) is the second capture group; and finally (\d*) is the third capture group (I included a third capture group to specify extra digits, if they exist in the event or not). See how we did not include the \d{12}? This is because THAT regex string is what we want to mask.

The basis behind masking your important data is to make sure that you have created the correct regex. In the example above I created the entire regex string that encompasses an entire event. In doing so, we are able to bring back the entire event using the capture groups and ridding the event of the data to be masked.

Another way to mask important data from being ingested into Splunk is to use the SEDCMD to replace the desired texts with X’s or whatever you want to show that the data has been masked. Using the same sample event above we will get the same results as above, but using a different method.

Method 2:

props.conf

[mysourcetype]

SEDCMD-replace=s/AcctID\=\d{12}/AcctID=xxxxxxxxxxxx/g

The above props.conf will mask the data as desired. The key here is to make sure that your regex string (the one that is replacing the original regex string) includes the part that you want to keep and does not include the string that you want to get rid of. With SEDCMD, Splunk replaces the current regex with the regex you specify in the third segment of the SEDCMD.

In conclusion, there are two ways to anonymize data with Splunk Enterprise:

Use the SEDCMD like a sed script to do replacements and substitutions. The sed script method is easier to do, takes less time to configure, and is slightly faster than a transform. But there are limits to how many times you can invoke SEDCMD and what it can do.

Use a regular expression transform (method 1). This method takes longer to configure, but is easier to modify after the initial configuration and can be assigned to multiple data inputs more easily.

Want to learn more about masking important data in your Splunk environment? Contact us today!

Create Splunk Indexes and HEC Inputs with Ansible

By: Brandon Mesa | Splunk Consultant

Managing Splunk .conf files is a day to day routine for most, if not all, Splunk admins. As your Splunk environment matures, you’ll find yourself making constant .conf changes to improve operational efficiency. For example, as new data sources are onboarded, new indexes and parsing settings are implemented to maintain efficiency and the appropriate data segregation controls in place. To access this new data or index you might also have to create a new role or manage an existing one in order to set the appropriate data permissions to a specific set of users. You may also explore alternate data inputs such as making use of the HTTP Event Collector.

Manually completing these tasks can become time-consuming and error-prone. While you can’t automate every change on the back end, you may be able to standardize some of the common configuration changes. For example, common tasks include creating a new index, role, HEC token, and many more. You can use a variety of automation tools to manage your .conf files and reduce time spent making manual .conf changes. This blog will show you how to use Ansible playbooks to automate common Splunk tasks including index and HEC input creation.

To keep this blog simple, examples will be applied to a local standalone instance in the $SPLUNK_HOME/etc/system/local path. The location of .conf changes will vary depending on your specific environment.

The following Ansible playbooks are used in this blog:

create_index.yaml

 

create_hec_token.yaml

 

Create an Index

To create a new index with Ansible playbooks, run the following command:

% ansible-playbook create_index.yaml -e ‘{“index_name”:”ansible_index”}’

Shown below, you can see the new index “ansible_index” has now been created on the indexes.conf.

 

If you run the playbook again to create a new index with an existing index name, an error will be returned and escape the playbook execution. For example, if we try to create the “ansible_index” index a second time, the playbook escapes execution and returns the following message:

“ansible_index – Index string already found in indexes.conf”

 

Take a look at the returned message for the “Confirm if index already exists” task. The playbook reads the indexes.conf file and looks for the index_name variable passed at the time the CLI command is run. If the string is found in the file, the playbook skips the stanza creation.

 

Create a HEC Token

We’ve created a new index for all the Ansible related data. Now let’s create a new HEC input that will constraint incoming data to the new index. To create a new HEC token, run the following Ansible playbook:

% ansible-playbook create_hec_token.yaml -e ‘{“username”:”admin”,”password”:”Pa$$w0rd”,”token_name”:”ansible_token”,”index”:”ansible_index”,”indexes”:”ansible_index”}’

Playbook execution will look something like this:

 

Now let’s validate our token has been created:

 

Automation tools can facilitate day-to-day operations related to your Splunk infrastructure. It’s not likely that all .conf changes will be automated in your environment as you’ll come across unique use cases that will require specific configurations. However, you can automate some of the common manual tasks, like the ones shown above, to reduce time spent and avoid any silly mistakes.

Want to learn more about creating Splunk indexes and HEC inputs with Ansible? Contact us today!

 

 

Press Release: TekStream Makes INC. 5000 List for Sixth Consecutive Year

For the 6th Time, Atlanta-based Technology Company Named One of the Fastest-growing Private Companies in America with Three-Year Sales Growth of 131%

Atlanta-based technology company, TekStream Solutions, is excited to announce that for the sixth time in a row, it has made the Inc. 5000 list of the fastest-growing private companies in America. Only 2.15% of companies have made the list six times. This prestigious recognition comes again just nine years after Rob Jansen, Judd Robins, and Mark Gannon left major firms and pursued a dream of creating a strategic offering to provide enterprise technology software, services, solutions, and sourcing. Now, they’re a part of an elite group that, over the years, has included companies such as Microsoft, Timberland, Vizio, Intuit, Chobani, Oracle, and Zappos.com.

“Being included in the Inc. 5000 for the sixth straight year is something we are truly proud of as very few organizations in the history of the Inc. 5000 list since 2007 can sustain the consistent and profitable growth year over year needed to be included in this prestigious group of companies,” said Chief Executive Officer, Rob Jansen. “Continued adoption by our clients for cloud-based technologies, Security, and Big Data solutions to solve complex business problems has been truly exciting. We are helping our clients take advantage of today’s most advanced recruiting and technology solutions to digitally transform their businesses and address the ever-changing market.”

This year’s Inc. 5000 nomination comes after TekStream has seen a three-year growth of over 131%, and 2020 is already on pace to continue this exceptional growth rate even amidst the impact of COVID-19 and the global pandemic.

“The pandemic has moved the demand for digital transformation to cloud technologies from high priority to absolutely critical. Overnight customers have been forced to establish new channels of remote collaboration just to maintain normal business functions. Preserving revenue streams and managing high operational costs is more important than ever” said Judd Robins, Executive Vice President. “The economic pause has created a window for companies to take another look at legacy technology debt, evaluate more cost-effective cloud options, and retool their platforms ahead of the rebound. TekStream’s continued growth is being fueled by those efforts and we’re happy to take the lead position with our customers.”

To qualify for the award, companies had to be privately owned, established in the first quarter of 2015 or earlier, experienced a two-year growth in sales of more than 50 percent, and garnered revenue between $2 million and $300 million in 2019.

“The prestigious recognition in trying times speaks to our team’s commitment to adapt Recruiting, RPO and Technology solutions to our client needs and the many relationships we service on both the candidate and client-side of our business. As economic and conditions change, we look forward to the challenge of raising our level of service to meet expectations of our internal and consulting staff, as well as positively impacting client and candidate hiring experiences” said TekStream Executive Vice President of Talent Management and Recruiting Services, Mark Gannon.

TekStream accelerates clients’ digital transformation by navigating complex technology environments with a combination of technical expertise and staffing solutions. We guide clients’ decisions, quickly implement the right technologies with the right people, and keep them running for sustainable growth. Our battle-tested processes and methodology help companies with legacy systems get to the cloud faster, so they can be agile, reduce costs, and improve operational efficiencies. And with 100s of deployments under our belt, we can guarantee on-time and on-budget project delivery. That’s why 97% of clients are repeat customers. For more information visit https://www.tekstream.com/

Auditing Apps for Splunk 8.0

By: Eric Howell | Splunk Consultant

Introduction

The release of Splunk 8.0 marked a pivotal change in the functional workings of Splunk; the tool transitioned from leveraging Python 2 to Python 3. This shift is due to the fact that support was dropped for Python 2 by the governing vendor on January 1st, 2020.  Due to this change, administrators working to maintain a supported, healthy environment will be required to perform a comprehensive review of app-Splunk version compatibility and upgrades in addition to usual upgrade procedures.

Audit Existing Applications

Compile List of Installed Apps and TAs

In each environment, make an inventory of the deployed Splunk architecture components:

  • Search Heads – taking into account any stand-alone instances that might not be in the primary cluster (example: Enterprise Security)
  • Indexers
  • Monitoring Console
  • Deployer
  • Deployment Server
  • License Master
  • Cluster Master
  • Heavy Forwarders – if applicable

For each of these components, compile a list of the installed and active applications (apps) and Technology Add-Ons (TAs). This can be done by running the following SPL query in the Search UI:

| rest /services/apps/local
| stats count(title) by splunk_server label version eai:acl.app author
| rename label AS App, version AS Version, eai:acl.app AS Base, author AS Author
| table splunk_server, App, Base, Author, Version

Review Installed Apps on SplunkBase

The above search will provide you with the necessary information to compare the installed version of an App against that found within Splunkbase, including the App Name, Author, Version, and installation status. This information will need to be compared against what is found by locating the app within Splunkbase at https://Splunkbase.Splunk.com .

As an example, using the Splunk Add-on for Microsoft Windows:

App Author Version Installed
Splunk Add-on for Microsoft Windows Splunk 4.8.2 Yes

Searching this add-on in SplunkBase leads us to the following link: https://splunkbase.splunk.com/app/742/

The SplunkBase page for each app contains information regarding app version (adjustable via the dropdown indicated in the next image) and compatible versions of Splunk Enterprise. The details tab often contains app-specific information (frequently linking to Splunk supported documentation) and can provide insight into the appropriate upgrade path for the app. These upgrade paths are critical to follow due to major adjustments often made in newer iterations of any app. These iterative changes can cause negative impact if not accounted for: loss of data, non-functional commands, and new formats for dashboards.

If your environment includes apps that are not found in SplunkBase (very likely due to custom app creation), use your best judgment. The Upgrade Readiness app, which is discussed later in this document, will provide further insight into likely xml or python related complications found in any apps that are scanned. It is advised further in this document to create a Dev environment to test these upgrades prior to releasing in Prod, and these apps are perfect candidates for additional testing outside of Prod.

Fig 1. SplunkBase page breakdown

 

Compile the list of installed apps and TAs in the environment and cross-reference them with SplunkBase to provide insight:

  • Does the current version of the App support the version of Splunk you are upgrading to?
  • What is the upgrade path for the App?
  • Does the app still benefit from ongoing development?
  • What Apps can be removed from the environment or will cause conflict once the upgrade has been performed?

Run Splunk Platform Upgrade Readiness App

SplunkBase link: https://SplunkBase.Splunk.com/app/4698/

Running the Upgrade Readiness App provides will provide further insight into whether your apps are ready for the upgrade to Splunk 8. As Python 2 is no longer vendor-supported, continued use of apps reliant on Python 2 can leave your environment vulnerable to intrusion or bad actors. The Upgrade Readiness App will advise which of your apps contain python files that are not dual-compatible or strictly compatible with Python 3, and it will also indicate xml files that support Splunk’s Advanced XML which has been sunset and replaced by SimpleXML. Additional details can be found here:

https://docs.Splunk.com/Documentation/UpgradeReadiness/latest/Use/About

After running the Readiness App each of the installed apps on the Splunk instance should return a value result, such as, Passed, Warning, Skipped, etc.

Please note that this app will need to be installed on each instance of Splunk for comprehensive review.

Preparing Next Steps

From the above steps, the path forward should emerge after documenting the findings.

  • Identify which apps/TAs can be and require upgrade
  • Identify Apps that are no longer supported
  • Remove and/or disable Apps that are no longer relevant in the environment or will cause issues post-upgrade.
  • Identify Apps that have not been thoroughly documented and will require additional testing (ideally in a Dev environment).

Once the plan has been developed with the steps above, separate the apps by the appropriate configuration management tool/server (Cluster Master, Deployer, etc)

Performing App Upgrades

The major contributing factor to lost functionality when upgrading to Splunk 8.0+ is found in apps that leverage a great deal of Python files that are not dual-compatible with Python 2 and Python 3. This is discussed in greater detail in the links below:

To maintain a functional, supported version of the Enterprise Security app throughout the upgrade process, it will likely be necessary to upgrade Apps as you upgrade Splunk. Several apps are heavily Python-dependent in their operation and will feature a Python-version change between app versions.

These Python version-specific apps, if they are being leveraged in your environment, should be upgraded during the same scheduled change window as the Splunk Enterprise upgrade to 8.0+. Otherwise, they will cease to function correctly (or at all) due to their reliance on Python 3. The apps that require this specific process are listed here:

  • Splunk Enterprise Security App ver 6.1
  • Splunk Machine Learning Toolkit ver 5.0
  • Deep Learning Toolkit ver 3.0
  • Python for Scientific Computer ver 2.0

 

Want to learn more about auditing apps in Splunk 8.0? Contact us today!

How to Connect AWS and Splunk to Ingest Log Data

By: Don Arnold | Splunk Consultant

 

Though a number of cloud solutions have popped up over the past 10 years, Amazon Web Services, better known as simply AWS, seems to be taking the lead in cloud infrastructure.  And, companies that are using AWS have either migrated their entire infrastructure or are using on-premises systems with some AWS services in a hybrid solution.  Whichever may be the case, the AWS environment is within the security boundary and should be a part of the System Security Plan (SSP) and needs to include Continuous Monitoring, which is a requirement in most security frameworks.  Splunk meets the Continuous Monitoring requirements, which includes instances and services within AWS.

Data push

There are 2 separate ways to get data from AWS into Splunk.  The first is to “push” data from AWS using “Kinesis Firehose” to a Splunk.  This requires IP connectivity between AWS and a Splunk Heavy Forwarder, a HTTP Event Collector token, and the “Splunk Add-on for Amazon Kinesis Firehose” from Splunkbase.

Splunk Heavy Forwarder Setup

  1. Ensure the organization firewall has a rule to allow connectivity from AWS to the Splunk Heavy Forwarder over HTTPs.
  2. Go to Splunkbase.com and download/install the “Splunk Add-on for Amazon Kinesis Firehose” – Restart the Splunk Heavy Forwarder
  3. Create an HTTP Event Collector token:
    1. Go to Settings > Data Inputs > HTTP Event Collector
    2. Select New Token
    3. Enter a name for your token. Example:  “AWS”.  Select Next
    4. For Source type, click Select > Structured and choose “aws:firehose:json”. For App Context choose “Add-on for Kinesis Firehose”. Select Review
    5. Verify the settings and select
    6. Go back to Settings > Data Inputs > HTTP Event Collector and select Global Settings
    7. For “All Tokens” select Enabled, ensure “Enable SSL” is selected, and the “HTTP port number” is set to 8088. Select Save.
    8. Copy the “Token Value” for setup in AWS Kinesis Firehose.

AWS Kinesis Firehose Setup

  1. Log in to AWS and go to the Kinesis service and select the “Get Started” button.
  2. On the top right you will see “Deliver Streaming data with Kinesis Firehose Delivery Streams.” Select the “Create Delivery System” button.
  3. Give your delivery system a name. Under Source, choose “Direct PUT or other sources”.  Select the “Next” button.
  4. Select “Disabled” for both Data transformation and Record format conversion.
  5. For Destination select “Splunk”. For Splunk cluster endpoint, enter the URL with port 8088 of your Splunk Heavy Forwarder.  For Splunk endpoint type select “Raw endpoint”.  For Authentication, token enter the Splunk HTTP Event Collector token number created in the Splunk Heavy Forwarder setup.
  6. For S3 backup select a S3 bucket. If one does not exist you can create one by selecting “Create New”.  Select Next.
  7. Scroll down to Permissions and click “Create new or choose” button. Choose an existing IAM role or create one.  Click Allow to return to the previous menu.  Select Next.
  8. Review the settings and select Create Delivery Stream.
  9. You will see a message stating “Successfully created delivery stream…”.

Test the Connection

  1. It is recommended that test data be used to verify the new connection by choosing the delivery stream and selecting “Test with Demo Data”. Go to step 2 and select “Start sending demo data”.  You will see the delivery stream sending demo data to Splunk.
  2. Log into Splunk and enter index=main sourcetype=aws:firehose:json to verify events are streaming into Splunk.
  3. If no events show up, go back and verify all steps have been configured properly and firewall rules are set to allow AWS HTTPs events through to the Splunk Heavy Forwarder.

Send Production Data

  1. Go to AWS Kinesis and select the delivery stream your setup. The status for the delivery stream should display “Active”.
  2. Go to Splunk and verify events are ingesting: index=mainsourcetype=aws:firehose:json and verify the timestamp is correct with the events.

Data pull

The second way to get data into Splunk from AWS is to have Splunk “pull” data via a REST API call.

AWS Prerequisites Setup

  1. There are AWS service prerequisites that require set up prior to performing REST API calls from the Splunk Heavy Forwarder. The prerequisites can be found in this document:  https://docs.splunk.com/Documentation/AddOns/released/AWS/ConfigureAWS
  2. Ensure all prerequisites are configured in AWS prior to configuring the “Splunk Add-on for AWS” on the Splunk Heavy Forwarder.

Splunk Heavy Forwarder Setup

  1. Ensure the organization firewall has a rule to allow connectivity from the Splunk Heavy Forwarder to AWS.
  2. Go to Splunkbase.com and install the “Splunk Add-on for AWS” – Restart the Splunk Heavy Forwarder.
  3. Launch the “Splunk Add-on for AWS” on the Splunk Heavy Forwarder.
  4. Go to the Configurations tab.
    1. Account tab: Select Add. Give the connection a name, enter the Key ID and Secret Key from the AWS IAM user account and select Add.

(To get the Key ID and Secret Key, go to AWS IAM > Access management > Users > (select user) > Security credentials > Create access key > Access Key ID and Secret Access key)

  1. IAM Role tab: Select Add.  Give the Role a name, enter the Role ARN and select Add.

(To get the Role ARN, go to AWS IAM > Access management > Roles > (select role).  At the top you will see the Role ARN)

  1. Go to the Inputs tab. Select Create New Input and select the type of data input from AWS to ingest.  Each selection is different and all will use the User and Role created in the previous step.  Go through the setup and select the AWS region, source type, and index and select Save.

Test the Connection

  1. Log into Splunk and enter index=main sourcetype=aws* to verify events are streaming into Splunk. Verify the sourcetype matches the one you selected in the input.
  2. If no events show up, go back and verify all steps have been configured properly and firewall rules are set to allow AWS HTTPs events through to the Splunk Heavy Forwarder.

With the popularity of AWS, more environments are starting to host hybrid solutions for a myriad of reasons.  With that, using Splunk to maintain Continuous Monitoring is easily achieved with 2 different approaches for monitoring the expanded security boundary into the cloud.  TekStream Solutions has Splunk and AWS engineers on staff with years of experience and can assist you in connecting your AWS environment to Splunk.

References

https://docs.splunk.com/Documentation/AddOns/released/Firehose/About

https://docs.splunk.com/Documentation/AddOns/released/Firehose/ConfigureFirehose

https://docs.splunk.com/Documentation/AddOns/released/AWS/Description

https://docs.splunk.com/Documentation/AddOns/released/AWS/ConfigureAWS

 

Want to learn more about connecting AWS and Splunk to ingest log data? Contact us today!

Splunk, AWS, and the Battle for Burst Balance

By: Karl Cepull | Senior Director, Operational Intelligence

 

Splunk and AWS: two of the most adopted tools of our time. Splunk allows fantastic insight into your company’s data at an incredible pace. AWS allows an affordable alternative to on-premise or even other cloud environments. Together both of these tools can come together and allow for one of the best combinations to further the overall ability to show the value in your data. But, there are many systems that need to come together to make all of this work.

In AWS, you have multiple types of storage options available to you for your Splunk servers with their Elastic Block Storage (EBS) offering. There are multiple drive types that you can use – e.g. “io1”, “gp2”, and others. The ‘gp2’ volume type is perhaps the most common one, particularly because it is usually the cheapest. However, when using this volume type, you need to be aware of Burst Balance.

Burst Balance can be a wonderful system. At its core, what Burst Balance does is allow your volume’s disk IOPS to burst higher when needed, without you needing to pay for the guaranteed IOPS all of the time (like you do with the “io1” volume type). What are IOPS? This stands for Inputs/Outputs Per Second, and represent the number of reads and writes that can occur over time. Allowing the IOPS to burst can come in handy when there is a spike in traffic to your Splunk Heavy Forwarder or Indexer, for example. However, this system does have its downside that can actually cause the volume to stop completely!

The way Burst Balance works is on a ‘credit’ system. Every second, the volume earns 3 ‘credits’ for every GB of configured size. For example, if the volume is 100GB, you would earn 300 credits every second. These credits are then used for reads and writes – 1 credit for each read or write. When the volume isn’t being used heavily, it will store up these credits (up to a cap of 5.4 million), and when the volume gets a spike of traffic, the credits are then used to handle the spike.

However, if your volume is constantly busy, or sees a lot of frequent spikes, you may not earn credits at a quick enough rate to keep up with the number of reads and writes. Using our above example, if you had an average of more than 300 reads and writes per second, you wouldn’t earn credits fast enough to keep up. What happens when you run out of credits? The volume stops. Period. No reads or writes occur until you earn more credits (again 3/GB/sec). So, all you can do is wait. That can be a very bad thing, so it is something you need to avoid!

The good news is that AWS has tools that you can use to monitor and alert if your Burst Balance gets low. You can use CloudWatch to monitor the Burst Balance percentage, and also set up an alert if it gets low. To view the Burst Balance percentage, one way is to click on the Volume in the AWS console, then go to the Monitoring tab. One of the metrics is the Burst Balance Percentage, and you can click to view it in a bigger view:

As you can see in the above example, the Burst Balance has been at 100% for most of the last 24 hours, with the exception of around 9pm on 3/19, where it dropped to about 95% shortly, before returning to 100%. You can also set up an alarm to alert you if the Burst Balance percentage drops below a certain threshold.

So, what can you do if the Burst Balance is constantly dipping dangerously low (or running out!)? There are three main solutions:

  1. You can switch to another volume type that doesn’t use the Burst Balance mechanism, such as the “io1” volume type. That volume type has guaranteed, consistent IOPS, so you don’t need to worry about “running out”. However, it is around twice the cost of the “gp2” volume type, so your storage costs could double.
  2. Since the rate that you earn Burst Balance credits is based on the size of the volume (3 credits/GB/second), if you increase the size of the volume, you will earn credits faster. For example, if you increase the size of the volume by 20%, you will earn credits 20% faster. If you are coming up short, but only by a little, this may be the easiest/most cost-effective option, even if you don’t actually need the additional storage space.
  3. You can modify your volume usage patterns to either reduce the number of reads and writes, or perhaps reduce the spikes and spread out the traffic more evenly throughout the day. That way, you have a better chance that you will have enough credits when needed. This may not be an easy thing to do, however.

In summary, AWS’s Burst Balance mechanism is a very creative and useful way to give you performance when you need it, without having to pay for it when you don’t. However, if you are not aware of how it works and how it could impact your environment, it can suddenly become a crippling issue. It pays to understand how this works, how to monitor and alert on it, and options to avoid the problem. This will help to ensure your Splunk environment stays running even in peak periods.

Want to learn more? Contact us today!

Textract – The Key to Better Solutions

By: Troy Allen | Vice President of Emerging Technologies

 

Businesses thrive on information, but finding good data can be difficult to collect sort, and utilize due to the vast variety of sources and forms by which information is created and disseminated.  As organizations are inundated with documents, forms, data streams, and more it’s becoming more difficult to extract meaningful information efficiently and funnel that information into the systems that need it or present it in a fashion that drives better business decisions.  Textract, part of AWS’s ever-growing solutions for Machine Learning, can play a critical part in how businesses process documents and collect vital data for use in their critical solutions and operations.

While Optical Character Recognition (OCR) has been around for many years, many organizations tend to overlook its strengths and ability to improve data processing.  Textract, while it does provide OCR functionality as a Cloud-based service, is much more thorough in its ability to bring Machine Learning based models to your business applications.  In order for data to be useful, it must first be collected; Textract provides OCR capabilities to ensure text is recognized from paper-scanned documents to electronic forms.

For data to be really useful, it needs to have organization and structure; Textract provides the ability to automatically detect content layout and recognize key elements and the relationship of the text and the elements it discovers.  And finally, for data to not only be useful, but actually utilized, it needs to be accessed; Textract can easily share the data, in its context, with other applications and data stores through well-formatted data streams to applications, databases, and other services.  Textract is designed to collect and filter data from documents and files so that you don’t have to.  Solutions utilizing Textract naturally benefit from an automated flow of information from capture to storage, to retrieval.

Textract is more than just OCR

In 1914, Emanual Goldberg developed a machine that could read characters and convert them into telegraph code. Golderg also applied for a patent in 1927 for his “Statistical Machine”.   Goldberg’s statistical machine was designed to retrieve individual records from spools of microfilm by using a movie projector and a photoelectric cell to do pattern recognition to find the right record on microfilm. In many ways, Goldberg’s inventions are often credited as the beginning of Optical Character Recognition technology (OCR).  Over the next 92 years, OCR has become one of the most critical elements, which few have heard of, in building business solutions.

OCR moved beyond the business world to enabling sight-impaired people to read printed materials.  Ray Kurzweil and the National Federation of the Blind announced a new product in 1976, based on newly developed charged-coupled device (CCD) flatbed scanners and text-to-speech synthesizers, which has fundamentally changed the way we work with information.  It was no longer about reports, statistics, or data; it was about sharing information with anyone, in a format that could easily be accessible.  By 1978, OCR had moved into the digital world as a computer program.

Like all new technologies, OCR has had its issues and limitations.  In the beginning, text had to be very clear and created only in certain fonts to be recognized.  Scan quality of physical pages also plays a major factor in how well OCR engines extract the text from pages and in most cases, only a portion of the text is captured on poor scans.  Even today, with so many advancements in OCR, there are challenges to accurately collecting and organizing data from images.

Most OCR engines collect all the text from documents and make the words available for search engines, but very few OCR engines take it any further without requiring additional tools and applications.  Textract by Amazon Web Services goes beyond OCR by not only collecting the content but understanding where the content came from.

Textract provides the ability to not only perform standard character recognition but is designed to understand the formatting and how content is aligned within a page.  This is accomplished by recognizing and creating Bounding Boxes around key information and text areas to support the content, table extraction, and form extraction.

Item Location on a Document Page

The example image displays content that is separated by columns and has header information.

Figure 1 – Two Column Document Example

Most OCR applications will collect all the words on the page, but do not provide a reference to lines of text or location.  Amazon’s Textract retrieves multiple blocks of information from each page of the image it investigates:

  • The lines and words of detected text
  • The relationships between the lines and words of detected text
  • The page that the detected text appears on
  • The location of the lines and words of text on the document page

As the following illustration demonstrates, Textract is able to identify that there are two columns of information on the page.  It then recognizes that for each column, there are multiple lines of text which are made up of multiple words.

Figure 2 – Textract Line and Word Recognition

Textract outputs its findings in standard JSON files so that they can be utilized easily by other services or applications.  The example above would be represented in the JSON as follows:

Figure 3 – Sample JSON

Table Extraction

Amazon’s Textract is well equipped to locate table data within documents as well.  Textract recognizes the table construct and can establish key-value pairs with the cells by referencing the row and column information.  The following table represents 20 distinct cells, including the header row that will be evaluated by Textract:

Figure 4 – Sample table data

The output JSON from the Textract service creates a mapping between the rows and columns and intelligently identifies the key-value pairs in the table.  This recognition can also be performed against vertical table data rather than horizontal table.  The following illustrates the key-value pair matching:

Figure 5 – Table Key-Value Pair

In addition to detecting text, Textract has the ability to recognize selection elements such as checkboxes and radio buttons.  A checkbox that has not been selected, such as o or ¡ is represented as a status of NOT_SELECTED whereas Rž are represented as SELECTED and can be tied to a key-value pair as well.  This can be extremely helpful in finding values in both tables and forms.

 

Form Extraction

Businesses have been interacting with their clients and vendors for decades through forms.  Textract provides the ability to read form data and clearly define key-value pairs of information from them.  Many organizations struggle with the fact that forms change over time and it can be difficult to train tools to find data when those tools were specific for a particular form layout.  Textract removes that complexity by reading the actual text rather than a location on a form to get its information and analyzes documents and forms for relationships between the detected text.

Figure 6 – Sample form image

In the example above, Textract will create the following Key-value pairs:

Traditional OCR tools will provide all the available text out of an image or document, but to gather Key-value pairs from forms and data, as well as recognizing text based on words, lines, and understanding the blocking of content, additional tools are required.  Textract does all of this for you providing data that can then be further analyzed as needed.

Textract Considerations

Textract is specifically designed to perform OCR against image files such as JPG, PNG, and PDF file formats.  Most text-based document formats created electronically today do not require additional OCR since they are already embedded with an index that is accessible by search engines.  With the proliferation of mobile device and tablet use, there are still many times that images are created in which there is no inherent index available.  We use our phones to take pictures of everything including people, scenes, receipts, presentations, and much more.  It is quick and easy to capture the world around us, but it is more difficult to have a computer application capture important information that may be held in those photographs.  Textract enables the extraction of data from images so that you don’t have to.

As with all technologies, there are limits to what Textract can do and should be recognized before introducing it into a solution.  AWS maintains detailed information about the Amazon Textract service and its limitations and can be found here, https://docs.aws.amazon.com/textract/latest/dg/limits.html.

Putting Textract to Work

While OCR is important and can be a critical part of any business process, it is an engine that retrieves information from sources that could not be accessed except through human intervention.  In many ways, it is like an important element within a car’s engine.  A fuel injector is critical for a car to run, but may not have much value as an entity unto itself.  It’s when you bring various parts together that your car takes you where you need to go or your application drives your business.

To create a basic OCR application with Textract, you will need:

  • A place to store the images that need to be processed, in many situations this may be an Amazon S3 service (Simple Storage Service), Amazon WorkDocs (secure content creation, storage, and collaboration service), or even a relational database like Amazon Aurora.
  • An application or service to call the Textract services. Many organizations are creating Cloud-first applications and may choose to use AWS Lambda to run their code without having to worry about the servers where the code runs.
  • A place to store the results of the Textract services. The options are limitless for where to store the text and details uncovered by Textract, this could be stored back into an Amazon S3 instance, a database like Aurora, or even a data warehouse like Amazon Redshift.
  • And finally, you need to do something with the information you have collected. This all depends on what your goals are for the information, but at a minimum, most people want to search for information.  Using Amazon Elasticsearch Service is an easy way to allow people to find the new information Textract was able to gather for you.

The following outlines this simple Textract solution:

Figure 7 – Simple Textract solution with Amazon Elasticsearch Service

Practical Applications for Textract

While being able to search for information that was extracted from images is useful, it isn’t all that compelling from a business perspective.  Information needs to be meaningful and applied to a task so that its value can be recognized.  The following examples illustrate common business processes and the role that Textract can play in them.

Human Resource Document Management

Every organization has employees and/or volunteers to support their efforts.  There are many state, county, and country regulations that drive what information we need to keep about our employees as well as operational documents about the employees that help us to keep our businesses running.  The following are some examples of common documents that most organizations need to collect and retain:

  • Employment applications
  • Employee resumes
  • Interview notes, references, and background information
  • Employee offer letters
  • Benefit elections
  • Employee appraisals
  • Wage garnishments
  • State and Federal Employee documentation
  • Employee disciplinary actions
  • Termination decisions and disclosures
  • Promotion recommendations
  • Employee complaints and investigations
  • Leave request documentation

While there are many applications and services available on the market today which will help organizations capture, index, and retain this information, they can sometimes be costly and may not be able to completely capture information held in non-text-based file formats.  As discussed earlier, more and more people are using mobile and tablet technologies because of their accessibility and ease of use.  In many cases, an employee may use their phone to take a picture of a signed employee document and send it in to the company.  This photographed document can cause issues in capturing the information in it, or even classifying it properly in an automated fashion.  This is where Textract can easily be integrated into an existing solution, or incorporated as part of a newly constructed solution, to ensure vital information isn’t missed.

The following illustrates how a solution designed for the Cloud-based on Amazon services can facilitate common Human Resource document management activities:

 

Amazon Services Utilized:

Amazon S3|Amazon WorkDocs|AWS Lambda|Amazon Textract|Amazon API Gateway

Non-Amazon Application examples:

Workday|Oracle Human Capital Management|Oracle PeopleSoft

 

In this example, a newly hired employee is granted access to the company’s Amazon WorkDocs environment to upload documentation that will be required during the hiring process.  While most of the documents being uploaded will be easily indexed and searchable through the Amazon WorkDocs service, the employee has been asked to upload a copy of their driver’s license.  The employee utilizes the Amazon WorkDocs mobile application to take a picture of their driver’s license and uploads it to the appropriate folder on their phone.  Behind the scenes, the company has configured a workflow in Amazon WorkDocs to inform HR managers when new documents have been submitted and a Human Resources representative reviews the uploaded driver’s license.  The human resource representative launches an action in Amazon WorkDocs (a special feature provided by the company’s IT department) which will launch an operation running on AWS Lambda initiating Textract to capture OCR information form the driver’s license as well as create Key-value information which will be sent to the company’s ERP system (like Workday, Oracle Human Capital Management, Oracle PeopleSoft, or other similar application) along with the Amazon WorkDocs reference for where the actual image is stored.

This illustrates a very simple method to directly engage with employees to capture critical HR information through a combination of out-of-the-box Amazon services and some light-weight customizations to create a streamlined process for document storage and data capture.  It only took the employee a few seconds to take the picture of the driver’s license and upload it and the HR representative a few seconds to review and process the new document.  In fact, the solution could be configured to automatically extract the required details and send it to the ERP without even having to have the HR representative involved for a truly automated solution.  Imagine each new hire having ten to twenty documents they need to upload and how much time HR spends processing each document manually for every new employee.  Automating this process can amount to several hours a month of time savings, especially when dealing with non-text-based file formats that require someone to manually read the documents to key in the information contained in them.  By introducing Amazon Textract into the overall solution, data can be collected, stored, processed, and shared easily and more efficiently.

Business Document Processing and Information Automation

While the Human Resource Document Management example above focused on capturing documents individually as they come in, there are many situations where companies need to process documents in bulk.  Using similar AWS services as the previous example, solutions can be designed to allow for batch uploading of documents for processing.  As an example, procurement procedures for large purchases can incorporate a wide variety of documentation which may have vastly different processes associated with them.  By providing a simple way for files to be uploaded in bulk, AWS services can be utilized to sort through the file formats for processing.  Non-text-based image files like JPG, PNG, and PDF files can then be automatically processed by Amazon Textract to capture OCR information, Table data, and Key-value information from forms and then shared with back-office applications, stored in data warehouse services, and/or shared with Amazon Elasticsearch services.  Processing hundreds or even thousands of documents and images a month becomes much easier through automation.  Incorporating Textract into business process work streams ensures that critical information is identified and captured from structured and semi-structured documents reducing the need for manual classification of information to facilitate business operations like Insurance Claims, Legal Processes, Partner Management, Purchasing, and more.

Litigation is disruptive to normal business operations for any company.  Thousands of documents, images, and artifacts have to be reviewed and collected to share with attorneys and the courts during a legal process which can be time-consuming.  While there are many discovery tools available on the market today to help speed up the process of finding the desired information, they are reliant on information being in a format that the discovery services can handle.  In many cases, important information is stored in pictures and scanned documents that these discovery services cannot easily process.  Amazon’s Textract becomes a valuable tool in the discovery process by allowing organizations to quickly filter through image files, capture, and OCR information so that it can become indexed and searched.

Litigation isn’t only a headache for companies, it is a headache for the legal teams associated with the litigation process as well.  Imagine a law firm receiving millions of electronic files from a company and having to read through each document to find pertinent information regarding the case they are working on.  This can take months and many resources to complete, time that most lawyers don’t have during a case to complete. Files may be images, documents, spreadsheets, audio files, and even video files.  All of these need to be processed so that key information can be selected to support the case they are working on.  The expense of a large legal process can be staggering due to the sheer amount of manual labor required to gather information.  In the following example, Amazon’s Artificial Intelligence and Machine Learning services, including Textract, are utilized to greatly reduce the processing time for legal discovery.

Amazon Services Utilized:

AWS Transfer for SFTP|Amazon S3|AWS Lambda|Amazon Textract|Amazon API Gateway| Amazon Rekognition| Amazon Comprehend| Amazon Transcribe|Amazon Elasticsearch services

In this example, a legal firm utilizes the power of AWS Transfer for SFTP services to allow clients and opposing counsel to quickly upload all of their discovery files and documents which are then automatically stored in Amazon S3.  Files are then sorted based on file types for processing.  Amazon Textract capture OCR information from image files including table and form data while Amazon Rekognition analyzes photos and videos to identify the objects, people, text, scenes, and activities, perform facial recognition, and detect any inappropriate content.  Audio and video files are processed through Amazon Transcribe to capture speech-to-text information.  As files are processed, the information is captured and indexed in Amazon Elasticsearch service to enable rich search functionality to the litigators as well as being processed by Amazon Comprehend to quickly find relationships and insights into all the data collected.

What would have taken months to sort through and comprehend becomes manageable information in hours or days providing more time for the legal team to focus on winning their case while saving thousands of dollars on the personnel required to manually process all the discovery information.

The tool you didn’t know you needed

Technology is advancing at incredible speeds and new solutions and services are becoming available every day.  Services like Amazon Textract are critical tools in document processing and are rarely thought about but imperative for success.  Of all the services Amazon provides, Amazon Textract is one of the hidden gems that can be easily overlooked but deserves to be part of your processing arsenal.

You are not alone

Business solutions can be complex, but making them work for your requirements doesn’t have to be.  Clearly defining your goals and objectives is half of the battle, the other half is knowing what tools will help you achieve those goals.  Are your off-the-shelf solutions and applications collecting all the information you have?  Do you need a business solution to manage all of your documents and data, but don’t know where to start?  Are you looking to move off of an outdated legacy application that no longer supports your business direction?  You are not alone.  Thousands of companies are facing the same questions and are finding the best answers by engaging with experts from Amazon and experts from solution service providers.  TekStream Solutions, along with AWS, is excited to speak with you about your Information Processing needs and how the right tools and solutions can have a positive impact on how you conduct business.  TekStream Solutions is offering a free Digital Transformation assessment where we will work with you to identify your document processing needs and provide process and technology recommendations to help you transform your business with ease.

 

Want to learn more about Textract? Contact us today!