How to Connect AWS and Splunk to Ingest Log Data

By: Don Arnold | Splunk Consultant

 

Though a number of cloud solutions have popped up over the past 10 years, Amazon Web Services, better known as simply AWS, seems to be taking the lead in cloud infrastructure.  And, companies that are using AWS have either migrated their entire infrastructure or are using on-premises systems with some AWS services in a hybrid solution.  Whichever may be the case, the AWS environment is within the security boundary and should be a part of the System Security Plan (SSP) and needs to include Continuous Monitoring, which is a requirement in most security frameworks.  Splunk meets the Continuous Monitoring requirements, which includes instances and services within AWS.

Data push

There are 2 separate ways to get data from AWS into Splunk.  The first is to “push” data from AWS using “Kinesis Firehose” to a Splunk.  This requires IP connectivity between AWS and a Splunk Heavy Forwarder, a HTTP Event Collector token, and the “Splunk Add-on for Amazon Kinesis Firehose” from Splunkbase.

Splunk Heavy Forwarder Setup

  1. Ensure the organization firewall has a rule to allow connectivity from AWS to the Splunk Heavy Forwarder over HTTPs.
  2. Go to Splunkbase.com and download/install the “Splunk Add-on for Amazon Kinesis Firehose” – Restart the Splunk Heavy Forwarder
  3. Create an HTTP Event Collector token:
    1. Go to Settings > Data Inputs > HTTP Event Collector
    2. Select New Token
    3. Enter a name for your token. Example:  “AWS”.  Select Next
    4. For Source type, click Select > Structured and choose “aws:firehose:json”. For App Context choose “Add-on for Kinesis Firehose”. Select Review
    5. Verify the settings and select
    6. Go back to Settings > Data Inputs > HTTP Event Collector and select Global Settings
    7. For “All Tokens” select Enabled, ensure “Enable SSL” is selected, and the “HTTP port number” is set to 8088. Select Save.
    8. Copy the “Token Value” for setup in AWS Kinesis Firehose.

AWS Kinesis Firehose Setup

  1. Log in to AWS and go to the Kinesis service and select the “Get Started” button.
  2. On the top right you will see “Deliver Streaming data with Kinesis Firehose Delivery Streams.” Select the “Create Delivery System” button.
  3. Give your delivery system a name. Under Source, choose “Direct PUT or other sources”.  Select the “Next” button.
  4. Select “Disabled” for both Data transformation and Record format conversion.
  5. For Destination select “Splunk”. For Splunk cluster endpoint, enter the URL with port 8088 of your Splunk Heavy Forwarder.  For Splunk endpoint type select “Raw endpoint”.  For Authentication, token enter the Splunk HTTP Event Collector token number created in the Splunk Heavy Forwarder setup.
  6. For S3 backup select a S3 bucket. If one does not exist you can create one by selecting “Create New”.  Select Next.
  7. Scroll down to Permissions and click “Create new or choose” button. Choose an existing IAM role or create one.  Click Allow to return to the previous menu.  Select Next.
  8. Review the settings and select Create Delivery Stream.
  9. You will see a message stating “Successfully created delivery stream…”.

Test the Connection

  1. It is recommended that test data be used to verify the new connection by choosing the delivery stream and selecting “Test with Demo Data”. Go to step 2 and select “Start sending demo data”.  You will see the delivery stream sending demo data to Splunk.
  2. Log into Splunk and enter index=main sourcetype=aws:firehose:json to verify events are streaming into Splunk.
  3. If no events show up, go back and verify all steps have been configured properly and firewall rules are set to allow AWS HTTPs events through to the Splunk Heavy Forwarder.

Send Production Data

  1. Go to AWS Kinesis and select the delivery stream your setup. The status for the delivery stream should display “Active”.
  2. Go to Splunk and verify events are ingesting: index=mainsourcetype=aws:firehose:json and verify the timestamp is correct with the events.

Data pull

The second way to get data into Splunk from AWS is to have Splunk “pull” data via a REST API call.

AWS Prerequisites Setup

  1. There are AWS service prerequisites that require set up prior to performing REST API calls from the Splunk Heavy Forwarder. The prerequisites can be found in this document:  https://docs.splunk.com/Documentation/AddOns/released/AWS/ConfigureAWS
  2. Ensure all prerequisites are configured in AWS prior to configuring the “Splunk Add-on for AWS” on the Splunk Heavy Forwarder.

Splunk Heavy Forwarder Setup

  1. Ensure the organization firewall has a rule to allow connectivity from the Splunk Heavy Forwarder to AWS.
  2. Go to Splunkbase.com and install the “Splunk Add-on for AWS” – Restart the Splunk Heavy Forwarder.
  3. Launch the “Splunk Add-on for AWS” on the Splunk Heavy Forwarder.
  4. Go to the Configurations tab.
    1. Account tab: Select Add. Give the connection a name, enter the Key ID and Secret Key from the AWS IAM user account and select Add.

(To get the Key ID and Secret Key, go to AWS IAM > Access management > Users > (select user) > Security credentials > Create access key > Access Key ID and Secret Access key)

  1. IAM Role tab: Select Add.  Give the Role a name, enter the Role ARN and select Add.

(To get the Role ARN, go to AWS IAM > Access management > Roles > (select role).  At the top you will see the Role ARN)

  1. Go to the Inputs tab. Select Create New Input and select the type of data input from AWS to ingest.  Each selection is different and all will use the User and Role created in the previous step.  Go through the setup and select the AWS region, source type, and index and select Save.

Test the Connection

  1. Log into Splunk and enter index=main sourcetype=aws* to verify events are streaming into Splunk. Verify the sourcetype matches the one you selected in the input.
  2. If no events show up, go back and verify all steps have been configured properly and firewall rules are set to allow AWS HTTPs events through to the Splunk Heavy Forwarder.

With the popularity of AWS, more environments are starting to host hybrid solutions for a myriad of reasons.  With that, using Splunk to maintain Continuous Monitoring is easily achieved with 2 different approaches for monitoring the expanded security boundary into the cloud.  TekStream Solutions has Splunk and AWS engineers on staff with years of experience and can assist you in connecting your AWS environment to Splunk.

References

https://docs.splunk.com/Documentation/AddOns/released/Firehose/About

https://docs.splunk.com/Documentation/AddOns/released/Firehose/ConfigureFirehose

https://docs.splunk.com/Documentation/AddOns/released/AWS/Description

https://docs.splunk.com/Documentation/AddOns/released/AWS/ConfigureAWS

 

Want to learn more about connecting AWS and Splunk to ingest log data? Contact us today!

Textract – The Key to Better Solutions

By: Troy Allen | Vice President of Emerging Technologies

 

Businesses thrive on information, but finding good data can be difficult to collect sort, and utilize due to the vast variety of sources and forms by which information is created and disseminated.  As organizations are inundated with documents, forms, data streams, and more it’s becoming more difficult to extract meaningful information efficiently and funnel that information into the systems that need it or present it in a fashion that drives better business decisions.  Textract, part of AWS’s ever-growing solutions for Machine Learning, can play a critical part in how businesses process documents and collect vital data for use in their critical solutions and operations.

While Optical Character Recognition (OCR) has been around for many years, many organizations tend to overlook its strengths and ability to improve data processing.  Textract, while it does provide OCR functionality as a Cloud-based service, is much more thorough in its ability to bring Machine Learning based models to your business applications.  In order for data to be useful, it must first be collected; Textract provides OCR capabilities to ensure text is recognized from paper-scanned documents to electronic forms.

For data to be really useful, it needs to have organization and structure; Textract provides the ability to automatically detect content layout and recognize key elements and the relationship of the text and the elements it discovers.  And finally, for data to not only be useful, but actually utilized, it needs to be accessed; Textract can easily share the data, in its context, with other applications and data stores through well-formatted data streams to applications, databases, and other services.  Textract is designed to collect and filter data from documents and files so that you don’t have to.  Solutions utilizing Textract naturally benefit from an automated flow of information from capture to storage, to retrieval.

Textract is more than just OCR

In 1914, Emanual Goldberg developed a machine that could read characters and convert them into telegraph code. Golderg also applied for a patent in 1927 for his “Statistical Machine”.   Goldberg’s statistical machine was designed to retrieve individual records from spools of microfilm by using a movie projector and a photoelectric cell to do pattern recognition to find the right record on microfilm. In many ways, Goldberg’s inventions are often credited as the beginning of Optical Character Recognition technology (OCR).  Over the next 92 years, OCR has become one of the most critical elements, which few have heard of, in building business solutions.

OCR moved beyond the business world to enabling sight-impaired people to read printed materials.  Ray Kurzweil and the National Federation of the Blind announced a new product in 1976, based on newly developed charged-coupled device (CCD) flatbed scanners and text-to-speech synthesizers, which has fundamentally changed the way we work with information.  It was no longer about reports, statistics, or data; it was about sharing information with anyone, in a format that could easily be accessible.  By 1978, OCR had moved into the digital world as a computer program.

Like all new technologies, OCR has had its issues and limitations.  In the beginning, text had to be very clear and created only in certain fonts to be recognized.  Scan quality of physical pages also plays a major factor in how well OCR engines extract the text from pages and in most cases, only a portion of the text is captured on poor scans.  Even today, with so many advancements in OCR, there are challenges to accurately collecting and organizing data from images.

Most OCR engines collect all the text from documents and make the words available for search engines, but very few OCR engines take it any further without requiring additional tools and applications.  Textract by Amazon Web Services goes beyond OCR by not only collecting the content but understanding where the content came from.

Textract provides the ability to not only perform standard character recognition but is designed to understand the formatting and how content is aligned within a page.  This is accomplished by recognizing and creating Bounding Boxes around key information and text areas to support the content, table extraction, and form extraction.

Item Location on a Document Page

The example image displays content that is separated by columns and has header information.

Figure 1 – Two Column Document Example

Most OCR applications will collect all the words on the page, but do not provide a reference to lines of text or location.  Amazon’s Textract retrieves multiple blocks of information from each page of the image it investigates:

  • The lines and words of detected text
  • The relationships between the lines and words of detected text
  • The page that the detected text appears on
  • The location of the lines and words of text on the document page

As the following illustration demonstrates, Textract is able to identify that there are two columns of information on the page.  It then recognizes that for each column, there are multiple lines of text which are made up of multiple words.

Figure 2 – Textract Line and Word Recognition

Textract outputs its findings in standard JSON files so that they can be utilized easily by other services or applications.  The example above would be represented in the JSON as follows:

Figure 3 – Sample JSON

Table Extraction

Amazon’s Textract is well equipped to locate table data within documents as well.  Textract recognizes the table construct and can establish key-value pairs with the cells by referencing the row and column information.  The following table represents 20 distinct cells, including the header row that will be evaluated by Textract:

Figure 4 – Sample table data

The output JSON from the Textract service creates a mapping between the rows and columns and intelligently identifies the key-value pairs in the table.  This recognition can also be performed against vertical table data rather than horizontal table.  The following illustrates the key-value pair matching:

Figure 5 – Table Key-Value Pair

In addition to detecting text, Textract has the ability to recognize selection elements such as checkboxes and radio buttons.  A checkbox that has not been selected, such as o or ¡ is represented as a status of NOT_SELECTED whereas Rž are represented as SELECTED and can be tied to a key-value pair as well.  This can be extremely helpful in finding values in both tables and forms.

 

Form Extraction

Businesses have been interacting with their clients and vendors for decades through forms.  Textract provides the ability to read form data and clearly define key-value pairs of information from them.  Many organizations struggle with the fact that forms change over time and it can be difficult to train tools to find data when those tools were specific for a particular form layout.  Textract removes that complexity by reading the actual text rather than a location on a form to get its information and analyzes documents and forms for relationships between the detected text.

Figure 6 – Sample form image

In the example above, Textract will create the following Key-value pairs:

Traditional OCR tools will provide all the available text out of an image or document, but to gather Key-value pairs from forms and data, as well as recognizing text based on words, lines, and understanding the blocking of content, additional tools are required.  Textract does all of this for you providing data that can then be further analyzed as needed.

Textract Considerations

Textract is specifically designed to perform OCR against image files such as JPG, PNG, and PDF file formats.  Most text-based document formats created electronically today do not require additional OCR since they are already embedded with an index that is accessible by search engines.  With the proliferation of mobile device and tablet use, there are still many times that images are created in which there is no inherent index available.  We use our phones to take pictures of everything including people, scenes, receipts, presentations, and much more.  It is quick and easy to capture the world around us, but it is more difficult to have a computer application capture important information that may be held in those photographs.  Textract enables the extraction of data from images so that you don’t have to.

As with all technologies, there are limits to what Textract can do and should be recognized before introducing it into a solution.  AWS maintains detailed information about the Amazon Textract service and its limitations and can be found here, https://docs.aws.amazon.com/textract/latest/dg/limits.html.

Putting Textract to Work

While OCR is important and can be a critical part of any business process, it is an engine that retrieves information from sources that could not be accessed except through human intervention.  In many ways, it is like an important element within a car’s engine.  A fuel injector is critical for a car to run, but may not have much value as an entity unto itself.  It’s when you bring various parts together that your car takes you where you need to go or your application drives your business.

To create a basic OCR application with Textract, you will need:

  • A place to store the images that need to be processed, in many situations this may be an Amazon S3 service (Simple Storage Service), Amazon WorkDocs (secure content creation, storage, and collaboration service), or even a relational database like Amazon Aurora.
  • An application or service to call the Textract services. Many organizations are creating Cloud-first applications and may choose to use AWS Lambda to run their code without having to worry about the servers where the code runs.
  • A place to store the results of the Textract services. The options are limitless for where to store the text and details uncovered by Textract, this could be stored back into an Amazon S3 instance, a database like Aurora, or even a data warehouse like Amazon Redshift.
  • And finally, you need to do something with the information you have collected. This all depends on what your goals are for the information, but at a minimum, most people want to search for information.  Using Amazon Elasticsearch Service is an easy way to allow people to find the new information Textract was able to gather for you.

The following outlines this simple Textract solution:

Figure 7 – Simple Textract solution with Amazon Elasticsearch Service

Practical Applications for Textract

While being able to search for information that was extracted from images is useful, it isn’t all that compelling from a business perspective.  Information needs to be meaningful and applied to a task so that its value can be recognized.  The following examples illustrate common business processes and the role that Textract can play in them.

Human Resource Document Management

Every organization has employees and/or volunteers to support their efforts.  There are many state, county, and country regulations that drive what information we need to keep about our employees as well as operational documents about the employees that help us to keep our businesses running.  The following are some examples of common documents that most organizations need to collect and retain:

  • Employment applications
  • Employee resumes
  • Interview notes, references, and background information
  • Employee offer letters
  • Benefit elections
  • Employee appraisals
  • Wage garnishments
  • State and Federal Employee documentation
  • Employee disciplinary actions
  • Termination decisions and disclosures
  • Promotion recommendations
  • Employee complaints and investigations
  • Leave request documentation

While there are many applications and services available on the market today which will help organizations capture, index, and retain this information, they can sometimes be costly and may not be able to completely capture information held in non-text-based file formats.  As discussed earlier, more and more people are using mobile and tablet technologies because of their accessibility and ease of use.  In many cases, an employee may use their phone to take a picture of a signed employee document and send it in to the company.  This photographed document can cause issues in capturing the information in it, or even classifying it properly in an automated fashion.  This is where Textract can easily be integrated into an existing solution, or incorporated as part of a newly constructed solution, to ensure vital information isn’t missed.

The following illustrates how a solution designed for the Cloud-based on Amazon services can facilitate common Human Resource document management activities:

 

Amazon Services Utilized:

Amazon S3|Amazon WorkDocs|AWS Lambda|Amazon Textract|Amazon API Gateway

Non-Amazon Application examples:

Workday|Oracle Human Capital Management|Oracle PeopleSoft

 

In this example, a newly hired employee is granted access to the company’s Amazon WorkDocs environment to upload documentation that will be required during the hiring process.  While most of the documents being uploaded will be easily indexed and searchable through the Amazon WorkDocs service, the employee has been asked to upload a copy of their driver’s license.  The employee utilizes the Amazon WorkDocs mobile application to take a picture of their driver’s license and uploads it to the appropriate folder on their phone.  Behind the scenes, the company has configured a workflow in Amazon WorkDocs to inform HR managers when new documents have been submitted and a Human Resources representative reviews the uploaded driver’s license.  The human resource representative launches an action in Amazon WorkDocs (a special feature provided by the company’s IT department) which will launch an operation running on AWS Lambda initiating Textract to capture OCR information form the driver’s license as well as create Key-value information which will be sent to the company’s ERP system (like Workday, Oracle Human Capital Management, Oracle PeopleSoft, or other similar application) along with the Amazon WorkDocs reference for where the actual image is stored.

This illustrates a very simple method to directly engage with employees to capture critical HR information through a combination of out-of-the-box Amazon services and some light-weight customizations to create a streamlined process for document storage and data capture.  It only took the employee a few seconds to take the picture of the driver’s license and upload it and the HR representative a few seconds to review and process the new document.  In fact, the solution could be configured to automatically extract the required details and send it to the ERP without even having to have the HR representative involved for a truly automated solution.  Imagine each new hire having ten to twenty documents they need to upload and how much time HR spends processing each document manually for every new employee.  Automating this process can amount to several hours a month of time savings, especially when dealing with non-text-based file formats that require someone to manually read the documents to key in the information contained in them.  By introducing Amazon Textract into the overall solution, data can be collected, stored, processed, and shared easily and more efficiently.

Business Document Processing and Information Automation

While the Human Resource Document Management example above focused on capturing documents individually as they come in, there are many situations where companies need to process documents in bulk.  Using similar AWS services as the previous example, solutions can be designed to allow for batch uploading of documents for processing.  As an example, procurement procedures for large purchases can incorporate a wide variety of documentation which may have vastly different processes associated with them.  By providing a simple way for files to be uploaded in bulk, AWS services can be utilized to sort through the file formats for processing.  Non-text-based image files like JPG, PNG, and PDF files can then be automatically processed by Amazon Textract to capture OCR information, Table data, and Key-value information from forms and then shared with back-office applications, stored in data warehouse services, and/or shared with Amazon Elasticsearch services.  Processing hundreds or even thousands of documents and images a month becomes much easier through automation.  Incorporating Textract into business process work streams ensures that critical information is identified and captured from structured and semi-structured documents reducing the need for manual classification of information to facilitate business operations like Insurance Claims, Legal Processes, Partner Management, Purchasing, and more.

Litigation is disruptive to normal business operations for any company.  Thousands of documents, images, and artifacts have to be reviewed and collected to share with attorneys and the courts during a legal process which can be time-consuming.  While there are many discovery tools available on the market today to help speed up the process of finding the desired information, they are reliant on information being in a format that the discovery services can handle.  In many cases, important information is stored in pictures and scanned documents that these discovery services cannot easily process.  Amazon’s Textract becomes a valuable tool in the discovery process by allowing organizations to quickly filter through image files, capture, and OCR information so that it can become indexed and searched.

Litigation isn’t only a headache for companies, it is a headache for the legal teams associated with the litigation process as well.  Imagine a law firm receiving millions of electronic files from a company and having to read through each document to find pertinent information regarding the case they are working on.  This can take months and many resources to complete, time that most lawyers don’t have during a case to complete. Files may be images, documents, spreadsheets, audio files, and even video files.  All of these need to be processed so that key information can be selected to support the case they are working on.  The expense of a large legal process can be staggering due to the sheer amount of manual labor required to gather information.  In the following example, Amazon’s Artificial Intelligence and Machine Learning services, including Textract, are utilized to greatly reduce the processing time for legal discovery.

Amazon Services Utilized:

AWS Transfer for SFTP|Amazon S3|AWS Lambda|Amazon Textract|Amazon API Gateway| Amazon Rekognition| Amazon Comprehend| Amazon Transcribe|Amazon Elasticsearch services

In this example, a legal firm utilizes the power of AWS Transfer for SFTP services to allow clients and opposing counsel to quickly upload all of their discovery files and documents which are then automatically stored in Amazon S3.  Files are then sorted based on file types for processing.  Amazon Textract capture OCR information from image files including table and form data while Amazon Rekognition analyzes photos and videos to identify the objects, people, text, scenes, and activities, perform facial recognition, and detect any inappropriate content.  Audio and video files are processed through Amazon Transcribe to capture speech-to-text information.  As files are processed, the information is captured and indexed in Amazon Elasticsearch service to enable rich search functionality to the litigators as well as being processed by Amazon Comprehend to quickly find relationships and insights into all the data collected.

What would have taken months to sort through and comprehend becomes manageable information in hours or days providing more time for the legal team to focus on winning their case while saving thousands of dollars on the personnel required to manually process all the discovery information.

The tool you didn’t know you needed

Technology is advancing at incredible speeds and new solutions and services are becoming available every day.  Services like Amazon Textract are critical tools in document processing and are rarely thought about but imperative for success.  Of all the services Amazon provides, Amazon Textract is one of the hidden gems that can be easily overlooked but deserves to be part of your processing arsenal.

You are not alone

Business solutions can be complex, but making them work for your requirements doesn’t have to be.  Clearly defining your goals and objectives is half of the battle, the other half is knowing what tools will help you achieve those goals.  Are your off-the-shelf solutions and applications collecting all the information you have?  Do you need a business solution to manage all of your documents and data, but don’t know where to start?  Are you looking to move off of an outdated legacy application that no longer supports your business direction?  You are not alone.  Thousands of companies are facing the same questions and are finding the best answers by engaging with experts from Amazon and experts from solution service providers.  TekStream Solutions, along with AWS, is excited to speak with you about your Information Processing needs and how the right tools and solutions can have a positive impact on how you conduct business.  TekStream Solutions is offering a free Digital Transformation assessment where we will work with you to identify your document processing needs and provide process and technology recommendations to help you transform your business with ease.

 

Want to learn more about Textract? Contact us today!

Why Splunk on AWS?

AWS is the world’s most comprehensive and widely-adopted cloud platform, offering over 165 services from data centers all over the globe. AWS allows you to build sophisticated applications with increased flexibility, scalability and reliability. The platform serves businesses from everyone from government agencies and Fortune 1,000 companies to small businesses and entrepreneurial startups.

Should your business consider using AWS? Changing databases to AWS is easy with the AWS database migration service, or by using AWS managed services or an AWS consulting agency.  Here’s why AWS is amongst the leading cloud computing services.

Reason #1: It’s Flexible 

Anyone can sign up for AWS and use the services without advanced programming language skills. AWS prioritizes consumer-centered design thinking, allowing users to select their preferred operating system, programming language, database, and other vital preferences. They also provide comprehensive developer resources and informative tools available to help maintain AWS’s ease of use and keep it up-to-date. 

Whether your team has the time to learn AWS or has access to AWS consulting, training users is simple. AWS offers their services with a no-commitment approach. Many software solutions use this as a way to market monthly subscriptions, but AWS services are charged on an hourly basis. As soon as you terminate a server, billing won’t include the next hour.   

With AWS, you can spin-up a new server within a matter of minutes compared to the hours or days it takes to procure a new traditional server. Plus, there’s no need to buy separate licenses for the new server. 

Reason #2: It’s Cost-Effective 

With AWS, you pay based on your actual usage. AWS’s cloud solution makes paying for what you use the standard for database storage, content delivery, compute power, and other services. No more fixed server cost or on-prem monitoring fees. Your cost structure scales as your business scales, providing your company with an affordable option that correlates with its current needs. This results in lower capital expenditure and faster time to value without sacrificing application performance or user experience. Amazon also has a strong focus on reducing infrastructure cost for buyers. 

Reason #3: It’s Secure 

Cloud security is the highest priority at AWS. Global banks, military, and other highly-sensitive organizations rely on AWS, which is backed by a deep set of cloud security tools. Maintaining high standards of security without managing your own facility allows companies to focus on scaling their business, with the help of:  

  • AWS multiple availability zones (which consist of one or more data centers around the globe with redundant power, networking, and connectivity) that aid your business in remaining resilient to most failure modes, like natural disasters or system failures. 
  • Configured, built-in firewall rules that allow you to transition from completely public to private or somewhere in between to control access according to circumstance.

Multiple Migration Options

Depending on your unique business and tech stack needs, AWS offers companies multiple options for realizing its host of benefits. For Splunk, those options include:

  1. Migrate your Splunk On-Prem directly to AWS
  2. Migrate your Splunk On-Prem to Splunk Cloud (which sits on AWS)

Migrating to the cloud can be a business challenge, but AWS makes it simpler. While on the journey towards stronger digital security and efficiency, AWS can save time and resources. With its flexibility, cost-effectiveness, and security, you can easily deploy a number of software-based processes to an inclusive cloud-based solution.

Deep Freeze Your Splunk Data in AWS

By: Zubair Rauf | Splunk Consultant

 

In today’s day and age, storage has become a commodity, but, even now, reliable high-speed storage comes at a substantial cost. For on-premise Splunk deployments, Splunk recommends RAID 0 or 1+0 disks capable of at least 1200 IOPS and this increases in high-volume environments. Similarly, in bring-your-own-license cloud deployments, customers prefer to use SSD storage with at least 1200 IOPS or more.

Procuring these disks and maintaining them can carry a hefty recurring price tag. Aged data, that no longer needs to be accessed on a daily basis but has to be stored because of corporate governance policies or regulatory requirements, can effectively increase the storage cost for companies if done on these high-performance disks.

This data can securely be moved to Amazon Web Services (AWS) S3, S3 Glacier, or other inexpensive storage options of the Admin’s choosing.

In this blog post, we will dive into a script that we have developed at TekStream which can move buckets from Indexer Clusters to AWS S3 seamlessly, without duplication. It will only move one good copy of the bucket and ignore any duplicates (replicated buckets).

During the process of setting up indexes, Splunk Admins can decide and set data retention on a per-index basis by setting the ‘frozenTimePeriodInSecs’ setting in indexes.conf. This allows the admins to be flexible on their retention levels based on the type of data. Once the data becomes of age, Splunk deletes it or moves it to frozen storage.

Splunk achieves this by referring to the coldToFrozenScript setting in indexes.conf. If a coldToFrozenScript is defined, Splunk will run that script; once it successfully executes without problems, Splunk will go ahead and delete the aged bucket from the indexer.

The dependencies for this script include the following;

–   Python 2.7 – Installed with Splunk

–   AWS CLI tools – with credentials already working.

–   AWS Account, Access Key and Secret Key

–   AWS S3 Bucket

Testing AWS Connectivity

After you have installed AWS CLI and set it up with the Secret Key and Access Key for your account, test connectivity to S3 by using the following command:

Note: Please ensure that the AWS CLI commands are installed under /usr/bin/aws and the AWS account you are using should have read and write access to S3 artifacts.

If AWS CLI commands are set-up correctly, this should return a list of all the S3 buckets in your account.

I have created a bucket titled “splunk-to-s3-frozen-demo”.

Populate the Script with Bucket Name

Once the S3 bucket is ready, you can copy the script to your $SPLUNK_HOME/bin folder. After copying the script, edit it and change the name of your S3 Bucket where you wish to freeze your buckets.

Splunk Index Settings

After you have made the necessary edits to the script, it is time to update the settings on your index in indexes.conf.

Depending on where your index is defined, we need to set the indexes.conf accordingly. On my demo instance, the index is defined in the following location:

In the indexes.conf, my index settings are defined as follows;

 

Note: These settings are only for a test index, that will roll any data off to frozen (or delete if a coldToFrozenScript is not present) after 600 seconds.

Once you have your settings complete in indexes.conf, please restart your Splunk instance. Splunk will read the new settings at restart.

After the restart, I can see my index on the Settings > Indexes page.

Once the index is set up, I use the Add Data Wizard to add some sample data to my index. Ideally, this data should roll over to warm, and the script should be moved to my AWS S3 bucket after 10 minutes.

The remote path on S3 will be set up in the following order:

If you are running this on an indexer cluster, the script will not copy duplicate buckets. It will only copy the first copy of a bucket and ignore the rest. This helps manage storage costs and does not keep multiple copies of the same buckets in S3.

Finally, once the script runs successfully, I can see my frozen Splunk bucket in AWS S3. If you are running this on an indexer cluster, the script will not copy duplicate buckets. It will only copy the first copy of a bucket and ignore the rest. This helps manage storage costs and does not keep multiple copies of the same buckets in S3.

Note: This demo test was done on Splunk Enterprise 8.0 using native Python 2.7.1 that ships with Splunk Enterprise. If you wish to use any other distribution on Python, you will have to modify the script to be compatible.

If there is an error and the bucket does not transfer to S3, or it is not deleted from the source folder, then you can troubleshoot it with the following search:

This search will show you the stdout error that is thrown when the script runs into an error.

To wrap it up, I would highly recommend that you do implement this in a dev/sandbox environment before rolling it out into production. Doing so will ensure that it is robust for your environment and make you comfortable with the set-up.

To learn more about how to set-up AWS CLI Tools for your environment, please refer to the following link; https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html

If you have any questions or are interested in getting the script, contact us today!

Sneak Peek Into Our Approach to Migrating Splunk On-Prem to AWS How to Migrate Splunk On-Prem to AWS

Splunk on AWS offers a special kind of magic. Splunk makes it simple to collect, analyze and act on data of all kinds. AWS applications for Splunk allow users to ingest that data quickly, with visualization tools, dashboards, and alerts. Together, they help organizations see through the noise. When a notable event (such as a potential breach) occurs, you can find and act on it quickly, making the combo a powerful tool for risk management. 

Running Splunk on the cloud gives organization resiliency, with all the advantages of scalability, flexibility, cost-optimization, and security.  Yet migrating to the cloud poses many challenges and implementing the new system alone can be intimidating, costly, and time-consuming if not done correctly. AWS database migration services are available to mitigate the impact of this necessary shift for your business. 

Finding an experienced and expert AWS and combined Splunk managed services partner to help you navigate can ease the process. Here’s a quick look into how to handle the change from Splunk On-Prem to AWS.

How Splunk Licensing Works

Each of your Splunk Enterprise instances require a license, which specifies how many gigabytes per day a given Splunk Enterprise instance can index and which features you have access to. Multiple types of licenses are available, and distributed deployments, consisting of multiple instances, require a few extra steps. 

Choosing the correct Splunk licensing option can be confusing. It requires outlining the types of business problems you wish to solve with Splunk, then estimating how much data usage you will need to perform this work over time.

Finding a Partner with Licensing Expertise

Non-compliance with licensing can lead to overages and penalties. As your advisory partner, TekStream can work with you to ensure that your Splunk and AWS licenses are in order.

TekStream has extensive experience navigating the specifics of complex license structures and contracts. Our Splunk Enterprise consultants will leverage their years of experience to help you assess your needs, accurately estimate your data usage, and determine the optimal license types and quantities for your unique needs.

In addition to Splunk licensing for new implementations, TekStream will also help your organization save money on licensing renewals. We will examine your Splunk usage to date, pinpoint areas where you may be overpaying, and provide you with viable alternatives to reduce your costs without sacrificing efficiency.

The 4 Most Common Licensing Structures

When selecting a licensing structure for Splunk on AWS, there are 4 main options. The best option will vary depending on the organization. Through careful analysis of your current licensing structures and your desired future state, we will work with you to determine the optimal licensing structure.  

Option 1: Migrate Your Existing Perpetual or Term License to AWS

Option 2: Convert Your Current License to Splunk Cloud (which would run on AWS)

Option 3: Convert to a Term or Infrastructure License (if on a Perpetual License)

Option 4: Pay-As-You-Go as part of a 3rd-Party Hosted MSP Solution

Each option has its pros and cons depending on an organization’s goals and usage. A partner can help you select the best option. TekStream’s deep experience overseeing complex data migrations empowers us to act as true consultative partners. We have the experience needed to quickly scope challenges and present solutions for your unique situation.

Take the risk out of your Splunk migration to AWS. We are so confident in our battle-tested strategy and proven database migration process that we guarantee that your database migration will be completed on-time and on-budget (when using TekStream’s Proven Process). We also guarantee optimal and cost-effective license and cloud subscriptions. 

 

Download the
Ultimate Guide

To find out more specifics about our proven process and get an in-depth look into our services, read The Ultimate Guide to Migrating Your Splunk On-Prem to Amazon Web Services. 

Take your Traditional OCR up a notch

By: Greg Moler | Director of Imaging Solutions

While the baseline OCR landscape has not changed much, AWS aims to correct that. Traditional OCR engines are quite limited in what details they can provide. Being able to detect the characters is only half the battle, the ability to get meaningful data out of them becomes the challenge. Traditional OCR follows the ‘what you see is what you get’ mantra, meaning once you run your document through, the blob of seemingly unnavigable text is all you are left with. What if we could enhance this output with other meaningful data elements useful in extraction confidence? What if we could improve the navigation of the traditional OCR block of text?

Enter Textract from AWS. A public web service aimed to improve your traditional OCR experience in an easily scalable, integrable, and low cost package. Textract is built upon an OCR extraction engine that is optimized by AWS’ advanced machine learning. It has been taught how to extract thousands of different types of forms so you don’t have to worry about it. The ‘template’ days are over. It also provides a number of useful advanced features that other engines simply do not offer: confidence ratings, word block identification, word and line object identification, table extraction, and key-value output. Let’s take a quick look at each of these:

  • Confidence Ratings: Ability to intelligently make choices to accept results, or require human intervention based on your own thresholds. Building this into your work flow or product can greatly improve data accuracy
  • Word Blocks: Textract will identify word blocks allowing you to navigate through them to help identify things like address blocks or known blocks of text in your documents. The ability to identify grouped wording rather than sifting through a massive blob of OCR output can help you find the information you are looking for faster
  • Word and Line Objects: Rather than getting a block of text from a traditional OCR engine, having code-navigable objects to parse your documents will greatly improve your efficiency and accuracy. Paired with location data, you can use the returned coordinates to pinpoint where it was extracted from. This becomes useful when you know your data is found in specific areas or ranges of a given document to further improve accuracy and filter out false positives
  • Table Extraction: Using AWS AI-backed extraction technology, Table extraction will intelligently identify and extract tabular data to pipe into whatever your use case may need, allowing you to quickly calculate and navigate these table data elements.
  • Key-value Output: AWS, again using AI-backed extraction technology, will intelligently identify key-value pairs found on the document without having to write custom engines to parse the data programmatically. Optionally, send these key-value pairs to your favorite key-value engine like Splunk or Elasticsearch (Elastic Stack) for easily searchable, trigger-able, and analytical actions for your document’s data.

Contact us today to find out how Textract from AWS can help streamline your OCR based solutions to improve your data’s accuracy!

Considerations for Moving From On-Prem to Cloud

Experts today are continually barraged with data about the cloud. It appears to be each different business is using cloud-based programming, leaving those as yet utilizing on-premise arrangements thinking about whether they, as well, should switch. Organizations are rushing to cloud arrangements on the grounds that there are numerous a large advantages than there are with on-prem arrangements. Here are some of the regularly said reasons cloud setups are better.

COST EFFECTIVE

Cloud arrangement suppliers by and large charge some kind of month to month expense.  This rate might be paid every year or month to month and can either be for each client cost or a cost that incorporates a set scope of records. In return for this charge, you’ll have the capacity to set up accounts until the point when you achieve the most extreme, overseeing secret word resets and record evacuations and augmentations utilizing an authoritative gateway.

Rather than depending on CDs or a site download to introduce the product on every gadget, you’ll have programming that is prepared to utilize. Permitting charges are incorporated into the price tag, so your IT group will never again need to stay aware of your product licenses to ensure the greater part of your introduced programming has been obtained.

TECHNICAL SKILLS

With such huge numbers of private companies and new companies in the business world today, technical support is no longer a choice. A SMB for the most part can’t bear the cost of a full-time IT bolster individual, not to mention the high cost of a server chairman. This implies depending on neighborhood organizations to offer help on an as-required premise, which can accompany a heavy for every hour sticker price. Along these lines, the organizations that do have on-introduce programming will regularly depend on remote help, which is outsourced by means of the cloud.

With cloud programming, technical support is generally taken care of by the supplier, regardless of whether by telephone, email, or an assistance work area ticket. These suppliers have the wage base to pay the high pay rates instructed by the present best IT experts, both at the server level and at the client bolster organize. Most smaller organizations essentially couldn’t manage the cost of this kind of skill all the time.

SCALABILITY

Each strategy for success to develop after some time and cloud programming offers the versatility required to deal with that development. At the point when another representative joins its staff, a business utilizing cloud programming can essentially add another client to its record administration. At the point when an organization maximizes its logins, a higher-level record can as a rule be requested with negligible exertion with respect to the business.

Another advantage to cloud arrangements is that they for the most part include new highlights that normal on-premise setups do not include. As clients express an enthusiasm for having the capacity to accomplish more with their product, suppliers include these highlights, making them accessible either consequently or with a discretionary record change. Cloud arrangements are additionally consistently endeavoring to work with other programming applications and these combinations make it simpler for organizations to deal with everything in one place.

AVAILABILITY

The present workforce is progressively portable, telecommuting, inn rooms, coffeeshops, and air terminals. Cloud programming implies that these laborers can get to their documents wherever they are, utilizing a portable workstation, cell phone, or tablet. This implies even while in the midst of some recreation, groups can keep in contact, keeping ventures pushing ahead through the cloud.

A standout amongst other things about cloud arrangements is that experts never again need to make sure to bring records with them when they leave the workplace. An introduction can be conveyed specifically from a client’s cell phone. Applications that handle charging, cost assessing, and venture administration can be gotten to amid gatherings, enabling participants to get the data they require without influencing everybody to hold up until the point when the meeting is finished and everybody has come back to their workplaces.

RELIABILITY

In the event that you’ve at any point endured a server blackout, you know how destructive it can be on an assortment of levels. Your representatives are compelled to either wait around, sitting tight for the circumstance to be settled, or go home for the day and leave your work environment unmanned. On the off chance that this happens over and over again, you’ll utilize customers and even workers, and in addition hurt your well-deserved notoriety as a business that has become a model of togetherness.

Cloud suppliers look at unwavering quality as a vital piece of their plans of action. They make it their central goal to guarantee clients approach the records and applications they require constantly. In the event that a blackout ever happens, many cloud suppliers have worked in reinforcements to assume control, with clients never mindful an issue has occurred. On the off chance that such reinforcement doesn’t exist, a cloud supplier still approaches specialists who can guarantee frameworks are up considerably more rapidly than a SMB could with an on-preface server.

SECURITY

Security is a continuous worry for organizations, with reports of ruptures getting to be plainly typical. Cloud programming guarantees an abnormal state of security, including information encryption and solid secret word prerequisites. These little things will help protect a business’ information, lessening the danger of a rupture that could cost cash and mischief client confide in an organization.

Organizations that store specific data, for example, medicinal records or financial balance data should scan for a cloud supplier that offers these insurances. There are presently cloud suppliers that have some expertise in HIPAA consistence, for example, so a medicinal practice could profit by the authorities on staff at one of those suppliers who can guarantee that wellbeing information stays safe.

DISASTER RECOVERY

What might happen to your business if a cataclysmic event struck your building or server farm? Imagine a scenario where you came in one morning to discover a fire had rendered your workplaces dreadful. Would you be compelled to close everything down for the term or would your representatives have the capacity to begin working promptly?

Cloud programming enables catastrophe to verification your business, guaranteeing your representatives can telecommute or a brief office if for reasons unknown they can’t work in the workplace. Cloud suppliers typically have reinforcement anticipates their own servers to ensure against calamities, so the product and documents you utilize every day will be available regardless of whether an issue strikes one of their server farms. Before you pick a supplier, don’t hesitate to make inquiries about an organization’s catastrophe arrangements to ensure you’ll be effortless.

Thinking about moving from On-Prem to Cloud? Contact us today!

[pardot-form id=”16001″ title=”Blog- Matt Chumley – Considerations for Moving From On-Prem to Cloud”]