Driving Growth by Leveraging AWS and Document Understanding

Your company is sitting on a potential gold mine of stored data. Tucked away on servers and cloud-based drives are the answers and insights you need to take your business to that next level of growth. Advancements in machine learning and artificial intelligence have made it easier (and less expensive) to analyze this data through Document Understanding. Companies that leverage the Amazon Web Services (AWS) platform to support their needs, tying in a Document Understanding initiative can have a fundamental impact on driving growth and securing a more profitable bottom-line.

What is Document Understanding?

Historically, the chief hurdle to analyzing this data is that much of this data is unstructured – composed of text-based files, reports, survey results, social media posts, notes, and random PDFs. Sifting through this quagmire was expensive and inefficient as it had to be done by hand.

That was the old way.

Fueled by natural language processing (NLP) and machine learning (ML), these systems analyze text-based documentation (PDFs, notes, reports) to uncover insights. The machine-learning capabilities allow you to “teach” the AI how to read your specific documentation and guide its insight discovery.

How Document Understanding Can Benefit Your Enterprise Corporation

Enterprise companies are already tapping the power of AWS’s Document Understanding solution to garner essential insights into critical business functions.  Regardless of industry vertical, businesses are using Document Understanding to:

  • – Instantly search for information across multiple scanned documents, PDFs, images, reports, and stored text files.
  • – Redact critical information from documents and identify compliance threats in real-time.
  • – Digitize, store, and analyze customer feedback and request forms.
  • – Identify overarching communication trends and isolate specific messaging that can be used to improve the customer experience or marketing campaign.

And this is just the proverbial tip of the benefits iceberg. Through the machine learning aspect of Document Understanding, you can tailor your use of this technology to identify and analyze the data sets that have the most impact on your business and bottom line.

Driving Document Understanding through Intelligence with AWS Content Process Automation

As a certified AWS Advanced Consulting Partner, we are excited to announce the launch of our new AWS Content Process Automation (CPA) offering. Our new CPA tool integrates with the AWS platform to provide a structured process and streamlined toolset for implementing and managing an ongoing Document Understanding initiative.

Through our new AWS CPA offering, brands can:

  • – Make previously inaccessible data actionable at scale.
  • – Automate tendencies but necessary business processes.
  • – Improve compliance and risk management.
  • – Identify opportunities to increase operational efficiency and reduced costs.

How TekStream CPA Works

Historically, analyzing sizeable unstructured data sets for actionable information has been a time-consuming and costly initiative. Most of the work had to be done manually – which can be both costly and inefficient.

Our new CPA offering leverages artificial intelligence and machine learning, along with defined scope and direction, to increase the speed and accuracy for data discovery while eliminating much of the manual aspect of data mining.

Using machine-learning services like Amazon Textract and Amazon Rekognition, TekStream CPA inspects documents, images, and video (collectively called “files”), gathers key information and insights, and automatically stores these files logically to ensure easier access to critical information. Amazon Augmented AI (Amazon A2I) routes files requiring further review to content specialists and information managers to edit associated information, take corrective actions, and approve files for storage.

TekStream CPA relentlessly and automatically investigates content to find key insights and associations that might not be easily discovered by the naked eye. Users and administrators establish business rules defining what information is important, how it will be managed, and the storage rules for documents, images, forms, video files, and unstructured data. This ensures critical business facts and figures are available for business operations.

Built for Growth

Use these systems to gain a deeper understanding of internal and consumer audience sentiment around your brand or a specific product.

Analyzing your unstructured data sets is only part of the business growth equation. To achieve a true return on your investment and drive a noticeable impact on your bottom-line, you also need to transform your insights into actions. By leveraging serverless technologies like Amazon Lambda through our Content Process Automation tool, administrators can create functions to call their own services for file conversions, reformatting, and many more to meet specific business criteria.

Start driving business growth today. TekStream has deep experience helping clients across multiple industries accelerate their digital transformation and begin leveraging the power of Document Understanding to push their business forward. Reach out to us today to learn more about what CPA and Document Understanding can do for your business.

Want to learn more about unlocking value from your unstructured data? Download our latest eBook, “9 Steps to Unlocking Value from Your Unstructured Data and Content.”

8 Benefits to Using Document Understanding to Mine Unstructured Data

What if we told you that your business was sitting on a mountain of untapped business intelligence or that hidden away in archived emails, documents, and customer survey results are the very insights you need to drive growth and improve your bottom line? These types of text-based documents are a form of “unstructured data” and (alongside image libraries, data streams, and similar data deposits) account for nearly 80% of all the data that an enterprise company generates and stores.

How do you analyze all of this data to identify the specific insights that can drive change and improve performance in your organization? Through Document Understanding.

Understanding Document Understanding

Document Understanding is one of the three core AI capabilities fueling the unstructured data analysis industry (the other two being Computer Vision and IoT analysis). This system leverages the power of natural language processing and machine learning to analyze text-based documents (PDFS, notes, reports) to uncover actionable business insights.

The machine-learning capabilities of these systems allow your organization to “teach” the AI to read your specific documentation and discover insights that specific to your brand and audience.

8 Benefits of Analyzing Your Company’s Unstructured Data

The fact that the market size for natural language processing is estimated to reach over $16B by 2021 proves that organizations large and small are investing in tools and systems that analyze their unstructured data. This means that these companies are confident that the benefit of this work will outweigh the costs of these new systems.

While these benefits differ between industries, some of the key benefits to mining unstructured data includes:

1. Finding Opportunities to Improve Your Customer Experience

Retain more customers (and win over new fans) by using Document Understanding to analyze customer surveys and reviews to identify where your company can provide better customer service.

2. Discover New Opportunities in The Market

What is the “next big thing” in your industry? How will you ensure your company will stay relevant to consumers over the next 20 years? Turn your data lake into a blue ocean by mining your unstructured data for relevant insights and consumer trends.

3.  Know Your Audience Better With Sentiment Analysis

Use these systems to gain a deeper understanding of internal and consumer audience sentiment around your brand or a specific product.

4. Make Key Decisions Faster and More Accurately

Quit getting bogged down with analysis paralysis. Get the data you need to identify and take action on the “right” decision when it counts most.

5. Improve Team Productivity and Reduce/Remove Outdated Data Processing Techniques

Through automation, you can eliminate data processing bottlenecks and instead focus your employees on more high-value tasks.

6. Identify and Eliminate Unnecessary Cost Centers

Get a handle on your waste by understanding what areas of your business are costing you money (without providing a correlating ROI).

7.  Gain a Better Understanding of Your Customer Behavior and Buying Triggers

Improve the performance of your marketing campaigns and customer retention efforts by gaining more in-depth insight into what makes your customers your customers in the first place.

8.  Avoid Costly Regulatory or Compliance Issues

Uncover regulatory or compliance issues before they negatively impact your company.

Start with The End in Mind

Ready to get started analyzing your unstructured data, but not sure where to begin? We recommend starting with the end goal in mind. What is your highest unstructured data analysis priority? Are you sitting on a mountain of customer surveys? Are you curious about where your hidden costs centers are?

Understand which aspects of your unstructured data analysis will have an immediate impact on your business’s bottom line. Then work backward to develop the tools and systems you need to discover this intelligence.

If you are not sure where to begin, we can help. We’ve helped companies across a myriad of industries turn their unstructured data into business growth rocket fuel. Contact us today to learn how we can do the same for you.

If you’d like to learn more about how to unlock value from your unstructured data? Download our free eBook, “9 Steps to Unlocking Value from Your Unstructured Data and Content.”

Leveraging Machine Learning for Document Understanding

By: Troy Allen | VP Cloud Services

Businesses thrive on information, but due to the complexity and wide variety of data available within an organization, finding usable information can be challenging and time-consuming.  As organizations are inundated with documents, forms, data streams, and more, it’s becoming increasingly difficult to extract meaningful information efficiently, funnel that information into the systems that need it, and present it in a fashion that drives better business decisions.  Machine Learning (ML) and Artificial Intelligence (AI) tools are helping solve those challenges. ML and AI tools have rapidly become more sophisticated and capable of allowing organizations to gather critical information out of their content, rich media files, and data to facilitate better Document Understanding (interpreting unstructured documents into recognizable sets of information). Document Understanding is primarily focused on the information companies commonly mine: textual-based information, videos, and graphics. Amazon has recognized the importance of Document Understanding and developed services to help drive visibility and analysis that companies desperately need. Amazon’s Textract and Rekognition machine learning services are designed to gather meaning out of documents, rich media files, and data.

Getting More Out of Text

While Optical Character Recognition (OCR) has been around for many years, most organizations tend to overlook its strengths and ability to improve data processing.  Amazon Textract, while it does provide OCR functionality as a Cloud-based service, offers much more than one might expect because of its ability to bring Machine Learning-based models to business applications.  In order for data to be useful, it must first be collected; Textract goes beyond simple OCR by providing the ability to distinguish key-value pairs of information, table data extraction, and recognition of checkboxes and radio buttons.  Amazon Textract makes it easy to export the extracted data into a database or into off-the-shelf or custom applications. Traditional OCR solutions require additional tools to provide this level of data recognition and extraction.

More than OCR

Textract by Amazon Web Services goes beyond OCR by not only collecting the content but understanding where the content came from. Textract provides the ability to not only perform standard character recognition but is designed to understand formatting and how content is aligned within a page.  This is accomplished by recognizing and creating Bounding Boxes around key information and text areas to support the content, table extraction, and form extraction.

Amazon’s Textract retrieves multiple blocks of information from each page of the image it investigates:

– The lines and words of detected text
– The relationships between the lines and words of detected text
– The page that the detected text appears on
– The location of the lines and words of text on the document page

Table Data Exposed

Amazon’s Textract is well equipped to locate table data within documents. Textract recognizes the table construct and can establish key-value pairs with the cells by referencing the row and column information.

In addition to detecting text, Textract has the ability to recognize selection elements such as checkboxes and radio buttons.  A check box that has not been selected, such as  or Ο is represented as a status of NOT_SELECTED whereas checked boxes and circles are represented as SELECTED and can be tied to a key-value pair as well.  This can be extremely helpful in finding values in both tables and forms.

The Power of Key-Value Pairs

Businesses have been interacting with their clients and vendors for decades through forms. Textract provides the ability to read form data and clearly define key-value pairs of information from them.  Many organizations struggle with the fact that forms change over time, and it can be difficult to train legacy OCR tools to find data when those tools are specific to a particular form layout. Textract removes that limitation by reading the actual text rather than a location on a form to get its information and analyzes documents and forms for relationships between detected text.

Getting More from Images

Amazon Rekognition makes it easy to analyze image and video files using proven, highly scalable, deep learning technology that requires no machine learning expertise to use. Amazon Rekognition provides the ability to identify objects, people, text, scenes, and activities in images and videos, as well as detect any inappropriate content. It provides those capabilities while also delivering highly accurate facial analysis and facial search capabilities that can be used to detect, analyze, and compare faces for a wide variety of user verification, people counting, and public safety use cases.

Using AI to See More

With Amazon Rekognition Custom Labels, objects, and scenes in images can be identified for specific business requirements and actions. Models can be configured to classify specific machine parts, identify the use of Personal Protection Equipment (PPE) for employees from surveillance videos, capture model numbers in images, and detect persons of interest for image classification to name a few use cases. Amazon Rekognition Custom Labels allows organizations to quickly identify objects and images that have value to their specific business and processes.

Uncovering Hidden Data

As with Amazon Textract, Amazon Rekognition provides a way for companies to identify key information that can be stored, processed, and shared with other applications enabling greater Document Understanding across files and data.  Context of information is critical to assigning value and defining how it can be best utilized.

Amazon Rekognition helps companies realize value in their images and videos across many different use cases across the enterprise:

    • – Discover inappropriate content – filter images and videos for objects and scenes containing inappropriate content such as nudity, weapons, graphic violence, and even inappropriate text in the videos or images.
    • – Identify key objects – Amazon Rekognition can be utilized to filter Social Media video and image files to identify products, brands, people, and even landmarks.
    • – Help improve workplace safety – with the support of video, Amazon Rekognition can be utilized to inspect surveillance videos and identify issues such as people not wearing Personal Protective Equipment (PPE) and obstructive objects in the workplace.
    • – Support identity verification – facial recognition and person recognition can be accomplished through Amazon Rekognition by detecting humans, identifying facial features, and even comparing those to documented photographs of people for identifying people in images and video files.
    • – Capture text information – Amazon Rekognition also provides the ability to perform text capture and recognition in video and image files. This can help an organization gather data and information contextual to a video or image such as the model number of a part from a photograph of a manufacturer’s plate or even identify names of streets from street signs in a video to assist in determining the location of the event filmed.

You Are Not Alone

Business solutions can be complex, but making them work for your requirements doesn’t have to be.  Clearly defining your goals and objectives is half of the battle, the other half is knowing what tools will help you achieve those goals.  Are your off-the-shelf solutions and applications collecting all the information you have?  Do you need a business solution to manage all of your documents and data, but don’t know where to start?  Are you looking to move off an outdated legacy application that no longer supports your business direction?  You are not alone.

Thousands of companies are facing the same questions and are finding the best answers by engaging with experts from Amazon and experts from solution service providers. TekStream, along with Amazon, is excited to speak with you about your Document Understanding needs and how the right tools and solutions can have a positive impact on how you conduct business. TekStream is offering a free Digital Transformation assessment where we will work with you to identify your document processing needs and provide process and technology recommendations to help you transform your business with ease.  Reach out to us at info@TekStream.com for more details or call 1-844-TEK-STRM.

The Current State of Unstructured Data Analysis and What It Means for Your Business

The data analytics industry is continually evolving. What seemed like science fiction only years ago has become a business fact as new data analyzing tools and techniques come to market. And while larger organizations (with deeper pockets) have been the driving force of much of this change, advancements in artificial intelligence and machine learning have democratized big data analysis. Nowhere is this more apparent than the evolution of the modern, unstructured data analysis landscape.

Accounting for nearly 80% of all data generated and stored by an organization and growing at a rate of 55%-65% each year, unstructured data is one of the largest untapped and continuous sources of business intelligence. We’ve put this blog post together as an introduction to unstructured data and some of the tools and techniques that companies are using to analyze and mime their unstructured data for actionable insights to improve their organization and bottom line.

Understanding the 3 Different Types of Big Data

To understand what is meant by the term “unstructured data,”  you first need to know where it falls within the broadest categories of business data – structured data, semi-structured data, and unstructured data.

The below table compares the overarching differences between these data sets.

Structured Data Semi-Structured Data Unstructured Data
  • ●      Historically used for data analysis and mining.

 

  • ●      Data that is loosely organized by its source and delivery channel:
    • ○      Email
    • ○      Tweets
    • ○      Folders
  • ●      Data that is not organized in any way, which makes it difficult to process and analyze using traditional methods.
    • ○      Surveys
    • ○      External Industry Reports
    • ○      Data Analysis
  • ●      Designed for data capture, data input, data analysis, search, etc. within the document.
  • ●      Has some basic search/discovery
    • ○      Inbox Search
    • ○      Hashtags
    • ○      Folder Names
  • ●      Tends to be text-heavy but can include voice recordings, images, video, etc.
    • ○      Notes (handwritten or typed)
    • ○      Documents (POs, Resumes, Invoices)
    • ○      Rich Media (Geo-Spatial, Security, etc.)
    • ○      Analytics/Performance Data
    • ○      Internet of Things Usage Reports/Data Streams
    • ○      Customer Communications (Surveys, Live Chat, Automated Messaging)
  • ●      Pre-defined structured format with standardized columns and rows:
    • ○      Databases, Google Sheets, CSV, Excel, etc.
  • ●      Specific data within these channels is unstructured and text-heavy.

 

  • ●      Also known as qualitative data.
  • ●      Tends to be utilized within an organization according to data type.

As you can see, the definition of unstructured data is broad. There is no consistent medium or format, but most unstructured data is unstructured text: documents, social media posts, emails, surveys, etc.

So, how does a large organization mine swathes of unstructured data for nuggets of actionable gold?

The Current Environment and Capabilities for Mining Unstructured Data

As the importance of mining unstructured data grows, new discovery and intelligence tools are introduced to the market. Simultaneously, as analysis technology improves, more companies build systems and tools that have integrated data logging capabilities capturing and generating more unstructured data than in previous years.

This means that not only do companies of all sizes have access to more advanced data mining tools – they also have more extensive data sets to analyze.

In general, there are three core AI capabilities that are empowering unstructured data analysis:

Document Understanding: Fueled by natural language processing (NLP) and machine learning (ML), these systems analyze text-based documentation (PDFs, notes, reports) to uncover insights. The machine-learning capabilities allow you to “teach” the AI how to read your specific documentation and guide its insight discovery.

Computer Vision: This is used to analyze image and video content through digital imaging technologies, pattern recognition, and ML in order to process your visual data and uncover actionable intelligence.

Internet of Things: Here, data is generated from machines. AI relies on real-time analytics, ML, and smart systems to analyze the data for performance-improvement insight.

Document Understanding and Text-Based Data Analysis

We stated this earlier, but the vast majority of an organization’s unstructured data is text-based. More organizations are leveraging the power of Document Understanding systems to drive data mining and identify impactful findings.

Here are just a few examples of how companies are leveraging document understanding to fuel insight gathering:

  • – Sentiment Analysis: Automatically classify text by sentiment and pull together trend reports.
  • – Keyword Extraction: What keywords are recurring throughout a data set
  • – Regulatory and Compliance Support: Identify regulatory or compliance issues before they impact your business.

Getting Started with Unstructured Data Analysis

Ready to get started mining your unstructured data? TekStream has deep experience deploying both pre-built and custom unstructured data analysis solutions that empower teams with the insights they need to take action and improve their bottom line. Contact TekStream today to learn more about how we can assist your company with its unstructured data analysis goals.

Are you looking for more insights and best practices for unlocking value from your unstructured data? Download our free eBook, “9 Steps to Unlocking Value from Your Unstructured Data and Content.”

TekStream Launches AWS Content Process Automation Solution

TekStream, an Atlanta-based digital transformation technology firm, is excited to announce the launch of their new Content Process Automation (CPA) Solution. This new CPA solution enables users to quickly process and manage critical business documents, images, forms, video files, and unstructured data from a wide variety of sources through the power of AWS Integrated Services.

TekStream’s CPA solution automatically investigates content to find key insights and associations that might not be easily discovered by the naked eye. Users and administrators establish business rules defining what information is important, how it will be managed, and the final destination of documents, images, forms, video files, and unstructured data to ensure critical business facts and figures are available for business operations. Each department within an organization can benefit from TekStream’s CPA solution, but it may be of most use to departments including:

– Legal
– Human Resources
– Information Technology
– Marketing
– Sales

“TekStream has always focused on understanding our clients’ needs and providing solutions that ensure their success in reaching and exceeding expectations. Our new Content Process Automation offering is based on years of helping customers gain better insights from their managed content. With the power of AWS Artificial Intelligence and Machine Learning, we were able to create a solution that makes obtaining operational success that much easier to reach. We also ensured that our design is future-proofed by allowing more AWS services to be incorporated based on client needs or as new and more powerful services are offered by AWS,” said Troy Allen, Vice President of Cloud Services at TekStream.

Content Process Automation is a platform to build upon, as your business needs grow, TekStream’s CPA capabilities will grow to meet your requirements. As Amazon continues its investment in machine learning and artificial intelligence, TekStream’s CPA solution will incorporate those services to provide more options to make document understanding easier and more efficient.

TekStream

TekStream accelerates clients’ digital transformation by navigating complex technology environments with a combination of technical expertise and staffing solutions. We guide clients’ decisions, quickly implement the right technologies with the right people, and keep them running for sustainable growth. Our battle-tested processes and methodology help companies with legacy systems get to the cloud faster, so they can be agile, reduce costs, and improve operational efficiencies. And with 100s of deployments under our belt, we can guarantee on-time and on-budget project delivery. That’s why 97% of clients are repeat customers. For more information visit https://www.tekstream.com/.

How Robotic Process Automation Fuels Document Understanding

By: Troy Allen | VP Cloud Services

RPA, or Robotic Process Automation, is becoming more prevalent in the world of business process automation. In its simplest form, RPA is the practice of utilizing bots or Artificial Intelligence to analyze information and make recommendations for actions based on that analysis.  RPA is not a static set of rules that a process follows. It is an evolving model that learns and adapts by examining recommendations and the actual actions taken and adjusts future actions based on what it has “learned.”  In short, RPA is business process automation driven by Artificial intelligence that learns as it goes to provide increasingly accurate actions and results.

As with business process automation, RPA can be utilized for many common tasks in an organization.  Insurance companies leverage RPA to automate the onboarding of new clients and validate insurance claims to reduce the amount of human-based processing while increasing accuracy.  Manufacturing companies utilize RPA to process Bill of Material documents and Purchase Order management.  Both examples focus on automatically processing documents and associated information that normally requires a high level of human interaction and decision making.

As an essential part of RPA, Document Understanding provides critical insights into the information being processed, helping to make the automation more accurate.

Document Understanding, a subset and critical function of RPA, interprets unstructured documents into a recognizable set of information that can be analyzed and acted upon with high levels of confidence.  Using specialized Artificial Intelligence tools, Document Understanding allows for the recognition of critical details and associations that would normally require human review to identify. For example, with forms processing and extracting key information from tables, Document Understanding enables RPA processes to perform highly accurate analysis and actions based on that information.

RPA and Document Understanding in Action

Onboarding new employees requires a large amount of information to be collected and processed.  Typical employers accept and track candidate applications, compensation details, candidate/employee profiles, onboarding documentation, performance management documentation, and various state and federal employee documents.  This can result in 15 to 20 documents being processed for each employee being hired.

Imagine an organization that hires thousands of employees for seasonal work, this can result in 2,000 or more documents that have to be reviewed, processed, and acted upon in a very quick timeframe.  Considering that a single document may take a human resource worker 2 minutes to review and make critical decisions about it, and up to 2 hours per document to complete its process, this can result in over 4,000 hours of processing.  Assuming the average resource cost for the participants involved in the processing of new employees is $25 an hour, the company could be looking at over $100,000 just to onboard these new employees.  This cost is most likely higher considering that not every candidate is a fit for the role or company and more candidates must be screened, interviewed, and processed to meet their hiring goals.

With RPA and Document Understanding, automated processes could be deployed to help minimize the amount of time each processor has to interact with the various documents and the actual process.  In many cases, documents can be automatically reviewed, analyzed, categorized, and routed for action based on a well-defined business process.  As an example, this can reduce the overall processing of those 2,000 employees from 4,000 hours to 2,000 hours, resulting in a $50,000 reduction in onboarding and hiring costs for seasonal employees.

What are the savings with RPA and Document Understanding?

As with any process, it takes time to establish, configure, test, and update to make the process as efficient and accurate as possible.  This is true with Robotic Process Automation and Document Understanding.  Many RPA tools provide a baseline of processes and intelligence based on business processes, but no two organizations operate the same way.

These baseline processes need to be modified to meet specific organizational operations.  In many cases, RPA and Document Understanding platforms provide a solid foundation to build upon which can save significant operational costs right out of the box.  RPA and Document Understanding are designed to learn as more information is processed which means that speed and efficiency grow exponentially.

Over time, organizations who utilize RPAs can see upwards of a 40% to 50% increase in efficiency and a reduction of 50% or more in processing costs.  The following chart outlines the potential return on investment (ROI) of an RPA solution with Document Understanding with 2 automated processes that traditionally take 4 full-time employees 35% of their time to perform with a salary of $55,000 annually per employee:

How to Learn More

Contact us to learn more about our RPA and Document Understanding solution – Content Process Automation (CPA) by TekStream. Through our experience and hundreds of implementations, we help companies streamline business processes and improve decision-making with a structured approach to unstructured content and data.

Using TekStream CPA, organizations can enable their users to quickly process and manage critical business documents, images, forms, video files, and unstructured data from a wide variety of sources. Fill out the form below to see how we can help you understand how Robotic Process Automation and Document Understanding can be leveraged within your organization. You’ll improve your processing efficiency, reduce overhead, and see a return on your RPA investment quickly so you can focus on driving your business to even greater heights of success.

TekStream Helps to Support the Launch of Professional Services in AWS Marketplace

TekStream, a digital transformation company and Amazon Web Services (AWS) Advanced Consulting Partner, announced today that it is participating in the launch of Professional Services in AWS Marketplace. AWS customers can now find and purchase professional services from TekStream in AWS Marketplace, a curated digital catalog of software, data, and services that makes it easy to find, test, buy, and deploy software and data products that run on AWS. As a participant in the launch, TekStream is one of the first AWS Consulting Partners to quote and contract services in AWS Marketplace to help customers implement, support, and manage their software on AWS. Click here for more information.

As organizations migrate to the cloud, they want to use their preferred software solutions on AWS. AWS customers often rely on professional services from TekStream to implement, migrate, support, and manage their software in the cloud. Until now, AWS customers had to find and contract professional services outside of AWS Marketplace and could not identify software and associated services in a single procurement experience. With professional services from TekStream available in AWS Marketplace, customers have a simplified way to purchase and be billed for both software and related services in a centralized place. Customers can further streamline their purchase of software with standard contract terms to simplify and accelerate procurement cycles.

“TekStream views AWS Marketplace as a strategic channel for our services to be discovered and procured,” said Judd Robins, Executive Vice President. “Complete solutions generally have a technology and a human component to make them work successfully. AWS Marketplace has always been a great catalog of technical solutions. With the addition of Professional Services in AWS Marketplace, customers now have a broader range of options to get those solutions launched and managed.”

• Database Migration QuickStart – Jumpstart your Database migration to AWS with a 1-week process to analyze and assess Oracle, Microsoft, and open-source database migrations to AWS purpose-built database solutions.
• Splunk Cloud QuickStart – Get your Top 3 IT Operations and/or Security use cases implemented leveraging Splunk with 2 weeks of services, training, and 3 months of go-live support provided by TekStream.
• Splunk CMMC QuickStart – a practical, proven, and effective solution to get you compliant in under 30 days.
• Oracle License Optimization Plan – 1 week to analyze and assess your Oracle licenses and contracts to reduce costs – paving your way to Database Freedom on AWS.
• CloudEndure Cloud Migration QuickStart – 1 week to Migrate Development, QA, or Testing On-Premise Workload to AWS
• CloudEndure Cloud Disaster Recover QuickStart – 1 week to implement and test disaster recovery for up to 3 on-premise workloads to AWS

TekStream accelerates clients’ digital transformation by navigating complex technology environments with a combination of technical expertise and staffing solutions. We guide clients’ decisions, quickly implement the right technologies with the right people, and keep them running for sustainable growth. Our battle-tested processes and methodology help companies with legacy systems get to the cloud faster, so they can be agile, reduce costs, and improve operational efficiencies. And with 100s of deployments under our belt, we can guarantee on-time and on-budget project delivery. That’s why 97% of clients are repeat customers.

How to Connect AWS and Splunk to Ingest Log Data

By: Don Arnold | Splunk Consultant

 

Though a number of cloud solutions have popped up over the past 10 years, Amazon Web Services, better known as simply AWS, seems to be taking the lead in cloud infrastructure.  And, companies that are using AWS have either migrated their entire infrastructure or are using on-premises systems with some AWS services in a hybrid solution.  Whichever may be the case, the AWS environment is within the security boundary and should be a part of the System Security Plan (SSP) and needs to include Continuous Monitoring, which is a requirement in most security frameworks.  Splunk meets the Continuous Monitoring requirements, which includes instances and services within AWS.

Data push

There are 2 separate ways to get data from AWS into Splunk.  The first is to “push” data from AWS using “Kinesis Firehose” to a Splunk.  This requires IP connectivity between AWS and a Splunk Heavy Forwarder, a HTTP Event Collector token, and the “Splunk Add-on for Amazon Kinesis Firehose” from Splunkbase.

Splunk Heavy Forwarder Setup

  1. Ensure the organization firewall has a rule to allow connectivity from AWS to the Splunk Heavy Forwarder over HTTPs.
  2. Go to Splunkbase.com and download/install the “Splunk Add-on for Amazon Kinesis Firehose” – Restart the Splunk Heavy Forwarder
  3. Create an HTTP Event Collector token:
    1. Go to Settings > Data Inputs > HTTP Event Collector
    2. Select New Token
    3. Enter a name for your token. Example:  “AWS”.  Select Next
    4. For Source type, click Select > Structured and choose “aws:firehose:json”. For App Context choose “Add-on for Kinesis Firehose”. Select Review
    5. Verify the settings and select
    6. Go back to Settings > Data Inputs > HTTP Event Collector and select Global Settings
    7. For “All Tokens” select Enabled, ensure “Enable SSL” is selected, and the “HTTP port number” is set to 8088. Select Save.
    8. Copy the “Token Value” for setup in AWS Kinesis Firehose.

AWS Kinesis Firehose Setup

  1. Log in to AWS and go to the Kinesis service and select the “Get Started” button.
  2. On the top right you will see “Deliver Streaming data with Kinesis Firehose Delivery Streams.” Select the “Create Delivery System” button.
  3. Give your delivery system a name. Under Source, choose “Direct PUT or other sources”.  Select the “Next” button.
  4. Select “Disabled” for both Data transformation and Record format conversion.
  5. For Destination select “Splunk”. For Splunk cluster endpoint, enter the URL with port 8088 of your Splunk Heavy Forwarder.  For Splunk endpoint type select “Raw endpoint”.  For Authentication, token enter the Splunk HTTP Event Collector token number created in the Splunk Heavy Forwarder setup.
  6. For S3 backup select a S3 bucket. If one does not exist you can create one by selecting “Create New”.  Select Next.
  7. Scroll down to Permissions and click “Create new or choose” button. Choose an existing IAM role or create one.  Click Allow to return to the previous menu.  Select Next.
  8. Review the settings and select Create Delivery Stream.
  9. You will see a message stating “Successfully created delivery stream…”.

Test the Connection

  1. It is recommended that test data be used to verify the new connection by choosing the delivery stream and selecting “Test with Demo Data”. Go to step 2 and select “Start sending demo data”.  You will see the delivery stream sending demo data to Splunk.
  2. Log into Splunk and enter index=main sourcetype=aws:firehose:json to verify events are streaming into Splunk.
  3. If no events show up, go back and verify all steps have been configured properly and firewall rules are set to allow AWS HTTPs events through to the Splunk Heavy Forwarder.

Send Production Data

  1. Go to AWS Kinesis and select the delivery stream your setup. The status for the delivery stream should display “Active”.
  2. Go to Splunk and verify events are ingesting: index=mainsourcetype=aws:firehose:json and verify the timestamp is correct with the events.

Data pull

The second way to get data into Splunk from AWS is to have Splunk “pull” data via a REST API call.

AWS Prerequisites Setup

  1. There are AWS service prerequisites that require set up prior to performing REST API calls from the Splunk Heavy Forwarder. The prerequisites can be found in this document:  https://docs.splunk.com/Documentation/AddOns/released/AWS/ConfigureAWS
  2. Ensure all prerequisites are configured in AWS prior to configuring the “Splunk Add-on for AWS” on the Splunk Heavy Forwarder.

Splunk Heavy Forwarder Setup

  1. Ensure the organization firewall has a rule to allow connectivity from the Splunk Heavy Forwarder to AWS.
  2. Go to Splunkbase.com and install the “Splunk Add-on for AWS” – Restart the Splunk Heavy Forwarder.
  3. Launch the “Splunk Add-on for AWS” on the Splunk Heavy Forwarder.
  4. Go to the Configurations tab.
    1. Account tab: Select Add. Give the connection a name, enter the Key ID and Secret Key from the AWS IAM user account and select Add.

(To get the Key ID and Secret Key, go to AWS IAM > Access management > Users > (select user) > Security credentials > Create access key > Access Key ID and Secret Access key)

  1. IAM Role tab: Select Add.  Give the Role a name, enter the Role ARN and select Add.

(To get the Role ARN, go to AWS IAM > Access management > Roles > (select role).  At the top you will see the Role ARN)

  1. Go to the Inputs tab. Select Create New Input and select the type of data input from AWS to ingest.  Each selection is different and all will use the User and Role created in the previous step.  Go through the setup and select the AWS region, source type, and index and select Save.

Test the Connection

  1. Log into Splunk and enter index=main sourcetype=aws* to verify events are streaming into Splunk. Verify the sourcetype matches the one you selected in the input.
  2. If no events show up, go back and verify all steps have been configured properly and firewall rules are set to allow AWS HTTPs events through to the Splunk Heavy Forwarder.

With the popularity of AWS, more environments are starting to host hybrid solutions for a myriad of reasons.  With that, using Splunk to maintain Continuous Monitoring is easily achieved with 2 different approaches for monitoring the expanded security boundary into the cloud.  TekStream Solutions has Splunk and AWS engineers on staff with years of experience and can assist you in connecting your AWS environment to Splunk.

References

https://docs.splunk.com/Documentation/AddOns/released/Firehose/About

https://docs.splunk.com/Documentation/AddOns/released/Firehose/ConfigureFirehose

https://docs.splunk.com/Documentation/AddOns/released/AWS/Description

https://docs.splunk.com/Documentation/AddOns/released/AWS/ConfigureAWS

 

Want to learn more about connecting AWS and Splunk to ingest log data? Contact us today!

Textract – The Key to Better Solutions

By: Troy Allen | Vice President of Emerging Technologies

 

Businesses thrive on information, but finding good data can be difficult to collect sort, and utilize due to the vast variety of sources and forms by which information is created and disseminated.  As organizations are inundated with documents, forms, data streams, and more it’s becoming more difficult to extract meaningful information efficiently and funnel that information into the systems that need it or present it in a fashion that drives better business decisions.  Textract, part of AWS’s ever-growing solutions for Machine Learning, can play a critical part in how businesses process documents and collect vital data for use in their critical solutions and operations.

While Optical Character Recognition (OCR) has been around for many years, many organizations tend to overlook its strengths and ability to improve data processing.  Textract, while it does provide OCR functionality as a Cloud-based service, is much more thorough in its ability to bring Machine Learning based models to your business applications.  In order for data to be useful, it must first be collected; Textract provides OCR capabilities to ensure text is recognized from paper-scanned documents to electronic forms.

For data to be really useful, it needs to have organization and structure; Textract provides the ability to automatically detect content layout and recognize key elements and the relationship of the text and the elements it discovers.  And finally, for data to not only be useful, but actually utilized, it needs to be accessed; Textract can easily share the data, in its context, with other applications and data stores through well-formatted data streams to applications, databases, and other services.  Textract is designed to collect and filter data from documents and files so that you don’t have to.  Solutions utilizing Textract naturally benefit from an automated flow of information from capture to storage, to retrieval.

Textract is more than just OCR

In 1914, Emanual Goldberg developed a machine that could read characters and convert them into telegraph code. Golderg also applied for a patent in 1927 for his “Statistical Machine”.   Goldberg’s statistical machine was designed to retrieve individual records from spools of microfilm by using a movie projector and a photoelectric cell to do pattern recognition to find the right record on microfilm. In many ways, Goldberg’s inventions are often credited as the beginning of Optical Character Recognition technology (OCR).  Over the next 92 years, OCR has become one of the most critical elements, which few have heard of, in building business solutions.

OCR moved beyond the business world to enabling sight-impaired people to read printed materials.  Ray Kurzweil and the National Federation of the Blind announced a new product in 1976, based on newly developed charged-coupled device (CCD) flatbed scanners and text-to-speech synthesizers, which has fundamentally changed the way we work with information.  It was no longer about reports, statistics, or data; it was about sharing information with anyone, in a format that could easily be accessible.  By 1978, OCR had moved into the digital world as a computer program.

Like all new technologies, OCR has had its issues and limitations.  In the beginning, text had to be very clear and created only in certain fonts to be recognized.  Scan quality of physical pages also plays a major factor in how well OCR engines extract the text from pages and in most cases, only a portion of the text is captured on poor scans.  Even today, with so many advancements in OCR, there are challenges to accurately collecting and organizing data from images.

Most OCR engines collect all the text from documents and make the words available for search engines, but very few OCR engines take it any further without requiring additional tools and applications.  Textract by Amazon Web Services goes beyond OCR by not only collecting the content but understanding where the content came from.

Textract provides the ability to not only perform standard character recognition but is designed to understand the formatting and how content is aligned within a page.  This is accomplished by recognizing and creating Bounding Boxes around key information and text areas to support the content, table extraction, and form extraction.

Item Location on a Document Page

The example image displays content that is separated by columns and has header information.

Figure 1 – Two Column Document Example

Most OCR applications will collect all the words on the page, but do not provide a reference to lines of text or location.  Amazon’s Textract retrieves multiple blocks of information from each page of the image it investigates:

  • The lines and words of detected text
  • The relationships between the lines and words of detected text
  • The page that the detected text appears on
  • The location of the lines and words of text on the document page

As the following illustration demonstrates, Textract is able to identify that there are two columns of information on the page.  It then recognizes that for each column, there are multiple lines of text which are made up of multiple words.

Figure 2 – Textract Line and Word Recognition

Textract outputs its findings in standard JSON files so that they can be utilized easily by other services or applications.  The example above would be represented in the JSON as follows:

Figure 3 – Sample JSON

Table Extraction

Amazon’s Textract is well equipped to locate table data within documents as well.  Textract recognizes the table construct and can establish key-value pairs with the cells by referencing the row and column information.  The following table represents 20 distinct cells, including the header row that will be evaluated by Textract:

Figure 4 – Sample table data

The output JSON from the Textract service creates a mapping between the rows and columns and intelligently identifies the key-value pairs in the table.  This recognition can also be performed against vertical table data rather than horizontal table.  The following illustrates the key-value pair matching:

Figure 5 – Table Key-Value Pair

In addition to detecting text, Textract has the ability to recognize selection elements such as checkboxes and radio buttons.  A checkbox that has not been selected, such as o or ¡ is represented as a status of NOT_SELECTED whereas Rž are represented as SELECTED and can be tied to a key-value pair as well.  This can be extremely helpful in finding values in both tables and forms.

 

Form Extraction

Businesses have been interacting with their clients and vendors for decades through forms.  Textract provides the ability to read form data and clearly define key-value pairs of information from them.  Many organizations struggle with the fact that forms change over time and it can be difficult to train tools to find data when those tools were specific for a particular form layout.  Textract removes that complexity by reading the actual text rather than a location on a form to get its information and analyzes documents and forms for relationships between the detected text.

Figure 6 – Sample form image

In the example above, Textract will create the following Key-value pairs:

Traditional OCR tools will provide all the available text out of an image or document, but to gather Key-value pairs from forms and data, as well as recognizing text based on words, lines, and understanding the blocking of content, additional tools are required.  Textract does all of this for you providing data that can then be further analyzed as needed.

Textract Considerations

Textract is specifically designed to perform OCR against image files such as JPG, PNG, and PDF file formats.  Most text-based document formats created electronically today do not require additional OCR since they are already embedded with an index that is accessible by search engines.  With the proliferation of mobile device and tablet use, there are still many times that images are created in which there is no inherent index available.  We use our phones to take pictures of everything including people, scenes, receipts, presentations, and much more.  It is quick and easy to capture the world around us, but it is more difficult to have a computer application capture important information that may be held in those photographs.  Textract enables the extraction of data from images so that you don’t have to.

As with all technologies, there are limits to what Textract can do and should be recognized before introducing it into a solution.  AWS maintains detailed information about the Amazon Textract service and its limitations and can be found here, https://docs.aws.amazon.com/textract/latest/dg/limits.html.

Putting Textract to Work

While OCR is important and can be a critical part of any business process, it is an engine that retrieves information from sources that could not be accessed except through human intervention.  In many ways, it is like an important element within a car’s engine.  A fuel injector is critical for a car to run, but may not have much value as an entity unto itself.  It’s when you bring various parts together that your car takes you where you need to go or your application drives your business.

To create a basic OCR application with Textract, you will need:

  • A place to store the images that need to be processed, in many situations this may be an Amazon S3 service (Simple Storage Service), Amazon WorkDocs (secure content creation, storage, and collaboration service), or even a relational database like Amazon Aurora.
  • An application or service to call the Textract services. Many organizations are creating Cloud-first applications and may choose to use AWS Lambda to run their code without having to worry about the servers where the code runs.
  • A place to store the results of the Textract services. The options are limitless for where to store the text and details uncovered by Textract, this could be stored back into an Amazon S3 instance, a database like Aurora, or even a data warehouse like Amazon Redshift.
  • And finally, you need to do something with the information you have collected. This all depends on what your goals are for the information, but at a minimum, most people want to search for information.  Using Amazon Elasticsearch Service is an easy way to allow people to find the new information Textract was able to gather for you.

The following outlines this simple Textract solution:

Figure 7 – Simple Textract solution with Amazon Elasticsearch Service

Practical Applications for Textract

While being able to search for information that was extracted from images is useful, it isn’t all that compelling from a business perspective.  Information needs to be meaningful and applied to a task so that its value can be recognized.  The following examples illustrate common business processes and the role that Textract can play in them.

Human Resource Document Management

Every organization has employees and/or volunteers to support their efforts.  There are many state, county, and country regulations that drive what information we need to keep about our employees as well as operational documents about the employees that help us to keep our businesses running.  The following are some examples of common documents that most organizations need to collect and retain:

  • Employment applications
  • Employee resumes
  • Interview notes, references, and background information
  • Employee offer letters
  • Benefit elections
  • Employee appraisals
  • Wage garnishments
  • State and Federal Employee documentation
  • Employee disciplinary actions
  • Termination decisions and disclosures
  • Promotion recommendations
  • Employee complaints and investigations
  • Leave request documentation

While there are many applications and services available on the market today which will help organizations capture, index, and retain this information, they can sometimes be costly and may not be able to completely capture information held in non-text-based file formats.  As discussed earlier, more and more people are using mobile and tablet technologies because of their accessibility and ease of use.  In many cases, an employee may use their phone to take a picture of a signed employee document and send it in to the company.  This photographed document can cause issues in capturing the information in it, or even classifying it properly in an automated fashion.  This is where Textract can easily be integrated into an existing solution, or incorporated as part of a newly constructed solution, to ensure vital information isn’t missed.

The following illustrates how a solution designed for the Cloud-based on Amazon services can facilitate common Human Resource document management activities:

 

Amazon Services Utilized:

Amazon S3|Amazon WorkDocs|AWS Lambda|Amazon Textract|Amazon API Gateway

Non-Amazon Application examples:

Workday|Oracle Human Capital Management|Oracle PeopleSoft

 

In this example, a newly hired employee is granted access to the company’s Amazon WorkDocs environment to upload documentation that will be required during the hiring process.  While most of the documents being uploaded will be easily indexed and searchable through the Amazon WorkDocs service, the employee has been asked to upload a copy of their driver’s license.  The employee utilizes the Amazon WorkDocs mobile application to take a picture of their driver’s license and uploads it to the appropriate folder on their phone.  Behind the scenes, the company has configured a workflow in Amazon WorkDocs to inform HR managers when new documents have been submitted and a Human Resources representative reviews the uploaded driver’s license.  The human resource representative launches an action in Amazon WorkDocs (a special feature provided by the company’s IT department) which will launch an operation running on AWS Lambda initiating Textract to capture OCR information form the driver’s license as well as create Key-value information which will be sent to the company’s ERP system (like Workday, Oracle Human Capital Management, Oracle PeopleSoft, or other similar application) along with the Amazon WorkDocs reference for where the actual image is stored.

This illustrates a very simple method to directly engage with employees to capture critical HR information through a combination of out-of-the-box Amazon services and some light-weight customizations to create a streamlined process for document storage and data capture.  It only took the employee a few seconds to take the picture of the driver’s license and upload it and the HR representative a few seconds to review and process the new document.  In fact, the solution could be configured to automatically extract the required details and send it to the ERP without even having to have the HR representative involved for a truly automated solution.  Imagine each new hire having ten to twenty documents they need to upload and how much time HR spends processing each document manually for every new employee.  Automating this process can amount to several hours a month of time savings, especially when dealing with non-text-based file formats that require someone to manually read the documents to key in the information contained in them.  By introducing Amazon Textract into the overall solution, data can be collected, stored, processed, and shared easily and more efficiently.

Business Document Processing and Information Automation

While the Human Resource Document Management example above focused on capturing documents individually as they come in, there are many situations where companies need to process documents in bulk.  Using similar AWS services as the previous example, solutions can be designed to allow for batch uploading of documents for processing.  As an example, procurement procedures for large purchases can incorporate a wide variety of documentation which may have vastly different processes associated with them.  By providing a simple way for files to be uploaded in bulk, AWS services can be utilized to sort through the file formats for processing.  Non-text-based image files like JPG, PNG, and PDF files can then be automatically processed by Amazon Textract to capture OCR information, Table data, and Key-value information from forms and then shared with back-office applications, stored in data warehouse services, and/or shared with Amazon Elasticsearch services.  Processing hundreds or even thousands of documents and images a month becomes much easier through automation.  Incorporating Textract into business process work streams ensures that critical information is identified and captured from structured and semi-structured documents reducing the need for manual classification of information to facilitate business operations like Insurance Claims, Legal Processes, Partner Management, Purchasing, and more.

Litigation is disruptive to normal business operations for any company.  Thousands of documents, images, and artifacts have to be reviewed and collected to share with attorneys and the courts during a legal process which can be time-consuming.  While there are many discovery tools available on the market today to help speed up the process of finding the desired information, they are reliant on information being in a format that the discovery services can handle.  In many cases, important information is stored in pictures and scanned documents that these discovery services cannot easily process.  Amazon’s Textract becomes a valuable tool in the discovery process by allowing organizations to quickly filter through image files, capture, and OCR information so that it can become indexed and searched.

Litigation isn’t only a headache for companies, it is a headache for the legal teams associated with the litigation process as well.  Imagine a law firm receiving millions of electronic files from a company and having to read through each document to find pertinent information regarding the case they are working on.  This can take months and many resources to complete, time that most lawyers don’t have during a case to complete. Files may be images, documents, spreadsheets, audio files, and even video files.  All of these need to be processed so that key information can be selected to support the case they are working on.  The expense of a large legal process can be staggering due to the sheer amount of manual labor required to gather information.  In the following example, Amazon’s Artificial Intelligence and Machine Learning services, including Textract, are utilized to greatly reduce the processing time for legal discovery.

Amazon Services Utilized:

AWS Transfer for SFTP|Amazon S3|AWS Lambda|Amazon Textract|Amazon API Gateway| Amazon Rekognition| Amazon Comprehend| Amazon Transcribe|Amazon Elasticsearch services

In this example, a legal firm utilizes the power of AWS Transfer for SFTP services to allow clients and opposing counsel to quickly upload all of their discovery files and documents which are then automatically stored in Amazon S3.  Files are then sorted based on file types for processing.  Amazon Textract capture OCR information from image files including table and form data while Amazon Rekognition analyzes photos and videos to identify the objects, people, text, scenes, and activities, perform facial recognition, and detect any inappropriate content.  Audio and video files are processed through Amazon Transcribe to capture speech-to-text information.  As files are processed, the information is captured and indexed in Amazon Elasticsearch service to enable rich search functionality to the litigators as well as being processed by Amazon Comprehend to quickly find relationships and insights into all the data collected.

What would have taken months to sort through and comprehend becomes manageable information in hours or days providing more time for the legal team to focus on winning their case while saving thousands of dollars on the personnel required to manually process all the discovery information.

The tool you didn’t know you needed

Technology is advancing at incredible speeds and new solutions and services are becoming available every day.  Services like Amazon Textract are critical tools in document processing and are rarely thought about but imperative for success.  Of all the services Amazon provides, Amazon Textract is one of the hidden gems that can be easily overlooked but deserves to be part of your processing arsenal.

You are not alone

Business solutions can be complex, but making them work for your requirements doesn’t have to be.  Clearly defining your goals and objectives is half of the battle, the other half is knowing what tools will help you achieve those goals.  Are your off-the-shelf solutions and applications collecting all the information you have?  Do you need a business solution to manage all of your documents and data, but don’t know where to start?  Are you looking to move off of an outdated legacy application that no longer supports your business direction?  You are not alone.  Thousands of companies are facing the same questions and are finding the best answers by engaging with experts from Amazon and experts from solution service providers.  TekStream Solutions, along with AWS, is excited to speak with you about your Information Processing needs and how the right tools and solutions can have a positive impact on how you conduct business.  TekStream Solutions is offering a free Digital Transformation assessment where we will work with you to identify your document processing needs and provide process and technology recommendations to help you transform your business with ease.

 

Want to learn more about Textract? Contact us today!