Making the Invisible Visible: Tracking Errors and Debugging in Splunk SOAR
By David Burns, Team Lead, Automation Engineering
When you’re automating security workflows with Splunk SOAR (used to be Phantom, for those of y’all who’ve been around a while), stuff’s bound to go sideways now and then. That’s just part of the game.
Whether it’s a bad API token, a mistyped field, or an app that just decides to take the day off knowing when things fail and why is what helps you get better. Lucky for us, SOAR gives you some solid tools to dig into those errors and make things right.
1. Action Run History: For When It’s One Particular Thing Causing Trouble
Sometimes your playbook as a whole looks like it ran okay, but one of the actions inside quietly failed. That’s where the Action Run History comes in, found under Administration -> System Health -> Action Run History.
You can pull up:
- All failed actions, filtered by action name as well as listing container ids and playbook names
- Pivot to the exact error messages (straight from the integration or app)
- What inputs were passed in when it failed
Say your hunt_ip action works fine on VirusTotal but fails for Palo Alto — this view will tell you exactly what went wrong and where, no guesswork needed.
2. Playbook Run History: Your First Clue When Things Go Sideways
Every time a playbook runs SOAR keeps a record of what happened, whether it works or blows up. That’s what your Playbook Run History is for, and it’s about the best place to start when something ain’t behaving like it should. A list can be found, sorted in descending execution order, under Administration -> System Health -> Playbook Run History.
With the run id you can:
- Step through each block to see where it choked
- Check out the input and output of every little piece
- Even replay the whole thing with the same data
It’s especially handy when your custom code is acting up such as catching a NoneType error than chasing it through logs.
3. Common Gotchas (and How to Stay Ahead of ’Em)
Some errors pop up more often than others, we’ve all seen ‘em:
- Timeouts or app servers that just ghost
- Bad credentials or expired tokens
- “Expected string but got int” kind of errors from JSON parsing
A few handy tips:
- Use phantom.debug() generously in your custom code. Future you will thank past you
- Wrap fragile stuff in conditionals or error handling branches
- Tag containers when something fails so you can come back and give ‘em some love
4. Automating Error Visibility: Set It and Forget It (Mostly)
Now this part right here’s where things really start to shine. Sure, you can poke around in the UI all day to see what went wrong but if you’re like most folks I know, you’d rather have the system keep an eye on itself.
So let’s talk about setting up a scheduled playbook that’ll do just that:
A little automation to watch your automation.
Here’s how the playbook could flow:
Step 1: Look back at recent Playbook runs
Set a time window (say, the last 24 hours) and pull in playbook runs that ended in error.
For each one, grab:
- The playbook name
- Container ID
- Timestamp
- The top-level error message
Step 2: Check the Action Runs too
Sometimes the playbook might limp along even if an action inside it failed.
So next, you query action runs over the same time range, filtering for status=”fail”.
For each of those, collect:
- App and action names
- Error message (this one’s usually straight from the app or integration)
- Parameters that were passed in (helps for troubleshooting)
Step 3: Correlate and clean it up for human eyes
Bundle the results together in a simple table or CSV:
- Group by playbook or app if you want to spot trends
- Add container links so folks can jump straight into triage
- If possible, include a short note like “Token expired” or “Timeout after 60s” to save folks the trouble of reading a mile of logs
Step 4: Send it somewhere useful
Now you’ve got options:
- Email it to your SOC’s distro list
- Post it to Slack or Teams with some friendly wording like:
“Here’s the list of failed playbook and action runs from the last day. Please review and isolate before they pile up.”
- Push it into Splunk Core as a custom log event for trending over time
Bonus tip:
Add logic to flag repeat offenders. Apps or actions that keep failing more than a couple times a day. That’s usually a sign something needs tuning (or rebooting, let’s be honest).
Conclusion: Failures Ain’t the End. They’re the Start of Getting Better
Errors don’t mean your automation’s broken they just mean your system’s trying to tell you something. The trick is listening.
By keeping tabs on playbook and action run histories and setting up a little auto-monitoring on the side, you can spot problems before they snowball, fix what needs fixin’, and build a system that only gets smarter with time.
Treat your errors like feedback, not setbacks. That’s how you go from running a few playbooks to running a whole dang SOC on cruise control.
About the Author
David Burns is a security engineer with experience working with Splunk Enterprise Security and Splunk SOAR (formerly Phantom) for a large fortune 200 bank. Before that he was a System Security Engineer working on the automation of security testing of OT systems. He brings his 20+ years programming background to use SDLC in rapid development of playbooks, custom functions, and more leading to modularity, re-use in design, and better long-term maintenance. For example, creating deeper integration for escalation through Slack and creating EDL management for multiple clients. At TekStream, he developed the slack escalation methodology that notifies customers of events that need their attention as well as a way of process for generating and updating EDLs within Splunk.
