Splunk Migration: On-Prem to Cloud
You’re starting fresh! You’re signing up for OPC (other people’s computers) and putting your admin burden on Splunk to manage. You’re freeing your Splunk staff to focus on the business of Splunk, hunting threats, monitoring infrastructure, and the real-time examination of critical business processes. Good for you, now you have to lift that five years of Splunk usage into the cloud, hmmm. This is a high level overview of the steps that we take when migrating on-premises applications to a Splunk Cloud environment. The process is broken into two major categories, migrating functionality (applications) and migrating data. Our process assumes leveraging Unix scripting utilities, analogs in PowerShell scripting exist but we did not do the translation.
You know there’s an app for that…, that is to say, there is an app for migrating your data from your existing on-prem indexers to a cloud environment. It’s a part of a professional services solution that allows you to asynchronously replicate data from your indexes into Splunk cloud. An approach that moves your data up to Splunk Cloud in the background, ensuring that there are no gaps in the transition. It’s actually pretty elegant; it leverages SmartStore to copy all of your data to S3 (encrypted using server-side keys), and then it bootstraps the Cloud environment that shares those S3 buckets, to wake up and recognize all that S3 goodness in your Splunk Cloud instance. Basically, you drain your data until you’re caught up and then flash cut your forwarders to point all new data to your Splunk Cloud environment and boom you’re in the cloud! Expect it to take several days if you have more than a few terabytes, leverage a big pipe for the upload if you have it. Also, make sure that you validate the keys given to S3 by running AWS CLI with the newly supplied keys and confirm you have visibility to the targeted S3 environment. For more on SmartStore see here. If you’re moving your data to your own AWS environment ( a non-Splunk cloud target), you can use the same approach by creating S3 volumes, granting access to those volumes by generating AWS server-side keys, and altering your indexes to be smart-store indexes. Our focus here is specifically on Splunk Cloud so we will assume you’ve engaged services and continue on to discuss what we can do to package applications.
Now, what about your apps? What about your users? How do you move all of the years of effort your team has built into searches and dashboards and knowledge objects? Moreover, some of those searches and alerts are stale; what is worth saving? Migrating searches from defunct users will result in orphaned objects. The objects still exist in the system, but anything reliant on those knowledge objects will most likely break. User grouping ensures that these orphaned objects are always accessible to users that need them and thus prevents anything dependent on them from breaking.
Typical Migration Challenges
Some of those older on-prem users might not be a part of our Splunk team now, but we might still be dependent upon the alerts they built. For that matter, we are not moving everyone all at once to Splunk Cloud so not everyone will have access on day one. We can’t require every user to recreate every search or dashboard that they’ve done in the new environment; that is going to make for some very unhappy users, and some of them are people that we presumably don’t like making unhappy. In a typical on-prem-only migration, if you move your user directories over and the users don’t exist in the new environment, you’ll have to clean up orphaned search errors. You can migrate orphaned knowledge objects to a new user en-masse through the UI to some admin user, but you may have to deal with them on a case-by-case basis. More on resolving orphaned objects here.
If you didn’t consolidate applications and decided to wholesale move users into the new target environment, you might run the small risk that you have a knowledge object or search with the same name and different scope (e.g. user creates a search in their directory and then creates a search with the same name in an application context). This can happen in the migration or merge of multiple environments, especially if the users are the same. Obviously the option to move users to Splunk Cloud is only an option for Splunk Cloud ops or consultants with command line access. In this case, there is the potential for overlay or confusion in a migration. We’ve seen a search get deleted in the GUI in a migrated environment because it was orphaned (a defunct user) and throwing errors, but the errors remained because it was deleted in apps but it . In Splunk Cloud there is only the UI, so there is no ability to look for assets defined in conf files behind the scenes. In any case, we are not moving user directories as the whole point of our approach is to consolidate user assets into applications.
Users Directories Don’t Migrate to Cloud
Applications are the container for promotion of functionality to Splunk Cloud. User directories are typically not migrated. Anything custom in an on-prem environment is packaged into an app and imported into the cloud environment. As a result, many people take the approach of not migrating the user directories as all; if you want to save any of your work that you’ve accumulated in Splunk, hand jam away. Copy up searches and recreate them, copy the XML behind your dashboards, etc.. The other extreme is to use “splunk btool … | grep local” and append everything in a local directory into a single .conf file – the not so pretty sledgehammer. Our approach is somewhere in the middle.
Approach: User Grouping and Consolidation
We take the approach of consolidating user knowledge objects into a “migration” application for promotion. It’s a pretty simple matter of copying user directories from your search head(s) and recursively appending all of the relevant .conf files (eventtypes, tags, savedsearches, etc.) and dashboards into a shared application structure for promotion. There is some up-front work that needs to happen with stakeholders in preparation and some expectation setting.
If you’re in a small environment, and all users are equal, then you simply have a single migration application for the purposes of bundling user-specific assets. If you’ve got multiple constituents that can be logically grouped, particularly if you’ve got sensitive data (e.g. PCI, PII, PHI, etc.), then grouping becomes important. Organizing users into groups can help democratize information across a pre-existing organization. Private dashboards, for example, will become visible to the rest of a group where before they were private – specific to a user. This approach to permissions also alleviates stress when maintaining permissions later and setting them up in the migration (i.e. 10 group permissions vs. 1000 individuals). Permissions for specific individuals can always be changed later and permission inheritance can help with that process. It is worth communicating with the users that moving and cloning searches from the shared application context is a part of the cleanup process, particularly with sensitive searches.
You’ll need to spend some time logically grouping users together. You’ll need to define local permissions to ensure that dashboards only get seen by users within that group/role. The same holds true for index permissions. Moving to cloud is a great opportunity to refine permissions and roles so the exercise has value beyond the migration itself. Moving to a new environment provides an opportunity to clean up any number of historical bad practices from removing unnecessary real-time searches (disabled by default in Splunk Cloud but you can ask support to change if necessary) and establishing new naming conventions.
If you get the inkling that nobody knows what’s in their user directories and a lot of the knowledge objects are not in use, then put a Time To Live on your consolidated migration application(s). Meaning, the consolidated applications targeted to house all user objects have a life expectancy before they will be retired. Users must move the functionality to another application or it will be removed. It’s a good way to methodically get rid of deadwood.
First and foremost – backups. Make sure that you take backups according to whatever method is most effective and that they’re valid. This migration process is one-way. After the process is complete, you’ll be decommissioning the old site. Splunk Cloud only allows access to the keys to the S3 environment for a certain duration and then they are expired. There are challenges inherent in having two sets of indexers (on-prem and Splunk Cloud) writing to the S3 store for a prolonged period of time. When you have multiple clusters managing the same remote storage, none of them acts as the official system of record and they are both applying rules that can affect the handling of that s3 storage. One cluster may freeze buckets from underneath the other one, causing a state mismatch problem on the other one, raising S2 problems (i.e. an indexer may think that a bucket is “stable” (that is, it thinks it’s on S3) while suddenly it is not).
If possible, you should disable freezing of buckets, by pushing the following settings onto all indexes of the indexers (in indexes.conf):
maxGlobalDataSizeMB = 0
frozenTimePeriodInSecs = 0
After the migration is complete and you are left with only one cluster pointing to the remote storage, you should remove the config settings above. So, you are essentially changing your indexes into SmartStore indexes and then allowing the Splunk Cloud environment to own. It’s not destructive but it’s also not reversible. You don’t want to turn off SmartStore once it has been enabled. Once you commit to the migration, you are committed. Data will not be searchable from On-Premise once the migration has been completed and the AWS S3 keys have been revoked. I haven’t seen the method fail and if you trust Splunk Cloud with your data generally speaking, the migration process uses the same safe and well-proven storage mechanisms that Splunk SmartStore does generically.
Make sure that before the process begins, you have gone through the process of obtaining AWS keys through Splunk Professional Services or Splunk Cloud Support with an expiration that takes you a couple of weeks beyond your targeted cutover date (accounting for the actual time that it will take to copy up all of your indexer buckets into S3). Times will vary so make sure you conservatively estimate your upload speeds. Estimating the upload time is a simple matter of taking your total data volume (sum total of space allocated to your index primary buckets) divided by your upload speed. If you are going through an intermediary, say with uploading an AWS environment to Splunk Cloud, then you are constrained by the upload speed of that single instance.
Set up some baseline searches. You want historical counts by sourcetype or source, current counts for a specific time period at some point in the past so that you can replicate that search after the move. You’ll also want counts of forwarders to verify after the cutover. Enlist key stakeholders in validation of dashboards and knowledge objects prior to cutover. There is always the risk that your Splunk users are not as familiar with the current state of Splunk data as they should be. You want to get their buy-in on the move and they need to trust the data post-migration. You might also look for any error messages or warnings you are getting in your environment by searching _internal (there are always a few). You can verify whether any new error messages you’re getting after the move are truly new errors.
Create a barebones app target as a container for your grouped user apps – you might have to include additional object permissions if dealing with sensitive searches. This is the shell for your application targets. You can use this template as the basis or target for migration app(s) in the consolidation of user assets.
Now timing for the application migration is important. If you have kicked off your data migration, you’ll want to time the app migration to be at the end of that process. Your users are still using the on-prem environment and happily creating searches, knowledge objects, and dashboards. You want to capture as much of those assets as you can so you should plan to migrate apps after the data move is complete or close to complete.
Create directories for the purposes of staging/copying user directories for each logical grouping of users (or all users if no grouping/categorization is necessary). You can loop through a list and just loop through each username in turn. Something like:
for file in $(cat users.txt); do sleep 1; echo copying “$file”; cp -rf /opt/splunk/etc/users/”$file” /target_user_directory/;
You can do a find command to get all directories with specific conf files (could run over all directories if you looked for the file in the context of the script). The same can be done to copy over all the xml files from dashboards. Of course if an asset from one user (e.g. saved search, knowledge object, dashboard) is named the same as one from another, there is the potential for overlay or replacement. If you are appending the results of all of these files to one another (below), you may find you have a duplicate stanza that will result in unpredictable behavior. You might look for duplicates in the resulting .conf files for this eventuality. Grepping for the stanza names (anything starting with “\[“) is the easiest way to identify duplicates.
In order to move your apps over to Splunk Cloud, you will want to run them through AppInspect to ensure they are compatible with the environment. The AppInspect tool will go through your apps’ contents and will return any failures/warnings that it found (complete with line numbers and stanza headings for easy debugging). You can either download the tool as a python script to run locally on the command line or you can use the Halifax website to verify your apps without installing anything (one at a time). Links are provided below. You’re going to run appinspect for cloud validation with a format like:
splunk-appinspect inspect path/to/splunk/splunk_app.tgz –mode precert –included-tags cloud
Correcting errors from the results of appinspect is usually pretty straightforward as the error message itself is somewhat self-explanatory and most warnings are not going to keep you from import. Once errors are resolved, you can try your import of the app directly in the Splunk GUI. If you’re having challenges at that point, it will require help from the Cloud Support team.
Additional Areas to Look Out For
- After you migrate your data over to the new environment, you might encounter errors like this:
This error is most likely caused by the bucket rolling to frozen during the migration (if you haven’t disabled it), and the longer the process takes, more buckets will roll to frozen. If you run across this error, check the naming convention of the bucket for the epoch time range against retention settings to confirm that it was rolled to frozen. You can also search for the freeze event in the _internal index by running:
index=_internal sourcetype=splunkd component=BucketMover “Will attempt to freeze”
You might also check that the results for the index in question are identical on both the new and old environments. You will get the error on the new Splunk Cloud environment even if your results are the same because it doesn’t recognize the frozen rules from your on-premises environment – it’s looking for that remote bucket that just disappeared. After confirming that the bucket can be removed, contact Splunk support to request that they remove the bucket.
- Time it takes to migrate and allocating adequate upload.
The time it takes for your data to migrate will vary depending on network speed, bandwidth, data volume, number of indexers, and index replication factors. The progress can be monitored with SPL searches in the GUI index=_internal sourcetype=splunkd source=*splunkd.log “CacheManager – action=upload, cacheId=” action=upload status=succeeded)You can calculate how long the migration will take after about an hour. If network traffic is a concern, you can lower the upload speed at the cost of time. You can check the logs to see if the
- Plan ahead – share indexer IP’s and your stack name with your services counterparts to get the process rolling. They will be creating keys specific to the shared AWS/S3 environment to initiate the process that will need to be applied to your indexer. Make sure you upgrade your environment to a post 7.3 version of Splunk.
Preparation will save time when migrating. It is best to start migrating the data as early as possible because you can work on other tasks while the data migrates (there is only a down time of 45 minutes when the process begins), but if you do it later you might be stuck waiting without other tasks to tackle. The old environment can be bootstrapped later to collect recent data. Bootstrap the old environment after cutting over the forwarders to guarantee all your data is accounted for. If all of this is done ahead of time, the data can begin migration immediately upon project inception, while you address application migration in parallel.
Communicate to end users in preparation for the move:
- What to expect during the migration (e.g. 45-minute lapse in complete search results during initial setup), and how your key Splunk users will be involved to validate the results.
- Changes for moving to Splunk Cloud that may affect them (e.g., no more real-time searches, no command line access).
- Relative timeframes for the move and the types of error messages they may encounter in the new environment temporarily (e.g., “localized bucket” errors).
Migration takes a little planning but it really isn’t terribly complex and it is highly reliable using this methodology.