AWS SQL Server Backup and Restore Automation using SSM and CFN

Harini Muralidharan
7 min readMay 13, 2021

SQL Server is a widely used relational database management system. It is used to store and retrieve data. This article is about how to perform AWS MSSQL Backup and Restore operations (same or cross-account). When I was searching for an article to implement this in one of my use cases, I could not find any source related to AWS and hence I wanted to write this article to share the process followed, issues/errors faced during the implementation.

SQL Server backups include all database objects: tables, indexes, stored procedures, and triggers. These backups are commonly used to migrate databases between different SQL Server instances running either on-premises or in the cloud. They can also be used for data ingestion, disaster recovery, etc.

We use backups to:

  • Migrate databases
  • Refresh development or lower environments from production.
  • Recover from accidental deletes.
  • Import and export data
  • Handle single or multiple database instances
  • Create database copies for testing, training, and demonstrations
  • Disaster recovery enforcement between on-prem and cloud storage.

Architecture:

The above architecture shows the various AWS account segregation and the workflow that occurs across these accounts to achieve backup and restore operation. We can see the utilization of each account and the resources created on those with more details below.

Shared Services Account:

The Shared Services account serves as the central hub to create and manage common AWS resources for processes implemented across multiple AWS accounts. The common resources like the S3 bucket, KMS will be created in this account so that both the production account and workload account can access these resources.

Production Account:

For the use case considered, it is assumed that the primary RDS instance is created in this account from which the backup process will be performed. The backup operation configuration will be done using SSM which will directly upload the backup file to the S3 account in the Shared Services account. The production account also has a lambda function that will update the ACL of backup objects uploaded in real-time based on the EventBridge rule. If the ACL is not updated, the workload account lambda function will throw an “Access Denied” error. Also, this has to be done from the production account and not the shared services account because only the object owner account will have permission to perform this change. The CloudFormation template to create these resources has been mentioned below.

Workload Account:

The workload account is where we are going to perform restoration operations. This can be one or multiple accounts and we will be creating the resources fresh considering it does not exist earlier. The CloudFormation template to create resources related to this is mentioned below.

Assumptions:

  1. There is already an existing primary MSSQL Instance in on-premise or in an AWS account (Production account in this article) and a secondary MSSQL Instance in an AWS account (Workload account in this article).
  2. The primary MSSQL database is accessed using SSM (SQL Server Management Studio) either from local windows machine or remote windows instance like EC2.
  3. The RDS access details and credentials are saved in a Secrets Manager of the respective account, which is encrypted with AWS KMS from the Shared Service account. In case this is not followed, please modify the RDS credentials access function based on your architecture.

Backup Process Configuration in SSM:

The backup process is created as a scheduled event using SSM in the production account database. This is accomplished using SQL Server agent in SSM. Please find below screenshots of step by step by the configuration of the scheduled job.

Step 1: Create a new job for the primary MSSQL server.

Step 2: Name the new job in the General tab.

Step 3: Navigate to the steps tab. Initially, the list will be empty, click on “New” to create a step.

Step 3a: Name the new step and add the backup script which will be executed during the job.

exec msdb.dbo.rds_backup_database
@source_db_name='DBName', @s3_arn_to_backup_to='arn:aws:s3:::s3-bucket/path/file_name.bak',
@overwrite_S3_backup_file=1;

Note: For details regarding the parameters and respective descriptions in the above commands, please visit here.

Step 4: Create a new schedule in the “Schedules” tab.

Step 4a: Configure the scheduling details as per your requirement. In the following screenshot, we have configured the job to happen once every week on Sunday at 12.00 AM.

NOTE: The other options (Notifications and Targets) have not been considered for our use case. But if required, please configure based on your requirement.

Restoration Process and related CFN templates:

Shared Services Account:

The following CFN template is to create resources in the Shared Services account which includes creating theS3 bucket to store all backup files, EventBridge rule to send PutObject, CompleteMultipartUpload, CopyObject events of the above S3 bucket to production and workload accounts, and cross-account EventBridge IAM Role.

Production Account:

The following CFN template is to create the EventBridge rule, which will listen to PutObject, CompleteMultipartUpload, CopyObject events of the Shared Services S3 bucket where the backup object is being stored and to create a lambda function that will update the object ACL. The template does not include updating the default EventBus policy to allow receiving events from the Shared Services account. Please modify the policy change as required.

Following is the db-backup-acl-change.jslambda function which will be compressed as db-backup-acl-change.zip and uploaded to lambda-source-codes-s3 bucket.

Workload Accounts:

The CFN template attached below is for one or more workload accounts that create resources for the restoration process. The resources include EventRule which will listen to PutObject, CompleteMultipartUpload, CopyObject events of the Shared Services S3 bucket where the backup object is being stored and lambda function to perform restoration operation in real-time.

Below is the db-migration-restore.js which is used in the lambda function.

Hola! The automation of an MSSQL database backup and restore is complete. You can see the process happening the next Sunday provided all the permissions are configured as expected.

Errors I came across during this implementation:

  1. Issue: Executing the backup query in the primary database was giving an “Access Denied” exception.
    Possible Reason and Solution:
    i) The SQL RDS Server does not have access to the S3 bucket or associated KMS Key to be able to upload the backup file. For this, please look into the “Options Group” attached to the RDS instance and see if there is an option with a name SQLSERVER_BACKUP_RESTORE . If so, modify the IAM Role associated with it to have required permission and if not, please add a new option to have this name and an IAM Role with the necessary permission.
    ii) As mentioned above, we have common KMS Keys in the Shared Services account. Because of this, I tried to have kms_master_key_arn a parameter in the backup query. It was automatically replacing the account number with the current account number irrespective of what we give in the execution query. With help of the AWS Support team, we have identified that this parameter works only if it is in the same account and if it is a cross-account KMS, we should not have that parameter.
  2. Issue: The EventBridge did not work only with the PutObject event listener.
    Solution: Since the backup file will be large, the backup file upload will happen in parts. Because of this, we have to listen to the CompleteMultipartUpload event rather than PutObject as this event is for one full object upload. For a safe case, I have considered all the possible events in the above CFN templates.
  3. Issue: The EventBridge in the Shared Services account is able to receive the real-time events but not the production or workload accounts.
    Solution: Check the default EventBus policy of all the accounts involved and the cross-account IAM Role created in the Shared Services account to see if all the necessary permissions with correct account numbers are added.
  4. Issue: During the execution of the restoration operation, an error stating “the database already exists”.
    Solution: We cannot perform restoration operations on the existing database. We will have to restore it to a new database and once that is successful, we have to rename the new database by dropping the old one. This logic is covered in the restoration lambda function above.
  5. Issue: Lambda not able to connect to the database instance.
    Solution: The RDS instance might be in a separate or private VPC. Configure the lambda VPC and security group to make sure it has a connection to the database.
  6. Initially, I tried to have the implementation of backup operation using a lambda function. As the maximum time limit of a lambda function is 15 minutes, the backup operation was not successfully happening within the time frame. Hence the usage of SSM (SQL Server Management Studio) for the backup process.

--

--