By Matthew Crockett, Synthesis Digital and Emerging tech Principal Projects Custodian
Part of the modernisation of a Microsoft workload and deploying it to AWS will include migrating the database from an on-prem Microsoft SQL Server to Amazon RDS, with any of the database flavors that you could wish for.
Monitoring these databases can be hard for non-DBA folks, such as myself, but there are a few tips and tricks that I have learned through trial by fire and I would like to share these to make sure others don’t face the same pain that I have.
What is RDS
It is a managed service provided by AWS to host relational databases in the public cloud, with a choice of DB engines, found here. Some ones of note of Amazon Aurora, Oracle and Microsoft SQL Server. Amazon Aurora is a relational database that is built for the cloud with support for MySQL and PostgreSQL while being extremely performant and reliable. Microsoft SQL Server is not new to the database world, but its availability inside Amazon RDS means that workloads making use of SQL Server-specific features can be migrated to AWS and still benefit from using a managed DB layer.
There are many considerations when picking what instance type and size for your RDS instance, and the choices made have a bigger impact on the performance of the database than meets the eye, more on this later.
Monitoring tools in AWS
By default, this is all you have to work with, and it is a very good place to start, you can monitor most, if not all, the metrics that you would be interested in including CPU Utilization, free Memory, DB connections, IOPS and many more. These metrics can be plotted on graphs and added to a CloudWatch dashboard, allowing quick access to the data. At a minimum, this dashboard should include the metrics mentioned above, as well as the latency and throughput metrics which will highlight any issues.
Enhanced monitoring gathers metrics in real time for the OS (operating system) that your DB instance runs on whereas CloudWatch gathers metrics from the hypervisor. This means that there can be a difference between the two since the hypervisor layer performs a small amount of work on smaller instance sizes, this can be more of a problem as there are fewer resources available in general. The benefit of enabling enhanced monitoring is that it shows at the OS layer as opposed to the hypervisor layer, letting you know what is actually available to the RDS database.
Enhanced monitoring also provides additional metrics which allow a different view into the health of the RDS instance, some of these are below, for example, the ‘Physical Devices Write IO/s’ below shows all the physical drives, as well as their usage, that the RDS instance makes use of.
RDS Performance Insights monitors the load on the RDS instance in question and provides the ability to analyse and troubleshoot the performance of the database. The main focus of performance insights is on the performance and health of the database, allowing the analysis of issues that affect it.
The performance insights dashboard allows for the visualisation of the current or historic load on the database, with filtering by waits, SQL statements, hosts or users. In addition to this, it shows what the max capacity of the database is, and how close the current load is getting to that maximum value.
I once found myself monitoring an application in a production environment where I was not in control of the source code. This meant that any issue I found I could raise but could not directly fix myself.
While setting up some dashboards to monitor the performance of the application, it became clear that the RDS instance was not handling the load that was being asked of it, below I look at the issues I found and the techniques I used to find them.
DB instance type and Size
With the help of performance insights, I was able to identify that the database instance size needed to be increased, as there is a maximum number of connections that are supported based on the available memory, which is tied to DB instance size.
As seen in the image below, this database was running far above the maximum capacity for an extended period of time, thereby causing poor performance for any services trying to make use of it.
GP2 SSD vs Provisioned IOPS
By default with AWS EBS general purpose SSD volumes IOPS is calculated using a ratio of 3:1 based on the size of the volume.
If the RDS volume size is 50GB it works out to (50×3) 150 IOPS (more info here). AWS also has a concept of a burst balance which is accumulated over the period of time that the volume is available, this can be used to burst above the IOPS of the volume for a period of time.
What the below image is depicting is that the burst balance (blue line) is decreasing from full at 08:00 to empty at 13:00 due to the WriteIOPS (green line) going above the available IOPS using the bursting capability. As you can see, when the bust capacity is depleted, the WriteIOPS flat-lines, and stays there for the rest of the period. Without making use of an effective CloudWatch dashboard, this would have been near impossible to troubleshoot.
With the help of Performance Insights, I was able to find that there were queries running on the DB that were not optimised, further to that, they were poorly written and resulted in much higher usage on the DB than they should have. I was able to let the developers of the application know exactly which queries were causing the problem, so they could narrow their focus and remedy these immediately.
As you can see from the images below, the blue stacked are represents the aspects which are causing extra load on the database, and by looking into the ‘SQL’ section of the table below that, we can see that it is an ‘INSERT’ and ‘DELETE’ query that is mostly responsible for the spike.
I’m certain that many, if not all, the monitoring elements I mentioned have always been available in the on-prem world, but since I come from a software engineering background rather than a database administrator one, they were new to me.
Hopefully, this highlights the importance of effective monitoring and alerting to enable DevOps teams to catch these types of errors early before it is reported by a user of the system.