As anyone who developers software for a living knows, MySQL is a popular Open Source, high performance database system that is used by millions of active web sites every day. Our web site has been running on MySQL since the day it went live, and until recently the performance of MySQL has been excellent. However a number of things happened over the last 12 months that caused our web site to have noticeable performance problems, most of which was caused by the growth in the size of the database. So for the last 12 months we have been on the hunt to solve the performance problems.
When we first started noticing performance issues we were running our entire web site all on one server, running under Windows Server 2008. The web site was running PHP for most of the front end, ASP.NET MVC for a lot of the back end and WCF web services for client/server warehousing applications. The initial thinking after examining the code was that the web server was just getting overloaded and we simply needed more horse power. So we ended up moving the database off the Windows machine that hosted the web site and the database, and onto a stand alone Linux machine just running MySQL. We also made sure the new machine had significantly improved hardware with more memory and more CPU. We also invested in an enterprise license for MySQL with Oracle, to get their reporting tools and technical support. After moving MySQL to the new hardware, things seemed a little better, but the problem was not solved. We still had terrible worst case performance issues where the site would stall.
Recompile PHP code with Phalanger
Around the same time we were still having performance problems with the web site we had also been experimenting with a fantastic tool called Phalanger, a PHP compiler for .NET. Since we had already been implementing all our new back end web site functionality in C# using ASP.NET MVC for the web site and WCF for our handheld warehousing computers, it made a lot of sense to work on recompiling all the PHP code into .NET so it could all run on the same platform. Not to mention I wanted to get off PHP and move towards .NET anyway for performance reasons, which I documented in one of my previous blog posts. So around March this year we finished up the project to port to Phalanger and launched the new compiled web site. I was expecting to see big performance gains since on our development and testing machines we could see Phalanger was much faster than regular PHP.
But the problem was still not solved! Although the Windows server has much lower CPU usage than it did running on real PHP, we were still having worst case performance problems where the site would slow down for no apparent reason.
Problems with Nightly Scripts
Once we launched the web site live with Phalanger, we started running into another issue that had not cropped up before. Something was going on such that the web site would basically crash and need some CPR in the middle of the night when our nightly scripts fired up (big thanks to Rackspace Rackers for restarting the web site at 3am multiple times when it died over the last 6 months!). To help us figure these issues out we got the Oracle MySQL monitoring tools installed and could see that during the night when the nightly scripts started, we would see the number of connections from the web site to MySQL start to sky rocket, and the CPU load on the MySQL server would go nuts.
One of the biggest differences between running real PHP on Windows and running PHP compiled under Phalanger, is how IIS handles scaling the load. With real PHP it runs under FastCGI and there is a limit on the number of PHP processes that will be spawned to handle the load based on how you configure FastCGI. We found our site worked well with up to about 100 PHP processes running, but things would fall apart if we let it go much past that value. However with IIS running an ASP.NET web site, it will automatically keep adding application instances to the application pool based on how slow the site is responding to incoming page requests until it is maxing out the CPU and memory on the machine. There is no way for us to control the maximum number of ‘applications’ that IIS will have running in the application pool.
The problem with that is all those new application instances then start requesting connections to MySQL and executing more and more SQL queries for the pages they are trying to serve! This then causes MySQL gets even more overloaded and starts responding even slower, which then causes IIS to spin up even more application instances. This feedback loop continues until eventually you get a site crash as the IIS applications can no longer get MySQL connections and MySQL is working so hard nothing can get done. We never had this problem with real PHP because we would only ever have a maximum of 100 PHP instances running so although the web site was really slow in the middle of the night, neither MySQL nor IIS would completely overload and cause the site to crash.
This was all pretty obvious once we had the MySQL Enterprise Monitor tools installed, but we just did not know what to do about it. With the graphs below you can see the problem clearly on April 25th, where the nightly process ran from 03:05 to 03:19. You can see the number of connections to the MySQL server start to head steadily towards the maximum, the CPU load on the server get higher and higher and the query cache start to constantly miss. It is clear from the graph that the MySQL machine is completely overloaded, as the load average goes up to around 30, when the machine only has 12 physical cores (24 with hyper threading). Pretty much any load past about 20 means the machine is running full tilt. The end result of this is aborted SQL connections and/or timeouts. We tried playing around the maximum number of connections, but all that did is cause more load on the MySQL server and eventually IIS would fall over.
So for the night of April 26th we optimized the code somewhat and nearly halved the number of iterations for the driving loop (~45K down to ~23K), and added a very small micro sleep to each iteration (about 1/40th of a second). This actually produced some completely odd results which was not what we were expecting. From the graphs you can see the nightly script in question ran from 03:07 to 03:39. Not only did it take twice as long, but the load average went all the way up to 55. Clearly adding a sleep in there simply slowed everything down, but did nothing to lower the server load and in fact made everything worse, not better.
Solution: MySQL Low Priority Updates
After doing some more thinking after the failure of our changes on the night of April 26th it occurred to me that part of the problem was that the nightly scripts run full tilt and were very update heavy. We needed some way to be able to ‘slow down’ the MySQL server, like a feature similar to the UNIX nice command but for our background MySQL process. The solution to this problem came in the form of the LOW_PRIORITY keyword for the MySQL update operation, or the use of the low_priority_updates variable for all queries on a connection. The idea behind this feature is that if you issue an update operation that is considered a ‘low priority’ and mark it as such MySQL will schedule the update operation to happen ONLY when all pending select operations have completed. This essentially gives the effect of making the updates happen in the background, so that all the foreground ‘web page’ tasks get to complete and the server won’t get overload.
So for the night of April 27th, we removed the sleeps and added in the low_priority support for the critical queries. The nightly process in question ran from 03:04 to 03:16. You can see that this time the CPU utilization stayed really low the whole time, the load average barely got over 1, the cache hit ratio never dipped below 60% and the total number of connections never went above 75. The entire process actually finished faster than the original code now, with no adverse load on the MySQL server (although the iterations of the loop are effectively half what they used to be, so the low priority update versions of the nightly scripts took almost 2x as long as before).
So in conclusion it should be pretty clear that if you are using MySQL and you need to run nightly processes that hit the database pretty hard, it is a really good idea to make sure all that code runs with low priority updates to ensure any select operations for web pages will get priority and not get blocked out. Not to mention that enabling low priority updates keeps the load on the MySQL server down which then allows it to do the same amount of work much more efficiently. Just be aware that if your MySQL server is totally overloaded already and you enable low priority updates, the update operation may get blocked indefinitely and never complete. If you have that problem you may need to consider moving select traffic off the database server onto a dedicated slave server.