As mentioned in my previous blog post, we had been having site performance problems for a while now. The previous blog post detailed how we solved the performance problems we had at night when our nighty scripts would run, but that was not the final solution. We still had more performance problems that would crop up during the day, and this blog post details the steps we took to finally resolve the problem for good.
Faster web site == slower MySQL!
One of the things we did in our quest for performance was to re-compile all our PHP code into .NET using Phalanger. This definitely made the web site faster and lowered the CPU load on the Windows web server substantially, but we quickly discovered that having faster code on the web server had the side effect of causing more performance problems with MySQL, not less. Clearly the MySQL server was not able to keep up with the amount of traffic being sent it’s way during normal operation, which was odd because when these issues were happening there was really not a whole lot going of activity on the MySQL server.
Caching won’t solve core DB issues
The first thing we did in an attempt to solve these performance problems was to revisit our caching system, and re-implement it using Couchbase. This produced significant performance improvements in the best case when items were in the cache, but the worst case performance problems remained.
Part of our caching architecture is to ensure that only one web server thread is actively generating the cached pages at a time to ensure that the load on the MySQL server is kept as low as possible. It is much faster to have only one thread generating the SQL select traffic to the MySQL server and to have the other threads wait, rather than to have them all send the same SQL select traffic at the same time. In the best case all the threads would run at the same speed anyway, however if you have 100 threads all asking the MySQL server to do the same thing in parallel, it is going to run about 100X slower than just one thread making those calls! So when another thread tries to get a cache item that is expired and another thread needs the same cache entry, the second thread will end up blocking with small sleeps in a loop until either the cache entry is valid again, or we hit a timeout after 10 seconds. Ten seconds is a LONG time for a web page, so it seemed like a reasonable timeout that should never happen in practice.
In theory if the MySQL server was acting up and behaving really slow, we might be be getting cache timeouts. So we went ahead and added some logging to the code to keep track of any timeouts that did occur. Once we enabled the logging on the live site we could clearly see the caching code was timing out hundreds of times a day! Which means that hundreds of times a day, pages were taking longer than 10 seconds to render, which is totally unacceptable. Clearly when the web site code had to reach out to MySQL to get the data to generate the web page content, it was slow. This has to be our performance bottleneck, so we kept looking.
Slow performance when DB server seems idle?
The real question was why? Looking at the first graph below taken while the web site was responding slowly, you can see that the MySQL server is not really doing a whole lot. The CPU usage was very low sitting around 5% with spikes up to 10%. The number of MySQL connections was also low hovering around 150 connections on average, and the database activity was not really all that high at an average of maybe 180 statements per second. By all counts the MySQL server should be blazing fast, but it was not! And we knew that the web server was quite slow during the peaks in the CPU activity from 13:00 to 14:00 and again from about 15:10 to 15:40.
Excessive table locks will KILL performance!
There had to be something else going on so we kept looking at the data with the MySQL Enterprise Monitor. One thing that caught my eye was the number of table locks going on, and here is where things start to get interesting. If you look at the graphs below you can see that there were a lot of table locks that were all grabbed immediately, but creeping in at the bottom of the graph are table locks that caused a wait condition. Then if you take a look at the table lock to wait ratio you can see big spikes in the graph that correspond to times when the web site performance was suffering! Aha! You will also notice that at the same time this happens we see spikes in CPU usage where it doubles from the low 5% to about 10% (visible in the graph above).
So the core problem we had was the excessive amount of table locking going on, or more importantly, the fact that way too many threads were getting stuck waiting to get a table lock! This clearly explains why the DB server appeared slow, when in fact the CPU load indicated not much was going on. Clearly this was the cause of our MySQL server acting like it was overloaded, when nothing of significance was really going on.
MyISAM tables don’t handle update load well
So we spoke to Oracle support, and they pretty much told us the problem we had is that the MyISAM table type we are using just does not work well when you have large tables with a mix of selects and updates going on. One of the core issues with the MyISAM table type is that it does not support row level locking, but rather will lock the entire table when you need to do an update on any record in a table. Or worse, if you have deleted a record from a table which leaves a hole in the table file on disk where the deleted record was, it will also lock the table for inserts also as it will try to fill in that hole. You can work around the insert problem by setting the concurrent_insert feature to ALWAYS, which will always insert new records at the end of the MyISAM table. We had already implemented that based on Oracle’s recommendation previously, however this simply does not solve the problem of table row updates. Some of our tables are really large so the updates can take a while due to indexing, so select traffic was getting blocked by updates on tables that also get lots of select traffic. Clearly we had reached the end of the road for the venerable MyISAM table type!
InnoDB to the rescue!
One of the biggest advantages of the InnoDB table type over the MyISAM table table (other than support for transactions) is that it can properly handle row level locking. This means that if you need to update a record in the database, the only time another thread that needs to read from the same table will block is if that thread needs to read the exact same record. Otherwise there is no lock contention and all of the requests can be handled nicely in parallel.
We had been planning for some time to convert our entire database over to the InnoDB table type format but this was a significant undertaking that would require our site to be down for an extended period of time, and it would require us to re-work all our existing database backup strategies. But it was sounding like we really needed to make the change, so we set about putting together a plan to make it happen. One of the first things we discovered was how to enable a file per table for the InnoDB databases, so each table would end up in a separate file just like our MyISAM tables. This makes it much, much easier to manage backups and deal with smaller parts of the database at a time for development purposes.
The next part was to schedule the downtime and run the upgrade. It ended up taking an entire hour of downtime on the web site to complete the upgrade on all the applications that needed it. Once the upgrade was done, it was really clear the next day during normal business hours that this was the right move and we should have done this a long, LONG time ago! As you can see from the graphs below, we started the upgrade process around 18:45, and it was all finished around 19:45 and performance normalized around 21:00 or so. The total CPU usage was much lower than it was prior to the upgrade, but more importantly it was fairly constant without the odd peaks we were seeing before. The total MySQL connections dropped off to about 100 connections after the upgrade, and the total database activity saw a small drop after the upgrade.
The big difference can be seen in the table lock to wait ratio. The total number of table locks did not really drop that much, but the table lock to wait ratio went to ZERO. More importantly, the web site performance was fantastic all day long! Now all the work we had done previously to improve the performance by compiling the web site with Phalanger and implementing better caching with Couchbase was finally paying off.
So in conclusion, if you are still using the MyISAM table type and you still believe (as we did) that MyISAM is one of the fastest table types for most web sites where there is a mostly lots of select activity going on, think again! Clearly the InnoDB table type is significantly faster when you have large amounts of table data and even have a small amount of update traffic going on. And I would suspect that for small web sites, the performance difference between the two table tables would be negligible at best. One thing I know for sure is that I won’t build another database driven web site using MyISAM – our default table table will now always be InnoDB.