Performance impact of MySQL query cache on modern hardware

Recently, Morgan has been writing on deprecating some MySQL features and inspired by that while working on MySQL on POWER, I wondered “What is the impact of the MySQL query cache on modern hardware?”

We’ve known for over six years (since before we started Drizzle) that the query cache hurt performance. It was for that reason that the query cache was one of the early things to be removed from Drizzle, it just didn’t scale on multi core systems that  we were targeting.

So what about modern hardware? While working on MySQL 5.6 on POWER8, I enabled the query cache and ran a benchmark. Enabling the query cache reduced performance by an order of magnitude. I suspect the performance impact is even higher on MySQL 5.7.

My vote? Deprecate it. Deprecate it now, print a giant warning if it’s enabled and at some point just remove it. A single global mutex just doesn’t scale to 4 cores, let alone 24 cores at 8 threads per core.

1 million SQL Queries Per Second: MySQL 5.7 on POWER8

I’ve previously covered MySQL 5.6 on POWER (with patch), MySQL 5.6 Performance on POWER8 (spoiler: new performance record) and MySQL 5.7 on POWER.

Of course, The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions. Also, these numbers should be considered preliminary, but trust me – I did get them and it’s not April 1st.

From my last post, you saw that with my preliminary patch for MySQL 5.7 to work on POWER, we could easily match the previous record for sysbench point select queries per second (i.e. key lookups). In fact, we could exceed the published record by a little bit which is kind of nice. At around 630kQPS, one could be rather happy.

But we still had 30-40% idle CPU on POWER8. This led me to file the following bug report:

  • Bug 72829: LOCK_grant is major contention point, leaves 30-40% idle CPU.

What’s going on is that there’s a rwlock in the MySQL Server that ensures that writers don’t collide with readers to the data structures describing the GRANTs (i.e. who has access to what). If you run a GRANT statement, it gets a writer lock, and nobody can read (i.e. check permissions) while everything is being updated. If you run a normal SQL statement, you get a read lock (non-exclusive) and can check permissions appropriately.

It’s been known for a long time that LOCK_grant was a bottleneck. Typically, some people have run with skip-grant-tables to help shorten the time the lock as held (as in MySQL you still take the mutex even though you’ve started the server with skip-grant-tables).

In Drizzle, we fixed that – moving authentication and authorization completely behind plugin APIs and if you didn’t load plugins for them, you executed near enough to zero instructions that it didn’t matter.

In my experiments, enabling skip-grant-tables actually hurt performance rather than helped. More investigation is needed, but it seems that simply the act of acquiring and releasing the rdlock is now a major bottleneck in some benchmarks (such as sysbench point select).

It turns out that this is a well known problem in other pieces of software (e.g. Linux kernel) and is pretty much what RCU (Read Copy Update) is best at. As far back as 2006 I remember attempting to get my head around RCU so that one day we could use it in MySQL or MySQL Cluster.

Another simpler method is simply splitting the mutex, with readers able to acquire any one of N mutexes and writers needing to acquire them all. This penalizes writers, but unless you’re executing a lot of GRANTs, you’re probably safe.

So… what is the theoretical maximum performance if this bottleneck went away?

I wrote a quick patch that just commented out the rdlock acquisition of LOCK_grant in the hot codepath of sysbench point selects. I wasn’t running GRANT statements at runtime so this was “safe”.

This patch is not production ready, it’s merely useful for demonstrating where we could be with MySQL 5.7 on POWER8 if one last bottleneck is fixed.

My results? Slightly over ONE MILLION QUERIES PER SECOND!

This is roughly twice the previous record.

This is with a dual socket 24 core POWER8 with SMT8 and DSCR=1 on 8 tables with sysbench 0.4.8. Sysbench itself is using a non-trivial amount of CPU and I could probably decently beat this number if I rewrote sysbench using the nonblocking API in libdrizzle (back when me made the Drizzle performance regression tests use a libdrizzle-ified sysbench we got double digit percentage improvement in our sysbench numbers).

There’s still around 7-10% idle CPU time… so there’s more room to grow.

Lacking a physical gauntlet to throw down, I’ll just have to submit a conference paper somewhere so that I can do that in person.

I really hope that we’re able to fix this bottleneck in MySQL 5.7 so that MySQL 5.7 will ship being able to do over a million queries per second. From SQL.

MySQL 5.7 on POWER

In a previous post, I covered porting MySQL 5.6 to POWER and subsequently, some new record performance numbers with MySQL 5.6.17 on POWER8.

Well, those following at home will be aware that not only is the next sentence sponsored by IBM Legal, but that MySQL 5.7 alleviates a bunch of the mutex contention that we saw with MySQL 5.6. The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.

In looking at MySQL performance on POWER, it’s inevitable that I should look at MySQL 5.7 and what’s coming up in the next stable release of MySQL.

Surprisingly, a bunch of the core code in InnoDB and MySQL dealing with mutexes has changed in MySQL 5.7 when compared to MySQL 5.6. Enough that I actually had to post a few bug reports about the changes that apply to any CPU architecture:

  • Bug 72805: mutex_delay() creating excess memory traffic, GCC mem barrier needed
    • This is now more generic mutex code, so it’s even more important to get it right. There’s a bunch of tricks that have been learned in other places (e.g. Linux kernel) in getting these things right. We need to get them right in MySQL too.
    • One of these tricks is in ensuring that the compiler doesn’t compile down spinloops to nothing.
  • Bug 72806: mutex_delay() missing x86 pause instruction optimization
    • This is actually a regression over 5.6.
    • On x86, there is an instruction (PAUSE) that tells the CPU that you’re in a spin loop and that it should yield resources in the CPU core to other threads (or thread, as HT CPUs only have 2 threads per core).
    • We have a different way of doing things on POWER, and I’ve got a patch for that too.
    • What’s interesting is reading the Intel CPU manual about the PAUSE instruction and how even if you went and benchmarked it, it depends on the CPU on if this is a NO-OP or not.
    • I suspect that with this bug fixed, performance on Hyper Threaded Intel systems will improve.
  • Bug 72807: Set thread priority in my_pthread_fastmutex_lock
    • This is the POWER equivalent of the x86 PAUSE instruction.
    • I’ve found this patch to have a quite decent positive impact on sysbench point select performance.

There were also the bugs I mentioned in my MySQL 5.6 on POWER blog post. Notably, I had to port Yasufumi’s memory barrier patch from 5.6 to 5.7. My port is incomplete (I can still crash mysqld without too much trying) but I’ve deemed it currently “good enough for benchmarking” and it’s attached to bug 47213 (I hope to spend some time fixing it up soon too). I don’t think I’m missing anything that’s going to have a major performance impact – so while not suitable for production use, it’s good enough to poke some benchmarks at.

So… I’m close to the point where I’ll share my patch for MySQL 5.7, but I’m really wanting to solve the last couple of issues before doing so. The majority of patches are attached to bug reports and get 99% of the way.

Amazingly enough, MySQL 5.7 works fairly well on POWER “out of the box”, and with sysbench point selects, I could quite easily get 320kQPS on a 24 core POWER8 with SMT8 mode without changing a single line of code or doing anything special. This alone is an impressive result when compared to the previous record on both POWER and other CPU architectures with MySQL 5.6 that had been optimized for POWER (while out-of-the-box MySQL 5.7 has not)
For my benchmarks, I’m doing the same procedure, workload and basic my.cnf settings that Dimitri has used and written about, so I won’t repeat that here.With my preliminary patch for MySQL 5.7.4-m14 to have it work well on POWER, on the same system I was using for my MySQL 5.6 benchmarks, I could easily match and indeed exceed the previous published maximum sysbench point select results (I got ~630kQPS). Consider this number a bit preliminary as my patch isn’t completely solid, but it does mean that we’re in the right ballpark for MySQL 5.7 performance, which is great news!So, you might just say “Mission Accomplished” and be done with it. Well… there was one issue:  with the maximum numbers I was getting there was still 30-40% idle CPU on the POWER 8 machine.Now… you could just use that idle 30-40% of total CPU to do other things (solving Sudoku in SQL for example) but that’s no fun.

MySQL 5.6 Performance on POWER8

The following sentence is brought to you by IBM Legal: The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.

My previous post covered the work needed to get MySQL 5.6.17 running reliably on modern POWER systems. The patch to MySQL 5.6.17 that’s needed is available here.

For those who don’t know, POWER8 is the latest Power Architecture processors from IBM (my employer). These chips will be available in systems from IBM in June 2014 (i.e. Real Soon Now(TM)). There’s some fairly impressive specs and numbers (see Wikipedia and elsewhere) – but what could this mean for actual applications?

Well, it turns out that MySQL is a pretty big thing in some target markets for POWER8, and inspired by Dimitri’s impressive benchmark numbers, I thought we should have a go on POWER8.

Firstly, I focused on MySQL 5.6 as it is the current stable release. MySQL 5.7 will be the subject of a future blog post.

The first step was to ensure that MySQL 5.6 worked correctly on POWER. My previous blog post covered the few bugs I ran into and filed (often  with patches). This wasn’t too hard and I’m fairly confident the bug fixes are simple enough to get into MySQL 5.6 – I can’t comment on what would be/could be “officially supported”, that’s a business discussion :)

In order to ensure that my patch was not only correct but performing well, I needed a benchmark. For my initial benchmark. I chose sysbench point selects (i.e. read only key lookups), which should show the theoretical maximum queries per second you could pump through the MySQL Server as well as really stressing the mutex code, helping ensure it was not only correct, but performing well.

A simple comparison of my early patch that used heavyweight memory barriers versus Yasufumi’s patch that used more lightweight ones showed that using heavyweight barriers could be as much as a 50% performance hit – so getting this code right is important.

To add to the fun, the POWER8 processor has a few parameters you can tweak. There is the SMT mode, which dictates how many threads per core there are. This can be changed at runtime. You can be in SMT=off, SMT=2, SMT=4 or SMT=8. Typically, only some workloads can benefit from SMT8 rather than SMT4. There is also DSCR, which is data prefetching. For sysbench point selects, I’ve found we do slightly better (around 10%) when DSCR is set to 1 rather than zero – but YMMV on other benchmarks.

In my experiments, I’ve found that SMT4 or SMT8 seems to be the best bang for buck for MySQL workloads on POWER8. With SMT=2 rather than off, I’ve seen a ~50% performance boost in sysbench point select results. With SMT=4 I’ve seen another 50% boost (i.e. roughly double SMT=off performance). The benefit of SMT8 for MySQL 5.6 (and the 5.6 part is crucial here) may be minimal, especially for this benchmark. This is mostly due to hitting heavy mutex contention inside the MySQL server rather than anything else.

POWER8 systems come in either single or dual socket, with the number of cores being a total of 4, 6, 8, 10, 12, 16, 20 or 24 depending on configuration of the system (go check IBM web site for specifics of what’s available in what model). This means with SMT8, a dual socket, 24 core POWER8 system has 192 hardware threads – the system I was using for these benchmarks.With this number of cores and hardware threads, those familiar with MySQL on multi core systems may already have an inkling that using the full capacity of such a system may be hard for MySQL.

Certainly for old versions of MySQL (such as 5.0 or 5.1) you’re going to get nowhere near full system utilization on POWER8. For MySQL 5.6 (and in the future, 5.7) you have a much better hope.

Before anyone asks, yes, I used jemalloc for most of my benchmarks and it helps by giving a single digit percent performance increase (around 3-4%).

The bottlenecks inside MySQL 5.6 for sysbench point select workload are fairly well documented, so at best we may be striving to equal the performance of other CPU architectures rather than get too much higher simply due to hitting mutex contention in creating read views inside InnoDB. So the maximum performance will be a function of individual core CPU speed and the speed at which a lock can be acquired (i.e. related to how quick you can bounce a cacheline with a lock between cores).

This is exactly what I found on POWER8 with MySQL 5.6 – you hit the same bottleneck on POWER8 as you do everywhere else – creating read views in InnoDB.

That being said, my maximum sysbench point select results on POWER8 was 344kQPS. This not only matches but exceeds the previous record holder by quite a decent amount.

This number was across 8 tables with mysqld bound to a single NUMA node (6 cores) and sysbench bound to another NUMA node (6 cores) on the same socket. For this benchmark, due to the mutex contention, bringing the second socket into play didn’t improve performance. For other benchmarks, (e.g. standard sysbench read only) it seems to scale with more CPU cores much better (no doubt the subject of a future blog post).

Single table sysbench point select was also impressive at 335kQPS – you only got an additional 10kQPS by going to 8 tables! All of these results were with SMT4 and DSCR=1, which seems to be the best configuration for this type of workload.

Up next: MySQL 5.7 on POWER8.

MySQL 5.6 on POWER (patch available)

The following sentence is brought to you by IBM Legal. The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.

Okay, now that is out of the way….

If you’re the kind of person who follows the MySQL bugs database closely or subscribes to the MySQL Internals mailing list, you may have worked out that I’ve spent a small amount of time poking at MySQL on modern POWER systems.

Unlike Intel CPUs, POWER CPUs require explicit memory barriers to synchronize memory state between different CPUs. This means that when you’re implementing synchronization primitives, you have one extra thing to get right.

Luckily, if you use straight pthread mutexes, this is already taken care of. Unluckily, there are some optimizations in MySQL that don’t use straight pthread mutexes and so may be problematic on non-Intel CPUs. A few of these issues have sneaked into MySQL over the past few years. The most problematic area was around the optimized mutexes in InnoDB (you can use the pthread_mutex fallback code, but it’s less performant).

Luckily, I both knew where to look and there are good asserts throughout InnoDB code to help spot any other areas that I may not have initially thought of to look at. Coding defensively with a good amount of asserts is a good thing.

After not too much work, I have a set of patches that I’m fairly confident is correct and performs near as well as possible. Initially, I had a different patch that used heavyweight memory barriers in a lot of places, but big kudos to Yasufumi for posting a better patch than mine to bug 47213 – using the lighter weight barriers gives a decent performance boost.

One of the key patches is in the InnoDB mutex code to change the thread priority – i.e. a POWER equivalent to the x86 pause instruction. These are hints to the CPU that the thread being executed is in a spinloop and CPU resources should be allocated to other threads to make betterr forward progress.

After dragging Anton in to have a look and a think, this code may have motivated him to have a go at getting kernel support for adaptive mutexes, thus removing the need for this spin/sleep/yield/eep loop in InnoDB (at least on Linux).

So… I’ve spent the appropriate time filing bugs in the MySQL bug tracker for the things I’ve found. Feel free to track them yourself, they are:

  • Bug 72715: character set code endianness dependent on CPU type rather than endianness of CP
    • I don’t think this is an issue for us… or it could be that this is actually just incredibly untested code in the MySQL Server. It’s also not POWER specific, although was caught by the Migration Assistant which is part of the Advanced Toolchain from IBM.
  • Bug 72718: CACHE_LINE_SIZE in innodb should be 128 on POWER
    • I contributed a patch that’s a simple #ifdef for CPU type. Those who care about other CPU architectures should chime in with the correct value for them.
    • There’s other places in InnoDB where there’s some padding that don’t use this define, I need to file a bug for that.
  • Bug 72754: Set thread priority in InnoDB mutex spinloop
    • This makes a big difference when you have mutex contention and SMT (Symmetric Multi-Threading) enabled (on POWER, you can dynamically change SMT levels at runtime).
    • I’ve contributed a preliminary patch that isn’t generic. I should go and fix that.
  • Bug 72755: InnoDB mutex spin loop is missing GCC barrier
    • This also applies to x86 (and indeed all platforms). If GCC gets a bit smarter, the current code could compile down to nothing, which is exactly what you don’t want from a spinloop. The correct thing to do is to have a GCC memory barrier (not CPU one) to ensure that the compiler doesn’t optimize away the spinning.
    • I’ve contributed a patch, may need #ifdef GCC added.
  • Bug 72809: InnoDB Linux native aio setup missing barrier after setup
    • This appears to be a “POWER8 is fast” related bug :)
    • Patch contributed.
  • Bug 72811: Set NUMA mempolicy for optimum mysqld performance
    • Not POWER specific.
    • I’ve contributed a patch that sets NUMA memory allocation policy inside mysqld rather than having to run “numactl” manually
  • Bug 47213: InnoDB mutex/rw_lock should be conscious about memory ordering other than Intel
    • Originally filed by Yasufumi back in 2009.
    • Some good discussion going on here to ensure the patch is correct. This is the kind of patch that requires more review  than it takes to write it.
    • This patch would fix the majority of problems for non-Intel CPU architectures.
    • Thanks to Yasufumi for providing an updated patch, it helped a lot!
  • Bug 72544: Incorrect locking for global_query_id
    • I found a bug. Rather benign and not POWER specific.

Want to run MySQL 5.6.17 on POWER? Get my MySQL 5.6.17 patch here: https://flamingspork.com/mysql/mysql-5.6.17-POWER.patch

My accumulation of 5.6 patches seems fairly reliable. I’d test before putting into production, and I’d certainly love to know any problems you hit.

Get the quilt series of patches here: https://flamingspork.com/mysql/mysql-5.6.17-POWER-patches.tar.gz

I have, of course, done the legal wrangling for the Oracle Contributor Agreement (remarkably painless) and am working on making the patches completely acceptable to be merged into MySQL.