Good news everyone! There’s video up for the talk I gave at Percona Live in April 2016 up: Why would I run MySQL/MariaDB on POWER anyway?
The talk is a general overview of POWER and why MySQL/MariaDB may be a good fit.
Earlier on in benchmarking MySQL and MariaDB on POWER8, we noticed that on write workloads (or read workloads involving a lot of IO) we were spending a bunch of time computing InnoDB page checksums. This is a relatively well known MySQL problem and has existed for many years and Percona even added innodb_fast_checksum to Percona Server to help alleviate the problem.
In MySQL 5.6, we got the ability to use CRC32 checksums, which are great in that they’re a lot faster to compute than tho old InnoDB “new” checksum. There’s code inside InnoDB to use the x86 SSE2 crc32q instruction to accelerate performing the checksum on compatible x86 CPUs (although oddly enough, the CRC32 checksum in the binlog does not use this acceleration).
However, on POWER, we’d end up using the software implementation of CRC32, which used a lot more CPU than we’d like. Luckily, CRC32 is really common code and for POWER8, we got some handy instructions to help computing it. Unfortunately, this required brushing up on vector polynomial math in order to understand how to do it all quickly. The end result was Anton coming up with crc32-vpmsum code that we could drop into projects that embed a copy of crc32 that was about 41 times faster than the best non-vpmsum implementation.
Recently, Daniel Black took the patch that had passed through both Daniel Axten‘s and my hands and worked on upstreaming it into MariaDB and MySQL. We did some pretty solid benchmarking on the improvement you’d get, and we pretty much cannot notice the difference between innodb_checksum=off and having it use the POWER8 accelerated CRC32 checksum, which frees up maybe 30% of CPU time to be used for things like query execution! My original benchmark showed a 30% improvement in sysbench read/write workloads.
The excellent news? Two days ago, MariaDB merged POWER8 accelerated crc32! This means that IO heavy workloads on MariaDB on POWER8 will get much faster in the next release.
MySQL bug 74776 is open, with patch attached, so hopefully MySQL will merge this soon too (hint hint).
A couple of days ago, MariaDB announced that MariaDB 10.1 is stable GA – around 19 months since the GA of MariaDB 10.0. With MariaDB 10.1 comes some important scalabiity improvements, especially for POWER8 systems. On POWER, we’re a bit unique in that we’re on the higher end of CPUs, have many cores, and up to 8 threads per core (selectable at runtime: 1, 2, 4 or 8/core) – so a dual socket system can easily be a 160 thread machine.
Recently, we (being IBM) announced availability of a couple of new POWER8 machines – machines designed for Linux and cloud environments. They are very much OpenPower machines, and more info is available here: http://www.ibm.com/marketplace/cloud/commercial-computing/us/en-us
Combine these two together, with Axel Schwenke running some benchmarks and you get 1 Million SQL Queries per second with MariaDB 10.1 on POWER8.
Having worked a lot on both MySQL for POWER and the firmware that ships in the S882LC, I’m rather happy that 1 Million queries per second is beyond what it was in June 2014, which was a neat hack on MySQL 5.7 that showed the potential of MySQL on POWER8 but wasn’t yet a product. Now, you can run a GA release of MariaDB on GA POWER8 hardware designed for scale-out cloud environments and get 1 Million SQL queries/second (with fewer cores than my initial benchmark last year!)
What’s even more impressive is that this million queries per second is in a KVM guest!
PAPR is the Power Architecture Platform Reference document. It’s a short read at only 890 pages and defines the virtualised environment that guests run in on PowerKVM and PowerVM (i.e. what is referred to as ‘pseries’ platform in the Linux kernel).
As part of the OpenPower Foundation, we’re looking at ensuring this is up to date, documents KVM specific things as well as splitting out the bits that are common to OPAL and PAPR into their own documents.
Previously, I blogged on how to Run skiboot (OPAL) on the POWER8 Simulator. If you want to build the full Open Power firmware environment, including the Petitboot bootloader and kernel, you can now do so!
My pull request for an op-build target for the simulator has been merged, so you can now do the following three steps to compile a kernel+initramfs to use with your built skiboot for development purposes:
git clone --recursive firstname.lastname@example.org:open-power/op-build.git cd op-build . op-build-env op-build mambo_defconfig && op-build
Then you wait for a whole bunch of time while everything compiles! Afterwards, you should be left with a zImage.epapr in output/images/ that you can copy into your skiboot directory.
With zImage.epapr in your skiboot directory, when you run “make check”, the skiboot test suite will actually launch the simulator to verify that your skiboot code boots all the way to the petitboot prompt!
We now have two boot tests as part of “make check” for skiboot!
While playing with MySQL 5.7.5 on POWER8, I came across a rather interesting bug (74775 – and this is not the only one… I think I have a decent amount of auditing and patching to do now) which made me want to write a bit on memory barriers and the volatile keyword.
Memory barriers are hard.
Like, super hard. It’s the kind of thing that makes you curse hardware designers, probably because they’re not magically solving all your problems for you. Basically, as you get more CPU cores and each of them have caches, it gets more expensive to keep everything in sync. It’s quite obvious that with *ahem* an eventually consistent model, you could save a bunch of time and effort at the expense of shifting some complexity into software.
Those in the MySQL world should recognize this – we’ve been dealing with asynchronous replication for well over a decade as a good way to scale.
On some CPU architectures (POWER for example) not all loads are created equal. When you load a value from memory, it will be consistent with your thread of execution. That is, with any stores that you have done in this thread of execution. If another thread updates that memory location you may not see that update even if your load occurs after that thread updates that memory location. Think eventually consistent.
If you want up to date reads (and not clobber writes), then you get to do memory barriers! (a topic for elsewhere – the PowerISA document has good explanations of what we have on POWER though, and how load with reserve works).
What the volatile keyword does is generate load and store instructions. It is useful when talking to hardware, as the load and store instructions are actually doing something there that the compiler doesn’t know about and thus shouldn’t optimize away.
The volatile keyword does not add any memory barriers. This is important to realize – volatile just makes loads and stores happen for your thread, not in relation to any other threads of execution. Thus, you cannot use volatile as a thread synchronization mechanism at all. It is completely and totally wrong.
Basically, if you have a volatile variable and you do stores to it in one thread and loads in another, after the store happens, it could be quite a long time before the thread doing the loads sees it! For some applications this may be okay (although I can’t really think of any beyond very very inaccurate status variables)… but if it matters at all for application correctness, volatile is the wrong thing to use.
Good news everyone!
MySQL 5.7.5 is out with a bunch more patches for running well on POWER in the tree. I haven’t yet gone and tried it all out, but since I’m me, I look at bugs database and git/bzr history first.
On Intel CPUs, when you’re spinning on a spin lock, you’re meant to execute the PAUSE CPU instruction. This tells the CPU that other execution threads in the same core should be given priority as you are currently not doing anything productive. Without this, you’re likely going to hurt on hyperthreaded CPUs.
In MySQL, there are custom spinlocks in order to do interesting adaptive mutex things to attempt to squeeze the most performance possible out of modern systems.
One of the (not 100% ready, but close) bugs with patches I submitted against MySQL 5.7 was for using the equivalent of the PAUSE instruction for POWER CPUs. On POWER, we’re a bit different, you can actually set priorities of threads (which may matter more, as POWER8 CPUs can be in SMT8 mode – where there are *eight* executing threads per core).
So, the good news is that in MySQL 5.7.5, the magic instructions for setting thread priority are in! This should mean great things for performance on POWER systems with any of the SMT modes enabled.
The next interesting part of this is how it interacts with other KVM guests on a system. At least on POWER (and on x86 as well, although I won’t go into details here) there’s a hypervisor call that a guest can make saying “hey, I’m spinning here, perhaps you want to make sure other vcpus execute so that at some point I can continue”. On POWER, this is the H_CONFER hcall, where you can basically do a directed yield to another vcpu (the one that holds the lock you’re trying to get is a good idea).
Generally though, it’s only the guest kernel that does this, not userspace. You can see the H_CONFER call in __spin_yield(arch_spinlock_t*) and __rw_yield(arch_rwlock_t*) in arch/powerpc/lib/locks.c in the kernel.
It would be interesting to see what extra we could get out of a system running multiple guests with MySQL servers if InnoDB/MySQL could properly yield to the right vcpu (well, thread I guess).
Good news everyone! Tyan has announced the availability of their first OpenPOWER system! They call this a Customer Reference System, which means it’s an excellent machine to start poking at OpenPower and POWER8 (or deploying applications on).
Because it’s an OpenPower machine, it runs the open source Open Power firmware (all up on github) and will happily run Linux (feel free to port your other operating system kernels). I’ll be writing more on the OpenPower firmware soon as, well, technical details are fun!
Ubuntu 14.10 is listed as recommended as not only have they been building for POWER8 but have spent some time ensuring things work fairly well out-of-the-box (both as a KVM guest and running native on the bare metal). Or, you can always just boot whatever the mainline kernel is at – build for the POWERNV (POWER non-virtualized) platform (be sure to include all the required drivers) and have fun!
Good news for those wanting to run MariaDB on POWER systems, the latest 10.0 bzr tree (as of a couple of weeks ago) builds and runs well!
I recently pulled the latest MariaDB 10.0 from BZR and built it on a POWER8 system in the lab to run some quick tests. The MariaDB team has done some work on getting MariaDB to run on POWER recently, a bunch of which is based off my work on MySQL on POWER.
There’s obviously still some work in progress going on, but my initial results show performance within around 10% of MySQL, so with a bit of work we will hopefully see MariaDB reach performance parity.
One interesting find was the code to account for thread memory usage uses a single atomic variable: this does not scale and does end up showing up on profiles.
I’ll comment more on the code in a future post, but it looks like we will have MariaDB being functional on POWER in an upcoming release.
It’s been a little while since I blogged on MySQL on POWER (last time was thinking that new releases would be much better for running on POWER). Well, I recently grabbed the MySQL 5.6.20 source tarball and had a go with it on a POWER8 system in the lab. There is good news: I now only need one patch to have it function pretty flawlessly (no crashes). Unfortunately, there’s still a bit of an odd thing with some of the InnoDB mutex code (bug filed at some point soon).
But, with this one patch applied, I was getting okay sysbench results and things are looking good.
Now just to hope the MySQL team applies my other patches that improve things on POWER. To be honest, I’m a bit disappointed many of them have sat there for this long… it doesn’t help build a development community when patches can sit for months without either “applied” or “fix these things first”. That being said, just as I write this, Bug 72809 which I filed has been closed as the fix has been merged into 5.6.22 and 5.7.6, so there is hope… it’s just that things can just be silent for a while. It seems I go back and forth on how various MySQL variants are going with fostering an external development community.
About 1.5 months ago I blogged on MySQL 5.6 on POWER andtalked about what I had to poke at to make modern MySQL versions run and run well on shiny POWER8 systems.
One of those bugs, MySQL bug 47213 (InnoDB mutex/rw_lock should be conscious of memory ordering other than Intel) was recently marked as CLOSED by the Oracle MySQL team and the upcoming 5.6.20 and 5.7.5 releases should have the fix!
This is excellent news for those wanting to run MySQL on SMP systems that don’t have an Intel-like memory model (e.g. POWER and MIPS64).
This was the most major and invasive patch in the patchset for MySQL on POWER. It’s absolutely fantastic that this has made it into 5.6.20 and 5.7.5 and may mean that these new versions will work out-of-the-box on POWER (I haven’t checked… but from glancing back at my patchset there was only one other patch that could be related to correctness rather than performance).
Of course, The postings on this site are my own and donâ€™t necessarily represent IBMâ€™s positions, strategies or opinions. Also, these numbers should be considered preliminary, but trust me – I did get them and it’s not April 1st.
From my last post, you saw that with my preliminary patch for MySQL 5.7 to work on POWER, we could easily match the previous record for sysbench point select queries per second (i.e. key lookups). In fact, we could exceed the published record by a little bit which is kind of nice. At around 630kQPS, one could be rather happy.
But we still had 30-40% idle CPU on POWER8. This led me to file the following bug report:
- Bug 72829: LOCK_grant is major contention point, leaves 30-40% idle CPU.
What’s going on is that there’s a rwlock in the MySQL Server that ensures that writers don’t collide with readers to the data structures describing the GRANTs (i.e. who has access to what). If you run a GRANT statement, it gets a writer lock, and nobody can read (i.e. check permissions) while everything is being updated. If you run a normal SQL statement, you get a read lock (non-exclusive) and can check permissions appropriately.
It’s been known for a long time that LOCK_grant was a bottleneck. Typically, some people have run with skip-grant-tables to help shorten the time the lock as held (as in MySQL you still take the mutex even though you’ve started the server with skip-grant-tables).
In Drizzle, we fixed that – moving authentication and authorization completely behind plugin APIs and if you didn’t load plugins for them, you executed near enough to zero instructions that it didn’t matter.
In my experiments, enabling skip-grant-tables actually hurt performance rather than helped. More investigation is needed, but it seems that simply the act of acquiring and releasing the rdlock is now a major bottleneck in some benchmarks (such as sysbench point select).
It turns out that this is a well known problem in other pieces of software (e.g. Linux kernel) and is pretty much what RCU (Read Copy Update) is best at. As far back as 2006 I remember attempting to get my head around RCU so that one day we could use it in MySQL or MySQL Cluster.
Another simpler method is simply splitting the mutex, with readers able to acquire any one of N mutexes and writers needing to acquire them all. This penalizes writers, but unless you’re executing a lot of GRANTs, you’re probably safe.
So… what is the theoretical maximum performance if this bottleneck went away?
I wrote a quick patch that just commented out the rdlock acquisition of LOCK_grant in the hot codepath of sysbench point selects. I wasn’t running GRANT statements at runtime so this was “safe”.
This patch is not production ready, it’s merely useful for demonstrating where we could be with MySQL 5.7 on POWER8 if one last bottleneck is fixed.
My results? Slightly over ONE MILLION QUERIES PER SECOND!
This is roughly twice the previous record.
This is with a dual socket 24 core POWER8 with SMT8 and DSCR=1 on 8 tables with sysbench 0.4.8. Sysbench itself is using a non-trivial amount of CPU and I could probably decently beat this number if I rewrote sysbench using the nonblocking API in libdrizzle (back when me made the Drizzle performance regression tests use a libdrizzle-ified sysbench we got double digit percentage improvement in our sysbench numbers).
There’s still around 7-10% idle CPU time… so there’s more room to grow.
Lacking a physical gauntlet to throw down, I’ll just have to submit a conference paper somewhere so that I can do that in person.
I really hope that we’re able to fix this bottleneck in MySQL 5.7 so that MySQL 5.7 will ship being able to do over a million queries per second. From SQL.
In a previous post, I covered porting MySQL 5.6 to POWER and subsequently, some new record performance numbers with MySQL 5.6.17 on POWER8.
Well, those following at home will be aware that not only is the next sentence sponsored by IBM Legal, but that MySQL 5.7 alleviates a bunch of the mutex contention that we saw with MySQL 5.6. The postings on this site are my own and donâ€™t necessarily represent IBMâ€™s positions, strategies or opinions.
In looking at MySQL performance on POWER, it’s inevitable that I should look at MySQL 5.7 and what’s coming up in the next stable release of MySQL.
Surprisingly, a bunch of the core code in InnoDB and MySQL dealing with mutexes has changed in MySQL 5.7 when compared to MySQL 5.6. Enough that I actually had to post a few bug reports about the changes that apply to any CPU architecture:
- Bug 72805: mutex_delay() creating excess memory traffic, GCC mem barrier needed
- This is now more generic mutex code, so it’s even more important to get it right. There’s a bunch of tricks that have been learned in other places (e.g. Linux kernel) in getting these things right. We need to get them right in MySQL too.
- One of these tricks is in ensuring that the compiler doesn’t compile down spinloops to nothing.
- Bug 72806: mutex_delay() missing x86 pause instruction optimization
- This is actually a regression over 5.6.
- On x86, there is an instruction (PAUSE) that tells the CPU that you’re in a spin loop and that it should yield resources in the CPU core to other threads (or thread, as HT CPUs only have 2 threads per core).
- We have a different way of doing things on POWER, and I’ve got a patch for that too.
- What’s interesting is reading the Intel CPU manual about the PAUSE instruction and how even if you went and benchmarked it, it depends on the CPU on if this is a NO-OP or not.
- I suspect that with this bug fixed, performance on Hyper Threaded Intel systems will improve.
- Bug 72807: Set thread priority in my_pthread_fastmutex_lock
- This is the POWER equivalent of the x86 PAUSE instruction.
- I’ve found this patch to have a quite decent positive impact on sysbench point select performance.
There were also the bugs I mentioned in my MySQL 5.6 on POWER blog post. Notably, I had to port Yasufumi’s memory barrier patch from 5.6 to 5.7. My port is incomplete (I can still crash mysqld without too much trying) but I’ve deemed it currently “good enough for benchmarking” and it’s attached to bug 47213 (I hope to spend some time fixing it up soon too). I don’t think I’m missing anything that’s going to have a major performance impact – so while not suitable for production use, it’s good enough to poke some benchmarks at.
So… I’m close to the point where I’ll share my patch for MySQL 5.7, but I’m really wanting to solve the last couple of issues before doing so. The majority of patches are attached to bug reports and get 99% of the way.
Amazingly enough, MySQL 5.7 works fairly well on POWER “out of the box”, and with sysbench point selects, I could quite easily get 320kQPS on a 24 core POWER8 with SMT8 mode without changing a single line of code or doing anything special. This alone is an impressive result when compared to the previous record on both POWER and other CPU architectures with MySQL 5.6 that had been optimized for POWER (while out-of-the-box MySQL 5.7 has not)
For my benchmarks, I’m doing the same procedure, workload and basic my.cnf settings that Dimitri has used and written about, so I won’t repeat that here.With my preliminary patch for MySQL 5.7.4-m14 to have it work well on POWER, on the same system I was using for my MySQL 5.6 benchmarks, I could easily match and indeed exceed the previous published maximum sysbench point select results (I got ~630kQPS). Consider this number a bit preliminary as my patch isn’t completely solid, but it does mean that we’re in the right ballpark for MySQL 5.7 performance, which is great news!So, you might just say “Mission Accomplished” and be done with it. Well… there was one issue:Â with the maximum numbers I was getting there was still 30-40% idle CPU on the POWER 8 machine.Now… you could just use that idle 30-40% of total CPU to do other things (solving Sudoku in SQL for example) but that’s no fun.
The following sentence is brought to you by IBM Legal. The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.
Okay, now that is out of the way….
If you’re the kind of person who follows the MySQL bugs database closely or subscribes to the MySQL Internals mailing list, you may have worked out that I’ve spent a small amount of time poking at MySQL on modern POWER systems.
Unlike Intel CPUs, POWER CPUs require explicit memory barriers to synchronize memory state between different CPUs. This means that when you’re implementing synchronization primitives, you have one extra thing to get right.
Luckily, if you use straight pthread mutexes, this is already taken care of. Unluckily, there are some optimizations in MySQL that don’t use straight pthread mutexes and so may be problematic on non-Intel CPUs. A few of these issues have sneaked into MySQL over the past few years. The most problematic area was around the optimized mutexes in InnoDB (you can use the pthread_mutex fallback code, but it’s less performant).
Luckily, I both knew where to look and there are good asserts throughout InnoDB code to help spot any other areas that I may not have initially thought of to look at. Coding defensively with a good amount of asserts is a good thing.
After not too much work, I have a set of patches that I’m fairly confident is correct and performs near as well as possible. Initially, I had a different patch that used heavyweight memory barriers in a lot of places, but big kudos to Yasufumi for posting a better patch than mine to bug 47213 – using the lighter weight barriers gives a decent performance boost.
One of the key patches is in the InnoDB mutex code to change the thread priority – i.e. a POWER equivalent to the x86 pause instruction. These are hints to the CPU that the thread being executed is in a spinloop and CPU resources should be allocated to other threads to make betterr forward progress.
After dragging Anton in to have a look and a think, this code may have motivated him to have a go at getting kernel support for adaptive mutexes, thus removing the need for this spin/sleep/yield/eep loop in InnoDB (at least on Linux).
So… I’ve spent the appropriate time filing bugs in the MySQL bug tracker for the things I’ve found. Feel free to track them yourself, they are:
- Bug 72715: character set code endianness dependent on CPU type rather than endianness of CP
- I don’t think this is an issue for us… or it could be that this is actually just incredibly untested code in the MySQL Server. It’s also not POWER specific, although was caught by the Migration Assistant which is part of the Advanced Toolchain from IBM.
- Bug 72718: CACHE_LINE_SIZE in innodb should be 128 on POWER
- I contributed a patch that’s a simple #ifdef for CPU type. Those who care about other CPU architectures should chime in with the correct value for them.
- There’s other places in InnoDB where there’s some padding that don’t use this define, I need to file a bug for that.
- Bug 72754: Set thread priority in InnoDB mutex spinloop
- This makes a big difference when you have mutex contention and SMT (Symmetric Multi-Threading) enabled (on POWER, you can dynamically change SMT levels at runtime).
- I’ve contributed a preliminary patch that isn’t generic. I should go and fix that.
- Bug 72755: InnoDB mutex spin loop is missing GCC barrier
- This also applies to x86 (and indeed all platforms). If GCC gets a bit smarter, the current code could compile down to nothing, which is exactly what you don’t want from a spinloop. The correct thing to do is to have a GCC memory barrier (not CPU one) to ensure that the compiler doesn’t optimize away the spinning.
- I’ve contributed a patch, may need #ifdef GCC added.
- Bug 72809: InnoDB Linux native aio setup missing barrier after setup
- This appears to be a “POWER8 is fast” related bug :)
- Patch contributed.
- Bug 72811: Set NUMA mempolicy for optimum mysqld performance
- Not POWER specific.
- I’ve contributed a patch that sets NUMA memory allocation policy inside mysqld rather than having to run “numactl” manually
- Bug 47213: InnoDB mutex/rw_lock should be conscious about memory ordering other than Intel
- Originally filed by Yasufumi back in 2009.
- Some good discussion going on here to ensure the patch is correct. This is the kind of patch that requires more reviewÂ than it takes to write it.
- This patch would fix the majority of problems for non-Intel CPU architectures.
- Thanks to Yasufumi for providing an updated patch, it helped a lot!
- Bug 72544: Incorrect locking for global_query_id
- I found a bug. Rather benign and not POWER specific.
Want to run MySQL 5.6.17 on POWER? Get my MySQL 5.6.17 patch here: https://flamingspork.com/mysql/mysql-5.6.17-POWER.patch
My accumulation of 5.6 patches seems fairly reliable. I’d test before putting into production, and I’d certainly love to know any problems you hit.
Get the quilt series of patches here: https://flamingspork.com/mysql/mysql-5.6.17-POWER-patches.tar.gz
I have, of course, done the legal wrangling for the Oracle Contributor Agreement (remarkably painless) and am working on making the patches completely acceptable to be merged into MySQL.
Having joined IBM now and working on Linux on Power, I’m allowed to be all happy and gleeful about a non x86 CPU architecture again, and one where Linux and free software really is a big deal.
Some of my now colleagues talked about some things related to Power 8 at Linux.conf.au so you should go and check out their talks!
Most recently, I was at Percona which was a wonderful journey where over my nearly three years there the company at least doubled in size, launched several new software products and greatly improved the quality and frequency of releases.
However the time has come for something completely different. The MySQL world is rather mature, the future of Percona software is bright and, well, I could do with poking into something rather different.
So a couple of weeks ago I started at IBM in the Linux Technology Centre working on KVM on POWER and related things. No doubt there’ll be interesting things to blog about as time goes on, but it’s about time I posted my change of employment :)