Running skiboot (OPAL) on the POWER8 Simulator

skiboot is open source boot and runtime firmware for OpenPOWER. On real POWER8 hardware, you will also need HostBoot to do this (basically, to make the chip work) but in a functional simulator (such as this one released by IBM) you don’t need a bunch of hardware procedures to make hardware work, so we can make do with just skiboot.

The POWER8 Functional Simulator is free to use but not open source and is only supported on limited platforms. But you can always run it all in a VM! I have it running this way on my laptop right now.

To go from a bare Ubuntu 14.10 VM on x86_64 to running skiboot in the simulator, I did the following:

  • apt-get install vim git emacs wget xterm # xterm is needed by the simulator. wget and editors are useful things.
  • (download systemsim-p8…deb from above URL)
  • dpkg -i systemsim-p8*deb # now the simulator is installed
  • git clone https://github.com/open-power/skiboot.git # get skiboot source
  • wget https://www.kernel.org/pub/tools/crosstool/files/bin/x86_64/4.8.0/x86_64-gcc-4.8.0-nolibc_powerpc64-linux.tar.xz # get a compiler to build it with
  • apt-get install make gcc valgrind # get build tools (skiboot unittests run on the host, so get a gcc and valgrind)
  • tar xfJ x86_64-gcc-4.8.0-nolibc_powerpc64-linux.tar.xz
  • mkdir -p /opt/cross
  • mv gcc-4.8.0-nolibc /opt/cross/ # now you have a powerpc64 cross compiler
  • export PATH=/opt/cross/gcc-4.8.0-nolibc/powerpc64-linux/bin/:$PATH # add cross compiler to path
  • cd skiboot
  • make # this should build a bunch of things, leaving you with skiboot.lid (and other things). If you have many CPUs, feel free to make -j128.
  • make check # run the unit tests. Everything should pass.
  • cd external/mambo
  • /opt/ibm/systemsim-p8/run/pegasus/power8 -f skiboot.tcl # run the simulator

The last step there will barf as you unlikely have a /tmp/zImage.epapr sitting around that’s suitable. If you use op-build to build a full set of OpenPower foo, you’ll likely be able to extract it from there. Basically, the skiboot.tcl script is adding a payload for skiboot to execute. On real hardware, this ends up being a Linux kernel with a small userspace and petitboot (link is to IBM documentation for IBM POWER8 systems). For the simulator, you could boot any tiny zImage.epapr you like, it should detect OPALv3 and boot!

Even if you cannot be bothered building a kernel or petitboot environment, if you comment out the associated lines in skiboot.tcl, you should be able to run the simulator and see the skiboot console message come up that says we couldn’t load a kernel.

At this point, congratulations, you can now become an OpenPower firmware hacker without even possessing any POWER8 hardware!

skiboot/OPAL versioning

skiboot is boot and runtime firmware for OpenPower systems. There are other components that make up all the firmware you need, but if you’re, say, a Linux kernel, you’re going to be interacting with skiboot.

I recently committed doc/versioning.txt to skiboot to try and explain our current thoughts on versioning releases.

It turns out that picking version numbers is a bit harder than you’d expect, especially when you want to construct a version string to display in places that has semantic meaning. In fact, the writing on Semantic Versioning influenced us heavily.

Since we’re firmware, making incompatible API changes is something we should basically never, ever do. Old kernels should must boot and work on new firmware and new kernels should boot and function on old firmware (and if they don’t, it plainly be a kernel bug). So, ignore the Major version parts of Semantic Versioning for us :)

For each new release, we plan to bump the minor version for mostly bug fix releases, while bump the major version for added functionality. Any additional information is to describe the version on that particular platform – as everybody shipping OPAL is likely to build it themselves with possibly some customizations (e.g. YOUR COMPANY NAME HERE, support for some on board RAID card or on-board automated coffee maker). See doc/versioning.txt for details.

You may wonder why we started at 4.0 for our first real version number. Well… this is purely a cunning plan to avoid confusion with other things, the details of which will only be extracted out of my when plied with a suitable amount of excellent craft beer (because if I’m going to tell a boring story, I may as well have awesome craft beer).

C bitfields considered harmful

In C (and C++) you can specify that a variable should take a specific number of bits of storage by doing “uint32_t foo:4;” rather than just “uint32_t foo”. In this example, the former uses 4 bits while the latter uses 32bits. This can be useful to pack many bit fields together.

Or, that’s what they’d like you to think.

In reality, the C spec allows the compiler to do just about anything it wants with these bitfields – which usually means it’s something you didn’t expect.

For a start, in a struct -e.g. “struct foo { uint32_t foo:4; uint32_t blah; uint32_t blergh:20; }” the compiler could go and combine foo and blergh into a single uint32_t and place it somewhere… or it could not. In this case, sizeof(struct foo) isn’t defined and may vary based on compiler, platform, compiler version, phases of the moon or if you’ve washed your hands recently.

Where this can get interesting is in network protocols (OMG DO NOT DO IT), APIs (OMG DO NOT DO IT), protecting different parts of a struct with different mutexes (EEP, don’t do it!) and performance.

I recently filed MySQL bug 74831 which relates to InnoDB performance on POWER8. InnoDB uses C bitfields which are themselves bitfields (urgh) for things like “flag to say if this table is compressed”. At various parts of the code, this flag is checked.

When you apply this simple patch:

--- mysql-5.7.5-m15.orig/storage/innobase/include/dict0mem.h
+++ mysql-5.7.5-m15/storage/innobase/include/dict0mem.h
@@ -1081,7 +1081,7 @@ struct dict_table_t {
        Use DICT_TF_GET_COMPACT(), DICT_TF_GET_ZIP_SSIZE(),
        DICT_TF_HAS_ATOMIC_BLOBS() and DICT_TF_HAS_DATA_DIR() to parse this
        flag. */
-       unsigned                                flags:DICT_TF_BITS;
+       unsigned                                flags;

I get 10,000 key lookups/sec more than without it!

Why is this? If you go and read the bug, you’ll see that the amount of CPU time spent on the instruction checking the bit flag is actually about the same… and this puzzled me for a while. That is, until Anton reminded me that the PMU can be approximate and perhaps I should look at the loads.

Sure enough, the major difference is that with the bitfield in place (i.e. MySQL 5.7.5 as it stands today), there is a ld instruction doing the load – which is a 64bit load. In my patched version, it’s a lwx instruction – which is a 32bit load.

So, basically, we were loading 8 bytes instead of 4 every time we were checking if it was a compressed table.

So, along with yesterday’s lesson of never, ever, ever use volatile, today’s lesson is never, ever, ever use bitfields.

volatile considered harmful

While playing with MySQL 5.7.5 on POWER8, I came across a rather interesting bug (74775 – and this is not the only one… I think I have a decent amount of auditing and patching to do now) which made me want to write a bit on memory barriers and the volatile keyword.

Memory barriers are hard.

Like, super hard. It’s the kind of thing that makes you curse hardware designers, probably because they’re not magically solving all your problems for you. Basically, as you get more CPU cores and each of them have caches, it gets more expensive to keep everything in sync. It’s quite obvious that with *ahem* an eventually consistent model, you could save a bunch of time and effort at the expense of shifting some complexity into software.

Those in the MySQL world should recognize this – we’ve been dealing with asynchronous replication for well over a decade as a good way to scale.

On some CPU architectures (POWER for example) not all loads are created equal. When you load a value from memory, it will be consistent with your thread of execution. That is, with any stores that you have done in this thread of execution. If another thread updates that memory location you may not see that update even if your load occurs after that thread updates that memory location. Think eventually consistent.

If you want up to date reads (and not clobber writes), then you get to do memory barriers! (a topic for elsewhere – the PowerISA document has good explanations of what we have on POWER though, and how load with reserve works).

What the volatile keyword does is generate load and store instructions. It is useful when talking to hardware, as the load and store instructions are actually doing something there that the compiler doesn’t know about and thus shouldn’t optimize away.

The volatile keyword does not add any memory barriers. This is important to realize – volatile just makes loads and stores happen for your thread, not in relation to any other threads of execution. Thus, you cannot use volatile as a thread synchronization mechanism at all. It is completely and totally wrong.

Basically, if you have a volatile variable and you do stores to it in one thread and loads in another, after the store happens, it could be quite a long time before the thread doing the loads sees it! For some applications this may be okay (although I can’t really think of any beyond very very inaccurate status variables)… but if it matters at all for application correctness, volatile is the wrong thing to use.

Further reading:

Preliminary MySQL Cluster benchmark results on POWER8

Yesterday, I got the basics going for MySQL Cluster on POWER. Today, I finished up a couple more patches to improve performance and ran some benchmarks.

This is on a 3.7Ghz POWER8 machine with non-balanced memory (only 2 of the 4 NUMA nodes have memory, so we have less total memory bandwidth than we could have, plus I’m going to bind ndbmtd to the CPUs in these NUMA nodes)

With a setup of a single replica and two data nodes on the one machine (each bound to a specific NUMA node), running the flexAsync benchmark on MySQL Cluster 7.3.7, I could get around:

  • 3.2 million reads/sec
  • 2.6 million deletes/sec
  • 2.4 million updates/sec
  • 2.4 million inserts/sec.

So, that’s at least in the right ballpark for a first go.

(I’m running this on a big endian host kernel, some random kernel I booted on the box and built with gcc 4.8 with whatever build options the MySQL Cluster cmake foo chooses by default)

MySQL Cluster on POWER8

So, I’ve written previously on MySQL on POWER, and today is a quick bit of news about MySQL Cluster on POWER – specifically MySQL Cluster 7.3.7.

I ran into three main issues in getting some flexAsync benchmark results. One of them was the fact that I wanted to do this in the middle of all the POWER8 machines I usually use moving buildings (hard to run benchmarks when computers are packed up in boxes on a truck).

The next issue was that ndbmtd (the multi-threaded data node) needs memory barriers for the magic message passing stuff between threads. So, that’s pretty easy (about an eight line patch).

The next issue was in the results from flexAsync, it turns out 32bit math is a bad idea with results from my POWER8 box.

My preliminary performance numbers are fairly promising (actually… what is the world record for a single machine and NDB these days? Single data node?). I think there’s a bit more low hanging fruit and a couple more things that are a bit more involved.

Bugs with patches:

  • Bug 74782 – compile fix (memory barriers for POWER)
  • Bug 74781 – flexAsync uses 32bit math, leading to incorrect summary on POWER8

MariaDB Foundation board

There seems be a bit of an exodus from the MariaDB Foundation board recently… I’m not sure exactly what to make of it all, but the current members according to https://mariadb.org/en/foundation/ are:

  • Rasmus Johansson (chair)
  • Michael “Monty” Widenius
  • Jeremy Zawodny
  • Sergei Golubchik

With Jeremy Zawodny being the only non-MariaDB Corp member.

Recently, Jeremy Cole asked some people about their membership:

I’m a little worried for the project, the idea of a foundation around it and for people I count as friends who work on MariaDB.

MySQL 5.7.5 on POWER – thread priority

Good news everyone!

MySQL 5.7.5 is out with a bunch more patches for running well on POWER in the tree. I haven’t yet gone and tried it all out, but since I’m me, I look at bugs database and git/bzr history first.

On Intel CPUs, when you’re spinning on a spin lock, you’re meant to execute the PAUSE CPU instruction. This tells the CPU that other execution threads in the same core should be given priority as you are currently not doing anything productive. Without this, you’re likely going to hurt on hyperthreaded CPUs.

In MySQL, there are custom spinlocks in order to do interesting adaptive mutex things to attempt to squeeze the most performance possible out of modern systems.

One of the (not 100% ready, but close) bugs with patches I submitted against MySQL 5.7 was for using the equivalent of the PAUSE instruction for POWER CPUs. On POWER, we’re a bit different, you can actually set priorities of threads (which may matter more, as POWER8 CPUs can be in SMT8 mode – where there are *eight* executing threads per core).

So, the good news is that in MySQL 5.7.5, the magic instructions for setting thread priority are in! This should mean great things for performance on POWER systems with any of the SMT modes enabled.

The next interesting part of this is how it interacts with other KVM guests on a system. At least on POWER (and on x86 as well, although I won’t go into details here) there’s a hypervisor call that a guest can make saying “hey, I’m spinning here, perhaps you want to make sure other vcpus execute so that at some point I can continue”. On POWER, this is the H_CONFER hcall, where you can basically do a directed yield to another vcpu (the one that holds the lock you’re trying to get is a good idea).

Generally though, it’s only the guest kernel that does this, not userspace. You can see the H_CONFER call in __spin_yield(arch_spinlock_t*) and __rw_yield(arch_rwlock_t*) in arch/powerpc/lib/locks.c in the kernel.

It would be interesting to see what extra we could get out of a system running multiple guests with MySQL servers if InnoDB/MySQL could properly yield to the right vcpu (well, thread I guess).

Tyan OpenPower

Good news everyone! Tyan has announced the availability of their first OpenPOWER system! They call this a Customer Reference System, which means it’s an excellent machine to start poking at OpenPower and POWER8 (or deploying applications on).

Because it’s an OpenPower machine, it runs the open source Open Power firmware (all up on github) and will happily run Linux (feel free to port your other operating system kernels). I’ll be writing more on the OpenPower firmware soon as, well, technical details are fun!

Ubuntu 14.10 is listed as recommended as not only have they been building for POWER8 but have spent some time ensuring things work fairly well out-of-the-box (both as a KVM guest and running native on the bare metal). Or, you can always just boot whatever the mainline kernel is at – build for the POWERNV (POWER non-virtualized) platform (be sure to include all the required drivers) and have fun!

MariaDB 10.0 on POWER

Good news for those wanting to run MariaDB on POWER systems, the latest 10.0 bzr tree (as of a couple of weeks ago) builds and runs well!

I recently pulled the latest MariaDB 10.0 from BZR and built it on a POWER8 system in the lab to run some quick tests. The MariaDB team has done some work on getting MariaDB to run on POWER recently, a bunch of which is based off my work on MySQL on POWER.

There’s obviously still some work in progress going on, but my initial results show performance within around 10% of MySQL, so with a bit of work we will hopefully see MariaDB reach performance parity.

One interesting find was the code to account for thread memory usage uses a single atomic variable: this does not scale and does end up showing up on profiles.

I’ll comment more on the code in a future post, but it looks like we will have MariaDB being functional on POWER in an upcoming release.

Quick MySQL 5.7.5 thoughts

It was great to see the recent announcement of MySQL 5.7.5 over at the MySQL Server Team blog. I’m looking forward to throwing this release at some of the POWER8 systems we have for a couple of really good reasons: 1) Does it work better than previous MySQL 5.7 releases “out of the box” on POWER? 2) What do the scalability improvements in 5.7.5 mean for peak QPS on POWER (and can I set a new record?).

Looking through the list of changes, I’m (casually not) surprised as to the number of features and the amount of work that echoes what we were working on in Drizzle a few years ago.

A closer look at the source for 5.7.5 may also prove enlightening, I wonder how the MySQL team is coping with a lot of the code rot legacy and the absolutely atrocious internal APIs they inherited…

MariaDB & Trademarks, and advice for your project

I want to emphasize this for those who have not spent time near trademarks: trademarks are trouble and another one of those things where no matter what, the lawyers always win. If you are starting a company or an open source project, you are going to have to spend a whole bunch of time with lawyers on trademarks or you are going to get properly, properly screwed.

MySQL AB always held the trademark for MySQL. There’s this strange thing with trademarks and free software, where while you can easily say “use and modify this code however you want” and retain copyright on it (for, say, selling your own version of it), this does not translate too well to trademarks as there’s a whole “if you don’t defend it, you lose it” thing.

The law, is, in effect, telling you that at some point you have to be an arsehole to not lose your trademark. (You can be various degrees of arsehole about it when you have to, and whenever you do, you should assume that people are acting in good faith and just have not spent the last 40,000 years of their life talking to trademark lawyers like you have).Basically, you get to spend time telling people that they have to rename their product from “MySQL Headbut” to “Headbut for MySQL” and that this is, in fact, a really important difference.

You also, at some point, get to spend a lot of time talking about when the modifications made by a Linux distribution to package your software constitute sufficient changes that it shouldn’t be using your trademark (basically so that you’re never stuck if some arse comes along, forks it, makes it awful and keeps using your name, to the detriment of your project and business).

If you’re wondering why Firefox isn’t called Firefox in Debian, you can read the Mozilla trademark policy and probably some giant thread on debian-legal I won’t point to.

Of course, there’s ‘ MySQL trademark policy and when I was at Percona, I spent some non-trivial amount of time attempting to ensure we had a trademark policy that would work from a legal angle, a corporate angle, and a get-our-software-into-linux-distros-happily angle.

So, back in 2010, Monty started talking about a draft MariaDB trademark policy (see also, Ubuntu trademark policy, WordPress trademark policy). If you are aiming to create a development community around an open source project, this is something you need to get right. There is a big difference between contributing to a corporate open source product and an open source project – both for individuals and corporations. If you are going to spend some of your spare time contributing to something, the motivation goes down when somebody else is going to directly profit off it (corporate project) versus a community of contributors and companies who will all profit off it (open source project). The most successful hybrid of these two is likely Ubuntu, and I am struggling to think of another (maybe Fedora?).

Linux is an open source project, RedHat Enterprise Linux is an open source product and in case it wasn’t obvious when OpenSolaris was no longer Open, OpenSolaris was an open source product (and some open source projects have sprung up around the code base, which is great to see!). When a corporation controls the destiny of the name and the entire source code and project infrastructure – it’s a product of that corporation, it’s not a community around a project.

From the start, it seemed that one of the purposes of MariaDB was to create a developer community around a database server that was compatible with MySQL, and eventually, to replace it. MySQL AB was not very good at having an external developer community, it was very much an open source product and not a an open source project (one of the downsides to hiring just about anyone who ever submitted a patch). Things struggled further at Sun and (I think) have actually gotten better for MySQL at Oracle – not perfect, I could pick holes in it all day if I wanted, but certainly better.

When we were doing Drizzle, we were really careful about making sure there was a development community. Ultimately, with Drizzle we made a different fatal error, and one that we knew had happened to another open source project and nearly killed it: all the key developers went to work for a single company. Looking back, this is easily my biggest professional regret and one day I’ll talk about it more.

Brian Aker observed (way back in 2010) that MariaDB was, essentially, just Monty Program. In 2013, I did my own analysis on the source tree of MariaDB 5.5.31 and MariaDB 10.0.3-ish to see if indeed there was a development community (tl;dr; there wasn’t, and I had the numbers to prove it).If you look back at the idea of the Open Database Alliance and the MariaDB Foundation, actually, I’m just going to quote Henrik here from his blog post about leaving MariaDB/Monty Program:

When I joined the company over a year ago I was immediately involved in drafting a project plan for the Open Database Alliance and its relation to MariaDB. We wanted to imitate the model of the Linux Foundation and Linux project, where the MariaDB project would be hosted by a non-profit organization where multiple vendors would collaborate and contribute. We wanted MariaDB to be a true community project, like most successful open source projects are – such as all other parts of the LAMP stack.

….

The reality today, confirmed to me during last week, is that:

Those in charge at Monty Program have decided to keep ownership of the MariaDB trademark, logo and mariadb.org domain, since this will make the company more valuable to investors and eventually to potential buyers.

Now, with Monty Program being sold to/merged into (I’m really not sure) SkySQL, it was SkySQL who had those things. So instead of having Monty Program being (at least in theory) one of the companies working on MariaDB and following the Hacker Business Model, you now have a single corporation with all the developers, all of the trademarks, that is, essentially a startup with VC looking to be valuable to potential buyers (whatever their motives).

Again, I’m going to just quote Henrik on the us-vs-them on community here:

Some may already have observed that the 5.2 release was not announced at all on mariadb.org, rather on the Monty Program blog. It is even intact with the “us vs them” attitude also MySQL AB had of its community, where the company is one entity and “outside community contributors” is another. This is repeated in other communication, such as the recent Recently in MariaDB newsletter.

This was, again, back in 2010.

More recently, Jeremy Cole, someone who has pumped a fair bit of personal and professional effort into MySQL and MariaDB over the past (many) years, asked what seemed to be a really simple question on the maria-discuss mailing list. Basically, “What’s going on with the MariaDB trademark? Isn’t this something that should be under the MariaDB foundation?”

The subsequent email thread was as confusing as ever and should be held up as a perfect example about what not to do. Some of us had by now, for years, smelt something fishy going on around the talk of a community project versus the reality. At the time (October 2013), Rasmus Johansson (VP of Engineering at SkySQL and Board Member of MariaDB foundation) said this:

The MariaDB Foundation and SkySQL are currently working on the trademark issue to come up with a solution on what rights to the trademark each entity should have. Expect to hear more about this in a fairly near future.

 

MariaDB has from its beginning been a very community friendly project and much of the success of MariaDB relies in that fact. SkySQL of course respects that.

(and at the same time, there were pages that were “Copyright MariaDB” which, as it was pointed out, was not an actual entity… so somebody just wasn’t paying attention). Also, just to make things even less clear about where SkySQL the corporation, Monty Program the corporation and the MariaDB Foundation all fit together, Mark Callaghan noticed this text up on mariadb.com:

The MariaDB Foundation also holds the trademark of the MariaDB server and owns mariadb.org. This ensures that the official MariaDB development tree<https://code.launchpad.net/maria> will always be open for the MariaDB developer community.

So…. there’s no actual clarity here. I can imagine attempting to get involved with MariaDB inside a corporation and spending literally weeks talking to a legal department – which thrills significantly less than standing in lines at security in an airport does.

So, if you started off as yay! MariaDB is going to be a developer community around an open source project that’s all about participation, you may have even gotten code into MariaDB at various times… and then started to notice a bit of a shift… there may have been some intent to make that happen, to correct what some saw as some of the failings of MySQL, but the reality has shown something different.

Most recently, SkySQL has renamed themselves to MariaDB. Good luck to anyone who isn’t directly involved with the legal processes around all this differentiating between MariaDB the project, MariaDB Foundation and MariaDB the company and who owns what. Urgh. This is, in no way, like the Linux Foundation and Linux.

Personally, I prefer to spend my personal time contributing to open source projects rather than products. I have spent the vast majority of my professional life closer to the corporate side of open source, some of which you could better describe as closer to the open source product end of the spectrum. I think it is completely and totally valid to produce an open source product. Making successful companies, products and a butt-ton of money from open source software is an absolutely awesome thing to do and I, personally, have benefited greatly from it.

MariaDB is a corporate open source product. It is no different to Oracle MySQL in that way. Oracle has been up front and honest about it the entire time MySQL has been part of Oracle, everybody knew where they stood (even if you sometimes didn’t like it). The whole MariaDB/Monty Program/SkySQL/MariaDB Foundation/Open Database Alliance/MariaDB Corporation thing has left me with a really bitter taste in my mouth – where the opportunity to create a foundation around a true community project with successful business based on it has been completely squandered and mismanaged.

I’d much rather deal with those who are honest and true about their intentions than those who aren’t.

My guess is that this factored heavily into Henrik’s decision to leave in 2010 and (more recently) Simon Phipps’s decision to leave in August of this year. These are two people who I both highly respect, never have enough time to hang out with and I would completely trust to do the right thing and be honest when running anything in relation to free and open source software.

Maybe WebScaleSQL will succeed here – it’s a community with a purpose and several corporate contributors. A branch rather than a fork may be the best way to do this (Percona is rather successful with their branch too).

Some current MySQL Architecture writings

So, I’ve been looking around for a while (and a few times now) for any good resources that cover a bunch of MySQL architecture and technical details aimed towards the technically proficient but not MySQL literate audience. I haven’t really found anything. I mean, there’s the (huge and very detailed) MySQL manual, there’s the MySQL Internals manual (which is sometimes only 10 years out of date) and there’s various blog entries around the place. So I thought I’d write something explaining roughly how it all fits together and what it does to your system (processes, threads, IO etc).(Basically, I’ve found myself explaining this enough times in the past few years that I should really write it down and just point people to my blog). I’ve linked to things for more reading. You should probably read them at some point.

Years ago, there were many presentations on MySQL Architecture. I went to try and find some on YouTube and couldn’t. We were probably not cool enough for YouTube and the conferences mostly weren’t recorded. So, instead, I’ll just link to Brian on NoSQL – because it’s important to cover NoSQL as well.

So, here is a quick overview of executing a query inside a MySQL Server and all the things that can affect it. This isn’t meant to be complete… just a “brief” overview (of a few thousand words).

MySQL is an open source relational database server, the origins of which date back to 1979 with MySQL 1.0 coming into existence in 1995. It’s code that has some history and sometimes this really does show. For a more complete history, see my talk from linux.conf.au 2014: Past, Present and Future of MySQL (YouTube, Download).

At least of writing, everything here applies to MariaDB and Percona Server too.

The MySQL Server runs as a daemon (mysqld). Users typically interact with it over a TCP or UNIX domain socket through the MySQL network protocol (of which multiple implementations exist under various licenses). Each connection causes the MySQL Server (mysqld) to spawn a thread to handle the client connection.

There are now several different thread-pool plugins that instead of using one thread per connection, multiplex connections over a set of threads. However, these plugins are not commonly deployed and we can largely ignore them. For all intents and purposes, the MySQL Server spawns one thread per connection and that thread alone performs everything needed to service that connection. Thus, parallelism in the MySQL Server is gotten from executing many concurrent queries rather than one query concurrently.

The MySQL Server will cache threads (the amount is configurable) so that it doesn’t have to have the overhead of pthread_create() for each new connection. This is controlled by the thread_cache_size configuration option. It turns out that although creating threads may be a relatively cheap operation, it’s actually quite time consuming in the scope of many typical MySQL Server connections.

Because the MySQL Server is a collection of threads, there’s going to be thread local data (e.g. connection specific) and shared data (e.g. cache of on disk data). This means mutexes and atomic variables. Most of the more advanced ways of doing concurrency haven’t yet made it into MySQL (e.g. RCU hasn’t yet and is pretty much needed to get 1 million TPS), so you’re most likely going to see mutex contention and contention on cache lines for atomic operations and shared data structures.

There are also various worker threads inside the MySQL Server that perform various functions (e.g. replication).

Until sometime in the 2000s, more than one CPU core was really uncommon, so the fact that there were many global mutexes in MySQL wasn’t really an issue. These days, now that we have more reliable async networking and disk IO system calls but MySQL has a long history, there’s global mutexes still and there’s no hard and fast rule about how it does IO.

Over the past 10 years of MySQL development, it’s been a fight to remove the reliance on global mutexes and data structures controlled by them to attempt to increase the number of CPU cores a single mysqld could realistically use. The good news is that it’s no longer going to max out on the number of CPU cores you have in your phone.

So, you have a MySQL Client (e.g. the mysql client or something) connecting to the MySQL Server. Now, you want to enter a query. So you do that, say “SELECT 1;”. The query is sent to the server where it is parsed, optimized, executed and the result returns to the client.

Now, you’d expect this whole process to be incredibly clean and modular, like you were taught things happened back in university with several black boxes that take clean input and produce clean output that’s all independent data structures. At least in the case of MySQL, this isn’t really the case. For over a decade there’s been lovely architecture diagrams with clean boxes – the code is not like this at all. But this probably only worries you once you’re delving into the source.

The parser is a standard yacc one – there’s been attempts to replace it over the years, none of which have stuck – so we have the butchered yacc one still. With MySQL 5.0, it exploded in size due to the addition of things like SQL2003 stored procedures and it is of common opinion that it’s rather bloated and was better in 4.1 and before for the majority of queries that large scale web peeps execute.

There is also this thing called the Query Cache – protected by a single global mutex. It made sense in 2001 for a single benchmark. It is a simple hash of the SQL statement coming over the wire to the exact result to send(2) over the socket back to a client. On a single CPU system where you ask the exact same query again and again without modifying the data it’s the best thing ever. If this is your production setup, you probably want to think about where you went wrong in your life. On modern systems, enabling the query cache can drop server performance by an order of magnitude. A single global lock is a really bad idea. The query cache should be killed with fire – but at least in the mean time, it can be disabled.

Normally, you just have the SQL progress through the whole process of parse, optimize, execute, results but the MySQL Server also supports prepared statements. A prepared statement is simply this: “Hi server, please prepare this statement for execution leaving the following values blank: X, Y and Z” followed by “now execute that query with X=foo, Y=bar and Z=42”. You can call execute many times with different values. Due to the previously mentioned not-quite-well-separated parse, optimize, execute steps, prepared statements in MySQL aren’t as awesome as in other relational databases. You basically end up saving parsing the query again when you execute it with new parameters. More on prepared statements (from 2006) here. Unless you’re executing the same query many times in a single connection, server side prepared statements aren’t worth the network round trips.

The absolute worst thing in the entire world is MySQL server side prepared statements. It moves server memory allocation to be the responsibility of the clients. This is just brain dead stupid and a reason enough to disable prepared statements. In fact, just about every MySQL client library for every programming language ever actually fakes prepared statements in the client rather than trust every $language programmer to remember to close their prepared statements. Open many client connections to a MySQL Server and prepare a lot of statements and watch the OOM killer help you with your DoS attack.

So now that we’ve connected to the server, parsed the query (or done a prepared statement), we’re into the optimizer. The optimizer looks at a data structure describing the query and works out how to execute it. Remember: SQL is declarative, not procedural. The optimizer will access various table and index statistics in order to work out an execution plan. It may not be the best execution plan, but it’s one that can be found within reasonable time. You can find out the query plan for a SELECT statement by prepending it with EXPLAIN.

The MySQL optimizer is not the be all and end all of SQL optimizers  (far from it). A lot of MySQL performance problems are due to complex SQL queries that don’t play well with the optimizer, and there’s been various tricks over the years to work around deficiencies in it. If there’s one thing the MySQL optimizer does well it’s making quick, pretty good decisions about simple queries. This is why MySQL is so popular – fast execution of simple queries.

To get table and index statistics, the optimizer has to ask the Storage Engine(s). In MySQL, the actual storage of tables (and thus the rows in tables) is (mostly) abstracted away from the upper layers. Much like a VFS layer in an OS kernel, there is (for some definition) an API abstracting away the specifics of storage from the rest of the server. The API is not clean and there are a million and one layering violations and exceptions to every rule. Sorry, not my fault.

Table definitions are in FRM files on disk, entirely managed by MySQL (not the storage engines) and for your own sanity you should not ever look into the actual file format. Table definitions are also cached by MySQL to save having to open and parse a file.

Originally, there was MyISAM (well, and ISAM before it, but that’s irrelevant now). MyISAM was non-transactional but relatively fast, especially for read heavy workloads. It only allowed one writer although there could be many concurrent readers. MyISAM is still there and used for system tables. The current default storage engine is called InnoDB. It’s all the buzzwords like ACID and MVCC. Just about every production environment is going to be using InnoDB. MyISAM is effectively deprecated.

InnoDB originally was its own independent thing and has (to some degree) been maintained as if it kind of was. It is, however, not buildable outside a MySQL Server anymore. It also has its own scalability issues. A recent victory was splitting the kernel_mutex, which was a mutex that protected far too much internal InnoDB state and could be a real bottleneck where NRCPUs > 4.

So, back to query execution. Once the optimizer has worked out how to execute the query, MySQL will start executing it. This probably involves accessing some database tables. These are probably going to be InnoDB tables. So, MySQL (server side) will open the tables, looking up the MySQL Server table definition cache and creating a MySQL Server side table share object which is shared amongst the open table instances for that table. See here for scalability hints on these (from 2009). The opened table objects are also cached – table_open_cache. In MySQL 5.6, there is table_open_cache_instances, which splits the table_open_cache mutex into table_open_cache_instances mutexes to help reduce lock contention on machines with many CPU cores (> 8 or >16 cores, depending on workload).

Once tables are opened, there are various access methods that can be employed. Table scans are the worst (start at the start and examine every row). There’s also index scans (often seeking to part of the index first) and key lookups. If your query involves multiple tables, the server (not the storage engine) will have to do a join. Typically, in MySQL, this is a nested loop join. In an ideal world, this would all be really easy to spot when profiling the MySQL server, but in reality, everything has funny names like rnd_next.

As an aside, any memory allocated during query execution is likely done as part of a MEM_ROOT – essentially a pool allocator, likely optimized for some ancient libc on some ancient linux/Solaris and it just so happens to still kinda work okay. There’s some odd server configuration options for (some of the) MEM_ROOTs that get exactly no airtime on what they mean or what changing them will do.

InnoDB has its own data dictionary (separate to FRM files) which can also be limited in current MySQL (important when you have tens of thousands of tables) – which is separate to the MySQL Server table definitions and table definition cache.

But anyway, you have a number of shared data structures about tables and then a data structure for each open table. To actually read/write things to/from tables, you’re going to have to get some data to/from disk.

InnoDB tables can be stored either in one giant table space or file-per-table. (Even though it’s now configurable), InnoDB database pages are 16kb. Database pages are cached in the InnoDB Buffer Pool, and the buffer-pool-size should typically be about 80% of system memory. InnoDB will use a (configurable) method to flush. Typically, it will all be O_DIRECT (it’s configurable) – which is why “just use XFS” is step 1 in IO optimization – the per inode mutex in ext3/ext4 just doesn’t make IO scale.

InnoDB will do some of its IO in the thread that is performing the query and some of it in helper threads using native linux async IO (again, that’s configurable). With luck, all of the data you need to access is in the InnoDB buffer pool – where database pages are cached. There exists innodb_buffer_pool_instances configuration option which will split the buffer pool into several instances to help reduce lock contention on the InnoDB buffer pool mutex.

All InnoDB tables have a clustered index. This is the index by which the rows are physically sorted by. If you have an INT PRIMARY KEY on your  InnoDB table, then a row with that primary key value of 1 will be physically close to the row with primary key value 2 (and so on). Due to the intricacies of InnoDB page allocation, there may still be disk seeks involved in scanning a table in primary key order.

Every page in InnoDB has a checksum. There was an original algorithm, then there was a “fast” algorithm in some forks and now we’re converging on crc32, mainly because Intel implemented CPU instructions to make that fast. In write heavy workloads, this used to show up pretty heavily in profiles.

InnoDB has both REDO and UNDO logging to keep both crash consistency and provide consistent read views to transactions. These are also stored on disk, the redo logs being in their own files (size and number are configurable). The larger the redo logs, the longer it may take to run recovery after a crash. The smaller the redo logs, the more trouble you’re likely to run into with large or many concurrent transactions.

If your query performs writes to database tables, those changes are written to the REDO log and then, in the background, written back into the table space files. There exists configuration parameters for how much of the InnoDB buffer pool can be filled with dirty pages before they have to be flushed out to the table space files.

In order to maintain Isolation (I in ACID), InnoDB needs to assign a consistent read view to a new transaction. Transactions are either started explicitly (e.g. with BEGIN) or implicitly (e.g. just running a SELECT statement). There has been a lot of work recently in improving the scalability of creating read views inside InnoDB. A bit further in the past there was a lot of work in scaling InnoDB for greater than 1024 concurrent transactions (limitations in UNDO logging).

Fancy things that make InnoDB generally faster than you’d expect are the Adaptive Hash Index and change buffering. There are, of course, scalability challenges with these too. It’s good to understand the basics of them however and (of course), they are configurable.

If you end up reading or writing rows (quite likely) there will also be a translation between the InnoDB row format(s) and the MySQL Server row format(s). The details of which are not particularly interesting unless you’re delving deep into code or wish to buy me beer to hear about them.

Query execution may need to get many rows from many tables, join them together, sum things together or even sort things. If there’s an index with the sort order, it’s better to use that. MySQL may also need to do a filesort (sort rows, possibly using files on disk) or construct a temporary table in order to execute the query. Temporary tables are either using the MEMORY (formerly HEAP) storage engine or the MyISAM storage engine. Generally, you want to avoid having to use temporary tables – doing IO is never good.

Once you have the results of a query coming through, you may think that’s it. However, you may also be part of a replication hierarchy. If so, any changes made as part of that transaction will be written to the binary log. This is a log file maintained by the MySQL Server (not the storage engines) of all the changes to tables that have occured. This log can then be pulled by other MySQL servers and applied, making them replication slaves of the master MySQL Server.

We’ll ignore the differences between statement based replication and row based replication as they’re better discussed elsewhere. Being part of replication means you get a few extra locks and an additional file or two being written. The binary log (binlog for short) is a file on disk that is appended to until it reaches a certain size and is then rotated. Writes to this file vary in size (along with the size of transactions being executed). The writes to the binlog occur as part of committing the transaction (and the crash safety between writing to the binlog and writing to the storage engine are covered elsewhere – basically: you don’t want to know).

If your MySQL Server is a replication slave, then you have a thread reading the binary log files from another MySQL Server and then another thread (or, in newer versions, threads) applying the changes.

If the slow query log or general query log is enabled, they’ll also be written to at various points – and the current code for this is not optimal, there be (yes, you guess it) global mutexes.

Once the results of a query have been sent back to the client, the MySQL Server cleans things up (frees some memory) and prepares itself for the next query. You probably have many queries being executed simultaneously, and this is (naturally) a good thing.

There… I think that’s a mostly complete overview of all the things that can go on during query execution inside MySQL.

MySQL 5.6.20 on POWER

It’s been a little while since I blogged on MySQL on POWER (last time was thinking that new releases would be much better for running on POWER). Well, I recently grabbed the MySQL 5.6.20 source tarball and had a go with it on a POWER8 system in the lab. There is good news: I now only need one patch to have it function pretty flawlessly (no crashes). Unfortunately, there’s still a bit of an odd thing with some of the InnoDB mutex code (bug filed at some point soon).

But, with this one patch applied, I was getting okay sysbench results and things are looking good.

Now just to hope the MySQL team applies my other patches that improve things on POWER. To be honest, I’m a bit disappointed many of them have sat there for this long… it doesn’t help build a development community when patches can sit for months without either “applied” or “fix these things first”. That being said, just as I write this, Bug 72809 which I filed has been closed as the fix has been merged into 5.6.22 and 5.7.6, so there is hope… it’s just that things can just be silent for a while. It seems I go back and forth on how various MySQL variants are going with fostering an external development community.

OpenPower firmware up on github!

With the whole OpenPower thing, a lot of low level firmware is being open sourced, which is really exciting for the platform – the less proprietary code sitting in memory the better in my books.

If you go to https://github.com/open-power you’ll see code for a bunch of the low level firmware for OpenPower and POWER8.

Hostboot is the bit of code that brings up the CPU and skiboot both sets up hardware and provides runtime services to Linux (such as talking to the service processor, if one is present).

Patches to https://github.com/open-power/skiboot/blob/master/doc/overview.txt are (of course) really quite welcome. It shouldn’t be too hard to get your head around the basics.

To see the Linux side of the OPAL interface, go check out linux/arch/powerpc/platforms/powernv -there you can see how we ask OPAL to do things for us.

If you buy a POWER8 system from IBM running PowerKVM you’re running this code.

Update on MySQL on POWER8

About 1.5 months ago I blogged on MySQL 5.6 on POWER andtalked about what I had to poke at to make modern MySQL versions run and run well on shiny POWER8 systems.

One of those bugs, MySQL bug 47213 (InnoDB mutex/rw_lock should be conscious of memory ordering other than Intel) was recently marked as CLOSED by the Oracle MySQL team and the upcoming 5.6.20 and 5.7.5 releases should have the fix!

This is excellent news for those wanting to run MySQL on SMP systems that don’t have an Intel-like memory model (e.g. POWER and MIPS64).

This was the most major and invasive patch in the patchset for MySQL on POWER. It’s absolutely fantastic that this has made it into 5.6.20 and 5.7.5 and may mean that these new versions will work out-of-the-box on POWER (I haven’t checked… but from glancing back at my patchset there was only one other patch that could be related to correctness rather than performance).

Performance impact of MySQL query cache on modern hardware

Recently, Morgan has been writing on deprecating some MySQL features and inspired by that while working on MySQL on POWER, I wondered “What is the impact of the MySQL query cache on modern hardware?”

We’ve known for over six years (since before we started Drizzle) that the query cache hurt performance. It was for that reason that the query cache was one of the early things to be removed from Drizzle, it just didn’t scale on multi core systems that  we were targeting.

So what about modern hardware? While working on MySQL 5.6 on POWER8, I enabled the query cache and ran a benchmark. Enabling the query cache reduced performance by an order of magnitude. I suspect the performance impact is even higher on MySQL 5.7.

My vote? Deprecate it. Deprecate it now, print a giant warning if it’s enabled and at some point just remove it. A single global mutex just doesn’t scale to 4 cores, let alone 24 cores at 8 threads per core.

1 million SQL Queries Per Second: MySQL 5.7 on POWER8

I’ve previously covered MySQL 5.6 on POWER (with patch), MySQL 5.6 Performance on POWER8 (spoiler: new performance record) and MySQL 5.7 on POWER.

Of course, The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions. Also, these numbers should be considered preliminary, but trust me – I did get them and it’s not April 1st.

From my last post, you saw that with my preliminary patch for MySQL 5.7 to work on POWER, we could easily match the previous record for sysbench point select queries per second (i.e. key lookups). In fact, we could exceed the published record by a little bit which is kind of nice. At around 630kQPS, one could be rather happy.

But we still had 30-40% idle CPU on POWER8. This led me to file the following bug report:

  • Bug 72829: LOCK_grant is major contention point, leaves 30-40% idle CPU.

What’s going on is that there’s a rwlock in the MySQL Server that ensures that writers don’t collide with readers to the data structures describing the GRANTs (i.e. who has access to what). If you run a GRANT statement, it gets a writer lock, and nobody can read (i.e. check permissions) while everything is being updated. If you run a normal SQL statement, you get a read lock (non-exclusive) and can check permissions appropriately.

It’s been known for a long time that LOCK_grant was a bottleneck. Typically, some people have run with skip-grant-tables to help shorten the time the lock as held (as in MySQL you still take the mutex even though you’ve started the server with skip-grant-tables).

In Drizzle, we fixed that – moving authentication and authorization completely behind plugin APIs and if you didn’t load plugins for them, you executed near enough to zero instructions that it didn’t matter.

In my experiments, enabling skip-grant-tables actually hurt performance rather than helped. More investigation is needed, but it seems that simply the act of acquiring and releasing the rdlock is now a major bottleneck in some benchmarks (such as sysbench point select).

It turns out that this is a well known problem in other pieces of software (e.g. Linux kernel) and is pretty much what RCU (Read Copy Update) is best at. As far back as 2006 I remember attempting to get my head around RCU so that one day we could use it in MySQL or MySQL Cluster.

Another simpler method is simply splitting the mutex, with readers able to acquire any one of N mutexes and writers needing to acquire them all. This penalizes writers, but unless you’re executing a lot of GRANTs, you’re probably safe.

So… what is the theoretical maximum performance if this bottleneck went away?

I wrote a quick patch that just commented out the rdlock acquisition of LOCK_grant in the hot codepath of sysbench point selects. I wasn’t running GRANT statements at runtime so this was “safe”.

This patch is not production ready, it’s merely useful for demonstrating where we could be with MySQL 5.7 on POWER8 if one last bottleneck is fixed.

My results? Slightly over ONE MILLION QUERIES PER SECOND!

This is roughly twice the previous record.

This is with a dual socket 24 core POWER8 with SMT8 and DSCR=1 on 8 tables with sysbench 0.4.8. Sysbench itself is using a non-trivial amount of CPU and I could probably decently beat this number if I rewrote sysbench using the nonblocking API in libdrizzle (back when me made the Drizzle performance regression tests use a libdrizzle-ified sysbench we got double digit percentage improvement in our sysbench numbers).

There’s still around 7-10% idle CPU time… so there’s more room to grow.

Lacking a physical gauntlet to throw down, I’ll just have to submit a conference paper somewhere so that I can do that in person.

I really hope that we’re able to fix this bottleneck in MySQL 5.7 so that MySQL 5.7 will ship being able to do over a million queries per second. From SQL.

MySQL 5.7 on POWER

In a previous post, I covered porting MySQL 5.6 to POWER and subsequently, some new record performance numbers with MySQL 5.6.17 on POWER8.

Well, those following at home will be aware that not only is the next sentence sponsored by IBM Legal, but that MySQL 5.7 alleviates a bunch of the mutex contention that we saw with MySQL 5.6. The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.

In looking at MySQL performance on POWER, it’s inevitable that I should look at MySQL 5.7 and what’s coming up in the next stable release of MySQL.

Surprisingly, a bunch of the core code in InnoDB and MySQL dealing with mutexes has changed in MySQL 5.7 when compared to MySQL 5.6. Enough that I actually had to post a few bug reports about the changes that apply to any CPU architecture:

  • Bug 72805: mutex_delay() creating excess memory traffic, GCC mem barrier needed
    • This is now more generic mutex code, so it’s even more important to get it right. There’s a bunch of tricks that have been learned in other places (e.g. Linux kernel) in getting these things right. We need to get them right in MySQL too.
    • One of these tricks is in ensuring that the compiler doesn’t compile down spinloops to nothing.
  • Bug 72806: mutex_delay() missing x86 pause instruction optimization
    • This is actually a regression over 5.6.
    • On x86, there is an instruction (PAUSE) that tells the CPU that you’re in a spin loop and that it should yield resources in the CPU core to other threads (or thread, as HT CPUs only have 2 threads per core).
    • We have a different way of doing things on POWER, and I’ve got a patch for that too.
    • What’s interesting is reading the Intel CPU manual about the PAUSE instruction and how even if you went and benchmarked it, it depends on the CPU on if this is a NO-OP or not.
    • I suspect that with this bug fixed, performance on Hyper Threaded Intel systems will improve.
  • Bug 72807: Set thread priority in my_pthread_fastmutex_lock
    • This is the POWER equivalent of the x86 PAUSE instruction.
    • I’ve found this patch to have a quite decent positive impact on sysbench point select performance.

There were also the bugs I mentioned in my MySQL 5.6 on POWER blog post. Notably, I had to port Yasufumi’s memory barrier patch from 5.6 to 5.7. My port is incomplete (I can still crash mysqld without too much trying) but I’ve deemed it currently “good enough for benchmarking” and it’s attached to bug 47213 (I hope to spend some time fixing it up soon too). I don’t think I’m missing anything that’s going to have a major performance impact – so while not suitable for production use, it’s good enough to poke some benchmarks at.

So… I’m close to the point where I’ll share my patch for MySQL 5.7, but I’m really wanting to solve the last couple of issues before doing so. The majority of patches are attached to bug reports and get 99% of the way.

Amazingly enough, MySQL 5.7 works fairly well on POWER “out of the box”, and with sysbench point selects, I could quite easily get 320kQPS on a 24 core POWER8 with SMT8 mode without changing a single line of code or doing anything special. This alone is an impressive result when compared to the previous record on both POWER and other CPU architectures with MySQL 5.6 that had been optimized for POWER (while out-of-the-box MySQL 5.7 has not)
For my benchmarks, I’m doing the same procedure, workload and basic my.cnf settings that Dimitri has used and written about, so I won’t repeat that here.With my preliminary patch for MySQL 5.7.4-m14 to have it work well on POWER, on the same system I was using for my MySQL 5.6 benchmarks, I could easily match and indeed exceed the previous published maximum sysbench point select results (I got ~630kQPS). Consider this number a bit preliminary as my patch isn’t completely solid, but it does mean that we’re in the right ballpark for MySQL 5.7 performance, which is great news!So, you might just say “Mission Accomplished” and be done with it. Well… there was one issue:  with the maximum numbers I was getting there was still 30-40% idle CPU on the POWER 8 machine.Now… you could just use that idle 30-40% of total CPU to do other things (solving Sudoku in SQL for example) but that’s no fun.