2016 Kernel miniconf CFP

Why yes, it’s another long URL thanks to Google Docs:

Got a kernel topic you want to talk about? Got a kernel topic you want to start discussion on? Or a Q&A? Submit NOW! We’re going for part sessions, part unconference.

Questions? Contact me at

POWER8 Accelerated CRC32 merged in MariaDB 10.1

Earlier on in benchmarking MySQL and MariaDB on POWER8, we noticed that on write workloads (or read workloads involving a lot of IO) we were spending a bunch of time computing InnoDB page checksums. This is a relatively well known MySQL problem and has existed for many years and Percona even added innodb_fast_checksum to Percona Server to help alleviate the problem.

In MySQL 5.6, we got the ability to use CRC32 checksums, which are great in that they’re a lot faster to compute than tho old InnoDB “new” checksum. There’s code inside InnoDB to use the x86 SSE2 crc32q instruction to accelerate performing the checksum on compatible x86 CPUs (although oddly enough, the CRC32 checksum in the binlog does not use this acceleration).

However, on POWER, we’d end up using the software implementation of CRC32, which used a lot more CPU than we’d like. Luckily, CRC32 is really common code and for POWER8, we got some handy instructions to help computing it. Unfortunately, this required brushing up on vector polynomial math in order to understand how to do it all quickly. The end result was Anton coming up with crc32-vpmsum code that we could drop into projects that embed a copy of crc32 that was about 41 times faster than the best non-vpmsum implementation.

Recently, Daniel Black took the patch that had passed through both Daniel Axten‘s and my hands and worked on upstreaming it into MariaDB and MySQL. We did some pretty solid benchmarking on the improvement you’d get, and we pretty much cannot notice the difference between innodb_checksum=off and having it use the POWER8 accelerated CRC32 checksum, which frees up maybe 30% of CPU time to be used for things like query execution! My original benchmark showed a 30% improvement in sysbench read/write workloads.

The excellent news? Two days ago, MariaDB merged POWER8 accelerated crc32! This means that IO heavy workloads on MariaDB on POWER8 will get much faster in the next release.

MySQL bug 74776 is open, with patch attached, so hopefully MySQL will merge this soon too (hint hint).

An update on using Tor on Android

Back in 2012 I wrote a blog post on using Tor on Android which has proved quite popular over the years.

These days, there is the OrFox browser, which is from The Tor Project and is likely the current best way to browse the web through Tor on your Android device.

If you’re still using the custom setup Firefox, I’d recommend giving OrFox a try – it’s been working quite well for me.

TianoCore (UEFI) ported to OpenPower

Recently, there’s been (actually two) ports of TianoCore (the reference implementation of UEFI firmware) to run on POWER on top of OPAL (provided by skiboot) – and it can be run in the Qemu PowerNV model.

More details:

1 Million SQL Queries per second: GA MariaDB 10.1 on POWER8

A couple of days ago, MariaDB announced that MariaDB 10.1 is stable GA – around 19 months since the GA of MariaDB 10.0. With MariaDB 10.1 comes some important scalabiity improvements, especially for POWER8 systems. On POWER, we’re a bit unique in that we’re on the higher end of CPUs, have many cores, and up to 8 threads per core (selectable at runtime: 1, 2, 4 or 8/core) – so a dual socket system can easily be a 160 thread machine.

Recently, we (being IBM) announced availability of a couple of new POWER8 machines – machines designed for Linux and cloud environments. They are very much OpenPower machines, and more info is available here:

Combine these two together, with Axel Schwenke running some benchmarks and you get 1 Million SQL Queries per second with MariaDB 10.1 on POWER8.

Having worked a lot on both MySQL for POWER and the firmware that ships in the S882LC, I’m rather happy that 1 Million queries per second is beyond what it was in June 2014, which was a neat hack on MySQL 5.7 that showed the potential of MySQL on POWER8 but wasn’t yet a product. Now, you can run a GA release of MariaDB on GA POWER8 hardware designed for scale-out cloud environments and get 1 Million SQL queries/second (with fewer cores than my initial benchmark last year!)

What’s even more impressive is that this million queries per second is in a KVM guest!

PAPR spec publicly available to download

PAPR is the Power Architecture Platform Reference document. It’s a short read at only 890 pages and defines the virtualised environment that guests run in on PowerKVM and PowerVM (i.e. what is referred to as ‘pseries’ platform in the Linux kernel).

As part of the OpenPower Foundation, we’re looking at ensuring this is up to date, documents KVM specific things as well as splitting out the bits that are common to OPAL and PAPR into their own documents.

Running OPAL in qemu – the powernv platform

Ben has a qemu tree up with some work-in-progress patches to qemu to support the PowerNV platform. This is the “bare metal” platform like you’d get on real POWER8 hardware running OPAL, and it allows us to use qemu like my previous post used the POWER8 Functional Simulator – to boot OpenPower firmware.

To build qemu for this, follow these steps:

apt-get -y install gcc python g++ pkg-config libz-dev libglib2.0-dev \
  libpixman-1-dev libfdt-dev git
git clone
cd qemu
./configure --target-list=ppc64-softmmu
make -j `grep -c processor /proc/cpuinfo`

This will leave you with a ppc64-softmmu/qemu-system-ppc64 binary. Once you’ve built your OpenPower firmware to run in a simulator, you can boot it!

Note that this qemu branch is under development, and is likely to move/change or even break.

I do it like this:

cd ~/op-build/output/images;  # so skiboot.lid is in pwd
~/qemu/ppc64-softmmu/qemu-system-ppc64 -m 1G -M powernv \
-kernel zImage.epapr -nographic \
-cdrom ~/ubuntu-vivid-ppc64el-mini.iso

and this lets me test that we launch the Ubunut vivid installer correctly.

You can easily add other qemu options such as additional disks or networking and verify that it works correctly. This way, you can do development on some skiboot functionality or a variety of kernel and op-build userspace (such as the petitboot bootloader) without needing either real hardware or using the simulator.

This is useful if, say, you’re running on ppc64el, for which the POWER8 functional simulator is currently not available on.

doing nothing on modern CPUs

Sometimes you don’t want to do anything. This is understandably human, and probably a sign you should either relax or get up and do something.

For processors, you sometimes do actually want to do absolutely nothing. Often this will be while waiting for a lock. You want to do nothing until the lock is free, but you want to be quick about it, you want to start work once that lock is free as soon as possible.

On CPU cores with more than one thread (e.g. hyperthreading on Intel, SMT on POWER) you likely want to let the other threads have all of the resources of the core if you’re sitting there waiting for something.

So, what do you do? On x86 there’s been the PAUSE instruction for a while and on POWER there’s been the SMT priority instructions.

The x86 PAUSE instruction delays execution of the next instruction for some amount of time while on POWER each executing thread in a core has a priority and this is how chip resources are handed out (you can set different priorities using special no-op instructions as well as setting the Relative Priority Register to map how these coarse grained priorities are interpreted by the chip).

So, when you’re writing spinlock code (or similar, such as the implementation of mutexes in InnoDB) you want to check if the lock is free, and if not, spin for a bit, but at a lower priority than the code running in the other thread that’s doing actual work. The idea being that when you do finally acquire the lock, you bump your priority back up and go do actual work.

Usually, you don’t continually check the lock, you do a bit of nothing in between checking. This is so that when the lock is contended, you don’t just jam every thread in the system up with trying to read a single bit of memory.

So you need a trick to do nothing that the complier isn’t going to optimize away.

Current (well, MySQL 5.7.5, but it’s current in MariaDB 10.0.17+ too, and other MySQL versions) code in InnoDB to “do nothing” looks something like this:

ulint ut_delay(ulint   delay)
        ulint   i, j;
        j = 0;
        for (i = 0; i < delay * 50; i++) {
                j += i;
        if (ut_always_false) {
                ut_always_false = (ibool) j;

On x86, UT_RELAX_CPU() ends up being the PAUSE instruction.

On POWER, the UT_LOW_PRIORITY_CPU() and UT_RESUME_PRIORITY_CPU() tunes the SMT thread priority (and on x86 they’re defined as nothing).

If you want an idea of when this was all written, this comment may be a hint:

/*!< in: delay in microseconds on 100 MHz Pentium */

But, if you’re not on x86 you don’t have the PAUSE instruction, instead, you end up getting this code:

# elif defined(HAVE_ATOMIC_BUILTINS)
#  define UT_RELAX_CPU() do { \
     volatile lint      volatile_var; \
     os_compare_and_swap_lint(&volatile_var, 0, 1); \
   } while (0)

Which you may think “yep, that does nothing and is not optimized away by the compiler”. Except you’d be wrong! What it actually does is generates a lot of memory traffic. You’re now sitting in a tight loop doing atomic operations, which have to be synchronized between cores (and sockets) since there’s no real way that the hardware is going to be able to work out that this is only a local variable that is never accessed from anywhere.

Additionally, the ut_always_false and j variable there is also attempts to trick the complier into not optimizing the loop away, and since ut_always_false is a global, you’re generating traffic to a single global variable too.

Instead, what’s needed is a compiler barrier. This simple bit of nothing tells the compiler “pretend memory has changed, so you can’t optimize around this point”.

__asm__ __volatile__ ("":::"memory")

So we can eliminate all sorts of useless non-work and instead do what we want: do nothing (a for loop for X iterations that isn’t optimized away by the compiler) and don’t have side effects.

In MySQL bug 74832 I detailed this with the appropriately produced POWER assembler. Unfortunately, this patch (submitted under the OCA) has sat since November 2014 (so, over 9 months) with no action. I’m a bit disappointed by that to be honest.

Anyway, the real moral of this story is: don’t implement your own locking primitives. You’re either going to get it wrong or you’ll be wrong in a few years when everything changes under you.

See also:

MySQL on NUMA machines just got better!

A followup to my previous entry , my patch that was part of Bug #72811 Set NUMA mempolicy for optimum mysqld performance has been merged!

I hope it’s enabled by default so that everything “just works”.

I also hope it filters down through MariaDB and Percona Server fairly quickly.

Also, from the release notes on that bug, I think we can expect 5.7.8 any day now.

The sad state of MySQL and NUMA

Way back in 2010, MySQL Bug 57241 was filed, pointing out that the “swap insanity” problem was getting serious on x86 systems – with NUMA being more and more common back then.

The swapping problem is due to running out of memory on a NUMA node and having to swap things to other nodes (see Jeremy Cole‘s blog entry also from 2010 on the topic of swap insanity). This was back when 64GB and dual quad core CPUs was big – in the past five years big systems have gotten bigger.

Back then there were two things you could do to have your system be usable: 1) numa=off as kernel boot parameter (this likely has other implications though) and 2) “numactl –interleave all” in mysqld_safe script (I think MariaDB currently has this built in if you set an option but I don’t think MySQL does, otherwise perhaps the bug would have been closed).

Anyway, it’s now about 5 years since this bug was opened and even when there’s been a patch in the Twitter MySQL branch for a while (years?) and my Oracle Contributor Agreement signed patch attached to bug 72811 since May 2014 (over a year) we still haven’t seen any action.

My patch takes the approach of you want things allocated at server startup to be interleaved across nodes (e.g. buffer pool) while runtime allocations are probably per connection and are thus fine (in fact, better) to do node local allocations.

Without a patch like this, or without running mysqld with the right numactl incantation, you end up either having all your memory on one NUMA node (potentially not utilising full memory bandwidth of the hardware), or you end up with swap insanity, or you end up with some other not exactly what you’d expect situation.

While we could have MySQL be more NUMA aware and perhaps do a buffer pool instance per NUMA node or some such thing, it’s kind of disappointing that for dedicated database servers bought in the past 7+ years (according to one comment on one of the bugs) this crippling issue hasn’t been addressed upstream.

Just to make it even more annoying, on certain workloads you end up with a lot of mutex contention, which can end up meaning that binding MySQL to fewer NUMA nodes (memory and CPU) ends up increasing performance (cachelines don’t have as far to travel) – this is a different problem than swap insanity though, and one that is being addressed.

Update: My patch as part of has been merged! MySQL on NUMA machines just got a whole lot better. I just hope it’s enabled by default…

OPAL firmware specification, conformance and documentation

Now that we have an increasing amount of things that run on top of OPAL:

  1. Linux
  2. hello_world (in skiboot tree)
  3. ppc64le_hello (as I wrote about yesterday)
  4. FreeBSD

and that the OpenPower ecosystem is rapidly growing (especially around people building OpenPower machines), the need for more formal specification, conformance testing and documentation for OPAL is increasing rapidly.

If you look at the documentation in the skiboot tree late last year, you’d notice a grand total of seven text files. Now, we’re a lot better (although far from complete).

I’m proud to say that I won’t merge new code that adds/modifies an OPAL API call or anything in the device tree that doesn’t come with accompanying documentation, and this has meant that although it may not be perfect, we have something that is a decent starting point.

We’re in the interesting situation of starting with a working system, with mainline Linux kernels now for over a year (maybe even 18 months) being able to be booted by skiboot and run on powernv hardware (the more modern the kernel the better though).

So…. if anyone loves going through deeply technical documentation… do I have a project you can contribute to!

FreeBSD on OpenPower

There’s been some work on porting FreeBSD over to run natively on top of OPAL, that is, on bare metal OpenPower machines (not just under KVM).

This is one of four possible things to run natively on an OPAL system:

  1. Linux
  2. hello_world (in skiboot tree)
  3. ppc64le_hello (as I wrote about yesterday)
  4. FreeBSD

It’s great to see that another fully featured OS is getting ported to POWER8 and OPAL. It’s not yet at a stage where you could say it was finished or anything (PCI support is pretty preliminary for example, and fancy things like disks and networking live on PCI).

hello world as ppc66le OPAL payload!

While the in-tree hello-world kernel (originally by me, and Mikey managed to CUT THE BLOAT of a whole SEVENTEEN instructions down to a tiny ten) is very, very dumb (and does one thing, print “Hello World” to the console), there’s now an alternative for those who like to play with a more feature-rich Hello World rather than booting a more “real” OS such as Linux. In case you’re wondering, we use the hello world kernel as a tiny test that we haven’t completely and utterly broken things when merging/developing code. is a wonderful example of a small (INTERACTIVE!) starting point for a PowerNV (as it’s called in Linux) or “bare metal” (i.e. non-virtualised) OS on POWER.

What’s more impressive is that this was all developed using the simulator rather than real hardware (although I think somebody has tried it on some now).

Kind of neat!

gcov code coverage for OpenPower firmware

For skiboot (which provides the OPAL boot and runtime firmware for OpenPower machines), I’ve been pretty interested at getting some automated code coverage data for booting on real hardware (as well as in a simulator). Why? Well, it’s useful to see that various test suites are actually testing what you think they are, and it helps you be able to define more tests to increase what you’re covering.

The typical way to do code coverage is to make GCC build your program with GCOV, which is pretty simple if you’re a userspace program. You build with gcov, run program, and at the end you’re left with files on disk that contain all the coverage information for a tool such as lcov to consume. For the Linux kernel, you can also do this, and then extract the GCOV data out of debugfs and get code coverage for all/part of your kernel. It’s a little bit more involved for the kernel, but not too much so.

To achieve this, the kernel has to implement a bunch of stub functions itself rather than link to the gcov library as well as parse the GCOV data structures that GCC generates and emit the gcda files in debugfs when read. Basically, you replace the part of the GCC generated code that writes the files out. This works really nicely as Linux has fancy things like a VFS and debugfs.

For skiboot, we have no such things. We are firmware, we don’t have a damn file system interface. So, what do we do? Write a userspace utility to parse a dump of the appropriate region of memory, easy! That’s exactly what I did, a (relatively) simple user space app to parse out the gcov gcda files from a skiboot memory image – something we can easily dump out of the simulator, relatively easily (albeit slower) from the FSP on an IBM POWER system and even just directly out of a running system (if you boot a linux kernel with the appropriate config).

So, we can now get a (mostly automated) code coverage report simply for the act of booting to petitboot: along with our old coverage report which was just for the unit tests ( My current boot-coverage-report is just on POWER7 and POWER8 IBM FSP based systems – but you can see that a decent amount of code both is (and isn’t) touched simply from the act of booting to the bootloader.

The numbers we get are only approximate for any code run on more than one CPU as GCC just generates code that does a load/add/store rather than using an atomic increment.

One interesting observation was that (at least on smaller systems, which are still quite large by many people’s standards), boot time was not really noticeably increased.

For more information on running with gcov, see the in-tree documentation:

Going beyond 1.3 MILLION SQL Queries/second

So, on a large IBM POWER8 system I was recently running the newly coined “yesmark” benchmark, which is best translated as this:

Benchmark (N for concurrency): for i in {1..N}; do yes "DO 0;" | mysql > /dev/null & done
Live results: mysqladmin -ri 1 extended-status | grep Questions

Which sounds all fun until you realize that it’s *amazingly* close in results to a sysbench point select benchmark these days (well, with MySQL 5.7.7).

Since yesmark doesn’t use InnoDB though, MariaDB is back in the game.

I don’t think it matters between MariaDB and MySQL at this point for yesbench. With MySQL in a KVM guest on a shared 2 socket POWER8 I could get 754kQPS and on a larger system, I could get 1.3 million / sec.

1.3 Million queries / sec is probably the highest number anybody has ever seen out of MySQL or MariaDB, so that’s fairly impressive in itself.

What’s also impressive is that on this workload, mysqld was still only using 50% of CPU in the system. The mysql command line client was really heavy user.

Other users are: 8% completely idle, another 12% in linux scheduler (alarmingly high really). So out of all execution time, only about 44% spent in mysqld, 29% in mysql client.

It seems that the current issues scaling to two socked POWER8 machines are the same as with scaling to other large systems, when we go beyond about 20 POWER8 cores (SMT8), we start to find new and interesting challenges.

Towards (and beyond) ONE MILLION queries per second

At Percona Live MySQL Conference 2015 next week I’ll be presenting on “Towards One MILLION queries per second” on 14th April at 4:50pm in Ballroom A.

This is the story of work I’ve been doing to get MySQL executing ONE MILLION SQL queries per second. It involves tales of MySQL, tales of the POWER8 Processor and a general amount of fun in extracting huge amounts of performance.

As I speak, I’m working on some even more impressive benchmark results! New hardware, new MySQL versions and really breaking news on MySQL scalability.

Preliminary results from POWER8 optimized CRC32 for MySQL

So, Anton got some useful code working that I could patch into a MySQL server for testing purposes – a POWER8 optimized CRC32 implementation.

I went with a pretty stock MySQL 5.6.22 (one patch) with sysbench preparing a single 2GB table (10,000,000 rows). I then hacked up innochecksum so that it would only do the correct CRC32 (rather than trying each checksum type). Using the standard CRC32 algorithm it took around three seconds to verify all of the checksums. With a POWER8 optimized CRC32: 0.4-0.5 seconds. Useful speed-up!

I then ran sysbench read/write with 16 threads with oltp-table-size=10000 (on the larger table) to see if there would be an improvement in a “real world” workload. I got about 30% better performance on read/write operations!

Using perf to see where CPU was going, CPU time spent doing CRC32 calculations went down from ~2.5% to ~0.25%!

In theory, we should be able to get about 52GiB/sec of CRC32 out of a 4.1Ghz POWER8 core. I don’t think we’ll be hitting this in MySQL any time soon.

Give us another week or two and we’ll likely have a patch that’s ready to merge.

Initial benchmarks look promising though!

Amounts of RAM for devices so that I no longer have to worry about it.

I think this is my current “okay, I don’t have to worry about RAM” list currently:

  • Phone/Tablet: 2GB
  • Laptop: 8GB (although 8GB is better… 4GB is tolerable IFF SSD)
  • Development server: 16GB (32GB if shared) (emacs+gcc)
  • Box for testing things: 128GB (virtualization, databases)

This is… kind of mind bending.

Building OpenPower firmware for use in POWER8 Simulator

Previously, I blogged on how to Run skiboot (OPAL) on the POWER8 Simulator. If you want to build the full Open Power firmware environment, including the Petitboot bootloader and kernel, you can now do so!

My pull request for an op-build target for the simulator has been merged, so you can now do the following three steps to compile a kernel+initramfs to use with your built skiboot for development purposes:

git clone --recursive
cd op-build
. op-build-env
op-build mambo_defconfig && op-build

Then you wait for a whole bunch of time while everything compiles! Afterwards, you should be left with a zImage.epapr in output/images/ that you can copy into your skiboot directory.

With zImage.epapr in your skiboot directory, when you run “make check”, the skiboot test suite will actually launch the simulator to verify that your skiboot code boots all the way to the petitboot prompt!

We now have two boot tests as part of “make check” for skiboot!