PAPR spec publicly available to download

PAPR is the Power Architecture Platform Reference document. It’s a short read at only 890 pages and defines the virtualised environment that guests run in on PowerKVM and PowerVM (i.e. what is referred to as ‘pseries’ platform in the Linux kernel).

https://members.openpowerfoundation.org/document/dl/469

As part of the OpenPower Foundation, we’re looking at ensuring this is up to date, documents KVM specific things as well as splitting out the bits that are common to OPAL and PAPR into their own documents.

Running OPAL in qemu – the powernv platform

Ben has a qemu tree up with some work-in-progress patches to qemu to support the PowerNV platform. This is the “bare metal” platform like you’d get on real POWER8 hardware running OPAL, and it allows us to use qemu like my previous post used the POWER8 Functional Simulator – to boot OpenPower firmware.

To build qemu for this, follow these steps:

apt-get -y install gcc python g++ pkg-config libz-dev libglib2.0-dev \
  libpixman-1-dev libfdt-dev git
git clone https://github.com/ozbenh/qemu.git
cd qemu
./configure --target-list=ppc64-softmmu
make -j `grep -c processor /proc/cpuinfo`

This will leave you with a ppc64-softmmu/qemu-system-ppc64 binary. Once you’ve built your OpenPower firmware to run in a simulator, you can boot it!

Note that this qemu branch is under development, and is likely to move/change or even break.

I do it like this:

cd ~/op-build/output/images;  # so skiboot.lid is in pwd
~/qemu/ppc64-softmmu/qemu-system-ppc64 -m 1G -M powernv \
-kernel zImage.epapr -nographic \
-cdrom ~/ubuntu-vivid-ppc64el-mini.iso

and this lets me test that we launch the Ubunut vivid installer correctly.

You can easily add other qemu options such as additional disks or networking and verify that it works correctly. This way, you can do development on some skiboot functionality or a variety of kernel and op-build userspace (such as the petitboot bootloader) without needing either real hardware or using the simulator.

This is useful if, say, you’re running on ppc64el, for which the POWER8 functional simulator is currently not available on.

doing nothing on modern CPUs

Sometimes you don’t want to do anything. This is understandably human, and probably a sign you should either relax or get up and do something.

For processors, you sometimes do actually want to do absolutely nothing. Often this will be while waiting for a lock. You want to do nothing until the lock is free, but you want to be quick about it, you want to start work once that lock is free as soon as possible.

On CPU cores with more than one thread (e.g. hyperthreading on Intel, SMT on POWER) you likely want to let the other threads have all of the resources of the core if you’re sitting there waiting for something.

So, what do you do? On x86 there’s been the PAUSE instruction for a while and on POWER there’s been the SMT priority instructions.

The x86 PAUSE instruction delays execution of the next instruction for some amount of time while on POWER each executing thread in a core has a priority and this is how chip resources are handed out (you can set different priorities using special no-op instructions as well as setting the Relative Priority Register to map how these coarse grained priorities are interpreted by the chip).

So, when you’re writing spinlock code (or similar, such as the implementation of mutexes in InnoDB) you want to check if the lock is free, and if not, spin for a bit, but at a lower priority than the code running in the other thread that’s doing actual work. The idea being that when you do finally acquire the lock, you bump your priority back up and go do actual work.

Usually, you don’t continually check the lock, you do a bit of nothing in between checking. This is so that when the lock is contended, you don’t just jam every thread in the system up with trying to read a single bit of memory.

So you need a trick to do nothing that the complier isn’t going to optimize away.

Current (well, MySQL 5.7.5, but it’s current in MariaDB 10.0.17+ too, and other MySQL versions) code in InnoDB to “do nothing” looks something like this:

ulint ut_delay(ulint   delay)
{
        ulint   i, j;
        UT_LOW_PRIORITY_CPU();
        j = 0;
        for (i = 0; i < delay * 50; i++) {
                j += i;
                UT_RELAX_CPU();
        }
        if (ut_always_false) {
                ut_always_false = (ibool) j;
        }
        UT_RESUME_PRIORITY_CPU();
        return(j);
}

On x86, UT_RELAX_CPU() ends up being the PAUSE instruction.

On POWER, the UT_LOW_PRIORITY_CPU() and UT_RESUME_PRIORITY_CPU() tunes the SMT thread priority (and on x86 they’re defined as nothing).

If you want an idea of when this was all written, this comment may be a hint:

/*!< in: delay in microseconds on 100 MHz Pentium */

But, if you’re not on x86 you don’t have the PAUSE instruction, instead, you end up getting this code:

# elif defined(HAVE_ATOMIC_BUILTINS)
#  define UT_RELAX_CPU() do { \
     volatile lint      volatile_var; \
     os_compare_and_swap_lint(&volatile_var, 0, 1); \
   } while (0)

Which you may think “yep, that does nothing and is not optimized away by the compiler”. Except you’d be wrong! What it actually does is generates a lot of memory traffic. You’re now sitting in a tight loop doing atomic operations, which have to be synchronized between cores (and sockets) since there’s no real way that the hardware is going to be able to work out that this is only a local variable that is never accessed from anywhere.

Additionally, the ut_always_false and j variable there is also attempts to trick the complier into not optimizing the loop away, and since ut_always_false is a global, you’re generating traffic to a single global variable too.

Instead, what’s needed is a compiler barrier. This simple bit of nothing tells the compiler “pretend memory has changed, so you can’t optimize around this point”.

__asm__ __volatile__ ("":::"memory")

So we can eliminate all sorts of useless non-work and instead do what we want: do nothing (a for loop for X iterations that isn’t optimized away by the compiler) and don’t have side effects.

In MySQL bug 74832 I detailed this with the appropriately produced POWER assembler. Unfortunately, this patch (submitted under the OCA) has sat since November 2014 (so, over 9 months) with no action. I’m a bit disappointed by that to be honest.

Anyway, the real moral of this story is: don’t implement your own locking primitives. You’re either going to get it wrong or you’ll be wrong in a few years when everything changes under you.

See also:

The sad state of MySQL and NUMA

Way back in 2010, MySQL Bug 57241 was filed, pointing out that the “swap insanity” problem was getting serious on x86 systems – with NUMA being more and more common back then.

The swapping problem is due to running out of memory on a NUMA node and having to swap things to other nodes (see Jeremy Cole‘s blog entry also from 2010 on the topic of swap insanity). This was back when 64GB and dual quad core CPUs was big – in the past five years big systems have gotten bigger.

Back then there were two things you could do to have your system be usable: 1) numa=off as kernel boot parameter (this likely has other implications though) and 2) “numactl –interleave all” in mysqld_safe script (I think MariaDB currently has this built in if you set an option but I don’t think MySQL does, otherwise perhaps the bug would have been closed).

Anyway, it’s now about 5 years since this bug was opened and even when there’s been a patch in the Twitter MySQL branch for a while (years?) and my Oracle Contributor Agreement signed patch attached to bug 72811 since May 2014 (over a year) we still haven’t seen any action.

My patch takes the approach of you want things allocated at server startup to be interleaved across nodes (e.g. buffer pool) while runtime allocations are probably per connection and are thus fine (in fact, better) to do node local allocations.

Without a patch like this, or without running mysqld with the right numactl incantation, you end up either having all your memory on one NUMA node (potentially not utilising full memory bandwidth of the hardware), or you end up with swap insanity, or you end up with some other not exactly what you’d expect situation.

While we could have MySQL be more NUMA aware and perhaps do a buffer pool instance per NUMA node or some such thing, it’s kind of disappointing that for dedicated database servers bought in the past 7+ years (according to one comment on one of the bugs) this crippling issue hasn’t been addressed upstream.

Just to make it even more annoying, on certain workloads you end up with a lot of mutex contention, which can end up meaning that binding MySQL to fewer NUMA nodes (memory and CPU) ends up increasing performance (cachelines don’t have as far to travel) – this is a different problem than swap insanity though, and one that is being addressed.

Update: My patch as part of https://bugs.mysql.com/bug.php?id=72811 has been merged! MySQL on NUMA machines just got a whole lot better. I just hope it’s enabled by default…

OPAL firmware specification, conformance and documentation

Now that we have an increasing amount of things that run on top of OPAL:

  1. Linux
  2. hello_world (in skiboot tree)
  3. ppc64le_hello (as I wrote about yesterday)
  4. FreeBSD

and that the OpenPower ecosystem is rapidly growing (especially around people building OpenPower machines), the need for more formal specification, conformance testing and documentation for OPAL is increasing rapidly.

If you look at the documentation in the skiboot tree late last year, you’d notice a grand total of seven text files. Now, we’re a lot better (although far from complete).

I’m proud to say that I won’t merge new code that adds/modifies an OPAL API call or anything in the device tree that doesn’t come with accompanying documentation, and this has meant that although it may not be perfect, we have something that is a decent starting point.

We’re in the interesting situation of starting with a working system, with mainline Linux kernels now for over a year (maybe even 18 months) being able to be booted by skiboot and run on powernv hardware (the more modern the kernel the better though).

So…. if anyone loves going through deeply technical documentation… do I have a project you can contribute to!

FreeBSD on OpenPower

There’s been some work on porting FreeBSD over to run natively on top of OPAL, that is, on bare metal OpenPower machines (not just under KVM).

This is one of four possible things to run natively on an OPAL system:

  1. Linux
  2. hello_world (in skiboot tree)
  3. ppc64le_hello (as I wrote about yesterday)
  4. FreeBSD

It’s great to see that another fully featured OS is getting ported to POWER8 and OPAL. It’s not yet at a stage where you could say it was finished or anything (PCI support is pretty preliminary for example, and fancy things like disks and networking live on PCI).

hello world as ppc66le OPAL payload!

While the in-tree hello-world kernel (originally by me, and Mikey managed to CUT THE BLOAT of a whole SEVENTEEN instructions down to a tiny ten) is very, very dumb (and does one thing, print “Hello World” to the console), there’s now an alternative for those who like to play with a more feature-rich Hello World rather than booting a more “real” OS such as Linux. In case you’re wondering, we use the hello world kernel as a tiny test that we haven’t completely and utterly broken things when merging/developing code.

https://github.com/andreiw/ppc64le_hello is a wonderful example of a small (INTERACTIVE!) starting point for a PowerNV (as it’s called in Linux) or “bare metal” (i.e. non-virtualised) OS on POWER.

What’s more impressive is that this was all developed using the simulator rather than real hardware (although I think somebody has tried it on some now).

Kind of neat!

gcov code coverage for OpenPower firmware

For skiboot (which provides the OPAL boot and runtime firmware for OpenPower machines), I’ve been pretty interested at getting some automated code coverage data for booting on real hardware (as well as in a simulator). Why? Well, it’s useful to see that various test suites are actually testing what you think they are, and it helps you be able to define more tests to increase what you’re covering.

The typical way to do code coverage is to make GCC build your program with GCOV, which is pretty simple if you’re a userspace program. You build with gcov, run program, and at the end you’re left with files on disk that contain all the coverage information for a tool such as lcov to consume. For the Linux kernel, you can also do this, and then extract the GCOV data out of debugfs and get code coverage for all/part of your kernel. It’s a little bit more involved for the kernel, but not too much so.

To achieve this, the kernel has to implement a bunch of stub functions itself rather than link to the gcov library as well as parse the GCOV data structures that GCC generates and emit the gcda files in debugfs when read. Basically, you replace the part of the GCC generated code that writes the files out. This works really nicely as Linux has fancy things like a VFS and debugfs.

For skiboot, we have no such things. We are firmware, we don’t have a damn file system interface. So, what do we do? Write a userspace utility to parse a dump of the appropriate region of memory, easy! That’s exactly what I did, a (relatively) simple user space app to parse out the gcov gcda files from a skiboot memory image – something we can easily dump out of the simulator, relatively easily (albeit slower) from the FSP on an IBM POWER system and even just directly out of a running system (if you boot a linux kernel with the appropriate config).

So, we can now get a (mostly automated) code coverage report simply for the act of booting to petitboot: https://open-power.github.io/skiboot/boot-coverage-report/ along with our old coverage report which was just for the unit tests (https://open-power.github.io/skiboot/coverage-report/). My current boot-coverage-report is just on POWER7 and POWER8 IBM FSP based systems – but you can see that a decent amount of code both is (and isn’t) touched simply from the act of booting to the bootloader.

The numbers we get are only approximate for any code run on more than one CPU as GCC just generates code that does a load/add/store rather than using an atomic increment.

One interesting observation was that (at least on smaller systems, which are still quite large by many people’s standards), boot time was not really noticeably increased.

For more information on running with gcov, see the in-tree documentation: https://github.com/open-power/skiboot/blob/master/doc/gcov.txt

Going beyond 1.3 MILLION SQL Queries/second

So, on a large IBM POWER8 system I was recently running the newly coined “yesmark” benchmark, which is best translated as this:

Benchmark (N for concurrency): for i in {1..N}; do yes "DO 0;" | mysql > /dev/null & done
Live results: mysqladmin -ri 1 extended-status | grep Questions

Which sounds all fun until you realize that it’s *amazingly* close in results to a sysbench point select benchmark these days (well, with MySQL 5.7.7).

Since yesmark doesn’t use InnoDB though, MariaDB is back in the game.

I don’t think it matters between MariaDB and MySQL at this point for yesbench. With MySQL in a KVM guest on a shared 2 socket POWER8 I could get 754kQPS and on a larger system, I could get 1.3 million / sec.

1.3 Million queries / sec is probably the highest number anybody has ever seen out of MySQL or MariaDB, so that’s fairly impressive in itself.

What’s also impressive is that on this workload, mysqld was still only using 50% of CPU in the system. The mysql command line client was really heavy user.

Other users are: 8% completely idle, another 12% in linux scheduler (alarmingly high really). So out of all execution time, only about 44% spent in mysqld, 29% in mysql client.

It seems that the current issues scaling to two socked POWER8 machines are the same as with scaling to other large systems, when we go beyond about 20 POWER8 cores (SMT8), we start to find new and interesting challenges.

Building OpenPower firmware for use in POWER8 Simulator

Previously, I blogged on how to Run skiboot (OPAL) on the POWER8 Simulator. If you want to build the full Open Power firmware environment, including the Petitboot bootloader and kernel, you can now do so!

My pull request for an op-build target for the simulator has been merged, so you can now do the following three steps to compile a kernel+initramfs to use with your built skiboot for development purposes:

git clone --recursive git@github.com:open-power/op-build.git
cd op-build
. op-build-env
op-build mambo_defconfig && op-build

Then you wait for a whole bunch of time while everything compiles! Afterwards, you should be left with a zImage.epapr in output/images/ that you can copy into your skiboot directory.

With zImage.epapr in your skiboot directory, when you run “make check”, the skiboot test suite will actually launch the simulator to verify that your skiboot code boots all the way to the petitboot prompt!

We now have two boot tests as part of “make check” for skiboot!

More OpenPower Firmware code released: OCC

Inside the IBM POWER8 chip there’s another processor! That’s right folks, you get another CPU for no extra cost (It’s a lot funnier if you say these previous two sentences as if you were presenting an informercial for a special TV offer).

It is, however, not what you’d consider a general purpose processor. It is, in fact, a PowerPC 405 – so your POWER8 processor also has another PowerPC chip in it. What’s the purpose of this chip? It’s named the On Chip Controller and it has the job of helping make the main processor (the POWER8) work.

It has two jobs:

  • Monitor temperature and keep the system thermally safe
  • Monitor power usage and keep the system power safe

It runs a hard Real Time OS which has just been released up on github.com/open-power/occ

There’s more complete documentation on OCC here.

It’s fairly exciting to see more of the software that runs on every POWER8 system make it out into the world.

skiboot-4.1

I just posted this to the mailing list, but I’ve tagged skiboot-4.1, so we have another release! There’s a good amount of changes since 4.0 nearly a month ago and this is the second release since we hit github back in July.

For the full set of changes, “git log” is your friend, but a summary of them follows:

  • We now build with -fstack-protector and -Werror
  • Stack checking extensions when built with STACK_CHECK=1
  • Reduced stack usage in some areas, -Wstack-usage=1024 now.
    • Some functions could use 2kb stack, now all are <1kb
  • Unsafe libc functions such as sprintf() have been removed
  • Symbolic backtraces
  • expose skiboot symbol map to OS (via device-tree)
  • removed machine check interrupt patching in OPAL
  • occ/hbrt: Call stopOCC() for implementing reset OCC command from FSP
  • occ: Fix the low level ACK message sent to FSP on receiving {RESET/LOAD}_OCC
  • hardening to errors of various FSP code
    • fsp: Avoid NULL dereference in case of invalid class_resp bits
    • abort if device tree parsing fails
    • FSP: Validate fsp_msg in fsp_queue_msg
    • fsp-elog: Add various NULL checks
  • Finessing of when to use error log vs prerror()
  • More i2c work
  • Can now run under Mambo simulator (see external/mambo/skiboot.tcl) (commonly known as “POWER8 Functional Simulator”)
  • Document skiboot versioning scheme
  • opal: Handle more TFAC errors.
    • TB_RESIDUE_ERR, FW_CONTROL_ERR and CHIP_TOD_PARITY_ERR
  • ipmi: populate FRU data
  • rtc: Add a generic rtc cache
  • ipmi/rtc: use generic cache
  • Error Logging backend for bmc based machines
  • PSI: Drive link down on HIR
  • occ: Fix clearing of OCC interrupt on remote fix

So, who worked on this release? We had 84 csets from 17 developers. A total of 3271 lines were added, 1314 removed (delta 1957).

Developers with the most changesets
Stewart Smith 24 28.6%
Benjamin Herrenschmidt 17 20.2%
Alistair Popple 8 9.5%
Vasant Hegde 6 7.1%
Ananth N Mavinakayanahalli 5 6.0%
Neelesh Gupta 4 4.8%
Mahesh Salgaonkar 4 4.8%
Cédric Le Goater 3 3.6%
Wei Yang 3 3.6%
Anshuman Khandual 2 2.4%
Shilpasri G Bhat 2 2.4%
Ryan Grimm 1 1.2%
Anton Blanchard 1 1.2%
Shreyas B. Prabhu 1 1.2%
Joel Stanley 1 1.2%
Vaidyanathan Srinivasan 1 1.2%
Dan Streetman 1 1.2%
Developers with the most changed lines
Benjamin Herrenschmidt 1290 35.1%
Alistair Popple 963 26.2%
Stewart Smith 344 9.4%
Mahesh Salgaonkar 308 8.4%
Ananth N Mavinakayanahalli 198 5.4%
Neelesh Gupta 186 5.1%
Vasant Hegde 122 3.3%
Shilpasri G Bhat 39 1.1%
Vaidyanathan Srinivasan 24 0.7%
Joel Stanley 21 0.6%
Wei Yang 20 0.5%
Anshuman Khandual 15 0.4%
Cédric Le Goater 12 0.3%
Shreyas B. Prabhu 9 0.2%
Ryan Grimm 3 0.1%
Anton Blanchard 2 0.1%
Dan Streetman 2 0.1%
Developers with the most lines removed
Mahesh Salgaonkar 287 21.8%
Developers with the most signoffs (total 54)
Stewart Smith 44 81.5%
Vasant Hegde 4 7.4%
Benjamin Herrenschmidt 4 7.4%
Vaidyanathan Srinivasan 2 3.7%
Developers with the most reviews (total 2)
Vasant Hegde 2 100.0%

Running skiboot (OPAL) on the POWER8 Simulator

skiboot is open source boot and runtime firmware for OpenPOWER. On real POWER8 hardware, you will also need HostBoot to do this (basically, to make the chip work) but in a functional simulator (such as this one released by IBM) you don’t need a bunch of hardware procedures to make hardware work, so we can make do with just skiboot.

The POWER8 Functional Simulator is free to use but not open source and is only supported on limited platforms. But you can always run it all in a VM! I have it running this way on my laptop right now.

To go from a bare Ubuntu 14.10 VM on x86_64 to running skiboot in the simulator, I did the following:

  • apt-get install vim git emacs wget xterm # xterm is needed by the simulator. wget and editors are useful things.
  • (download systemsim-p8…deb from above URL)
  • dpkg -i systemsim-p8*deb # now the simulator is installed
  • git clone https://github.com/open-power/skiboot.git # get skiboot source
  • wget https://www.kernel.org/pub/tools/crosstool/files/bin/x86_64/4.8.0/x86_64-gcc-4.8.0-nolibc_powerpc64-linux.tar.xz # get a compiler to build it with
  • apt-get install make gcc valgrind # get build tools (skiboot unittests run on the host, so get a gcc and valgrind)
  • tar xfJ x86_64-gcc-4.8.0-nolibc_powerpc64-linux.tar.xz
  • mkdir -p /opt/cross
  • mv gcc-4.8.0-nolibc /opt/cross/ # now you have a powerpc64 cross compiler
  • export PATH=/opt/cross/gcc-4.8.0-nolibc/powerpc64-linux/bin/:$PATH # add cross compiler to path
  • cd skiboot
  • make # this should build a bunch of things, leaving you with skiboot.lid (and other things). If you have many CPUs, feel free to make -j128.
  • make check # run the unit tests. Everything should pass.
  • cd external/mambo
  • /opt/ibm/systemsim-p8/run/pegasus/power8 -f skiboot.tcl # run the simulator

The last step there will barf as you unlikely have a /tmp/zImage.epapr sitting around that’s suitable. If you use op-build to build a full set of OpenPower foo, you’ll likely be able to extract it from there. Basically, the skiboot.tcl script is adding a payload for skiboot to execute. On real hardware, this ends up being a Linux kernel with a small userspace and petitboot (link is to IBM documentation for IBM POWER8 systems). For the simulator, you could boot any tiny zImage.epapr you like, it should detect OPALv3 and boot!

Even if you cannot be bothered building a kernel or petitboot environment, if you comment out the associated lines in skiboot.tcl, you should be able to run the simulator and see the skiboot console message come up that says we couldn’t load a kernel.

At this point, congratulations, you can now become an OpenPower firmware hacker without even possessing any POWER8 hardware!

skiboot/OPAL versioning

skiboot is boot and runtime firmware for OpenPower systems. There are other components that make up all the firmware you need, but if you’re, say, a Linux kernel, you’re going to be interacting with skiboot.

I recently committed doc/versioning.txt to skiboot to try and explain our current thoughts on versioning releases.

It turns out that picking version numbers is a bit harder than you’d expect, especially when you want to construct a version string to display in places that has semantic meaning. In fact, the writing on Semantic Versioning influenced us heavily.

Since we’re firmware, making incompatible API changes is something we should basically never, ever do. Old kernels should must boot and work on new firmware and new kernels should boot and function on old firmware (and if they don’t, it plainly be a kernel bug). So, ignore the Major version parts of Semantic Versioning for us :)

For each new release, we plan to bump the minor version for mostly bug fix releases, while bump the major version for added functionality. Any additional information is to describe the version on that particular platform – as everybody shipping OPAL is likely to build it themselves with possibly some customizations (e.g. YOUR COMPANY NAME HERE, support for some on board RAID card or on-board automated coffee maker). See doc/versioning.txt for details.

You may wonder why we started at 4.0 for our first real version number. Well… this is purely a cunning plan to avoid confusion with other things, the details of which will only be extracted out of my when plied with a suitable amount of excellent craft beer (because if I’m going to tell a boring story, I may as well have awesome craft beer).

C bitfields considered harmful

In C (and C++) you can specify that a variable should take a specific number of bits of storage by doing “uint32_t foo:4;” rather than just “uint32_t foo”. In this example, the former uses 4 bits while the latter uses 32bits. This can be useful to pack many bit fields together.

Or, that’s what they’d like you to think.

In reality, the C spec allows the compiler to do just about anything it wants with these bitfields – which usually means it’s something you didn’t expect.

For a start, in a struct -e.g. “struct foo { uint32_t foo:4; uint32_t blah; uint32_t blergh:20; }” the compiler could go and combine foo and blergh into a single uint32_t and place it somewhere… or it could not. In this case, sizeof(struct foo) isn’t defined and may vary based on compiler, platform, compiler version, phases of the moon or if you’ve washed your hands recently.

Where this can get interesting is in network protocols (OMG DO NOT DO IT), APIs (OMG DO NOT DO IT), protecting different parts of a struct with different mutexes (EEP, don’t do it!) and performance.

I recently filed MySQL bug 74831 which relates to InnoDB performance on POWER8. InnoDB uses C bitfields which are themselves bitfields (urgh) for things like “flag to say if this table is compressed”. At various parts of the code, this flag is checked.

When you apply this simple patch:

--- mysql-5.7.5-m15.orig/storage/innobase/include/dict0mem.h
+++ mysql-5.7.5-m15/storage/innobase/include/dict0mem.h
@@ -1081,7 +1081,7 @@ struct dict_table_t {
        Use DICT_TF_GET_COMPACT(), DICT_TF_GET_ZIP_SSIZE(),
        DICT_TF_HAS_ATOMIC_BLOBS() and DICT_TF_HAS_DATA_DIR() to parse this
        flag. */
-       unsigned                                flags:DICT_TF_BITS;
+       unsigned                                flags;

I get 10,000 key lookups/sec more than without it!

Why is this? If you go and read the bug, you’ll see that the amount of CPU time spent on the instruction checking the bit flag is actually about the same… and this puzzled me for a while. That is, until Anton reminded me that the PMU can be approximate and perhaps I should look at the loads.

Sure enough, the major difference is that with the bitfield in place (i.e. MySQL 5.7.5 as it stands today), there is a ld instruction doing the load – which is a 64bit load. In my patched version, it’s a lwx instruction – which is a 32bit load.

So, basically, we were loading 8 bytes instead of 4 every time we were checking if it was a compressed table.

So, along with yesterday’s lesson of never, ever, ever use volatile, today’s lesson is never, ever, ever use bitfields.

volatile considered harmful

While playing with MySQL 5.7.5 on POWER8, I came across a rather interesting bug (74775 – and this is not the only one… I think I have a decent amount of auditing and patching to do now) which made me want to write a bit on memory barriers and the volatile keyword.

Memory barriers are hard.

Like, super hard. It’s the kind of thing that makes you curse hardware designers, probably because they’re not magically solving all your problems for you. Basically, as you get more CPU cores and each of them have caches, it gets more expensive to keep everything in sync. It’s quite obvious that with *ahem* an eventually consistent model, you could save a bunch of time and effort at the expense of shifting some complexity into software.

Those in the MySQL world should recognize this – we’ve been dealing with asynchronous replication for well over a decade as a good way to scale.

On some CPU architectures (POWER for example) not all loads are created equal. When you load a value from memory, it will be consistent with your thread of execution. That is, with any stores that you have done in this thread of execution. If another thread updates that memory location you may not see that update even if your load occurs after that thread updates that memory location. Think eventually consistent.

If you want up to date reads (and not clobber writes), then you get to do memory barriers! (a topic for elsewhere – the PowerISA document has good explanations of what we have on POWER though, and how load with reserve works).

What the volatile keyword does is generate load and store instructions. It is useful when talking to hardware, as the load and store instructions are actually doing something there that the compiler doesn’t know about and thus shouldn’t optimize away.

The volatile keyword does not add any memory barriers. This is important to realize – volatile just makes loads and stores happen for your thread, not in relation to any other threads of execution. Thus, you cannot use volatile as a thread synchronization mechanism at all. It is completely and totally wrong.

Basically, if you have a volatile variable and you do stores to it in one thread and loads in another, after the store happens, it could be quite a long time before the thread doing the loads sees it! For some applications this may be okay (although I can’t really think of any beyond very very inaccurate status variables)… but if it matters at all for application correctness, volatile is the wrong thing to use.

Further reading:

Preliminary MySQL Cluster benchmark results on POWER8

Yesterday, I got the basics going for MySQL Cluster on POWER. Today, I finished up a couple more patches to improve performance and ran some benchmarks.

This is on a 3.7Ghz POWER8 machine with non-balanced memory (only 2 of the 4 NUMA nodes have memory, so we have less total memory bandwidth than we could have, plus I’m going to bind ndbmtd to the CPUs in these NUMA nodes)

With a setup of a single replica and two data nodes on the one machine (each bound to a specific NUMA node), running the flexAsync benchmark on MySQL Cluster 7.3.7, I could get around:

  • 3.2 million reads/sec
  • 2.6 million deletes/sec
  • 2.4 million updates/sec
  • 2.4 million inserts/sec.

So, that’s at least in the right ballpark for a first go.

(I’m running this on a big endian host kernel, some random kernel I booted on the box and built with gcc 4.8 with whatever build options the MySQL Cluster cmake foo chooses by default)

MySQL Cluster on POWER8

So, I’ve written previously on MySQL on POWER, and today is a quick bit of news about MySQL Cluster on POWER – specifically MySQL Cluster 7.3.7.

I ran into three main issues in getting some flexAsync benchmark results. One of them was the fact that I wanted to do this in the middle of all the POWER8 machines I usually use moving buildings (hard to run benchmarks when computers are packed up in boxes on a truck).

The next issue was that ndbmtd (the multi-threaded data node) needs memory barriers for the magic message passing stuff between threads. So, that’s pretty easy (about an eight line patch).

The next issue was in the results from flexAsync, it turns out 32bit math is a bad idea with results from my POWER8 box.

My preliminary performance numbers are fairly promising (actually… what is the world record for a single machine and NDB these days? Single data node?). I think there’s a bit more low hanging fruit and a couple more things that are a bit more involved.

Bugs with patches:

  • Bug 74782 – compile fix (memory barriers for POWER)
  • Bug 74781 – flexAsync uses 32bit math, leading to incorrect summary on POWER8

MySQL 5.7.5 on POWER – thread priority

Good news everyone!

MySQL 5.7.5 is out with a bunch more patches for running well on POWER in the tree. I haven’t yet gone and tried it all out, but since I’m me, I look at bugs database and git/bzr history first.

On Intel CPUs, when you’re spinning on a spin lock, you’re meant to execute the PAUSE CPU instruction. This tells the CPU that other execution threads in the same core should be given priority as you are currently not doing anything productive. Without this, you’re likely going to hurt on hyperthreaded CPUs.

In MySQL, there are custom spinlocks in order to do interesting adaptive mutex things to attempt to squeeze the most performance possible out of modern systems.

One of the (not 100% ready, but close) bugs with patches I submitted against MySQL 5.7 was for using the equivalent of the PAUSE instruction for POWER CPUs. On POWER, we’re a bit different, you can actually set priorities of threads (which may matter more, as POWER8 CPUs can be in SMT8 mode – where there are *eight* executing threads per core).

So, the good news is that in MySQL 5.7.5, the magic instructions for setting thread priority are in! This should mean great things for performance on POWER systems with any of the SMT modes enabled.

The next interesting part of this is how it interacts with other KVM guests on a system. At least on POWER (and on x86 as well, although I won’t go into details here) there’s a hypervisor call that a guest can make saying “hey, I’m spinning here, perhaps you want to make sure other vcpus execute so that at some point I can continue”. On POWER, this is the H_CONFER hcall, where you can basically do a directed yield to another vcpu (the one that holds the lock you’re trying to get is a good idea).

Generally though, it’s only the guest kernel that does this, not userspace. You can see the H_CONFER call in __spin_yield(arch_spinlock_t*) and __rw_yield(arch_rwlock_t*) in arch/powerpc/lib/locks.c in the kernel.

It would be interesting to see what extra we could get out of a system running multiple guests with MySQL servers if InnoDB/MySQL could properly yield to the right vcpu (well, thread I guess).

Tyan OpenPower

Good news everyone! Tyan has announced the availability of their first OpenPOWER system! They call this a Customer Reference System, which means it’s an excellent machine to start poking at OpenPower and POWER8 (or deploying applications on).

Because it’s an OpenPower machine, it runs the open source Open Power firmware (all up on github) and will happily run Linux (feel free to port your other operating system kernels). I’ll be writing more on the OpenPower firmware soon as, well, technical details are fun!

Ubuntu 14.10 is listed as recommended as not only have they been building for POWER8 but have spent some time ensuring things work fairly well out-of-the-box (both as a KVM guest and running native on the bare metal). Or, you can always just boot whatever the mainline kernel is at – build for the POWERNV (POWER non-virtualized) platform (be sure to include all the required drivers) and have fun!