API, ABI and backwards compatibility are a hard necessity

Recently, I was reading a thread on LKML on a proposal to change the behavior of the open system call when confronted with unknown flags. The thread is worth a read as the topic of augmenting things that exist probably by accident to be “better” is always interesting, as is the definition of “better”.

Keeping API and/or ABI compatibility is something that isn’t a new problem, and it’s one that people are pretty good at sometimes messing up.

This problem does not go away just because “we have cloud now”. In any distributed system, in order to upgrade it (or “be agile” as the kids are calling it), you by definition are going to have either downtime or at least two versions running concurrently. Thus, you have to have your interfaces/RPCs/APIs/ABIs/protocols/whatever cope with changes.

You cannot instantly upgrade the world, it happens gradually. You also have to design for at least three concurrent versions running. One is the original, the second is your upgrade, your third is the urgent fix because the upgrade is quite broken in some new way you only discover in production.

So, the way you do this? Never ever EVER design for N-1 compatibility only. Design for going back a long way, much longer than you officially support. You want to have a design and programming culture of backwards compatibility to ensure you can both do new and exciting things and experiment off to the side.

It’s worth going and rereading Rusty’s API levels posts from 2008:

MySQL removes the FRM (7 years after Drizzle did)

The new MySQL 8.0.0 milestone release that was recently announced brings something that has been a looooong time coming: the removal of the FRM file. I was the one who implemented this in Drizzle way back in 2009 (July 28th 2009 according to Brian)- and I may have had a flashback to removing the tentacles of the FRM when reading the MySQL 8.0.0 announcement.

As an idea for how long this has been on the cards, I’ll quote Brian from when we removed it in Drizzle:

We have been talking about getting rid of FRM since around 2003. I remember a drive up to northern Finland with Kaj Arnö, where we spent an hour talking about this. I, David, and MontyW have talked about this for years.


Soo… it was a known problem for at least thirteen years. One of the issues removing it was how pervasive all of the FRM related things were. I shudder at the mention of “pack_flag” and Jay Pipes probably does too.

At the time, we tried a couple of approaches as to how things should look. Our philosophy with Drizzle was that it should get out of the way at let the storage engines be the storage engines and not try to second guess them or keep track of things behind their back. I still think that was the correct architectural approach: the role of Drizzle was to put SQL on top of a storage engine, not to also be one itself.

Looking at the MySQL code, there’s one giant commit 31350e8ab15179acab5197fa29d12686b1efd6ef. I do mean giant too, the diffstat is amazing:

 786 files changed, 58471 insertions(+), 25586 deletions(-)

How anyone even remotely did code review on that I have absolutely no idea. I know the only way I could get it to work in Drizzle was to do it incrementally, a series of patches that gradually chiseled out what needed to be taken out so I could put it an API and the protobuf code.

Oh, and in case you’re wondering:

- uint offset,pack_flag;
+ uint offset;

Thank goodness. Now, you may not appreciate that as much as I might, but pack_flag was not the height of design, it was… pretty much a catchalll for some kind of data about a field that wasn’t something that already had a field in the FRM. So it may include information on if the field could be null or not, if it’s decimal, how many bytes an integer takes, that it’s a number and how many oh, just don’t ask.

Also gone is the weird interval_id and a whole bunch of limitations because of the FRM format, including one that I either just discovered or didn’t remember: if you used all 256 characters in an enum, you couldn’t create the table as MySQL would pick either a comma or an unused character to be the separator in the FRM!?!

Also changed is how the MySQL server handles default values. For those not aware, the FRM file contains a static copy of the row containing default values. This means the default values are computed once on table creation and never again (there’s a bunch of work arounds for things like AUTO_INCREMENT and DEFAULT NOW()). The new sql/default_values.cc is where this is done now.

For now at least, table metadata is also written to a file that appears to be JSON format. It’s interesting that a SQL database server is using a schemaless file format to describe schema. It appears that these files exist only for disaster recovery or perhaps portable tablespaces. As such, I’m not entirely convinced they’re needed…. it’s just a thing to get out of sync with what the storage engine thinks and causes extra IO on DDL (as well as forcing the issue that you can’t have MVCC into the data dictionary itself).

What will be interesting is to see the lifting of these various limitations and how MariaDB will cope with that. Basically, unless they switch, we’re going to see some interesting divergence in what you can do in either database.

There’s certainly differences in how MySQL removed the FRM file to the way we did it in Drizzle. Hopefully some of the ideas we had were helpful in coming up with this different approach, as well as an extra seven years of in-production use.

At some point I’ll write something up as to the fate of Drizzle and a bit of a post-mortem, I think I may have finally worked out what I want to say…. but that is a post for another day.

Compiling your own firmware for Barreleye (OpenCompute OpenPOWER system)

Aaron Sullivan announced on the Rackspace Blog that you can now get your own Barreleye system! What’s great is that the code for the Barreleye platform is upstream in the op-build project, which means you can build your own firmware for them (just like garrison, the “IBM S822LC for HPC” system I blogged about a few days ago).

Remarkably, to build an image for the host firmware, it’s eerily similar to any other platform:

git clone --recursive https://github.com/open-power/op-build.git
cd op-build
. op-build-env
op-build barreleye_defconfig

…and then you wait. You can cross compile on x86.

You’ve been able to build firmware for these machines with upstream code since Feb/March (I wouldn’t recommend running with builds from then though, try the latest release instead).

Hopefully, someone involved in OpenBMC can write on how to build the BMC firmware.

Using Smatch static analysis on OpenPOWER OPAL firmware

For Skiboot, I’m always looking at new automated systems to find bugs in the code. A little while ago, I read about the Smatch tool developed by some folks at Oracle (they also wrote about using it on the Linux kernel).

I was eager to try it with skiboot to see if it could find anything.

Luckily, it was pretty easy. I built Smatch according to their documentation and then built skiboot:

make CHECK="/home/stewart/smatch/smatch" C=1 -j20 all check

Due to some differences in how we implement abort() and assert() in skiboot, I added “_abort”, “abort” and “assert_fail” to smatch_data/no_return_funcs in the Smatch source tree to silence some false positives.

It seems that there’s a few useful warnings there (some of which I’ve fixed in skiboot master already), along with some false positives around the preprocessor/complier tricks we do to ensure at compile time that an OPAL call definition has the correct number of arguments specified.

So far, so good though. Try it on your project!

Fuzzing Firmware – afl-fuzz + skiboot

In what is likely to be a series on how firmware makes some normal tools harder to use, first I’m going to look at american fuzzy lop – a tool for fuzz testing that if you’re not using then you most certainly have bugs it’ll find for you.

I first got interested in afl-fuzz during Erik de Castro Lopo’s excellent linux.conf.au 2016 in Geelong earlier this year: “Fuzz all the things!“. In a previous life, the Random Query Generator managed to find a heck of a lot of bugs in MySQL (and Drizzle). For randgen info, see Philip Stoev’s talk on it from way back in 2009, a recent (2014) blog post on how Tokutek uses it and some notes on how it was being used at Oracle from 2013. Basically, the randgen was a specialized fuzzer that (given a grammar) would randomly generate SQL queries, and then (if the server didn’t crash), compare the result to some other database server (e.g. your previous version).

The afl-fuzz fuzzer takes a different approach – it’s a much more generic fuzzer rather than a targeted tool. Also, while tools such as the random query generator are extremely powerful and find specialized bugs, they’re hard to get started with. A huge benefit of afl-fuzz is that it’s really, really simple to get started with.

Basically, if you have a binary that takes input on stdin or as a (relatively small) file, afl-fuzz will just work and find bugs for you – read the Quick Start Guide and you’ll be finding bugs in no time!

For firmware of course, we’re a little different than a simple command line program as, well, we aren’t one! Luckily though, we have unit tests. These are just standard binaries that include a bunch of firmware code and get run in user space as part of “make check”. Also, just like unit tests for any project, people do send me patches that break tests (which I reject).

Some of these tests act on data we get from a place – maybe reading other parts of firmware off PNOR or interacting with data structures we get from other bits of firmware. For testing this code, it can be relatively easy to (for the test), read these off disk.

For skiboot, there’s a data structure we get from the service processor on FSP machines called HDAT. Basically, it’s just like the device tree, but different. Because yet another binary format is always a good idea (yes, that is laced with a heavy dose of sarcasm). One of the steps in early boot is to parse the HDAT data structure and convert it to a device tree. Luckily, we structured our code so that creating a unit test that can run in userspace was relatively easy, we just needed to dump this data structure out from a running machine. You can see the test case here. Basically, hdat_to_dt is a binary that reads the HDAT structure out of a pair of files and prints out a device tree. One of the regression tests we have is that we always produce the same output from the same input.

So… throwing that into AFL yielded a couple of pretty simple bugs, especially around aborting out on invalid data (it’s better to exit the process with failure rather than hit an assert). Nothing too interesting here on my simple input file, but it does mean that our parsing code exits “gracefully” on invalid data.

Another utility we have is actually a userspace utility for accessing the gard records in the flash. A GARD record is a record of a piece of hardware that has been deconfigured due to a fault (or a suspected fault). Usually this utility operates on PNOR flash through /dev/mtd – but really what it’s doing is talking to the libflash library, that we also use inside skiboot (and on OpenBMC) to read/write from flash directly, via /dev/mtd or just from a file. The good news? I haven’t been able to crash this utility yet!

So I modified the pflash utility to read from a file to attempt to fuzz the partition reading code we have for the partitioning format that’s on PNOR. So far, no crashes – although to even get it going I did have to fix a bug in the file handling code in pflash, so that’s already a win!

But crashing bugs aren’t the only type of bugs – afl-fuzz has exposed several cases where we act on uninitialized data. How? Well, we run some test cases under valgrind! This is the joy of user space unit tests for firmware – valgrind becomes a tool that you can run! Unfortunately, these bugs have been sitting in my “todo” pile (which is, of course, incredibly long).

Where to next? Fuzzing the firmware calls themselves would be nice – although that’s going to require a targeted tool that knows about what to pass each of the calls. Another round of afl-fuzz running would also be good, I’ve fixed a bunch of the simple things and having a better set of starting input files would be great (and likely expose more bugs).

MySQL Contributions status

This post is an update to the status of various MySQL bugs (some with patches) that I’ve filed over the past couple of years (or that people around me have). I’m not looking at POWER specific ones, as there are these too, but each of these bugs here deal with general correctness of the code base.

Firstly, let’s look at some points I’ve raised:

  • Incorrect locking for global_query_id (bug #72544)
    Raised on May 5th, 2014 on the internals list. As of today, no action (apart from Dimitri verifying the bug back in May 2014). There continues to be locking that perhaps only works by accident around query IDs.Soon, this bug will be two years old.
  • Endian code based on CPU type rather than endian define (bug #72715)
    About six-hundred and fifty days ago I filed this bug – back in May 2014, which probably has a relatively trivial fix of using the correct #ifdef of BIG_ENDIAN/LITTLE_ENDIAN rather than doing specific behavior based on #ifdef __i386__
    What’s worse is that this looks like somebody being clever for a compiler in the 1990s, which unlikely ends up with the most optimal code today.
  • mysql-test-run.pl –valgrind-all does not run all binaries under valgrind (bug #74830)
    Yeah, this should be a trivial fix, but nothing has happened since November 2014.
    I’m unlikely to go provide a patch simply because it seems to take sooooo damn long to get anything merged.
  • MySQL 5.1 doesn’t build with Bison 3.0 (bug #77529)
    Probably of little consequence, unless you’re trying to build MySQL 5.1 on a linux distro released in the last couple of years. Fixed in Maria for a long time now.

Trivial patches:

  • Incorrect function name in DBUG_ENTER (bug #78133)
    Pull request number 25 on github – a trivial patch that is obviously correct, simply correcting some debug only string.
    So far, over 191 days with no action. If you can’t get trivial and obvious patches merged in about 2/3rds of a year, you’re not going to grow contributions. Nearly everybody coming to a project starts out with trivial patches, and if a long time contributor who will complain loudly on the internet (like I am here) if his trivial patches aren’t merged can’t get it in, what chance on earth does a newcomer have?
    In case you’re wondering, this is the patch:

    --- a/sql/rpl_rli_pdb.cc
    +++ b/sql/rpl_rli_pdb.cc
    @@ -470,7 +470,7 @@ bool Slave_worker::read_info(Rpl_info_handler *from)
     bool Slave_worker::write_info(Rpl_info_handler *to)
    -  DBUG_ENTER("Master_info::write_info");
    +  DBUG_ENTER("Slave_worker::write_info");
  • InnoDB table flags in bitfield is non-optimal (bug #74831)
    With a patch since I filed this back in November 2014, it’s managed to sit idle long enough for GCC 4.8 to practically disappear from anywhere I care about, and 4.9 makes better optimization decisions. There are other reasons why C bitfields are an awful idea too.

Actually complex issues:

  • InnoDB mutex spin loop is missing GCC barrier (bug #72755)
    Again, another bug filed back in May 2014, where InnoDB is doing a rather weird trick to attempt to get the compiler to not optimize away a spinloop. There’s a known good way of doing this, it’s called a compiler barrier. I’ve had a patch for nearly two years, not merged :(
  • buf_block_align relies on random timeouts, volatile rather than memory barriers (bug #74775)
    This bug was first filed in November 2014 and deals with a couple of incorrect assumptions about memory ordering and what volatile means.
    While this may only exhibit a problem on ARM and POWER processors (as well as any other relaxed memory ordering architectures, x86 is the notable exception), it’s clearly incorrect and very non-portable.
    Don’t expect MySQL 5.7 to work properly on ARM (or POWER). Try this:

    ./mysql-test-run.pl rpl.rpl_checksum_cache --repeat=10

    You’ll likely find MySQL > 5.7.5 still explodes.
    In fact, there’s also Bug #79378 which Alexey Kopytov filed with patch that’s been sitting idle since November 2015 which is likely related to this bug.

Not covered here: universal CRC32 hardware acceleration (rather than just for innodb data pages) and other locking issues (some only recently discovered). I also didn’t go into anything filed in December 2015… although in any other project I’d expect something filed in December 2015 to have been looked at by now.

Like it or not, MySQL is still upstream for all the MySQL derivatives active today. Maybe this will change as RocksDB and TokuDB gain users and if WebScaleSQL, MariaDB and Percona can foster a better development community.

linux.conf.au 2016 Kernel miniconf CFP

Why yes, it’s another long URL thanks to Google Docs: https://docs.google.com/forms/d/148SieC6vmAxJZ3R5Lz5e1Mb0IM06LNSCt6WNVEwYFcs/viewform

Got a kernel topic you want to talk about? Got a kernel topic you want to start discussion on? Or a Q&A? Submit NOW! We’re going for part sessions, part unconference.

Questions? Contact me at stewart@linux.vnet.ibm.com

doing nothing on modern CPUs

Sometimes you don’t want to do anything. This is understandably human, and probably a sign you should either relax or get up and do something.

For processors, you sometimes do actually want to do absolutely nothing. Often this will be while waiting for a lock. You want to do nothing until the lock is free, but you want to be quick about it, you want to start work once that lock is free as soon as possible.

On CPU cores with more than one thread (e.g. hyperthreading on Intel, SMT on POWER) you likely want to let the other threads have all of the resources of the core if you’re sitting there waiting for something.

So, what do you do? On x86 there’s been the PAUSE instruction for a while and on POWER there’s been the SMT priority instructions.

The x86 PAUSE instruction delays execution of the next instruction for some amount of time while on POWER each executing thread in a core has a priority and this is how chip resources are handed out (you can set different priorities using special no-op instructions as well as setting the Relative Priority Register to map how these coarse grained priorities are interpreted by the chip).

So, when you’re writing spinlock code (or similar, such as the implementation of mutexes in InnoDB) you want to check if the lock is free, and if not, spin for a bit, but at a lower priority than the code running in the other thread that’s doing actual work. The idea being that when you do finally acquire the lock, you bump your priority back up and go do actual work.

Usually, you don’t continually check the lock, you do a bit of nothing in between checking. This is so that when the lock is contended, you don’t just jam every thread in the system up with trying to read a single bit of memory.

So you need a trick to do nothing that the complier isn’t going to optimize away.

Current (well, MySQL 5.7.5, but it’s current in MariaDB 10.0.17+ too, and other MySQL versions) code in InnoDB to “do nothing” looks something like this:

ulint ut_delay(ulint   delay)
        ulint   i, j;
        j = 0;
        for (i = 0; i < delay * 50; i++) {
                j += i;
        if (ut_always_false) {
                ut_always_false = (ibool) j;

On x86, UT_RELAX_CPU() ends up being the PAUSE instruction.

On POWER, the UT_LOW_PRIORITY_CPU() and UT_RESUME_PRIORITY_CPU() tunes the SMT thread priority (and on x86 they’re defined as nothing).

If you want an idea of when this was all written, this comment may be a hint:

/*!< in: delay in microseconds on 100 MHz Pentium */

But, if you’re not on x86 you don’t have the PAUSE instruction, instead, you end up getting this code:

# elif defined(HAVE_ATOMIC_BUILTINS)
#  define UT_RELAX_CPU() do { \
     volatile lint      volatile_var; \
     os_compare_and_swap_lint(&volatile_var, 0, 1); \
   } while (0)

Which you may think “yep, that does nothing and is not optimized away by the compiler”. Except you’d be wrong! What it actually does is generates a lot of memory traffic. You’re now sitting in a tight loop doing atomic operations, which have to be synchronized between cores (and sockets) since there’s no real way that the hardware is going to be able to work out that this is only a local variable that is never accessed from anywhere.

Additionally, the ut_always_false and j variable there is also attempts to trick the complier into not optimizing the loop away, and since ut_always_false is a global, you’re generating traffic to a single global variable too.

Instead, what’s needed is a compiler barrier. This simple bit of nothing tells the compiler “pretend memory has changed, so you can’t optimize around this point”.

__asm__ __volatile__ ("":::"memory")

So we can eliminate all sorts of useless non-work and instead do what we want: do nothing (a for loop for X iterations that isn’t optimized away by the compiler) and don’t have side effects.

In MySQL bug 74832 I detailed this with the appropriately produced POWER assembler. Unfortunately, this patch (submitted under the OCA) has sat since November 2014 (so, over 9 months) with no action. I’m a bit disappointed by that to be honest.

Anyway, the real moral of this story is: don’t implement your own locking primitives. You’re either going to get it wrong or you’ll be wrong in a few years when everything changes under you.

See also:

OPAL firmware specification, conformance and documentation

Now that we have an increasing amount of things that run on top of OPAL:

  1. Linux
  2. hello_world (in skiboot tree)
  3. ppc64le_hello (as I wrote about yesterday)
  4. FreeBSD

and that the OpenPower ecosystem is rapidly growing (especially around people building OpenPower machines), the need for more formal specification, conformance testing and documentation for OPAL is increasing rapidly.

If you look at the documentation in the skiboot tree late last year, you’d notice a grand total of seven text files. Now, we’re a lot better (although far from complete).

I’m proud to say that I won’t merge new code that adds/modifies an OPAL API call or anything in the device tree that doesn’t come with accompanying documentation, and this has meant that although it may not be perfect, we have something that is a decent starting point.

We’re in the interesting situation of starting with a working system, with mainline Linux kernels now for over a year (maybe even 18 months) being able to be booted by skiboot and run on powernv hardware (the more modern the kernel the better though).

So…. if anyone loves going through deeply technical documentation… do I have a project you can contribute to!

FreeBSD on OpenPower

There’s been some work on porting FreeBSD over to run natively on top of OPAL, that is, on bare metal OpenPower machines (not just under KVM).

This is one of four possible things to run natively on an OPAL system:

  1. Linux
  2. hello_world (in skiboot tree)
  3. ppc64le_hello (as I wrote about yesterday)
  4. FreeBSD

It’s great to see that another fully featured OS is getting ported to POWER8 and OPAL. It’s not yet at a stage where you could say it was finished or anything (PCI support is pretty preliminary for example, and fancy things like disks and networking live on PCI).

hello world as ppc66le OPAL payload!

While the in-tree hello-world kernel (originally by me, and Mikey managed to CUT THE BLOAT of a whole SEVENTEEN instructions down to a tiny ten) is very, very dumb (and does one thing, print “Hello World” to the console), there’s now an alternative for those who like to play with a more feature-rich Hello World rather than booting a more “real” OS such as Linux. In case you’re wondering, we use the hello world kernel as a tiny test that we haven’t completely and utterly broken things when merging/developing code.

https://github.com/andreiw/ppc64le_hello is a wonderful example of a small (INTERACTIVE!) starting point for a PowerNV (as it’s called in Linux) or “bare metal” (i.e. non-virtualised) OS on POWER.

What’s more impressive is that this was all developed using the simulator rather than real hardware (although I think somebody has tried it on some now).

Kind of neat!

gcov code coverage for OpenPower firmware

For skiboot (which provides the OPAL boot and runtime firmware for OpenPower machines), I’ve been pretty interested at getting some automated code coverage data for booting on real hardware (as well as in a simulator). Why? Well, it’s useful to see that various test suites are actually testing what you think they are, and it helps you be able to define more tests to increase what you’re covering.

The typical way to do code coverage is to make GCC build your program with GCOV, which is pretty simple if you’re a userspace program. You build with gcov, run program, and at the end you’re left with files on disk that contain all the coverage information for a tool such as lcov to consume. For the Linux kernel, you can also do this, and then extract the GCOV data out of debugfs and get code coverage for all/part of your kernel. It’s a little bit more involved for the kernel, but not too much so.

To achieve this, the kernel has to implement a bunch of stub functions itself rather than link to the gcov library as well as parse the GCOV data structures that GCC generates and emit the gcda files in debugfs when read. Basically, you replace the part of the GCC generated code that writes the files out. This works really nicely as Linux has fancy things like a VFS and debugfs.

For skiboot, we have no such things. We are firmware, we don’t have a damn file system interface. So, what do we do? Write a userspace utility to parse a dump of the appropriate region of memory, easy! That’s exactly what I did, a (relatively) simple user space app to parse out the gcov gcda files from a skiboot memory image – something we can easily dump out of the simulator, relatively easily (albeit slower) from the FSP on an IBM POWER system and even just directly out of a running system (if you boot a linux kernel with the appropriate config).

So, we can now get a (mostly automated) code coverage report simply for the act of booting to petitboot: https://open-power.github.io/skiboot/boot-coverage-report/ along with our old coverage report which was just for the unit tests (https://open-power.github.io/skiboot/coverage-report/). My current boot-coverage-report is just on POWER7 and POWER8 IBM FSP based systems – but you can see that a decent amount of code both is (and isn’t) touched simply from the act of booting to the bootloader.

The numbers we get are only approximate for any code run on more than one CPU as GCC just generates code that does a load/add/store rather than using an atomic increment.

One interesting observation was that (at least on smaller systems, which are still quite large by many people’s standards), boot time was not really noticeably increased.

For more information on running with gcov, see the in-tree documentation: https://github.com/open-power/skiboot/blob/master/doc/gcov.txt

Preliminary results from POWER8 optimized CRC32 for MySQL

So, Anton got some useful code working that I could patch into a MySQL server for testing purposes – a POWER8 optimized CRC32 implementation.

I went with a pretty stock MySQL 5.6.22 (one patch) with sysbench preparing a single 2GB table (10,000,000 rows). I then hacked up innochecksum so that it would only do the correct CRC32 (rather than trying each checksum type). Using the standard CRC32 algorithm it took around three seconds to verify all of the checksums. With a POWER8 optimized CRC32: 0.4-0.5 seconds. Useful speed-up!

I then ran sysbench read/write with 16 threads with oltp-table-size=10000 (on the larger table) to see if there would be an improvement in a “real world” workload. I got about 30% better performance on read/write operations!

Using perf to see where CPU was going, CPU time spent doing CRC32 calculations went down from ~2.5% to ~0.25%!

In theory, we should be able to get about 52GiB/sec of CRC32 out of a 4.1Ghz POWER8 core. I don’t think we’ll be hitting this in MySQL any time soon.

Give us another week or two and we’ll likely have a patch that’s ready to merge.

Initial benchmarks look promising though!

Building OpenPower firmware for use in POWER8 Simulator

Previously, I blogged on how to Run skiboot (OPAL) on the POWER8 Simulator. If you want to build the full Open Power firmware environment, including the Petitboot bootloader and kernel, you can now do so!

My pull request for an op-build target for the simulator has been merged, so you can now do the following three steps to compile a kernel+initramfs to use with your built skiboot for development purposes:

git clone --recursive git@github.com:open-power/op-build.git
cd op-build
. op-build-env
op-build mambo_defconfig && op-build

Then you wait for a whole bunch of time while everything compiles! Afterwards, you should be left with a zImage.epapr in output/images/ that you can copy into your skiboot directory.

With zImage.epapr in your skiboot directory, when you run “make check”, the skiboot test suite will actually launch the simulator to verify that your skiboot code boots all the way to the petitboot prompt!

We now have two boot tests as part of “make check” for skiboot!

More OpenPower Firmware code released: OCC

Inside the IBM POWER8 chip there’s another processor! That’s right folks, you get another CPU for no extra cost (It’s a lot funnier if you say these previous two sentences as if you were presenting an informercial for a special TV offer).

It is, however, not what you’d consider a general purpose processor. It is, in fact, a PowerPC 405 – so your POWER8 processor also has another PowerPC chip in it. What’s the purpose of this chip? It’s named the On Chip Controller and it has the job of helping make the main processor (the POWER8) work.

It has two jobs:

  • Monitor temperature and keep the system thermally safe
  • Monitor power usage and keep the system power safe

It runs a hard Real Time OS which has just been released up on github.com/open-power/occ

There’s more complete documentation on OCC here.

It’s fairly exciting to see more of the software that runs on every POWER8 system make it out into the world.


I just posted this to the mailing list, but I’ve tagged skiboot-4.1, so we have another release! There’s a good amount of changes since 4.0 nearly a month ago and this is the second release since we hit github back in July.

For the full set of changes, “git log” is your friend, but a summary of them follows:

  • We now build with -fstack-protector and -Werror
  • Stack checking extensions when built with STACK_CHECK=1
  • Reduced stack usage in some areas, -Wstack-usage=1024 now.
    • Some functions could use 2kb stack, now all are <1kb
  • Unsafe libc functions such as sprintf() have been removed
  • Symbolic backtraces
  • expose skiboot symbol map to OS (via device-tree)
  • removed machine check interrupt patching in OPAL
  • occ/hbrt: Call stopOCC() for implementing reset OCC command from FSP
  • occ: Fix the low level ACK message sent to FSP on receiving {RESET/LOAD}_OCC
  • hardening to errors of various FSP code
    • fsp: Avoid NULL dereference in case of invalid class_resp bits
    • abort if device tree parsing fails
    • FSP: Validate fsp_msg in fsp_queue_msg
    • fsp-elog: Add various NULL checks
  • Finessing of when to use error log vs prerror()
  • More i2c work
  • Can now run under Mambo simulator (see external/mambo/skiboot.tcl) (commonly known as “POWER8 Functional Simulator”)
  • Document skiboot versioning scheme
  • opal: Handle more TFAC errors.
  • ipmi: populate FRU data
  • rtc: Add a generic rtc cache
  • ipmi/rtc: use generic cache
  • Error Logging backend for bmc based machines
  • PSI: Drive link down on HIR
  • occ: Fix clearing of OCC interrupt on remote fix

So, who worked on this release? We had 84 csets from 17 developers. A total of 3271 lines were added, 1314 removed (delta 1957).

Developers with the most changesets
Stewart Smith 24 28.6%
Benjamin Herrenschmidt 17 20.2%
Alistair Popple 8 9.5%
Vasant Hegde 6 7.1%
Ananth N Mavinakayanahalli 5 6.0%
Neelesh Gupta 4 4.8%
Mahesh Salgaonkar 4 4.8%
Cédric Le Goater 3 3.6%
Wei Yang 3 3.6%
Anshuman Khandual 2 2.4%
Shilpasri G Bhat 2 2.4%
Ryan Grimm 1 1.2%
Anton Blanchard 1 1.2%
Shreyas B. Prabhu 1 1.2%
Joel Stanley 1 1.2%
Vaidyanathan Srinivasan 1 1.2%
Dan Streetman 1 1.2%
Developers with the most changed lines
Benjamin Herrenschmidt 1290 35.1%
Alistair Popple 963 26.2%
Stewart Smith 344 9.4%
Mahesh Salgaonkar 308 8.4%
Ananth N Mavinakayanahalli 198 5.4%
Neelesh Gupta 186 5.1%
Vasant Hegde 122 3.3%
Shilpasri G Bhat 39 1.1%
Vaidyanathan Srinivasan 24 0.7%
Joel Stanley 21 0.6%
Wei Yang 20 0.5%
Anshuman Khandual 15 0.4%
Cédric Le Goater 12 0.3%
Shreyas B. Prabhu 9 0.2%
Ryan Grimm 3 0.1%
Anton Blanchard 2 0.1%
Dan Streetman 2 0.1%
Developers with the most lines removed
Mahesh Salgaonkar 287 21.8%
Developers with the most signoffs (total 54)
Stewart Smith 44 81.5%
Vasant Hegde 4 7.4%
Benjamin Herrenschmidt 4 7.4%
Vaidyanathan Srinivasan 2 3.7%
Developers with the most reviews (total 2)
Vasant Hegde 2 100.0%

Running skiboot (OPAL) on the POWER8 Simulator

skiboot is open source boot and runtime firmware for OpenPOWER. On real POWER8 hardware, you will also need HostBoot to do this (basically, to make the chip work) but in a functional simulator (such as this one released by IBM) you don’t need a bunch of hardware procedures to make hardware work, so we can make do with just skiboot.

The POWER8 Functional Simulator is free to use but not open source and is only supported on limited platforms. But you can always run it all in a VM! I have it running this way on my laptop right now.

To go from a bare Ubuntu 14.10 VM on x86_64 to running skiboot in the simulator, I did the following:

  • apt-get install vim git emacs wget xterm # xterm is needed by the simulator. wget and editors are useful things.
  • (download systemsim-p8…deb from above URL)
  • dpkg -i systemsim-p8*deb # now the simulator is installed
  • git clone https://github.com/open-power/skiboot.git # get skiboot source
  • wget https://www.kernel.org/pub/tools/crosstool/files/bin/x86_64/4.8.0/x86_64-gcc-4.8.0-nolibc_powerpc64-linux.tar.xz # get a compiler to build it with
  • apt-get install make gcc valgrind # get build tools (skiboot unittests run on the host, so get a gcc and valgrind)
  • tar xfJ x86_64-gcc-4.8.0-nolibc_powerpc64-linux.tar.xz
  • mkdir -p /opt/cross
  • mv gcc-4.8.0-nolibc /opt/cross/ # now you have a powerpc64 cross compiler
  • export PATH=/opt/cross/gcc-4.8.0-nolibc/powerpc64-linux/bin/:$PATH # add cross compiler to path
  • cd skiboot
  • make # this should build a bunch of things, leaving you with skiboot.lid (and other things). If you have many CPUs, feel free to make -j128.
  • make check # run the unit tests. Everything should pass.
  • cd external/mambo
  • /opt/ibm/systemsim-p8/run/pegasus/power8 -f skiboot.tcl # run the simulator

The last step there will barf as you unlikely have a /tmp/zImage.epapr sitting around that’s suitable. If you use op-build to build a full set of OpenPower foo, you’ll likely be able to extract it from there. Basically, the skiboot.tcl script is adding a payload for skiboot to execute. On real hardware, this ends up being a Linux kernel with a small userspace and petitboot (link is to IBM documentation for IBM POWER8 systems). For the simulator, you could boot any tiny zImage.epapr you like, it should detect OPALv3 and boot!

Even if you cannot be bothered building a kernel or petitboot environment, if you comment out the associated lines in skiboot.tcl, you should be able to run the simulator and see the skiboot console message come up that says we couldn’t load a kernel.

At this point, congratulations, you can now become an OpenPower firmware hacker without even possessing any POWER8 hardware!

C bitfields considered harmful

In C (and C++) you can specify that a variable should take a specific number of bits of storage by doing “uint32_t foo:4;” rather than just “uint32_t foo”. In this example, the former uses 4 bits while the latter uses 32bits. This can be useful to pack many bit fields together.

Or, that’s what they’d like you to think.

In reality, the C spec allows the compiler to do just about anything it wants with these bitfields – which usually means it’s something you didn’t expect.

For a start, in a struct -e.g. “struct foo { uint32_t foo:4; uint32_t blah; uint32_t blergh:20; }” the compiler could go and combine foo and blergh into a single uint32_t and place it somewhere… or it could not. In this case, sizeof(struct foo) isn’t defined and may vary based on compiler, platform, compiler version, phases of the moon or if you’ve washed your hands recently.

Where this can get interesting is in network protocols (OMG DO NOT DO IT), APIs (OMG DO NOT DO IT), protecting different parts of a struct with different mutexes (EEP, don’t do it!) and performance.

I recently filed MySQL bug 74831 which relates to InnoDB performance on POWER8. InnoDB uses C bitfields which are themselves bitfields (urgh) for things like “flag to say if this table is compressed”. At various parts of the code, this flag is checked.

When you apply this simple patch:

--- mysql-5.7.5-m15.orig/storage/innobase/include/dict0mem.h
+++ mysql-5.7.5-m15/storage/innobase/include/dict0mem.h
@@ -1081,7 +1081,7 @@ struct dict_table_t {
        DICT_TF_HAS_ATOMIC_BLOBS() and DICT_TF_HAS_DATA_DIR() to parse this
        flag. */
-       unsigned                                flags:DICT_TF_BITS;
+       unsigned                                flags;

I get 10,000 key lookups/sec more than without it!

Why is this? If you go and read the bug, you’ll see that the amount of CPU time spent on the instruction checking the bit flag is actually about the same… and this puzzled me for a while. That is, until Anton reminded me that the PMU can be approximate and perhaps I should look at the loads.

Sure enough, the major difference is that with the bitfield in place (i.e. MySQL 5.7.5 as it stands today), there is a ld instruction doing the load – which is a 64bit load. In my patched version, it’s a lwx instruction – which is a 32bit load.

So, basically, we were loading 8 bytes instead of 4 every time we were checking if it was a compressed table.

So, along with yesterday’s lesson of never, ever, ever use volatile, today’s lesson is never, ever, ever use bitfields.

volatile considered harmful

While playing with MySQL 5.7.5 on POWER8, I came across a rather interesting bug (74775 – and this is not the only one… I think I have a decent amount of auditing and patching to do now) which made me want to write a bit on memory barriers and the volatile keyword.

Memory barriers are hard.

Like, super hard. It’s the kind of thing that makes you curse hardware designers, probably because they’re not magically solving all your problems for you. Basically, as you get more CPU cores and each of them have caches, it gets more expensive to keep everything in sync. It’s quite obvious that with *ahem* an eventually consistent model, you could save a bunch of time and effort at the expense of shifting some complexity into software.

Those in the MySQL world should recognize this – we’ve been dealing with asynchronous replication for well over a decade as a good way to scale.

On some CPU architectures (POWER for example) not all loads are created equal. When you load a value from memory, it will be consistent with your thread of execution. That is, with any stores that you have done in this thread of execution. If another thread updates that memory location you may not see that update even if your load occurs after that thread updates that memory location. Think eventually consistent.

If you want up to date reads (and not clobber writes), then you get to do memory barriers! (a topic for elsewhere – the PowerISA document has good explanations of what we have on POWER though, and how load with reserve works).

What the volatile keyword does is generate load and store instructions. It is useful when talking to hardware, as the load and store instructions are actually doing something there that the compiler doesn’t know about and thus shouldn’t optimize away.

The volatile keyword does not add any memory barriers. This is important to realize – volatile just makes loads and stores happen for your thread, not in relation to any other threads of execution. Thus, you cannot use volatile as a thread synchronization mechanism at all. It is completely and totally wrong.

Basically, if you have a volatile variable and you do stores to it in one thread and loads in another, after the store happens, it could be quite a long time before the thread doing the loads sees it! For some applications this may be okay (although I can’t really think of any beyond very very inaccurate status variables)… but if it matters at all for application correctness, volatile is the wrong thing to use.

Further reading: