A (simplified) view of OpenPOWER Firmware Development

I’ve been working on trying to better document the whole flow of code that goes into a build of firmware for an OpenPOWER machine. This is partially to help those not familiar with it get a better grasp of the sheer scale of what goes into that 32/64MB of flash.

I also wanted to convey the components that we heavily re-used from other Open Source projects, what parts are still “IBM internal” (as they relate to the open source workflow) and which bits are primarily contributed to by IBMers (at least at this point in time).

As such, let’s start with the legend of the diagram:

Now, the diagram:

Simplified development flow for OpenPOWER firmware

The end thing that a user with a machine will download and apply (or that comes shipped with a box) is the purple “Installable Firmware Release” nodes (bottom center). In this diagram, there are 4 of them. One for POWER9 systems such as the just-announced AC922 system (this is the “OP910 Release” node, which is the witherspoon_defconfig in the op-build tree); one for the p9dsu platform (p9dsu_defconfig in op-build) and one is for IBM FSP based systems such as the S812L and S822L systems (or S812/S822 in OPAL mode).

There are more platforms out there, but this diagram is meant to be simplified. The key difference with the p9dsu platform is that this is produced by somebody other than IBM.

All of these releases are based off the upstream op-build project, op-build is the light blue box in the center of the diagram. We do regular X.Y releases and sometimes do X.Y.Z releases. It’s primarily a pull request based workflow currently, so everything goes via a pull request. The op-build project brings together all the POWER specific firmware components (pretty much everything in every other light blue/blue box) along with a Linux kernel and buildroot.

The kernel and buildroot are the two big yellow boxes on the top right. Buildroot brings together a lot of open source components that are in our firmware image (including some power specific ones that we get through upstream buildroot).

For Linux, this is a pretty simplified view of the process, but we primarily ship the stable tree (with maybe up to half a dozen patches).

The skiboot and petitboot components both use a mailing list based workflow (similar to kernel) as well as X.Y and X.Y.Z releases (again, similar to the linux kernel).

On the far left of the diagram, we have Hostboot, SBE and OCC. These are three firmware components that come from the traditional IBM POWER Firmware group, and are shared with the IBM non-OpenPOWER POWER systems (“traditional” POWER). These components have part of their code from from an (internal) repository called “ekb” which also goes into a (very) low level debug tool and the FSP based systems. There’s also an (internal) gerrit instance that’s the primary place where code review/development discussions are for these components.

In future posts, I’ll probably delve into more specifics of the current development process, and how we may try and change things for the better.

How I do email (at work)

Recently, I blogged on my home email setup and in that post, I hinted that my work setup was rather different. I have entirely separate computing devices for work and personal, a setup I strongly recommend. This also lets me “go home” from work even when working from home, I use a different physical machine!

Since I work for IBM I have (at least) two email accounts for work: a Lotus Notes one and a internet standards compliant one. It’s “easy” enough to get the Notes one to forward to the standards compliant one, from which I can just use fetchmail or similar to pull down mail.

I run mail through a rather simple procmail script: it de-mangles some URL mangling that can happen in the current IBM email infrastructure, runs things through SpamAssassin and deliver to a date based Maildir (or one giant pile for spam).

My ~/.procmailrc looks something like this:


DATE=`date +"%Y%m"`

| magic_script_to_unmangle_things

| spamc

* ^X-Spam-Status: Yes

I use tail -f mail_log as a really dumb kind of biff replacement.

Now, what do I read and write mail with? Notmuch! It is the only thing that even comes close to being able to deal with a decent flow of mail. I have a couple of saved searches just to track how much mail I pull in a day/week. Today (on Monday), it says 442 today and 10,403 over the past week.

For the most part, my workflow is kind of INBOX-ZERO like, except that I currently view victory as INBOX 2000. Most mail does go into my INBOX, the notable exceptions are two main mailing lists I’m subscribed to mostly as FYI and to search/find things when needed. Those are the Linux Kernel Mailing List (LKML) and the buildroot mailing list. Why notmuch rather than just searching the web for mailing list archives? Notmuch can return the result of a query in less time it takes light to get to and from the United States in ideal conditions.

For work, I don’t sync my mail anywhere. It’s just on my laptop. Not having it on my phone is a feature. I have a notmuch post-new hook that does some initial tagging of mail, and as such I have this in my ~/.notmuch-config:


My post-new hook looks like this:


# immediately archive all messages from "me"
notmuch tag -new -- tag:new and from:stewart@linux.vnet.ibm.com

# tag all message from lists
notmuch tag +devicetree +list -- tag:new and to:devicetree@vger.kernel.org
notmuch tag +inbox +unread -new -- tag:new and tag:devicetree and to:stewart@linux.vnet.ibm.com
notmuch tag +listinbox +unread +list -new -- tag:new and tag:devicetree and not to:stewart@linux.vnet.ibm.com

notmuch tag +linuxppc +list -- tag:new and to:linuxppc-dev@lists.ozlabs.org
notmuch tag +linuxppc +list -- tag:new and cc:linuxppc-dev@lists.ozlabs.org
notmuch tag +inbox +unread -new -- tag:new and tag:linuxppc
notmuch tag +openbmc +list -- tag:new and to:openbmc@lists.ozlabs.org
notmuch tag +inbox +unread -new -- tag:new and tag:openbmc

notmuch tag +lkml +list -- tag:new and to:linux-kernel@vger.kernel.org
notmuch tag +inbox +unread -new -- tag:new and tag:lkml and to:stewart@linux.vnet.ibm.com
notmuch tag +listinbox +unread -new -- tag:new and tag:lkml and not tag:linuxppc and not to:stewart@linux.vnet.ibm.com

notmuch tag +qemuppc +list -- tag:new and to:qemu-ppc@nongnu.org
notmuch tag +inbox +unread -new -- tag:new and tag:qemuppc and to:stewart@linux.vnet.ibm.com
notmuch tag +listinbox +unread -new -- tag:new and tag:qemuppc and not to:stewart@linux.vnet.ibm.com

notmuch tag +qemu +list -- tag:new and to:qemu-devel@nongnu.org
notmuch tag +inbox +unread -new -- tag:new and tag:qemu and to:stewart@linux.vnet.ibm.com
notmuch tag +listinbox +unread -new -- tag:new and tag:qemu and not to:stewart@linux.vnet.ibm.com

notmuch tag +buildroot +list -- tag:new and to:buildroot@buildroot.org
notmuch tag +buildroot +list -- tag:new and to:buildroot@busybox.net
notmuch tag +buildroot +list -- tag:newa nd to:buildroot@uclibc.org
notmuch tag +inbox +unread -new -- tag:new and tag:buildroot and to:stewart@linux.vnet.ibm.com
notmuch tag +listinbox +unread -new -- tag:new and tag:buildroot and not to:stewart@linux.vnet.ibm.com

notmuch tag +ibmbugzilla -- tag:new and from:bugzilla@us.ibm.com

# finally, retag all "new" messages "inbox" and "unread"
notmuch tag +inbox +unread -new -- tag:new

This leaves me with both an inbox and a listinbox. I do not look at the overwhelming majority of mail that hits the listinbox – It’s mostly for following up on individual things. If I started to need to care more about specific topics, I’d probably add something in there for them so I could easily find them.

My notmuch emacs setup has a bunch of saved searches, so my notmuch-hello screen looks something like this:

This gets me a bit of a state-of-the-world-of-email-to-look-at view for the day. I’ll often have meetings first thing in the morning that may reference email I haven’t looked at yet, and this generally lets me quickly find mail related to the problems of the day and start being productive.

ZMODEM saves the day! Or, why my firmware for a machine with a CPU from 2017 contains a serial file transfer protocol from the 1980s

Recently, I added the package lrzsz to op-build in this commit. This package provides the rz and sz commands – for receive zmodem and send zmodem respectively. For those who don’t know, op-build builds a firmware image for OpenPOWER machines, and adding this package adds the commands to the petitboot shell (the busybox environment you get when you “exit to shell” from the boot menu).

For those who aren’t familiar with ZMODEM, you should first get off my lawn, and secondly, know that it’s a method for sending/receiving files over something akin to a serial port, e.g. a modem. The basic protocol is “I want to send you this file named FOO”, “okay, I would like to receive it”, “here’s some data and a checksum, did you get it and does it match the checksum?”, “yes!”, “okay, great, here’s the next bit” until the file is transferred. Importantly, it has a provision for “no, I didn’t get that right” and for bits to be resent.

The one thing that pretty much always somewhat works on a computer is a serial port (or something that looks like a serial port to software). When you connect to the IPMI console (“ipmitool sol activate”), the host sees this as a serial port that it pumps bytes over. With OpenBMC, you can actually connect to this serial port via SSH.

When diagnosing weird problems during firmware bringup such as “why doesn’t PCI work” or “why does my network adapter not work” (or, perhaps, somebody helpfully didn’t plug the network cable in), it can be useful to copy off a bunch of debug information from the machine.

You may say “can’t you just print the log file to the screen and save it?” and you’d be right, you can do that for text – it’s really annoying for binary data though. Plus, there have been bugs in the console implementations on pretty much every BMC I’ve ever used that makes them not as reliable as you’d like.

So, how could we transfer a file over the serial connection we have to the machine? The same way we did on a BBS! Enter ZMODEM. The error recovery properties are perfect in this situation.

So, how do you use it? I’ve found two ways that work well: GNU screen and zssh. For GNU Screen, you’ll want to configure it to catch zmodem traffic by doing “control-a:zmodem catch<ENTER>” (you need the colon there). After that if you execute “sz” on the host and the rest should be obvious! If you wanted to send a file to the host, run “rz” rather than “sz”.

Fast Reset, Trusted Boot and the security of /sbin/reboot

In OpenPOWER land, we’ve been doing some work on Secure and Trusted Boot while at the same time doing some work on what we call fast-reset (or fast-reboot, depending on exactly what mood someone was in at any particular time…. we should start being a bit more consistent).

The basic idea for fast-reset is that when the OS calls OPAL reboot, we gather all the threads in the system using a combination of patching the reset vector and soft-resetting them, then cleanup a few bits of hardware (we do re-probe PCIe for example), and reload & restart the bootloader (petitboot).

What this means is that typing “reboot” on the command line goes from a ~90-120+ second affair (through firmware to petitboot, linux distros still take ages to shut themselves down) down to about a 20 second affair (to petitboot).

If you’re running a (very) recent skiboot, you can enable it with a special hidden NVRAM configuration option (although we’ll likely enable it by default pretty soon, it’s proving remarkably solid). If you want to know what that NVRAM option is… Use the source, Luke! (or git history, but I’ve yet to see a neat Star Wars reference referring to git commit logs).

So, there’s nothing like a demo. Here’s a demo with Ubuntu running off an NVMe drive on an IBM S822LC for HPC (otherwise known as Minsky or Garrison) which was running the HTX hardware exerciser, through fast-reboot back into Petitboot and then booting into Ubuntu and auto-starting the exerciser (HTX) again.

Apart from being stupidly quick when compared to a full IPL (Initial Program Load – i.e. boot), since we’re not rebooting out of band, we have no way to reset the TPM, so if you’re measuring boot, each subsequent fast-reset will result in a different set of measurements.

This may be slightly confusing, but it’s not really a problem. You see, if a machine is compromised, there’s nothing stopping me replacing /sbin/reboot with something that just prints things to the console that look like your machine rebooted but in fact left my rootkit running. Indeed, fast-reset and a full IPL should measure different values in the TPM.

It also means that if you ever want to re-establish trust in your OS, never do a reboot from the host – always reboot out of band (e.g. from a BMC). This, of course, means you’re trusting your BMC to not be compromised, which I wouldn’t necessarily do if you suspect your host has been.

Workaround for opal-prd using 100% CPU

opal-prd is the Processor RunTime Diagnostics daemon, the userspace process that on OpenPower systems is responsible for some of the runtime diagnostics. Although a userspace process, it memory maps (as in mmap) in some code loaded by early firmware (Hostboot) called the HostBoot RunTime (HBRT) and runs it, using calls to the kernel to accomplish any needed operations (e.g. reading/writing registers inside the chip). Running this in user space gives us benefits such as being able to attach gdb, recover from segfaults etc.

The reason this code is shipped as part of firmware rather than as an OS package is that it is very system specific, and it would be a giant pain to update a package in every Linux distribution every time a new chip or machine was introduced.

Anyway, there’s a bug in the HBRT code that means if there’s an ECC error in the HBEL (HostBoot Error Log) partition in the system flash (“bios” or “pnor”… the flash where your system firmware lives), the opal-prd process may get stuck chewing up 100% CPU and not doing anything useful. There’s https://github.com/open-power/hostboot/issues/67 for this.

You will notice a problem if the opal-prd process is using 100% CPU and the last log messages are something like:

HBRT: ERRL:>>ErrlManager::ErrlManager constructor.
HBRT: ERRL:iv_hiddenErrorLogsEnable = 0x0
HBRT: ERRL:>>setupPnorInfo
HBRT: PNOR:>>RtPnor::getSectionInfo
HBRT: PNOR:>>RtPnor::readFromDevice: i_offset=0x0, i_procId=0 sec=11 size=0x20000 ecc=1
HBRT: PNOR:RtPnor::readFromDevice: removing ECC...
HBRT: PNOR:RtPnor::readFromDevice> Uncorrectable ECC error : chip=0,offset=0x0

(the parameters to readFromDevice may differ)

Luckily, there’s a simple workaround to fix it all up! You will need the pflash utility. Primarily, pflash is meant only for developers and those who know what they’re doing. You can turn your computer into a brick using it.

pflash is packaged in Ubuntu 16.10 and RHEL 7.3, but you can otherwise build it from source easily enough:

git clone https://github.com/open-power/skiboot.git
cd skiboot/external/pflash

Now that you have pflash, you just need to erase the HBEL partition and write (ECC) zeros:

dd if=/dev/zero of=/tmp/hbel bs=1 count=147456
pflash -P HBEL -e
pflash -P HBEL -p /tmp/hbel

Note: you cannot just erase the partition or use the pflash option to do an ECC erase, you may render your system unbootable if you get it wrong.

After that, restart opal-prd however your distro handles restarting daemons (e.g. systemctl restart opal-prd.service) and all should be well.

Compiling your own firmware for Barreleye (OpenCompute OpenPOWER system)

Aaron Sullivan announced on the Rackspace Blog that you can now get your own Barreleye system! What’s great is that the code for the Barreleye platform is upstream in the op-build project, which means you can build your own firmware for them (just like garrison, the “IBM S822LC for HPC” system I blogged about a few days ago).

Remarkably, to build an image for the host firmware, it’s eerily similar to any other platform:

git clone --recursive https://github.com/open-power/op-build.git
cd op-build
. op-build-env
op-build barreleye_defconfig

…and then you wait. You can cross compile on x86.

You’ve been able to build firmware for these machines with upstream code since Feb/March (I wouldn’t recommend running with builds from then though, try the latest release instead).

Hopefully, someone involved in OpenBMC can write on how to build the BMC firmware.

Compiling your own firmware for the S822LC for HPC

IBM (my employer) recently announced  the new S822LC for HPC POWER8+NVLINK NVIDIA P100 GPUs server (press release, IBM Systems Blog, The Register). The “For HPC” suffix on the model number is significant, as the S822LC is a different machine. What makes the “for HPC” variant different is that the POWER8 CPU has (in addition to PCIe), logic for NVLink to connect the CPU to NVIDIA GPUs.

There’s also the NVIDIA Tesla P100 GPUs which are NVIDIA’s latest in an SXM2 form factor, but instead of delving into GPUs, I’m going to tell you how to compile the firmware for this machine.

You see, this is an OpenPOWER machine. It’s an OpenPOWER machine where the vendor (in this case IBM) has worked to get all the needed code upstream, so you can see exactly what goes into a firmware build.

To build the latest host firmware (you can cross compile on x86 as we use buildroot to build a cross compiler):

git clone --recursive https://github.com/open-power/op-build.git
cd op-build
. op-build-env
op-build garrison_defconfig

That’s it! Give it a while and you’ll end up with output/images/garrison.pnor – which is a firmware image to flash onto PNOR. The machine name is garrison as that’s the code name for the “S822LC for HPC” (you may see Minsky in the press, but that’s a rather new code name, Garrison has been around for a lot longer as a name).

Using Smatch static analysis on OpenPOWER OPAL firmware

For Skiboot, I’m always looking at new automated systems to find bugs in the code. A little while ago, I read about the Smatch tool developed by some folks at Oracle (they also wrote about using it on the Linux kernel).

I was eager to try it with skiboot to see if it could find anything.

Luckily, it was pretty easy. I built Smatch according to their documentation and then built skiboot:

make CHECK="/home/stewart/smatch/smatch" C=1 -j20 all check

Due to some differences in how we implement abort() and assert() in skiboot, I added “_abort”, “abort” and “assert_fail” to smatch_data/no_return_funcs in the Smatch source tree to silence some false positives.

It seems that there’s a few useful warnings there (some of which I’ve fixed in skiboot master already), along with some false positives around the preprocessor/complier tricks we do to ensure at compile time that an OPAL call definition has the correct number of arguments specified.

So far, so good though. Try it on your project!

Building OPAL firmware for POWER9

Recently, we merged into the op-build project (the build scripts for OpenPOWER Firmware) a defconfig for building OPAL for (certain) POWER9 simulators. I won’t bother linking over to articles on the POWER9 chip or schedule (there’s search engines for that), but with this commit – if you happen to be able to get your hands on a POWER9 simulator, you can now boot to the petitboot bootloader on it!

We’re using upstream Linux 4.7.0-rc3 and upstream skiboot (master), so all of this code is already upstream!

Now, by no means is this complete. There’s some fairly fundamental things that are missing (e.g. PCI) – but how many other platforms can you build open source firmware for before you can even get your hands on a simulator?

Fuzzing Firmware – afl-fuzz + skiboot

In what is likely to be a series on how firmware makes some normal tools harder to use, first I’m going to look at american fuzzy lop – a tool for fuzz testing that if you’re not using then you most certainly have bugs it’ll find for you.

I first got interested in afl-fuzz during Erik de Castro Lopo’s excellent linux.conf.au 2016 in Geelong earlier this year: “Fuzz all the things!“. In a previous life, the Random Query Generator managed to find a heck of a lot of bugs in MySQL (and Drizzle). For randgen info, see Philip Stoev’s talk on it from way back in 2009, a recent (2014) blog post on how Tokutek uses it and some notes on how it was being used at Oracle from 2013. Basically, the randgen was a specialized fuzzer that (given a grammar) would randomly generate SQL queries, and then (if the server didn’t crash), compare the result to some other database server (e.g. your previous version).

The afl-fuzz fuzzer takes a different approach – it’s a much more generic fuzzer rather than a targeted tool. Also, while tools such as the random query generator are extremely powerful and find specialized bugs, they’re hard to get started with. A huge benefit of afl-fuzz is that it’s really, really simple to get started with.

Basically, if you have a binary that takes input on stdin or as a (relatively small) file, afl-fuzz will just work and find bugs for you – read the Quick Start Guide and you’ll be finding bugs in no time!

For firmware of course, we’re a little different than a simple command line program as, well, we aren’t one! Luckily though, we have unit tests. These are just standard binaries that include a bunch of firmware code and get run in user space as part of “make check”. Also, just like unit tests for any project, people do send me patches that break tests (which I reject).

Some of these tests act on data we get from a place – maybe reading other parts of firmware off PNOR or interacting with data structures we get from other bits of firmware. For testing this code, it can be relatively easy to (for the test), read these off disk.

For skiboot, there’s a data structure we get from the service processor on FSP machines called HDAT. Basically, it’s just like the device tree, but different. Because yet another binary format is always a good idea (yes, that is laced with a heavy dose of sarcasm). One of the steps in early boot is to parse the HDAT data structure and convert it to a device tree. Luckily, we structured our code so that creating a unit test that can run in userspace was relatively easy, we just needed to dump this data structure out from a running machine. You can see the test case here. Basically, hdat_to_dt is a binary that reads the HDAT structure out of a pair of files and prints out a device tree. One of the regression tests we have is that we always produce the same output from the same input.

So… throwing that into AFL yielded a couple of pretty simple bugs, especially around aborting out on invalid data (it’s better to exit the process with failure rather than hit an assert). Nothing too interesting here on my simple input file, but it does mean that our parsing code exits “gracefully” on invalid data.

Another utility we have is actually a userspace utility for accessing the gard records in the flash. A GARD record is a record of a piece of hardware that has been deconfigured due to a fault (or a suspected fault). Usually this utility operates on PNOR flash through /dev/mtd – but really what it’s doing is talking to the libflash library, that we also use inside skiboot (and on OpenBMC) to read/write from flash directly, via /dev/mtd or just from a file. The good news? I haven’t been able to crash this utility yet!

So I modified the pflash utility to read from a file to attempt to fuzz the partition reading code we have for the partitioning format that’s on PNOR. So far, no crashes – although to even get it going I did have to fix a bug in the file handling code in pflash, so that’s already a win!

But crashing bugs aren’t the only type of bugs – afl-fuzz has exposed several cases where we act on uninitialized data. How? Well, we run some test cases under valgrind! This is the joy of user space unit tests for firmware – valgrind becomes a tool that you can run! Unfortunately, these bugs have been sitting in my “todo” pile (which is, of course, incredibly long).

Where to next? Fuzzing the firmware calls themselves would be nice – although that’s going to require a targeted tool that knows about what to pass each of the calls. Another round of afl-fuzz running would also be good, I’ve fixed a bunch of the simple things and having a better set of starting input files would be great (and likely expose more bugs).

First POWER9 bits merged into skiboot master

I just merged in some base POWER9 support patches into skiboot. While this is in no way near complete or really enough to be interesting to anyone that isn’t heavily involved in POWER9 development, it’s nice to take upstream first and open source first so seriously that this level of base enablement patches is easy to merge in.

Other work that has gone on for POWER9 in open source projects include way back in November 2015 where work for the updated PowerISA 3.0 was merged into binutils and this year there’s been kernel work too.

Video of my Percona Live Talk: Why would I run MySQL/MariaDB on POWER anyway?

Good news everyone! There’s video up for the talk I gave at Percona Live in April 2016 up: Why would I run MySQL/MariaDB on POWER anyway?

The talk is a general overview of POWER and why MySQL/MariaDB may be a good fit.

OpenPOWER, OpenCompute and fostering a firmware development community

Recently, I was at the OpenPOWER Summit in San Jose where people could see the Barreleye server (specs and design here, initial Rackspace blog post here). Barreleye is an OpenCompute form factor POWER8 server. It’s not only an OpenPOWER machine, which means all of the host firmware is free and open source software, but there’s also OpenBMC meaning that the source to the OS and userspace running on the BMC (service processor) is also open!

In addition, the firmware enablement came from Foxconn (see this skiboot commit), which means we’re being successful in enabling people who aren’t part of IBM to join the development community for OpenPOWER firmware and get the changes needed to support their machines accepted upstream.

Granted, the size of a firmware development community is always likely to be relatively small, but I really like how Foxconn has shown leadership to other ODMs on interacting with and becoming part of the OpenPOWER firmware community.

“Toy” database on mainframes

While much less common than 10 or 15 (err… even 20) years ago, you still sometimes hear MySQL being called a “toy” database. The good news is, when somebody says that, they’re admitting ignorance and you can help educate them!

Recently, a fellow IBMer submitted a pull request (and bug) to start having MySQL support on Z Series (s390x).

Generally speaking, when there’s effort being spent on getting something to run on Z, it is in no way considered a toy by those who’ll end up using it.

POWER8 Accelerated CRC32 merged in MariaDB 10.1

Earlier on in benchmarking MySQL and MariaDB on POWER8, we noticed that on write workloads (or read workloads involving a lot of IO) we were spending a bunch of time computing InnoDB page checksums. This is a relatively well known MySQL problem and has existed for many years and Percona even added innodb_fast_checksum to Percona Server to help alleviate the problem.

In MySQL 5.6, we got the ability to use CRC32 checksums, which are great in that they’re a lot faster to compute than tho old InnoDB “new” checksum. There’s code inside InnoDB to use the x86 SSE2 crc32q instruction to accelerate performing the checksum on compatible x86 CPUs (although oddly enough, the CRC32 checksum in the binlog does not use this acceleration).

However, on POWER, we’d end up using the software implementation of CRC32, which used a lot more CPU than we’d like. Luckily, CRC32 is really common code and for POWER8, we got some handy instructions to help computing it. Unfortunately, this required brushing up on vector polynomial math in order to understand how to do it all quickly. The end result was Anton coming up with crc32-vpmsum code that we could drop into projects that embed a copy of crc32 that was about 41 times faster than the best non-vpmsum implementation.

Recently, Daniel Black took the patch that had passed through both Daniel Axten‘s and my hands and worked on upstreaming it into MariaDB and MySQL. We did some pretty solid benchmarking on the improvement you’d get, and we pretty much cannot notice the difference between innodb_checksum=off and having it use the POWER8 accelerated CRC32 checksum, which frees up maybe 30% of CPU time to be used for things like query execution! My original benchmark showed a 30% improvement in sysbench read/write workloads.

The excellent news? Two days ago, MariaDB merged POWER8 accelerated crc32! This means that IO heavy workloads on MariaDB on POWER8 will get much faster in the next release.

MySQL bug 74776 is open, with patch attached, so hopefully MySQL will merge this soon too (hint hint).

TianoCore (UEFI) ported to OpenPower

Recently, there’s been (actually two) ports of TianoCore (the reference implementation of UEFI firmware) to run on POWER on top of OPAL (provided by skiboot) – and it can be run in the Qemu PowerNV model.

More details:

1 Million SQL Queries per second: GA MariaDB 10.1 on POWER8

A couple of days ago, MariaDB announced that MariaDB 10.1 is stable GA – around 19 months since the GA of MariaDB 10.0. With MariaDB 10.1 comes some important scalabiity improvements, especially for POWER8 systems. On POWER, we’re a bit unique in that we’re on the higher end of CPUs, have many cores, and up to 8 threads per core (selectable at runtime: 1, 2, 4 or 8/core) – so a dual socket system can easily be a 160 thread machine.

Recently, we (being IBM) announced availability of a couple of new POWER8 machines – machines designed for Linux and cloud environments. They are very much OpenPower machines, and more info is available here: http://www.ibm.com/marketplace/cloud/commercial-computing/us/en-us

Combine these two together, with Axel Schwenke running some benchmarks and you get 1 Million SQL Queries per second with MariaDB 10.1 on POWER8.

Having worked a lot on both MySQL for POWER and the firmware that ships in the S882LC, I’m rather happy that 1 Million queries per second is beyond what it was in June 2014, which was a neat hack on MySQL 5.7 that showed the potential of MySQL on POWER8 but wasn’t yet a product. Now, you can run a GA release of MariaDB on GA POWER8 hardware designed for scale-out cloud environments and get 1 Million SQL queries/second (with fewer cores than my initial benchmark last year!)

What’s even more impressive is that this million queries per second is in a KVM guest!

PAPR spec publicly available to download

PAPR is the Power Architecture Platform Reference document. It’s a short read at only 890 pages and defines the virtualised environment that guests run in on PowerKVM and PowerVM (i.e. what is referred to as ‘pseries’ platform in the Linux kernel).


As part of the OpenPower Foundation, we’re looking at ensuring this is up to date, documents KVM specific things as well as splitting out the bits that are common to OPAL and PAPR into their own documents.

Running OPAL in qemu – the powernv platform

Ben has a qemu tree up with some work-in-progress patches to qemu to support the PowerNV platform. This is the “bare metal” platform like you’d get on real POWER8 hardware running OPAL, and it allows us to use qemu like my previous post used the POWER8 Functional Simulator – to boot OpenPower firmware.

To build qemu for this, follow these steps:

apt-get -y install gcc python g++ pkg-config libz-dev libglib2.0-dev \
  libpixman-1-dev libfdt-dev git
git clone https://github.com/ozbenh/qemu.git
cd qemu
./configure --target-list=ppc64-softmmu
make -j `grep -c processor /proc/cpuinfo`

This will leave you with a ppc64-softmmu/qemu-system-ppc64 binary. Once you’ve built your OpenPower firmware to run in a simulator, you can boot it!

Note that this qemu branch is under development, and is likely to move/change or even break.

I do it like this:

cd ~/op-build/output/images;  # so skiboot.lid is in pwd
~/qemu/ppc64-softmmu/qemu-system-ppc64 -m 1G -M powernv \
-kernel zImage.epapr -nographic \
-cdrom ~/ubuntu-vivid-ppc64el-mini.iso

and this lets me test that we launch the Ubunut vivid installer correctly.

You can easily add other qemu options such as additional disks or networking and verify that it works correctly. This way, you can do development on some skiboot functionality or a variety of kernel and op-build userspace (such as the petitboot bootloader) without needing either real hardware or using the simulator.

This is useful if, say, you’re running on ppc64el, for which the POWER8 functional simulator is currently not available on.