Fast Reset, Trusted Boot and the security of /sbin/reboot

In OpenPOWER land, we’ve been doing some work on Secure and Trusted Boot while at the same time doing some work on what we call fast-reset (or fast-reboot, depending on exactly what mood someone was in at any particular time…. we should start being a bit more consistent).

The basic idea for fast-reset is that when the OS calls OPAL reboot, we gather all the threads in the system using a combination of patching the reset vector and soft-resetting them, then cleanup a few bits of hardware (we do re-probe PCIe for example), and reload & restart the bootloader (petitboot).

What this means is that typing “reboot” on the command line goes from a ~90-120+ second affair (through firmware to petitboot, linux distros still take ages to shut themselves down) down to about a 20 second affair (to petitboot).

If you’re running a (very) recent skiboot, you can enable it with a special hidden NVRAM configuration option (although we’ll likely enable it by default pretty soon, it’s proving remarkably solid). If you want to know what that NVRAM option is… Use the source, Luke! (or git history, but I’ve yet to see a neat Star Wars reference referring to git commit logs).

So, there’s nothing like a demo. Here’s a demo with Ubuntu running off an NVMe drive on an IBM S822LC for HPC (otherwise known as Minsky or Garrison) which was running the HTX hardware exerciser, through fast-reboot back into Petitboot and then booting into Ubuntu and auto-starting the exerciser (HTX) again.

Apart from being stupidly quick when compared to a full IPL (Initial Program Load – i.e. boot), since we’re not rebooting out of band, we have no way to reset the TPM, so if you’re measuring boot, each subsequent fast-reset will result in a different set of measurements.

This may be slightly confusing, but it’s not really a problem. You see, if a machine is compromised, there’s nothing stopping me replacing /sbin/reboot with something that just prints things to the console that look like your machine rebooted but in fact left my rootkit running. Indeed, fast-reset and a full IPL should measure different values in the TPM.

It also means that if you ever want to re-establish trust in your OS, never do a reboot from the host – always reboot out of band (e.g. from a BMC). This, of course, means you’re trusting your BMC to not be compromised, which I wouldn’t necessarily do if you suspect your host has been.

Workaround for opal-prd using 100% CPU

opal-prd is the Processor RunTime Diagnostics daemon, the userspace process that on OpenPower systems is responsible for some of the runtime diagnostics. Although a userspace process, it memory maps (as in mmap) in some code loaded by early firmware (Hostboot) called the HostBoot RunTime (HBRT) and runs it, using calls to the kernel to accomplish any needed operations (e.g. reading/writing registers inside the chip). Running this in user space gives us benefits such as being able to attach gdb, recover from segfaults etc.

The reason this code is shipped as part of firmware rather than as an OS package is that it is very system specific, and it would be a giant pain to update a package in every Linux distribution every time a new chip or machine was introduced.

Anyway, there’s a bug in the HBRT code that means if there’s an ECC error in the HBEL (HostBoot Error Log) partition in the system flash (“bios” or “pnor”… the flash where your system firmware lives), the opal-prd process may get stuck chewing up 100% CPU and not doing anything useful. There’s https://github.com/open-power/hostboot/issues/67 for this.

You will notice a problem if the opal-prd process is using 100% CPU and the last log messages are something like:

HBRT: ERRL:>>ErrlManager::ErrlManager constructor.
HBRT: ERRL:iv_hiddenErrorLogsEnable = 0x0
HBRT: ERRL:>>setupPnorInfo
HBRT: PNOR:>>RtPnor::getSectionInfo
HBRT: PNOR:>>RtPnor::readFromDevice: i_offset=0x0, i_procId=0 sec=11 size=0x20000 ecc=1
HBRT: PNOR:RtPnor::readFromDevice: removing ECC...
HBRT: PNOR:RtPnor::readFromDevice> Uncorrectable ECC error : chip=0,offset=0x0

(the parameters to readFromDevice may differ)

Luckily, there’s a simple workaround to fix it all up! You will need the pflash utility. Primarily, pflash is meant only for developers and those who know what they’re doing. You can turn your computer into a brick using it.

pflash is packaged in Ubuntu 16.10 and RHEL 7.3, but you can otherwise build it from source easily enough:

git clone https://github.com/open-power/skiboot.git
cd skiboot/external/pflash
make

Now that you have pflash, you just need to erase the HBEL partition and write (ECC) zeros:

dd if=/dev/zero of=/tmp/hbel bs=1 count=147456
pflash -P HBEL -e
pflash -P HBEL -p /tmp/hbel

Note: you cannot just erase the partition or use the pflash option to do an ECC erase, you may render your system unbootable if you get it wrong.

After that, restart opal-prd however your distro handles restarting daemons (e.g. systemctl restart opal-prd.service) and all should be well.

Compiling your own firmware for Barreleye (OpenCompute OpenPOWER system)

Aaron Sullivan announced on the Rackspace Blog that you can now get your own Barreleye system! What’s great is that the code for the Barreleye platform is upstream in the op-build project, which means you can build your own firmware for them (just like garrison, the “IBM S822LC for HPC” system I blogged about a few days ago).

Remarkably, to build an image for the host firmware, it’s eerily similar to any other platform:

git clone --recursive https://github.com/open-power/op-build.git
cd op-build
. op-build-env
op-build barreleye_defconfig
op-build

…and then you wait. You can cross compile on x86.

You’ve been able to build firmware for these machines with upstream code since Feb/March (I wouldn’t recommend running with builds from then though, try the latest release instead).

Hopefully, someone involved in OpenBMC can write on how to build the BMC firmware.

Compiling your own firmware for the S822LC for HPC

IBM (my employer) recently announced  the new S822LC for HPC POWER8+NVLINK NVIDIA P100 GPUs server (press release, IBM Systems Blog, The Register). The “For HPC” suffix on the model number is significant, as the S822LC is a different machine. What makes the “for HPC” variant different is that the POWER8 CPU has (in addition to PCIe), logic for NVLink to connect the CPU to NVIDIA GPUs.

There’s also the NVIDIA Tesla P100 GPUs which are NVIDIA’s latest in an SXM2 form factor, but instead of delving into GPUs, I’m going to tell you how to compile the firmware for this machine.

You see, this is an OpenPOWER machine. It’s an OpenPOWER machine where the vendor (in this case IBM) has worked to get all the needed code upstream, so you can see exactly what goes into a firmware build.

To build the latest host firmware (you can cross compile on x86 as we use buildroot to build a cross compiler):

git clone --recursive https://github.com/open-power/op-build.git
cd op-build
. op-build-env
op-build garrison_defconfig
op-build

That’s it! Give it a while and you’ll end up with output/images/garrison.pnor – which is a firmware image to flash onto PNOR. The machine name is garrison as that’s the code name for the “S822LC for HPC” (you may see Minsky in the press, but that’s a rather new code name, Garrison has been around for a lot longer as a name).

Using Smatch static analysis on OpenPOWER OPAL firmware

For Skiboot, I’m always looking at new automated systems to find bugs in the code. A little while ago, I read about the Smatch tool developed by some folks at Oracle (they also wrote about using it on the Linux kernel).

I was eager to try it with skiboot to see if it could find anything.

Luckily, it was pretty easy. I built Smatch according to their documentation and then built skiboot:

make CHECK="/home/stewart/smatch/smatch" C=1 -j20 all check

Due to some differences in how we implement abort() and assert() in skiboot, I added “_abort”, “abort” and “assert_fail” to smatch_data/no_return_funcs in the Smatch source tree to silence some false positives.

It seems that there’s a few useful warnings there (some of which I’ve fixed in skiboot master already), along with some false positives around the preprocessor/complier tricks we do to ensure at compile time that an OPAL call definition has the correct number of arguments specified.

So far, so good though. Try it on your project!

Building OPAL firmware for POWER9

Recently, we merged into the op-build project (the build scripts for OpenPOWER Firmware) a defconfig for building OPAL for (certain) POWER9 simulators. I won’t bother linking over to articles on the POWER9 chip or schedule (there’s search engines for that), but with this commit – if you happen to be able to get your hands on a POWER9 simulator, you can now boot to the petitboot bootloader on it!

We’re using upstream Linux 4.7.0-rc3 and upstream skiboot (master), so all of this code is already upstream!

Now, by no means is this complete. There’s some fairly fundamental things that are missing (e.g. PCI) – but how many other platforms can you build open source firmware for before you can even get your hands on a simulator?

Fuzzing Firmware – afl-fuzz + skiboot

In what is likely to be a series on how firmware makes some normal tools harder to use, first I’m going to look at american fuzzy lop – a tool for fuzz testing that if you’re not using then you most certainly have bugs it’ll find for you.

I first got interested in afl-fuzz during Erik de Castro Lopo’s excellent linux.conf.au 2016 in Geelong earlier this year: “Fuzz all the things!“. In a previous life, the Random Query Generator managed to find a heck of a lot of bugs in MySQL (and Drizzle). For randgen info, see Philip Stoev’s talk on it from way back in 2009, a recent (2014) blog post on how Tokutek uses it and some notes on how it was being used at Oracle from 2013. Basically, the randgen was a specialized fuzzer that (given a grammar) would randomly generate SQL queries, and then (if the server didn’t crash), compare the result to some other database server (e.g. your previous version).

The afl-fuzz fuzzer takes a different approach – it’s a much more generic fuzzer rather than a targeted tool. Also, while tools such as the random query generator are extremely powerful and find specialized bugs, they’re hard to get started with. A huge benefit of afl-fuzz is that it’s really, really simple to get started with.

Basically, if you have a binary that takes input on stdin or as a (relatively small) file, afl-fuzz will just work and find bugs for you – read the Quick Start Guide and you’ll be finding bugs in no time!

For firmware of course, we’re a little different than a simple command line program as, well, we aren’t one! Luckily though, we have unit tests. These are just standard binaries that include a bunch of firmware code and get run in user space as part of “make check”. Also, just like unit tests for any project, people do send me patches that break tests (which I reject).

Some of these tests act on data we get from a place – maybe reading other parts of firmware off PNOR or interacting with data structures we get from other bits of firmware. For testing this code, it can be relatively easy to (for the test), read these off disk.

For skiboot, there’s a data structure we get from the service processor on FSP machines called HDAT. Basically, it’s just like the device tree, but different. Because yet another binary format is always a good idea (yes, that is laced with a heavy dose of sarcasm). One of the steps in early boot is to parse the HDAT data structure and convert it to a device tree. Luckily, we structured our code so that creating a unit test that can run in userspace was relatively easy, we just needed to dump this data structure out from a running machine. You can see the test case here. Basically, hdat_to_dt is a binary that reads the HDAT structure out of a pair of files and prints out a device tree. One of the regression tests we have is that we always produce the same output from the same input.

So… throwing that into AFL yielded a couple of pretty simple bugs, especially around aborting out on invalid data (it’s better to exit the process with failure rather than hit an assert). Nothing too interesting here on my simple input file, but it does mean that our parsing code exits “gracefully” on invalid data.

Another utility we have is actually a userspace utility for accessing the gard records in the flash. A GARD record is a record of a piece of hardware that has been deconfigured due to a fault (or a suspected fault). Usually this utility operates on PNOR flash through /dev/mtd – but really what it’s doing is talking to the libflash library, that we also use inside skiboot (and on OpenBMC) to read/write from flash directly, via /dev/mtd or just from a file. The good news? I haven’t been able to crash this utility yet!

So I modified the pflash utility to read from a file to attempt to fuzz the partition reading code we have for the partitioning format that’s on PNOR. So far, no crashes – although to even get it going I did have to fix a bug in the file handling code in pflash, so that’s already a win!

But crashing bugs aren’t the only type of bugs – afl-fuzz has exposed several cases where we act on uninitialized data. How? Well, we run some test cases under valgrind! This is the joy of user space unit tests for firmware – valgrind becomes a tool that you can run! Unfortunately, these bugs have been sitting in my “todo” pile (which is, of course, incredibly long).

Where to next? Fuzzing the firmware calls themselves would be nice – although that’s going to require a targeted tool that knows about what to pass each of the calls. Another round of afl-fuzz running would also be good, I’ve fixed a bunch of the simple things and having a better set of starting input files would be great (and likely expose more bugs).

First POWER9 bits merged into skiboot master

I just merged in some base POWER9 support patches into skiboot. While this is in no way near complete or really enough to be interesting to anyone that isn’t heavily involved in POWER9 development, it’s nice to take upstream first and open source first so seriously that this level of base enablement patches is easy to merge in.

Other work that has gone on for POWER9 in open source projects include way back in November 2015 where work for the updated PowerISA 3.0 was merged into binutils and this year there’s been kernel work too.

Video of my Percona Live Talk: Why would I run MySQL/MariaDB on POWER anyway?

Good news everyone! There’s video up for the talk I gave at Percona Live in April 2016 up: Why would I run MySQL/MariaDB on POWER anyway?

The talk is a general overview of POWER and why MySQL/MariaDB may be a good fit.

OpenPOWER, OpenCompute and fostering a firmware development community

Recently, I was at the OpenPOWER Summit in San Jose where people could see the Barreleye server (specs and design here, initial Rackspace blog post here). Barreleye is an OpenCompute form factor POWER8 server. It’s not only an OpenPOWER machine, which means all of the host firmware is free and open source software, but there’s also OpenBMC meaning that the source to the OS and userspace running on the BMC (service processor) is also open!

In addition, the firmware enablement came from Foxconn (see this skiboot commit), which means we’re being successful in enabling people who aren’t part of IBM to join the development community for OpenPOWER firmware and get the changes needed to support their machines accepted upstream.

Granted, the size of a firmware development community is always likely to be relatively small, but I really like how Foxconn has shown leadership to other ODMs on interacting with and becoming part of the OpenPOWER firmware community.

“Toy” database on mainframes

While much less common than 10 or 15 (err… even 20) years ago, you still sometimes hear MySQL being called a “toy” database. The good news is, when somebody says that, they’re admitting ignorance and you can help educate them!

Recently, a fellow IBMer submitted a pull request (and bug) to start having MySQL support on Z Series (s390x).

Generally speaking, when there’s effort being spent on getting something to run on Z, it is in no way considered a toy by those who’ll end up using it.

POWER8 Accelerated CRC32 merged in MariaDB 10.1

Earlier on in benchmarking MySQL and MariaDB on POWER8, we noticed that on write workloads (or read workloads involving a lot of IO) we were spending a bunch of time computing InnoDB page checksums. This is a relatively well known MySQL problem and has existed for many years and Percona even added innodb_fast_checksum to Percona Server to help alleviate the problem.

In MySQL 5.6, we got the ability to use CRC32 checksums, which are great in that they’re a lot faster to compute than tho old InnoDB “new” checksum. There’s code inside InnoDB to use the x86 SSE2 crc32q instruction to accelerate performing the checksum on compatible x86 CPUs (although oddly enough, the CRC32 checksum in the binlog does not use this acceleration).

However, on POWER, we’d end up using the software implementation of CRC32, which used a lot more CPU than we’d like. Luckily, CRC32 is really common code and for POWER8, we got some handy instructions to help computing it. Unfortunately, this required brushing up on vector polynomial math in order to understand how to do it all quickly. The end result was Anton coming up with crc32-vpmsum code that we could drop into projects that embed a copy of crc32 that was about 41 times faster than the best non-vpmsum implementation.

Recently, Daniel Black took the patch that had passed through both Daniel Axten‘s and my hands and worked on upstreaming it into MariaDB and MySQL. We did some pretty solid benchmarking on the improvement you’d get, and we pretty much cannot notice the difference between innodb_checksum=off and having it use the POWER8 accelerated CRC32 checksum, which frees up maybe 30% of CPU time to be used for things like query execution! My original benchmark showed a 30% improvement in sysbench read/write workloads.

The excellent news? Two days ago, MariaDB merged POWER8 accelerated crc32! This means that IO heavy workloads on MariaDB on POWER8 will get much faster in the next release.

MySQL bug 74776 is open, with patch attached, so hopefully MySQL will merge this soon too (hint hint).

TianoCore (UEFI) ported to OpenPower

Recently, there’s been (actually two) ports of TianoCore (the reference implementation of UEFI firmware) to run on POWER on top of OPAL (provided by skiboot) – and it can be run in the Qemu PowerNV model.

More details:

1 Million SQL Queries per second: GA MariaDB 10.1 on POWER8

A couple of days ago, MariaDB announced that MariaDB 10.1 is stable GA – around 19 months since the GA of MariaDB 10.0. With MariaDB 10.1 comes some important scalabiity improvements, especially for POWER8 systems. On POWER, we’re a bit unique in that we’re on the higher end of CPUs, have many cores, and up to 8 threads per core (selectable at runtime: 1, 2, 4 or 8/core) – so a dual socket system can easily be a 160 thread machine.

Recently, we (being IBM) announced availability of a couple of new POWER8 machines – machines designed for Linux and cloud environments. They are very much OpenPower machines, and more info is available here: http://www.ibm.com/marketplace/cloud/commercial-computing/us/en-us

Combine these two together, with Axel Schwenke running some benchmarks and you get 1 Million SQL Queries per second with MariaDB 10.1 on POWER8.

Having worked a lot on both MySQL for POWER and the firmware that ships in the S882LC, I’m rather happy that 1 Million queries per second is beyond what it was in June 2014, which was a neat hack on MySQL 5.7 that showed the potential of MySQL on POWER8 but wasn’t yet a product. Now, you can run a GA release of MariaDB on GA POWER8 hardware designed for scale-out cloud environments and get 1 Million SQL queries/second (with fewer cores than my initial benchmark last year!)

What’s even more impressive is that this million queries per second is in a KVM guest!

PAPR spec publicly available to download

PAPR is the Power Architecture Platform Reference document. It’s a short read at only 890 pages and defines the virtualised environment that guests run in on PowerKVM and PowerVM (i.e. what is referred to as ‘pseries’ platform in the Linux kernel).

https://members.openpowerfoundation.org/document/dl/469

As part of the OpenPower Foundation, we’re looking at ensuring this is up to date, documents KVM specific things as well as splitting out the bits that are common to OPAL and PAPR into their own documents.

Running OPAL in qemu – the powernv platform

Ben has a qemu tree up with some work-in-progress patches to qemu to support the PowerNV platform. This is the “bare metal” platform like you’d get on real POWER8 hardware running OPAL, and it allows us to use qemu like my previous post used the POWER8 Functional Simulator – to boot OpenPower firmware.

To build qemu for this, follow these steps:

apt-get -y install gcc python g++ pkg-config libz-dev libglib2.0-dev \
  libpixman-1-dev libfdt-dev git
git clone https://github.com/ozbenh/qemu.git
cd qemu
./configure --target-list=ppc64-softmmu
make -j `grep -c processor /proc/cpuinfo`

This will leave you with a ppc64-softmmu/qemu-system-ppc64 binary. Once you’ve built your OpenPower firmware to run in a simulator, you can boot it!

Note that this qemu branch is under development, and is likely to move/change or even break.

I do it like this:

cd ~/op-build/output/images;  # so skiboot.lid is in pwd
~/qemu/ppc64-softmmu/qemu-system-ppc64 -m 1G -M powernv \
-kernel zImage.epapr -nographic \
-cdrom ~/ubuntu-vivid-ppc64el-mini.iso

and this lets me test that we launch the Ubunut vivid installer correctly.

You can easily add other qemu options such as additional disks or networking and verify that it works correctly. This way, you can do development on some skiboot functionality or a variety of kernel and op-build userspace (such as the petitboot bootloader) without needing either real hardware or using the simulator.

This is useful if, say, you’re running on ppc64el, for which the POWER8 functional simulator is currently not available on.

doing nothing on modern CPUs

Sometimes you don’t want to do anything. This is understandably human, and probably a sign you should either relax or get up and do something.

For processors, you sometimes do actually want to do absolutely nothing. Often this will be while waiting for a lock. You want to do nothing until the lock is free, but you want to be quick about it, you want to start work once that lock is free as soon as possible.

On CPU cores with more than one thread (e.g. hyperthreading on Intel, SMT on POWER) you likely want to let the other threads have all of the resources of the core if you’re sitting there waiting for something.

So, what do you do? On x86 there’s been the PAUSE instruction for a while and on POWER there’s been the SMT priority instructions.

The x86 PAUSE instruction delays execution of the next instruction for some amount of time while on POWER each executing thread in a core has a priority and this is how chip resources are handed out (you can set different priorities using special no-op instructions as well as setting the Relative Priority Register to map how these coarse grained priorities are interpreted by the chip).

So, when you’re writing spinlock code (or similar, such as the implementation of mutexes in InnoDB) you want to check if the lock is free, and if not, spin for a bit, but at a lower priority than the code running in the other thread that’s doing actual work. The idea being that when you do finally acquire the lock, you bump your priority back up and go do actual work.

Usually, you don’t continually check the lock, you do a bit of nothing in between checking. This is so that when the lock is contended, you don’t just jam every thread in the system up with trying to read a single bit of memory.

So you need a trick to do nothing that the complier isn’t going to optimize away.

Current (well, MySQL 5.7.5, but it’s current in MariaDB 10.0.17+ too, and other MySQL versions) code in InnoDB to “do nothing” looks something like this:

ulint ut_delay(ulint   delay)
{
        ulint   i, j;
        UT_LOW_PRIORITY_CPU();
        j = 0;
        for (i = 0; i < delay * 50; i++) {
                j += i;
                UT_RELAX_CPU();
        }
        if (ut_always_false) {
                ut_always_false = (ibool) j;
        }
        UT_RESUME_PRIORITY_CPU();
        return(j);
}

On x86, UT_RELAX_CPU() ends up being the PAUSE instruction.

On POWER, the UT_LOW_PRIORITY_CPU() and UT_RESUME_PRIORITY_CPU() tunes the SMT thread priority (and on x86 they’re defined as nothing).

If you want an idea of when this was all written, this comment may be a hint:

/*!< in: delay in microseconds on 100 MHz Pentium */

But, if you’re not on x86 you don’t have the PAUSE instruction, instead, you end up getting this code:

# elif defined(HAVE_ATOMIC_BUILTINS)
#  define UT_RELAX_CPU() do { \
     volatile lint      volatile_var; \
     os_compare_and_swap_lint(&volatile_var, 0, 1); \
   } while (0)

Which you may think “yep, that does nothing and is not optimized away by the compiler”. Except you’d be wrong! What it actually does is generates a lot of memory traffic. You’re now sitting in a tight loop doing atomic operations, which have to be synchronized between cores (and sockets) since there’s no real way that the hardware is going to be able to work out that this is only a local variable that is never accessed from anywhere.

Additionally, the ut_always_false and j variable there is also attempts to trick the complier into not optimizing the loop away, and since ut_always_false is a global, you’re generating traffic to a single global variable too.

Instead, what’s needed is a compiler barrier. This simple bit of nothing tells the compiler “pretend memory has changed, so you can’t optimize around this point”.

__asm__ __volatile__ ("":::"memory")

So we can eliminate all sorts of useless non-work and instead do what we want: do nothing (a for loop for X iterations that isn’t optimized away by the compiler) and don’t have side effects.

In MySQL bug 74832 I detailed this with the appropriately produced POWER assembler. Unfortunately, this patch (submitted under the OCA) has sat since November 2014 (so, over 9 months) with no action. I’m a bit disappointed by that to be honest.

Anyway, the real moral of this story is: don’t implement your own locking primitives. You’re either going to get it wrong or you’ll be wrong in a few years when everything changes under you.

See also:

The sad state of MySQL and NUMA

Way back in 2010, MySQL Bug 57241 was filed, pointing out that the “swap insanity” problem was getting serious on x86 systems – with NUMA being more and more common back then.

The swapping problem is due to running out of memory on a NUMA node and having to swap things to other nodes (see Jeremy Cole‘s blog entry also from 2010 on the topic of swap insanity). This was back when 64GB and dual quad core CPUs was big – in the past five years big systems have gotten bigger.

Back then there were two things you could do to have your system be usable: 1) numa=off as kernel boot parameter (this likely has other implications though) and 2) “numactl –interleave all” in mysqld_safe script (I think MariaDB currently has this built in if you set an option but I don’t think MySQL does, otherwise perhaps the bug would have been closed).

Anyway, it’s now about 5 years since this bug was opened and even when there’s been a patch in the Twitter MySQL branch for a while (years?) and my Oracle Contributor Agreement signed patch attached to bug 72811 since May 2014 (over a year) we still haven’t seen any action.

My patch takes the approach of you want things allocated at server startup to be interleaved across nodes (e.g. buffer pool) while runtime allocations are probably per connection and are thus fine (in fact, better) to do node local allocations.

Without a patch like this, or without running mysqld with the right numactl incantation, you end up either having all your memory on one NUMA node (potentially not utilising full memory bandwidth of the hardware), or you end up with swap insanity, or you end up with some other not exactly what you’d expect situation.

While we could have MySQL be more NUMA aware and perhaps do a buffer pool instance per NUMA node or some such thing, it’s kind of disappointing that for dedicated database servers bought in the past 7+ years (according to one comment on one of the bugs) this crippling issue hasn’t been addressed upstream.

Just to make it even more annoying, on certain workloads you end up with a lot of mutex contention, which can end up meaning that binding MySQL to fewer NUMA nodes (memory and CPU) ends up increasing performance (cachelines don’t have as far to travel) – this is a different problem than swap insanity though, and one that is being addressed.

Update: My patch as part of https://bugs.mysql.com/bug.php?id=72811 has been merged! MySQL on NUMA machines just got a whole lot better. I just hope it’s enabled by default…

OPAL firmware specification, conformance and documentation

Now that we have an increasing amount of things that run on top of OPAL:

  1. Linux
  2. hello_world (in skiboot tree)
  3. ppc64le_hello (as I wrote about yesterday)
  4. FreeBSD

and that the OpenPower ecosystem is rapidly growing (especially around people building OpenPower machines), the need for more formal specification, conformance testing and documentation for OPAL is increasing rapidly.

If you look at the documentation in the skiboot tree late last year, you’d notice a grand total of seven text files. Now, we’re a lot better (although far from complete).

I’m proud to say that I won’t merge new code that adds/modifies an OPAL API call or anything in the device tree that doesn’t come with accompanying documentation, and this has meant that although it may not be perfect, we have something that is a decent starting point.

We’re in the interesting situation of starting with a working system, with mainline Linux kernels now for over a year (maybe even 18 months) being able to be booted by skiboot and run on powernv hardware (the more modern the kernel the better though).

So…. if anyone loves going through deeply technical documentation… do I have a project you can contribute to!