gcov code coverage for OpenPower firmware

For skiboot (which provides the OPAL boot and runtime firmware for OpenPower machines), I’ve been pretty interested at getting some automated code coverage data for booting on real hardware (as well as in a simulator). Why? Well, it’s useful to see that various test suites are actually testing what you think they are, and it helps you be able to define more tests to increase what you’re covering.

The typical way to do code coverage is to make GCC build your program with GCOV, which is pretty simple if you’re a userspace program. You build with gcov, run program, and at the end you’re left with files on disk that contain all the coverage information for a tool such as lcov to consume. For the Linux kernel, you can also do this, and then extract the GCOV data out of debugfs and get code coverage for all/part of your kernel. It’s a little bit more involved for the kernel, but not too much so.

To achieve this, the kernel has to implement a bunch of stub functions itself rather than link to the gcov library as well as parse the GCOV data structures that GCC generates and emit the gcda files in debugfs when read. Basically, you replace the part of the GCC generated code that writes the files out. This works really nicely as Linux has fancy things like a VFS and debugfs.

For skiboot, we have no such things. We are firmware, we don’t have a damn file system interface. So, what do we do? Write a userspace utility to parse a dump of the appropriate region of memory, easy! That’s exactly what I did, a (relatively) simple user space app to parse out the gcov gcda files from a skiboot memory image – something we can easily dump out of the simulator, relatively easily (albeit slower) from the FSP on an IBM POWER system and even just directly out of a running system (if you boot a linux kernel with the appropriate config).

So, we can now get a (mostly automated) code coverage report simply for the act of booting to petitboot: https://open-power.github.io/skiboot/boot-coverage-report/ along with our old coverage report which was just for the unit tests (https://open-power.github.io/skiboot/coverage-report/). My current boot-coverage-report is just on POWER7 and POWER8 IBM FSP based systems – but you can see that a decent amount of code both is (and isn’t) touched simply from the act of booting to the bootloader.

The numbers we get are only approximate for any code run on more than one CPU as GCC just generates code that does a load/add/store rather than using an atomic increment.

One interesting observation was that (at least on smaller systems, which are still quite large by many people’s standards), boot time was not really noticeably increased.

For more information on running with gcov, see the in-tree documentation: https://github.com/open-power/skiboot/blob/master/doc/gcov.txt

MySQL 5.6 on POWER (patch available)

The following sentence is brought to you by IBM Legal. The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.

Okay, now that is out of the way….

If you’re the kind of person who follows the MySQL bugs database closely or subscribes to the MySQL Internals mailing list, you may have worked out that I’ve spent a small amount of time poking at MySQL on modern POWER systems.

Unlike Intel CPUs, POWER CPUs require explicit memory barriers to synchronize memory state between different CPUs. This means that when you’re implementing synchronization primitives, you have one extra thing to get right.

Luckily, if you use straight pthread mutexes, this is already taken care of. Unluckily, there are some optimizations in MySQL that don’t use straight pthread mutexes and so may be problematic on non-Intel CPUs. A few of these issues have sneaked into MySQL over the past few years. The most problematic area was around the optimized mutexes in InnoDB (you can use the pthread_mutex fallback code, but it’s less performant).

Luckily, I both knew where to look and there are good asserts throughout InnoDB code to help spot any other areas that I may not have initially thought of to look at. Coding defensively with a good amount of asserts is a good thing.

After not too much work, I have a set of patches that I’m fairly confident is correct and performs near as well as possible. Initially, I had a different patch that used heavyweight memory barriers in a lot of places, but big kudos to Yasufumi for posting a better patch than mine to bug 47213 – using the lighter weight barriers gives a decent performance boost.

One of the key patches is in the InnoDB mutex code to change the thread priority – i.e. a POWER equivalent to the x86 pause instruction. These are hints to the CPU that the thread being executed is in a spinloop and CPU resources should be allocated to other threads to make betterr forward progress.

After dragging Anton in to have a look and a think, this code may have motivated him to have a go at getting kernel support for adaptive mutexes, thus removing the need for this spin/sleep/yield/eep loop in InnoDB (at least on Linux).

So… I’ve spent the appropriate time filing bugs in the MySQL bug tracker for the things I’ve found. Feel free to track them yourself, they are:

  • Bug 72715: character set code endianness dependent on CPU type rather than endianness of CP
    • I don’t think this is an issue for us… or it could be that this is actually just incredibly untested code in the MySQL Server. It’s also not POWER specific, although was caught by the Migration Assistant which is part of the Advanced Toolchain from IBM.
  • Bug 72718: CACHE_LINE_SIZE in innodb should be 128 on POWER
    • I contributed a patch that’s a simple #ifdef for CPU type. Those who care about other CPU architectures should chime in with the correct value for them.
    • There’s other places in InnoDB where there’s some padding that don’t use this define, I need to file a bug for that.
  • Bug 72754: Set thread priority in InnoDB mutex spinloop
    • This makes a big difference when you have mutex contention and SMT (Symmetric Multi-Threading) enabled (on POWER, you can dynamically change SMT levels at runtime).
    • I’ve contributed a preliminary patch that isn’t generic. I should go and fix that.
  • Bug 72755: InnoDB mutex spin loop is missing GCC barrier
    • This also applies to x86 (and indeed all platforms). If GCC gets a bit smarter, the current code could compile down to nothing, which is exactly what you don’t want from a spinloop. The correct thing to do is to have a GCC memory barrier (not CPU one) to ensure that the compiler doesn’t optimize away the spinning.
    • I’ve contributed a patch, may need #ifdef GCC added.
  • Bug 72809: InnoDB Linux native aio setup missing barrier after setup
    • This appears to be a “POWER8 is fast” related bug :)
    • Patch contributed.
  • Bug 72811: Set NUMA mempolicy for optimum mysqld performance
    • Not POWER specific.
    • I’ve contributed a patch that sets NUMA memory allocation policy inside mysqld rather than having to run “numactl” manually
  • Bug 47213: InnoDB mutex/rw_lock should be conscious about memory ordering other than Intel
    • Originally filed by Yasufumi back in 2009.
    • Some good discussion going on here to ensure the patch is correct. This is the kind of patch that requires more review  than it takes to write it.
    • This patch would fix the majority of problems for non-Intel CPU architectures.
    • Thanks to Yasufumi for providing an updated patch, it helped a lot!
  • Bug 72544: Incorrect locking for global_query_id
    • I found a bug. Rather benign and not POWER specific.

Want to run MySQL 5.6.17 on POWER? Get my MySQL 5.6.17 patch here: https://flamingspork.com/mysql/mysql-5.6.17-POWER.patch

My accumulation of 5.6 patches seems fairly reliable. I’d test before putting into production, and I’d certainly love to know any problems you hit.

Get the quilt series of patches here: https://flamingspork.com/mysql/mysql-5.6.17-POWER-patches.tar.gz

I have, of course, done the legal wrangling for the Oracle Contributor Agreement (remarkably painless) and am working on making the patches completely acceptable to be merged into MySQL.

Caring about stack usage

It may not be surprising that there’s been a few projects over the years that I’ve worked on where we’ve had to care about stack usage (to varying degrees).

For threaded userspace applications (e.g. MySQL, Drizzle) you get a certain amount of stack per thread – and you really don’t want to bust that. For a great many years now, there’s been both a configuration parameter in MySQL to set how much stack each thread (connection) gets as well as various checks in the source code to ensure there’s enough free stack to do a particular operation (IIRC open_table is the most hairy one of this in MySQL).

For the Linux Kernel, stack usage is a relatively (in)famous problem… although by now just about every real problem has been fixed and merely mentioning it is probably just the influence of the odd grey beard hairs I’m pretending not to notice.

In a current project I’m working on, it’s also something we have to care about.

It turns out that GCC has a few nice things to help you prevent unbounded stack usage or runaway stack usage. There’s two warnings you can enable.

There’s -Wstack-usage=len which will throw warnings on unbounded stack usage (e.g. array on stack sized based on an argument to the function), where stack usage is greater than len and when stack usage may exceed len.

There’s also -Wframe-larger-than=len which is based on calculation for a particular stack frame, as opposed to -Wstack-usage=len, which could be based on several stack frames.

Odds are, you may get some warnings in your project if you set this to what you would consider “conservative” values. Now, if this is every going to explode at runtime is something that’s left as an exercise for the reader, but enabling these warnings is pretty easy and a simple way to help find and prevent some issues.

After all, having your software explode for running off the end of the stack is just a tad embarrassing.

Limiting functions to 32k stack in Drizzle (and scoped_ptr)

I wonder if this comes under “Code Style” or not…

Anyway, Monty and I finished getting Drizzle ready for adding “-Wframe-larger-than=32768” as a standard compiler flag. This means that no function within the Drizzle source tree can use greater than 32kb stack – it’s a compiler warning – and with -Werror, it means that it’s a build error.

GCC is not perfect at detecting stack usage, but it’s pretty good.

Why have we done this?

Well, there is a little bit of recursion in the server… and we can craft queries to blow a small stack (not so good). On MacOS X, the default thread stack size is only 512kb. This gives not many frames if 32kb stack is a even remotely common.

I found some interesting places to throw a lot of things on the stack too – that would be rather far down on a callchain – leading to the possibility of blowing up in really strange ways.

We’d love to make it 16kb…. but that’s a fair bit more work, so something for the future.

We’ve used the Boost scoped_ptr to address a bunch of these situations as it provides pretty much minimal code change for the same effect (except that memory is dynamically allocated instead of as part of the stack frame).

Fun with the 387

Filed  GCC bug 39228:

#include <stdio.h>
#include <math.h>
int main()
{
        double a= 10.0;
        double b= 1e+308;
        printf("%d %d %dn", isinf(a*b), __builtin_isinf(a*b), __isinf(a*b));
        return 0;
}

mtaylor@drizzle-dev:~$ gcc -o test test.c
mtaylor@drizzle-dev:~$ ./test
0 0 1
mtaylor@drizzle-dev:~$ gcc -o test test.c -std=c99
mtaylor@drizzle-dev:~$ ./test
1 0 1
mtaylor@drizzle-dev:~$ gcc -o test test.c   -mfpmath=sse -march=pentium4
mtaylor@drizzle-dev:~$ ./test
1 1 1
mtaylor@drizzle-dev:~$ g++ -o test test.c
mtaylor@drizzle-dev:~$ ./test
1 0 1

Originally I found the simple isinf() case to be different on x86 than x86-64, ppc32 and sparc (32 and 64).

After more research, I found that x86-64 uses the sse instructions to do it (and using sse is the only way for __builtin_isinf() to produce correct results). For the g++ built version, it calls __isinf() instead of inlining (and as can be seen, the __isinf() version is always correct).

Specifically, it’s because the optimised 387 code is doing the math in double extended precision inside the FPU. 10.0*1e308 fits in 80bits but not in 64bit. Any code that forces it to be stored and loaded gets the correct result too. e.g.

mtaylor@drizzle-dev:~$ cat test-simple.c

#include <stdio.h>
#include <math.h>
int main()
{
        double a= 10.0;
        double b= 1e+308;
    volatile    double c= a*b;
        printf("%dn", isinf(c));
        return 0;
}

mtaylor@drizzle-dev:~$ gcc -o test-simple test-simple.c
mtaylor@drizzle-dev:~$ ./test-simple
1

With this code you can easily see the load and store:

 8048407:       dc 0d 18 85 04 08       fmull  0x8048518 804840d:       dd 5d f0                fstpl  -0x10(%ebp)
 8048410:       dd 45 f0                fldl   -0x10(%ebp)
 8048413:       d9 e5                   fxam

While if you remove volatile, the load and store doesn’t happen (at least on -O3, on -O0 it hasn’t been optimised away):

 8048407:       dc 0d 18 85 04 08       fmull  0x8048518
 804840d:       c7 44 24 04 10 85 04    movl   $0x8048510,0x4(%esp)
 8048414:       08
 8048415:       c7 04 24 01 00 00 00    movl   $0x1,(%esp)
 804841c:       d9 e5                   fxam

This is also a regression from 4.2.4 as it just calls isinf() and doesn’t expand the 387 code inline. My guess is the 387 optimisation was added in 4.3.

Recommended fix: store and load in the 387 version so to operate on same precision as elsewhere.

Now I just have to make a patch I like that makes Drizzle behave because of this (showed up as a failure in the SQL func_math test) and then submit to MySQL as well… as this may happen there if “correctly” built.

floating point is not fun

#include <stdio.h>
#include <math.h>

int main()
{
        double a= 10.0;
        double b= 1e+308;
        printf("%dn",isinf(a * b));
        return 0;
}

Prints 1 on: 64bit intel, 32bit PowerPC, 32bit SPARC, 64bit Sparc. But prints zero on 32bit intel.

Oh, but if you build that with g++ instead of gcc on 32bit Intel, you get 1.

VirtualBox 2.1.0 (and OpenSolaris 2008.11)

Upgraded VirtualBox and booted up my OpenSolaris VM. VirtualBox 2.1.0 finally fixes the bug where if 127.0.0.1 was in resolv.conf on the host – no DNS for you in the guest (unless in the guest you were running a DNS server).

Haven’t tried it yet… but OpenGL Accelleration makes at least a checkbox appearance in VirtualBox 2.1…. so that could be rather awesome.

Going a lot better with OpenSolaris 2008.11 than previous releases.. It looks like it might be quite easy to get to the stage of building Drizzle on it.

Just figured out how to change to Dvorak! Yay, I can type again! (Go to Input Methods preference panel and add US/DVORAK as a language, move it to the top, and enable the input method application and do it that way).

Currently installing sunstudioexpress. Why not gcc? I’m pretty sure the version in OpenSolaris is still ancient (so won’t build drizzle) and Sun Studio does produce different warnings (which indicate real bugs in a bunch of cases).

Things I wish were packaged: latest protobufs, latest bzr, gcc 4.x