2016 Kernel miniconf CFP

Why yes, it’s another long URL thanks to Google Docs:

Got a kernel topic you want to talk about? Got a kernel topic you want to start discussion on? Or a Q&A? Submit NOW! We’re going for part sessions, part unconference.

Questions? Contact me at

MySQL 5.7.5 on POWER – thread priority

Good news everyone!

MySQL 5.7.5 is out with a bunch more patches for running well on POWER in the tree. I haven’t yet gone and tried it all out, but since I’m me, I look at bugs database and git/bzr history first.

On Intel CPUs, when you’re spinning on a spin lock, you’re meant to execute the PAUSE CPU instruction. This tells the CPU that other execution threads in the same core should be given priority as you are currently not doing anything productive. Without this, you’re likely going to hurt on hyperthreaded CPUs.

In MySQL, there are custom spinlocks in order to do interesting adaptive mutex things to attempt to squeeze the most performance possible out of modern systems.

One of the (not 100% ready, but close) bugs with patches I submitted against MySQL 5.7 was for using the equivalent of the PAUSE instruction for POWER CPUs. On POWER, we’re a bit different, you can actually set priorities of threads (which may matter more, as POWER8 CPUs can be in SMT8 mode – where there are *eight* executing threads per core).

So, the good news is that in MySQL 5.7.5, the magic instructions for setting thread priority are in! This should mean great things for performance on POWER systems with any of the SMT modes enabled.

The next interesting part of this is how it interacts with other KVM guests on a system. At least on POWER (and on x86 as well, although I won’t go into details here) there’s a hypervisor call that a guest can make saying “hey, I’m spinning here, perhaps you want to make sure other vcpus execute so that at some point I can continue”. On POWER, this is the H_CONFER hcall, where you can basically do a directed yield to another vcpu (the one that holds the lock you’re trying to get is a good idea).

Generally though, it’s only the guest kernel that does this, not userspace. You can see the H_CONFER call in __spin_yield(arch_spinlock_t*) and __rw_yield(arch_rwlock_t*) in arch/powerpc/lib/locks.c in the kernel.

It would be interesting to see what extra we could get out of a system running multiple guests with MySQL servers if InnoDB/MySQL could properly yield to the right vcpu (well, thread I guess).

OpenPower firmware up on github!

With the whole OpenPower thing, a lot of low level firmware is being open sourced, which is really exciting for the platform – the less proprietary code sitting in memory the better in my books.

If you go to you’ll see code for a bunch of the low level firmware for OpenPower and POWER8.

Hostboot is the bit of code that brings up the CPU and skiboot both sets up hardware and provides runtime services to Linux (such as talking to the service processor, if one is present).

Patches to are (of course) really quite welcome. It shouldn’t be too hard to get your head around the basics.

To see the Linux side of the OPAL interface, go check out linux/arch/powerpc/platforms/powernv -there you can see how we ask OPAL to do things for us.

If you buy a POWER8 system from IBM running PowerKVM you’re running this code.

The joy of Unicode

So, back in late 2008, rather soon after we got to start working on Drizzle full time, someone discovered, or:

Since we had decided that Drizzle was going to be UTF-8 everywhere,(after seeing for years how hard it was for people to get character sets correct in MySQL) we soon added ☃.test to the tree, which tried a few interesting things:


Because what better to show off UTF-8 than using odd Unicode characters for table names, database names and file names. Well… it turns out we were all good except if you attempted to check out the source tree on Solaris. It was some combination of Python, Bazaar and Solaris that meant you just got python stacktraces and no source tree. So, if you look now it’s actually snowman.test and has been since the end of 2008, because Solaris 10.

A little while later, I was talking to Anthony Baxter at OSDC in Sydney and he mentioned Unicode above 2^16 in UTF-8…. so, we had clef.test (we’d learned since ☃ and we were not going to tall it 𝄢.test).

Fast forward a few years to, well, this week, and I was talking to Jeremy Kerr about petitboot and telling the tail of snowman.test. So out came the crazy Unicode characters:

  • U+1F4A9 PILE OF POO 💩
  • U+1F435 MONKEY FACE 🐵
  • U+1F431 CAT FACE 🐱

But guess what, there is no MONKEY FACE WITH TEARS OF JOY! I know, this is just unacceptable – TEARS OF JOY should be a modifier, because you may need U+1F6B9 MENS SYMBOL 🚹 with a TEARS OF JOY modifier at some point in your life.

Anyway, another place with tests involving odd Unicode characters is good for everyone, but still lacking if you need to boot an Operating System that’s MONKEY FACE WITH TEARS OF JOY.

Google Tests Homegrown Power8 Servers

Having joined IBM now and working on Linux on Power, I’m allowed to be all happy and gleeful about a non x86 CPU architecture again, and one where Linux and free software really is a big deal.

Some of my now colleagues talked about some things related to Power 8 at so you should go and check out their talks!

and now for something completely different…

As many of you know, I’ve been working in the MySQL world for quite a while now. IN fact, it was nearly 10 years ago when I first started hacking on MySQL Cluster at MySQL AB.

Most recently, I was at Percona which was a wonderful journey where over my nearly three years there the company at least doubled in size, launched several new software products and greatly improved the quality and frequency of releases.

However the time has come for something completely different. The MySQL world is rather mature, the future of Percona software is bright and, well, I could do with poking into something rather different.

So a couple of weeks ago I started at IBM in the Linux Technology Centre working on KVM on POWER and related things. No doubt there’ll be interesting things to blog about as time goes on, but it’s about time I posted my change of employment :)

Switching to Fedora from Ubuntu

I’ve run Ubuntu on my desktop (well… and laptop) since roughly the first release back in 2004. I’ve upgraded along the way, with reinstalls on the laptop limited to changing CPU architecture and switching full disk encryption.

Yesterday I wiped Ubuntu and installed Fedora.

Previously to Ubuntu I ran Debian. I actually ran Debian unstable on my desktop/laptop. I ran Debian Stable on any machines that had to sit there and “just work” and were largely headless. Back then Debian stable just didn’t have even remotely recent enough kernel, X and desktop apps to really be usable with any modern hardware. The downside to this was that having an IRC client open to #debian-devel and reading the topic to find if sid (codename for the unstable distribution) was pretty much a requirement if you ever thought about running “apt-get dist-upgrade”. This was exactly the kind of system that you wouldn’t give to non-expert family members as a desktop and expect them to maintain it.

Then along came Ubuntu. The basic premise was “a Debian derived distribution for the desktop, done right.” Brilliant. Absolutely amazingly brilliant. This is exactly what I wanted. I had a hope that I’d be able to focus on things other than “will dist-upgrade lead to 4 hours of fixing random things only to discover that X is fundamentally broken” and a possibility that I could actually recommend something to people who weren’t experts in the field.

For many years, Ubuntu served well. Frequent updates and relatively flawless upgrades between releases. A modern desktop environment, support for current hardware – heck, even non computer literate family members started applying their own security updates and even upgrading between versions!

Then, something started to go awry…. Maybe it was when Ubuntu shipped a kernel that helpfully erased the RAID superblock of the array in the MythTV machine… Maybe it was when I realized that Unity was failing as a basic window manager and that I swore less at fvwm…. Or maybe it was when I had a bug open for about 14,000 years on that if you left a Unity session logged in for too long all the icons in the dock would end up on top of each other at the top left of the screen making it rather obvious that nobody working on Ubuntu actually managed to stay logged in for more than a week. Or could it be that on the MythTV box and on my desktop having the login manager start (so you can actually log in to a graphical environment) is a complete crapshoot, with the MythTV box never doing it (even though it is enabled in upstart… trust me).

I think the final straw was the 13.04 upgrade. Absolutely nothing improved for me. If I ran Unity I got random graphics artifacts (a pulldown menu would remain on the screen) and with GNOME3 the unlock from screensaver screen was half corrupted and often just didn’t show – just type in your password and hope you’re typing it into the unlock screen and it hasn’t just pasted it into an IM or twitter or something. Oh, and the number of times I was prompted for my WiFi network password when it was saved in the keyring for AT LEAST TWO YEARS was roughly equivalent to the number of coffee beans in my morning espresso. The giant regressions in graphics further removed any trust I had that Mir may actually work when it becomes default(!?) in the next Ubuntu release.

GNOME3 is not perfect… I have to flip a few things in the tweak tool to have it not actively irritate me but on the whole there’s a number of things I quite like about it. It wins over Unity in an important respect: it actually functions as a window manager. A simple use case: scanning photos and then using GIMP to edit the result. I have a grand total of two applications open, one being the scanning software (a single window) and the other being the GIMP. At least half the time, when I switch back to the scanning program (i.e. it is the window at the front, maximized) I get GIMP toolbars on top of it. Seriously. It’s 2013 and we still can’t get this right?

So… I went and tried installing Fedora 19 (after ensuring I had an up to date backup).

The install went pretty smoothly, I cheated and found an external DVD drive and used a burnt DVD (this laptop doesn’t have an optical drive and I just CBF finding a suitably sized USB key and working out how to write the image to it correctly).

The installer booted… I then managed to rather easily decrypt my disk and set it to preserve /home and just format / and /boot (as XFS and ext3 respectively) and use the existing swap. Brilliant – I was hoping I wouldn’t have to format and restore from backup (a downside to using Maildir is that you end up with an awful lot of files). Install was flawless, didn’t take any longer than expected and I was able to reboot into a new Fedora environment. It actually worked.

I read somewhere that Fedora produces an initramfs that is rather specific to the hardware you’re currently running on, which just fills me with dread for my next hardware upgrade. I remember switching hard disks from one Windows 98 machine to another and it WAS NOT FUN. I hope we haven’t made 2013 the year of Windows 98 emulation, because I lived through that without ever running the damn thing at home and I don’t want to repeat it.

Some preferences had to be set again, there’s probably some incompatibility between how Ubuntu does things and how Fedora does things. I’m not too fussed about that though.

I did have to go and delete every sign of Google accounts in GNOME Online Accounts as it kept asking for a password (it turns out that two-factor-auth on Google accounts doesn’t play too nice). To be fair, this never worked in Ubuntu anyway.

In getting email going, I had to manually configure postfix (casually annoying to have to do it again) and procmail was actually a real pain. Why? SELinux. It turns out I needed to run “restorecon -r /home”. The way it would fail was silently and without any error anywhere. If I did “setenforce 0” it would magically work, but I actually would like to run with SELinux: better security is better. It seems that the restorecon step is absolutely essential if you’re bringing across an existing partition.

Getting tor, polipo and spamassasin going was fairly easy. I recompiled notmuch, tweaked my .emacs and I had email back too. Unfortunately, it appears that Chromium is not packaged for Fedora (well.. somebody has an archive, but the packages don’t appear to be GPG signed, so I’m not going to do that). There’s a complaint that Chromium is hard to package blah blah blah but if Debian and Ubuntu manage it, surely Fedora can. I use different browsers for different jobs and although I can use multiple instances of Firefox, it doesn’t show up as different instances in alt-tab menu, which is just annoying.

It appears that the version of OTR is old, so I filed a bug for that (and haven’t yet had the time to build+package libotr 4.0.0 – but it’s sorely needed). The pytrainer app that is used to look at the results of my Garmin watch was missing some depedencies (bug filed) and I haven’t yet tried to get the Garmin watch to sync… but that shouldn’t be too hard…

The speakers on my laptop still don’t work – so it’s somebody screwing up either the kernel driver or pulseaudio that makes the speakers only sometimes work for a few seconds and then stop working (while the headphone port works fine).

On the whole, I’m currently pretty happy with it. We’ll see how the upgrade to Fedora 20 goes though…. It’s nice using a desktop environment that’s actually supported by my distribution and that actually remotely works.

Is MySQL bigger than Linux?

I’m going to take the numbers from my previous post, MySQL Modularity, Are We There Yet? for the “kernel” size of MySQL – that is, everything that isn’t a plugin or storage engine.

For Linux kernel, I’m just going to use the a-bit-old git tree I have on my laptop. I’ve decided that the following directories are for “plugins” drivers/ arch/ sound/ firmware/ crypto/ usr/ virt/ tools/ scripts/ fs/*/* and everything else is core kernel code.

Version Total LoC Total Plugin LoC Remaining (kernel)
MySQL 5.6.10 1,049,344 265,189 784,155 (74% kernel)
MariaDB 5.5 1,142,118 315,796 826,322 (72% kernel)
Linux 9,983,269 8,824,121 1,159,148 (11% kernel)

The scary thing is that it’s surprisingly close, MySQL/MariaDB core is roughly 68-71% the  size of the Linux kernel. This is probably an unfairly large number for Linux too as there’s much more of Linux that is pluggable and modular… so I actually suspect they’re closer to exactly the same size.

If we look at the net/ directory in linux, it’s a grand total of 493,000 lines of code, all of which is fairly modular and independent. You could, quite reasonably, claim that the core of Linux is in fact closer to half a million lines of code than a million, making MySQL significantly larger.

So how many engineers are looking after each code base? We know there are over a thousand Linux kernel developers contributing to each release (e.g. for data back in 2010, and for Feb 2013).

I’m now going to fudge some things to attempt to work out how many “developers” are working on linux core code rather than drivers and arch specific things. I work out there’s probably about 20-25% of linux developers who work on things that are not drivers, filesystems or arch code. This is around 250-300 developers for each kernel release.

So… how many people have ever committed code to MySQL? This is fairly easy to find out: I simply looked at the entire bzr history, grepped out every committer and then uniqued the list (this required more than just sort -u as people used different email addresses and names). How many people have ever committed code to MySQL (i.e. their code can be found in the MySQL 5.6 bzr tree)? 312.

How many committers to MySQL 5.6 are there? 161. This is pretty amazing, that’s about half of what the total is. However, this number is misleading. For example, my name is there and the last commit to the MySQL tree from me was in 2008. You also see names such as Monty Taylor and Kristian Nielsen – all three of us not having worked for MySQL/Sun/Oracle for a great number of years. At the very least, there’s been a lot of code integration into MySQL 5.6 from many existing sources that were not previously in MySQL trunk.

How many CPU cycles does a SQL query take? (or pagefaults caused… or L2 cache misses… or CPU migrations…)

I like profilers. I use them when trying to make software (such as Drizzle) faster. Many profilers suck – and pretty much all of them are impossible to attach to a running system. Two notable exceptions are oprofile and dtrace (for Linux and Solaris respectively). The downside of oprofile is that it is non trivial to configure and get running and is pretty much all or nothing. Dtrace has the major disadvantage of that it is Solaris specific, so is only available to a minority of our users (and developers).

The new Linux Performance Events interface (perf_event) presents to userspace a nice abstraction of the hardware Performance Monitoring Unit inside the CPU. Typically these are processor specific (i.e. the one in a Core is different than the one in a Core 2) and can only be used by one thing at a time. The perf_events interface lets multiple applications/threads use the PMU (switching state at context switch as needed), even giving us ratios of how much time we got events for so we can do realistic estimates. It also provides some common defines to ask for things like CPU Cycles (the value you program into the PMU differs per CPU architecture for that, so an abstraction is rather welcome: it means we don’t need to have #ifdef __powerpc__ in our userspace code to support PowerPC, just a kernel that does).

Since perf_events gives us an interface to only get counts for our thread, we can map this onto connected sessions to Drizzle (and if we were using a pool_of_threads type scheduler in Drizzle, we’d need a bit of extra code to get things right, but with a thread per connection scheduler, we get it for free). A simple use of this could be to answer the question “How many CPU cycles does this SQL query take?” with the condition that you do not want how many CPU cycles were spent executing other things (e.g. the 1,000 other SQL queries currently being executed on your database server).

Many of you will now point out the RDTSC instruction for the x86 and ask why I’m just not using it. With RDTSC you’re only getting “how many cycles since reboot”. So using two values from the TSC and finding the difference only tells you how many cycles were between the two reads, not how many cycles were spent executing your thread/process. So the value of “cycles executed” gathered with RDTSC varies between a loaded and non-loaded system. With perf_events, it does not.

So…. after talking to paulus about perf_events, I decided to see how I could plug this into Drizzle to start getting interesting numbers out. Sure enough, a little bit of work later and I have my first proof of concept implementation over in lp:~stewart-flamingspork/drizzle/perf-events. That tree has a perf_udf() function that is like BENCHMARK except it returns the number of CPU cycles spent executing the expression N times. For example, how many CPU cycles does it take for the SQL expression MD5(‘pants’) to be evaluated?

drizzle> select perf_udf(1, MD5(‘pants’));
| perf_udf(1, MD5(‘pants’)) |
| 43540 |
1 row in set (0 sec)

So on my laptop, it’s about 40,000 cycles (over multiple runs I saw it vary between about 39,000 to 44,000). The really neat thing about using the perf_events interface is that if you run this on a different CPU architecture that has perf_events available in the kernel you’re currently running, it “just works”. e.g. if I ran this on a POWER5 box, I’d find out how many cycles it took there! No special code in Drizzle, yay!

The really neat next thing I tried was to run the same perf_udf query while running drizzleslap on the same database server, completely hammering it. I get exactly the same result (within normal variance)!

That isn’t the best part though. The best part is the other bits of information we can get out of the PMU:


So the same way we can use the new ‘perf’ utility to see what a process is doing to the machine, we can enable people to do exactly the same thing with specific SQL queries (and through a bit of extra code, you could aggregate for users/applications). Not only that, but we could write a plugin for Drizzle to occasionally sample queries running through the database server and build up a quite complete profile of what’s going on over time.

We can also get software events out of perf_events such as:


So for engines that do memory mapping of files on disk, we can find out which of your SQL queries are causing the page faults! We should also be able to find out if the operating system kernel is bouncing your execution threads around CPUs a lot.

The biggest possibility for awesomeness comes from the perf_event ability to get periodic call traces (you specify how often) including down into the kernel. This means that we could, on demand and only when we enable it, profile where time is being spent for that specific SQL query. The most important thing to note is that when this gathering is not enabled, the overhead is zero. Even when enabled for one query, this should have minimal impact on other things currently running (you’re going to use some extra CPU doing the profile, but we’re only profiling that one SQL query, not the 1000 other ones executing at the same time). So we could tell you that your query spends 9/10ths of its time in filesort() without ever adding any extra instructions to the filesort() codepath (and anywhere else where you may want to find out how much CPU time was spent). We could even tell you how much time was being spent in the kernel doing specific IO operations!

So I’m thinking that this could be used to make something pretty useful that could easily be used on production systems due to the zero overhead when not enabled and the small overhead when enabled.

Debian unstable on a Sun Fire T1000

So i got the T1000 working again (finally, after much screwing about trying to get the part). I then hit the ever annoying “no console” problem, where the console didn’t work – kind of problematic.

After a firmware upgrade, and passing “console=/dev/ttyS0” to the kernel, things work.

So the T1000 firmware 6.3 doesn’t work with modern debian kernels. Thing swork with 6.7 though.

Using Dtrace to find out if the hardware or Solaris is slow (but really just working around the problem)

A little while ago, I was the brave soul tasked with making sure Drizzle was working properly and passing all tests on Solaris and OpenSolaris. Brian recently blogged about some of the advantages of also running on Solaris and the SunStudio compilers – more warnings from the compiler is a good thing. Many kudos goes to Monty Taylor for being the brave soul who fixed most of the compiler warnings (and for us, warnings=errors – so we have to fix them) for the SunStudio compilers before I got to making te tests work.

So, I got to the end of it all and got pointed to an OpenSolaris x86 box where the drizzleslap test was timing out. The timeout for tests is some amazingly long amount of time – 15 minutes. All the drizzle-test-run tests are rather short tests.

To make running the tests quick, I usually LD_PRELOAD libeatmydata – a simple way of disabling pesky things like fsync that take a long time (rumors that I nickname it libmacosxsimulation are entirely true). It’s pretty simple to build libeatmydata on Solaris too (I periodically do this and always intend to check in the associated Makefile but never do).

Unfortunately, on OpenSolaris a bunch of things are built 32bit and others 64bit and just doing “ ./dtr” doesn’t work – I’d have to modify the test script to only do the LD_PRELOAD for drizzled – which is annoying.

On my T1000 running Debian, the drizzleslap test takes 42 seconds to complete with libeatmydata, or 393 seconds when it’s really doing fsyncs. So for it to be timing out on this OpenSolaris x86 box – i.e. taking more than 15 minutes, was strange.

So… what was going on? Step 1: is anything actually going on? One way to test this is to see if disk IO is being generated. On Linux, we can use “iostat”. On Solaris, we can use “zpool iostat”. Things were going to disk for the whole time of the test. Time to compare what the difference between the platforms is.

Well.. a typical way that tests have taken forever have been because of lots of transactions: i.e. lots of fsync(). You are then dependent on the fsync() performance.

If we look at “iostat -x” and the avgrq-sz field on Linux, we’ll see that the average request size is on 10-12 sectors (512 byte blocks). i.e. about 5 or 6kb.

If we look at “zpool iostat 1” on OpenSolaris, we see a bit of a different story, but similar enough that you could safely assume that lots of small synchronous IOs were going on. After a bit of reading of the ZFS on-disk format documents, I had a slightly better idea what was going on that could be causing me seeing a larger average request size on ZFS than on Linux with XFS.

So… perhaps it’s the speed of these syncs? Ordinarily, I’d just write up a quick LD_PRELOAD library that wraps fsync() and times it (perhaps writing to a file so I could do analysis on it later). Since I was working on Solaris… I thought I’d try DTrace. Some google-foo and dtrace hacking later, I tried this:

stewart@drizzle-dev:~/drizzle/sparc$ time pfexec dtrace -n ‘syscall::fdsync:entry /execname == “drizzled” / { self->ts[self->stack++] = timestamp; } syscall::fdsync:return /self->ts[self->stack – 1]/ { this->elapsed = timestamp – self->ts[–self->stack]; @[probefunc] = count(); @a[probefunc] = quantize(this->elapsed); self->ts[self->stack] =0; }’

dtrace: description 'syscall::fdsync:entry ' matched 2 probes

  fdsync                                                         1600
           value  ------------- Distribution ------------- count
        33554432 |                                         0
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   1520
       134217728 |@@                                       79
       268435456 |                                         1
       536870912 |                                         0        

real	4m26.837s
user	0m0.657s
sys	0m0.566s

Which did seem like an awful long time for an fsync() to take. Although the filesystem was on a single disk, it was meant to be made remotely recently, and it’s sitting on a Sun controller… so it should be a bit better than that. From reading some of the ZFS on-disk spec, it could be some bug that means we’re waiting for a checkpoint to be written instead of forcing the sync out when we call fsync() – but I sought another solution (as on other Solaris/OpenSolaris systems this wasn’t a problem – so perhaps fixed in newer kernels or it’s a driver issue).

So I went and added “–commit=100” to a bunch of places in the drizzleslap test to batch things into transactions. The idea being to greatly reduce the number of fsync() calls to bring the execution time of the drizzleslap test on the machine to get below 15minutes. A bit of jiggerypokery later (some tests needed to not have the –commit to avoid various locking foo) and I had something that should run.

Now, ~113 seconds on the T1000 on Linux (with a single SATA disk, down from an original 393 seconds) and ~437 seconds on the OpenSolaris box. For giggles, tried it on a Solaris box that’s running UFS on a 10k RPM SAS drive: ~44 seconds.

In Summary:

T1000, Linux, libeatmydata, XFS: ~42 seconds (before optim)
T1000, Linux, 7200RPM SATA, XFS: ~113 seconds
T5240, Solaris 10, 10k RPM SAS, UFS: ~44 seconds
16 core Xeon, OpenSolaris, 7200RPM, ZFS: ~437 seconds

So, on that hardware setup – something is strange. The 10k SAS drive on UFS on the CoolThreads box is really nice though…. makes me want that kind of disk here.

This page was useful, and I used it as a basis for some of my DTrace scripts:

Also thanks to several people on #opensolaris on Freenode who helped me out with various Solaris specific commands in tracking this down.

fixing drizzle on linux sparc

Since I got fed up with Solaris the other day, the T1000 is running Debian. This means that “I’ll care about Drizzle on Linux Sparc”.

OMG were things broken in the most “trivial” ways.

A good quick intro to the issues is Memory alignment on SPARC, or a 300x speedup!

It all comes down to memory alignment.

So I pulled the MySQL 6.0 bzr tree onto the box to try it too… I haven’t seen so many compiler warnings in ages (okay… since I last built MySQL.. drizzle is warning-clean and it makes it hard to remember a time before that). I think it works purely by accident.

So I’m gradually getting all of Drizzle working on Linux Sparc (a few things fixed already).

It’d be great if the T1k had faster disk though (make -j30 is fun… but IO isn’t on a single 160GB 7200rpm disk)… anybody wanting to donate an SSD?