Optimizing database access in Django: A patchwork story

tl;dr: I made Patchwork a lot faster by looking at what database queries were being generated and optimizing them either by making Django produce better queries or by adding better indexes.

Introduction to Patchwork

One of the key bits of infrastructure a bunch of maintainers of Open Source Software use is a tool called Patchwork. We use it for a bunch of OpenPOWER firmware development, several Linux subsystems use it as well as freedesktop.org.

The purpose of Patchwork is to supplement the patches-to-a-mailing-list development work flow. It allows a maintainer to see all the patches that have been posted on the list, How many Acked-by/Reviewed-by/Tested-by replies they have, delegate responsibility for the patch to a co-maintainer, track (and change) the state of the patch (e.g. to “Under Review”, “Changes Requested”, or “Accepted”), and create bundles of patches to help in review and testing.

Since patchwork is an open source project itself, there’s several instances of it out there in common use. One of the main instances is https://patchwork.ozlabs.org/ which is (funnily enough) used by a bunch of people connected to OzLabs for projects that are somewhat connected to OzLabs. e.g. the linuxppc-dev project and the skiboot and petitboot projects. There’s also a kernel.org instance, which is used by some kernel subsystems.

Recent versions of Patchwork have added some pretty cool features such as the ability to integrate with CI systems such as Snowpatch which helps maintainers see if patches submitted are likely to break things.

Unfortunately, there’s also been some complaints that recent version of patchwork have gotten slower than previous ones. This may well be the case, or it could just be that the volume of patches is much higher and there’s load on the database. Anyway, after asking a few questions about what the size and scope was of the patchwork database on ozlabs.org, I went “hrm… this sounds like it shouldn’t really be a problem… perhaps I should look into this”.

Attacking the problem…

Every so often it is revealed that I know a little bit about databases.

Getting a development environment up for Patchwork is amazingly easy thanks to Docker and the great work of the Patchwork maintainers. The only thing you need to load in is an example dataset. I started by importing mail from a few mailing lists I’m subscribed to, which was Good Enough(TM) for an initial look.

Due to how Django forces us to design a database schema though, the suggested method of getting a sample data set will not mirror what occurs in a production system with multiple lists. It’s for this reason that I ended up using a copy of a live dataset for much of my work rather than constructing an artificial one.

Patchwork supports both a MySQL and PostgreSQL database backend. Since the one on ozlabs.org is backed by PostgreSQL, I ended up loading a large dataset into PostgreSQL for most of my work, although I also did some testing with MySQL.

The current patchwork.ozlabs.org instance has a database of around 13GB in side, with about a million patches. You may think this is big, my database brain goes “no, this is actually quite small and everything should be a lot faster than it is even on quite limited hardware”

The problem with ORMs

It turns out that Patchwork is written in Django, an ORM (Object-Relational Mapping) framework in Python – and thus something that pretty effectively obfuscates application code from the SQL being run.

There is one thing that Django misses that could be a pretty big general performance boost to many applications: it doesn’t support composite primary keys. For some databases (e.g. MySQL’s InnoDB engine) the PRIMARY KEY is a clustered index – that is, the physical layout of the rows on disk reflect primary key order. You can use this feature to your advantage and have much higher cache hits of your database pages.

Unfortunately though, we cannot do that with Django, so we lose a bunch of possible performance because of it (especially for queries that are going to have to bring in data from disk). In fact, we’re forced to use an ID field that’ll scatter our rows all over the place rather than do something efficient. You can somewhat get back some of the performance by creating covering indexes, but this costs in terms of index maintenance and disk space.

It should be noted that PostgreSQL doesn’t have a similar concept, although there is a (locking) CLUSTER statement that can (as an offline operation for the table) re-arrange existing rows to be in index order. In my testing, this can give a bit of a boost to performance of some of the Patchwork queries.

With MySQL, you’d look at a bunch of statistics on what pages are being brought in and paged out of the InnoDB buffer pool. With PostgreSQL it’s a bit more complex as it relies heavily on the OS page cache.

My main experience is with MySQL like environment, so I’ve had to re-learn a bunch of PostgreSQL things in this work which was kind of fun. It may be “because of my upbringing” but it seems as if there’s a lot more resources and documentation out in the wild about optimizing MySQL environments than PostgreSQL ones, especially when it comes to documentation around a bunch of things inside the database server. A lot of credit should go to the MySQL Documentation team – I wish the PostgreSQL documentation was up to the same standard.

Another issue is that fetching BLOBs is generally an expensive operation that you want to avoid unless you’re going to use them. Thus, fetching the whole “object” at once isn’t always optimal. The Django query generation appears to be somewhat buggy when it comes to “hey, don’t fetch these columns, I don’t need them”, so you do have to watch what query is produced not just what query you expect to be produced. For example, [01/11] Improve patch listing performance (~3x).

Another issue with Django is how you go from your Python code to an actual SQL query, especially when the produced SQL query is needlessly complex or inefficient. I’ve seen Django always produce an ORDER BY for one table, even when not needed, I’ve also seen it always join tables even when you’re getting no columns from one of them and there’s no way you’re asking for it. In fact, I had to revert to raw SQL for one of my performance improvements as I just couldn’t beat it into submission: [10/11] Be sensible computing project patch counts.

An ORM can be great for getting apps out quickly, or programming in a familiar way. But like many things, an understanding of what is going on underneath is key for extracting maximum performance.

Also, if you ever hear something like “ORM $x doesn’t scale” then maybe that person just hasn’t looked at how to use the ORM better. The same goes for if they say “database $y doesn’t scale”- especially if it’s a long existing relational database such as MySQL or PostgreSQL.

Speeding up listing current patches for a project

17 SQL queries in 4477ms
More than 4 seconds in the database
does not make page load time great.

Fortunately though, the Django development environment lets you really easily dive into what queries are being generated and (at least roughly) where they’re being generated from. There’s a sidebar in your browser that shows how many SQL queries were needed to generate the page and how long they took. The key to making your application go faster is to run fewer queries in less time.

I was incredibly impressed with how easy it was to see what queries were run, where they were run from, and the EXPLAIN output for them.

By clicking on that SQL button on the right side of your browser, you get this wonderful chart of what queries were executed, when, and how long they took. From this, it is incredibly obvious which query is the most problematic: the one that took more than four seconds!

In the dim dark days of web development, you’d have to turn on a Slow Query Log on the database server and then grep through your source code or some other miserable activity. I am so glad I didn’t have to do that.

More than four seconds for a single database query does not make for a nice UX.

This particular query was a real hairy one, the EXPLAIN output from PostgreSQL (and MySQL) was certainly long and involved and would most certainly not feature in the first half of an “Introduction to query optimization” day long workshop. If you haven’t brushed up on various bits of documentation on understanding EXPLAIN, you should! The MySQL EXPLAIN FORMAT=JSON is especially fantastic for getting deep details as to what’s going on with query execution.

The big performance gain here was to have the database be able to execute the query in a much more efficient way by creating two covering indexes for part of the query. To work out what indexes to create, one has to look at the EXPLAIN output and work out why the database is choosing to do either a sequential scan of a large table, or use an index that doesn’t exclude that many rows. In this case, I tweaked the code to slightly change the query that was generated as well as adding a covering index. What we ended up with is something that is dramatically faster.

The main query is ~350x faster than before

You’ll notice that it appears that the first query there takes a lot more time but it doesn’t, it just takes a lot more time relative to the main query.

In fact, this particular page is one that people have mentioned at being really, really slow to load. With the main query now about 350 times faster than it was originally, it shouldn’t be a problem anymore.

A future improvement would be to cache the COUNT() for the common case, as it’s pretty easily computed when new patches come in or states change.

The patches that help this particular page have been submitted upstream here:

Making viewing a patch with comments faster

Now that we can list patches faster, can we make other pages that Patchwork has quicker?

One such page is viewing a patch (or cover letter) that has a lot of comments on it. One feature of Patchwork is that it will display all the email replies to a patch or cover letter in the Web UI. But… this seemed slow

On one of the most commented patches I could find, we ended up executing one hundred and seventy seven SQL queries to view it! If we dove into it, a bunch of the queries looked really really similar…

I’ve got 99 queries where I only need 1.

The problem here is that the Patchwork UI is wanting to find out the name of each person who submitted a comment, and is doing that by querying the ID from a table. What it should be doing instead is a SQL JOIN on the original query and just fetching all that information in one go: make the database server do the work, it’s really good at it.

My patch [02/11] 4x performance improvement for viewing patch with many comments   does just that by using the select_related() method correctly, as well as being explicit about what information we want to retrieve.

We’re now only a few milliseconds to grab all the comments

With that patch, we’re down to a constant number of queries and around a 3x-7x faster time executing them depending if we have a warm cache or not.

The one time I had to use raw SQL

When viewing a project page (such as https://patchwork.ozlabs.org/project/qemu-devel/ ) it displays the number of patches (archived and not archived) for the project. By looking at what SQL queries are executed to collect these numbers, you’ll notice two things. First, here are the queries:

COUNT() queries can be expensive

First thing you’ll notice is that they took a loooooong time to execute (more than a second each). The second thing, if you look closer, is that they contain a join which is completely unneeded.

I spent a good long while trying to make Django behave, and I just could not. I believe it’s due to the model having some inheritance in it. Writing the query by hand ended up being the best solution, and it gave a significant performance improvement:

Unfortunately, only 4x faster.

Arguably, a better way would be to precompute the count for the archived/non-archived patches and just display them. I (or someone else who knows more about Django) may want to look at that for a future improvement.

Conclusion and final thoughts

There’s a few more places where there could be some optimizations, but currently I cannot get any single page to take more than between 40-400ms in the database when running on my laptop – and that’s Good Enough(TM) for now.

The next steps are getting these patches through a round or two of review, and then getting them into a Patchwork release and deployed out on patchwork.ozlabs.org and see if people can find any new ways to make things slow.

If you’re interested, the full patchset with cover letter is here: [00/11] Performance for ALL THE THINGS!

The diffstat is interesting, as most of the added code is auto-generated by Django for database migrations (adding of indexes).

 .../migrations/0027_add_comment_date_index.py | 23 +++++++++++++++++
 .../0028_add_list_covering_index.py           | 19 ++++++++++++++
 .../0029_add_submission_covering_index.py     | 19 ++++++++++++++
 patchwork/models.py                           | 21 ++++++++++++++--
 patchwork/templates/patchwork/submission.html | 16 ++++++------
 patchwork/views/__init__.py                   |  8 +++++-
 patchwork/views/cover.py                      |  5 ++++
 patchwork/views/patch.py                      |  7 ++++++
 patchwork/views/project.py                    | 25 ++++++++++++++++---
 9 files changed, 128 insertions(+), 15 deletions(-)
 create mode 100644 patchwork/migrations/0027_add_comment_date_index.py
 create mode 100644 patchwork/migrations/0028_add_list_covering_index.py
 create mode 100644 patchwork/migrations/0029_add_submission_covering_index.py

I think the lesson is that making dramatic improvements to performance of your Django based app does not mean you have to write a lot of code or revert to raw SQL or abandon your ORM. In fact, use it properly and you can get a looong way. It’s just that to use it properly, you’re going to have to understand the layer below the ORM, and not just treat the database as a magic black box.

ccache and op-build

You may have heard of ccache (Compiler Cache) which saves you heaps of real world time when rebuilding a source tree that is rather similar to one you’ve recently built before. It’s really useful in buildroot based projects where you’re building similar trees, or have done a minor bump of some components.

In trying to find a commit which introduced a bug in op-build (OpenPOWER firmware), I noticed that hostboot wasn’t being built using ccache and we were always doing a full build. So, I started digging into it.

It turns out that a bunch of the perl scripts for parsing the Machine Readable Workbook XML in hostboot did a bunch of things like foreach $key (%hash) – which means that the code iterates over the items in hash order rather than an order that would produce predictable output such as “attribute name” or something. So… much messing with that later, I had hostboot generating the same output for the same input on every build.

Next step was to work out why I was still getting a lot of CCACHE misses. It turns out the default ccache size is 5GB. A full hostboot build uses around 7.1GB of that.

So, if building op-build with CCACHE, be sure to set both BR2_CCACHE=y in your config as well as something like BR2_CCACHE_INITIAL_SETUP="--max-size 20G"

Hopefully my patches hit hostboot and op-build soon.

A (simplified) view of OpenPOWER Firmware Development

I’ve been working on trying to better document the whole flow of code that goes into a build of firmware for an OpenPOWER machine. This is partially to help those not familiar with it get a better grasp of the sheer scale of what goes into that 32/64MB of flash.

I also wanted to convey the components that we heavily re-used from other Open Source projects, what parts are still “IBM internal” (as they relate to the open source workflow) and which bits are primarily contributed to by IBMers (at least at this point in time).

As such, let’s start with the legend of the diagram:

Now, the diagram:

Simplified development flow for OpenPOWER firmware

The end thing that a user with a machine will download and apply (or that comes shipped with a box) is the purple “Installable Firmware Release” nodes (bottom center). In this diagram, there are 4 of them. One for POWER9 systems such as the just-announced AC922 system (this is the “OP910 Release” node, which is the witherspoon_defconfig in the op-build tree); one for the p9dsu platform (p9dsu_defconfig in op-build) and one is for IBM FSP based systems such as the S812L and S822L systems (or S812/S822 in OPAL mode).

There are more platforms out there, but this diagram is meant to be simplified. The key difference with the p9dsu platform is that this is produced by somebody other than IBM.

All of these releases are based off the upstream op-build project, op-build is the light blue box in the center of the diagram. We do regular X.Y releases and sometimes do X.Y.Z releases. It’s primarily a pull request based workflow currently, so everything goes via a pull request. The op-build project brings together all the POWER specific firmware components (pretty much everything in every other light blue/blue box) along with a Linux kernel and buildroot.

The kernel and buildroot are the two big yellow boxes on the top right. Buildroot brings together a lot of open source components that are in our firmware image (including some power specific ones that we get through upstream buildroot).

For Linux, this is a pretty simplified view of the process, but we primarily ship the stable tree (with maybe up to half a dozen patches).

The skiboot and petitboot components both use a mailing list based workflow (similar to kernel) as well as X.Y and X.Y.Z releases (again, similar to the linux kernel).

On the far left of the diagram, we have Hostboot, SBE and OCC. These are three firmware components that come from the traditional IBM POWER Firmware group, and are shared with the IBM non-OpenPOWER POWER systems (“traditional” POWER). These components have part of their code from from an (internal) repository called “ekb” which also goes into a (very) low level debug tool and the FSP based systems. There’s also an (internal) gerrit instance that’s the primary place where code review/development discussions are for these components.

In future posts, I’ll probably delve into more specifics of the current development process, and how we may try and change things for the better.

How I do email (at work)

Recently, I blogged on my home email setup and in that post, I hinted that my work setup was rather different. I have entirely separate computing devices for work and personal, a setup I strongly recommend. This also lets me “go home” from work even when working from home, I use a different physical machine!

Since I work for IBM I have (at least) two email accounts for work: a Lotus Notes one and a internet standards compliant one. It’s “easy” enough to get the Notes one to forward to the standards compliant one, from which I can just use fetchmail or similar to pull down mail.

I run mail through a rather simple procmail script: it de-mangles some URL mangling that can happen in the current IBM email infrastructure, runs things through SpamAssassin and deliver to a date based Maildir (or one giant pile for spam).

My ~/.procmailrc looks something like this:

LOGFILE=$HOME/mail_log
LOGABSTRACT=yes

DATE=`date +"%Y%m"`
MAILDIR=Maildir/INBOX
DEFAULT=$DATE/

:0fw
| magic_script_to_unmangle_things

:0fw
| spamc

:0
* ^X-Spam-Status: Yes
$HOME/Maildir/junkmail/incoming/

I use tail -f mail_log as a really dumb kind of biff replacement.

Now, what do I read and write mail with? Notmuch! It is the only thing that even comes close to being able to deal with a decent flow of mail. I have a couple of saved searches just to track how much mail I pull in a day/week. Today (on Monday), it says 442 today and 10,403 over the past week.

For the most part, my workflow is kind of INBOX-ZERO like, except that I currently view victory as INBOX 2000. Most mail does go into my INBOX, the notable exceptions are two main mailing lists I’m subscribed to mostly as FYI and to search/find things when needed. Those are the Linux Kernel Mailing List (LKML) and the buildroot mailing list. Why notmuch rather than just searching the web for mailing list archives? Notmuch can return the result of a query in less time it takes light to get to and from the United States in ideal conditions.

For work, I don’t sync my mail anywhere. It’s just on my laptop. Not having it on my phone is a feature. I have a notmuch post-new hook that does some initial tagging of mail, and as such I have this in my ~/.notmuch-config:

[new]
tags=new;

My post-new hook looks like this:

#!/bin/bash

# immediately archive all messages from "me"
notmuch tag -new -- tag:new and from:stewart@linux.vnet.ibm.com

# tag all message from lists
notmuch tag +devicetree +list -- tag:new and to:devicetree@vger.kernel.org
notmuch tag +inbox +unread -new -- tag:new and tag:devicetree and to:stewart@linux.vnet.ibm.com
notmuch tag +listinbox +unread +list -new -- tag:new and tag:devicetree and not to:stewart@linux.vnet.ibm.com

notmuch tag +linuxppc +list -- tag:new and to:linuxppc-dev@lists.ozlabs.org
notmuch tag +linuxppc +list -- tag:new and cc:linuxppc-dev@lists.ozlabs.org
notmuch tag +inbox +unread -new -- tag:new and tag:linuxppc
notmuch tag +openbmc +list -- tag:new and to:openbmc@lists.ozlabs.org
notmuch tag +inbox +unread -new -- tag:new and tag:openbmc

notmuch tag +lkml +list -- tag:new and to:linux-kernel@vger.kernel.org
notmuch tag +inbox +unread -new -- tag:new and tag:lkml and to:stewart@linux.vnet.ibm.com
notmuch tag +listinbox +unread -new -- tag:new and tag:lkml and not tag:linuxppc and not to:stewart@linux.vnet.ibm.com

notmuch tag +qemuppc +list -- tag:new and to:qemu-ppc@nongnu.org
notmuch tag +inbox +unread -new -- tag:new and tag:qemuppc and to:stewart@linux.vnet.ibm.com
notmuch tag +listinbox +unread -new -- tag:new and tag:qemuppc and not to:stewart@linux.vnet.ibm.com

notmuch tag +qemu +list -- tag:new and to:qemu-devel@nongnu.org
notmuch tag +inbox +unread -new -- tag:new and tag:qemu and to:stewart@linux.vnet.ibm.com
notmuch tag +listinbox +unread -new -- tag:new and tag:qemu and not to:stewart@linux.vnet.ibm.com

notmuch tag +buildroot +list -- tag:new and to:buildroot@buildroot.org
notmuch tag +buildroot +list -- tag:new and to:buildroot@busybox.net
notmuch tag +buildroot +list -- tag:newa nd to:buildroot@uclibc.org
notmuch tag +inbox +unread -new -- tag:new and tag:buildroot and to:stewart@linux.vnet.ibm.com
notmuch tag +listinbox +unread -new -- tag:new and tag:buildroot and not to:stewart@linux.vnet.ibm.com

notmuch tag +ibmbugzilla -- tag:new and from:bugzilla@us.ibm.com

# finally, retag all "new" messages "inbox" and "unread"
notmuch tag +inbox +unread -new -- tag:new

This leaves me with both an inbox and a listinbox. I do not look at the overwhelming majority of mail that hits the listinbox – It’s mostly for following up on individual things. If I started to need to care more about specific topics, I’d probably add something in there for them so I could easily find them.

My notmuch emacs setup has a bunch of saved searches, so my notmuch-hello screen looks something like this:

This gets me a bit of a state-of-the-world-of-email-to-look-at view for the day. I’ll often have meetings first thing in the morning that may reference email I haven’t looked at yet, and this generally lets me quickly find mail related to the problems of the day and start being productive.

How I do email (at home)

I thought I might write something up on how I’ve been doing email both at home and at work. I very much on purpose keep the two completely separate, and have slightly different use cases for both of them.

For work, I do not want mail on my phone. For personal mail, it turns out I do want this on my phone, which is currently an Android phone. Since my work and personal email is very separate, the volume of mail is really, really different. Personal mail is maybe a couple of dozen a day at most. Work is… orders of magnitude more.

Considering I generally prefer free software to non-free software, K9 Mail is the way I go on my phone. I have it set up to point at the IMAP and SMTP servers of my mail provider (FastMail). I also have a google account, and the gmail app works fine for the few bits of mail that go there instead of my regular account.

For my mail accounts, I do an INBOX ZERO like approach (in reality, I’m pretty much nowhere near zero, but today I learned I’m a lot closer than many colleagues). This means I read / respond / do / ignore mail and then move it to an ARCHIVE folder. K9 and Gmail both have the ability to do this easily, so it works well.

Additionally though, I don’t want to care about limits on storage (i.e. expire mail from the server after X days), nor do I want to rely on “the cloud” to be the only copy of things. I also don’t want to have to upload any of past mail I may be keeping around. I also generally prefer to use notmuch as a mail client on a computer.

For those not familiar with notmuch, it does tags on mail in Maildir, is extremely fast and can actually cope with a quantity of mail. It also has this “archive”/INBOX ZERO workflow which I like.

In order to get mail from FastMail and Gmail onto a machine, I use offlineimap. An important thing to do is to set “status_backend = sqlite” for each Account. It turns out I first hacked on sqlite for offlineimap status a bit over ten years ago – time flies. For each Account I also set presynchook = ~/Maildir/maildir-notmuch-presync (below) and a postsynchook = notmuch new. The presynchook is run before we sync, and its job is to move files around based on the tags in notmuch and the postsynchook lets notmuch catch any new mail that’s been fetched.

My maildir-notmuch-presync hook script is:

#!/bin/bash
notmuch search --output=files not tag:inbox and folder:fastmail/INBOX|xargs -I'{}' mv '{}' "$HOME/Maildir/INBOX/fastmail/Archive/cur/"

notmuch search --output=files folder:fastmail/INBOX and tag:spam |xargs -I'{}' mv '{}' "$HOME/Maildir/INBOX/fastmail/Spam/cur/"
ARCHIVE_DIR=$HOME/Maildir/INBOX/`date +"%Y%m"`/cur/
mkdir -p $ARCHIVE_DIR
notmuch search --output=files folder:fastmail/Archive and date:..90d and not tag:flagged | xargs -I'{}' mv '{}' "$ARCHIVE_DIR"

# Gmail
notmuch search --output=files not tag:inbox and folder:gmail/INBOX|grep 'INBOX/gmail/INBOX/' | xargs -I'{}' rm '{}'
notmuch search --output=files folder:gmail/INBOX and tag:spam |xargs -I'{}' mv '{}' "$HOME/Maildir/INBOX/gmail/[Gmail].Spam/cur/"

So This keeps 90 days of mail on the fastmail server, and archives older mail off into month based archive dirs. This is simply to keep directory sizes not too large, you could put everything in one directory… but at some point that gets a bit silly.

I don’t think this is all the most optimal setup I could have, but it does let me read and answer mail on my phone and desktop (as well as use a web client if I want to). There is a bit of needless copying of messages by offlineimap under certain circumstances, but I don’t get enough personal mail for it to be a problem.

Ten years of libeatmydata!

So, ten years ago (how is that even possible… it seems like it was just a couple of years ago), there was the first commit in the libeatmydata repository (now in git on github rather than in bzr on launchpad). The first implementation was literally just this:

#include <sys/types.h>
#include 
#include 
#include <sys/stat.h>
#include 

int errno;

int fsync(int fd)
{
       errno=0;
       return 0;
}

Soooo…. kind of incredibly simple. But, hey, it worked! Little did I know, that these two lines of code were going to grow into 166 lines of C in order to do it a bit more “properly”.

My initial use case was making the MySQL test suite run faster: 30% faster back then! In fact, it was better than using tmpfs! It’s still used for that (even though I no longer hack on MySQL with any regularity), see github issue #1 for a recent bug that cropped up.

Since then, I’m aware of eatmydata being used to build entire operating systems and in production in way too many places (on way too many machines). The probability that any given human who’s used a computer in the past 10 years has used libeatmydata, used a package built with it or used a service with it running somewhere in production is so close to 1 that I don’t want to think about it.

Well… here’s to the next ten years of eating data!

Installing Windows on a USB key

For some unknown reason, the Windows installer doesn’t let you install to a USB key. Luckily, there’s a simple workaround. It turns out that only the very first step of installation cares about that. So, if you can fool it (say, by running in qemu), you can have a USB key with a Windows install rather than having to dual boot on your hard disk (e.g. if you run Linux and want all that fast in-built SSD space for Linux)

  1. Download the Windows Installer: https://www.microsoft.com/en-au/software-download/windows10ISO
  2. Start the installer in a VM, with the USB key passed through to the VM as the hard disk (or use a disk image the same size as your USB key for transfer with a utility such as ‘dd’ later). e.g. do:
    qemu-img create -f raw win-installed.img 50G
    
    qemu-system-x86_64 --enable-kvm -m 8G -cdrom Downloads/Win10_1709_English_x64.iso -hda win-installed.img -boot d
  3. At the first reboot of the installer, instead of letting it boot, stop the VM. You are going to copy the install at this state to the USB key.
  4. Boot from the USB key, go through the rest of the installer. You’re done!

Updating Intel Management Engine firmware on a Lenovo without a Windows Install

This is how I updated my Intel ME firmware on my Lenovo X1 Carbon Gen 4 (reports say this also has worked for Gen5 machines). These instructions are pretty strongly inspired by https://news.ycombinator.com/item?id=15744152

Why? Intel security advisory and CVE-2017-5705, CVE-2017-5708, CVE-2017-5711, and CVE-2017-5712 should be reason enough.

You will need:

  • To download about 3.5GB of stuff
  • A USB key
  • Linux installed
  • WINE or a Windows box to run two executables (because self extracting archives are a thing on Windows apparently)
  • A bit of technical know-how. A shell prompt shouldn’t scare you too hard.

Steps:

  1. Go to https://www.microsoft.com/en-au/software-download/windows10ISO and download the 32-bit ISO.
  2. Mount the ISO as a loopback device (e.g. by right clicking and choosing to mount, or by doing ‘sudo mount -o loop,ro file.iso /mnt’
  3. Go to Lenovo web site for Drivers & Software for your laptop, under Chipset, there’s ME Firmware and Software downloads You will need both. It looks like this:
  4. Run both exe files with WINE or on a windows box to extract the archives, you do not need to run the installers at the end.
  5. you now need to extract the management engine drivers. You can do this in  ~/.wine/drive_c/DRIVERS/WIN/AMT, with “cabextract SetupME.exe” or (as suggested in the comments) you can use the innoextract utility (from linux) to extract them (a quick check shows this to work)
  6. Save off HECI_REL folder, it’s the only extracted thing you’ll need.
  7. Go and install https://wimlib.net/ – we’re going to use it to create the boot disk. (it may be packaged for your distro).
    If you don’t have the path /usr/lib/syslinux/modules/bios on your system but you do have /usr/share/syslinux/modules/bios – you will need to change a bit of the file programs/mkwinpeimg.in to point to the /usr/share locations rather than /usr/lib before you install wimlib. This probably isn’t needed if you’re installing from packages, but may be requried if you’re on, say, Fedora.
  8. Copy ~/.wine/drive_c/DRIVERS to a new folder, e.g. “winpe_overlay” (or copy from the Windows box you extracted things on)
  9. Use mkwinpeimg to create the boot disk, pointing it to the mounted Windows 10 installer and the “winpe_overlay”:
    mkwinpeimg -W /path/to/mounted/windows10-32bit-installer/ -O winpe_overlay disk.img
  10. Use ‘dd’ to write it to your USB key
  11. Reboot, go into BIOS and turn Secure Boot OFF, Legacy BIOS ON, and AMT ON.
  12. Boot off the USB disk you created.
  13. In the command prompt of the booted WinPE environment, run the following to start the update:
    cd \
    cd HECI_REL\win10
    drvload heci.inf
    cd \
    cd win\me
    MEUpdate.cmd

    It should look something like this:

  14. Reboot, go back into BIOS and change your settings back to how you started.

ZMODEM saves the day! Or, why my firmware for a machine with a CPU from 2017 contains a serial file transfer protocol from the 1980s

Recently, I added the package lrzsz to op-build in this commit. This package provides the rz and sz commands – for receive zmodem and send zmodem respectively. For those who don’t know, op-build builds a firmware image for OpenPOWER machines, and adding this package adds the commands to the petitboot shell (the busybox environment you get when you “exit to shell” from the boot menu).

For those who aren’t familiar with ZMODEM, you should first get off my lawn, and secondly, know that it’s a method for sending/receiving files over something akin to a serial port, e.g. a modem. The basic protocol is “I want to send you this file named FOO”, “okay, I would like to receive it”, “here’s some data and a checksum, did you get it and does it match the checksum?”, “yes!”, “okay, great, here’s the next bit” until the file is transferred. Importantly, it has a provision for “no, I didn’t get that right” and for bits to be resent.

The one thing that pretty much always somewhat works on a computer is a serial port (or something that looks like a serial port to software). When you connect to the IPMI console (“ipmitool sol activate”), the host sees this as a serial port that it pumps bytes over. With OpenBMC, you can actually connect to this serial port via SSH.

When diagnosing weird problems during firmware bringup such as “why doesn’t PCI work” or “why does my network adapter not work” (or, perhaps, somebody helpfully didn’t plug the network cable in), it can be useful to copy off a bunch of debug information from the machine.

You may say “can’t you just print the log file to the screen and save it?” and you’d be right, you can do that for text – it’s really annoying for binary data though. Plus, there have been bugs in the console implementations on pretty much every BMC I’ve ever used that makes them not as reliable as you’d like.

So, how could we transfer a file over the serial connection we have to the machine? The same way we did on a BBS! Enter ZMODEM. The error recovery properties are perfect in this situation.

So, how do you use it? I’ve found two ways that work well: GNU screen and zssh. For GNU Screen, you’ll want to configure it to catch zmodem traffic by doing “control-a:zmodem catch<ENTER>” (you need the colon there). After that if you execute “sz” on the host and the rest should be obvious! If you wanted to send a file to the host, run “rz” rather than “sz”.

op-test-framework: Let’s break the console!

One of the things I’ve been working on fairly quietly is the test suite for OpenPOWER firmware: op-test-framework. I’ve approach things I’m hacking on from the goal of “when I merge patches into skiboot, can I be confident that I haven’t merged something that’s broken existing functionality?”

By testing host firmware, we incidentally (as well as on purpose) test a whole bunch of BMC functionality. One bit of functionality we rely on a lot is the host “serial” console. Typically, this is exposed to the user over IPMI Serial Over LAN (SOL), or on OpenBMC it’s also exposed as something you can ssh to (which proves to be both faster and more reliable than IPMI, not to mention there’s some remote semblance of security).

When running through some tests, I noticed something pretty odd, it appeared as though we were sometimes missing some console output on larger IOs. This usually isn’t a problem as when we’re using expect(1) (or the python equivalent pexpect) we end up having all sorts of delays here there and everywhere to work around all the terrible things you hope you never learn. So… how could I test that? Well.. what about checking the output of something like dd if=/dev/zero bs=1024 count=16|hexdump -C to see if we get the full output?

Time to add a test to op-test-framework! Adding such a test is pretty easy. If we look at the source of the test I added, we can see what happens (source here).

class Console():
    bs = 1024
    count = 8
    def setUp(self):
        conf = OpTestConfiguration.conf
        self.bmc = conf.bmc()
        self.system = conf.system()

    def runTest(self):
        self.system.goto_state(OpSystemState.PETITBOOT_SHELL)
        console = self.bmc.get_host_console()
        self.system.host_console_unique_prompt()
        bs = self.bs
        count = self.count
        self.assertTrue( (bs*count)%16 == 0, "Bug in test writer. Must be multiple of 16 bytes: bs %u count %u / 16 = %u" % (bs, count, (bs*count)%16))
        try:
            zeros = console.run_command("dd if=/dev/zero bs=%u count=%u|hexdump -C -v" % (bs, count), timeout=120)
        except CommandFailed as cf:
            self.assertEqual(cf.exitcode, 0)
        self.assertTrue( len(zeros) == 3+(count*bs)/16, "Unexpected length of zeros %u" % (len(zeros)))

First thing you’ll notice is that this looks like a Python unittest. It’s because it is. The unittest infrastructure was a path of least resistance, so we started with it. This class isn’t the one that’s actually run, we do a little bit of inheritance magic in order to run the same test with different parameters (see https://github.com/open-power/op-test-framework/blob/6c74fb0fb0993ae8ae1a7aa62ec58e57c0080686/testcases/Console.py#L50)

class Console8k(Console, unittest.TestCase):
    bs = 1024
    count = 8

class Console16k(Console, unittest.TestCase):
    bs = 1024
    count = 16

class Console32k(Console, unittest.TestCase):
    bs = 1024
    count = 32

The setUp() function is pure boiler plate, we grab some objects from the configuration of the test run, namely information about the BMC and the system itself, so we can do things to both. The real magic happens in runTest().

op-test-framework tracks the state of the machine being tested across each test. This means that if we’re executing 101 tests in the petitboot shell, we don’t need to do 101 separate boots to petitboot. The self.system.goto_state(OpSystemState.PETITBOOT_SHELL) statement says “Please ensure the system is booted to the petitboot shell”. Other states include OFF (obvious) and OS, which is when the machine is booted to the OS.

The next two lines ensure we can run commands on the console (where console is IPMI Serial over LAN or other similar connection, such as the SSH console provided by OpenBMC):

console = self.bmc.get_host_console()
self.system.host_console_unique_prompt()

The host_console_unique_prompt() call is a bit ugly, and I’m hoping we fix the APIs so that this isn’t needed. Basically, it sets things up so that pexpect will work properly.

The bit that does the work is the try/except block along with the assertTrue. We don’t currently check that the content is all correct, we just check we got the right *amount* of content.

It turns out, this check is enough to reveal a bug that turns out to be deep in the core Linux TTY layer, and has caused Jeremy some amount of fun (for certain values of fun).

Want to know more about how the console works? Jeremy blogged on it.

Fedora 25 + Lenovo X1 Carbon 4th Gen + OneLink+ Dock

As of May 29th 2017, if you want to do something crazy like use *both* ports of the OneLink+ dock to use monitors that aren’t 640×480 (but aren’t 4k), you’re going to need a 4.11 kernel, as everything else (for example 4.10.17, which is the latest in Fedora 25 at time of writing) will end you in a world of horrible, horrible pain.

To install, run this:

sudo dnf install \
https://kojipkgs.fedoraproject.org//packages/kernel/4.11.3/200.fc25/x86_64/kernel-4.11.3-200.fc25.x86_64.rpm \
https://kojipkgs.fedoraproject.org//packages/kernel/4.11.3/200.fc25/x86_64/kernel-core-4.11.3-200.fc25.x86_64.rpm \
https://kojipkgs.fedoraproject.org//packages/kernel/4.11.3/200.fc25/x86_64/kernel-cross-headers-4.11.3-200.fc25.x86_64.rpm \
https://kojipkgs.fedoraproject.org//packages/kernel/4.11.3/200.fc25/x86_64/kernel-devel-4.11.3-200.fc25.x86_64.rpm \
https://kojipkgs.fedoraproject.org//packages/kernel/4.11.3/200.fc25/x86_64/kernel-modules-4.11.3-200.fc25.x86_64.rpm \
https://kojipkgs.fedoraproject.org//packages/kernel/4.11.3/200.fc25/x86_64/kernel-tools-4.11.3-200.fc25.x86_64.rpm \
https://kojipkgs.fedoraproject.org//packages/kernel/4.11.3/200.fc25/x86_64/kernel-tools-libs-4.11.3-200.fc25.x86_64.rpm \
https://kojipkgs.fedoraproject.org//packages/kernel/4.11.3/200.fc25/x86_64/perf-4.11.3-200.fc25.x86_64.rpm

This grabs a kernel that’s sitting in testing and isn’t yet in the main repositories. However, I can now see things on monitors, rather than 0 to 1 monitor (most often 0). You can also dock/undock and everything doesn’t crash in a pile of fail.

I remember a time when you could fairly reliably buy Intel hardware and have it “just work” with the latest distros. It’s unfortunate that this is no longer the case, and it’s more of a case of “wait six months and you’ll still have problems”.

Urgh.

(at least Wayland and X were bug for bug compatible?)

API, ABI and backwards compatibility are a hard necessity

Recently, I was reading a thread on LKML on a proposal to change the behavior of the open system call when confronted with unknown flags. The thread is worth a read as the topic of augmenting things that exist probably by accident to be “better” is always interesting, as is the definition of “better”.

Keeping API and/or ABI compatibility is something that isn’t a new problem, and it’s one that people are pretty good at sometimes messing up.

This problem does not go away just because “we have cloud now”. In any distributed system, in order to upgrade it (or “be agile” as the kids are calling it), you by definition are going to have either downtime or at least two versions running concurrently. Thus, you have to have your interfaces/RPCs/APIs/ABIs/protocols/whatever cope with changes.

You cannot instantly upgrade the world, it happens gradually. You also have to design for at least three concurrent versions running. One is the original, the second is your upgrade, your third is the urgent fix because the upgrade is quite broken in some new way you only discover in production.

So, the way you do this? Never ever EVER design for N-1 compatibility only. Design for going back a long way, much longer than you officially support. You want to have a design and programming culture of backwards compatibility to ensure you can both do new and exciting things and experiment off to the side.

It’s worth going and rereading Rusty’s API levels posts from 2008:

j-core + Numato Spartan 6 board + Fedora 25

A couple of changes to http://j-core.org/#download_bitstream made it easy for me to get going:

  • In order to make ModemManager not try to think it’s a “modem”, create /etc/udev/rules.d/52-numato.rules with the following content:
    # Make ModemManager ignore Numato FPGA board
    ATTRS{idVendor}=="2a19", ATTRS{idProduct}=="1002", ENV{ID_MM_DEVICE_IGNORE}="1"
  • You will need to install python3-pyserial and minicom
  • The minicom command line i used was:
    sudo stty -F /dev/ttyACM0 -crtscts && minicom -b 115200 -D /dev/ttyACM0

and along with the instructions on j-core.org, I got it to load a known good build.

Books referenced in my Organizational Change talk at LCA2017

All of these are available as Kindle books, but I’m sure you can get 3D copies too:

The Five Dysfunctions of a Team: A Leadership Fable by Patrick M. Lencioni
Leading Change by John P. Kotter
Who Says Elephants Can’t Dance? Louis V. Gerstner Jr.
Nonviolent Communication: A language of Life by Marshall B. Rosenberg and Arun Gandhi

Fast Reset, Trusted Boot and the security of /sbin/reboot

In OpenPOWER land, we’ve been doing some work on Secure and Trusted Boot while at the same time doing some work on what we call fast-reset (or fast-reboot, depending on exactly what mood someone was in at any particular time…. we should start being a bit more consistent).

The basic idea for fast-reset is that when the OS calls OPAL reboot, we gather all the threads in the system using a combination of patching the reset vector and soft-resetting them, then cleanup a few bits of hardware (we do re-probe PCIe for example), and reload & restart the bootloader (petitboot).

What this means is that typing “reboot” on the command line goes from a ~90-120+ second affair (through firmware to petitboot, linux distros still take ages to shut themselves down) down to about a 20 second affair (to petitboot).

If you’re running a (very) recent skiboot, you can enable it with a special hidden NVRAM configuration option (although we’ll likely enable it by default pretty soon, it’s proving remarkably solid). If you want to know what that NVRAM option is… Use the source, Luke! (or git history, but I’ve yet to see a neat Star Wars reference referring to git commit logs).

So, there’s nothing like a demo. Here’s a demo with Ubuntu running off an NVMe drive on an IBM S822LC for HPC (otherwise known as Minsky or Garrison) which was running the HTX hardware exerciser, through fast-reboot back into Petitboot and then booting into Ubuntu and auto-starting the exerciser (HTX) again.

Apart from being stupidly quick when compared to a full IPL (Initial Program Load – i.e. boot), since we’re not rebooting out of band, we have no way to reset the TPM, so if you’re measuring boot, each subsequent fast-reset will result in a different set of measurements.

This may be slightly confusing, but it’s not really a problem. You see, if a machine is compromised, there’s nothing stopping me replacing /sbin/reboot with something that just prints things to the console that look like your machine rebooted but in fact left my rootkit running. Indeed, fast-reset and a full IPL should measure different values in the TPM.

It also means that if you ever want to re-establish trust in your OS, never do a reboot from the host – always reboot out of band (e.g. from a BMC). This, of course, means you’re trusting your BMC to not be compromised, which I wouldn’t necessarily do if you suspect your host has been.

Windows NT4 for PowerPC guest on OPAL on POWER8 in qemu

Sometimes, programming is just for fun. This is what PREPHV is for Andrei Warkentin. To quote the README:

“This is mostly a huge ugly hack, derived from my
ppc64le_hello code. The running philosophy here is
to throw things together late at night with my family
asleep and see how far I get without a real design
or without a real desire to implement boring things
like IDE (*sigh*) emulation”

Since my day job is maintaining the firmware that it runs on, I decided to have a go (it also ties in with the retro stuff I’ve been blogging about). So…

screenshot-from-2016-10-30-17-22-20and I’m off! (yes, this is the very latest qemu and skiboot):

screenshot-from-2016-10-30-17-23-32screenshot-from-2016-10-30-17-23-48Yes, prephv does clear all thirty two megabytes of guest memory

screenshot-from-2016-10-30-17-24-15A quick diversion, if you try Windows NT 3.51 for PowerPC, you get this:

screenshot-from-2016-10-30-18-17-35

But on NT4, you continue unharmed:

screenshot-from-2016-10-30-17-22-32A sign I needed to hack my filesystem of bits of NT installer bits a bit more:

screenshot-from-2016-10-30-17-22-45But, on my next try:

screenshot-from-2016-10-30-17-25-26Well… looks like there’s an instruction that needs to be emulated (and there’s no code to currently do that). Mind you… this is decently far into booting before we hit anything fatal, which is a pretty impressive effort – and it is tempting to continue and see if it’ll run on real hardware and if it could be made to work well enough to not find any disks :)

Failed Retro emulation attempts

For reasons that should escape everybody, I went back and looked at some old Operating Systems a little while ago: OS/2 Warp, Windows 3.11 and Microsoft Chicago. So, I went on a little adventure this weekend, largely in failure though.

Windows NT 3.51

This was the first version (err… no, second I think) of Windows NT that I ever used.

Lesson 1: qemu doesn’t expose a SCSI adapter that isn’t virtio-scsi (and I have a feeling there aren’t Windows NT 3.51 installer driver floppies for virtio-scsi)

screenshot_winnt3-1_2016-10-29_191707Lesson 2: OMG I’m so glad I don’t have to wait for things to be read off floppy disks anymore:

screenshot_winnt3-51_2016-10-29_193833Lesson 3: I’d forgotten that the Windows directory on NT 3.51 was different to every other Windows NT version, being \WINNT35

screenshot_winnt3-51_2016-10-29_194040screenshot_winnt3-51_2016-10-29_194139Lesson 3: Yeah, sometimes there’s just fail.

screenshot_winnt3-51_2016-10-29_194147

Windows NT 4.0

This brought the UI of Windows 95 to Windows NT. It was a thing. It required a fairly beefy PC for the day, but it could use two CPUs if you were that amazingly rich (dual Pentium Pro was a thing)

Lesson 1: Windows NT 4 does not like 8GB disks. My idea of “creating a small disk for a VM for an old OS as it probably won’t work well with a 20GB disk” needs to be adjusted. I’m writing this on a system with 8 times more RAM than what I ended up using for a disk for Windows NT 4.

screenshot_winnt4-0_2016-10-29_192840But hey, back to \WINNT rather than \WINNT35 or \WINDOWS

screenshot_winnt4-0_2016-10-29_192937

Lesson 2: Sometimes, full system emulation turns out to be a better idea:

screenshot_winnt4-0_2016-10-29_192418screenshot_winnt4-0_2016-10-29_192304Lesson 3: Remember when Windows couldn’t actually format NTFS in the installer and it installed to FAT and then converted to NTFS? No? Well, aren’t you lucky.

screenshot_winnt4-0_2016-10-29_193102screenshot_winnt4-0_2016-10-29_193235Apple Rhapsody DR2

Before there was MacOS X, there was a project called Rhapsody. This was to take NeXTStep (from NeXT, which Apple bought to get both NeXTStep and Steve Jobs as every internal “let’s replace the aging MacOS” project had utterly failed for the past ten years). Rhapsody was not going to be backwards compatible until everybody said that was a terrible idea and the Blue Box was added (known as Classic) – basically, a para-virtualized VM running the old MacOS 9.

Anyway, for the first two developer releases, it was also available on x86 (not just PowerPC). This was probably because a PowerPC port to Macs was a lot newer than the x86 port.

So, I dusted off the (virtual) Boot and Driver floppies and fired up qemu…..

screenshot_rhapsodydr2_2016-10-29_172732Yeah, MacOS X got a better installer…

screenshot_rhapsodydr2_2016-10-29_173227Hopeful!

screenshot_rhapsodydr2_2016-10-29_182218This was after I decided that using KVM was a bad Idea:

screenshot_rhapsodydr2_2016-10-29_172143

screenshot_rhapsodydr2_2016-10-29_173333Nope… and this is where we stop. There seems to be some issue with ATA drivers? I honestly can’t be bothered to debug it (although… for the PowerPC version… maybe).

MacOS 9.2

Well.. this goes a lot better now thanks to a whole bunch of patches hitting upstream Qemu recently (thanks Ben!)

screenshot-from-2016-10-29-20-37-24screenshot-from-2016-10-29-20-42-10Yeah, I was kind of tempted to set up Outlook Express to read my email…. But running MacOS 9 was way too successful, so I had to stop there :)

Workaround for opal-prd using 100% CPU

opal-prd is the Processor RunTime Diagnostics daemon, the userspace process that on OpenPower systems is responsible for some of the runtime diagnostics. Although a userspace process, it memory maps (as in mmap) in some code loaded by early firmware (Hostboot) called the HostBoot RunTime (HBRT) and runs it, using calls to the kernel to accomplish any needed operations (e.g. reading/writing registers inside the chip). Running this in user space gives us benefits such as being able to attach gdb, recover from segfaults etc.

The reason this code is shipped as part of firmware rather than as an OS package is that it is very system specific, and it would be a giant pain to update a package in every Linux distribution every time a new chip or machine was introduced.

Anyway, there’s a bug in the HBRT code that means if there’s an ECC error in the HBEL (HostBoot Error Log) partition in the system flash (“bios” or “pnor”… the flash where your system firmware lives), the opal-prd process may get stuck chewing up 100% CPU and not doing anything useful. There’s https://github.com/open-power/hostboot/issues/67 for this.

You will notice a problem if the opal-prd process is using 100% CPU and the last log messages are something like:

HBRT: ERRL:>>ErrlManager::ErrlManager constructor.
HBRT: ERRL:iv_hiddenErrorLogsEnable = 0x0
HBRT: ERRL:>>setupPnorInfo
HBRT: PNOR:>>RtPnor::getSectionInfo
HBRT: PNOR:>>RtPnor::readFromDevice: i_offset=0x0, i_procId=0 sec=11 size=0x20000 ecc=1
HBRT: PNOR:RtPnor::readFromDevice: removing ECC...
HBRT: PNOR:RtPnor::readFromDevice> Uncorrectable ECC error : chip=0,offset=0x0

(the parameters to readFromDevice may differ)

Luckily, there’s a simple workaround to fix it all up! You will need the pflash utility. Primarily, pflash is meant only for developers and those who know what they’re doing. You can turn your computer into a brick using it.

pflash is packaged in Ubuntu 16.10 and RHEL 7.3, but you can otherwise build it from source easily enough:

git clone https://github.com/open-power/skiboot.git
cd skiboot/external/pflash
make

Now that you have pflash, you just need to erase the HBEL partition and write (ECC) zeros:

dd if=/dev/zero of=/tmp/hbel bs=1 count=147456
pflash -P HBEL -e
pflash -P HBEL -p /tmp/hbel

Note: you cannot just erase the partition or use the pflash option to do an ECC erase, you may render your system unbootable if you get it wrong.

After that, restart opal-prd however your distro handles restarting daemons (e.g. systemctl restart opal-prd.service) and all should be well.

MySQL removes the FRM (7 years after Drizzle did)

The new MySQL 8.0.0 milestone release that was recently announced brings something that has been a looooong time coming: the removal of the FRM file. I was the one who implemented this in Drizzle way back in 2009 (July 28th 2009 according to Brian)- and I may have had a flashback to removing the tentacles of the FRM when reading the MySQL 8.0.0 announcement.

As an idea for how long this has been on the cards, I’ll quote Brian from when we removed it in Drizzle:

We have been talking about getting rid of FRM since around 2003. I remember a drive up to northern Finland with Kaj Arnö, where we spent an hour talking about this. I, David, and MontyW have talked about this for years.

http://krow.livejournal.com/642329.html

Soo… it was a known problem for at least thirteen years. One of the issues removing it was how pervasive all of the FRM related things were. I shudder at the mention of “pack_flag” and Jay Pipes probably does too.

At the time, we tried a couple of approaches as to how things should look. Our philosophy with Drizzle was that it should get out of the way at let the storage engines be the storage engines and not try to second guess them or keep track of things behind their back. I still think that was the correct architectural approach: the role of Drizzle was to put SQL on top of a storage engine, not to also be one itself.

Looking at the MySQL code, there’s one giant commit 31350e8ab15179acab5197fa29d12686b1efd6ef. I do mean giant too, the diffstat is amazing:

 786 files changed, 58471 insertions(+), 25586 deletions(-)

How anyone even remotely did code review on that I have absolutely no idea. I know the only way I could get it to work in Drizzle was to do it incrementally, a series of patches that gradually chiseled out what needed to be taken out so I could put it an API and the protobuf code.

Oh, and in case you’re wondering:

- uint offset,pack_flag;
+ uint offset;

Thank goodness. Now, you may not appreciate that as much as I might, but pack_flag was not the height of design, it was… pretty much a catchalll for some kind of data about a field that wasn’t something that already had a field in the FRM. So it may include information on if the field could be null or not, if it’s decimal, how many bytes an integer takes, that it’s a number and how many oh, just don’t ask.

Also gone is the weird interval_id and a whole bunch of limitations because of the FRM format, including one that I either just discovered or didn’t remember: if you used all 256 characters in an enum, you couldn’t create the table as MySQL would pick either a comma or an unused character to be the separator in the FRM!?!

Also changed is how the MySQL server handles default values. For those not aware, the FRM file contains a static copy of the row containing default values. This means the default values are computed once on table creation and never again (there’s a bunch of work arounds for things like AUTO_INCREMENT and DEFAULT NOW()). The new sql/default_values.cc is where this is done now.

For now at least, table metadata is also written to a file that appears to be JSON format. It’s interesting that a SQL database server is using a schemaless file format to describe schema. It appears that these files exist only for disaster recovery or perhaps portable tablespaces. As such, I’m not entirely convinced they’re needed…. it’s just a thing to get out of sync with what the storage engine thinks and causes extra IO on DDL (as well as forcing the issue that you can’t have MVCC into the data dictionary itself).

What will be interesting is to see the lifting of these various limitations and how MariaDB will cope with that. Basically, unless they switch, we’re going to see some interesting divergence in what you can do in either database.

There’s certainly differences in how MySQL removed the FRM file to the way we did it in Drizzle. Hopefully some of the ideas we had were helpful in coming up with this different approach, as well as an extra seven years of in-production use.

At some point I’ll write something up as to the fate of Drizzle and a bit of a post-mortem, I think I may have finally worked out what I want to say…. but that is a post for another day.