One of the core bits of infrastructure I use as a maintainer is Patchwork (I wrote about making it faster recently). Patchwork tracks patches sent to a mailing list, allowing me as a maintainer to track the state of them (New|Under Review|Changes Requested|Accepted etc), combine them into patch bundles, look at specific series, test results etc.
One of the core bits of software I use is my email client, notmuch. Most other mail clients are laughably slow and clunky, or just plain annoying for absorbing a torrent of mail and being able to deal with it or just plain ignore it but have it searchable locally.
You may think your mail client is better than notmuch, but you’re wrong.
A key feature of notmuch is tagging email. It doesn’t do the traditional “folders” but instead does tags (if you’ve used gmail, you’d be somewhat familiar).
One of my key work flows as a maintainer is looking at what patches are outstanding for a project, and then reviewing them. This should also be a core part of any contributor to a project too. You may think that a tag:unread and to:project-list@foo query would be enough, but that doesn’t correspond with what’s in patchwork.
So, I decided to make a tool that would add tags to messages in notmuch corresponding with the state of the patch in patchwork. This way, I could easily search for “patches marked as New in patchwork” (or Under Review or whatever) and see what I should be reviewing and looking at merging or commenting on.
Logically, this wouldn’t be that hard, just use the (new) Patchwork REST API to get the state of everything and issue the appropriate notmuch commands.
But just going one way isn’t that interesting, I wanted to be able to change the tags in notmuch and have them sync back up to Patchwork. So, I made that part of the tool too.
Introducing pwnm-sync: a tool to sync patchwork and notmuch.
With this tool I can easily see the patchwork state of any patch that I have in my notmuch database. For projects that I’m a maintainer on (i.e. can change the state of patches), If I update the patches of that email and run pwnm-sync again, it’ll update the state in patchwork.
I’ve been using this for a few weeks myself and it’s made my maintainer workflow significantly nicer.
It may also be useful to people who want to find what patches need some review.
The sync time is mostly dependent on how fast your patchwork instance is for API requests. Unfortunately, we need to make some improvements on the Patchwork side of things here, but a full sync of the above takes about 4 minutes for me. You can also add a –epoch option (with a date/time) to say “only fetch things from patchwork since that date” which makes things a lot quicker for incremental syncs. For me, I typically run it with an epoch of a couple of months ago, and that takes ~20-30 seconds to run. In this case, if you’ve locally updated a old patch, it will still sync that change up to patchwork.
It’s a python3 script using the notmuch python bindings, the requests-futures module for asynchronous HTTP requests (so we can have the patchwork server assemble the next page of results while we process the previous one), and a local sqlite3 database to store state in so we can work out what changed locally / server side.
tl;dr: I made Patchwork a lot faster by looking at what database queries were being generated and optimizing them either by making Django produce better queries or by adding better indexes.
Introduction to Patchwork
One of the key bits of infrastructure a bunch of maintainers of Open Source Software use is a tool called Patchwork. We use it for a bunch of OpenPOWER firmware development, several Linux subsystems use it as well as freedesktop.org.
The purpose of Patchwork is to supplement the patches-to-a-mailing-list development work flow. It allows a maintainer to see all the patches that have been posted on the list, How many Acked-by/Reviewed-by/Tested-by replies they have, delegate responsibility for the patch to a co-maintainer, track (and change) the state of the patch (e.g. to “Under Review”, “Changes Requested”, or “Accepted”), and create bundles of patches to help in review and testing.
Since patchwork is an open source project itself, there’s several instances of it out there in common use. One of the main instances is https://patchwork.ozlabs.org/ which is (funnily enough) used by a bunch of people connected to OzLabs for projects that are somewhat connected to OzLabs. e.g. the linuxppc-dev project and the skiboot and petitboot projects. There’s also a kernel.org instance, which is used by some kernel subsystems.
Recent versions of Patchwork have added some pretty cool features such as the ability to integrate with CI systems such as Snowpatch which helps maintainers see if patches submitted are likely to break things.
Unfortunately, there’s also been some complaints that recent version of patchwork have gotten slower than previous ones. This may well be the case, or it could just be that the volume of patches is much higher and there’s load on the database. Anyway, after asking a few questions about what the size and scope was of the patchwork database on ozlabs.org, I went “hrm… this sounds like it shouldn’t really be a problem… perhaps I should look into this”.
Attacking the problem…
Every so often it is revealed that I know a little bit about databases.
Getting a development environment up for Patchwork is amazingly easy thanks to Docker and the great work of the Patchwork maintainers. The only thing you need to load in is an example dataset. I started by importing mail from a few mailing lists I’m subscribed to, which was Good Enough(TM) for an initial look.
Due to how Django forces us to design a database schema though, the suggested method of getting a sample data set will not mirror what occurs in a production system with multiple lists. It’s for this reason that I ended up using a copy of a live dataset for much of my work rather than constructing an artificial one.
Patchwork supports both a MySQL and PostgreSQL database backend. Since the one on ozlabs.org is backed by PostgreSQL, I ended up loading a large dataset into PostgreSQL for most of my work, although I also did some testing with MySQL.
The current patchwork.ozlabs.org instance has a database of around 13GB in side, with about a million patches. You may think this is big, my database brain goes “no, this is actually quite small and everything should be a lot faster than it is even on quite limited hardware”
The problem with ORMs
It turns out that Patchwork is written in Django, an ORM (Object-Relational Mapping) framework in Python – and thus something that pretty effectively obfuscates application code from the SQL being run.
There is one thing that Django misses that could be a pretty big general performance boost to many applications: it doesn’t support composite primary keys. For some databases (e.g. MySQL’s InnoDB engine) the PRIMARY KEY is a clustered index – that is, the physical layout of the rows on disk reflect primary key order. You can use this feature to your advantage and have much higher cache hits of your database pages.
Unfortunately though, we cannot do that with Django, so we lose a bunch of possible performance because of it (especially for queries that are going to have to bring in data from disk). In fact, we’re forced to use an ID field that’ll scatter our rows all over the place rather than do something efficient. You can somewhat get back some of the performance by creating covering indexes, but this costs in terms of index maintenance and disk space.
It should be noted that PostgreSQL doesn’t have a similar concept, although there is a (locking) CLUSTER statement that can (as an offline operation for the table) re-arrange existing rows to be in index order. In my testing, this can give a bit of a boost to performance of some of the Patchwork queries.
With MySQL, you’d look at a bunch of statistics on what pages are being brought in and paged out of the InnoDB buffer pool. With PostgreSQL it’s a bit more complex as it relies heavily on the OS page cache.
My main experience is with MySQL like environment, so I’ve had to re-learn a bunch of PostgreSQL things in this work which was kind of fun. It may be “because of my upbringing” but it seems as if there’s a lot more resources and documentation out in the wild about optimizing MySQL environments than PostgreSQL ones, especially when it comes to documentation around a bunch of things inside the database server. A lot of credit should go to the MySQL Documentation team – I wish the PostgreSQL documentation was up to the same standard.
Another issue is that fetching BLOBs is generally an expensive operation that you want to avoid unless you’re going to use them. Thus, fetching the whole “object” at once isn’t always optimal. The Django query generation appears to be somewhat buggy when it comes to “hey, don’t fetch these columns, I don’t need them”, so you do have to watch what query is produced not just what query you expect to be produced. For example, [01/11] Improve patch listing performance (~3x).
Another issue with Django is how you go from your Python code to an actual SQL query, especially when the produced SQL query is needlessly complex or inefficient. I’ve seen Django always produce an ORDER BY for one table, even when not needed, I’ve also seen it always join tables even when you’re getting no columns from one of them and there’s no way you’re asking for it. In fact, I had to revert to raw SQL for one of my performance improvements as I just couldn’t beat it into submission: [10/11] Be sensible computing project patch counts.
An ORM can be great for getting apps out quickly, or programming in a familiar way. But like many things, an understanding of what is going on underneath is key for extracting maximum performance.
Also, if you ever hear something like “ORM $x doesn’t scale” then maybe that person just hasn’t looked at how to use the ORM better. The same goes for if they say “database $y doesn’t scale”- especially if it’s a long existing relational database such as MySQL or PostgreSQL.
Speeding up listing current patches for a project
Fortunately though, the Django development environment lets you really easily dive into what queries are being generated and (at least roughly) where they’re being generated from. There’s a sidebar in your browser that shows how many SQL queries were needed to generate the page and how long they took. The key to making your application go faster is to run fewer queries in less time.
I was incredibly impressed with how easy it was to see what queries were run, where they were run from, and the EXPLAIN output for them.
By clicking on that SQL button on the right side of your browser, you get this wonderful chart of what queries were executed, when, and how long they took. From this, it is incredibly obvious which query is the most problematic: the one that took more than four seconds!
In the dim dark days of web development, you’d have to turn on a Slow Query Log on the database server and then grep through your source code or some other miserable activity. I am so glad I didn’t have to do that.
This particular query was a real hairy one, the EXPLAIN output from PostgreSQL (and MySQL) was certainly long and involved and would most certainly not feature in the first half of an “Introduction to query optimization” day long workshop. If you haven’t brushed up on various bits of documentation on understanding EXPLAIN, you should! The MySQL EXPLAIN FORMAT=JSON is especially fantastic for getting deep details as to what’s going on with query execution.
The big performance gain here was to have the database be able to execute the query in a much more efficient way by creating two covering indexes for part of the query. To work out what indexes to create, one has to look at the EXPLAIN output and work out why the database is choosing to do either a sequential scan of a large table, or use an index that doesn’t exclude that many rows. In this case, I tweaked the code to slightly change the query that was generated as well as adding a covering index. What we ended up with is something that is dramatically faster.
You’ll notice that it appears that the first query there takes a lot more time but it doesn’t, it just takes a lot more time relative to the main query.
In fact, this particular page is one that people have mentioned at being really, really slow to load. With the main query now about 350 times faster than it was originally, it shouldn’t be a problem anymore.
A future improvement would be to cache the COUNT() for the common case, as it’s pretty easily computed when new patches come in or states change.
The patches that help this particular page have been submitted upstream here:
Now that we can list patches faster, can we make other pages that Patchwork has quicker?
One such page is viewing a patch (or cover letter) that has a lot of comments on it. One feature of Patchwork is that it will display all the email replies to a patch or cover letter in the Web UI. But… this seemed slow
On one of the most commented patches I could find, we ended up executing one hundred and seventy seven SQL queries to view it! If we dove into it, a bunch of the queries looked really really similar…
The problem here is that the Patchwork UI is wanting to find out the name of each person who submitted a comment, and is doing that by querying the ID from a table. What it should be doing instead is a SQL JOIN on the original query and just fetching all that information in one go: make the database server do the work, it’s really good at it.
With that patch, we’re down to a constant number of queries and around a 3x-7x faster time executing them depending if we have a warm cache or not.
The one time I had to use raw SQL
When viewing a project page (such as https://patchwork.ozlabs.org/project/qemu-devel/ ) it displays the number of patches (archived and not archived) for the project. By looking at what SQL queries are executed to collect these numbers, you’ll notice two things. First, here are the queries:
First thing you’ll notice is that they took a loooooong time to execute (more than a second each). The second thing, if you look closer, is that they contain a join which is completely unneeded.
I spent a good long while trying to make Django behave, and I just could not. I believe it’s due to the model having some inheritance in it. Writing the query by hand ended up being the best solution, and it gave a significant performance improvement:
Arguably, a better way would be to precompute the count for the archived/non-archived patches and just display them. I (or someone else who knows more about Django) may want to look at that for a future improvement.
Conclusion and final thoughts
There’s a few more places where there could be some optimizations, but currently I cannot get any single page to take more than between 40-400ms in the database when running on my laptop – and that’s Good Enough(TM) for now.
The next steps are getting these patches through a round or two of review, and then getting them into a Patchwork release and deployed out on patchwork.ozlabs.org and see if people can find any new ways to make things slow.
I think the lesson is that making dramatic improvements to performance of your Django based app does not mean you have to write a lot of code or revert to raw SQL or abandon your ORM. In fact, use it properly and you can get a looong way. It’s just that to use it properly, you’re going to have to understand the layer below the ORM, and not just treat the database as a magic black box.
You may have heard of ccache (Compiler Cache) which saves you heaps of real world time when rebuilding a source tree that is rather similar to one you’ve recently built before. It’s really useful in buildroot based projects where you’re building similar trees, or have done a minor bump of some components.
In trying to find a commit which introduced a bug in op-build (OpenPOWER firmware), I noticed that hostboot wasn’t being built using ccache and we were always doing a full build. So, I started digging into it.
It turns out that a bunch of the perl scripts for parsing the Machine Readable Workbook XML in hostboot did a bunch of things like foreach $key (%hash) – which means that the code iterates over the items in hash order rather than an order that would produce predictable output such as “attribute name” or something. So… much messing with that later, I had hostboot generating the same output for the same input on every build.
Next step was to work out why I was still getting a lot of CCACHE misses. It turns out the default ccache size is 5GB. A full hostboot build uses around 7.1GB of that.
So, if building op-build with CCACHE, be sure to set both BR2_CCACHE=y in your config as well as something like BR2_CCACHE_INITIAL_SETUP="--max-size 20G"
Hopefully my patches hit hostboot and op-build soon.
I’ve been working on trying to better document the whole flow of code that goes into a build of firmware for an OpenPOWER machine. This is partially to help those not familiar with it get a better grasp of the sheer scale of what goes into that 32/64MB of flash.
I also wanted to convey the components that we heavily re-used from other Open Source projects, what parts are still “IBM internal” (as they relate to the open source workflow) and which bits are primarily contributed to by IBMers (at least at this point in time).
As such, let’s start with the legend of the diagram:
Now, the diagram:
The end thing that a user with a machine will download and apply (or that comes shipped with a box) is the purple “Installable Firmware Release” nodes (bottom center). In this diagram, there are 4 of them. One for POWER9 systems such as the just-announced AC922 system (this is the “OP910 Release” node, which is the witherspoon_defconfig in the op-build tree); one for the p9dsu platform (p9dsu_defconfig in op-build) and one is for IBM FSP based systems such as the S812L and S822L systems (or S812/S822 in OPAL mode).
There are more platforms out there, but this diagram is meant to be simplified. The key difference with the p9dsu platform is that this is produced by somebody other than IBM.
All of these releases are based off the upstream op-build project, op-build is the light blue box in the center of the diagram. We do regular X.Y releases and sometimes do X.Y.Z releases. It’s primarily a pull request based workflow currently, so everything goes via a pull request. The op-build project brings together all the POWER specific firmware components (pretty much everything in every other light blue/blue box) along with a Linux kernel and buildroot.
The kernel and buildroot are the two big yellow boxes on the top right. Buildroot brings together a lot of open source components that are in our firmware image (including some power specific ones that we get through upstream buildroot).
For Linux, this is a pretty simplified view of the process, but we primarily ship the stable tree (with maybe up to half a dozen patches).
The skiboot and petitboot components both use a mailing list based workflow (similar to kernel) as well as X.Y and X.Y.Z releases (again, similar to the linux kernel).
On the far left of the diagram, we have Hostboot, SBE and OCC. These are three firmware components that come from the traditional IBM POWER Firmware group, and are shared with the IBM non-OpenPOWER POWER systems (“traditional” POWER). These components have part of their code from from an (internal) repository called “ekb” which also goes into a (very) low level debug tool and the FSP based systems. There’s also an (internal) gerrit instance that’s the primary place where code review/development discussions are for these components.
In future posts, I’ll probably delve into more specifics of the current development process, and how we may try and change things for the better.
Recently, I blogged on my home email setup and in that post, I hinted that my work setup was rather different. I have entirely separate computing devices for work and personal, a setup I strongly recommend. This also lets me “go home” from work even when working from home, I use a different physical machine!
Since I work for IBM I have (at least) two email accounts for work: a Lotus Notes one and a internet standards compliant one. It’s “easy” enough to get the Notes one to forward to the standards compliant one, from which I can just use fetchmail or similar to pull down mail.
I run mail through a rather simple procmail script: it de-mangles some URL mangling that can happen in the current IBM email infrastructure, runs things through SpamAssassin and deliver to a date based Maildir (or one giant pile for spam).
I use tail -f mail_log as a really dumb kind of biff replacement.
Now, what do I read and write mail with? Notmuch! It is the only thing that even comes close to being able to deal with a decent flow of mail. I have a couple of saved searches just to track how much mail I pull in a day/week. Today (on Monday), it says 442 today and 10,403 over the past week.
For the most part, my workflow is kind of INBOX-ZERO like, except that I currently view victory as INBOX 2000. Most mail does go into my INBOX, the notable exceptions are two main mailing lists I’m subscribed to mostly as FYI and to search/find things when needed. Those are the Linux Kernel Mailing List (LKML) and the buildroot mailing list. Why notmuch rather than just searching the web for mailing list archives? Notmuch can return the result of a query in less time it takes light to get to and from the United States in ideal conditions.
For work, I don’t sync my mail anywhere. It’s just on my laptop. Not having it on my phone is a feature. I have a notmuch post-new hook that does some initial tagging of mail, and as such I have this in my ~/.notmuch-config:
My post-new hook looks like this:
# immediately archive all messages from "me"
notmuch tag -new -- tag:new and from:firstname.lastname@example.org
# tag all message from lists
notmuch tag +devicetree +list -- tag:new and to:email@example.com
notmuch tag +inbox +unread -new -- tag:new and tag:devicetree and to:firstname.lastname@example.org
notmuch tag +listinbox +unread +list -new -- tag:new and tag:devicetree and not to:email@example.com
notmuch tag +linuxppc +list -- tag:new and to:firstname.lastname@example.org
notmuch tag +linuxppc +list -- tag:new and cc:email@example.com
notmuch tag +inbox +unread -new -- tag:new and tag:linuxppc
notmuch tag +openbmc +list -- tag:new and to:firstname.lastname@example.org
notmuch tag +inbox +unread -new -- tag:new and tag:openbmc
notmuch tag +lkml +list -- tag:new and to:email@example.com
notmuch tag +inbox +unread -new -- tag:new and tag:lkml and to:firstname.lastname@example.org
notmuch tag +listinbox +unread -new -- tag:new and tag:lkml and not tag:linuxppc and not to:email@example.com
notmuch tag +qemuppc +list -- tag:new and to:firstname.lastname@example.org
notmuch tag +inbox +unread -new -- tag:new and tag:qemuppc and to:email@example.com
notmuch tag +listinbox +unread -new -- tag:new and tag:qemuppc and not to:firstname.lastname@example.org
notmuch tag +qemu +list -- tag:new and to:email@example.com
notmuch tag +inbox +unread -new -- tag:new and tag:qemu and to:firstname.lastname@example.org
notmuch tag +listinbox +unread -new -- tag:new and tag:qemu and not to:email@example.com
notmuch tag +buildroot +list -- tag:new and to:firstname.lastname@example.org
notmuch tag +buildroot +list -- tag:new and to:email@example.com
notmuch tag +buildroot +list -- tag:newa nd to:firstname.lastname@example.org
notmuch tag +inbox +unread -new -- tag:new and tag:buildroot and to:email@example.com
notmuch tag +listinbox +unread -new -- tag:new and tag:buildroot and not to:firstname.lastname@example.org
notmuch tag +ibmbugzilla -- tag:new and from:email@example.com
# finally, retag all "new" messages "inbox" and "unread"
notmuch tag +inbox +unread -new -- tag:new
This leaves me with both an inbox and a listinbox. I do not look at the overwhelming majority of mail that hits the listinbox – It’s mostly for following up on individual things. If I started to need to care more about specific topics, I’d probably add something in there for them so I could easily find them.
My notmuch emacs setup has a bunch of saved searches, so my notmuch-hello screen looks something like this:
This gets me a bit of a state-of-the-world-of-email-to-look-at view for the day. I’ll often have meetings first thing in the morning that may reference email I haven’t looked at yet, and this generally lets me quickly find mail related to the problems of the day and start being productive.
For some unknown reason, the Windows installer doesn’t let you install to a USB key. Luckily, there’s a simple workaround. It turns out that only the veryfirst step of installation cares about that. So, if you can fool it (say, by running in qemu), you can have a USB key with a Windows install rather than having to dual boot on your hard disk (e.g. if you run Linux and want all that fast in-built SSD space for Linux)
Mount the ISO as a loopback device (e.g. by right clicking and choosing to mount, or by doing ‘sudo mount -o loop,ro file.iso /mnt’
Go to Lenovo web site for Drivers & Software for your laptop, under Chipset, there’s ME Firmware and Software downloads You will need both. It looks like this:
Run both exe files with WINE or on a windows box to extract the archives, you do not need to run the installers at the end.
you now need to extract the management engine drivers. You can do this in ~/.wine/drive_c/DRIVERS/WIN/AMT, with “cabextract SetupME.exe” or (as suggested in the comments) you can use the innoextract utility (from linux) to extract them (a quick check shows this to work)
Save off HECI_REL folder, it’s the only extracted thing you’ll need.
Go and install https://wimlib.net/ – we’re going to use it to create the boot disk. (it may be packaged for your distro).
If you don’t have the path /usr/lib/syslinux/modules/bios on your system but you do have /usr/share/syslinux/modules/bios – you will need to change a bit of the file programs/mkwinpeimg.in to point to the /usr/share locations rather than /usr/lib before you install wimlib. This probably isn’t needed if you’re installing from packages, but may be requried if you’re on, say, Fedora.
Copy ~/.wine/drive_c/DRIVERS to a new folder, e.g. “winpe_overlay” (or copy from the Windows box you extracted things on)
Use mkwinpeimg to create the boot disk, pointing it to the mounted Windows 10 installer and the “winpe_overlay”:
Recently, I added the package lrzsz to op-build in this commit. This package provides the rz and sz commands – for receive zmodem and send zmodem respectively. For those who don’t know, op-build builds a firmware image for OpenPOWER machines, and adding this package adds the commands to the petitboot shell (the busybox environment you get when you “exit to shell” from the boot menu).
For those who aren’t familiar with ZMODEM, you should first get off my lawn, and secondly, know that it’s a method for sending/receiving files over something akin to a serial port, e.g. a modem. The basic protocol is “I want to send you this file named FOO”, “okay, I would like to receive it”, “here’s some data and a checksum, did you get it and does it match the checksum?”, “yes!”, “okay, great, here’s the next bit” until the file is transferred. Importantly, it has a provision for “no, I didn’t get that right” and for bits to be resent.
The one thing that pretty much always somewhat works on a computer is a serial port (or something that looks like a serial port to software). When you connect to the IPMI console (“ipmitool sol activate”), the host sees this as a serial port that it pumps bytes over. With OpenBMC, you can actually connect to this serial port via SSH.
When diagnosing weird problems during firmware bringup such as “why doesn’t PCI work” or “why does my network adapter not work” (or, perhaps, somebody helpfully didn’t plug the network cable in), it can be useful to copy off a bunch of debug information from the machine.
You may say “can’t you just print the log file to the screen and save it?” and you’d be right, you can do that for text – it’s really annoying for binary data though. Plus, there have been bugs in the console implementations on pretty much every BMC I’ve ever used that makes them not as reliable as you’d like.
So, how could we transfer a file over the serial connection we have to the machine? The same way we did on a BBS! Enter ZMODEM. The error recovery properties are perfect in this situation.
So, how do you use it? I’ve found two ways that work well: GNU screen and zssh. For GNU Screen, you’ll want to configure it to catch zmodem traffic by doing “control-a:zmodem catch<ENTER>” (you need the colon there). After that if you execute “sz” on the host and the rest should be obvious! If you wanted to send a file to the host, run “rz” rather than “sz”.
One of the things I’ve been working on fairly quietly is the test suite for OpenPOWER firmware: op-test-framework. I’ve approach things I’m hacking on from the goal of “when I merge patches into skiboot, can I be confident that I haven’t merged something that’s broken existing functionality?”
By testing host firmware, we incidentally (as well as on purpose) test a whole bunch of BMC functionality. One bit of functionality we rely on a lot is the host “serial” console. Typically, this is exposed to the user over IPMI Serial Over LAN (SOL), or on OpenBMC it’s also exposed as something you can ssh to (which proves to be both faster and more reliable than IPMI, not to mention there’s some remote semblance of security).
When running through some tests, I noticed something pretty odd, it appeared as though we were sometimes missing some console output on larger IOs. This usually isn’t a problem as when we’re using expect(1) (or the python equivalent pexpect) we end up having all sorts of delays here there and everywhere to work around all the terrible things you hope you never learn. So… how could I test that? Well.. what about checking the output of something like dd if=/dev/zero bs=1024 count=16|hexdump -C to see if we get the full output?
Time to add a test to op-test-framework! Adding such a test is pretty easy. If we look at the source of the test I added, we can see what happens (source here).
class Console8k(Console, unittest.TestCase):
bs = 1024
count = 8
class Console16k(Console, unittest.TestCase):
bs = 1024
count = 16
class Console32k(Console, unittest.TestCase):
bs = 1024
count = 32
The setUp() function is pure boiler plate, we grab some objects from the configuration of the test run, namely information about the BMC and the system itself, so we can do things to both. The real magic happens in runTest().
op-test-framework tracks the state of the machine being tested across each test. This means that if we’re executing 101 tests in the petitboot shell, we don’t need to do 101 separate boots to petitboot. The self.system.goto_state(OpSystemState.PETITBOOT_SHELL) statement says “Please ensure the system is booted to the petitboot shell”. Other states include OFF (obvious) and OS, which is when the machine is booted to the OS.
The next two lines ensure we can run commands on the console (where console is IPMI Serial over LAN or other similar connection, such as the SSH console provided by OpenBMC):
As of May 29th 2017, if you want to do something crazy like use *both* ports of the OneLink+ dock to use monitors that aren’t 640×480 (but aren’t 4k), you’re going to need a 4.11 kernel, as everything else (for example 4.10.17, which is the latest in Fedora 25 at time of writing) will end you in a world of horrible, horrible pain.
This grabs a kernel that’s sitting in testing and isn’t yet in the main repositories. However, I can now see things on monitors, rather than 0 to 1 monitor (most often 0). You can also dock/undock and everything doesn’t crash in a pile of fail.
I remember a time when you could fairly reliably buy Intel hardware and have it “just work” with the latest distros. It’s unfortunate that this is no longer the case, and it’s more of a case of “wait six months and you’ll still have problems”.
(at least Wayland and X were bug for bug compatible?)
Recently, I was reading a thread on LKML on a proposal to change the behavior of the open system call when confronted with unknown flags. The thread is worth a read as the topic of augmenting things that exist probably by accident to be “better” is always interesting, as is the definition of “better”.
Keeping API and/or ABI compatibility is something that isn’t a new problem, and it’s one that people are pretty good at sometimes messing up.
This problem does not go away just because “we have cloud now”. In any distributed system, in order to upgrade it (or “be agile” as the kids are calling it), you by definition are going to have either downtime or at least two versions running concurrently. Thus, you have to have your interfaces/RPCs/APIs/ABIs/protocols/whatever cope with changes.
You cannot instantly upgrade the world, it happens gradually. You also have to design for at least three concurrent versions running. One is the original, the second is your upgrade, your third is the urgent fix because the upgrade is quite broken in some new way you only discover in production.
So, the way you do this? Never ever EVER design for N-1 compatibility only. Design for going back a long way, much longer than you officially support. You want to have a design and programming culture of backwards compatibility to ensure you can both do new and exciting things and experiment off to the side.
It’s worth going and rereading Rusty’s API levels posts from 2008:
All of these are available as Kindle books, but I’m sure you can get 3D copies too:
The Five Dysfunctions of a Team: A Leadership Fable by Patrick M. Lencioni Leading Change by John P. Kotter Who Says Elephants Can’t Dance? Louis V. Gerstner Jr. Nonviolent Communication: A language of Life by Marshall B. Rosenberg and Arun Gandhi
In OpenPOWER land, we’ve been doing some work on Secure and Trusted Boot while at the same time doing some work on what we call fast-reset (or fast-reboot, depending on exactly what mood someone was in at any particular time…. we should start being a bit more consistent).
The basic idea for fast-reset is that when the OS calls OPAL reboot, we gather all the threads in the system using a combination of patching the reset vector and soft-resetting them, then cleanup a few bits of hardware (we do re-probe PCIe for example), and reload & restart the bootloader (petitboot).
What this means is that typing “reboot” on the command line goes from a ~90-120+ second affair (through firmware to petitboot, linux distros still take ages to shut themselves down) down to about a 20 second affair (to petitboot).
If you’re running a (very) recent skiboot, you can enable it with a special hidden NVRAM configuration option (although we’ll likely enable it by default pretty soon, it’s proving remarkably solid). If you want to know what that NVRAM option is… Use the source, Luke! (or git history, but I’ve yet to see a neat Star Wars reference referring to git commit logs).
So, there’s nothing like a demo. Here’s a demo with Ubuntu running off an NVMe drive on an IBM S822LC for HPC (otherwise known as Minsky or Garrison) which was running the HTX hardware exerciser, through fast-reboot back into Petitboot and then booting into Ubuntu and auto-starting the exerciser (HTX) again.
Apart from being stupidly quick when compared to a full IPL (Initial Program Load – i.e. boot), since we’re not rebooting out of band, we have no way to reset the TPM, so if you’re measuring boot, each subsequent fast-reset will result in a different set of measurements.
This may be slightly confusing, but it’s not really a problem. You see, if a machine is compromised, there’s nothing stopping me replacing /sbin/reboot with something that just prints things to the console that look like your machine rebooted but in fact left my rootkit running. Indeed, fast-reset and a full IPL should measure different values in the TPM.
It also means that if you ever want to re-establish trust in your OS, never do a reboot from the host – always reboot out of band (e.g. from a BMC). This, of course, means you’re trusting your BMC to not be compromised, which I wouldn’t necessarily do if you suspect your host has been.
opal-prd is the Processor RunTime Diagnostics daemon, the userspace process that on OpenPower systems is responsible for some of the runtime diagnostics. Although a userspace process, it memory maps (as in mmap) in some code loaded by early firmware (Hostboot) called the HostBoot RunTime (HBRT) and runs it, using calls to the kernel to accomplish any needed operations (e.g. reading/writing registers inside the chip). Running this in user space gives us benefits such as being able to attach gdb, recover from segfaults etc.
The reason this code is shipped as part of firmware rather than as an OS package is that it is very system specific, and it would be a giant pain to update a package in every Linux distribution every time a new chip or machine was introduced.
Anyway, there’s a bug in the HBRT code that means if there’s an ECC error in the HBEL (HostBoot Error Log) partition in the system flash (“bios” or “pnor”… the flash where your system firmware lives), the opal-prd process may get stuck chewing up 100% CPU and not doing anything useful. There’s https://github.com/open-power/hostboot/issues/67 for this.
You will notice a problem if the opal-prd process is using 100% CPU and the last log messages are something like:
Luckily, there’s a simple workaround to fix it all up! You will need the pflash utility. Primarily, pflash is meant only for developers and those who know what they’re doing. You can turn your computer into a brick using it.
pflash is packaged in Ubuntu 16.10 and RHEL 7.3, but you can otherwise build it from source easily enough:
git clone https://github.com/open-power/skiboot.git
Now that you have pflash, you just need to erase the HBEL partition and write (ECC) zeros:
IBM (my employer) recently announced the new S822LC for HPC POWER8+NVLINK NVIDIA P100 GPUs server (press release, IBM Systems Blog, The Register). The “For HPC” suffix on the model number is significant, as the S822LC is a different machine. What makes the “for HPC” variant different is that the POWER8 CPU has (in addition to PCIe), logic for NVLink to connect the CPU to NVIDIA GPUs.
There’s also the NVIDIA Tesla P100 GPUs which are NVIDIA’s latest in an SXM2 form factor, but instead of delving into GPUs, I’m going to tell you how to compile the firmware for this machine.
You see, this is an OpenPOWER machine. It’s an OpenPOWER machine where the vendor (in this case IBM) has worked to get all the needed code upstream, so you can see exactly what goes into a firmware build.
To build the latest host firmware (you can cross compile on x86 as we use buildroot to build a cross compiler):
That’s it! Give it a while and you’ll end up with output/images/garrison.pnor – which is a firmware image to flash onto PNOR. The machine name is garrison as that’s the code name for the “S822LC for HPC” (you may see Minsky in the press, but that’s a rather new code name, Garrison has been around for a lot longer as a name).
I was eager to try it with skiboot to see if it could find anything.
Luckily, it was pretty easy. I built Smatch according to their documentation and then built skiboot:
make CHECK="/home/stewart/smatch/smatch" C=1 -j20 all check
Due to some differences in how we implement abort() and assert() in skiboot, I added “_abort”, “abort” and “assert_fail” to smatch_data/no_return_funcs in the Smatch source tree to silence some false positives.
It seems that there’s a few useful warnings there (some of which I’ve fixed in skiboot master already), along with some false positives around the preprocessor/complier tricks we do to ensure at compile time that an OPAL call definition has the correct number of arguments specified.
Recently, we merged into the op-build project (the build scripts for OpenPOWER Firmware) a defconfig for building OPAL for (certain) POWER9 simulators. I won’t bother linking over to articles on the POWER9 chip or schedule (there’s search engines for that), but with this commit – if you happen to be able to get your hands on a POWER9 simulator, you can now boot to the petitboot bootloader on it!
We’re using upstream Linux 4.7.0-rc3 and upstream skiboot (master), so all of this code is already upstream!
Now, by no means is this complete. There’s some fairly fundamental things that are missing (e.g. PCI) – but how many other platforms can you build open source firmware for before you can even get your hands on a simulator?