Tracing flash reads (and writes) during boot

On OpenPOWER POWER9 systems, we typically talk to the flash chips that hold firmware for the host (i.e. the POWER9) processor through a daemon running on the BMC (aka service processor) rather than directly.

We have host firmware map “windows” on the LPC bus to parts of the flash chip. This flash chip can in fact be a virtual one, constructed dynamically from files on the BMC.

Since we’re mapping windows into this flash address space, we have some knowledge as to what IO the host is doing to/from the pnor. We can use this to output data in the blktrace format and feed into existing tools used to analyze IO patterns.

So, with a bit of learning of the data format and learning how to drive the various tools, I was ready to patch the BMC daemon (mboxbridge) to get some data out.

An initial bit of data is a graph of the windows into PNOR opened up during an normal boot (see below).

PNOR windows created over the course of a normal boot.

This shows us that over the course of the boot, we open a bunch of windows, and switch them around a fair bit early on. This makes sense as early in boot we do not yet have DRAM working and page in firmware on-demand into L3 cache.

Later in boot, you can see the loading of larger chunks of firmware into memory. It’s also possible to see that this seems to take longer than it should – and indeed, we have a bug there.

Next, by modifying the code again, I introduced recording of when we used a window that the BMC had already cached. While the host will only see one window at a time, the BMC can keep around the ones it prepared earlier in order to avoid IO to the actual flash chips (which are SPI flash, so aren’t incredibly fast).

Here we can see that we’re likely not doing the most efficient things during boot, and there’s probably room for some optimization.

Normal boot but including re-used Windows rather than just created ones

Finally, in order to get finer grained information, I reduced the window size from one megabyte down to 4096 bytes. This will impose a heavy speed penalty as it’ll mean we will have to create a lot more windows to do the same amount of IO, but it means that since we’re using the page size of hostboot, we’ll see each individual page in/out operation that it does during boot.

So, from the next graph, we can see that there’s several “hot” areas of the image, and on the whole it’s not too many pages. This gives us a hint that a bit of effort to reduce binary image size a little bit could greatly reduce the amount of IO we have to do.

4096 byte (i.e. page) size window, capturing the bits of flash we need to read in several times due to being low on memory when we’re L3 cache constrained.

The iowatcher tool also can construct a video of the boot and what “blocks” are being read.

Video of what blocks are read from flash during booting

So, what do we get from this adventure? Well, we get a good list of things to look into in order to improve boot performance, and we get to back these up with data rather than guesswork. Since this also works on unmodified host firmware, we’re measuring what we really boot rather than changing it in order to measure it.

What you need to reproduce this:

OpenPOWER, OpenCompute and fostering a firmware development community

Recently, I was at the OpenPOWER Summit in San Jose where people could see the Barreleye server (specs and design here, initial Rackspace blog post here). Barreleye is an OpenCompute form factor POWER8 server. It’s not only an OpenPOWER machine, which means all of the host firmware is free and open source software, but there’s also OpenBMC meaning that the source to the OS and userspace running on the BMC (service processor) is also open!

In addition, the firmware enablement came from Foxconn (see this skiboot commit), which means we’re being successful in enabling people who aren’t part of IBM to join the development community for OpenPOWER firmware and get the changes needed to support their machines accepted upstream.

Granted, the size of a firmware development community is always likely to be relatively small, but I really like how Foxconn has shown leadership to other ODMs on interacting with and becoming part of the OpenPOWER firmware community.