Speeding up Blackbird boot: the SBE

The Self Boot Engine (SBE) is a small embedded PPE42 core inside the POWER9 CPU which has the unenvious job of getting a single POWER9 core ready enough to start executing instructions out of L3 cache, and poking some instructions into said cache for the core to start executing.

It’s called the “Self Boot Engine” as in generations prior to POWER8, it was the job of the FSP (Service Processor) to do all of the booting for the CPU. On POWER8, there was still an SBE, but it was a custom instruction set (this was the Power On Reset Engine – PORE), while the PPE42 is basically a 32bit powerpc core cut straight down the middle (just the way to make it awkward for toolchains).

One of the things I noted in my post on Booting temporary firmware on the Raptor Blackbird is that we got serial console output from the SBE. It turns out one of thing things explicitly not enabled by Raptor in their build was this output as “it made the SBE boot much slower”. I’d actually long suspected this, but hadn’t really had the time to delve into it.

Since for POWER9, the firmware for the SBE is now open source code, as is the ppe42-binutils and ppe42-gcc toolchain for it. This means we can hack on it!

WARNING: hacking on your SBE firmware can be relatively dangerous, as it’s literally the first thing that needs to work in order to boot the system, and there isn’t (AFAIK) publicly documented easy way to re-flash your SBE firmware if you mess it up.

Seeing as we saw a regression in boot time with the UART output enabled, we need to look at the uartPutChar() function in sbeConsole.C (error paths removed for clarity):

static void uartPutChar(char c)
{
    #define SBE_FUNC "uartPutChar"
    uint32_t rc = SBE_SEC_OPERATION_SUCCESSFUL;
    do {
        static const uint64_t DELAY_NS = 100;
        static const uint64_t DELAY_LOOPS = 100000000;

        uint64_t loops = 0;
        uint8_t data = 0;
        do {
            rc = readReg(LSR, data);
...
            if(data == LSR_BAD || (data & LSR_THRE))
            {
                break;
            }
            delay(DELAY_NS, 1000000);
        } while(++loops < DELAY_LOOPS);

...
        rc = writeReg(THR, c);
...
    } while(0);

    #undef SBE_FUNC
}

One thing you may notice if you’ve spent some time around serial ports is that it’s not using the transmit FIFO! While according to Wikipedia the original 16550 had a broken FIFO, but we’re certainly not going to be hooked up to an original rev of that silicon.

To compare, let’s look at the skiboot code, which is all in hw/lpc-uart.c:

static void uart_check_tx_room(void)
{
	if (uart_read(REG_LSR) & LSR_THRE) {
		/* FIFO is 16 entries */
		tx_room = 16;
		tx_full = false;
	}
}

The uart_check_tx_room() function is pretty simple, it checks if there’s room in the FIFO and knows that there’s 16 entries. Next, we have a busy loop that waits until there’s room again in the FIFO:

static void uart_wait_tx_room(void)
{
	while (!tx_room) {
		uart_check_tx_room();
		if (!tx_room) {
			smt_lowest();
			do {
				barrier();
				uart_check_tx_room();
			} while (!tx_room);
			smt_medium();
		}
	}
}

Finally, the bit of code that writes the (internal) log buffer out to a serial port:

/*
 * Internal console driver (output only)
 */
static size_t uart_con_write(const char *buf, size_t len)
{
	size_t written = 0;

	/* If LPC bus is bad, we just swallow data */
	if (!lpc_ok() && !mmio_uart_base)
		return written;

	lock(&uart_lock);
	while(written < len) {
		if (tx_room == 0) {
			uart_wait_tx_room();
			if (tx_room == 0)
				goto bail;
		} else {
			uart_write(REG_THR, buf[written++]);
			tx_room--;
		}
	}
 bail:
	unlock(&uart_lock);
	return written;
}

The skiboot code ends up being a bit more complicated thanks to a number of reasons, but the basic algorithm could be applied to the SBE code, and rather than busy waiting for each character to be written out before sending the other into the FIFO, we could just splat things down there and continue with life. So, I put together a patch to try out.

Before (i.e. upstream SBE code): it took about 15 seconds from “Welcome to SBE” to “Booting Hostboot”.

Now (with my patch): Around 10 seconds.

It’s a full five seconds (33%) faster to get through the SBE stage of booting. Wow.

Hopefully somebody looks at the pull request sometime soon, as it’s probably useful to a lot of people doing firmware and Operating System development.

So, Happy New Year for Blackbird owners (I’ll publish a build with this and other misc improvements “soon”).

A close-to-upstream firmware build for the Raptor Blackbird

It goes without saying that using this build is a At Your Own Risk and I make zero warranty. AFAIK it can’t physically destroy your system.

My GitHub op-build branch stewart-blackbird-v1 has all the changes built into this build (the VERSION displayed in firmware will be slightly weird as I did the tagging afterwards… this is not meant to be “howto release firmware to the public”). Follow op-build pull 3341 for the state of upstreaming everything.

Binaries are over at https://www.flamingspork.com/blackbird/stewart-blackbird-v1-images/ (see the git branch of op-build for source).

To flash it (temporarily), grab blackbird.pnor, get it to /tmp on your BMC and follow the instructions I posted the other day.

I’d be interested in any feedback on what does/does not work.

Are you Fans of the Blackbird? Speak up, I can’t hear you over the fan.

So, as of yesterday, I started running a pretty-close-to-upstream op-build host firmware stack on my Blackbird. Notable yak-shaving has included:

Apart from that, I was all happy as Larry. Except then I went into the room with the Blackbird in it an went “huh, that’s loud”, and since it was bedtime, I decided it could all wait until the morning.

It is now the morning. Checking fan speeds over IPMI, one fan stood out (fan2, sitting at 4300RPM). This was a bit of a surprise as what’s silkscreened on the board is that the rear case fan is hooked up to ‘fan2″, and if we had a “start from 0/1” mix up, it’d be the front case fan. I had just assumed it’d be maybe OCC firmware dying or something, but this wasn’t the case (I checked – thanks occtoolp9!)

After a bit of digging around, I worked out this mapping:

IPMI fan0Rear Case FanMotherboard Fan 2
IPMI fan1Front Case FanMotherboard Fan 3
IPMI fan2CPU FanMotherboard Fan 1

Which is about as surprising and confusing as you’d think.

After a bunch of digging around the Raptor ports of OpenBMC and Hostboot, it seems that the IPL Observer which is custom to Raptor controls if the BMC decides to do fan control or not.

You can get its view of the world from the BMC via the (incredibly user friendly) poking at DBus:

busctl get-property org.openbmc.status.IPL /org/openbmc/status/IPL org.openbmc.status.IPL current_status; busctl get-property org.openbmc.status.IPL /org/openbmc/status/IPL org.openbmc.status.IPL current_istep

Which if you just have the Hostboot patch in (like I first did) you end up with:

s "IPL_RUNNING"
s "21,3"

Which is where Hostboot exits the IPL process (as you see on the screen) and hands over to skiboot. But if you start digging through their op-build tree, you find that there’s a signal_linux_start_complete script which calls pnv-lpc to write two values to LPC ports 0x81 and 0x82. The pnv-lpc utility is the external/lpc/ binary from skiboot, and these two ports are the “extended lpc port 80h” state.

So, to get back fan control? First, build the lpc utility:

git clone git@github.com:open-power/skiboot.git
cd skiboot/external/lpc
make

and then poke the magic values of “IPL complete and linux running”:

$ sudo ./lpc io 0x81.b=254
[io] W 0x00000081.b=0xfe
$ sudo ./lpc io 0x82.b=254
[io] W 0x00000082.b=0xfe

You get a friendly beep, and then your fans return to sanity.

Of course, for that to work you need to have debugfs mounted, as this pokes OPAL debugfs to do direct LPC operations.

Next up: think of a smarter way to trigger that than “stewart runs it on the command line”. Also next up: work out the better way to determine that fan control should be on and patch the BMC.

Booting temporary firmware on the Raptor Blackbird

In a future post, I’ll detail how to build my ported-to-upstream Blackbird firmware. Here though, we’ll explore booting some firmware temporarily to experiment.

Step 1: Copy your new PNOR image over to the BMC.
Step 2: …
Step 3: Profit!

Okay, not really, once you’ve copied over your image, ensure the computer is off and then you can tell the daemon that provides firmware to the host to use a file backend for it rather than the PNOR chip on the motherboard (i.e. yes, you can boot your system even when the firmware chip isn’t there – although I’ve not literally tried this).

root@blackbird:~# mboxctl --backend file:/tmp/blackbird.pnor 
SetBackend: Success
root@blackbird:~# obmcutil poweron

If we look at the serial console (ssh to the BMC port 2200) we’ll see Hostboot start, realise there’s newer SBE code, flash it, and reboot:

--== Welcome to Hostboot hostboot-b284071/hbicore.bin ==--

  3.02606|secure|SecureROM valid - enabling functionality
  5.14678|Booting from SBE side 0 on master proc=00050000
  5.18537|ISTEP  6. 5 - host_init_fsi
  5.47985|ISTEP  6. 6 - host_set_ipl_parms
  5.54476|ISTEP  6. 7 - host_discover_targets
  6.56106|HWAS|PRESENT> DIMM[03]=8080000000000000
  6.56108|HWAS|PRESENT> Proc[05]=8000000000000000
  6.56109|HWAS|PRESENT> Core[07]=1511540000000000
  6.61373|ISTEP  6. 8 - host_update_master_tpm
  6.61529|SECURE|Security Access Bit> 0x0000000000000000
  6.61530|SECURE|Secure Mode Disable (via Jumper)> 0x8000000000000000
  6.61543|ISTEP  6. 9 - host_gard
  7.20987|HWAS|FUNCTIONAL> DIMM[03]=8080000000000000
  7.20988|HWAS|FUNCTIONAL> Proc[05]=8000000000000000
  7.20989|HWAS|FUNCTIONAL> Core[07]=1511540000000000
  7.21299|ISTEP  6.11 - host_start_occ_xstop_handler
  8.28965|ISTEP  6.12 - host_voltage_config
  8.47973|ISTEP  7. 1 - mss_attr_cleanup
  9.07674|ISTEP  7. 2 - mss_volt
  9.35627|ISTEP  7. 3 - mss_freq
  9.63029|ISTEP  7. 4 - mss_eff_config
 10.35189|ISTEP  7. 5 - mss_attr_update
 10.38489|ISTEP  8. 1 - host_slave_sbe_config
 10.45332|ISTEP  8. 2 - host_setup_sbe
 10.45450|ISTEP  8. 3 - host_cbs_start
 10.45574|ISTEP  8. 4 - proc_check_slave_sbe_seeprom_complete
 10.48675|ISTEP  8. 5 - host_attnlisten_proc
 10.50338|ISTEP  8. 6 - host_p9_fbc_eff_config
 10.50771|ISTEP  8. 7 - host_p9_eff_config_links
 10.53338|ISTEP  8. 8 - proc_attr_update
 10.53634|ISTEP  8. 9 - proc_chiplet_fabric_scominit
 10.55234|ISTEP  8.10 - proc_xbus_scominit
 10.56202|ISTEP  8.11 - proc_xbus_enable_ridi
 10.57788|ISTEP  8.12 - host_set_voltages
 10.59421|ISTEP  9. 1 - fabric_erepair
 10.65877|ISTEP  9. 2 - fabric_io_dccal
 10.66048|ISTEP  9. 3 - fabric_pre_trainadv
 10.66665|ISTEP  9. 4 - fabric_io_run_training
 10.66860|ISTEP  9. 5 - fabric_post_trainadv
 10.67060|ISTEP  9. 6 - proc_smp_link_layer
 10.67503|ISTEP  9. 7 - proc_fab_iovalid
 11.10386|ISTEP  9. 8 - host_fbc_eff_config_aggregate
 11.15103|ISTEP 10. 1 - proc_build_smp
 11.27537|ISTEP 10. 2 - host_slave_sbe_update
 11.68581|sbe|System Performing SBE Update for PROC 0, side 0
 34.50467|sbe|System Rebooting To Complete SBE Update Process
 34.50595|IPMI: Initiate power cycle
 34.54671|Stopping istep dispatcher
 34.68729|IPMI: shutdown complete

One of the improvements is we now get output from the SBE! This means that when we do things like mess up secure boot and non secure boot firmware (I’ll explain why/how this is a thing later), we’ll actually get something useful out of a serial port:

--== Welcome to SBE - CommitId[0x8b06b5c1] ==--
istep 3.19
istep 3.20
istep 3.21
istep 3.22
istep 4.1
istep 4.2
istep 4.3
istep 4.4
istep 4.5
istep 4.6
istep 4.7
istep 4.8
istep 4.9
istep 4.10
istep 4.11
istep 4.12
istep 4.13
istep 4.14
istep 4.15
istep 4.16
istep 4.17
istep 4.18
istep 4.19
istep 4.20
istep 4.21
istep 4.22
istep 4.23
istep 4.24
istep 4.25
istep 4.26
istep 4.27
istep 4.28
istep 4.29
istep 4.30
istep 4.31
istep 4.32
istep 4.33
istep 4.34
istep 5.1
istep 5.2
SBE starting hostboot

And then we’re back into normal Hostboot boot (which we’ve all seen before) and end up at a newer petitboot!

Petitboot 1.11 on a Raptor Blackbird

One notable absence from that screenshot is my installed Fedora is missing. This is because there appears to be a bug in the 5.3.7 kernel that’s currently upstream, and if we drop to the shell and poke at lspci and dmesg, we can work out what could be the culprit:

Exiting petitboot. Type 'exit' to return.
You may run 'pb-sos' to gather diagnostic data
No password set, running as root. You may set a password in the System Configuration screen.
# lspci
0000:00:00.0 PCI bridge: IBM Device 04c1
0001:00:00.0 PCI bridge: IBM Device 04c1
0001:01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a8 (rev 03)
0002:00:00.0 PCI bridge: IBM Device 04c1
0002:01:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller (rev 11)
0003:00:00.0 PCI bridge: IBM Device 04c1
0003:01:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02)
0004:00:00.0 PCI bridge: IBM Device 04c1
0004:01:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0004:01:00.1 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0004:01:00.2 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0005:00:00.0 PCI bridge: IBM Device 04c1
0005:01:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04)
0005:02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
# dmesg|grep -i nvme
[    2.991038] nvme nvme0: pci function 0001:01:00.0
[    2.991088] nvme 0001:01:00.0: enabling device (0140 -> 0142)
[    3.121799] nvme nvme0: Identify Controller failed (19)
[    3.121802] nvme nvme0: Removing after probe failure status: -5
# uname -a
Linux skiroot 5.3.7-openpower1 #2 SMP Sat Dec 14 09:06:20 PST 2019 ppc64le GNU/Linux

If for some reason the device didn’t show up in lspci, then I’d look at the skiboot firmware log, which is /sys/firmware/opal/msglog.

Looking at upstream stable kernel patches, it seems like 5.3.8 has a interesting looking patch when you realize that ppc64le uses a 64k page size:

commit efac0f186ea654e8389f5017c7f643ef48cb4b93
Author: Kevin Hao <haokexin@gmail.com>
Date:   Fri Oct 18 10:53:14 2019 +0800

    nvme-pci: Set the prp2 correctly when using more than 4k page
    
    commit a4f40484e7f1dff56bb9f286cc59ffa36e0259eb upstream.
    
    In the current code, the nvme is using a fixed 4k PRP entry size,
    but if the kernel use a page size which is more than 4k, we should
    consider the situation that the bv_offset may be larger than the
    dev->ctrl.page_size. Otherwise we may miss setting the prp2 and then
    cause the command can't be executed correctly.
    
    Fixes: dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests")
    Cc: stable@vger.kernel.org
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Kevin Hao <haokexin@gmail.com>
    Signed-off-by: Keith Busch <kbusch@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

So, time to go try 5.3.8. My yaks are getting quite smooth.

Oh, and when you’re done with your temporary firmware, either fiddle with mboxctl or restart the systemd service for it, or reboot your BMC or… well, I gotta leave you something to work out on your own :)

Building OpenPOWER firmware on Fedora 31

One of the challenges with Fedora 31 is that /usr/bin/python is now Python 3 rather than Python 2. Just about every python script in existence relies on /usr/bin/python being Python 2 and not anything else. I can’t really recall, but this probably happened with the 1.5 to 2 transition as well (although IIRC that was less breaking).

What this means is that for projects that are half-way through converting to python 3, everything breaks.

op-build is one of these projects.

So, we need:

After all that, you can actually build a pnor image on Fedora 31. Even on Fedora 31 ppc64le, which is literally what I’ve just done.

Upstreaming Blackbird firmware (step 1: skiboot)

Now that I can actually boot the machine, I could test and send my patch upstream for Blackbird support in skiboot. One thing I noticed with the current firmware from Raptor is that the PCIe slot names were wrong. While a pretty minor point, it’s a bit funny that there’s only two slots and the names were wrong.

The PCIe slot names are used to call out the physical location of PCIe cards in the system, so if you, say, hit a bunch of errors, OS/firmware can say “It’s this card in the slot labeled BLAH on the board”.

With my patch, the slot table from skiboot is spat out looking like this:

[   64.296743001,5] PHB#0000:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..ff SLOT=SLOT1 PCIE 4.0 X16 
 [   64.296875483,5] PHB#0001:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=SLOT2 PCIE 4.0 X8 
 [   64.297054197,5] PHB#0001:01:00.0 [EP  ] 8086 f1a8 R:03 C:010802 (  mass-storage) LOC_CODE=SLOT2 PCIE 4.0 X8
 [   64.297285067,5] PHB#0002:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=Builtin SATA 
 [   64.297411565,5] PHB#0002:01:00.0 [LGCY] 1b4b 9235 R:11 C:010601 (          sata) LOC_CODE=Builtin SATA
 [   64.297554540,5] PHB#0003:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=Builtin USB 
 [   64.297732049,5] PHB#0003:01:00.0 [EP  ] 104c 8241 R:02 C:0c0330 (      usb-xhci) LOC_CODE=Builtin USB
 [   64.297848624,5] PHB#0004:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=Builtin Ethernet 
 [   64.298026870,5] PHB#0004:01:00.0 [EP  ] 14e4 1657 R:01 C:020000 (      ethernet) LOC_CODE=Builtin Ethernet
 [   64.298212291,5] PHB#0004:01:00.1 [EP  ] 14e4 1657 R:01 C:020000 (      ethernet) LOC_CODE=Builtin Ethernet
 [   64.298424962,5] PHB#0004:01:00.2 [EP  ] 14e4 1657 R:01 C:020000 (      ethernet) LOC_CODE=Builtin Ethernet
 [   64.298587848,5] PHB#0005:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..02 SLOT=BMC 
 [   64.298722540,5] PHB#0005:01:00.0 [ETOX] 1a03 1150 R:04 C:060400 B:02..02 LOC_CODE=BMC
 [   64.298850009,5] PHB#0005:02:00.0 [PCID] 1a03 2000 R:41 C:030000 (           vga) LOC_CODE=BMC

If you want to give it a go, grab the patch, build skiboot, and flash it on. Alternatively, you can download a built skiboot here. To flash it, do this:

# Copy to your BMC for the Blackbird
scp skiboot-v6.5-146-g376bed3f.lid.xz.stb root@blackbird:/tmp/

# then, ssh to the BMC
$ ssh root@blackbird

# ensure the machine is off
obmcutil poweroff --wait

# Now, make a backup copy (remember to copy it off /tmp on the bmc)
pflash -P PAYLOAD -r /tmp/skiboot-backup

# and flash the new skiboot:
pflash -e -P PAYLOAD -p /tmp/skiboot.lid.xz.stb

# now, power on the box
obmcutil poweron

Black(bird) boots!

Well, after the half false start of not having RAM so really not being able to do much (yeah yeah, I hear you – I’m weak for not just running Linux in L3), my RAM arrived today. Putting the sticks in was easy (of course), although does not make for an exciting photo.

One DIMM in the Blackbird

After that, I SSH’d the the BMC and then did “obmcutil poweron” (as is traditional) and started looking at the console via conneting via SSH to port 2200 on the BMC. I was then greeted by the (by this time in my life rather familiar) Hostboot:

--== Welcome to Hostboot hostboot-3beba24/hbicore.bin ==--
 3.02902|secure|SecureROM valid - enabling functionality
   7.15613|Booting from SBE side 0 on master proc=00050000
   7.19697|ISTEP  6. 5 - host_init_fsi
   7.54226|ISTEP  6. 6 - host_set_ipl_parms
   8.06280|ISTEP  6. 7 - host_discover_targets
   9.19791|HWAS|PRESENT> DIMM[03]=8080000000000000
   9.19792|HWAS|PRESENT> Proc[05]=8000000000000000
   9.19794|HWAS|PRESENT> Core[07]=1511540000000000
   9.55305|ISTEP  6. 8 - host_update_master_tpm
   9.60521|SECURE|Security Access Bit> 0x0000000000000000
   9.60522|SECURE|Secure Mode Disable (via Jumper)> 0x8000000000000000
   9.63093|ISTEP  6. 9 - host_gard
   9.89867|HWAS|Blocking Speculative Deconfig
   9.90128|HWAS|FUNCTIONAL> DIMM[03]=8080000000000000
   9.90129|HWAS|FUNCTIONAL> Proc[05]=8000000000000000
   9.90130|HWAS|FUNCTIONAL> Core[07]=1511540000000000
   9.90329|ISTEP  6.11 - host_start_occ_xstop_handler
  11.19092|ISTEP  6.12 - host_voltage_config
  11.30246|ISTEP  7. 1 - mss_attr_cleanup
  12.61924|ISTEP  7. 2 - mss_volt
  12.92705|ISTEP  7. 3 - mss_freq
  13.67475|ISTEP  7. 4 - mss_eff_config
  14.95827|ISTEP  7. 5 - mss_attr_update
  14.97307|ISTEP  8. 1 - host_slave_sbe_config
  15.05372|ISTEP  8. 2 - host_setup_sbe
  15.10258|ISTEP  8. 3 - host_cbs_start
  15.10381|ISTEP  8. 4 - proc_check_slave_sbe_seeprom_complete
  15.11144|ISTEP  8. 5 - host_attnlisten_proc
  15.11213|ISTEP  8. 6 - host_p9_fbc_eff_config
  15.13552|ISTEP  8. 7 - host_p9_eff_config_links
  15.20087|ISTEP  8. 8 - proc_attr_update
  15.20191|ISTEP  8. 9 - proc_chiplet_fabric_scominit
  15.21891|ISTEP  8.10 - proc_xbus_scominit
  15.22929|ISTEP  8.11 - proc_xbus_enable_ridi
  15.24717|ISTEP  8.12 - host_set_voltages
  15.26620|ISTEP  9. 1 - fabric_erepair
  15.42123|ISTEP  9. 2 - fabric_io_dccal
  15.42436|ISTEP  9. 3 - fabric_pre_trainadv
  15.42887|ISTEP  9. 4 - fabric_io_run_training
  15.43207|ISTEP  9. 5 - fabric_post_trainadv
  15.44893|ISTEP  9. 6 - proc_smp_link_layer
  15.45454|ISTEP  9. 7 - proc_fab_iovalid
  15.87126|ISTEP  9. 8 - host_fbc_eff_config_aggregate
  15.89174|ISTEP 10. 1 - proc_build_smp
  16.54194|ISTEP 10. 2 - host_slave_sbe_update
  18.63876|sbe|System Performing SBE Update for PROC 0, side 0
  41.69727|sbe|System Rebooting To Complete SBE Update Process
  41.72189|IPMI: Initiate power cycle
  42.40652|IPMI: shutdown complete

The first IPL updated the Self Boot Engine firmware on the chip, so it automatically applied the new firmware and rebooted to finish applying it. This is perfectly normal, it just shows itself as a longer boot time. Booting continues:

--== Welcome to Hostboot hostboot-3beba24/hbicore.bin ==--
 3.02810|secure|SecureROM valid - enabling functionality
   6.07331|Booting from SBE side 0 on master proc=00050000
   6.11485|ISTEP  6. 5 - host_init_fsi
   6.60361|ISTEP  6. 6 - host_set_ipl_parms
   6.98640|ISTEP  6. 7 - host_discover_targets
   7.53975|HWAS|PRESENT> DIMM[03]=8080000000000000
   7.53976|HWAS|PRESENT> Proc[05]=8000000000000000
   7.53977|HWAS|PRESENT> Core[07]=1511540000000000
   7.79123|ISTEP  6. 8 - host_update_master_tpm
   7.79263|SECURE|Security Access Bit> 0x0000000000000000
   7.79264|SECURE|Secure Mode Disable (via Jumper)> 0x8000000000000000
   7.82684|ISTEP  6. 9 - host_gard
   8.26609|HWAS|Blocking Speculative Deconfig
   8.26865|HWAS|FUNCTIONAL> DIMM[03]=8080000000000000
   8.26866|HWAS|FUNCTIONAL> Proc[05]=8000000000000000
   8.26867|HWAS|FUNCTIONAL> Core[07]=1511540000000000
   8.27142|ISTEP  6.11 - host_start_occ_xstop_handler
   9.69606|ISTEP  6.12 - host_voltage_config
   9.81183|ISTEP  7. 1 - mss_attr_cleanup
  10.95130|ISTEP  7. 2 - mss_volt
  11.39875|ISTEP  7. 3 - mss_freq
  12.15655|ISTEP  7. 4 - mss_eff_config
  13.63504|ISTEP  7. 5 - mss_attr_update
  13.65162|ISTEP  8. 1 - host_slave_sbe_config
  13.78039|ISTEP  8. 2 - host_setup_sbe
  13.78143|ISTEP  8. 3 - host_cbs_start
  13.78247|ISTEP  8. 4 - proc_check_slave_sbe_seeprom_complete
  13.79015|ISTEP  8. 5 - host_attnlisten_proc
  13.79114|ISTEP  8. 6 - host_p9_fbc_eff_config
  13.79734|ISTEP  8. 7 - host_p9_eff_config_links
  13.85128|ISTEP  8. 8 - proc_attr_update
  13.85783|ISTEP  8. 9 - proc_chiplet_fabric_scominit
  13.87991|ISTEP  8.10 - proc_xbus_scominit
  13.89056|ISTEP  8.11 - proc_xbus_enable_ridi
  13.91122|ISTEP  8.12 - host_set_voltages
  13.93077|ISTEP  9. 1 - fabric_erepair
  14.05235|ISTEP  9. 2 - fabric_io_dccal
  14.13131|ISTEP  9. 3 - fabric_pre_trainadv
  14.13616|ISTEP  9. 4 - fabric_io_run_training
  14.13934|ISTEP  9. 5 - fabric_post_trainadv
  14.14087|ISTEP  9. 6 - proc_smp_link_layer
  14.14656|ISTEP  9. 7 - proc_fab_iovalid
  14.59454|ISTEP  9. 8 - host_fbc_eff_config_aggregate
  14.61811|ISTEP 10. 1 - proc_build_smp
  15.24074|ISTEP 10. 2 - host_slave_sbe_update
  17.16022|sbe|System Performing SBE Update for PROC 0, side 1
  40.16808|ISTEP 10. 4 - proc_cen_ref_clk_enable
  40.27866|ISTEP 10. 5 - proc_enable_osclite
  40.31297|ISTEP 10. 6 - proc_chiplet_scominit
  40.55805|ISTEP 10. 7 - proc_abus_scominit
  40.57942|ISTEP 10. 8 - proc_obus_scominit
  40.58078|ISTEP 10. 9 - proc_npu_scominit
  40.60704|ISTEP 10.10 - proc_pcie_scominit
  40.66572|ISTEP 10.11 - proc_scomoverride_chiplets
  40.66874|ISTEP 10.12 - proc_chiplet_enable_ridi
  40.68407|ISTEP 10.13 - host_rng_bist
  40.75548|ISTEP 10.14 - host_update_redundant_tpm
  40.75785|ISTEP 11. 1 - host_prd_hwreconfig
  41.15067|ISTEP 11. 2 - cen_tp_chiplet_init1
  41.15299|ISTEP 11. 3 - cen_pll_initf
  41.15544|ISTEP 11. 4 - cen_pll_setup
  41.18530|ISTEP 11. 5 - cen_tp_chiplet_init2
  41.18762|ISTEP 11. 6 - cen_tp_arrayinit
  41.19050|ISTEP 11. 7 - cen_tp_chiplet_init3
  41.19286|ISTEP 11. 8 - cen_chiplet_init
  41.19553|ISTEP 11. 9 - cen_arrayinit
  41.19986|ISTEP 11.10 - cen_initf
  41.20215|ISTEP 11.11 - cen_do_manual_inits
  41.20497|ISTEP 11.12 - cen_startclocks
  41.20802|ISTEP 11.13 - cen_scominits
  41.21171|ISTEP 12. 1 - mss_getecid
  42.25709|ISTEP 12. 2 - dmi_attr_update
  42.30382|ISTEP 12. 3 - proc_dmi_scominit
  42.32572|ISTEP 12. 4 - cen_dmi_scominit
  42.32798|ISTEP 12. 5 - dmi_erepair
  42.35000|ISTEP 12. 6 - dmi_io_dccal
  42.35218|ISTEP 12. 7 - dmi_pre_trainadv
  42.35489|ISTEP 12. 8 - dmi_io_run_training
  42.37076|ISTEP 12. 9 - dmi_post_trainadv
  42.39541|ISTEP 12.10 - proc_cen_framelock
  42.40772|ISTEP 12.11 - host_startprd_dmi
  42.41974|ISTEP 12.12 - host_attnlisten_memb
  42.44506|ISTEP 12.13 - cen_set_inband_addr
  42.58832|ISTEP 13. 1 - host_disable_memvolt
  43.67808|ISTEP 13. 2 - mem_pll_reset
  43.75070|ISTEP 13. 3 - mem_pll_initf
  43.85043|ISTEP 13. 4 - mem_pll_setup
  43.87372|ISTEP 13. 6 - mem_startclocks
  43.88970|ISTEP 13. 7 - host_enable_memvolt
  43.89177|ISTEP 13. 8 - mss_scominit
  45.10013|ISTEP 13. 9 - mss_ddr_phy_reset
  45.38105|ISTEP 13.10 - mss_draminit
  45.95447|ISTEP 13.11 - mss_draminit_training
  47.20963|ISTEP 13.12 - mss_draminit_trainadv
  47.32161|ISTEP 13.13 - mss_draminit_mc
  47.49186|ISTEP 14. 1 - mss_memdiag
  69.53224|ISTEP 14. 2 - mss_thermal_init
  69.66891|ISTEP 14. 3 - proc_pcie_config
  69.71959|ISTEP 14. 4 - mss_power_cleanup
  69.72385|ISTEP 14. 5 - proc_setup_bars
  69.83889|ISTEP 14. 6 - proc_htm_setup
  69.84748|ISTEP 14. 7 - proc_exit_cache_contained
  69.89430|ISTEP 15. 1 - host_build_stop_image
  73.08679|ISTEP 15. 2 - proc_set_pba_homer_bar
  73.12352|ISTEP 15. 3 - host_establish_ex_chiplet
  73.13714|ISTEP 15. 4 - host_start_stop_engine
  73.19059|ISTEP 16. 1 - host_activate_master
  74.44590|ISTEP 16. 2 - host_activate_slave_cores
  74.53820|ISTEP 16. 3 - host_secure_rng
  74.54651|ISTEP 16. 4 - mss_scrub
  74.56565|ISTEP 16. 5 - host_load_io_ppe
  74.78752|ISTEP 16. 6 - host_ipl_complete
  75.50085|ISTEP 18.11 - proc_tod_setup
  75.94190|ISTEP 18.12 - proc_tod_init
  75.97575|ISTEP 20. 1 - host_load_payload
  77.12340|ISTEP 20. 2 - host_load_hdat
  78.05195|ISTEP 21. 1 - host_runtime_setup
  83.87001|htmgt|OCCs are now running in ACTIVE state
  89.72649|ISTEP 21. 2 - host_verify_hdat
  89.77252|ISTEP 21. 3 - host_start_payload
 [   90.400516933,5] OPAL skiboot-c81f9d6 starting…

The rest of the skiboot log was also spat out, and then the familiar Petitboot screen:

Welcome to Petitboot!

It lives! I even had a bit of a look at the sensors to see power consumption and temperatures. All looks good:

ipmitool sdr|grep -v ns
 occ0             | 0x00              | ok
 occ1             | 0x00              | ok
 p0_core3_temp    | 51 degrees C      | ok
 p0_core5_temp    | 49 degrees C      | ok
 p0_core7_temp    | 50 degrees C      | ok
 p0_core11_temp   | 49 degrees C      | ok
 p0_core15_temp   | 50 degrees C      | ok
 p0_core17_temp   | 50 degrees C      | ok
 p0_core19_temp   | 50 degrees C      | ok
 p0_core21_temp   | 50 degrees C      | ok
 dimm0_temp       | 36 degrees C      | ok
 dimm4_temp       | 39 degrees C      | ok
 fan0             | 1300 RPM          | ok
 fan1             | 1200 RPM          | ok
 fan2             | 1000 RPM          | ok
 p0_power         | 60 Watts          | ok
 p0_vdd_power     | 31 Watts          | ok
 p0_vdn_power     | 10 Watts          | ok
 cpu_1_ambient    | 30.90 degrees C   | ok
 pcie             | 27 degrees C      | ok
 ambient          | 25.40 degrees C   | ok

Next up? I guess I should install an OS.

Blackbird (singing in the dead of night..)

Way back when Raptor Computer Systems was doing pre-orders for the microATX Blackboard POWER9 system, I put in a pre-order. Since then, I’ve had a few life changes (such as moving to the US and starting to work for Amazon rather than IBM), but I’ve finally gone and done (most of) the setup for my own POWER9 system on (or under) my desk.

An 8 core POWER9 CPU, in bubble wrap and plastic packaging.

Everything came in a big brown box, all rather well packed. I had the board, CPU, heatsink assembly and the special tool to attach the heatsink to the board. Although unique to POWER9, the heatsink/fan assembly was one of the easier ones I’ve ever attached to a board.

The board itself looks pretty much as you’d expect – there’s a big spot for the CPU, a couple of PCI slots, a couple of DIMM slots and some SATA connectors.

The bits that are a bit unusual for a micro-ATX board are the big space reserved for FlexVer, the ASPEED BMC chip and the socketed flash. FlexVer is something I’m not ever going to use, and instead wish that there was an on-board m2 SSD slot instead, even if it was just PCIe. Having to sacrifice a PCIe slot just for a SSD is kind of a bummer.

The Blackbird POWER9 board
The POWER9 chip in socket

One annoying thing is my DIMMs are taking their sweet time in getting here, so I couldn’t actually populate the board with any memory.

Even without memory though, you can start powering it on and see that everything else works okay (i.e. it’s not completely boned). So, even without DIMMs, I could plug it in, and observe the Hostboot firmware complaining about insufficient hardware to IPL the box.

It Lives!

Yep, out the console (via ssh) you clearly see where things fail:

--== Welcome to Hostboot hostboot-3beba24/hbicore.bin ==--

  3.03104|secure|SecureROM valid - enabling functionality
  6.67619|Booting from SBE side 0 on master proc=00050000
  6.85100|ISTEP  6. 5 - host_init_fsi
  7.23753|ISTEP  6. 6 - host_set_ipl_parms
  7.71759|ISTEP  6. 7 - host_discover_targets
 11.34738|HWAS|PRESENT> Proc[05]=8000000000000000
 11.34739|HWAS|PRESENT> Core[07]=1511540000000000
 11.69077|ISTEP  6. 8 - host_update_master_tpm
 11.73787|SECURE|Security Access Bit> 0x0000000000000000
 11.73787|SECURE|Secure Mode Disable (via Jumper)> 0x8000000000000000
 11.76276|ISTEP  6. 9 - host_gard
 11.96654|HWAS|FUNCTIONAL> Proc[05]=8000000000000000
 11.96655|HWAS|FUNCTIONAL> Core[07]=1511540000000000
 12.07554|================================================
 12.07554|Error reported by hwas (0x0C00) PLID 0x90000007
 12.10289|  checkMinimumHardware found no functional dimm cards.
 12.10290|  ModuleId   0x03 MOD_CHECK_MIN_HW
 12.10291|  ReasonCode 0x0c06 RC_SYSAVAIL_NO_MEMORY_FUNC
 12.10292|  UserData1  HUID of node : 0x0002000000000000
 12.10293|  UserData2  number of present, non-functional dimms : 0x0000000000000000
 12.10294|------------------------------------------------
 12.10417|  Callout type             : Procedure Callout
 12.10417|  Procedure                : EPUB_PRC_FIND_DECONFIGURED_PART
 12.10418|  Priority                 : SRCI_PRIORITY_HIGH
 12.10419|------------------------------------------------
 12.10420|  Hostboot Build ID: hostboot-3beba24/hbicore.bin
 12.10421|================================================
 12.51718|================================================
 12.51719|Error reported by hwas (0x0C00) PLID 0x90000007
 12.51720|  Insufficient hardware to continue.
 12.51721|  ModuleId   0x03 MOD_CHECK_MIN_HW
 12.51722|  ReasonCode 0x0c04 RC_SYSAVAIL_INSUFFICIENT_HW
 12.54457|  UserData1   : 0x0000000000000000
 12.54458|  UserData2   : 0x0000000000000000
 12.54458|------------------------------------------------
 12.54459|  Callout type             : Procedure Callout
 12.54460|  Procedure                : EPUB_PRC_FIND_DECONFIGURED_PART
 12.54461|  Priority                 : SRCI_PRIORITY_HIGH
 12.54462|------------------------------------------------
 12.54462|  Hostboot Build ID: hostboot-3beba24/hbicore.bin
 12.54463|================================================
 12.73660|System shutting down with error status 0x90000007
 12.75545|================================================
 12.75546|Error reported by istep (0x1700) PLID 0x90000007
 12.77991|  IStep failed, see other log(s) with the same PLID for reason.
 12.77992|  ModuleId   0x01 MOD_REPORTING_ERROR
 12.77993|  ReasonCode 0x1703 RC_FAILURE
 12.77994|  UserData1  eid of first error : 0x9000000800000c04
 12.77995|  UserData2  Reason code of first error : 0x0000000100000609
 12.77996|------------------------------------------------
 12.77996|  host_gard
 12.77997|------------------------------------------------
 12.77998|  Callout type             : Procedure Callout
 12.77998|  Procedure                : EPUB_PRC_HB_CODE
 12.77999|  Priority                 : SRCI_PRIORITY_LOW
 12.78000|------------------------------------------------
 12.78001|  Hostboot Build ID: hostboot-3beba24/hbicore.bin
 12.78002|================================================

Looking forward to getting some DIMMs to show/share more.

Looking at the state of Blackbird firmware

Having been somewhat involved in OpenPOWER firmware, I have a bunch of experience and opinions on maintaining firmware trees for products, what working with upstream looks like and all that.

So, with my new Blackbird system I decided to take a bit of a look as to what the firmware situation was like.

There’s two main parts of firmware: BMC and Host. The BMC firmware runs purely on the ASPEED AST2500 and is based on OpenBMC while the host firmware is what runs on the POWER9 and is based off of OpenPOWER Firmware as assembled by op-build.

Initial impressions on the BMC is that there doesn’t seem to be any web based UI for it, which is kind of disappointing, as the Web UI being developed upstream has some nice qualities, and I’d say I even enjoyed using it when it was built into BMC firmware for systems we had when I was at IBM.

Looking at the git trees, the raptor-v1.00 tag is OpenBMC 2.7.0-dev-533-g386e5602e while current master is 2.8.0-dev-960-g10f7830bd. The spot where it split off was 2.7.0-dev-430-g7443ee80b, from April 2019 – so it’s not too old, but I’m also not convinced there should have been some security patches since then.

I’m not sure if any of the OpenBMC code is upstream, I haven’t looked.

Unfortunately, none of the host firmware is upstream.

On the host firmware side, v2.3-rc2-67-ga6a5f142 is the Raptor tag, and that compares with current master of v2.4-305-g54d8daf4, the place where Raptor forked was v2.3-rc2-9-g7b556015, again in April of 2019. Considering there was an upstream release in May of 2019 (v2.3), and again in July (v2.4), it could have easily have made it into an upstream release.

Unfortunately, there doesn’t seem to have been an upstream op-build release since v2.4 back in July (when I made it shortly before leaving IBM).

The skiboot component of host firmware has had an upstream release since I left (v6.5 in mid-August 2019), so the (rather trivial) platform support could have easily made it. I have a cleaned up and ready to upstream patch for it, I just need some DIMMs to actually test with before I send the patch.

As the current firmware situation stands, producing another build with updated upstream code is tricky due to the out-of-tree nature of the Blackbird patches, and a straight “git merge” is probably doable by some people, but not everybody.

On my TODO list is to get all the code into a state I can upstream it, assess vulnerability to CVE-2019-6260, and work out how I want to make it do Secure Boot (something that isn’t in upstream firmware yet, and currently would require a TPM, which I do not have).

AWS Welcomes Stewart

A little over a month ago now, I started a new role at Amazon Web Services (AWS) as a Principal Engineer with Amazon Linux. Everyone has been wonderfully welcoming and helpful. I’m excited about the future here, the team, and our mission.

Thanks to all my IBM colleagues over the past five and a half and a bit years too, I really enjoyed working with you on OpenPOWER and hope it continues to gain traction. I have my Blackbird now and am eagerly waiting for a spare 20 minutes to assemble it.

CVE-2019-6260: Gaining control of BMC from the host processor

This is details for CVE-2019-6260 – which has been nicknamed “pantsdown” due to the nature of feeling that we feel that we’ve “caught chunks of the industry with their…” and combined with the fact that naming things is hard, so if you pick a bad name somebody would have to come up with a better one before we publish.

I expect OpenBMC to have a statement shortly.

The ASPEED ast2400 and ast2500 Baseboard Management Controller (BMC) hardware and firmware implement Advanced High-performance Bus (AHB) bridges, which allow arbitrary read and write access to the BMC’s physical address space from the host, or from the network if the BMC console uart is attached to a serial concentrator (this is atypical for most systems).

Common configuration of the ASPEED BMC SoC’s hardware features leaves it open to “remote” unauthenticated compromise from the host and from the BMC console. This stems from AHB bridges on the LPC and PCIe buses, another on the BMC console UART (hardware password protected), and the ability of the X-DMA engine to address all of the BMC’s M-Bus (memory bus).

This affects multiple BMC firmware stacks, including OpenBMC, AMI’s BMC, and SuperMicro. It is independent of host processor architecture, and has been observed on systems with x86_64 processors IBM POWER processors (there is no reason to suggest that other architectures wouldn’t be affected, these are just the ones we’ve been able to get access to)

The LPC, PCIe and UART AHB bridges are all explicitly features of Aspeed’s designs: They exist to recover the BMC during firmware development or to allow the host to drive the BMC hardware if the BMC has no firmware of its own. See section 1.9 of the AST2500 Software Programming Guide.

The typical consequence of external, unauthenticated, arbitrary AHB access is that the BMC fails to ensure all three of confidentiality, integrity and availability for its data and services. For instance it is possible to:

  1. Reflash or dump the firmware of a running BMC from the host
  2. Perform arbitrary reads and writes to BMC RAM
  3. Configure an in-band BMC console from the host
  4. “Brick” the BMC by disabling the CPU clock until the next AC power cycle

Using 1 we can obviously implant any malicious code we like, with the impact of BMC downtime while the flashing and reboot take place. This may take the form of minor, malicious modifications to the officially provisioned BMC image, as we can extract, modify, then repackage the image to be re-flashed on the BMC. As the BMC potentially has no secure boot facility it is likely difficult to detect such actions.

Abusing 3 may require valid login credentials, but combining 1 and 2 we can simply change the locks on the BMC by replacing all instances of the root shadow password hash in RAM with a chosen password hash – one instance of the hash is in the page cache, and from that point forward any login process will authenticate with the chosen password.

We obtain the current root password hash by using 1 to dump the current flash content, then using https://github.com/ReFirmLabs/binwalk to extract the rootfs, then simply loop-mount the rootfs to access /etc/shadow. At least one BMC stack doesn’t require this, and instead offers “Press enter for console”.

IBM has internally developed a proof-of-concept application that we intend to open-source, likely as part of the OpenBMC project, that demonstrates how to use the interfaces and probes for their availability. The intent is that it be added to platform firmware test
suites as a platform security test case. The application requires root user privilege on the host system for the LPC and PCIe bridges, or normal user privilege on a remote system to exploit the debug UART interface. Access from userspace demonstrates the vulnerability of systems in bare-metal cloud hosting lease arrangements where the BMC
is likely in a separate security domain to the host.

OpenBMC Versions affected: Up to at least 2.6, all supported Aspeed-based platforms

It only affects systems using the ASPEED ast2400, ast2500 SoCs. There has not been any investigation into other hardware.

The specific issues are listed below, along with some judgement calls on their risk.

iLPC2AHB bridge Pt I

State: Enabled at cold start
Description: A SuperIO device is exposed that provides access to the BMC’s address-space
Impact: Arbitrary reads and writes to the BMC address-space
Risk: High – known vulnerability and explicitly used as a feature in some platform designs
Mitigation: Can be disabled by configuring a bit in the BMC’s LPC controller, however see Pt II.

iLPC2AHB bridge Pt II

State: Enabled at cold start
Description: The bit disabling the iLPC2AHB bridge only removes write access – reads are still possible.
Impact: Arbitrary reads of the BMC address-space
Risk: High – we expect the capability and mitigation are not well known, and the mitigation has side-effects
Mitigation: Disable SuperIO decoding on the LPC bus (0x2E/0x4E decode). Decoding is controlled via hardware strapping and can be turned off at runtime, however disabling SuperIO decoding also removes the host’s ability to configure SUARTs, System wakeups, GPIOs and the BMC/Host mailbox

PCIe VGA P2A bridge

State: Enabled at cold start
Description: The VGA graphics device provides a host-controllable window mapping onto the BMC address-space
Impact: Arbitrary reads and writes to the BMC address-space
Risk: Medium – the capability is known to some platform integrators and may be disabled in some firmware stacks
Mitigation: Can be disabled or filter writes to coarse-grained regions of the AHB by configuring bits in the System Control Unit

DMA from/to arbitrary BMC memory via X-DMA

State: Enabled at cold start
Description: X-DMA available from VGA and BMC PCI devices
Impact: Misconfiguration can expose the entirety of the BMC’s RAM to the host
AST2400 Risk: High – SDK u-boot does not constrain X-DMA to VGA reserved memory
AST2500 Risk: Low – SDK u-boot restricts X-DMA to VGA reserved memory
Mitigation: X-DMA accesses are configured to remap into VGA reserved memory in u-boot

UART-based SoC Debug interface

State: Enabled at cold start
Description: Pasting a magic password over the configured UART exposes a hardware-provided debug shell. The capability is only exposed on one of UART1 or UART5, and interactions are only possible via the physical IO port (cannot be accessed from the host)
Impact: Misconfiguration can expose the BMC’s address-space to the network if the BMC console is made available via a serial concentrator.
Risk: Low
Mitigation: Can be disabled by configuring a bit in the System Control Unit

LPC2AHB bridge

State: Disabled at cold start
Description: Maps LPC Firmware cycles onto the BMC’s address-space
Impact: Misconfiguration can expose vulnerable parts of the BMC’s address-space to the host
Risk: Low – requires reasonable effort to configure and enable.
Mitigation: Don’t enable the feature if not required.
Note: As a counter-point, this feature is used legitimately on OpenPOWER systems to expose the boot flash device content to the host

PCIe BMC P2A bridge

State: Disabled at cold start
Description: PCI-to-BMC address-space bridge allowing memory and IO accesses
Impact: Enabling the device provides limited access to BMC address-space
Risk: Low – requires some effort to enable, constrained to specific parts of the BMC address space
Mitigation: Don’t enable the feature if not required.

Watchdog setup

State: Required system function, always available
Description: Misconfiguring the watchdog to use “System Reset” mode for BMC reboot will re-open all the “enabled at cold start” backdoors until the firmware reconfigures the hardware otherwise. Rebooting the BMC is generally possible from the host via IPMI “mc reset” command, and this may provide a window of opportunity for BMC compromise.
Impact: May allow arbitrary access to BMC address space via any of the above mechanisms
Risk: Low – “System Reset” mode is unlikely to be used for reboot due to obvious side-effects
Mitigation: Ensure BMC reboots always use “SOC Reset” mode

The CVSS score for these vulnerabilities is: https://nvd.nist.gov/vuln-metrics/cvss/v3-calculator?vector=3DAV:A/AC:L/PR:=N/UI:N/S:U/C:H/I:H/A:H/E:F/RL:U/RC:C/CR:H/IR:H/AR:M/MAV:L/MAC:L/MPR:N/MUI:N=/MS:U/MC:H/MI:H/MA:H

There is some debate on if this is a local or remote vulnerability, and it depends on if you consider the connection between the BMC and the host processor as a network or not.

The fix is platform dependent as it can involve patching both the BMC firmware and the host firmware.

For example, we have mitigated these vulnerabilities for OpenPOWER systems, both on the host and BMC side. OpenBMC has a u-boot patch that disables the features:

https://gerrit.openbmc-project.xyz/#/c/openbmc/meta-phosphor/+/13290/

Which platforms can opt into in the following way:

https://gerrit.openbmc-project.xyz/#/c/openbmc/meta-ibm/+/17146/

The process is opt-in for OpenBMC platforms because platform maintainers have the knowledge of if their platform uses affected hardware features. This is important when disabling the iLPC2AHB bridge as it can be a bit of a finicky process.

See also https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/11164 for a WIP OpenBMC Security Architecture document which should eventually contain all these details.

For OpenPOWER systems, the host firmware patches are contained in op-build v2.0.11 and enabled for certain platforms. Again, this is not by default for all platforms as there is BMC work required as well as per-platform changes.

Credit for finding these problems: Andrew Jeffery, Benjamin
Herrenschmidt, Jeremy Kerr, Russell Currey, Stewart Smith. There have been many more people who have helped with this issue, and they too deserve thanks.

Switching to iPhone Part 2: Seriously?

In which I ask of Apple, “Seriously?”.

That was pretty much my reaction with Apple sticking to Lightning connectors rather than going with the USB-C standard. Having USB-C around the place for my last two (Android) phones was fantastic. I could charge a phone, external battery, a (future) laptop, all off the same wall wart and with the same cable. It is with some hilarity that I read that the new iPad Pro has USB-C rather than Lightning.

But Apple’s dongle fetish reigns supreme, and so I get a multitude of damn dongles all for a wonderfully inflated price with an Australia Tax whacked on top.

The most egregious one is the Lightning-to-3.5mm dongle. In the office, I have a good set of headphones. The idea is to block out the sound of an open plan office so I can actually get some concentrating done. With tiny dedicated MP3 players and my previous phones, these sounded great. The Apple dongle? It sounds terrible. Absolutely terrible. The Lighting-to-3.5mm adapter might be okay for small earbuds but it is nearly completely intolerable for any decent set of headphones. I’m now in the market for a Bluetooth headphone amplifier. Another bunch of money to throw at another damn dongle.

Luckily, there seems to be a really good Bluetooth headphone amplifier on Amazon. The same Amazon that no longer ships to Australia. Well, there’s an Australian seller, for six times the price.

Urgh.

Tracing flash reads (and writes) during boot

On OpenPOWER POWER9 systems, we typically talk to the flash chips that hold firmware for the host (i.e. the POWER9) processor through a daemon running on the BMC (aka service processor) rather than directly.

We have host firmware map “windows” on the LPC bus to parts of the flash chip. This flash chip can in fact be a virtual one, constructed dynamically from files on the BMC.

Since we’re mapping windows into this flash address space, we have some knowledge as to what IO the host is doing to/from the pnor. We can use this to output data in the blktrace format and feed into existing tools used to analyze IO patterns.

So, with a bit of learning of the data format and learning how to drive the various tools, I was ready to patch the BMC daemon (mboxbridge) to get some data out.

An initial bit of data is a graph of the windows into PNOR opened up during an normal boot (see below).

PNOR windows created over the course of a normal boot.

This shows us that over the course of the boot, we open a bunch of windows, and switch them around a fair bit early on. This makes sense as early in boot we do not yet have DRAM working and page in firmware on-demand into L3 cache.

Later in boot, you can see the loading of larger chunks of firmware into memory. It’s also possible to see that this seems to take longer than it should – and indeed, we have a bug there.

Next, by modifying the code again, I introduced recording of when we used a window that the BMC had already cached. While the host will only see one window at a time, the BMC can keep around the ones it prepared earlier in order to avoid IO to the actual flash chips (which are SPI flash, so aren’t incredibly fast).

Here we can see that we’re likely not doing the most efficient things during boot, and there’s probably room for some optimization.

Normal boot but including re-used Windows rather than just created ones

Finally, in order to get finer grained information, I reduced the window size from one megabyte down to 4096 bytes. This will impose a heavy speed penalty as it’ll mean we will have to create a lot more windows to do the same amount of IO, but it means that since we’re using the page size of hostboot, we’ll see each individual page in/out operation that it does during boot.

So, from the next graph, we can see that there’s several “hot” areas of the image, and on the whole it’s not too many pages. This gives us a hint that a bit of effort to reduce binary image size a little bit could greatly reduce the amount of IO we have to do.

4096 byte (i.e. page) size window, capturing the bits of flash we need to read in several times due to being low on memory when we’re L3 cache constrained.

The iowatcher tool also can construct a video of the boot and what “blocks” are being read.

Video of what blocks are read from flash during booting

So, what do we get from this adventure? Well, we get a good list of things to look into in order to improve boot performance, and we get to back these up with data rather than guesswork. Since this also works on unmodified host firmware, we’re measuring what we really boot rather than changing it in order to measure it.

What you need to reproduce this:

Switching to iPhone: Part 1

I have used Android phones since the first one: the G1. I’m one of the (relatively) few people who has used Android 1.0. I’ve had numerous Android phones since then, mostly the Google flagship.

I have fond memories of the Nexus One and Galaxy Nexus, as well as a bunch of time running Cyanogen (often daily builds, because YOLO) to get more privacy preserving features (or a more recent Android). I had a Sony Z1 Compact for a while which was great bang for buck except for the fact the screen broke whenever you looked at it sideways. Great kudos to the Sony team for being so friendly to custom firmware loads.

I buy my hardware from physical stores. Why? Well, it means that the NSA and others get to spend extra effort to insert hardware modifications (backdoors), as well as the benefit of having a place to go to/set the ACCC on to get my rights under Australian Consumer Law.

My phone before last was a Nexus 5X. There were a lot of good things about this phone; the promise of fast charging via USB-C was one, as was the ever improving performance of the hardware and Android itself. Well… it just got progressively slower, and slower, and slower – as if it was designed to get near unusable by the time of the next Google phone announcement.

Inevitably, my 5X succumbed to the manufacturing defect that resulted in a boot loop. It would start booting, and then spontaneously reboot, in a loop, forever. The remedy? Replace it under warranty! That would take weeks, which isn’t a suitable timeframe in this day and age to be without a phone, so I mulled over buying a Google Pixel or my first ever iPhone (my iPhone owning friends assured me that if such a thing happens with an iPhone that Apple would have swapped it on the spot). Not wanting to give up a lot of the personal freedom that comes with the Android world, I spent the $100 more to get the Pixel, acutely aware that having a phone was now a near $1000/year habit.

The Google Pixel was a fantastic phone (except the price, they should have matched the iPhone price). The camera was the first phone camera I actually went “wow, I’m impressed” over. The eye-watering $279 to replace a cracked screen, the still eye-watering cost of USB-C cables, and the seat to process the HDR photos were all forgiven. It was a good phone. Until, that is, less than a year in, the battery was completely shot. It would power off when less than 40% and couldn’t last the trip from Melbourne airport to Melbourne city.

So, with a flagship phone well within the “reasonable quality” time that consumer law would dictate, I contacted Google after going through all the standard troubleshooting. Google agreed this was not normal and that the phone was defective. I was told that they would mail me a replacement, I could transfer my stuff over and then mail in the broken one. FANTASTIC!! This was soooo much better than the experience with the 5X.

Except that it wasn’t. A week later, I rang back to ask what was going on as I hadn’t received the replacement; it turns out Google had lied to me, I’d have to mail the phone to them and then another ten business days later I’d have a replacement. Errr…. no, I’ve been here before.

I rang the retailer, JB Hi-Fi; they said it would take them at least three weeks, which I told them was not acceptable nor a “reasonable timeframe” as dictated by consumer law.

So, with a bunch of travel imminent, I bought a big external USB-C battery and kept it constantly connected as without it the battery percentage went down faster than the minutes ticked over. I could sort it out once I was back from travel.

So, I’m back. In fact, I drove back from a weekend away and finally bit the bullet – I went to pick up a phone who’s manufacturer has a reputation of supporting their hardware.

I picked up an iPhone.

I figured I should write up how, why, my reasons, and experiences in switching phone platforms. I think my next post will be “Why iPhone and not a different Android”.

pwnm-sync: Synchronizing Patchwork and Notmuch

One of the core bits of infrastructure I use as a maintainer is Patchwork (I wrote about making it faster recently). Patchwork tracks patches sent to a mailing list, allowing me as a maintainer to track the state of them (New|Under Review|Changes Requested|Accepted etc), combine them into patch bundles, look at specific series, test results etc.

One of the core bits of software I use is my email client, notmuch. Most other mail clients are laughably slow and clunky, or just plain annoying for absorbing a torrent of mail and being able to deal with it or just plain ignore it but have it searchable locally.

You may think your mail client is better than notmuch, but you’re wrong.

A key feature of notmuch is tagging email. It doesn’t do the traditional “folders” but instead does tags (if you’ve used gmail, you’d be somewhat familiar).

One of my key work flows as a maintainer is looking at what patches are outstanding for a project, and then reviewing them. This should also be a core part of any contributor to a project too. You may think that a tag:unread and to:project-list@foo query would be enough, but that doesn’t correspond with what’s in patchwork.

So, I decided to make a tool that would add tags to messages in notmuch corresponding with the state of the patch in patchwork. This way, I could easily search for “patches marked as New in patchwork” (or Under Review or whatever) and see what I should be reviewing and looking at merging or commenting on.

Logically, this wouldn’t be that hard, just use the (new) Patchwork REST API to get the state of everything and issue the appropriate notmuch commands.

But just going one way isn’t that interesting, I wanted to be able to change the tags in notmuch and have them sync back up to Patchwork. So, I made that part of the tool too.

Introducing pwnm-sync: a tool to sync patchwork and notmuch.

notmuch-hello tag counts for pwnm-sync tagged
patches in patchwork
notmuch-hello tag counts for pwnm-sync tagged
patches in patchwork

With this tool I can easily see the patchwork state of any patch that I have in my notmuch database. For projects that I’m a maintainer on (i.e. can change the state of patches), If I update the patches of that email and run pwnm-sync again, it’ll update the state in patchwork.

I’ve been using this for a few weeks myself and it’s made my maintainer workflow significantly nicer.

It may also be useful to people who want to find what patches need some review.

The sync time is mostly dependent on how fast your patchwork instance is for API requests. Unfortunately, we need to make some improvements on the Patchwork side of things here, but a full sync of the above takes about 4 minutes for me. You can also add a –epoch option (with a date/time) to say “only fetch things from patchwork since that date” which makes things a lot quicker for incremental syncs. For me, I typically run it with an epoch of a couple of months ago, and that takes ~20-30 seconds to run. In this case, if you’ve locally updated a old patch, it will still sync that change up to patchwork.

Implementation Details

It’s a python3 script using the notmuch python bindings, the requests-futures module for asynchronous HTTP requests (so we can have the patchwork server assemble the next page of results while we process the previous one), and a local sqlite3 database to store state in so we can work out what changed locally / server side.

Getting it

Head to https://github.com/stewartsmith/pwnm-sync or just:

git clone https://github.com/stewartsmith/pwnm-sync.git

Optimizing database access in Django: A patchwork story

tl;dr: I made Patchwork a lot faster by looking at what database queries were being generated and optimizing them either by making Django produce better queries or by adding better indexes.

Introduction to Patchwork

One of the key bits of infrastructure a bunch of maintainers of Open Source Software use is a tool called Patchwork. We use it for a bunch of OpenPOWER firmware development, several Linux subsystems use it as well as freedesktop.org.

The purpose of Patchwork is to supplement the patches-to-a-mailing-list development work flow. It allows a maintainer to see all the patches that have been posted on the list, How many Acked-by/Reviewed-by/Tested-by replies they have, delegate responsibility for the patch to a co-maintainer, track (and change) the state of the patch (e.g. to “Under Review”, “Changes Requested”, or “Accepted”), and create bundles of patches to help in review and testing.

Since patchwork is an open source project itself, there’s several instances of it out there in common use. One of the main instances is https://patchwork.ozlabs.org/ which is (funnily enough) used by a bunch of people connected to OzLabs for projects that are somewhat connected to OzLabs. e.g. the linuxppc-dev project and the skiboot and petitboot projects. There’s also a kernel.org instance, which is used by some kernel subsystems.

Recent versions of Patchwork have added some pretty cool features such as the ability to integrate with CI systems such as Snowpatch which helps maintainers see if patches submitted are likely to break things.

Unfortunately, there’s also been some complaints that recent version of patchwork have gotten slower than previous ones. This may well be the case, or it could just be that the volume of patches is much higher and there’s load on the database. Anyway, after asking a few questions about what the size and scope was of the patchwork database on ozlabs.org, I went “hrm… this sounds like it shouldn’t really be a problem… perhaps I should look into this”.

Attacking the problem…

Every so often it is revealed that I know a little bit about databases.

Getting a development environment up for Patchwork is amazingly easy thanks to Docker and the great work of the Patchwork maintainers. The only thing you need to load in is an example dataset. I started by importing mail from a few mailing lists I’m subscribed to, which was Good Enough(TM) for an initial look.

Due to how Django forces us to design a database schema though, the suggested method of getting a sample data set will not mirror what occurs in a production system with multiple lists. It’s for this reason that I ended up using a copy of a live dataset for much of my work rather than constructing an artificial one.

Patchwork supports both a MySQL and PostgreSQL database backend. Since the one on ozlabs.org is backed by PostgreSQL, I ended up loading a large dataset into PostgreSQL for most of my work, although I also did some testing with MySQL.

The current patchwork.ozlabs.org instance has a database of around 13GB in side, with about a million patches. You may think this is big, my database brain goes “no, this is actually quite small and everything should be a lot faster than it is even on quite limited hardware”

The problem with ORMs

It turns out that Patchwork is written in Django, an ORM (Object-Relational Mapping) framework in Python – and thus something that pretty effectively obfuscates application code from the SQL being run.

There is one thing that Django misses that could be a pretty big general performance boost to many applications: it doesn’t support composite primary keys. For some databases (e.g. MySQL’s InnoDB engine) the PRIMARY KEY is a clustered index – that is, the physical layout of the rows on disk reflect primary key order. You can use this feature to your advantage and have much higher cache hits of your database pages.

Unfortunately though, we cannot do that with Django, so we lose a bunch of possible performance because of it (especially for queries that are going to have to bring in data from disk). In fact, we’re forced to use an ID field that’ll scatter our rows all over the place rather than do something efficient. You can somewhat get back some of the performance by creating covering indexes, but this costs in terms of index maintenance and disk space.

It should be noted that PostgreSQL doesn’t have a similar concept, although there is a (locking) CLUSTER statement that can (as an offline operation for the table) re-arrange existing rows to be in index order. In my testing, this can give a bit of a boost to performance of some of the Patchwork queries.

With MySQL, you’d look at a bunch of statistics on what pages are being brought in and paged out of the InnoDB buffer pool. With PostgreSQL it’s a bit more complex as it relies heavily on the OS page cache.

My main experience is with MySQL like environment, so I’ve had to re-learn a bunch of PostgreSQL things in this work which was kind of fun. It may be “because of my upbringing” but it seems as if there’s a lot more resources and documentation out in the wild about optimizing MySQL environments than PostgreSQL ones, especially when it comes to documentation around a bunch of things inside the database server. A lot of credit should go to the MySQL Documentation team – I wish the PostgreSQL documentation was up to the same standard.

Another issue is that fetching BLOBs is generally an expensive operation that you want to avoid unless you’re going to use them. Thus, fetching the whole “object” at once isn’t always optimal. The Django query generation appears to be somewhat buggy when it comes to “hey, don’t fetch these columns, I don’t need them”, so you do have to watch what query is produced not just what query you expect to be produced. For example, [01/11] Improve patch listing performance (~3x).

Another issue with Django is how you go from your Python code to an actual SQL query, especially when the produced SQL query is needlessly complex or inefficient. I’ve seen Django always produce an ORDER BY for one table, even when not needed, I’ve also seen it always join tables even when you’re getting no columns from one of them and there’s no way you’re asking for it. In fact, I had to revert to raw SQL for one of my performance improvements as I just couldn’t beat it into submission: [10/11] Be sensible computing project patch counts.

An ORM can be great for getting apps out quickly, or programming in a familiar way. But like many things, an understanding of what is going on underneath is key for extracting maximum performance.

Also, if you ever hear something like “ORM $x doesn’t scale” then maybe that person just hasn’t looked at how to use the ORM better. The same goes for if they say “database $y doesn’t scale”- especially if it’s a long existing relational database such as MySQL or PostgreSQL.

Speeding up listing current patches for a project

17 SQL queries in 4477ms
More than 4 seconds in the database
does not make page load time great.

Fortunately though, the Django development environment lets you really easily dive into what queries are being generated and (at least roughly) where they’re being generated from. There’s a sidebar in your browser that shows how many SQL queries were needed to generate the page and how long they took. The key to making your application go faster is to run fewer queries in less time.

I was incredibly impressed with how easy it was to see what queries were run, where they were run from, and the EXPLAIN output for them.

By clicking on that SQL button on the right side of your browser, you get this wonderful chart of what queries were executed, when, and how long they took. From this, it is incredibly obvious which query is the most problematic: the one that took more than four seconds!

In the dim dark days of web development, you’d have to turn on a Slow Query Log on the database server and then grep through your source code or some other miserable activity. I am so glad I didn’t have to do that.

More than four seconds for a single database query does not make for a nice UX.

This particular query was a real hairy one, the EXPLAIN output from PostgreSQL (and MySQL) was certainly long and involved and would most certainly not feature in the first half of an “Introduction to query optimization” day long workshop. If you haven’t brushed up on various bits of documentation on understanding EXPLAIN, you should! The MySQL EXPLAIN FORMAT=JSON is especially fantastic for getting deep details as to what’s going on with query execution.

The big performance gain here was to have the database be able to execute the query in a much more efficient way by creating two covering indexes for part of the query. To work out what indexes to create, one has to look at the EXPLAIN output and work out why the database is choosing to do either a sequential scan of a large table, or use an index that doesn’t exclude that many rows. In this case, I tweaked the code to slightly change the query that was generated as well as adding a covering index. What we ended up with is something that is dramatically faster.

The main query is ~350x faster than before

You’ll notice that it appears that the first query there takes a lot more time but it doesn’t, it just takes a lot more time relative to the main query.

In fact, this particular page is one that people have mentioned at being really, really slow to load. With the main query now about 350 times faster than it was originally, it shouldn’t be a problem anymore.

A future improvement would be to cache the COUNT() for the common case, as it’s pretty easily computed when new patches come in or states change.

The patches that help this particular page have been submitted upstream here:

Making viewing a patch with comments faster

Now that we can list patches faster, can we make other pages that Patchwork has quicker?

One such page is viewing a patch (or cover letter) that has a lot of comments on it. One feature of Patchwork is that it will display all the email replies to a patch or cover letter in the Web UI. But… this seemed slow

On one of the most commented patches I could find, we ended up executing one hundred and seventy seven SQL queries to view it! If we dove into it, a bunch of the queries looked really really similar…

I’ve got 99 queries where I only need 1.

The problem here is that the Patchwork UI is wanting to find out the name of each person who submitted a comment, and is doing that by querying the ID from a table. What it should be doing instead is a SQL JOIN on the original query and just fetching all that information in one go: make the database server do the work, it’s really good at it.

My patch [02/11] 4x performance improvement for viewing patch with many comments   does just that by using the select_related() method correctly, as well as being explicit about what information we want to retrieve.

We’re now only a few milliseconds to grab all the comments

With that patch, we’re down to a constant number of queries and around a 3x-7x faster time executing them depending if we have a warm cache or not.

The one time I had to use raw SQL

When viewing a project page (such as https://patchwork.ozlabs.org/project/qemu-devel/ ) it displays the number of patches (archived and not archived) for the project. By looking at what SQL queries are executed to collect these numbers, you’ll notice two things. First, here are the queries:

COUNT() queries can be expensive

First thing you’ll notice is that they took a loooooong time to execute (more than a second each). The second thing, if you look closer, is that they contain a join which is completely unneeded.

I spent a good long while trying to make Django behave, and I just could not. I believe it’s due to the model having some inheritance in it. Writing the query by hand ended up being the best solution, and it gave a significant performance improvement:

Unfortunately, only 4x faster.

Arguably, a better way would be to precompute the count for the archived/non-archived patches and just display them. I (or someone else who knows more about Django) may want to look at that for a future improvement.

Conclusion and final thoughts

There’s a few more places where there could be some optimizations, but currently I cannot get any single page to take more than between 40-400ms in the database when running on my laptop – and that’s Good Enough(TM) for now.

The next steps are getting these patches through a round or two of review, and then getting them into a Patchwork release and deployed out on patchwork.ozlabs.org and see if people can find any new ways to make things slow.

If you’re interested, the full patchset with cover letter is here: [00/11] Performance for ALL THE THINGS!

The diffstat is interesting, as most of the added code is auto-generated by Django for database migrations (adding of indexes).

 .../migrations/0027_add_comment_date_index.py | 23 +++++++++++++++++
 .../0028_add_list_covering_index.py           | 19 ++++++++++++++
 .../0029_add_submission_covering_index.py     | 19 ++++++++++++++
 patchwork/models.py                           | 21 ++++++++++++++--
 patchwork/templates/patchwork/submission.html | 16 ++++++------
 patchwork/views/__init__.py                   |  8 +++++-
 patchwork/views/cover.py                      |  5 ++++
 patchwork/views/patch.py                      |  7 ++++++
 patchwork/views/project.py                    | 25 ++++++++++++++++---
 9 files changed, 128 insertions(+), 15 deletions(-)
 create mode 100644 patchwork/migrations/0027_add_comment_date_index.py
 create mode 100644 patchwork/migrations/0028_add_list_covering_index.py
 create mode 100644 patchwork/migrations/0029_add_submission_covering_index.py

I think the lesson is that making dramatic improvements to performance of your Django based app does not mean you have to write a lot of code or revert to raw SQL or abandon your ORM. In fact, use it properly and you can get a looong way. It’s just that to use it properly, you’re going to have to understand the layer below the ORM, and not just treat the database as a magic black box.

ccache and op-build

You may have heard of ccache (Compiler Cache) which saves you heaps of real world time when rebuilding a source tree that is rather similar to one you’ve recently built before. It’s really useful in buildroot based projects where you’re building similar trees, or have done a minor bump of some components.

In trying to find a commit which introduced a bug in op-build (OpenPOWER firmware), I noticed that hostboot wasn’t being built using ccache and we were always doing a full build. So, I started digging into it.

It turns out that a bunch of the perl scripts for parsing the Machine Readable Workbook XML in hostboot did a bunch of things like foreach $key (%hash) – which means that the code iterates over the items in hash order rather than an order that would produce predictable output such as “attribute name” or something. So… much messing with that later, I had hostboot generating the same output for the same input on every build.

Next step was to work out why I was still getting a lot of CCACHE misses. It turns out the default ccache size is 5GB. A full hostboot build uses around 7.1GB of that.

So, if building op-build with CCACHE, be sure to set both BR2_CCACHE=y in your config as well as something like BR2_CCACHE_INITIAL_SETUP="--max-size 20G"

Hopefully my patches hit hostboot and op-build soon.

A (simplified) view of OpenPOWER Firmware Development

I’ve been working on trying to better document the whole flow of code that goes into a build of firmware for an OpenPOWER machine. This is partially to help those not familiar with it get a better grasp of the sheer scale of what goes into that 32/64MB of flash.

I also wanted to convey the components that we heavily re-used from other Open Source projects, what parts are still “IBM internal” (as they relate to the open source workflow) and which bits are primarily contributed to by IBMers (at least at this point in time).

As such, let’s start with the legend of the diagram:

Now, the diagram:

Simplified development flow for OpenPOWER firmware

The end thing that a user with a machine will download and apply (or that comes shipped with a box) is the purple “Installable Firmware Release” nodes (bottom center). In this diagram, there are 4 of them. One for POWER9 systems such as the just-announced AC922 system (this is the “OP910 Release” node, which is the witherspoon_defconfig in the op-build tree); one for the p9dsu platform (p9dsu_defconfig in op-build) and one is for IBM FSP based systems such as the S812L and S822L systems (or S812/S822 in OPAL mode).

There are more platforms out there, but this diagram is meant to be simplified. The key difference with the p9dsu platform is that this is produced by somebody other than IBM.

All of these releases are based off the upstream op-build project, op-build is the light blue box in the center of the diagram. We do regular X.Y releases and sometimes do X.Y.Z releases. It’s primarily a pull request based workflow currently, so everything goes via a pull request. The op-build project brings together all the POWER specific firmware components (pretty much everything in every other light blue/blue box) along with a Linux kernel and buildroot.

The kernel and buildroot are the two big yellow boxes on the top right. Buildroot brings together a lot of open source components that are in our firmware image (including some power specific ones that we get through upstream buildroot).

For Linux, this is a pretty simplified view of the process, but we primarily ship the stable tree (with maybe up to half a dozen patches).

The skiboot and petitboot components both use a mailing list based workflow (similar to kernel) as well as X.Y and X.Y.Z releases (again, similar to the linux kernel).

On the far left of the diagram, we have Hostboot, SBE and OCC. These are three firmware components that come from the traditional IBM POWER Firmware group, and are shared with the IBM non-OpenPOWER POWER systems (“traditional” POWER). These components have part of their code from from an (internal) repository called “ekb” which also goes into a (very) low level debug tool and the FSP based systems. There’s also an (internal) gerrit instance that’s the primary place where code review/development discussions are for these components.

In future posts, I’ll probably delve into more specifics of the current development process, and how we may try and change things for the better.

How I do email (at work)

Recently, I blogged on my home email setup and in that post, I hinted that my work setup was rather different. I have entirely separate computing devices for work and personal, a setup I strongly recommend. This also lets me “go home” from work even when working from home, I use a different physical machine!

Since I work for IBM I have (at least) two email accounts for work: a Lotus Notes one and a internet standards compliant one. It’s “easy” enough to get the Notes one to forward to the standards compliant one, from which I can just use fetchmail or similar to pull down mail.

I run mail through a rather simple procmail script: it de-mangles some URL mangling that can happen in the current IBM email infrastructure, runs things through SpamAssassin and deliver to a date based Maildir (or one giant pile for spam).

My ~/.procmailrc looks something like this:

LOGFILE=$HOME/mail_log
LOGABSTRACT=yes

DATE=`date +"%Y%m"`
MAILDIR=Maildir/INBOX
DEFAULT=$DATE/

:0fw
| magic_script_to_unmangle_things

:0fw
| spamc

:0
* ^X-Spam-Status: Yes
$HOME/Maildir/junkmail/incoming/

I use tail -f mail_log as a really dumb kind of biff replacement.

Now, what do I read and write mail with? Notmuch! It is the only thing that even comes close to being able to deal with a decent flow of mail. I have a couple of saved searches just to track how much mail I pull in a day/week. Today (on Monday), it says 442 today and 10,403 over the past week.

For the most part, my workflow is kind of INBOX-ZERO like, except that I currently view victory as INBOX 2000. Most mail does go into my INBOX, the notable exceptions are two main mailing lists I’m subscribed to mostly as FYI and to search/find things when needed. Those are the Linux Kernel Mailing List (LKML) and the buildroot mailing list. Why notmuch rather than just searching the web for mailing list archives? Notmuch can return the result of a query in less time it takes light to get to and from the United States in ideal conditions.

For work, I don’t sync my mail anywhere. It’s just on my laptop. Not having it on my phone is a feature. I have a notmuch post-new hook that does some initial tagging of mail, and as such I have this in my ~/.notmuch-config:

[new]
tags=new;

My post-new hook looks like this:

#!/bin/bash

# immediately archive all messages from "me"
notmuch tag -new -- tag:new and from:stewart@linux.vnet.ibm.com

# tag all message from lists
notmuch tag +devicetree +list -- tag:new and to:devicetree@vger.kernel.org
notmuch tag +inbox +unread -new -- tag:new and tag:devicetree and to:stewart@linux.vnet.ibm.com
notmuch tag +listinbox +unread +list -new -- tag:new and tag:devicetree and not to:stewart@linux.vnet.ibm.com

notmuch tag +linuxppc +list -- tag:new and to:linuxppc-dev@lists.ozlabs.org
notmuch tag +linuxppc +list -- tag:new and cc:linuxppc-dev@lists.ozlabs.org
notmuch tag +inbox +unread -new -- tag:new and tag:linuxppc
notmuch tag +openbmc +list -- tag:new and to:openbmc@lists.ozlabs.org
notmuch tag +inbox +unread -new -- tag:new and tag:openbmc

notmuch tag +lkml +list -- tag:new and to:linux-kernel@vger.kernel.org
notmuch tag +inbox +unread -new -- tag:new and tag:lkml and to:stewart@linux.vnet.ibm.com
notmuch tag +listinbox +unread -new -- tag:new and tag:lkml and not tag:linuxppc and not to:stewart@linux.vnet.ibm.com

notmuch tag +qemuppc +list -- tag:new and to:qemu-ppc@nongnu.org
notmuch tag +inbox +unread -new -- tag:new and tag:qemuppc and to:stewart@linux.vnet.ibm.com
notmuch tag +listinbox +unread -new -- tag:new and tag:qemuppc and not to:stewart@linux.vnet.ibm.com

notmuch tag +qemu +list -- tag:new and to:qemu-devel@nongnu.org
notmuch tag +inbox +unread -new -- tag:new and tag:qemu and to:stewart@linux.vnet.ibm.com
notmuch tag +listinbox +unread -new -- tag:new and tag:qemu and not to:stewart@linux.vnet.ibm.com

notmuch tag +buildroot +list -- tag:new and to:buildroot@buildroot.org
notmuch tag +buildroot +list -- tag:new and to:buildroot@busybox.net
notmuch tag +buildroot +list -- tag:newa nd to:buildroot@uclibc.org
notmuch tag +inbox +unread -new -- tag:new and tag:buildroot and to:stewart@linux.vnet.ibm.com
notmuch tag +listinbox +unread -new -- tag:new and tag:buildroot and not to:stewart@linux.vnet.ibm.com

notmuch tag +ibmbugzilla -- tag:new and from:bugzilla@us.ibm.com

# finally, retag all "new" messages "inbox" and "unread"
notmuch tag +inbox +unread -new -- tag:new

This leaves me with both an inbox and a listinbox. I do not look at the overwhelming majority of mail that hits the listinbox – It’s mostly for following up on individual things. If I started to need to care more about specific topics, I’d probably add something in there for them so I could easily find them.

My notmuch emacs setup has a bunch of saved searches, so my notmuch-hello screen looks something like this:

This gets me a bit of a state-of-the-world-of-email-to-look-at view for the day. I’ll often have meetings first thing in the morning that may reference email I haven’t looked at yet, and this generally lets me quickly find mail related to the problems of the day and start being productive.

How I do email (at home)

I thought I might write something up on how I’ve been doing email both at home and at work. I very much on purpose keep the two completely separate, and have slightly different use cases for both of them.

For work, I do not want mail on my phone. For personal mail, it turns out I do want this on my phone, which is currently an Android phone. Since my work and personal email is very separate, the volume of mail is really, really different. Personal mail is maybe a couple of dozen a day at most. Work is… orders of magnitude more.

Considering I generally prefer free software to non-free software, K9 Mail is the way I go on my phone. I have it set up to point at the IMAP and SMTP servers of my mail provider (FastMail). I also have a google account, and the gmail app works fine for the few bits of mail that go there instead of my regular account.

For my mail accounts, I do an INBOX ZERO like approach (in reality, I’m pretty much nowhere near zero, but today I learned I’m a lot closer than many colleagues). This means I read / respond / do / ignore mail and then move it to an ARCHIVE folder. K9 and Gmail both have the ability to do this easily, so it works well.

Additionally though, I don’t want to care about limits on storage (i.e. expire mail from the server after X days), nor do I want to rely on “the cloud” to be the only copy of things. I also don’t want to have to upload any of past mail I may be keeping around. I also generally prefer to use notmuch as a mail client on a computer.

For those not familiar with notmuch, it does tags on mail in Maildir, is extremely fast and can actually cope with a quantity of mail. It also has this “archive”/INBOX ZERO workflow which I like.

In order to get mail from FastMail and Gmail onto a machine, I use offlineimap. An important thing to do is to set “status_backend = sqlite” for each Account. It turns out I first hacked on sqlite for offlineimap status a bit over ten years ago – time flies. For each Account I also set presynchook = ~/Maildir/maildir-notmuch-presync (below) and a postsynchook = notmuch new. The presynchook is run before we sync, and its job is to move files around based on the tags in notmuch and the postsynchook lets notmuch catch any new mail that’s been fetched.

My maildir-notmuch-presync hook script is:

#!/bin/bash
notmuch search --output=files not tag:inbox and folder:fastmail/INBOX|xargs -I'{}' mv '{}' "$HOME/Maildir/INBOX/fastmail/Archive/cur/"

notmuch search --output=files folder:fastmail/INBOX and tag:spam |xargs -I'{}' mv '{}' "$HOME/Maildir/INBOX/fastmail/Spam/cur/"
ARCHIVE_DIR=$HOME/Maildir/INBOX/`date +"%Y%m"`/cur/
mkdir -p $ARCHIVE_DIR
notmuch search --output=files folder:fastmail/Archive and date:..90d and not tag:flagged | xargs -I'{}' mv '{}' "$ARCHIVE_DIR"

# Gmail
notmuch search --output=files not tag:inbox and folder:gmail/INBOX|grep 'INBOX/gmail/INBOX/' | xargs -I'{}' rm '{}'
notmuch search --output=files folder:gmail/INBOX and tag:spam |xargs -I'{}' mv '{}' "$HOME/Maildir/INBOX/gmail/[Gmail].Spam/cur/"

So This keeps 90 days of mail on the fastmail server, and archives older mail off into month based archive dirs. This is simply to keep directory sizes not too large, you could put everything in one directory… but at some point that gets a bit silly.

I don’t think this is all the most optimal setup I could have, but it does let me read and answer mail on my phone and desktop (as well as use a web client if I want to). There is a bit of needless copying of messages by offlineimap under certain circumstances, but I don’t get enough personal mail for it to be a problem.