Using llvm-mca for predicting CPU cycle impact of code changes

Way back in the distant past, when the Apple ][ and the Commodore 64 were king, you could read the manual for a microprocessor and see how many CPU cycles each instruction took, and then do the math as to how long a sequence of instructions would take to execute. This cycle counting was used pretty effectively to do really neat things such as how you’d get anything on the screen from an Atari 2600. Modern CPUs are… complex. They can do several things at once, in a different order than what you wrote them in, and have an interesting arrangement of shared resources to allocate.

So, unlike with simpler hardware, if you have a sequence of instructions for a modern processor, it’s going to be pretty hard to work out how many cycles that could take by hand, and it’s going to differ for each micro-architecture available for the instruction set.

When designing a microprocessor, simulating what a series of existing instructions will take to execute compared to the previous generation of microprocessor is pretty important. The aim should be for it to take less time or energy or some other metric that means your new processor is better than the old one. It can be okay if processor generation to generation some sequence of instructions take more cycles, if your cycles are more frequent, or power efficient, or other positive metric you’re designing for.

Programmers may want this simulation too, as some code paths get rather performance critical for certain applications. Open Source tools for this aren’t as prolific as I’d like, but there is llvm-mca which I (relatively) recently learned about.

llvm-mca is a performance analysis tool that uses information available in LLVM (e.g. scheduling models) to statically measure the performance of machine code in a specific CPU.

the llvm-mca docs

So, when looking at an issue in the IPv6 address and connection hashing code in Linux last year, and being quite conscious of modern systems dealing with a LOT of network packets, and thus this can be quite CPU usage sensitive, I wanted to make sure that my suggested changes weren’t going to have a large impact on performance – across the variety of CPU generations in use.

There’s two ways to do this: run everything, throw a lot of packets at something, and measure it. That can be a long dev cycle, and sometimes just annoying to get going. It can be a lot quicker to simulate the small section of code in question and do some analysis of it before going through the trouble of spinning up multiple test environments to prove it in the real world.

So, enter llvm-mca and the ability to try and quickly evaluate possible changes before testing them. Seeing as the code in question was nicely self contained, I could easily get this to a point where I could easily get gcc (or llvm) to spit out assembler for it separately from the kernel tree. My preference was for gcc as that’s what most distros end up compiling Linux with, including the Linux distribution that’s my day job (Amazon Linux).

In order to share the results of the experiments as part of the discussion on where the code changes should end up, I published the code and results in a github project as things got way too large to throw on a mailing list post and retain sanity.

I used a container so that I could easily run it in a repeatable isolated environment, as well as have others reproduce my results if needed. Different compiler versions and optimization levels will very much produce different sequences of instructions, and thus possibly quite different results. This delta in compiler optimization levels is partially why the numbers don’t quite match on some of the mailing list messages, although the delta of the various options was all the same. The other reason is learning how to better use llvm-mca to isolate down the exact sequence of instructions I was caring about (and not including things like the guesswork that llvm-mca has to do for branches).

One thing I learned along the way is how to better use llvm-mca to get the results that I was looking for. One trick is to very much avoid branches, as that’s going to be near complete guesswork as there’s not a simulation of the branch predictor (at least in the version I was using.

The big thing I wanted to prove: is doing the extra work having a small or large impact on number of elapsed cycles. The answer was that doing a bunch of extra “work” was essentially near free. The CPU core could execute enough things in parallel that the incremental cost of doing extra work just… wasn’t relevant.

This helped getting a patch deployed without impact to performance, as well as get a patch upstream, fixing an issue that was partially fixed 10 years prior, and had existed since day 1 of the Linux IPv6 code.

Naturally, this wasn’t a solo effort, and that’s one of the joys of working with a bunch of smart people – both at the same company I work for, and in the broader open source community. It’s always humbling when you’re looking at code outside your usual area of expertise that was written (and then modified) by Really Smart People, and you’re then trying to fix a problem in it, while trying to learn all the implications of changing that bit of code.

Anyway, check out llvm-mca for your next adventure into premature optimization, as if you’re going to get started with evil, you may as well start with what’s at the root of all of it.

Booting temporary firmware on the Raptor Blackbird

In a future post, I’ll detail how to build my ported-to-upstream Blackbird firmware. Here though, we’ll explore booting some firmware temporarily to experiment.

Step 1: Copy your new PNOR image over to the BMC.
Step 2: …
Step 3: Profit!

Okay, not really, once you’ve copied over your image, ensure the computer is off and then you can tell the daemon that provides firmware to the host to use a file backend for it rather than the PNOR chip on the motherboard (i.e. yes, you can boot your system even when the firmware chip isn’t there – although I’ve not literally tried this).

root@blackbird:~# mboxctl --backend file:/tmp/blackbird.pnor 
SetBackend: Success
root@blackbird:~# obmcutil poweron

If we look at the serial console (ssh to the BMC port 2200) we’ll see Hostboot start, realise there’s newer SBE code, flash it, and reboot:

--== Welcome to Hostboot hostboot-b284071/hbicore.bin ==--

  3.02606|secure|SecureROM valid - enabling functionality
  5.14678|Booting from SBE side 0 on master proc=00050000
  5.18537|ISTEP  6. 5 - host_init_fsi
  5.47985|ISTEP  6. 6 - host_set_ipl_parms
  5.54476|ISTEP  6. 7 - host_discover_targets
  6.56106|HWAS|PRESENT> DIMM[03]=8080000000000000
  6.56108|HWAS|PRESENT> Proc[05]=8000000000000000
  6.56109|HWAS|PRESENT> Core[07]=1511540000000000
  6.61373|ISTEP  6. 8 - host_update_master_tpm
  6.61529|SECURE|Security Access Bit> 0x0000000000000000
  6.61530|SECURE|Secure Mode Disable (via Jumper)> 0x8000000000000000
  6.61543|ISTEP  6. 9 - host_gard
  7.20987|HWAS|FUNCTIONAL> DIMM[03]=8080000000000000
  7.20988|HWAS|FUNCTIONAL> Proc[05]=8000000000000000
  7.20989|HWAS|FUNCTIONAL> Core[07]=1511540000000000
  7.21299|ISTEP  6.11 - host_start_occ_xstop_handler
  8.28965|ISTEP  6.12 - host_voltage_config
  8.47973|ISTEP  7. 1 - mss_attr_cleanup
  9.07674|ISTEP  7. 2 - mss_volt
  9.35627|ISTEP  7. 3 - mss_freq
  9.63029|ISTEP  7. 4 - mss_eff_config
 10.35189|ISTEP  7. 5 - mss_attr_update
 10.38489|ISTEP  8. 1 - host_slave_sbe_config
 10.45332|ISTEP  8. 2 - host_setup_sbe
 10.45450|ISTEP  8. 3 - host_cbs_start
 10.45574|ISTEP  8. 4 - proc_check_slave_sbe_seeprom_complete
 10.48675|ISTEP  8. 5 - host_attnlisten_proc
 10.50338|ISTEP  8. 6 - host_p9_fbc_eff_config
 10.50771|ISTEP  8. 7 - host_p9_eff_config_links
 10.53338|ISTEP  8. 8 - proc_attr_update
 10.53634|ISTEP  8. 9 - proc_chiplet_fabric_scominit
 10.55234|ISTEP  8.10 - proc_xbus_scominit
 10.56202|ISTEP  8.11 - proc_xbus_enable_ridi
 10.57788|ISTEP  8.12 - host_set_voltages
 10.59421|ISTEP  9. 1 - fabric_erepair
 10.65877|ISTEP  9. 2 - fabric_io_dccal
 10.66048|ISTEP  9. 3 - fabric_pre_trainadv
 10.66665|ISTEP  9. 4 - fabric_io_run_training
 10.66860|ISTEP  9. 5 - fabric_post_trainadv
 10.67060|ISTEP  9. 6 - proc_smp_link_layer
 10.67503|ISTEP  9. 7 - proc_fab_iovalid
 11.10386|ISTEP  9. 8 - host_fbc_eff_config_aggregate
 11.15103|ISTEP 10. 1 - proc_build_smp
 11.27537|ISTEP 10. 2 - host_slave_sbe_update
 11.68581|sbe|System Performing SBE Update for PROC 0, side 0
 34.50467|sbe|System Rebooting To Complete SBE Update Process
 34.50595|IPMI: Initiate power cycle
 34.54671|Stopping istep dispatcher
 34.68729|IPMI: shutdown complete

One of the improvements is we now get output from the SBE! This means that when we do things like mess up secure boot and non secure boot firmware (I’ll explain why/how this is a thing later), we’ll actually get something useful out of a serial port:

--== Welcome to SBE - CommitId[0x8b06b5c1] ==--
istep 3.19
istep 3.20
istep 3.21
istep 3.22
istep 4.1
istep 4.2
istep 4.3
istep 4.4
istep 4.5
istep 4.6
istep 4.7
istep 4.8
istep 4.9
istep 4.10
istep 4.11
istep 4.12
istep 4.13
istep 4.14
istep 4.15
istep 4.16
istep 4.17
istep 4.18
istep 4.19
istep 4.20
istep 4.21
istep 4.22
istep 4.23
istep 4.24
istep 4.25
istep 4.26
istep 4.27
istep 4.28
istep 4.29
istep 4.30
istep 4.31
istep 4.32
istep 4.33
istep 4.34
istep 5.1
istep 5.2
SBE starting hostboot

And then we’re back into normal Hostboot boot (which we’ve all seen before) and end up at a newer petitboot!

Petitboot 1.11 on a Raptor Blackbird

One notable absence from that screenshot is my installed Fedora is missing. This is because there appears to be a bug in the 5.3.7 kernel that’s currently upstream, and if we drop to the shell and poke at lspci and dmesg, we can work out what could be the culprit:

Exiting petitboot. Type 'exit' to return.
You may run 'pb-sos' to gather diagnostic data
No password set, running as root. You may set a password in the System Configuration screen.
# lspci
0000:00:00.0 PCI bridge: IBM Device 04c1
0001:00:00.0 PCI bridge: IBM Device 04c1
0001:01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a8 (rev 03)
0002:00:00.0 PCI bridge: IBM Device 04c1
0002:01:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller (rev 11)
0003:00:00.0 PCI bridge: IBM Device 04c1
0003:01:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02)
0004:00:00.0 PCI bridge: IBM Device 04c1
0004:01:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0004:01:00.1 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0004:01:00.2 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0005:00:00.0 PCI bridge: IBM Device 04c1
0005:01:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04)
0005:02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
# dmesg|grep -i nvme
[    2.991038] nvme nvme0: pci function 0001:01:00.0
[    2.991088] nvme 0001:01:00.0: enabling device (0140 -> 0142)
[    3.121799] nvme nvme0: Identify Controller failed (19)
[    3.121802] nvme nvme0: Removing after probe failure status: -5
# uname -a
Linux skiroot 5.3.7-openpower1 #2 SMP Sat Dec 14 09:06:20 PST 2019 ppc64le GNU/Linux

If for some reason the device didn’t show up in lspci, then I’d look at the skiboot firmware log, which is /sys/firmware/opal/msglog.

Looking at upstream stable kernel patches, it seems like 5.3.8 has a interesting looking patch when you realize that ppc64le uses a 64k page size:

commit efac0f186ea654e8389f5017c7f643ef48cb4b93
Author: Kevin Hao <haokexin@gmail.com>
Date:   Fri Oct 18 10:53:14 2019 +0800

    nvme-pci: Set the prp2 correctly when using more than 4k page
    
    commit a4f40484e7f1dff56bb9f286cc59ffa36e0259eb upstream.
    
    In the current code, the nvme is using a fixed 4k PRP entry size,
    but if the kernel use a page size which is more than 4k, we should
    consider the situation that the bv_offset may be larger than the
    dev->ctrl.page_size. Otherwise we may miss setting the prp2 and then
    cause the command can't be executed correctly.
    
    Fixes: dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests")
    Cc: stable@vger.kernel.org
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Kevin Hao <haokexin@gmail.com>
    Signed-off-by: Keith Busch <kbusch@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

So, time to go try 5.3.8. My yaks are getting quite smooth.

Oh, and when you’re done with your temporary firmware, either fiddle with mboxctl or restart the systemd service for it, or reboot your BMC or… well, I gotta leave you something to work out on your own :)

MySQL 5.7.5 on POWER – thread priority

Good news everyone!

MySQL 5.7.5 is out with a bunch more patches for running well on POWER in the tree. I haven’t yet gone and tried it all out, but since I’m me, I look at bugs database and git/bzr history first.

On Intel CPUs, when you’re spinning on a spin lock, you’re meant to execute the PAUSE CPU instruction. This tells the CPU that other execution threads in the same core should be given priority as you are currently not doing anything productive. Without this, you’re likely going to hurt on hyperthreaded CPUs.

In MySQL, there are custom spinlocks in order to do interesting adaptive mutex things to attempt to squeeze the most performance possible out of modern systems.

One of the (not 100% ready, but close) bugs with patches I submitted against MySQL 5.7 was for using the equivalent of the PAUSE instruction for POWER CPUs. On POWER, we’re a bit different, you can actually set priorities of threads (which may matter more, as POWER8 CPUs can be in SMT8 mode – where there are *eight* executing threads per core).

So, the good news is that in MySQL 5.7.5, the magic instructions for setting thread priority are in! This should mean great things for performance on POWER systems with any of the SMT modes enabled.

The next interesting part of this is how it interacts with other KVM guests on a system. At least on POWER (and on x86 as well, although I won’t go into details here) there’s a hypervisor call that a guest can make saying “hey, I’m spinning here, perhaps you want to make sure other vcpus execute so that at some point I can continue”. On POWER, this is the H_CONFER hcall, where you can basically do a directed yield to another vcpu (the one that holds the lock you’re trying to get is a good idea).

Generally though, it’s only the guest kernel that does this, not userspace. You can see the H_CONFER call in __spin_yield(arch_spinlock_t*) and __rw_yield(arch_rwlock_t*) in arch/powerpc/lib/locks.c in the kernel.

It would be interesting to see what extra we could get out of a system running multiple guests with MySQL servers if InnoDB/MySQL could properly yield to the right vcpu (well, thread I guess).

Tyan OpenPower

Good news everyone! Tyan has announced the availability of their first OpenPOWER system! They call this a Customer Reference System, which means it’s an excellent machine to start poking at OpenPower and POWER8 (or deploying applications on).

Because it’s an OpenPower machine, it runs the open source Open Power firmware (all up on github) and will happily run Linux (feel free to port your other operating system kernels). I’ll be writing more on the OpenPower firmware soon as, well, technical details are fun!

Ubuntu 14.10 is listed as recommended as not only have they been building for POWER8 but have spent some time ensuring things work fairly well out-of-the-box (both as a KVM guest and running native on the bare metal). Or, you can always just boot whatever the mainline kernel is at – build for the POWERNV (POWER non-virtualized) platform (be sure to include all the required drivers) and have fun!