Way back in the distant past, when the Apple ][ and the Commodore 64 were king, you could read the manual for a microprocessor and see how many CPU cycles each instruction took, and then do the math as to how long a sequence of instructions would take to execute. This cycle counting was used pretty effectively to do really neat things such as how you’d get anything on the screen from an Atari 2600. Modern CPUs are… complex. They can do several things at once, in a different order than what you wrote them in, and have an interesting arrangement of shared resources to allocate.
So, unlike with simpler hardware, if you have a sequence of instructions for a modern processor, it’s going to be pretty hard to work out how many cycles that could take by hand, and it’s going to differ for each micro-architecture available for the instruction set.
When designing a microprocessor, simulating what a series of existing instructions will take to execute compared to the previous generation of microprocessor is pretty important. The aim should be for it to take less time or energy or some other metric that means your new processor is better than the old one. It can be okay if processor generation to generation some sequence of instructions take more cycles, if your cycles are more frequent, or power efficient, or other positive metric you’re designing for.
Programmers may want this simulation too, as some code paths get rather performance critical for certain applications. Open Source tools for this aren’t as prolific as I’d like, but there is llvm-mca
which I (relatively) recently learned about.
llvm-mca is a performance analysis tool that uses information available in LLVM (e.g. scheduling models) to statically measure the performance of machine code in a specific CPU.
the llvm-mca docs
So, when looking at an issue in the IPv6 address and connection hashing code in Linux last year, and being quite conscious of modern systems dealing with a LOT of network packets, and thus this can be quite CPU usage sensitive, I wanted to make sure that my suggested changes weren’t going to have a large impact on performance – across the variety of CPU generations in use.
There’s two ways to do this: run everything, throw a lot of packets at something, and measure it. That can be a long dev cycle, and sometimes just annoying to get going. It can be a lot quicker to simulate the small section of code in question and do some analysis of it before going through the trouble of spinning up multiple test environments to prove it in the real world.
So, enter llvm-mca and the ability to try and quickly evaluate possible changes before testing them. Seeing as the code in question was nicely self contained, I could easily get this to a point where I could easily get gcc
(or llvm
) to spit out assembler for it separately from the kernel tree. My preference was for gcc as that’s what most distros end up compiling Linux with, including the Linux distribution that’s my day job (Amazon Linux).
In order to share the results of the experiments as part of the discussion on where the code changes should end up, I published the code and results in a github project as things got way too large to throw on a mailing list post and retain sanity.
I used a container so that I could easily run it in a repeatable isolated environment, as well as have others reproduce my results if needed. Different compiler versions and optimization levels will very much produce different sequences of instructions, and thus possibly quite different results. This delta in compiler optimization levels is partially why the numbers don’t quite match on some of the mailing list messages, although the delta of the various options was all the same. The other reason is learning how to better use llvm-mca
to isolate down the exact sequence of instructions I was caring about (and not including things like the guesswork that llvm-mca
has to do for branches).
One thing I learned along the way is how to better use llvm-mca
to get the results that I was looking for. One trick is to very much avoid branches, as that’s going to be near complete guesswork as there’s not a simulation of the branch predictor (at least in the version I was using.
The big thing I wanted to prove: is doing the extra work having a small or large impact on number of elapsed cycles. The answer was that doing a bunch of extra “work” was essentially near free. The CPU core could execute enough things in parallel that the incremental cost of doing extra work just… wasn’t relevant.
This helped getting a patch deployed without impact to performance, as well as get a patch upstream, fixing an issue that was partially fixed 10 years prior, and had existed since day 1 of the Linux IPv6 code.
Naturally, this wasn’t a solo effort, and that’s one of the joys of working with a bunch of smart people – both at the same company I work for, and in the broader open source community. It’s always humbling when you’re looking at code outside your usual area of expertise that was written (and then modified) by Really Smart People, and you’re then trying to fix a problem in it, while trying to learn all the implications of changing that bit of code.
Anyway, check out llvm-mca
for your next adventure into premature optimization, as if you’re going to get started with evil, you may as well start with what’s at the root of all of it.