Using llvm-mca for predicting CPU cycle impact of code changes

Way back in the distant past, when the Apple ][ and the Commodore 64 were king, you could read the manual for a microprocessor and see how many CPU cycles each instruction took, and then do the math as to how long a sequence of instructions would take to execute. This cycle counting was used pretty effectively to do really neat things such as how you’d get anything on the screen from an Atari 2600. Modern CPUs are… complex. They can do several things at once, in a different order than what you wrote them in, and have an interesting arrangement of shared resources to allocate.

So, unlike with simpler hardware, if you have a sequence of instructions for a modern processor, it’s going to be pretty hard to work out how many cycles that could take by hand, and it’s going to differ for each micro-architecture available for the instruction set.

When designing a microprocessor, simulating what a series of existing instructions will take to execute compared to the previous generation of microprocessor is pretty important. The aim should be for it to take less time or energy or some other metric that means your new processor is better than the old one. It can be okay if processor generation to generation some sequence of instructions take more cycles, if your cycles are more frequent, or power efficient, or other positive metric you’re designing for.

Programmers may want this simulation too, as some code paths get rather performance critical for certain applications. Open Source tools for this aren’t as prolific as I’d like, but there is llvm-mca which I (relatively) recently learned about.

llvm-mca is a performance analysis tool that uses information available in LLVM (e.g. scheduling models) to statically measure the performance of machine code in a specific CPU.

the llvm-mca docs

So, when looking at an issue in the IPv6 address and connection hashing code in Linux last year, and being quite conscious of modern systems dealing with a LOT of network packets, and thus this can be quite CPU usage sensitive, I wanted to make sure that my suggested changes weren’t going to have a large impact on performance – across the variety of CPU generations in use.

There’s two ways to do this: run everything, throw a lot of packets at something, and measure it. That can be a long dev cycle, and sometimes just annoying to get going. It can be a lot quicker to simulate the small section of code in question and do some analysis of it before going through the trouble of spinning up multiple test environments to prove it in the real world.

So, enter llvm-mca and the ability to try and quickly evaluate possible changes before testing them. Seeing as the code in question was nicely self contained, I could easily get this to a point where I could easily get gcc (or llvm) to spit out assembler for it separately from the kernel tree. My preference was for gcc as that’s what most distros end up compiling Linux with, including the Linux distribution that’s my day job (Amazon Linux).

In order to share the results of the experiments as part of the discussion on where the code changes should end up, I published the code and results in a github project as things got way too large to throw on a mailing list post and retain sanity.

I used a container so that I could easily run it in a repeatable isolated environment, as well as have others reproduce my results if needed. Different compiler versions and optimization levels will very much produce different sequences of instructions, and thus possibly quite different results. This delta in compiler optimization levels is partially why the numbers don’t quite match on some of the mailing list messages, although the delta of the various options was all the same. The other reason is learning how to better use llvm-mca to isolate down the exact sequence of instructions I was caring about (and not including things like the guesswork that llvm-mca has to do for branches).

One thing I learned along the way is how to better use llvm-mca to get the results that I was looking for. One trick is to very much avoid branches, as that’s going to be near complete guesswork as there’s not a simulation of the branch predictor (at least in the version I was using.

The big thing I wanted to prove: is doing the extra work having a small or large impact on number of elapsed cycles. The answer was that doing a bunch of extra “work” was essentially near free. The CPU core could execute enough things in parallel that the incremental cost of doing extra work just… wasn’t relevant.

This helped getting a patch deployed without impact to performance, as well as get a patch upstream, fixing an issue that was partially fixed 10 years prior, and had existed since day 1 of the Linux IPv6 code.

Naturally, this wasn’t a solo effort, and that’s one of the joys of working with a bunch of smart people – both at the same company I work for, and in the broader open source community. It’s always humbling when you’re looking at code outside your usual area of expertise that was written (and then modified) by Really Smart People, and you’re then trying to fix a problem in it, while trying to learn all the implications of changing that bit of code.

Anyway, check out llvm-mca for your next adventure into premature optimization, as if you’re going to get started with evil, you may as well start with what’s at the root of all of it.

C bitfields considered harmful

In C (and C++) you can specify that a variable should take a specific number of bits of storage by doing “uint32_t foo:4;” rather than just “uint32_t foo”. In this example, the former uses 4 bits while the latter uses 32bits. This can be useful to pack many bit fields together.

Or, that’s what they’d like you to think.

In reality, the C spec allows the compiler to do just about anything it wants with these bitfields – which usually means it’s something you didn’t expect.

For a start, in a struct -e.g. “struct foo { uint32_t foo:4; uint32_t blah; uint32_t blergh:20; }” the compiler could go and combine foo and blergh into a single uint32_t and place it somewhere… or it could not. In this case, sizeof(struct foo) isn’t defined and may vary based on compiler, platform, compiler version, phases of the moon or if you’ve washed your hands recently.

Where this can get interesting is in network protocols (OMG DO NOT DO IT), APIs (OMG DO NOT DO IT), protecting different parts of a struct with different mutexes (EEP, don’t do it!) and performance.

I recently filed MySQL bug 74831 which relates to InnoDB performance on POWER8. InnoDB uses C bitfields which are themselves bitfields (urgh) for things like “flag to say if this table is compressed”. At various parts of the code, this flag is checked.

When you apply this simple patch:

--- mysql-5.7.5-m15.orig/storage/innobase/include/dict0mem.h
+++ mysql-5.7.5-m15/storage/innobase/include/dict0mem.h
@@ -1081,7 +1081,7 @@ struct dict_table_t {
        Use DICT_TF_GET_COMPACT(), DICT_TF_GET_ZIP_SSIZE(),
        DICT_TF_HAS_ATOMIC_BLOBS() and DICT_TF_HAS_DATA_DIR() to parse this
        flag. */
-       unsigned                                flags:DICT_TF_BITS;
+       unsigned                                flags;

I get 10,000 key lookups/sec more than without it!

Why is this? If you go and read the bug, you’ll see that the amount of CPU time spent on the instruction checking the bit flag is actually about the same… and this puzzled me for a while. That is, until Anton reminded me that the PMU can be approximate and perhaps I should look at the loads.

Sure enough, the major difference is that with the bitfield in place (i.e. MySQL 5.7.5 as it stands today), there is a ld instruction doing the load – which is a 64bit load. In my patched version, it’s a lwx instruction – which is a 32bit load.

So, basically, we were loading 8 bytes instead of 4 every time we were checking if it was a compressed table.

So, along with yesterday’s lesson of never, ever, ever use volatile, today’s lesson is never, ever, ever use bitfields.

volatile considered harmful

While playing with MySQL 5.7.5 on POWER8, I came across a rather interesting bug (74775 – and this is not the only one… I think I have a decent amount of auditing and patching to do now) which made me want to write a bit on memory barriers and the volatile keyword.

Memory barriers are hard.

Like, super hard. It’s the kind of thing that makes you curse hardware designers, probably because they’re not magically solving all your problems for you. Basically, as you get more CPU cores and each of them have caches, it gets more expensive to keep everything in sync. It’s quite obvious that with *ahem* an eventually consistent model, you could save a bunch of time and effort at the expense of shifting some complexity into software.

Those in the MySQL world should recognize this – we’ve been dealing with asynchronous replication for well over a decade as a good way to scale.

On some CPU architectures (POWER for example) not all loads are created equal. When you load a value from memory, it will be consistent with your thread of execution. That is, with any stores that you have done in this thread of execution. If another thread updates that memory location you may not see that update even if your load occurs after that thread updates that memory location. Think eventually consistent.

If you want up to date reads (and not clobber writes), then you get to do memory barriers! (a topic for elsewhere – the PowerISA document has good explanations of what we have on POWER though, and how load with reserve works).

What the volatile keyword does is generate load and store instructions. It is useful when talking to hardware, as the load and store instructions are actually doing something there that the compiler doesn’t know about and thus shouldn’t optimize away.

The volatile keyword does not add any memory barriers. This is important to realize – volatile just makes loads and stores happen for your thread, not in relation to any other threads of execution. Thus, you cannot use volatile as a thread synchronization mechanism at all. It is completely and totally wrong.

Basically, if you have a volatile variable and you do stores to it in one thread and loads in another, after the store happens, it could be quite a long time before the thread doing the loads sees it! For some applications this may be okay (although I can’t really think of any beyond very very inaccurate status variables)… but if it matters at all for application correctness, volatile is the wrong thing to use.

Further reading:

Is Python the new BASIC

Today I managed to finally find a way to express what I’ve been thinking for a while: “Python is the new BASIC”. Think about it: it’s easy to get started in, there’s books and tutorials on it everywhere, a bunch of real world software is actually written in it and with all the different versions and modules (and versions of modules) there’s a billion subtle differences to trip you up.

There’s also the group of people (like me) who don’t particularly like it, for a bunch of quite valid reasons. The lack of being strongly typed is a huge barrier for me.

I am of the opinion that the ideal language with the ideal compiler would not let buggy code compile. It may not be as easy to program in this hypothetical language, but seeing as code has to exist and be debugged for order of magnitudes more time than it takes to write it, making it harder to write bugs is a good thing. After all, my experience with Python apps is that bugs manifest themselves at run time, to the user, rather than to the developer at the time of writing. Also, compiler error is better than unit test failure.

Discuss.

Getting a file size (on Windows)

The first point I’d like to make is that you’re using a Microsoft Windows API, so you have already lost. You are just not quite aware of how much you have lost.

A quick look around and you say “Ahh… GetFileSize, that’s what I want to do!” Except, of course, you’re wrong. You don’t want to use GetFileSize at all. It has the following signature:

DWORD WINAPI GetFileSize(  __in       HANDLE hFile,

__out_opt  LPDWORD lpFileSizeHigh

);

Yes, it supports larger than 4GB files! How? A pointer to the high-order doubleword is passed in! So how do you know if this errored? Return -1? WRONG! Because the high word could have been set and your file length could legitimately be 0x1ffffffff. So to find out if you actually had an error, you must call GetLastError! Instead of one call, you now have two.

The Microsoft documentation even acknowledges that this is stupid: “Because of this behavior, it is recommended that you use GetFileSizeEx instead.”

GetFileSizeEx is presumably named “Ex” as in “i broke up with my ex because their API sucked.”

You now have something that looks like this:

BOOL WINAPI GetFileSizeEx(  __in   HANDLE hFile,

__out  PLARGE_INTEGER lpFileSize

);

Which starts to look a little bit nicer. For a start, the return code of BOOL seems to indicate success or failure.

You now get to provide a pointer to a LARGE_INTEGER. Which, if you missed it, a LARGE_INTEGER is:

typedef union _LARGE_INTEGER {  struct {

DWORD LowPart;

LONG HighPart;

};

struct {

DWORD LowPart;

LONG HighPart;

} u;

LONGLONG QuadPart;

} LARGE_INTEGER,

*PLARGE_INTEGER;

Why this abomination? Well… ” If your compiler has built-in support for 64-bit integers, use the QuadPart member to store the 64-bit integer. Otherwise, use the LowPart and HighPart members to store the 64-bit integer.”

That’s right kiddies… if you’ve decided to loose from the get-go and have a compiler that doesn’t support 64-bit integers, you can still get the file size! Of course, you’re using a compiler that doesn’t have 64bit integer support… and the Microsoft documentation indicates that the GetFileSizeEx call requires Windows 2000… so it’s post y2k and you’re using a compiler without 64-bit ints? You have already lost.

Oh, but you say something about binary compatibility for apps written in the old days (handwave something like that). Well… let’s see… IRIX will give you 64bit numbers in stat (stat64) unless you build with -o32 – giving you the old ABI. I just can’t see a use for GetFileSize….. somebody please enlighten me.

Which header would you include? Any Linux/UNIX person would think of something logical – say sys/stat.h (Linux man page says sys/types.h, sys/stat.h and unistd.h). No, nothing sensible like that. It’s “Declared in WinBase.h; include Windows.h”.

So… you thought that obviously somebody went through the API and gave you all this Ex function goodness to get rid of mucking about with parts of a 64bit int? You were wrong. Let me say it with this:

DWORD WINAPI GetCompressedFileSizeTransacted(
  __in       LPCTSTR lpFileName,
  __out_opt  LPDWORD lpFileSizeHigh,
  __in       HANDLE hTransaction
);

I’ll now tell you that this was introduced in Vista/Server 2008.

Obviously, you want to be able to use Transaction NTFS on Windows Vista with a compiler that doesn’t have 64 bit ints. Oh, and you must then make another function call to see if something went wrong?

But you know what… perhaps we can get away from this complete and utter world of madness and use stat()…. or rather… perhaps _stati64().

Well… you’d be fine except for the fact that these seem to lie to you (at least on Windows Server 2003 and Vista) – it seems that even Explorer can lie to you.

But perhaps you’ve been barking up the wrong tree… you obviously don’t want to find the file size at all – what you want is to FindFirstFile! No, you don’t want FindFirstFileEx (in this case, Ex is for Extremely complicated). It’s meant to be faster too… you know, maybe.

So remember kids, smoke your crack pipe – you’re going to need it if using this thing called the Microsoft Windows File Management Functions.