Using llvm-mca for predicting CPU cycle impact of code changes

Way back in the distant past, when the Apple ][ and the Commodore 64 were king, you could read the manual for a microprocessor and see how many CPU cycles each instruction took, and then do the math as to how long a sequence of instructions would take to execute. This cycle counting was used pretty effectively to do really neat things such as how you’d get anything on the screen from an Atari 2600. Modern CPUs are… complex. They can do several things at once, in a different order than what you wrote them in, and have an interesting arrangement of shared resources to allocate.

So, unlike with simpler hardware, if you have a sequence of instructions for a modern processor, it’s going to be pretty hard to work out how many cycles that could take by hand, and it’s going to differ for each micro-architecture available for the instruction set.

When designing a microprocessor, simulating what a series of existing instructions will take to execute compared to the previous generation of microprocessor is pretty important. The aim should be for it to take less time or energy or some other metric that means your new processor is better than the old one. It can be okay if processor generation to generation some sequence of instructions take more cycles, if your cycles are more frequent, or power efficient, or other positive metric you’re designing for.

Programmers may want this simulation too, as some code paths get rather performance critical for certain applications. Open Source tools for this aren’t as prolific as I’d like, but there is llvm-mca which I (relatively) recently learned about.

llvm-mca is a performance analysis tool that uses information available in LLVM (e.g. scheduling models) to statically measure the performance of machine code in a specific CPU.

the llvm-mca docs

So, when looking at an issue in the IPv6 address and connection hashing code in Linux last year, and being quite conscious of modern systems dealing with a LOT of network packets, and thus this can be quite CPU usage sensitive, I wanted to make sure that my suggested changes weren’t going to have a large impact on performance – across the variety of CPU generations in use.

There’s two ways to do this: run everything, throw a lot of packets at something, and measure it. That can be a long dev cycle, and sometimes just annoying to get going. It can be a lot quicker to simulate the small section of code in question and do some analysis of it before going through the trouble of spinning up multiple test environments to prove it in the real world.

So, enter llvm-mca and the ability to try and quickly evaluate possible changes before testing them. Seeing as the code in question was nicely self contained, I could easily get this to a point where I could easily get gcc (or llvm) to spit out assembler for it separately from the kernel tree. My preference was for gcc as that’s what most distros end up compiling Linux with, including the Linux distribution that’s my day job (Amazon Linux).

In order to share the results of the experiments as part of the discussion on where the code changes should end up, I published the code and results in a github project as things got way too large to throw on a mailing list post and retain sanity.

I used a container so that I could easily run it in a repeatable isolated environment, as well as have others reproduce my results if needed. Different compiler versions and optimization levels will very much produce different sequences of instructions, and thus possibly quite different results. This delta in compiler optimization levels is partially why the numbers don’t quite match on some of the mailing list messages, although the delta of the various options was all the same. The other reason is learning how to better use llvm-mca to isolate down the exact sequence of instructions I was caring about (and not including things like the guesswork that llvm-mca has to do for branches).

One thing I learned along the way is how to better use llvm-mca to get the results that I was looking for. One trick is to very much avoid branches, as that’s going to be near complete guesswork as there’s not a simulation of the branch predictor (at least in the version I was using.

The big thing I wanted to prove: is doing the extra work having a small or large impact on number of elapsed cycles. The answer was that doing a bunch of extra “work” was essentially near free. The CPU core could execute enough things in parallel that the incremental cost of doing extra work just… wasn’t relevant.

This helped getting a patch deployed without impact to performance, as well as get a patch upstream, fixing an issue that was partially fixed 10 years prior, and had existed since day 1 of the Linux IPv6 code.

Naturally, this wasn’t a solo effort, and that’s one of the joys of working with a bunch of smart people – both at the same company I work for, and in the broader open source community. It’s always humbling when you’re looking at code outside your usual area of expertise that was written (and then modified) by Really Smart People, and you’re then trying to fix a problem in it, while trying to learn all the implications of changing that bit of code.

Anyway, check out llvm-mca for your next adventure into premature optimization, as if you’re going to get started with evil, you may as well start with what’s at the root of all of it.

Every time I program a Mac…

… the preferred programming language changes.

I never programmed a 1980s Macintosh actually in the 1980s. It was sometime in the early 1990s that I first experienced Microsoft Basic for the Macintosh. I’d previously (unknowingly at the time as it was branded Commodore) experienced Microsoft BASIC on the Commodore 16, Commodore 64, and even the Apple ][, but the Macintosh version was something else. It let you do some pretty neat things such as construct a GUI with largely the same amount of effort as it took to construct a Text based UI on the micros I was familiar with.

Okay, to be fair, I’d also dabbled in Microsoft QBasic that came bundled with MS-DOS of the era, which let you do a whole bunch of graphics – so you could theoretically construct a GUI with it. Something I did attempt to do. Programming on the Mac was so much easier to construct a GUI.

Of course, Microsoft Basic wasn’t the preferred way to program on the Macintosh. At that time it was largely Pascal, with C being something that also existed – but you were going to see Pascal in Inside Macintosh. It was probably somewhat fortuitous that I’d poked at Pascal a bit as something alternate to look at in the high school computing classes. I can only remember using TurboPascal on DOS systems and never actually writing Pascal on the Macintosh.

By the middle part of the 1990s though, I was firmly incompetently writing C on the Mac. No doubt the quality of my code increased after I’d done some university courses actually covering the language rather than the only practical way I had to attempt to write anything useful being looking at Inside Macintosh examples in Pascal and “C for Dummies” which was very not-Macintosh. Writing C on UNIX/Linux was a lot easier – everything was made for it, including Actual Documentation!

Anyway, in the early 2000s I ran MacOS X for a bit on my white iBook G3, and did a (very) small amount of any GUI / Project Builder (the precursor to Xcode) related development – instead largely focusing on command line / X11 things. The latest coolness being to use Objective-C to program applications (unless you were bringing over your Classic MacOS Carbon based application, then you could still write C). Enter some (incompetent) Objective-C coding!

Then Apple went to x86, so the hardware ceased being interesting, and I had no reason to poke at it even as a side effect of having hardware that could run the software stack. Enter a long-ass time of Debian, Ubuntu, and Fedora on laptops.

Come 2022 though, and (for reasons I should really write up), I’m poking at a Mac again and it’s now Swift as the preferred way to write apps. So, I’m (incompetently) hacking away at Swift code. I have to admit, it’s pretty nice. I’ve managed to be somewhat productive in a relative short amount of time, and all the affordances in the language gear towards the kind of safety that is a PITA when coding in C.

So this is my WIP utility to be able to import photos from a Shotwell database into the macOS Photos app:

There’s a lot of rough edges and unknowns left, including how to actually do the import (it looks like there’s going to be Swift code doing AppleScript things as the PhotoKit API is inadequate). But hey, some incompetent hacking in not too much time has a kind-of photo browser thing going on that feels pretty snappy.

API, ABI and backwards compatibility are a hard necessity

Recently, I was reading a thread on LKML on a proposal to change the behavior of the open system call when confronted with unknown flags. The thread is worth a read as the topic of augmenting things that exist probably by accident to be “better” is always interesting, as is the definition of “better”.

Keeping API and/or ABI compatibility is something that isn’t a new problem, and it’s one that people are pretty good at sometimes messing up.

This problem does not go away just because “we have cloud now”. In any distributed system, in order to upgrade it (or “be agile” as the kids are calling it), you by definition are going to have either downtime or at least two versions running concurrently. Thus, you have to have your interfaces/RPCs/APIs/ABIs/protocols/whatever cope with changes.

You cannot instantly upgrade the world, it happens gradually. You also have to design for at least three concurrent versions running. One is the original, the second is your upgrade, your third is the urgent fix because the upgrade is quite broken in some new way you only discover in production.

So, the way you do this? Never ever EVER design for N-1 compatibility only. Design for going back a long way, much longer than you officially support. You want to have a design and programming culture of backwards compatibility to ensure you can both do new and exciting things and experiment off to the side.

It’s worth going and rereading Rusty’s API levels posts from 2008:

Fuzzing Firmware – afl-fuzz + skiboot

In what is likely to be a series on how firmware makes some normal tools harder to use, first I’m going to look at american fuzzy lop – a tool for fuzz testing that if you’re not using then you most certainly have bugs it’ll find for you.

I first got interested in afl-fuzz during Erik de Castro Lopo’s excellent linux.conf.au 2016 in Geelong earlier this year: “Fuzz all the things!“. In a previous life, the Random Query Generator managed to find a heck of a lot of bugs in MySQL (and Drizzle). For randgen info, see Philip Stoev’s talk on it from way back in 2009, a recent (2014) blog post on how Tokutek uses it and some notes on how it was being used at Oracle from 2013. Basically, the randgen was a specialized fuzzer that (given a grammar) would randomly generate SQL queries, and then (if the server didn’t crash), compare the result to some other database server (e.g. your previous version).

The afl-fuzz fuzzer takes a different approach – it’s a much more generic fuzzer rather than a targeted tool. Also, while tools such as the random query generator are extremely powerful and find specialized bugs, they’re hard to get started with. A huge benefit of afl-fuzz is that it’s really, really simple to get started with.

Basically, if you have a binary that takes input on stdin or as a (relatively small) file, afl-fuzz will just work and find bugs for you – read the Quick Start Guide and you’ll be finding bugs in no time!

For firmware of course, we’re a little different than a simple command line program as, well, we aren’t one! Luckily though, we have unit tests. These are just standard binaries that include a bunch of firmware code and get run in user space as part of “make check”. Also, just like unit tests for any project, people do send me patches that break tests (which I reject).

Some of these tests act on data we get from a place – maybe reading other parts of firmware off PNOR or interacting with data structures we get from other bits of firmware. For testing this code, it can be relatively easy to (for the test), read these off disk.

For skiboot, there’s a data structure we get from the service processor on FSP machines called HDAT. Basically, it’s just like the device tree, but different. Because yet another binary format is always a good idea (yes, that is laced with a heavy dose of sarcasm). One of the steps in early boot is to parse the HDAT data structure and convert it to a device tree. Luckily, we structured our code so that creating a unit test that can run in userspace was relatively easy, we just needed to dump this data structure out from a running machine. You can see the test case here. Basically, hdat_to_dt is a binary that reads the HDAT structure out of a pair of files and prints out a device tree. One of the regression tests we have is that we always produce the same output from the same input.

So… throwing that into AFL yielded a couple of pretty simple bugs, especially around aborting out on invalid data (it’s better to exit the process with failure rather than hit an assert). Nothing too interesting here on my simple input file, but it does mean that our parsing code exits “gracefully” on invalid data.

Another utility we have is actually a userspace utility for accessing the gard records in the flash. A GARD record is a record of a piece of hardware that has been deconfigured due to a fault (or a suspected fault). Usually this utility operates on PNOR flash through /dev/mtd – but really what it’s doing is talking to the libflash library, that we also use inside skiboot (and on OpenBMC) to read/write from flash directly, via /dev/mtd or just from a file. The good news? I haven’t been able to crash this utility yet!

So I modified the pflash utility to read from a file to attempt to fuzz the partition reading code we have for the partitioning format that’s on PNOR. So far, no crashes – although to even get it going I did have to fix a bug in the file handling code in pflash, so that’s already a win!

But crashing bugs aren’t the only type of bugs – afl-fuzz has exposed several cases where we act on uninitialized data. How? Well, we run some test cases under valgrind! This is the joy of user space unit tests for firmware – valgrind becomes a tool that you can run! Unfortunately, these bugs have been sitting in my “todo” pile (which is, of course, incredibly long).

Where to next? Fuzzing the firmware calls themselves would be nice – although that’s going to require a targeted tool that knows about what to pass each of the calls. Another round of afl-fuzz running would also be good, I’ve fixed a bunch of the simple things and having a better set of starting input files would be great (and likely expose more bugs).

Carbon footprint of interpreted languages

Thought from a good discussion with at François at OSDC today, what is the carbon footprint of various languages? He mentioned that the carbon footprint of a new Haskell compiler release is remarkably non-trivial due to every Haskell package in Debian needing to be rebuilt.

So, I thought, what’s the impact of something like Python? (or Perl). Every machine running the code has to do the bytecode compilation/JIT/interpretation of that code so when, say, Ubuntu ships some new version of $random_desktop_thing_written_in_python, we’re actually compiling it well over 20 million times. That’s a remarkably non-trivial amount of CPU time (and thus CO2 emissions).

So, program in compiled languages such as C or C++ as doing so will save polar bears.

A better set of Boost m4 macros

I just replaced the old Pandora boost m4 macros in a project with boost.m4 from https://github.com/tsuna/boost.m4 and it basically just solved all my problems with Boost and the whole set of distributions that I build for (everything from CentOS/RHEL 5 to Debian unstable).

I like things that other people maintain.

Stewart’s dot twenty rule

I realised I haven’t written on this for a while and I was asked about it again today.

Stewart’s dot twenty rule is that a piece of software is never really mature until a dot twenty release.

This was a variant of “never use a dot zero release” which has been around the industry for a long time (i.e. always wait for X.0.1).

My first written observation on my variant on this rule was back in 2006:

This is a really stupid metric of software maturity. It is, however, disturbingly accurate.

It seems to continue to be both really stupid and disturbingly accurate. The first few point releases are still going to have rough edges and once you get to about 5 you likely have something that’s intensely usable for a good number of people, by dot 10 the more complex use cases should start to be okay and once you get to dot twenty, then you could say it’s mature.

A topic for another time is how releasing often is one thing but maintaining a release is quite another.

Previously:

Fun with Coverity found bugs (Episode 1)

Taking the inspiration of  great series of blog posts “Fun with Bugs” (and not http://funwithbugs.com/ which is about both caring for and eating bugs), and since I recently went and run Coverity against Drizzle, I thought I’d have a small series of posts on bugs that it has found (and I’ve fixed).

An idea that has been pervasive in the Drizzle project (and one that I rather subscribe to) is that there is two types of correct: correct and obviously correct. Being obviously correct is much, much better than merely being correct.

The first category of problems that Coverity found was kind of interesting, there was a warning that data_file_name and index_file_name in class ha_myisam weren’t initialized in the ha_myisam constructor nor in any function that it calls. It turns out that this was basically because the code wasn’t exactly optimal, and these variables were used kind of oddly. In fact, in writing this blog post I went back and found that there’s a bunch of extra dead code and these should just be removed, along with the code that “used” them.

The historical use for data_file_name and index_file_name were that (in MySQL) you could specify different paths for MyISAM data and index files, so that the FRM ended up in the server datadir, the data file ended up some other place and the index file was off behind the sofa. Since MyISAM is used only for temporary tables in Drizzle, this is entirely not needed.

Another place where a similar bug was found by Coverity was in the SQLExecutor class of the json_server plugin. The _err variable wasn’t initialized in the constructor. After some careful auditing, I think this was actually a false positive as it was set to something before being used, but it was pretty simple to prevent future bugs by initializing it.

Two instances of the same warning, one just found a bunch of code to delete (rather useful) and the other is rather minor but may help someone in the future.

Coming up next: total embarrassment bugs.

Limiting functions to 32k stack in Drizzle (and scoped_ptr)

I wonder if this comes under “Code Style” or not…

Anyway, Monty and I finished getting Drizzle ready for adding “-Wframe-larger-than=32768” as a standard compiler flag. This means that no function within the Drizzle source tree can use greater than 32kb stack – it’s a compiler warning – and with -Werror, it means that it’s a build error.

GCC is not perfect at detecting stack usage, but it’s pretty good.

Why have we done this?

Well, there is a little bit of recursion in the server… and we can craft queries to blow a small stack (not so good). On MacOS X, the default thread stack size is only 512kb. This gives not many frames if 32kb stack is a even remotely common.

I found some interesting places to throw a lot of things on the stack too – that would be rather far down on a callchain – leading to the possibility of blowing up in really strange ways.

We’d love to make it 16kb…. but that’s a fair bit more work, so something for the future.

We’ve used the Boost scoped_ptr to address a bunch of these situations as it provides pretty much minimal code change for the same effect (except that memory is dynamically allocated instead of as part of the stack frame).

A tale of a bug…

So I sometimes get asked if we funnel back bug reports or patches back to MySQL from Drizzle. Also, MariaDB adds some interest here as they are a lot closer (and indeed compatible with) to MySQL. With Drizzle, we have deviated really quite heavily from the MySQL codebase. There are still some common areas, but they’re getting rarer (especially to just directly apply a patch).

Back in June 2009, while working on Drizzle at Sun, I found a bug that I knew would affect both. The patch would even directly apply (well… close, but I made one anyway).

So the typical process of me filing a MySQL bug these days is:

  • Stewart files bug
  • In the next window of Sveta being awake, it’s verified.

This happened within a really short time.

Unfortunately, what happens next isn’t nearly as awesome.

Namely, nothing. For a year.

So a year later, I filed it in launchpad for MariaDB.

So, MariaDB is gearing up for a release, it’s a relatively low priority bug (but it does have a working, correct and obvious patch), within 2 months, Monty applied it and improved the error checking around it.

So MariaDB bug 588599 is Fix Committed (June 2nd 2010 – July 20th 2010), MySQL Bug 45377 is still Verified (July 20th 2009 – ….).

(and yes, this tends to be a general pattern I find)

But Mark says he gets things through… so yay for him.2

Shocked and Stunned (that code exists and does work)

#define READ_ALL		1	/* openfrm: Read all parameters */
#define EXTRA_RECORD		8	/* Reservera plats f|r extra record */

and later on….

  if (prgflag & (READ_ALL+EXTRA_RECORD))
    records++;

Feel free to think about that for a second.

(I have an urge to add this to questions asked in a job interview…)

stringstream is completely useless (and why C++ should have a snprintf)

  1. It’s easy to screw up thread safety.
    If you’re trying to format something for output (e.g. leading zeros, only 1 decimal place or whatever… you know, format specifiers in printf) you are setting a property on the stream, not on what you’re converting. So if you have a thread running that sets a format, adds something to the stream, and then unsets the format, you cannot have another thread able to come in and do something to that stream. Look out for thread unsafe cout code.
  2. You cannot use streams for any text that may need to be translated.
    gettext is what everybody uses. You cannot get a page into the manual before it tells you that translators may want to change the order of what you’re printing. This goes directly against stringstream.
  3. You need another reason? Number 2 rules it out for so much handling it’s not funny.

Progress in nofrm branch

“Ban FRM Now!” branch in Launchpad

Now we’re reading part of the table information out of the proto file on disk instead of the frm.

Not everything (yet) but a bit. Good first steps. Had to fix bugs along the way as well (and find weirdness in FRM file format…).

Progress is being made.

magic number super fun happy time

umm…..

int Field_timestamp::store(double nr)
{
  int error= 0;
  if (nr < 0 || nr > 99991231235959.0)
  {
    set_datetime_warning(DRIZZLE_ERROR::WARN_LEVEL_WARN,
                         ER_WARN_DATA_OUT_OF_RANGE,
                         nr, DRIZZLE_TIMESTAMP_DATETIME);
    nr= 0;					// Avoid overflow on buff
    error= 1;
  }
  error|= Field_timestamp::store((int64_t) rint(nr), false);
  return error;
}

(likely the same in mysql as well… haven’t checked though). these date and time things scare me.

Drizzle progress… (testing can be good)

We’ve been working on fixing up the remaining test cases so that they run with Drizzle. We’ve found: bugs in Drizzle, bugs in MySQL (one that seems to have been there for at least 10 years), bugs in the tests, tests that no longer apply and occationally, something like this:


/* Please god, will someone rewrite this to be readable :( */
if (to->pack_length() == from->pack_length() &&
!(to->flags & UNSIGNED_FLAG && !(from->flags & UNSIGNED_FLAG)) &&
to->real_type() != DRIZZLE_TYPE_ENUM &&
(to->real_type() != DRIZZLE_TYPE_NEWDECIMAL || (to->field_length == from->field_length && (((Field_num*)to)->dec == ((Field_num*)from)->dec))) &&
from->charset() == to->charset() &&
to->table->s->db_low_byte_first == from->table->s->db_low_byte_first &&
(!(to->table->in_use->variables.sql_mode & (MODE_NO_ZERO_DATE | MODE_INVALID_DATES)) || (to->type() != DRIZZLE_TYPE_DATE && to->type() != DRIZZLE_TYPE_DATETIME)) &&
(from->real_type() != DRIZZLE_TYPE_VARCHAR || ((Field_varstring*)from)->length_bytes == ((Field_varstring*)to)->length_bytes))
{ // Identical fields
/* This may happen if one does 'UPDATE ... SET x=x' */
if (to->ptr != from->ptr)
memcpy(to->ptr,from->ptr,to->pack_length());
return 0;
}

and no, I haven’t really changed the formatting.

libmallocfail

Bazaar branches of libmallocfail

Simple LD_PRELOAD library that will take parameters via environment variables and cause malloc() to occationally fail.

Aim was to use this to test bits of MySQL/Drizzle although since their libtool based stuf, the binary in tree is a libtool shell script, and I haven’t found a way to LD_PRELOAD only for mysqld and not the shell script and the other processes spawned by it.

I have found a bug in libc though :)

MemberDB speed improvements

So I finally installed the xdebug PHP extension and started doing some performance analysis of MemberDB using xdebug and kcachegrind. The upshot of which is a number of commits to the bzr tree that dramatically improve performance in several key areas. The answer? Caching.

I’m not even talking using memcached or caching things in database tables or anything like that – just about everything is still the same dynamically produced content as before, but I’m now caching some simple things avoiding many round-trips to the database while executing a script.

There were a few things that were taking a fair bit of execution time:

  1. The generation of the menu. In MemberDB, there’s a menu on the left. There’s also a powerful (read: non-trivial) permissions system allowing relatively fine grained granting of permissions. So, we need to check that the user has permission to go to the page before showing the page in the menu.
    Previously, for each item in the menu, we’d do a lookup to the database – checking if they have the permission or they are an admin. This ended up taking a bit of time – up to 30% of the time for the front page was taken up just generating the menu!
    So, now I cache the set of permissions for the user. One function to fetch it from the DB into a structure, another function to check the permissions of the user in that struct.
    While testing this, I actually used memcached to cache the menu to see how much of an improvement I could get… I’m about 69/70ths of the speed of using memcached with a purely PHP implementation caching the permissions info.
  2. Getting the information about a member is done in a variety of places. On some pages, you want information on the current logged in user (or just need to find their member ID). These are now cached for the duration of the script. Saved quite a few DB round trips
  3. When viewing an election (not the results, just the normal “view election” page that lists candidates), we need to get the membership information on a number of users (okay… so technically I should rewrite some of the queries to use joins in the DB… but this was easier). I now have a (limited) cache of membership info. So now, when a member has nominated multiple people, we only pull the member info out of the database once.
  4. Rewrite the “current_members” view. The old one was not as efficient as it could be. While the new one has slightly different semantics (can have duplicate rows, it turns out the use of DISTINCT was adding a bit of execution time, which for a bunch of queries is not needed) it’s significantly quicker.

I used the faithful Apache Bench (ab) to do benchmarks against the modified PHP code. I think the biggest improvement was the view election page which went from about 6seconds/page to 0.2seconds/page.

libeatmydata

Following my successful linux.conf.au talk “Eat My Data: How Everybody Gets POSIX File I/O Wrong“, I started to feel the need to easily be able to have my data eaten.

Okay, not quite. However, when you’ve written your software properly, so it uses fsync() correctly, opening files with O_SYNC or whatever – tests take longer as you’re having to wait for things to hit the rust.

So….. LD_PRELOAD=libeatmydata.so to the rescue! With a POSIX compliant fsync() (that does nothing) and filtering on open(2), it can take your test run times down dramatically.

The only time you shouldn’t use it for your tests is when you end up crashing the machine to test durability (i.e. when the OS doesn’t have the opportunity to cleanly write out the data to disk).

See the libeatmydata project page: http://www.flamingspork.com/projects/libeatmydata/

and the bazaar repository: http://www.flamingspork.com/src/libeatmydata

(it’s seemed to have saved somewhere between 20 and 30% of the time for innodb/ndb tests in mysql-test-run).

SCM performance

Linus is right when he talks about the performance of SCMs…. and that BitKeeper was about the first one to be worth using at all (really).

But as an interesting speed comparison… I’ve managed to pull the latest git (with git) and build it in less time than BitKeeper has taken to pull the latest NDB tree…. hrrm..

One of the reasons I’m so enjoying quilt for every day hacking is that it is blindingly fast.