doing nothing on modern CPUs

Sometimes you don’t want to do anything. This is understandably human, and probably a sign you should either relax or get up and do something.

For processors, you sometimes do actually want to do absolutely nothing. Often this will be while waiting for a lock. You want to do nothing until the lock is free, but you want to be quick about it, you want to start work once that lock is free as soon as possible.

On CPU cores with more than one thread (e.g. hyperthreading on Intel, SMT on POWER) you likely want to let the other threads have all of the resources of the core if you’re sitting there waiting for something.

So, what do you do? On x86 there’s been the PAUSE instruction for a while and on POWER there’s been the SMT priority instructions.

The x86 PAUSE instruction delays execution of the next instruction for some amount of time while on POWER each executing thread in a core has a priority and this is how chip resources are handed out (you can set different priorities using special no-op instructions as well as setting the Relative Priority Register to map how these coarse grained priorities are interpreted by the chip).

So, when you’re writing spinlock code (or similar, such as the implementation of mutexes in InnoDB) you want to check if the lock is free, and if not, spin for a bit, but at a lower priority than the code running in the other thread that’s doing actual work. The idea being that when you do finally acquire the lock, you bump your priority back up and go do actual work.

Usually, you don’t continually check the lock, you do a bit of nothing in between checking. This is so that when the lock is contended, you don’t just jam every thread in the system up with trying to read a single bit of memory.

So you need a trick to do nothing that the complier isn’t going to optimize away.

Current (well, MySQL 5.7.5, but it’s current in MariaDB 10.0.17+ too, and other MySQL versions) code in InnoDB to “do nothing” looks something like this:

ulint ut_delay(ulint   delay)
{
        ulint   i, j;
        UT_LOW_PRIORITY_CPU();
        j = 0;
        for (i = 0; i < delay * 50; i++) {
                j += i;
                UT_RELAX_CPU();
        }
        if (ut_always_false) {
                ut_always_false = (ibool) j;
        }
        UT_RESUME_PRIORITY_CPU();
        return(j);
}

On x86, UT_RELAX_CPU() ends up being the PAUSE instruction.

On POWER, the UT_LOW_PRIORITY_CPU() and UT_RESUME_PRIORITY_CPU() tunes the SMT thread priority (and on x86 they’re defined as nothing).

If you want an idea of when this was all written, this comment may be a hint:

/*!< in: delay in microseconds on 100 MHz Pentium */

But, if you’re not on x86 you don’t have the PAUSE instruction, instead, you end up getting this code:

# elif defined(HAVE_ATOMIC_BUILTINS)
#  define UT_RELAX_CPU() do { \
     volatile lint      volatile_var; \
     os_compare_and_swap_lint(&volatile_var, 0, 1); \
   } while (0)

Which you may think “yep, that does nothing and is not optimized away by the compiler”. Except you’d be wrong! What it actually does is generates a lot of memory traffic. You’re now sitting in a tight loop doing atomic operations, which have to be synchronized between cores (and sockets) since there’s no real way that the hardware is going to be able to work out that this is only a local variable that is never accessed from anywhere.

Additionally, the ut_always_false and j variable there is also attempts to trick the complier into not optimizing the loop away, and since ut_always_false is a global, you’re generating traffic to a single global variable too.

Instead, what’s needed is a compiler barrier. This simple bit of nothing tells the compiler “pretend memory has changed, so you can’t optimize around this point”.

__asm__ __volatile__ ("":::"memory")

So we can eliminate all sorts of useless non-work and instead do what we want: do nothing (a for loop for X iterations that isn’t optimized away by the compiler) and don’t have side effects.

In MySQL bug 74832 I detailed this with the appropriately produced POWER assembler. Unfortunately, this patch (submitted under the OCA) has sat since November 2014 (so, over 9 months) with no action. I’m a bit disappointed by that to be honest.

Anyway, the real moral of this story is: don’t implement your own locking primitives. You’re either going to get it wrong or you’ll be wrong in a few years when everything changes under you.

See also:

Update on MySQL on POWER8

About 1.5 months ago I blogged on MySQL 5.6 on POWER andtalked about what I had to poke at to make modern MySQL versions run and run well on shiny POWER8 systems.

One of those bugs, MySQL bug 47213 (InnoDB mutex/rw_lock should be conscious of memory ordering other than Intel) was recently marked as CLOSED by the Oracle MySQL team and the upcoming 5.6.20 and 5.7.5 releases should have the fix!

This is excellent news for those wanting to run MySQL on SMP systems that don’t have an Intel-like memory model (e.g. POWER and MIPS64).

This was the most major and invasive patch in the patchset for MySQL on POWER. It’s absolutely fantastic that this has made it into 5.6.20 and 5.7.5 and may mean that these new versions will work out-of-the-box on POWER (I haven’t checked… but from glancing back at my patchset there was only one other patch that could be related to correctness rather than performance).

1 million SQL Queries Per Second: MySQL 5.7 on POWER8

I’ve previously covered MySQL 5.6 on POWER (with patch), MySQL 5.6 Performance on POWER8 (spoiler: new performance record) and MySQL 5.7 on POWER.

Of course, The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions. Also, these numbers should be considered preliminary, but trust me – I did get them and it’s not April 1st.

From my last post, you saw that with my preliminary patch for MySQL 5.7 to work on POWER, we could easily match the previous record for sysbench point select queries per second (i.e. key lookups). In fact, we could exceed the published record by a little bit which is kind of nice. At around 630kQPS, one could be rather happy.

But we still had 30-40% idle CPU on POWER8. This led me to file the following bug report:

  • Bug 72829: LOCK_grant is major contention point, leaves 30-40% idle CPU.

What’s going on is that there’s a rwlock in the MySQL Server that ensures that writers don’t collide with readers to the data structures describing the GRANTs (i.e. who has access to what). If you run a GRANT statement, it gets a writer lock, and nobody can read (i.e. check permissions) while everything is being updated. If you run a normal SQL statement, you get a read lock (non-exclusive) and can check permissions appropriately.

It’s been known for a long time that LOCK_grant was a bottleneck. Typically, some people have run with skip-grant-tables to help shorten the time the lock as held (as in MySQL you still take the mutex even though you’ve started the server with skip-grant-tables).

In Drizzle, we fixed that – moving authentication and authorization completely behind plugin APIs and if you didn’t load plugins for them, you executed near enough to zero instructions that it didn’t matter.

In my experiments, enabling skip-grant-tables actually hurt performance rather than helped. More investigation is needed, but it seems that simply the act of acquiring and releasing the rdlock is now a major bottleneck in some benchmarks (such as sysbench point select).

It turns out that this is a well known problem in other pieces of software (e.g. Linux kernel) and is pretty much what RCU (Read Copy Update) is best at. As far back as 2006 I remember attempting to get my head around RCU so that one day we could use it in MySQL or MySQL Cluster.

Another simpler method is simply splitting the mutex, with readers able to acquire any one of N mutexes and writers needing to acquire them all. This penalizes writers, but unless you’re executing a lot of GRANTs, you’re probably safe.

So… what is the theoretical maximum performance if this bottleneck went away?

I wrote a quick patch that just commented out the rdlock acquisition of LOCK_grant in the hot codepath of sysbench point selects. I wasn’t running GRANT statements at runtime so this was “safe”.

This patch is not production ready, it’s merely useful for demonstrating where we could be with MySQL 5.7 on POWER8 if one last bottleneck is fixed.

My results? Slightly over ONE MILLION QUERIES PER SECOND!

This is roughly twice the previous record.

This is with a dual socket 24 core POWER8 with SMT8 and DSCR=1 on 8 tables with sysbench 0.4.8. Sysbench itself is using a non-trivial amount of CPU and I could probably decently beat this number if I rewrote sysbench using the nonblocking API in libdrizzle (back when me made the Drizzle performance regression tests use a libdrizzle-ified sysbench we got double digit percentage improvement in our sysbench numbers).

There’s still around 7-10% idle CPU time… so there’s more room to grow.

Lacking a physical gauntlet to throw down, I’ll just have to submit a conference paper somewhere so that I can do that in person.

I really hope that we’re able to fix this bottleneck in MySQL 5.7 so that MySQL 5.7 will ship being able to do over a million queries per second. From SQL.