doing nothing on modern CPUs

Sometimes you don’t want to do anything. This is understandably human, and probably a sign you should either relax or get up and do something.

For processors, you sometimes do actually want to do absolutely nothing. Often this will be while waiting for a lock. You want to do nothing until the lock is free, but you want to be quick about it, you want to start work once that lock is free as soon as possible.

On CPU cores with more than one thread (e.g. hyperthreading on Intel, SMT on POWER) you likely want to let the other threads have all of the resources of the core if you’re sitting there waiting for something.

So, what do you do? On x86 there’s been the PAUSE instruction for a while and on POWER there’s been the SMT priority instructions.

The x86 PAUSE instruction delays execution of the next instruction for some amount of time while on POWER each executing thread in a core has a priority and this is how chip resources are handed out (you can set different priorities using special no-op instructions as well as setting the Relative Priority Register to map how these coarse grained priorities are interpreted by the chip).

So, when you’re writing spinlock code (or similar, such as the implementation of mutexes in InnoDB) you want to check if the lock is free, and if not, spin for a bit, but at a lower priority than the code running in the other thread that’s doing actual work. The idea being that when you do finally acquire the lock, you bump your priority back up and go do actual work.

Usually, you don’t continually check the lock, you do a bit of nothing in between checking. This is so that when the lock is contended, you don’t just jam every thread in the system up with trying to read a single bit of memory.

So you need a trick to do nothing that the complier isn’t going to optimize away.

Current (well, MySQL 5.7.5, but it’s current in MariaDB 10.0.17+ too, and other MySQL versions) code in InnoDB to “do nothing” looks something like this:

ulint ut_delay(ulint   delay)
{
        ulint   i, j;
        UT_LOW_PRIORITY_CPU();
        j = 0;
        for (i = 0; i < delay * 50; i++) {
                j += i;
                UT_RELAX_CPU();
        }
        if (ut_always_false) {
                ut_always_false = (ibool) j;
        }
        UT_RESUME_PRIORITY_CPU();
        return(j);
}

On x86, UT_RELAX_CPU() ends up being the PAUSE instruction.

On POWER, the UT_LOW_PRIORITY_CPU() and UT_RESUME_PRIORITY_CPU() tunes the SMT thread priority (and on x86 they’re defined as nothing).

If you want an idea of when this was all written, this comment may be a hint:

/*!< in: delay in microseconds on 100 MHz Pentium */

But, if you’re not on x86 you don’t have the PAUSE instruction, instead, you end up getting this code:

# elif defined(HAVE_ATOMIC_BUILTINS)
#  define UT_RELAX_CPU() do { \
     volatile lint      volatile_var; \
     os_compare_and_swap_lint(&volatile_var, 0, 1); \
   } while (0)

Which you may think “yep, that does nothing and is not optimized away by the compiler”. Except you’d be wrong! What it actually does is generates a lot of memory traffic. You’re now sitting in a tight loop doing atomic operations, which have to be synchronized between cores (and sockets) since there’s no real way that the hardware is going to be able to work out that this is only a local variable that is never accessed from anywhere.

Additionally, the ut_always_false and j variable there is also attempts to trick the complier into not optimizing the loop away, and since ut_always_false is a global, you’re generating traffic to a single global variable too.

Instead, what’s needed is a compiler barrier. This simple bit of nothing tells the compiler “pretend memory has changed, so you can’t optimize around this point”.

__asm__ __volatile__ ("":::"memory")

So we can eliminate all sorts of useless non-work and instead do what we want: do nothing (a for loop for X iterations that isn’t optimized away by the compiler) and don’t have side effects.

In MySQL bug 74832 I detailed this with the appropriately produced POWER assembler. Unfortunately, this patch (submitted under the OCA) has sat since November 2014 (so, over 9 months) with no action. I’m a bit disappointed by that to be honest.

Anyway, the real moral of this story is: don’t implement your own locking primitives. You’re either going to get it wrong or you’ll be wrong in a few years when everything changes under you.

See also:

13 thoughts on “doing nothing on modern CPUs

  1. Alternate lesson is that vendors of expensive, different and hard to access HW should expect open source software to not work well, or they can fund the effort to fix things or at least make this stuff easier to access.

  2. Alternate lesson for compiler vendors – provide the intrinsics for doing nothing independent from CPU (atomics are intrinsics probably everywhere already, so doing nothing is i think second most used functionality after atomics) . Or, alternative lesson for OS vendors – provide macros in OS headers for this type of stuff like Windows that has YieldProcessor macro for that .

  3. There’s not really a solid set of semantics at this stage – there’s “run at lower priority” as well as “do nothing for a bit” and these are even possible the same thing (on x86 they are) but can also be different – you don’t have to be in a spinloop to want to be at different priorities.

    One thing I didn’t mention is that when running as a guest on POWER, the PAPR spec lists a hypervisor call H_CONFER, which tells the hypervisor “I’m waiting on a lock held by THAT guy” (the equivalent on x86 I think is that the pause instruction is trapped by kvm).

  4. Pingback: Running OpenPOWER firmware with QEMU | Firmware Security

  5. Pingback: doing nothing on modern CPUs | MySQL

  6. Pingback: Log Buffer #438: A Carnival of the Vanities for DBAs | MySQL

  7. Why does os_compare_and_swap_lint cause any traffic? The cache lines for the stack address should be held in M-state by the local cache. Doing an atomic operation on any M-state cache line should be free of any bus activity.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.