How many CPU cycles does a SQL query take? (or pagefaults caused… or L2 cache misses… or CPU migrations…)

I like profilers. I use them when trying to make software (such as Drizzle) faster. Many profilers suck – and pretty much all of them are impossible to attach to a running system. Two notable exceptions are oprofile and dtrace (for Linux and Solaris respectively). The downside of oprofile is that it is non trivial to configure and get running and is pretty much all or nothing. Dtrace has the major disadvantage of that it is Solaris specific, so is only available to a minority of our users (and developers).

The new Linux Performance Events interface (perf_event) presents to userspace a nice abstraction of the hardware Performance Monitoring Unit inside the CPU. Typically these are processor specific (i.e. the one in a Core is different than the one in a Core 2) and can only be used by one thing at a time. The perf_events interface lets multiple applications/threads use the PMU (switching state at context switch as needed), even giving us ratios of how much time we got events for so we can do realistic estimates. It also provides some common defines to ask for things like CPU Cycles (the value you program into the PMU differs per CPU architecture for that, so an abstraction is rather welcome: it means we don’t need to have #ifdef __powerpc__ in our userspace code to support PowerPC, just a kernel that does).

Since perf_events gives us an interface to only get counts for our thread, we can map this onto connected sessions to Drizzle (and if we were using a pool_of_threads type scheduler in Drizzle, we’d need a bit of extra code to get things right, but with a thread per connection scheduler, we get it for free). A simple use of this could be to answer the question “How many CPU cycles does this SQL query take?” with the condition that you do not want how many CPU cycles were spent executing other things (e.g. the 1,000 other SQL queries currently being executed on your database server).

Many of you will now point out the RDTSC instruction for the x86 and ask why I’m just not using it. With RDTSC you’re only getting “how many cycles since reboot”. So using two values from the TSC and finding the difference only tells you how many cycles were between the two reads, not how many cycles were spent executing your thread/process. So the value of “cycles executed” gathered with RDTSC varies between a loaded and non-loaded system. With perf_events, it does not.

So…. after talking to paulus about perf_events, I decided to see how I could plug this into Drizzle to start getting interesting numbers out. Sure enough, a little bit of work later and I have my first proof of concept implementation over in lp:~stewart-flamingspork/drizzle/perf-events. That tree has a perf_udf() function that is like BENCHMARK except it returns the number of CPU cycles spent executing the expression N times. For example, how many CPU cycles does it take for the SQL expression MD5(‘pants’) to be evaluated?

drizzle> select perf_udf(1, MD5(‘pants’));
+—————————+
| perf_udf(1, MD5(‘pants’)) |
+—————————+
| 43540 |
+—————————+
1 row in set (0 sec)

So on my laptop, it’s about 40,000 cycles (over multiple runs I saw it vary between about 39,000 to 44,000). The really neat thing about using the perf_events interface is that if you run this on a different CPU architecture that has perf_events available in the kernel you’re currently running, it “just works”. e.g. if I ran this on a POWER5 box, I’d find out how many cycles it took there! No special code in Drizzle, yay!

The really neat next thing I tried was to run the same perf_udf query while running drizzleslap on the same database server, completely hammering it. I get exactly the same result (within normal variance)!

That isn’t the best part though. The best part is the other bits of information we can get out of the PMU:

  • PERF_COUNT_HW_CPU_CYCLES
  • PERF_COUNT_HW_INSTRUCTIONS
  • PERF_COUNT_HW_CACHE_REFERENCES
  • PERF_COUNT_HW_CACHE_MISSES
  • PERF_COUNT_HW_BRANCH_INSTRUCTIONS
  • PERF_COUNT_HW_BRANCH_MISSES
  • PERF_COUNT_HW_BUS_CYCLES

So the same way we can use the new ‘perf’ utility to see what a process is doing to the machine, we can enable people to do exactly the same thing with specific SQL queries (and through a bit of extra code, you could aggregate for users/applications). Not only that, but we could write a plugin for Drizzle to occasionally sample queries running through the database server and build up a quite complete profile of what’s going on over time.

We can also get software events out of perf_events such as:

  • PERF_COUNT_SW_CPU_CLOCK
  • PERF_COUNT_SW_TASK_CLOCK
  • PERF_COUNT_SW_PAGE_FAULTS
  • PERF_COUNT_SW_CONTEXT_SWITCHES
  • PERF_COUNT_SW_CPU_MIGRATIONS
  • PERF_COUNT_SW_PAGE_FAULTS_MIN
  • PERF_COUNT_SW_PAGE_FAULTS_MAJ

So for engines that do memory mapping of files on disk, we can find out which of your SQL queries are causing the page faults! We should also be able to find out if the operating system kernel is bouncing your execution threads around CPUs a lot.

The biggest possibility for awesomeness comes from the perf_event ability to get periodic call traces (you specify how often) including down into the kernel. This means that we could, on demand and only when we enable it, profile where time is being spent for that specific SQL query. The most important thing to note is that when this gathering is not enabled, the overhead is zero. Even when enabled for one query, this should have minimal impact on other things currently running (you’re going to use some extra CPU doing the profile, but we’re only profiling that one SQL query, not the 1000 other ones executing at the same time). So we could tell you that your query spends 9/10ths of its time in filesort() without ever adding any extra instructions to the filesort() codepath (and anywhere else where you may want to find out how much CPU time was spent). We could even tell you how much time was being spent in the kernel doing specific IO operations!

So I’m thinking that this could be used to make something pretty useful that could easily be used on production systems due to the zero overhead when not enabled and the small overhead when enabled.

12 thoughts on “How many CPU cycles does a SQL query take? (or pagefaults caused… or L2 cache misses… or CPU migrations…)

  1. Pingback: Tweets that mention How many CPU cycles does a SQL query take? (or pagefaults caused… or L2 cache misses… or CPU migrations…) | Ramblings -- Topsy.com

  2. Hi!

    hey, this sounds really pwerful and interesting!

    I was wondering about this though:

    “over multiple runs I saw it vary between about 39,000 to 44,000”

    It’s probably my lack of knowledge but I kind of expected to see a deterministic number of cycles being consumed for the same high level operation. I mean, sure, there are a lot of layers between MD5(’pants’) and the actual CPU instructions, but I’m at a loss why the number should differ between calls. Or is it feasible that tin some cases, the ‘pants’ data were loaded from memory whereas for other calls the data would be in the CPU cache, something like that? Does that conjecture even make sense? Or perhaps the number of cycles differs because that “hardware monitoring unit” in the CPU doesn’t record everything, and just samples the available events?

    Just curious.

  3. CPU scheduling inside the CPU will be a big thing, as well as access to memory, cache and a few other things (e.g. branch prediction).

    You can get CPU *instructions*, which should be constant from run to run, but isn’t really as interesting because of how modern CPUs work (e.g. branches are *amazingly* expensive and no-ops are effectively free).

    Of course, we can give CPU instruction count too…. and then we just have to educate everybody about modern CPU architecture :)

  4. Hi Stewart,

    thanks for the explanation – you uncovered a tacit and wrong assumption on my part immediately: I was confusing cycles with the number of instructions.

    Indeed – the number of instructions isn’t that helpful a measure. But before I do it again, I am assuming that the number of cycles is a direct measure for the amount of time spent. Is that true? Or am i missing something again?

    kind regards, Roland.

  5. As long as you know the number of cycles per second, yep!

    (This gets interesting with CPUs that change frequency… e.g. the one in my laptop.)

    Perhaps we need that bit of information too… Don’t think that’s there… now I’m thinking…

  6. Hi – while oprofile is useful, it only provides statistical data (through sampling) and people tend to forget this point. I had some problems with this in the past: depending on your workload, it can be very misleading. For accurate data, callgrind is a very good (but slow) profiler (and kcachegrind a nice GUI). Performance counters are a very good option for production/test environments … but require the target application to be instrumented. That’s why I’m delighted to know about your work about it.

  7. The issue with callgrind of course is that it really is slow, so is unrealistic in wall time and if you have threads, scheduling becomes much different to real life.

  8. Didier: You don’t have to instrument your app to use performance counters, you can run your entire program under “perf”.

  9. @Michael: I thought that ‘perf’ implemented statistical sampling (in a similar way than oprofile). This is what I understand from Arnaldo Carvalho de Melo’s presentation (Linux Plumbers Conference, September 2009). Am I mistaken? Don’t get me wrong: oprofile, perf are definitely useful tools, but their accuracy is limited, and sometimes it can be misleading. By instrumenting your application, you can get accurate data with low performance overhead.

  10. Hi,
    I’m currently working for my second-degree final exam to develop a global real-time scheduler (EDF) inside Linux. I have to measure the work done by the scheduler and the overhead due to migrations from a CPU to an another. So I think that perf counters comes to me very handy.
    I can’t use perf tools for monitoring, I’m trying to instrument my benchmark application to do the job.
    So the question is.. where can I find some documentation on how instrument an user-space application to monitoring things? Is there any way to see how you done the same work?
    Thank in advance,
    Juri

  11. It’s probably my lack of knowledge but I kind of expected to see a deterministic number of cycles being consumed for the same high level operation. The issue with callgrind of course is that it really is slow, so is unrealistic in wall time and if you have threads, scheduling becomes much different to real life. Thank you.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.