Posts Tagged ‘linux’

How many CPU cycles does a SQL query take? (or pagefaults caused… or L2 cache misses… or CPU migrations…)

Saturday, October 10th, 2009

I like profilers. I use them when trying to make software (such as Drizzle) faster. Many profilers suck – and pretty much all of them are impossible to attach to a running system. Two notable exceptions are oprofile and dtrace (for Linux and Solaris respectively). The downside of oprofile is that it is non trivial to configure and get running and is pretty much all or nothing. Dtrace has the major disadvantage of that it is Solaris specific, so is only available to a minority of our users (and developers).

The new Linux Performance Events interface (perf_event) presents to userspace a nice abstraction of the hardware Performance Monitoring Unit inside the CPU. Typically these are processor specific (i.e. the one in a Core is different than the one in a Core 2) and can only be used by one thing at a time. The perf_events interface lets multiple applications/threads use the PMU (switching state at context switch as needed), even giving us ratios of how much time we got events for so we can do realistic estimates. It also provides some common defines to ask for things like CPU Cycles (the value you program into the PMU differs per CPU architecture for that, so an abstraction is rather welcome: it means we don’t need to have #ifdef __powerpc__ in our userspace code to support PowerPC, just a kernel that does).

Since perf_events gives us an interface to only get counts for our thread, we can map this onto connected sessions to Drizzle (and if we were using a pool_of_threads type scheduler in Drizzle, we’d need a bit of extra code to get things right, but with a thread per connection scheduler, we get it for free). A simple use of this could be to answer the question “How many CPU cycles does this SQL query take?” with the condition that you do not want how many CPU cycles were spent executing other things (e.g. the 1,000 other SQL queries currently being executed on your database server).

Many of you will now point out the RDTSC instruction for the x86 and ask why I’m just not using it. With RDTSC you’re only getting “how many cycles since reboot”. So using two values from the TSC and finding the difference only tells you how many cycles were between the two reads, not how many cycles were spent executing your thread/process. So the value of “cycles executed” gathered with RDTSC varies between a loaded and non-loaded system. With perf_events, it does not.

So…. after talking to paulus about perf_events, I decided to see how I could plug this into Drizzle to start getting interesting numbers out. Sure enough, a little bit of work later and I have my first proof of concept implementation over in lp:~stewart-flamingspork/drizzle/perf-events. That tree has a perf_udf() function that is like BENCHMARK except it returns the number of CPU cycles spent executing the expression N times. For example, how many CPU cycles does it take for the SQL expression MD5(‘pants’) to be evaluated?

drizzle> select perf_udf(1, MD5(‘pants’));
+—————————+
| perf_udf(1, MD5(‘pants’)) |
+—————————+
| 43540 |
+—————————+
1 row in set (0 sec)

So on my laptop, it’s about 40,000 cycles (over multiple runs I saw it vary between about 39,000 to 44,000). The really neat thing about using the perf_events interface is that if you run this on a different CPU architecture that has perf_events available in the kernel you’re currently running, it “just works”. e.g. if I ran this on a POWER5 box, I’d find out how many cycles it took there! No special code in Drizzle, yay!

The really neat next thing I tried was to run the same perf_udf query while running drizzleslap on the same database server, completely hammering it. I get exactly the same result (within normal variance)!

That isn’t the best part though. The best part is the other bits of information we can get out of the PMU:

  • PERF_COUNT_HW_CPU_CYCLES
  • PERF_COUNT_HW_INSTRUCTIONS
  • PERF_COUNT_HW_CACHE_REFERENCES
  • PERF_COUNT_HW_CACHE_MISSES
  • PERF_COUNT_HW_BRANCH_INSTRUCTIONS
  • PERF_COUNT_HW_BRANCH_MISSES
  • PERF_COUNT_HW_BUS_CYCLES

So the same way we can use the new ‘perf’ utility to see what a process is doing to the machine, we can enable people to do exactly the same thing with specific SQL queries (and through a bit of extra code, you could aggregate for users/applications). Not only that, but we could write a plugin for Drizzle to occasionally sample queries running through the database server and build up a quite complete profile of what’s going on over time.

We can also get software events out of perf_events such as:

  • PERF_COUNT_SW_CPU_CLOCK
  • PERF_COUNT_SW_TASK_CLOCK
  • PERF_COUNT_SW_PAGE_FAULTS
  • PERF_COUNT_SW_CONTEXT_SWITCHES
  • PERF_COUNT_SW_CPU_MIGRATIONS
  • PERF_COUNT_SW_PAGE_FAULTS_MIN
  • PERF_COUNT_SW_PAGE_FAULTS_MAJ

So for engines that do memory mapping of files on disk, we can find out which of your SQL queries are causing the page faults! We should also be able to find out if the operating system kernel is bouncing your execution threads around CPUs a lot.

The biggest possibility for awesomeness comes from the perf_event ability to get periodic call traces (you specify how often) including down into the kernel. This means that we could, on demand and only when we enable it, profile where time is being spent for that specific SQL query. The most important thing to note is that when this gathering is not enabled, the overhead is zero. Even when enabled for one query, this should have minimal impact on other things currently running (you’re going to use some extra CPU doing the profile, but we’re only profiling that one SQL query, not the 1000 other ones executing at the same time). So we could tell you that your query spends 9/10ths of its time in filesort() without ever adding any extra instructions to the filesort() codepath (and anywhere else where you may want to find out how much CPU time was spent). We could even tell you how much time was being spent in the kernel doing specific IO operations!

So I’m thinking that this could be used to make something pretty useful that could easily be used on production systems due to the zero overhead when not enabled and the small overhead when enabled.

Debian unstable on a Sun Fire T1000

Thursday, July 9th, 2009

So i got the T1000 working again (finally, after much screwing about trying to get the part). I then hit the ever annoying “no console” problem, where the console didn’t work – kind of problematic.

After a firmware upgrade, and passing “console=/dev/ttyS0″ to the kernel, things work.

So the T1000 firmware 6.3 doesn’t work with modern debian kernels. Thing swork with 6.7 though.

Using Dtrace to find out if the hardware or Solaris is slow (but really just working around the problem)

Wednesday, April 8th, 2009

A little while ago, I was the brave soul tasked with making sure Drizzle was working properly and passing all tests on Solaris and OpenSolaris. Brian recently blogged about some of the advantages of also running on Solaris and the SunStudio compilers – more warnings from the compiler is a good thing. Many kudos goes to Monty Taylor for being the brave soul who fixed most of the compiler warnings (and for us, warnings=errors – so we have to fix them) for the SunStudio compilers before I got to making te tests work.

So, I got to the end of it all and got pointed to an OpenSolaris x86 box where the drizzleslap test was timing out. The timeout for tests is some amazingly long amount of time – 15 minutes. All the drizzle-test-run tests are rather short tests.

To make running the tests quick, I usually LD_PRELOAD libeatmydata – a simple way of disabling pesky things like fsync that take a long time (rumors that I nickname it libmacosxsimulation are entirely true). It’s pretty simple to build libeatmydata on Solaris too (I periodically do this and always intend to check in the associated Makefile but never do).

Unfortunately, on OpenSolaris a bunch of things are built 32bit and others 64bit and just doing “LD_PRELOAD=libeatmydata.so ./dtr” doesn’t work – I’d have to modify the test script to only do the LD_PRELOAD for drizzled – which is annoying.

On my T1000 running Debian, the drizzleslap test takes 42 seconds to complete with libeatmydata, or 393 seconds when it’s really doing fsyncs. So for it to be timing out on this OpenSolaris x86 box – i.e. taking more than 15 minutes, was strange.

So… what was going on? Step 1: is anything actually going on? One way to test this is to see if disk IO is being generated. On Linux, we can use “iostat”. On Solaris, we can use “zpool iostat”. Things were going to disk for the whole time of the test. Time to compare what the difference between the platforms is.

Well.. a typical way that tests have taken forever have been because of lots of transactions: i.e. lots of fsync(). You are then dependent on the fsync() performance.

If we look at “iostat -x” and the avgrq-sz field on Linux, we’ll see that the average request size is on 10-12 sectors (512 byte blocks). i.e. about 5 or 6kb.

If we look at “zpool iostat 1″ on OpenSolaris, we see a bit of a different story, but similar enough that you could safely assume that lots of small synchronous IOs were going on. After a bit of reading of the ZFS on-disk format documents, I had a slightly better idea what was going on that could be causing me seeing a larger average request size on ZFS than on Linux with XFS.

So… perhaps it’s the speed of these syncs? Ordinarily, I’d just write up a quick LD_PRELOAD library that wraps fsync() and times it (perhaps writing to a file so I could do analysis on it later). Since I was working on Solaris… I thought I’d try DTrace. Some google-foo and dtrace hacking later, I tried this:

stewart@drizzle-dev:~/drizzle/sparc$ time pfexec dtrace -n ‘syscall::fdsync:entry /execname == “drizzled” / { self->ts[self->stack++] = timestamp; } syscall::fdsync:return /self->ts[self->stack - 1]/ { this->elapsed = timestamp – self->ts[--self->stack]; @[probefunc] = count(); @a[probefunc] = quantize(this->elapsed); self->ts[self->stack] =0; }’

dtrace: description 'syscall::fdsync:entry ' matched 2 probes
^C

  fdsync                                                         1600
  fdsync
           value  ------------- Distribution ------------- count
        33554432 |                                         0
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   1520
       134217728 |@@                                       79
       268435456 |                                         1
       536870912 |                                         0        

real	4m26.837s
user	0m0.657s
sys	0m0.566s

Which did seem like an awful long time for an fsync() to take. Although the filesystem was on a single disk, it was meant to be made remotely recently, and it’s sitting on a Sun controller… so it should be a bit better than that. From reading some of the ZFS on-disk spec, it could be some bug that means we’re waiting for a checkpoint to be written instead of forcing the sync out when we call fsync() – but I sought another solution (as on other Solaris/OpenSolaris systems this wasn’t a problem – so perhaps fixed in newer kernels or it’s a driver issue).

So I went and added “–commit=100″ to a bunch of places in the drizzleslap test to batch things into transactions. The idea being to greatly reduce the number of fsync() calls to bring the execution time of the drizzleslap test on the machine to get below 15minutes. A bit of jiggerypokery later (some tests needed to not have the –commit to avoid various locking foo) and I had something that should run.

Now, ~113 seconds on the T1000 on Linux (with a single SATA disk, down from an original 393 seconds) and ~437 seconds on the OpenSolaris box. For giggles, tried it on a Solaris box that’s running UFS on a 10k RPM SAS drive: ~44 seconds.

In Summary:

T1000, Linux, libeatmydata, XFS: ~42 seconds (before optim)
T1000, Linux, 7200RPM SATA, XFS: ~113 seconds
T5240, Solaris 10, 10k RPM SAS, UFS: ~44 seconds
16 core Xeon, OpenSolaris, 7200RPM, ZFS: ~437 seconds

So, on that hardware setup – something is strange. The 10k SAS drive on UFS on the CoolThreads box is really nice though…. makes me want that kind of disk here.

This page was useful, and I used it as a basis for some of my DTrace scripts: http://fav.or.it/post/1146360/dtrace-and-the-mighty-hercules

Also thanks to several people on #opensolaris on Freenode who helped me out with various Solaris specific commands in tracking this down.

fixing drizzle on linux sparc

Tuesday, December 16th, 2008

Since I got fed up with Solaris the other day, the T1000 is running Debian. This means that “I’ll care about Drizzle on Linux Sparc”.

OMG were things broken in the most “trivial” ways.

A good quick intro to the issues is Memory alignment on SPARC, or a 300x speedup!

It all comes down to memory alignment.

So I pulled the MySQL 6.0 bzr tree onto the box to try it too… I haven’t seen so many compiler warnings in ages (okay… since I last built MySQL.. drizzle is warning-clean and it makes it hard to remember a time before that). I think it works purely by accident.

So I’m gradually getting all of Drizzle working on Linux Sparc (a few things fixed already).

It’d be great if the T1k had faster disk though (make -j30 is fun… but IO isn’t on a single 160GB 7200rpm disk)… anybody wanting to donate an SSD?