Ten years of libeatmydata!

So, ten years ago (how is that even possible… it seems like it was just a couple of years ago), there was the first commit in the libeatmydata repository (now in git on github rather than in bzr on launchpad). The first implementation was literally just this:

#include <sys/types.h>
#include <sys/stat.h>

int errno;

int fsync(int fd)
       return 0;

Soooo…. kind of incredibly simple. But, hey, it worked! Little did I know, that these two lines of code were going to grow into 166 lines of C in order to do it a bit more “properly”.

My initial use case was making the MySQL test suite run faster: 30% faster back then! In fact, it was better than using tmpfs! It’s still used for that (even though I no longer hack on MySQL with any regularity), see github issue #1 for a recent bug that cropped up.

Since then, I’m aware of eatmydata being used to build entire operating systems and in production in way too many places (on way too many machines). The probability that any given human who’s used a computer in the past 10 years has used libeatmydata, used a package built with it or used a service with it running somewhere in production is so close to 1 that I don’t want to think about it.

Well… here’s to the next ten years of eating data!

New libeatmydata release: 105

Over on the project page and on launchpad you can now download libeatmydata 105.

This release fixes a couple of bugs that came in via the Debian project, including a rather interesting one about some binaries not running .so ctors to properly init libeatmydata and the code path in the libeatmydata open() not really dealing with being called first in this situation.


New libeatmydata release (65): MacOS X 10.7 fixes

This release incorporates contributions from Blair Zajac to fix issues on MacOS X 10.7.

You can get the source tarball over on the launchpad page for the release or directly from my web site:

New libeatmydata release!

I updated the web site for libeatmydata (woah!): http://flamingspork.com/projects/libeatmydata/ and the launchpad page: https://launchpad.net/libeatmydata to reflect this too.

New exciting things in the land of libeatmydata:

  • sync_file_range is now wrapped (thanks to Phillip Susi)
  • I now bundle the eatmydata helper script originally included in the debian packages
  • the autotools foo builds on MacOS X
  • I modified the eatmydata helper script to also do the right DYLD environment variables if it’s running on Darwin. i.e. the eatmydata helper script now runs on MacOS X too (well, it should – please test)
  • libeatmydata should now work just about everywhere that can LD_PRELOAD. Patches welcome.

If anyone knows how to build a non-versioned shared libray using autotools… I’d love to hear it. libeatmydata is totally not something that needs soname versioning. I guess it’s harmless though.

nocache LD_PRELOAD

Want to do something like “cp big_file copy_of_big_file” or “tar xfz big_tarball.tar.gz” but without thrashing your cache?

Enrico Zini has a nice little LD_PRELOAD called nocache.

$ nocache tar xfz foo.tar.gz

Goes well with libeatmydata. A pair of tools for compensating for your Operating System casually hating you.

I imagine people will love this when taking database backups.

Using Dtrace to find out if the hardware or Solaris is slow (but really just working around the problem)

A little while ago, I was the brave soul tasked with making sure Drizzle was working properly and passing all tests on Solaris and OpenSolaris. Brian recently blogged about some of the advantages of also running on Solaris and the SunStudio compilers – more warnings from the compiler is a good thing. Many kudos goes to Monty Taylor for being the brave soul who fixed most of the compiler warnings (and for us, warnings=errors – so we have to fix them) for the SunStudio compilers before I got to making te tests work.

So, I got to the end of it all and got pointed to an OpenSolaris x86 box where the drizzleslap test was timing out. The timeout for tests is some amazingly long amount of time – 15 minutes. All the drizzle-test-run tests are rather short tests.

To make running the tests quick, I usually LD_PRELOAD libeatmydata – a simple way of disabling pesky things like fsync that take a long time (rumors that I nickname it libmacosxsimulation are entirely true). It’s pretty simple to build libeatmydata on Solaris too (I periodically do this and always intend to check in the associated Makefile but never do).

Unfortunately, on OpenSolaris a bunch of things are built 32bit and others 64bit and just doing “LD_PRELOAD=libeatmydata.so ./dtr” doesn’t work – I’d have to modify the test script to only do the LD_PRELOAD for drizzled – which is annoying.

On my T1000 running Debian, the drizzleslap test takes 42 seconds to complete with libeatmydata, or 393 seconds when it’s really doing fsyncs. So for it to be timing out on this OpenSolaris x86 box – i.e. taking more than 15 minutes, was strange.

So… what was going on? Step 1: is anything actually going on? One way to test this is to see if disk IO is being generated. On Linux, we can use “iostat”. On Solaris, we can use “zpool iostat”. Things were going to disk for the whole time of the test. Time to compare what the difference between the platforms is.

Well.. a typical way that tests have taken forever have been because of lots of transactions: i.e. lots of fsync(). You are then dependent on the fsync() performance.

If we look at “iostat -x” and the avgrq-sz field on Linux, we’ll see that the average request size is on 10-12 sectors (512 byte blocks). i.e. about 5 or 6kb.

If we look at “zpool iostat 1” on OpenSolaris, we see a bit of a different story, but similar enough that you could safely assume that lots of small synchronous IOs were going on. After a bit of reading of the ZFS on-disk format documents, I had a slightly better idea what was going on that could be causing me seeing a larger average request size on ZFS than on Linux with XFS.

So… perhaps it’s the speed of these syncs? Ordinarily, I’d just write up a quick LD_PRELOAD library that wraps fsync() and times it (perhaps writing to a file so I could do analysis on it later). Since I was working on Solaris… I thought I’d try DTrace. Some google-foo and dtrace hacking later, I tried this:

stewart@drizzle-dev:~/drizzle/sparc$ time pfexec dtrace -n ‘syscall::fdsync:entry /execname == “drizzled” / { self->ts[self->stack++] = timestamp; } syscall::fdsync:return /self->ts[self->stack – 1]/ { this->elapsed = timestamp – self->ts[–self->stack]; @[probefunc] = count(); @a[probefunc] = quantize(this->elapsed); self->ts[self->stack] =0; }’

dtrace: description 'syscall::fdsync:entry ' matched 2 probes

  fdsync                                                         1600
           value  ------------- Distribution ------------- count
        33554432 |                                         0
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   1520
       134217728 |@@                                       79
       268435456 |                                         1
       536870912 |                                         0        

real	4m26.837s
user	0m0.657s
sys	0m0.566s

Which did seem like an awful long time for an fsync() to take. Although the filesystem was on a single disk, it was meant to be made remotely recently, and it’s sitting on a Sun controller… so it should be a bit better than that. From reading some of the ZFS on-disk spec, it could be some bug that means we’re waiting for a checkpoint to be written instead of forcing the sync out when we call fsync() – but I sought another solution (as on other Solaris/OpenSolaris systems this wasn’t a problem – so perhaps fixed in newer kernels or it’s a driver issue).

So I went and added “–commit=100” to a bunch of places in the drizzleslap test to batch things into transactions. The idea being to greatly reduce the number of fsync() calls to bring the execution time of the drizzleslap test on the machine to get below 15minutes. A bit of jiggerypokery later (some tests needed to not have the –commit to avoid various locking foo) and I had something that should run.

Now, ~113 seconds on the T1000 on Linux (with a single SATA disk, down from an original 393 seconds) and ~437 seconds on the OpenSolaris box. For giggles, tried it on a Solaris box that’s running UFS on a 10k RPM SAS drive: ~44 seconds.

In Summary:

T1000, Linux, libeatmydata, XFS: ~42 seconds (before optim)
T1000, Linux, 7200RPM SATA, XFS: ~113 seconds
T5240, Solaris 10, 10k RPM SAS, UFS: ~44 seconds
16 core Xeon, OpenSolaris, 7200RPM, ZFS: ~437 seconds

So, on that hardware setup – something is strange. The 10k SAS drive on UFS on the CoolThreads box is really nice though…. makes me want that kind of disk here.

This page was useful, and I used it as a basis for some of my DTrace scripts: http://fav.or.it/post/1146360/dtrace-and-the-mighty-hercules

Also thanks to several people on #opensolaris on Freenode who helped me out with various Solaris specific commands in tracking this down.