C bitfields considered harmful

In C (and C++) you can specify that a variable should take a specific number of bits of storage by doing “uint32_t foo:4;” rather than just “uint32_t foo”. In this example, the former uses 4 bits while the latter uses 32bits. This can be useful to pack many bit fields together.

Or, that’s what they’d like you to think.

In reality, the C spec allows the compiler to do just about anything it wants with these bitfields – which usually means it’s something you didn’t expect.

For a start, in a struct -e.g. “struct foo { uint32_t foo:4; uint32_t blah; uint32_t blergh:20; }” the compiler could go and combine foo and blergh into a single uint32_t and place it somewhere… or it could not. In this case, sizeof(struct foo) isn’t defined and may vary based on compiler, platform, compiler version, phases of the moon or if you’ve washed your hands recently.

Where this can get interesting is in network protocols (OMG DO NOT DO IT), APIs (OMG DO NOT DO IT), protecting different parts of a struct with different mutexes (EEP, don’t do it!) and performance.

I recently filed MySQL bug 74831 which relates to InnoDB performance on POWER8. InnoDB uses C bitfields which are themselves bitfields (urgh) for things like “flag to say if this table is compressed”. At various parts of the code, this flag is checked.

When you apply this simple patch:

--- mysql-5.7.5-m15.orig/storage/innobase/include/dict0mem.h
+++ mysql-5.7.5-m15/storage/innobase/include/dict0mem.h
@@ -1081,7 +1081,7 @@ struct dict_table_t {
        DICT_TF_HAS_ATOMIC_BLOBS() and DICT_TF_HAS_DATA_DIR() to parse this
        flag. */
-       unsigned                                flags:DICT_TF_BITS;
+       unsigned                                flags;

I get 10,000 key lookups/sec more than without it!

Why is this? If you go and read the bug, you’ll see that the amount of CPU time spent on the instruction checking the bit flag is actually about the same… and this puzzled me for a while. That is, until Anton reminded me that the PMU can be approximate and perhaps I should look at the loads.

Sure enough, the major difference is that with the bitfield in place (i.e. MySQL 5.7.5 as it stands today), there is a ld instruction doing the load – which is a 64bit load. In my patched version, it’s a lwx instruction – which is a 32bit load.

So, basically, we were loading 8 bytes instead of 4 every time we were checking if it was a compressed table.

So, along with yesterday’s lesson of never, ever, ever use volatile, today’s lesson is never, ever, ever use bitfields.

Some current MySQL Architecture writings

So, I’ve been looking around for a while (and a few times now) for any good resources that cover a bunch of MySQL architecture and technical details aimed towards the technically proficient but not MySQL literate audience. I haven’t really found anything. I mean, there’s the (huge and very detailed) MySQL manual, there’s the MySQL Internals manual (which is sometimes only 10 years out of date) and there’s various blog entries around the place. So I thought I’d write something explaining roughly how it all fits together and what it does to your system (processes, threads, IO etc).(Basically, I’ve found myself explaining this enough times in the past few years that I should really write it down and just point people to my blog). I’ve linked to things for more reading. You should probably read them at some point.

Years ago, there were many presentations on MySQL Architecture. I went to try and find some on YouTube and couldn’t. We were probably not cool enough for YouTube and the conferences mostly weren’t recorded. So, instead, I’ll just link to Brian on NoSQL – because it’s important to cover NoSQL as well.

So, here is a quick overview of executing a query inside a MySQL Server and all the things that can affect it. This isn’t meant to be complete… just a “brief” overview (of a few thousand words).

MySQL is an open source relational database server, the origins of which date back to 1979 with MySQL 1.0 coming into existence in 1995. It’s code that has some history and sometimes this really does show. For a more complete history, see my talk from linux.conf.au 2014: Past, Present and Future of MySQL (YouTube, Download).

At least of writing, everything here applies to MariaDB and Percona Server too.

The MySQL Server runs as a daemon (mysqld). Users typically interact with it over a TCP or UNIX domain socket through the MySQL network protocol (of which multiple implementations exist under various licenses). Each connection causes the MySQL Server (mysqld) to spawn a thread to handle the client connection.

There are now several different thread-pool plugins that instead of using one thread per connection, multiplex connections over a set of threads. However, these plugins are not commonly deployed and we can largely ignore them. For all intents and purposes, the MySQL Server spawns one thread per connection and that thread alone performs everything needed to service that connection. Thus, parallelism in the MySQL Server is gotten from executing many concurrent queries rather than one query concurrently.

The MySQL Server will cache threads (the amount is configurable) so that it doesn’t have to have the overhead of pthread_create() for each new connection. This is controlled by the thread_cache_size configuration option. It turns out that although creating threads may be a relatively cheap operation, it’s actually quite time consuming in the scope of many typical MySQL Server connections.

Because the MySQL Server is a collection of threads, there’s going to be thread local data (e.g. connection specific) and shared data (e.g. cache of on disk data). This means mutexes and atomic variables. Most of the more advanced ways of doing concurrency haven’t yet made it into MySQL (e.g. RCU hasn’t yet and is pretty much needed to get 1 million TPS), so you’re most likely going to see mutex contention and contention on cache lines for atomic operations and shared data structures.

There are also various worker threads inside the MySQL Server that perform various functions (e.g. replication).

Until sometime in the 2000s, more than one CPU core was really uncommon, so the fact that there were many global mutexes in MySQL wasn’t really an issue. These days, now that we have more reliable async networking and disk IO system calls but MySQL has a long history, there’s global mutexes still and there’s no hard and fast rule about how it does IO.

Over the past 10 years of MySQL development, it’s been a fight to remove the reliance on global mutexes and data structures controlled by them to attempt to increase the number of CPU cores a single mysqld could realistically use. The good news is that it’s no longer going to max out on the number of CPU cores you have in your phone.

So, you have a MySQL Client (e.g. the mysql client or something) connecting to the MySQL Server. Now, you want to enter a query. So you do that, say “SELECT 1;”. The query is sent to the server where it is parsed, optimized, executed and the result returns to the client.

Now, you’d expect this whole process to be incredibly clean and modular, like you were taught things happened back in university with several black boxes that take clean input and produce clean output that’s all independent data structures. At least in the case of MySQL, this isn’t really the case. For over a decade there’s been lovely architecture diagrams with clean boxes – the code is not like this at all. But this probably only worries you once you’re delving into the source.

The parser is a standard yacc one – there’s been attempts to replace it over the years, none of which have stuck – so we have the butchered yacc one still. With MySQL 5.0, it exploded in size due to the addition of things like SQL2003 stored procedures and it is of common opinion that it’s rather bloated and was better in 4.1 and before for the majority of queries that large scale web peeps execute.

There is also this thing called the Query Cache – protected by a single global mutex. It made sense in 2001 for a single benchmark. It is a simple hash of the SQL statement coming over the wire to the exact result to send(2) over the socket back to a client. On a single CPU system where you ask the exact same query again and again without modifying the data it’s the best thing ever. If this is your production setup, you probably want to think about where you went wrong in your life. On modern systems, enabling the query cache can drop server performance by an order of magnitude. A single global lock is a really bad idea. The query cache should be killed with fire – but at least in the mean time, it can be disabled.

Normally, you just have the SQL progress through the whole process of parse, optimize, execute, results but the MySQL Server also supports prepared statements. A prepared statement is simply this: “Hi server, please prepare this statement for execution leaving the following values blank: X, Y and Z” followed by “now execute that query with X=foo, Y=bar and Z=42”. You can call execute many times with different values. Due to the previously mentioned not-quite-well-separated parse, optimize, execute steps, prepared statements in MySQL aren’t as awesome as in other relational databases. You basically end up saving parsing the query again when you execute it with new parameters. More on prepared statements (from 2006) here. Unless you’re executing the same query many times in a single connection, server side prepared statements aren’t worth the network round trips.

The absolute worst thing in the entire world is MySQL server side prepared statements. It moves server memory allocation to be the responsibility of the clients. This is just brain dead stupid and a reason enough to disable prepared statements. In fact, just about every MySQL client library for every programming language ever actually fakes prepared statements in the client rather than trust every $language programmer to remember to close their prepared statements. Open many client connections to a MySQL Server and prepare a lot of statements and watch the OOM killer help you with your DoS attack.

So now that we’ve connected to the server, parsed the query (or done a prepared statement), we’re into the optimizer. The optimizer looks at a data structure describing the query and works out how to execute it. Remember: SQL is declarative, not procedural. The optimizer will access various table and index statistics in order to work out an execution plan. It may not be the best execution plan, but it’s one that can be found within reasonable time. You can find out the query plan for a SELECT statement by prepending it with EXPLAIN.

The MySQL optimizer is not the be all and end all of SQL optimizers  (far from it). A lot of MySQL performance problems are due to complex SQL queries that don’t play well with the optimizer, and there’s been various tricks over the years to work around deficiencies in it. If there’s one thing the MySQL optimizer does well it’s making quick, pretty good decisions about simple queries. This is why MySQL is so popular – fast execution of simple queries.

To get table and index statistics, the optimizer has to ask the Storage Engine(s). In MySQL, the actual storage of tables (and thus the rows in tables) is (mostly) abstracted away from the upper layers. Much like a VFS layer in an OS kernel, there is (for some definition) an API abstracting away the specifics of storage from the rest of the server. The API is not clean and there are a million and one layering violations and exceptions to every rule. Sorry, not my fault.

Table definitions are in FRM files on disk, entirely managed by MySQL (not the storage engines) and for your own sanity you should not ever look into the actual file format. Table definitions are also cached by MySQL to save having to open and parse a file.

Originally, there was MyISAM (well, and ISAM before it, but that’s irrelevant now). MyISAM was non-transactional but relatively fast, especially for read heavy workloads. It only allowed one writer although there could be many concurrent readers. MyISAM is still there and used for system tables. The current default storage engine is called InnoDB. It’s all the buzzwords like ACID and MVCC. Just about every production environment is going to be using InnoDB. MyISAM is effectively deprecated.

InnoDB originally was its own independent thing and has (to some degree) been maintained as if it kind of was. It is, however, not buildable outside a MySQL Server anymore. It also has its own scalability issues. A recent victory was splitting the kernel_mutex, which was a mutex that protected far too much internal InnoDB state and could be a real bottleneck where NRCPUs > 4.

So, back to query execution. Once the optimizer has worked out how to execute the query, MySQL will start executing it. This probably involves accessing some database tables. These are probably going to be InnoDB tables. So, MySQL (server side) will open the tables, looking up the MySQL Server table definition cache and creating a MySQL Server side table share object which is shared amongst the open table instances for that table. See here for scalability hints on these (from 2009). The opened table objects are also cached – table_open_cache. In MySQL 5.6, there is table_open_cache_instances, which splits the table_open_cache mutex into table_open_cache_instances mutexes to help reduce lock contention on machines with many CPU cores (> 8 or >16 cores, depending on workload).

Once tables are opened, there are various access methods that can be employed. Table scans are the worst (start at the start and examine every row). There’s also index scans (often seeking to part of the index first) and key lookups. If your query involves multiple tables, the server (not the storage engine) will have to do a join. Typically, in MySQL, this is a nested loop join. In an ideal world, this would all be really easy to spot when profiling the MySQL server, but in reality, everything has funny names like rnd_next.

As an aside, any memory allocated during query execution is likely done as part of a MEM_ROOT – essentially a pool allocator, likely optimized for some ancient libc on some ancient linux/Solaris and it just so happens to still kinda work okay. There’s some odd server configuration options for (some of the) MEM_ROOTs that get exactly no airtime on what they mean or what changing them will do.

InnoDB has its own data dictionary (separate to FRM files) which can also be limited in current MySQL (important when you have tens of thousands of tables) – which is separate to the MySQL Server table definitions and table definition cache.

But anyway, you have a number of shared data structures about tables and then a data structure for each open table. To actually read/write things to/from tables, you’re going to have to get some data to/from disk.

InnoDB tables can be stored either in one giant table space or file-per-table. (Even though it’s now configurable), InnoDB database pages are 16kb. Database pages are cached in the InnoDB Buffer Pool, and the buffer-pool-size should typically be about 80% of system memory. InnoDB will use a (configurable) method to flush. Typically, it will all be O_DIRECT (it’s configurable) – which is why “just use XFS” is step 1 in IO optimization – the per inode mutex in ext3/ext4 just doesn’t make IO scale.

InnoDB will do some of its IO in the thread that is performing the query and some of it in helper threads using native linux async IO (again, that’s configurable). With luck, all of the data you need to access is in the InnoDB buffer pool – where database pages are cached. There exists innodb_buffer_pool_instances configuration option which will split the buffer pool into several instances to help reduce lock contention on the InnoDB buffer pool mutex.

All InnoDB tables have a clustered index. This is the index by which the rows are physically sorted by. If you have an INT PRIMARY KEY on your  InnoDB table, then a row with that primary key value of 1 will be physically close to the row with primary key value 2 (and so on). Due to the intricacies of InnoDB page allocation, there may still be disk seeks involved in scanning a table in primary key order.

Every page in InnoDB has a checksum. There was an original algorithm, then there was a “fast” algorithm in some forks and now we’re converging on crc32, mainly because Intel implemented CPU instructions to make that fast. In write heavy workloads, this used to show up pretty heavily in profiles.

InnoDB has both REDO and UNDO logging to keep both crash consistency and provide consistent read views to transactions. These are also stored on disk, the redo logs being in their own files (size and number are configurable). The larger the redo logs, the longer it may take to run recovery after a crash. The smaller the redo logs, the more trouble you’re likely to run into with large or many concurrent transactions.

If your query performs writes to database tables, those changes are written to the REDO log and then, in the background, written back into the table space files. There exists configuration parameters for how much of the InnoDB buffer pool can be filled with dirty pages before they have to be flushed out to the table space files.

In order to maintain Isolation (I in ACID), InnoDB needs to assign a consistent read view to a new transaction. Transactions are either started explicitly (e.g. with BEGIN) or implicitly (e.g. just running a SELECT statement). There has been a lot of work recently in improving the scalability of creating read views inside InnoDB. A bit further in the past there was a lot of work in scaling InnoDB for greater than 1024 concurrent transactions (limitations in UNDO logging).

Fancy things that make InnoDB generally faster than you’d expect are the Adaptive Hash Index and change buffering. There are, of course, scalability challenges with these too. It’s good to understand the basics of them however and (of course), they are configurable.

If you end up reading or writing rows (quite likely) there will also be a translation between the InnoDB row format(s) and the MySQL Server row format(s). The details of which are not particularly interesting unless you’re delving deep into code or wish to buy me beer to hear about them.

Query execution may need to get many rows from many tables, join them together, sum things together or even sort things. If there’s an index with the sort order, it’s better to use that. MySQL may also need to do a filesort (sort rows, possibly using files on disk) or construct a temporary table in order to execute the query. Temporary tables are either using the MEMORY (formerly HEAP) storage engine or the MyISAM storage engine. Generally, you want to avoid having to use temporary tables – doing IO is never good.

Once you have the results of a query coming through, you may think that’s it. However, you may also be part of a replication hierarchy. If so, any changes made as part of that transaction will be written to the binary log. This is a log file maintained by the MySQL Server (not the storage engines) of all the changes to tables that have occured. This log can then be pulled by other MySQL servers and applied, making them replication slaves of the master MySQL Server.

We’ll ignore the differences between statement based replication and row based replication as they’re better discussed elsewhere. Being part of replication means you get a few extra locks and an additional file or two being written. The binary log (binlog for short) is a file on disk that is appended to until it reaches a certain size and is then rotated. Writes to this file vary in size (along with the size of transactions being executed). The writes to the binlog occur as part of committing the transaction (and the crash safety between writing to the binlog and writing to the storage engine are covered elsewhere – basically: you don’t want to know).

If your MySQL Server is a replication slave, then you have a thread reading the binary log files from another MySQL Server and then another thread (or, in newer versions, threads) applying the changes.

If the slow query log or general query log is enabled, they’ll also be written to at various points – and the current code for this is not optimal, there be (yes, you guess it) global mutexes.

Once the results of a query have been sent back to the client, the MySQL Server cleans things up (frees some memory) and prepares itself for the next query. You probably have many queries being executed simultaneously, and this is (naturally) a good thing.

There… I think that’s a mostly complete overview of all the things that can go on during query execution inside MySQL.

Update on MySQL on POWER8

About 1.5 months ago I blogged on MySQL 5.6 on POWER andtalked about what I had to poke at to make modern MySQL versions run and run well on shiny POWER8 systems.

One of those bugs, MySQL bug 47213 (InnoDB mutex/rw_lock should be conscious of memory ordering other than Intel) was recently marked as CLOSED by the Oracle MySQL team and the upcoming 5.6.20 and 5.7.5 releases should have the fix!

This is excellent news for those wanting to run MySQL on SMP systems that don’t have an Intel-like memory model (e.g. POWER and MIPS64).

This was the most major and invasive patch in the patchset for MySQL on POWER. It’s absolutely fantastic that this has made it into 5.6.20 and 5.7.5 and may mean that these new versions will work out-of-the-box on POWER (I haven’t checked… but from glancing back at my patchset there was only one other patch that could be related to correctness rather than performance).

Ghosts of MySQL Past: Part 2

This continues on from my post yesterday and also contains content from my linux.conf.au 2014 talk (view video here).

Way back in May in the year 2000, a feature was added to MySQL that would keep many people employed for many years – replication. In 3.23.15 you could replicate from one MySQL instance to another. This is commonly cited as the results of two weeks of work by one developer. The idea is simple: create a log of all the SQL queries that modify the database and then replay them on a slave. Remember, this is before there was concurrency and everything was ISAM or MyISAM, so this worked (for certain definitions of worked).

The key things to remember about MySQL replication are: it was easy to use, it was easy to set up and it was built into the MySQL Server. This is why it won. You have to fast forward to September in 2010 before PostgreSQL caught up! It was only with PostgreSQL 9.0 that you could have queryable read-only slaves with stock standard PostgreSQL.

If you want to know why MySQL was so much bigger than PostgreSQL, this built in and easy to use replication was a huge reason. There is the age of a decent scotch between read-only slaves for MySQL and PostgreSQL (although I don’t think I’ve ever pointed that out to my PostgreSQL friends when having scotch with them… I shall have to!)

In 2001, when space was an odyssey, the first GA (General Availability) release of MySQL 3.23 hit the streets (quite literally, this was back in the day of software that came in actual physical boxes, so it quite probably was literally hitting the streets).

For a good piece of trivia, it’s 3.23.22-beta that is the first release in the current bzr tree, which means that it was around this time that BitKeeper first came into use for MySQL source code.

We also saw the integration of InnoDB in 2001. What was supremely interesting is that the transactional storage engine was not from MySQL AB, it was from Innobase Oy. The internals of the MySQL server were certainly not set up for transactions, and for many years (in fact, to this day) we talk about how a transactional engine was shoehorned in there. Every transactional engine since has had to do the same odd things to, say, find out when a transaction was being started. The exception here is in Drizzle, where we finally cleaned up a bunch of this mess.

Having a major component of the MySQL server owned and controlled by another company was an interesting situation, and one that would prove interesting in a few years time.

We also saw Mårten Mickos become CEO in 2001, a role he would have through the Sun acquisition – an acquisition that definitively proved that you can build an open source company and sell it for a *lot* of money. It was also the year that saw MySQL AB accept its first round of VC funding, and this would (of course) have interesting implications: some good, some less ideal.

(We’ll continue tomorrow with Part 3!)

The EXAMPLE storage engine

The Example storage engine is meant to serve mainly as a code example of the stub of a storage engine for example purposes only (or so the code comment at the start of ha_example.cc reads). In reality however, it’s not very useful. It likely was back in 2004 when it could be used as a starting point for starting some simple new engines (my guess would be that more than a few of the simpler engines started from ha_example.cc).

The sad reality is the complexity of the non-obviousness of the bits o the storage engine API you actually care about are documented in ha_ndbcluster.cc, ha_myisam.cc and ha_innodb.cc. If you’re doing something that isn’t already done by one of those three engines: good luck.

Whenever I looked at ha_example.cc I always wished there was something more behind it… basically hoping that InnoDB would get a better and cleaner API with the server and would use that rather than the layering violations it has to do the interesting stuff.

That all being said, as a starting point, it probably helped spawn at least a dozen storage engines.

The ARCHIVE Storage Engine

I wonder how much longer the ARCHIVE storage engine is going to ship with MySQL…. I think I’m the last person to actually fix a bug in it, and that was, well, a good number of years ago now. It was created to solve a simple problem: write once read hardly ever. Useful for logs and the like. A zlib stream of rows in a file.

You can actually easily beat ARCHIVE for INSERT speed with a non-indexed MyISAM table, and with things like TokuDB around you can probably get pretty close to compression while at the same time having these things known as “indexes”.

ARCHIVE for a long time held this niche though and was widely and quietly used (and likely still is). It has the great benefit of being fairly lightweight – it’s only about 2500 lines of code (1130 if you exclude azio.c, the slightly modified gzio.c from zlib).

It also use the table discovery mechanism that NDB uses. If you remove the FRM file for an ARCHIVE table, the ARCHIVE storage engine will extract the copy it keeps to replace it. You can also do consistent backups with ARCHIVE as it’s an append-only engine. The ARCHIVE engine was certainly the simplest example code of this and a few other storage engine API things.

I’d love to see someone compare storage space and performance of ARCHIVE against TokuDB and InnoDB (hint hint, the Internet should solve this for me).

A few notes on InnoDB in MySQL 5.7.1

I’ve started poking around the MySQL 5.7.1 source tree (although just from tarball as I don’t see a BZR tree yet). I thought I’d share a few thoughts:

  • InnoDB temporary tables. Not REDO logged. What does this mean? It’s a huge step in removing the dependency on MEMORY and MyISAM engines for temporary tables used in query execution. With InnoDB temporary tables there is no reason for MEMORY engine to continue to exist, there is absolutely no way in which it is better.
  • InnoDB temp tables aren’t insert buffered
    This probably doesn’t really matter as you’re not going to be doing REDO logging for them (plus things are generally short lived)… but it could be a future area for performance improvement
  • The NO_REDO log mode appears to be implemented fairly neatly.
  • Improvements in innodb read only mode. What does this mean? Maybe we can finally get rid of the oddity of compressed read only MyISAM tables on read only media. (on the other hand, CDs and DVDs aren’t exactly a modern form of software distribution).
  • Some of the source code comments have improved.. it’s getting easier to understand InnoDB. I’d still make the argument that if you need source code comments you’re code isn’t clear enough… but any step is an improvement. (that being said, InnoDB was always easier than the server)
  • There is some pretty heavy refactoring of lock0lock.cc – I really need to sit down and poke at it a bunch.
  • The shared tablespace code (innodb system tablespace) has been heavily refactored. This also introduces tablespaces for temporary tables – and it appears to be implemented in the correct way.

I need to look into things a bunch more, and it’ll be really useful to see a bzr tree to better understand some of the changes.

More to come later, but that’s my quick look.


In MySQL 5.6 we have two new INFORMATION_SCHEMA tables for InnoDB that are likely going to cause confusion: INNODB_SYS_FIELDS and INNODB_SYS_COLUMNS. You may think these are likely to just be aliases of each other in order to make your life easier. However…

These are not the same thing. The INNODB_SYS_FIELDS table is all about key columns (fields) of InnoDB indexes, while INNODB_SYS_COLUMNS is about actual columns. This is even more confusing as within the MySQL source code, there is the Field set of objects that manipulate fields (columns) in a row.

Blegh. I’m glad it’s Friday.

Information on Bug#12704861 (which doesn’t exist in any public bug tracker)

Some of you may be aware that MySQL is increasingly using an Oracle-internal bug tracker. You can see these large bug numbers mentioned alongside smaller public bug numbers in recent MySQL release notes. If you’re particularly unlucky, you  just get a big Oracle-internal bug number. For a recently fixed bug, I dug further, posted up on the Percona blog: http://www.mysqlperformanceblog.com/2011/11/20/bug12704861/

Possibly interesting reading for those of you who interested in InnoDB, MySQL, BLOBs and crash recovery.

HailDB: A NoSQl API Direct to InnoDB

At the MySQL Conference and Expo last week I gave a session on HailDB. I’ve got the slides up on slideshare so you can either view through them or download them. I think the session went well, and there certainly is some interest in HailDB out there (which is great!).

innodb and memcached

I had a quick look at the source tree (I haven’t compiled it, just read the source – that’s what I do. I challenge any C/C++ compiler to keep up with my brain!) that’s got a tarball up on labs.mysql.com for the memcached interface to innodb. A few quick thoughts:

  • Where’s the Bazaar tree on launchpad? I hate pulling tarballs, following the dev tree is much more interesting from a tech perspective (especially for early development releases). I note that the NDB memcached stuff is up on launchpad now, so yay there. I would love it if the InnoDB team in general was much more open with development, especially with having source trees up on launchpad.
  • It embeds a copy of the memcached server engines branch into the MySQL tree. This is probably the correct way to go. There is no real sense in re-implementing the protocol and network stack (this is about half what memcached is anyway).
  • The copy of the memcached engine branch seems to be a few months old.
  • The current documentation appears to be the source code.
  • The innodb_memcached plugin embeds a memcached server using an API to InnoDB inside the MySQL server process (basically so it can access the same instance of InnoDB as a running MySQL server).
  • There’s a bit of something that kind-of looks similar to the Embedded InnoDB (now HailDB) API being used to link InnoDB and memcached together. I can understand why they didn’t go through the MySQL handler interface… this would be bracing to say the least to get correct. InnoDB APIs, much more likely to have fewer bugs.
  • If this accepted JSON and spat it back out… how fast would MongoDB die? weeks? months?
  • The above dot point would be a lot more likely if adding a column to an InnoDB table didn’t involve epic amounts of IO.
  • I’ve been wanting a good memcached protocol inside Drizzle, we have ,of course, focused on stability of what we do have first. That being said…. upgrade my flight home so I can open a laptop… probably be done well before I land….. (assuming I don’t get to it in the 15 other awesome things I want to hack on this week)

xtrabackup for Drizzle merge request

Follow it over on launchpad.

After having fixed an incredibly odd compiler warning (and with -Werror that we build with, error) on OSX (die die die) – xtrabackup for Drizzle is ready to be merged. This will bring it into our next milestone: freemont. Over the next few weeks you should see some good tests merged in for backup and restore too.

While not final final, I’m thinking that the installed binary name will be drizzlebackup.innobase. A simple naming scheme for various backup tools that are Drizzle specific. This casually pre-empts a drizzlebackup tool that can co-ordinate all of these (like the innobackupex script).

Things I’ve done in Drizzle

When writing my Dropping ACID: Eating Data in a Web 2.0 Cloud World talk for LCA2011 I came to the realisation that I had forgotten a lot of the things I had worked on in MySQL and MySQL Cluster. So, as a bit of a retrospective as part of the Drizzle7 GA release, I thought I might try and write down a (incomplete) list of the various things I’ve worked on in Drizzle.

I noticed I did a lot of code removal, that’s all fine and dandy but maybe I won’t list all of that… except perhaps my first branch that was merged :)


  • First ever branch that was merged: some mysys removal (use POSIX functions instead of wrappers that sometimes have different semantics than their POSIX functions), some removal of NETWARE, build scripts that weren’t helpful (i.e. weren’t what any build team ever used to build a release) and some other dead code removal.
  • Improve ‘make test’ time – transactions FTW! (this would be a theme for me over the years, I always want build and test to be faster)
  • Started moving functions out into their own plugins, removing the difference between UDFs (User Defined Functions) and builtin functions. One API to rule them all.
  • Ahhh compiler warnings (we now build with -Werror) and unchecked return codes from system calls (we now add this to some of our own methods, finding and fixing even more bugs). We did enable a lot of compiler warnings and OMG fix a lot of them.
  • Removal of FRM – use a protobuf message to describe the table. The first branch that implemented any of this was merged mid-November. It was pretty minimal, and we still had the FRM around for a little while yet – but it was the beginning of the end for features that couldn’t be implemented due to limitations in the FRM file format. I wrote a few blog entries about this.
  • A lot of test fixes for Drizzle
  • After relating the story of .test (hint: it broke the ability to check out the source tree on Solaris) to Anthony Baxter, he suggested trying something rather nasty… a unicode character above 2^16 – I chose 𝄢 – which at the time didn’t even render on Ubuntu – this was the test case you could only see correctly on MacOS X at the time (or some less broken Linux distro I guess). Some time later, I was actually able to view the test file on Ubuntu correctly.
  • I think it was November 1st when I started to work on Drizzle full time – this was rather awesome, although a really difficult decision as I did rather enjoy working with all the NDB guys.


  • January sparked the beginning of reading table information from the table protobuf message instead of the FRM file. The code around the FRM file was lovely and convoluted in places – to this day I’m surprised at the low bug count all that effort resulted in. One day I may write up a list of bugs and exploits probably available through the FRM code.
  • My hate for C++ fstream started in Feb 2009
  • In Feb I finally removed the code to work out (and store) in the FRM file how to display a set of fields for entering data into the table on a VT100 80×24 screen.
  • I filed my first GCC bug. The morning started off with a phone call and Brian asking me to look at some strange bug and ended with Intel processor manuals (mmm… the 387 is fun), the C language standard and (legitimately) finding it was a real compiler bug. Two hours and two minutes after filing the bug there was a patch to GCC fixing it – I was impressed.
  • By the end of Feb, the FRM was gone.
  • I spent a bit of time making Drizzle on linux-sparc work. We started building with compiler options that should be much friendlier to fast execution on SPARC processors (all to do with aligning things to word boundaries). -Wcast-align is an interesting gcc flag
  • Moved DDL commands to be in StorageEngine rather than handler for an instance of an open table (now known as Cursor)
  • MyISAM and CSV as temporary table only engines – this means we save a bunch of mucking about in the server.
  • Amazingly enough, sprintf is dangerous.
  • Moved to an API around what tables exist in a database so that the Storage Engines can own their own metadata.
  • Move to have Storage Engines deal with the protobuf table message directly.
  • Found out you should never assume that your process never calls fork() – if you use libuuid, it may. If it wasn’t for this and if we had param-build, my porting of mtr2 to Drizzle probably would have gone in.
  • There was this thing called pack_flag – its removal was especially painful.
  • Many improvements to the table protobuf message (FRM replacement) code and format – moving towards having a file format that could realistically be produced by code that wasn’t the Drizzle database kernel.
  • Many bug fixes, some acused by us, others exposed by us, others had always been there.


  • embedded_innodb storage engine (now HailDB). This spurred many bug and API fixes in the Storage Engine interface. This was an education in all the corner cases of the various interfaces, where the obvious way to do things was almost always not fully correct (but mostly worked).
  • Engine and Schema custom KEY=VALUE options.
  • Found a bug in the pthread_mutex implementation of the atomics<> template we had. That was fun tracking down.
  • started writing more test code for the storage engine interface and upper layer. With our (much improved) storage engine interface, this became relatively easy to implement a storage engine to test specific bits of the upper layer.
  • Wrote a CREATE TABLE query that would take over four minutes to run. Fixed the execution time too. This only existed because of a (hidden and undocumented) limitation in the FRM file format to do with ENUM columns.
  • Characters versus bytes is an important difference, and one that not all of the code really appreciated (or dealt with cleanly)
  • FOREIGN KEY information now stored in the table protobuf message (this was never stored in the FRM).
  • SHOW CREATE TABLE now uses a library that reads the table protobuf message (this will later be used in the replication code as well)
  • Started the HailDB project
  • Updated the innobase plugin to be based on the current InnoDB versions.
  • Amazingly, noticed that the READ_COMMITTED isolation level was never tested (even in the simplest way you would ever explain READ_COMMITTED).
  • Created a storage engine for testing called storage_engine_api_tester (or SEAPITester). It’s a dummy engine that checks we’re calling things correctly. It exposed even more bugs and strangeness that we weren’t aware of.
  • Fixed a lot of cases in the code where we were using a large stack frame in a function (greater than 32kb).
  • An initial patch using the internal InnoDB API to store the replication log


  • This can be detailed later, it’s still in progress :)
  • The big highlight: a release.

Drizzle gets InnoDB 1.0.9

My branch that updates the innobase plugin in Drizzle to be based on innodb_plugin 1.0.9 has been merged. For the next milestone, we’ll probably have 1.0.11 as well.

How’s the progress getting 1.1 and 1.2 in? Pretty good actually. We’ll have it for either this milestone or the next one.

and merging newer innodb into HailDB? It’s going well too, expect more news “soon”.

Second Drizzle Beta (and InnoDB update)

We just released the latest Drizzle tarball (2010-10-11 milestone). There are a whole bunch of bug fixes, but there are two things that are interesting from a storage engine point of view:

  • The Innobase plugin is now based on innodb_plugin 1.0.6
  • The embedded_innodb engine is now named HailDB and requires HailDB, it can no longer be built with embedded_innodb.

Those of you following Drizzle fairly closely have probably noticed that we’ve lagged behind in InnoDB versions. I’m actively working on fixing that – both for the innobase plugin and for the HailDB library.

If building the HailDB plugin (which is planned to replace the innobase plugin), you’ll need the latest HailDB release (which as of writing is 2.3.1). We’re making good additions to the HailDB API to enable the storage engine to have the same features as the Innobase plugin.

What was InnoDB+?

Yes, I said InnoDB+ with a plus sign at the end (also see the first comment here).

Please note that this blog post is only based on public information. It has absolutely nothing in it that I only could have learned from back when I worked at Sun or MySQL AB. Everything has links or pointers to where you can find the information out on the Internet and all thoughts are based on stringing these things together.

There was a lot of talk around the acquisition of Sun Microsystems by Oracle about MySQL (MySQL AB was bought by Sun). Some of the talk centred around Oracle and their ability to make a closed source version of MySQL with added bits that wouldn’t be released as GPL. They’ve since proved that they’re quite willing to do this to an open source project (see OpenSolaris).

Relatively recently, a bunch of history from the old InnoDB SVN trees was imported into the MySQL source tree. You can pull the revision of the SVN tree as of InnoDB Plugin 1.0.6 release by using revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/zip:6263  from the MySQL repository – or just use a branch I’ve put up on launchpad for it (lp:~stewart/haildb/innodb-1.0.6-from-svn).

The first revision from the SVN tree was created on 2005-10-27, which you may remember was not too long after Oracle acquired Innobase on the 7th of October that year. The next two revisions were importing the 5.0 innodb code base, and then the 5.1 code base. Previous history can be found according to this blog post on Transactions on InnoDB.

According to Monty in the comment on the Pythian blog:

Oracle did work on a closed source version of InnoDB, codename InnoDB+, but they never released it, probably because our contract with them stopped them.

and from Eben Moglen’s letter to the EU Commission (via Baron Schwartz’s blog post):

Innobase could therefore have provided an enhanced version of InnoDB, like Oracle’s current InnoDB+, under non-GPL license

Most tellingly is a lot of references in the revision history to “branches/innodb+” as well as this commit:

revno: 0.5.148
revision-id: svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6329
parent: svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6322
committer: vasil
timestamp: Thu 2009-12-17 11:00:17 +0000
branches/innodb+: change name and version
Change name from “InnoDB Plugin” to “InnoDB+” and
version from 1.0.5 to 1.0.0.

So, from the revision history I’ve managed to work out that it likely was going to have the following features:

  • innodb_change_buffering (for values other than inserts)
    See revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/zip:4061
    Or, more tellingly revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:4053
    The latter tells about the merge of change buffering for delete-mark and delete in addition to the default of inserts.
  • Possibly compressed tables.
    revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:2316 seems to show that it may have been copied across: “branches/innodb+: Copy from branches/zip r2315” in the comment.  There’s a lot of other merges of branches/zip as well
  • Something named FTS
    There is “branches/fts” in revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:2325 and revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:2324  (there’s an import of a red-black tree implementation)
    If you also look at revid: svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6776
    you’ll see references to a innofts+ branch with ha_innodb.cc in it.
    So between a red-black tree and handler changes, this is surely something interesting.
  • Persistent statistics (also revid: svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6776)
  • Metrics Table (also revid: svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6776)
  • posix_fadvise() hints to temp files used in creating indexes (revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:2342 )
  • Improved recovery performance
    See revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:2989
    Talks about using the red-black tree for sorted insertion into the flush_list
  • native linux aio (revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:3913 )
  • group commit (revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:3923 )
  • New mutex to protect flush_list (revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6330)

and finally, in revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6819 you can see the change from “InnoDB+” back to “InnoDB” for being the built in default for MySQL 5.5

Warnings are now actual problems

Yesterday, I reached a happy milestone in HailDB development. All compiler warnings left in the api/ directory (the public interface to the database engine) are now either probable/possible bugs (that we need to look at closely) or are warnings due to unfinished code (that we should finish).

There’s still a bunch of compiler warnings that we’ve inherited (HailDB compiles with lots of warnings enabled) that we have to get through, but a lot will wait until after we update the core to be based on InnoDB 1.1.

HailDB 2.0.0 released!

(Reposted from the HailDB Blog. See also the announcement on the Drizzle Blog.)
We’ve made our first HailDB release! We’ve decided to make this a very conservative release. Fixing some minor bugs, getting a lot of compiler warnings fixed and start to make the name change in the source from Embedded InnoDB to HailDB.

Migrating your software to use HailDB is really simple. In fact, for this release, it shouldn’t take more than 5 minutes.

Highlights of this release:

  • A lot of compiler warnings have been fixed.
  • The build system is now pandora-build.
  • some small bugs have been fixed
  • Header file is now haildb.h instead of innodb.h
  • We display “HailDB” instead of “Embedded InnoDB”
  • Library name is libhaildb instead of libinnodb
  • It is probably binary compatible with the last Embedded InnoDB release, but we don’t have explicit tests for that, so YMMV.

Check out the Launchpad page on 2.0.0 and you can download the tarball either from there or right here:

  • haildb-2.0.0.tar.gz
    MD5:  183b81bfe2303aed435cdc8babf11d2b
    SHA1:  065e6a2f2cb2949efd7b8f3ed664bc1ac655cd75

HailDB, Hudson, compiler warnings and cppcheck

I’ve integrated HailDB into our Hudson setup (haildb-trunk on Hudson). I’ve also made sure that Hudson is tracking the compiler warnings. We’ve enabled more compiler warnings than InnoDB has traditionally been compiled with – this means we’ve started off with over 4,300 compiler warnings! Most of those are not going to be anything remotely harmful – however, we often find that it’s 1 in 1000 that is a real bug. I’ve managed to get it down to about 1,700 at the moment (removing a lot of harmless ones).

I’ve also enabled a cppcheck run on it. Cppcheck is a static analysis tool for C/C++. We’ve also enabled it for Drizzle (see drizzle-build-cppcheck on Hudson). When we enabled it for Drizzle, we immediately found three real bugs! There is also a coding style checker which we’ve also enabled on both projects. So far, cppcheck has not found any real bugs in HailDB, just some style warnings.

So, I encourage you to try cppcheck if you’re writing C/C++.