An old note on the Storage Engine API

Whenever I stick my head into the MySQL storage engine API, I’m reminded of a MySQL User Conference from several years ago now.

Specifically, I’m reminded of a slide from an early talk at the MySQL User Conference by Paul McCullagh describing developing PBXT. For “How to write a Storage Engine for MySQL”, it went something like this:

  1. Develop basic INSERT (write_row) support – INSERT INTO t1 VALUES (42)
  2. Develop full table scan (rnd_init, rnd_next, rnd_end)  – SELECT * from t1
  3. If you’re sane, stop here.

A lot of people stop at step 3. It’s a really good place to stop too. It avoids most of the tricky parts that are unexpected, undocumented and unlogical (yes, I’m inventing words here).

MySQL vs Drizzle plugin APIs

There’s a big difference in how plugins are treated in MySQL and how they are treated in Drizzle. The MySQL way has been to create a C API in front of the C++-like (I call it C- as it manages to take the worst of both worlds) internal “API”. The Drizzle way is to have plugins be first class citizens and use exactly the same API as if they were inside the server.

This means that MySQL attempts to maintain API stability. This isn’t something worth trying for. Any plugin that isn’t trivial quickly surpasses what is exposed via the C API and has to work around it, or, it’s a storage engine and instead you have this horrible mash of C and C++. The byproduct of this is that no core server features are being re-implemented as plugins. This means the API is being developed in a vacuum devoid of usefulness. At least, this was the case… The authentication plugin API seems to be an exception, and it’s interesting to note that semisync replication is in fact a plugin.

So times may be changing… sort of. Yesterday I noted that some storage engine API features are only available if you’re InnoDB and I’ve voiced my general disappointment in the audit API being unsuitable to implement various forms of query logging already in the server (general query log, slow query log).

One thing to note: when the API is the same for both inside the server and a plugin, it makes initial refactoring very easy, and you quickly see the bits that could be improved.

Some storage engine features you only get if you’re InnoDB

I had reason to look into the extended secondary index code in MariaDB and MySQL recently, and there was one bit that I really didn’t like.

MariaDB:

share->set_use_ext_keys_flag(legacy_db_type == DB_TYPE_INNODB);

MySQL:

use_extended_sk= (legacy_db_type == DB_TYPE_INNODB);

In case you were wondering what “legacy_db_type” actually does, let me tell you: it’s not legacy at all, it’s kind of key to how the whole “metadata” system in MySQL works. For example, to drop a table, this magic number is used to work out what storage engine to call to drop the table.

Now, these code snippets basically kiss goodbye to the idea of a “pluggable storage engine” architecture. If you’re not InnoDB, you don’t get to have certain features. This isn’t exactly MySQL or MariaDB encouraging an open storage engine ecosystem (quite the opposite really).

Having the MySQL server have this incredibly basic, busy and incomplete understanding of metadata has always been a bit of a mess. The code for reading a table definition out of the FRM file really does show its age, and has fingers all through the server.

If somebody was serious about refactoring server code, you’d certainly be looking here, as this code is a major source of arbitrary limitations. However, if you have the server and the engine(s) both having separate views of what is the “correct” state of metadata you end up with a mess (anyone who has had InnoDB be out of sync with FRMs knows this one). I worry that the FRM code will be replaced with something even less understandable by humans, again making the mistake that the server knows the state of the engine better than the engine does.

See Also:

New libeatmydata release

Good news everyone! There’s a new libeatmydata release! I’ve put a source tarball up on the launchpad page: release-79.

This version packs:

  • RPM and debian packaging in tree
  • A bug fix so that O_SYNC and O_DSYNC are properly discarded on 32bit machines both with and without _FILE_OFFSET_BITS being set.

I’d love to hear any feedback and receive any patches (hopefully things still work well on MacOS X and Solaris). So far, libeatmydata has had contributions from the following people, and many thanks to them:

  • Stewart Smith
  • Alexey Bychko
  • Blair Zajac
  • Phillip Susi
  • Modestas Vainius
  • Monty Taylor
  • Olly Betts
  • Pavel Pushkarev
  • Elliot Murphy
  • Eric Wong
  • Tamas TEVESZ
  • Joachim Berdal Haga
  • Mohsen Hariri

The EXAMPLE storage engine

The Example storage engine is meant to serve mainly as a code example of the stub of a storage engine for example purposes only (or so the code comment at the start of ha_example.cc reads). In reality however, it’s not very useful. It likely was back in 2004 when it could be used as a starting point for starting some simple new engines (my guess would be that more than a few of the simpler engines started from ha_example.cc).

The sad reality is the complexity of the non-obviousness of the bits o the storage engine API you actually care about are documented in ha_ndbcluster.cc, ha_myisam.cc and ha_innodb.cc. If you’re doing something that isn’t already done by one of those three engines: good luck.

Whenever I looked at ha_example.cc I always wished there was something more behind it… basically hoping that InnoDB would get a better and cleaner API with the server and would use that rather than the layering violations it has to do the interesting stuff.

That all being said, as a starting point, it probably helped spawn at least a dozen storage engines.

The ARCHIVE Storage Engine

I wonder how much longer the ARCHIVE storage engine is going to ship with MySQL…. I think I’m the last person to actually fix a bug in it, and that was, well, a good number of years ago now. It was created to solve a simple problem: write once read hardly ever. Useful for logs and the like. A zlib stream of rows in a file.

You can actually easily beat ARCHIVE for INSERT speed with a non-indexed MyISAM table, and with things like TokuDB around you can probably get pretty close to compression while at the same time having these things known as “indexes”.

ARCHIVE for a long time held this niche though and was widely and quietly used (and likely still is). It has the great benefit of being fairly lightweight – it’s only about 2500 lines of code (1130 if you exclude azio.c, the slightly modified gzio.c from zlib).

It also use the table discovery mechanism that NDB uses. If you remove the FRM file for an ARCHIVE table, the ARCHIVE storage engine will extract the copy it keeps to replace it. You can also do consistent backups with ARCHIVE as it’s an append-only engine. The ARCHIVE engine was certainly the simplest example code of this and a few other storage engine API things.

I’d love to see someone compare storage space and performance of ARCHIVE against TokuDB and InnoDB (hint hint, the Internet should solve this for me).

The MySQL Cluster storage engine

This is one close to my heart. I’ve recently written on other storage engines: Where are they now: MySQL Storage Engines, The MERGE storage engine: not dead, just resting…. or forgotten and The MEMORY storage engine. Today, it’s the turn of MySQL Cluster.

Like InnoDB, MySQL Cluster started outside of MySQL. Those of you paying attention at home may notice a correlation between storage engines not written exclusively for MySQL and being at all successful.

NDB (for Network DataBase) started inside Ericsson, originally written in a language called PLEX, which was internal to Ericsson and used in the AXE telephone switches. Mikael Ronstrom’s PHD thesis covered NDB and even covered things that (at least were) yet to be implemented (it’s been quite a few years since I leafed through it last). The project at Ericsson (IIRC) was shelved a couple of times, but eventually got spun out into an Ericsson Business Innovation company called Alzato.

Some remnants of PLEX can still be found in the NDB source code (if you look really hard that is). At some point the code was fed through a PLEX to C++ converter and development continued from there. Some of the really, really old parts of the source may seem weird either due to this or some hand optimization for SPARC processors in the 1990s.

In 2003, MySQL AB acquired Alzato and work on a storage engine plugin for MySQL to interface to the (C++ API only) NDB was underway. Seeing as the storage engine interface was so simple, easy and modular it would only take several years for the interface to NDB to become mature.

The biggest problem: NDB itself worked really well if your workload fit exactly what it was good at… if you deviated, horrific performance and/or crashes were not as uncommon as we’d have liked. This was a source of strain for many years with the developers and support team on one side and some of the less-than-careful sales team on the other. That being said, there have been some absolutely awesome sales people selling NDB into markets it truly fits, and this is why there’s barely a place in the world where placing a mobile phone call doesn’t go through MySQL Cluster at some point.

You should read Tomas Ulin’s post Celebrating 10 years @MySQL for a bit of an insight into how Alzato became part of MySQL AB (which later became part of Sun which became part of Oracle).

I joined the MySQL Cluster team at MySQL in December 2004, not too long after Alzato was acquired, but certainly when the NDB storage engine in MySQL 4.1 was in its very early stages – it was then by no means a general purpose database.

Over the years, MySQL Cluster gained both traction and features, making it useful for more applications. One of the biggest marketing successes of MySQL was the storage engine architecture and how you could just “plug in” different engines. The reality (of course) was far different and even though MySQL Cluster did just “plug in” to MySQL, it was certainly not a drop in replacement.

In MySQL 5.0, a bunch of neat new features were added:

  • Engine condition pushdown
    This enabled conditions on non-indexed columns to be evaluated on the data nodes rather than having every row pulled up to the SQL node to be evaluated.
  • Batched read interface
    So that queries like SELECT FOO FROM BAR WHERE A IN (1,2,3) were executed as a single network round trip rather than 3 round trips.
  • Query cache
    Although the query cache should die, hey, at least it worked with NDB now…. in a way.
  • Reduced IndexMemory usage
    Remember, NDB is an in-memory database, so saving a bunch of bytes for secondary indexes was a big thing.

the first release with things I really worked on was MySQL 5.1. My first talk (to a packed room) at the MySQL User Conference in 2006 was on new features in MySQL Cluster 5.1. I’m still quite proud of that talk even though I know I am a much better speaker than I was then (It would have been great to have had more guidance… but hey, learning from experience is good too).

We added a lot in 5.1:

  • Integration with replication
    This is where row based replication was born. It was a real team effort with the NDB kernel part (going from memory and bzr logs) having been written by Tomas and Jonas seems to have a bunch of code there too. I worked a bunch on the NDB Injector thread in mysqld, Mats worked on the core row based code (at the time the most C++ like code in the entire MySQL world). You could now have a cluster replicate to another cluster with the giant bottleneck that is MySQL replication.
  • disk data
    You could store non-indexed columns on disk. I implemented the INFORMATION_SCHEMA.FILES table for this, I was young and naive enough to think that the InnoDB guys would also fill out this table and all would be happy with the world (I’m lucky I haven’t been holding my breath on this one).
  • Variable Sized columns
    A VARCHAR(255) would actually not always use more than 255bytes if you just stored a single character in it. Catch? Only for in-memory columns.
  • User defined partitioning
    Because NDB desperately needed more options, we let the user choose how they wanted to partition up their data (per table).
  • Autodiscovery of schema changes
    This was a giant workaround to the epic mess that is FRM files and data dictionary things inside the MySQL Server. It is because of all this code that when I went to rewrite the whole thing for Drizzle I took the approach of “just pass it down to the engines, the server must not attempt to know better”. FWIW, I’m still right: if the server tries to be clever you now have two places for bugs to be, not just one.
  • Distribution awareness
    i.e. better selection of which data node to talk to for a particular query, reducing latency.
  • Online add/drop index.
    How long did it take for other engines to get this? Let’s not think about that :)

After that the really interesting stuff started to happen, that is, the first major fork of MySQL: MySQL Cluster Carrier Grade Edition (CGE). Why? We had customers that simply couldn’t wait for MySQL 6.0 (after all, they’d still be waiting).

We had MySQL Cluster CGE 6.1, 6.2, 6.3 and now we’re into 7.0, 7.1 and 7.2. There is without doubt that it’s the longest serving and surviving MySQL fork. There were non-trivial changes inside the MySQL server too, which caused enough of a merge problem for the (small) Cluster team.

One big thing that you’re probably still all waiting for? Replication conflict detection and resolution in circular/multi-master replication setups. It was an NDB first and been used in production for a decent amount of time.

I remember a hack while on an airplane led to the CompressedBackup and CompressedLCP options (used zlib when writing out checkpoints/backups) – something that took more time than you’d think to go from prototype to production ready code.

The last few things I worked on in MySQL Cluster before going and working full time on Drizzle was the Windows port, online add/drop node and NDBINFO.

I’ve left out so many cool MySQL Cluster things that were worked on over the years (e.g. online add/drop column, rewriting of LCP code, micro GCPs, crash-safe DDL, the test suite). I really should mention the test suite, in lines of code it was over three times that of MyISAM.. and that was probably six years ago that I worked that out.

One thing to think about: when Innobase Oy was bought by Oracle and there was this effort to have a transactional storage engine that was inside MySQL AB rather than another company, I pointed out that I thought it would take less time adding the needed features to NDB and integrating it inside the MySQL server binary (and with the addition of online add node you could go from stand alone DB server to a full cluster with no down time) than it would for any of the alternatives to get to a suitable level of maturity.

I wish I put money on this… I put money on the MySQL 5.1 GA release date (which I was happy to loose), but in the years since you can see that InnoDB is still reigning supreme with all that came to replace it having fallen away for one reason or another. It’s still on track to have MySQL Cluster be the only real alternative (now also, funnily enough, owned by Oracle). I have to say, it’s kind of a hollow victory though, it would have been nice to see Falcon and PBXT be serious players in today’s market.

A few points on talking about the internet

There are a few things you should keep in mind when talking about the internet:

  • Use of the word “cyber” is not cool.
  • Whenever you hear the word “cyber” substitute it with “Information Super-Highway”… yes, it sounds that dated.
  • Use of the word “cyber” is applicable only in discussions relating to Doctor Who.
  • Whenever you see “facebook.com/BLAH” just think “AOL Keyword”. If you don’t know what AOL was, I likely have 437 trial CDs you can have.
  • There is no differentiation between life and online-life – just about everything is internet connected now. This very much counts for speech vs online speech.

A reminder: Leah and I are running in the upcoming MS Fun Run raising money for Australians affected by Multiple Sclerosis (MS). If you don’t sponsor us you’re going to die poor and alone. Really, I have an arrangement with all known deities to make it happen. Sponsor us here: http://register.mswalk.org.au/2013-MS-Walk-and-Fun-Run-Melbourne/Meep

OMG we appear to have entered a run for MS…

We are team MEEP – as that is the noise Beaker the cockatiel, our team mascot, makes. Stewart is going for 10 km, Leah is planning for 5 km … and Beaker will enjoy having two tired people to sit on afterwards. :)

We are taking part in the 2013 MS Walk and Fun Run in order to raise funds for people affected by multiple sclerosis (MS). MS is is the most common disease of the central nervous system and affects more than 23,000 Australians.

Did you know?

  • The average age of diagnosis of MS is just 30 years
  • MS affects three times as many women as men

MS Australia aims to minimize the impact of multiple sclerosis on all individuals affected by the disease, as well as their families, carers and the community, by offering a wide range of services, equipment and support. MS Australia’s goal is to assist everyone affected by MS to live life to their fullest potential and secure the care and support they need, until we ultimately find a cure.

You can sponsor us here: http://register.mswalk.org.au/2013-MS-Walk-and-Fun-Run-Melbourne/Meep