An old note on the Storage Engine API

Whenever I stick my head into the MySQL storage engine API, I’m reminded of a MySQL User Conference from several years ago now.

Specifically, I’m reminded of a slide from an early talk at the MySQL User Conference by Paul McCullagh describing developing PBXT. For “How to write a Storage Engine for MySQL”, it went something like this:

  1. Develop basic INSERT (write_row) support – INSERT INTO t1 VALUES (42)
  2. Develop full table scan (rnd_init, rnd_next, rnd_end)  – SELECT * from t1
  3. If you’re sane, stop here.

A lot of people stop at step 3. It’s a really good place to stop too. It avoids most of the tricky parts that are unexpected, undocumented and unlogical (yes, I’m inventing words here).

MySQL vs Drizzle plugin APIs

There’s a big difference in how plugins are treated in MySQL and how they are treated in Drizzle. The MySQL way has been to create a C API in front of the C++-like (I call it C- as it manages to take the worst of both worlds) internal “API”. The Drizzle way is to have plugins be first class citizens and use exactly the same API as if they were inside the server.

This means that MySQL attempts to maintain API stability. This isn’t something worth trying for. Any plugin that isn’t trivial quickly surpasses what is exposed via the C API and has to work around it, or, it’s a storage engine and instead you have this horrible mash of C and C++. The byproduct of this is that no core server features are being re-implemented as plugins. This means the API is being developed in a vacuum devoid of usefulness. At least, this was the case… The authentication plugin API seems to be an exception, and it’s interesting to note that semisync replication is in fact a plugin.

So times may be changing… sort of. Yesterday I noted that some storage engine API features are only available if you’re InnoDB and I’ve voiced my general disappointment in the audit API being unsuitable to implement various forms of query logging already in the server (general query log, slow query log).

One thing to note: when the API is the same for both inside the server and a plugin, it makes initial refactoring very easy, and you quickly see the bits that could be improved.

The EXAMPLE storage engine

The Example storage engine is meant to serve mainly as a code example of the stub of a storage engine for example purposes only (or so the code comment at the start of ha_example.cc reads). In reality however, it’s not very useful. It likely was back in 2004 when it could be used as a starting point for starting some simple new engines (my guess would be that more than a few of the simpler engines started from ha_example.cc).

The sad reality is the complexity of the non-obviousness of the bits o the storage engine API you actually care about are documented in ha_ndbcluster.cc, ha_myisam.cc and ha_innodb.cc. If you’re doing something that isn’t already done by one of those three engines: good luck.

Whenever I looked at ha_example.cc I always wished there was something more behind it… basically hoping that InnoDB would get a better and cleaner API with the server and would use that rather than the layering violations it has to do the interesting stuff.

That all being said, as a starting point, it probably helped spawn at least a dozen storage engines.

The MEMORY storage engine

I recently wrote about Where are they now: MySQL Storage Engines and The MERGE storage engine: not dead, just resting…. or forgotten. Today, it’s the turn of the MEMORY storage engine – otherwise known as HEAP.

This is yet another piece of the MySQL server that sits largely unmaintained and unloved. The MySQL Manual even claims that it supports encryption… with the caveat of having to use the SQL functions for encryption/decryption rather than in the engine itself (so, basically, it supports encryption about as much as every other engine does).

The only “recent” innovation in the MEMORY engine was the dynamic row patch that ended up making its way into Percona Server (and isn’t enabled by default). This forced me to go and look at the code of the MEMORY engine again and I cannot possibly drink enough in my lifetime to erase the memory.

The MEMORY engine is used by just about everybody as you probably have a SQL query somewhere that uses an in memory temporary table. I can, however, feel the comments being added to this post right now by people who use gdb to set server variables that not a single query in their systems use MEMORY….. (IIRC there have been some patches around that would throw an error rather than create a temporary table)

We had a early version of the dynamic row format patch in Drizzle for a while… and if you turned it on, all sorts of things horrifically broke. It was a remarkably non-trivial amount of work to get that code to work properly and this is largely a testament to the “design” of the MEMORY engine.

While it may be efficient or fast or something (likely on 1990s hardware and workloads), it misses the boat completely on the things that matter today: simultaneous access, MVCC, BLOB/TEXT columns and transactions. Basically, it’s a engine that’s really only useful for a single connection in limited use cases…. and even then, it’s likely a good way to ruin things. MyISAM is better as at least on memory pressure things may be written out to disk sensibly…. and if InnoDB had a “don’t log this table” mode it would beat that absolute pants off it.

It is, again, another part of the MySQL server that’s remarkably hard to pull out and replace with something different/better. Why? Well, I wrote about it before: Refactoring Internal temporary tables (another stab at it). If it was easy, we’d likely have Tokyo Cabinet (via BlitzDB) or similar (some bit of code maintained by other people) doing the same job in Drizzle rather than this large chunk of code that nobody really cares about.

Where are they now: MySQL Storage Engines

There was once a big hooplah about the MySQL Storage Engine Architecture and how it was easy to just slot in some other method of storage instead of the provided ones. Over the years I’ve repeatedly mentioned how this wasn’t really the case and that it was remarkably non trivial.

Over the years there have been many storage engines crop up and then disappear. So… where are they now?

  • ISAM
    This became MyISAM…. you know you’ve been around MySQL a long time if you’ve ever had to deal with an ISAM table.
  • Gemini
    This was the first big test of the GPL in court. Basically, you have to obey the GPL (see wikipedia for more info). The code was released as GPL and development stopped. This has been dead since ca 2002.
  • Amira – http://launchpad.net/amira
    Antony first mentioned this in 2008 on his blog. This was a continuation of the Gemini engine, you can actually go over to launchpad and get the code. This was one of the projects to have a transactional storage engine not owned by Oracle after Innobase Oy was acquired by them. It went nowhere special as Netfrastructure was acquired which became Falcon.
  • BDB
    otherwise known as the BerkeleyDB engine. It was seldom used and never gained much of a userbase. It was unceremoniously dropped back in 2006 and both users didn’t really exist.
  • PBXT – http://pbxt.blogspot.com/
    I think we can credit PBXT with at least half of the features and performance improvements to InnoDB since it first emerged back in 2006. It got attention very quickly. Why? Because it was different. It had the very rare ability to outperform InnoDB in some places. You can still find PBXT in MariaDB, but sadly it can be hard to fund development of a MySQL storage engine, especially one as tied to MySQL as PBXT is, and it’s no longer under active development. Closely related was the Blob Streaming project which was way ahead of its time as an AlsoSQL access method. The good news is that the code was released under a BSD license in 2012 (was previously GPL). We even had PBXT in Drizzle for a while.
  • Blob Streaming (PBMS) – http://bpbdev.blogspot.com/
    This project was closely related to (but not depending exclusively on) PBXT. It embedded a HTTP server inside the database and could use it to read and write BLOBs. This was not only fairly cool but way ahead of its time. We owe the existence of both HandlerSocket and the memcached interface to InnoDB to PBMS (it was also an inspiration for the JSON server plugin for Drizzle, to address some of the use cases of the PBMS plugin).
  • Federated
    It’s still there… but is effectively unmaintained and dead. There’s even FederatedX in MariaDB which is an improvement, but still, the MySQL server really doesn’t lend itself kindly to this type of engine… it’s always been an oddity only suitable for very specific tasks.
  • Archive
    Although useful, effectively unmaintained. I kinda don’t want to say dead… but if it went away, I wouldn’t exactly be surprised.
  • CSV
    Currently used to access the log tables in MySQL… and hardly used otherwise. It’s odd that the same code doesn’t deal with SELECT INTO OUTFILE and LOAD DATA INFILE, and I doubt this will ever change. I’d say effectively niche/dead.
  • SolidDB
    Purchased by IBM, abandoned.
  • DB2
    Only ever on System i. Useful for very very few people… but you can still find it around if you’re one of them.
  • Infobright
    OMG it exists! This is probably because they’re largely just using the MySQL server as a way to implement the MySQL network protocol and all of the heavy lifting is done by their own code.
  • Xeround
    I’m quite surprised these guys are still around, as they’re a proprietary storage engine as a service, and initial testing wasn’t entirely promising.
  • TokuDB
    I cannot emphasize how much more interesting TokuDB would be if it were open source. It actually holds some promise… and with their recent work with mongo, perhaps this is a good way forward for them…
  • Maria/Aria
    Another “OMG Oracle just bought Innobase Oy” engine. This was a project to take MyISAM and turn it into a lean, mean, transactional storage engine machine. It’s still not there and I don’t think it ever will be.
  • Falcon
    This was the hot new thing. It came out of Netfrastructure, which MySQL AB acquired in order to help get a transactional storage engine after Innobase Oy was acquired by Oracle. If you’re keeping count, that’s three projects for a transactional storage engine. Falcon was the star though, receiving all the press and publicity (well before it was ready). There are many reasons why Falcon isn’t around today – the chief one probably being that Oracle bought Sun who had bought MySQL and thus a need for an “InnoDB replacement” instantly vanished. There was also immense management pressure for performance to be greater than InnoDB, without any allowance for or focus on correctness…. and this showed. This was quite disappointing as Falcon had a lot of good architectural things going for it.
  • BlitzDB – https://launchpad.net/blitzdb
    I had hoped we’d replace MyISAM with BlitzDB in Drizzle. It was a wrapper around Tokyo Cabinet to the storage engine API in Drizzle. Unfortunately, the ties to MyISAM are incredibly deep (see my recent post on internal temporary tables) and we never quite got there.

I think this is all the notable engines that were aimed at widespread adoption… what ones have I forgotten?

It’s interesting to note that only Archive, CSV, Xeround, TokuDB and Infobright can be gotten anywhere, and the latter two only in their own distribution (one proprietary) and Xeround only as a service.

Is your Storage Engine buggy or the database server?

If your storage engine returns an error from rnd_init (or doStartTableScan as it’s named in Drizzle) and does not save this error and return it in any subsequent calls to rnd_next, your engine is buggy. Namely it is buggy in that a) an error may not be reported back to the user and b) everything may explode horribly when rnd_next is called after rnd_init returned an error.

Unless it is running on MariaDB 5.2 or (soon, when the patch hits the tree) Drizzle.

Monty (Widenius, not Taylor) wrote a patch for MariaDB based on my bug report that addressed that problem. It uses the compiler feature to throw a warning if the result of a function isn’t checked to make sure that all places that call rnd_init are checking for an error from the engine.

Today I (finally) pulled that into Drizzle as well.

So… if your engine does the logical thing and goes “oh look, this method returns an error… I’ll return my error” it will exhibit bugs in MySQL but not MariaDB 5.2 or Drizzle (when patch hits).

Which is buggy, the server or the engine?

The MySQL bug number is 54166, filed in June 2010.

A more complete look at Storage Engine API

Okay… So I’ve blogged many times before about the Storage Engine API in Drizzle. This API is somewhat inherited from MySQL. We have very much attempted to make it a much cleaner interface. Our goals in making changes include: make it much easier to write and maintain a storage engine, make the upper layer code obviously correct and clear in what it’s doing and being able to more easily introduce optimisations.

I’ve recently added a Storage Engine that is only used in testing: storage_engine_api_tester. I’ve blogged on it producing call graphs (really state transition graphs) before both for Storage Engine and Cursor.

I’ve been expanding the test. My test engine is now a wrapper around a real engine instead of just a fake one. This lets us run real queries (and test cases) while testing what’s going on. At some point in the near future I plan to make it so that it will be able to log what calls go on to the engine and produce a graph just of those.

I added a lot more to the Storage Engine part of the wrapper. Below is what you can see is the current graph:

I’ve coded what I consider to be bugs as red and what I consider suspect as blue.

Also for the Cursor (colours mean the same):

As you can see, there’s currently some wacky possibilities. I’m investigating exactly what’s going on here – If I’m somehow missing some calls that I should be wrapping (I don’t think so) or if we are really doing some dumb-ass things in the upper layer.

Also, please do not be under any impression that any of this means that we’re going to have a stable API. We’re not. To stabilise on this would just be insane – way too much of it still makes not much sense.

Cursor states

Following on from my post yesterday on the various states of a Storage Engine, I said I’d have a go with the Cursor object too. A Cursor is used by the Drizzle kernel to get and set data in a table. There can be more than one cursor open at once, and more than one per thread. If your engine cannot cope with this, it is its responsibility to figure it out and return the appropriate errors.

Let’s look at a really simple operation, inserting a couple of rows and then reading them back via a full table scan.

Now, this graph is slightly incomplete as there is no doEndTableScan() call. But you can see in which order things are meant to happen. In this case, “store_lock()” means that store_lock() has been called, so when coming back from doInsertRecord() we do not call store_lock() again, rather, we’re just in a state where it has already been executed.

For MySQL handler, think ::write_row() for doInsertRecord() and ::rnd_init() for doStartTableScan().

This diagram was again auto-generated from my test engine.

Storage Engine API state graph

Drizzle still has a number of quirks inherited from the MySQL Storage Engine API (e.g. BLOBs, row buffer, CREATE SELECT and lack of DDL transaction boundaries, key tuple format). One of the things we fixed a long time ago was to have proper methods for StorageEngines to be called for: startTransaction, startStatement, endStatement, commit and rollback.

If you’ve had to implement a transactional storage engine in MySQL you will be well aware of the pattern of “in every Storage Engine/handler call: if transaction doesn’t exist, begin.” We’ve tried to fix this in the Drizzle API for a number of reasons. I think having this obvious set of calls will make the API a lot easier to understand. I am also very interested in making things much easier to prove correct.

A while ago I spotted Bug 587772, which was the READ COMMITTED isolation level not working correctly with InnoDB. It turns out that the most basic example for READ COMMITTED failed. Hrrm… this is no good. It worked on MySQL, so this was certainly something that we broke. What was more worrying is that there wasn’t a test for this in the test suite (and at the time I couldn’t find one in the MySQL test suite either, so I think we inherited the missing test).

I recently started delving in, actually going to solve this. I noticed something worrying, endStatement wasn’t being called, which is where the innobase plugin would release the read view that it used for the statement. You’d think that it would grab a new one on startStatement, but because of the previous design of the API (remember “if txn isn’t started, start it!”) this also happened for getting the read view for the statement… so we instead got a REPEATABLE READ isolation level.

I wanted a test.

Previously, I’ve created a dummy storage engine (tableprototester) and used it to test the server code for reading the table protobuf message. I thought about doing a Storage Engine for this problem too, basically looking at the calls to the Storage Engine as transitions between states in a state machine.

A basic view of a transaction could be:

State transitions for a transaction. Transaction can be empty OR have one or more statementsThat is, a transaction starts and has zero or more statements before it commits or gets rolled back.

By coding up a data structure of allowable state transitions, a small function to assert() on invalid transitions and enough of the boilerplate to make the engine “work”, I was able to hit an assert() exactly where I’d expected it: at an invalid transition from START STATEMENT to COMMIT.

To fix the initial bug (READ COMMITTED not working), I filled in a few state transitions for the system as a whole that aren’t quite correct. From the diagram below, you can quite obviously see where the obvious bugs are (it helps that I’ve coloured them red):

There is absolutely no sense in going BEGIN -> END STATEMENT or immediately to COMMIT. These should be relatively easy to solve too, but are separate bugs.

I wish to expand this in the future to cover Cursor as well. It will also be useful to ensure that DDL can be wrapped in transactions. Not to mention the last few HTON flags that exist (and should likely go away).

To generate the diagrams, I just wrote a little utility to dump out the state transitions in dot, using it to generate the diagrams.

BLOBS in the Drizzle/MySQL Storage Engine API

Another (AFAIK) undocumented part of the Storage Engine API:

We all know what a normal row looks like in Drizzle/MySQL row format (a NULL bitmap and then column data):

Nothing that special. It’s a fixed sized buffer, Field objects reference into it, you read out of it and write the values into your engine. However, when you get to BLOBs, we can’t use a fixed sized buffer as BLOBs may be quite large. So, the format with BLOBS is the bit in the row is a length of the blob (1, 2, 3 or 4 bytes – in Drizzle it’s only 3 or 4 bytes now and soon only 4 bytes once we fix a bug that isn’t interesting to discuss here). The Second part of the in-row part is a pointer to a location in memory where the BLOB is stored. So a row that has a BLOB in it looks something like this:

The size of the pointer is (of course) platform dependent. On 32bit machines it’s 4 bytes and on 64bit machines it’s 8 bytes.

Now, if I were any other source of documentation, I’d stop right here.

But I’m not. I’m a programmer writing a Storage Engine who now has the crucial question of memory management.

When your engine is given the row from the upper layer (such as doInsertRecord()/write_row()) you don’t have to worry, for the duration of the call, the memory will be there (don’t count on it being there after though, so if you’re not going to immediately splat it somewhere, make your own copy).

For reading, you are expected to provide a pointer to a location in memory that is valid until the next call to your Cursor. For example, rnd_next() call reads a BLOB field and your engine provides a pointer. At the subsequent rnd_next() call, it can free that pointer (or at doStopTableScan()/rnd_end()).

HOWEVER, this is true except for index_read_idx_map(), which in the default implementation in the Cursor (handler) base class ends up doing a doStartIndexScan(), index_read(), doEndIndexScan(). This means that if a BLOB was read, the engine could have (quite rightly) freed that memory already. In this case, you must keep the memory around until either a reset() or extra(HA_EXTRA_FLUSH) call.

This exception is tested (by accident) by a whole single query in type_blob.test – a monster of a query that’s about a seven way join with a group by and an order by. It would be quite possible to write a fairly functional engine and completely miss this.

Good luck.

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Using the row buffer in Drizzle (and MySQL)

Here’s another bit of the API you may need to use in your storage engine (it also seems to be a rather unknown. I believe the only place where this has really been documented is ha_ndbcluster.cc, so here goes….

Drizzle (through inheritance from MySQL) has its own (in memory) row format (it could be said that it has several, but we’ll ignore that for the moment for sanity). This is used inside the server for a number of things. When writing a Storage Engine all you really need to know is that you’re expected to write these into your engine and return them from your engine.

The row buffer format itself is kind-of documented (in that it’s mentioned in the MySQL Internals documentation) but everywhere that’s ever pointed to makes the (big) assumption that you’re going to be implementing an engine that just uses a more compact variant of the in-memory row format. The notable exception is the CSV engine, which only ever cares about textual representations of data (calling val_str() on a Field is pretty simple).

The basic layout is a NULL bitmap plus the data for each non-null column:

Except that the NULL bitmap is byte aligned. So in the above diagram, with four nullable columns, it would actually be padded out to 1 byte:

Each column is stored in a type-specific way.

Each Table (an instance of an open table which a Cursor is used to iterate over parts of) has two row buffers in it: record[0] and record[1]. For the most part, the Cursor implementation for your Storage Engine only ever has to deal with record[0]. However, sometimes you may be asked to read a row into record[1], so your engine must deal with that too.

A Row (no, there’s no object for that… you just get a pointer to somewhere in memory) is made up of Fields (as in Field objects). It’s really made up of lots of things, but if you’re dealing with the row format, a row is made up of fields. The Field objects let you get the value out of a row in a number of ways. For an integer column, you can call Field::val_int() to get the value as an integer, or you can call val_str() to get it as a string (this is what the CSV engine does, just calls val_str() on each Field).

The Field objects are not part of a row in any way. They instead have a pointer to record[0] stored in them. This doesn’t help you if you need to access record[1] (because that can be passed into your Cursor methods). Although the buffer passed into various Cursor methods is usually record[0] it is not always record[0]. How do you use the Field objects to access fields in the row buffer then? The answer is the Field::move_field_offset(ptrdiff_t) method. Here is how you can use it in your code:

ptrdiff_t row_offset= buf - table->record[0];
(**field).move_field_offset(row_offset);
(do things with field)
(**field).move_field_offset(-row_offset);

Yes, this API completely sucks and is very easy to misuse and abuse – especially in error handling cases. We’re currently discussing some alternatives for Drizzle.

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Storage Engine API: write_row, CREATE SELECT and DDL

(this probably applies exactly the same for MySQL and Drizzle… but I’m just speaking about current Drizzle here)

In my current merge request for the embedded-innodb-create-select-transaction-arrgh branch (also see this specific revision), you’ll notice an odd hoop that we have to jump through to make CREATE SELECT statements work with an engine such as InnoDB.

Basically, this is what happens:

  • start transaction
  • start executing SELECT QUERY (well, prepare executing it and fetch a row)
  • create table
  • attempt to insert into table

But… we have to do the DDL statement (i.e. the CREATE TABLE) in its own transaction. This means that the outer transaction (running the SELECT) shouldn’t be able to see it. Except it does. We can create a cursor on this table. However, when we try and do something with it (e.g. ib_cursor_first()) we then get the error message DB_MISSING_HISTORY from InnoDB. With a data dictionary that was REPEATABLE READ, we shouldn’t have this problem. However, we don’t have that.

So? What do we do? If we’re in ::write_row and we get an error and we’re running a SQLCOM_CREATE_TABLE sql_command (yes, we get to poke into current_session->lex->sql_command to find this out) we just magically restart the transaction so that we can (properly) see the created table and write rows to it.

This is not a sane part of the interface; it won’t be an issue for many engines but it is needed here.

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

The Drizzle (and MySQL) Key tuple format

Here’s something that’s not really documented anywhere (unless you count ha_innodb.cc as a source of server documentation). You may have some idea about the MySQL/Drizzle row buffer format. This is passed around the storage engine interface: in for write_row and update_row and out for the various scan and index read methods.

If you want to see the docs for it that exist in the code, check out store_key_val_for_row in ha_innodb.cc.

However, there is another format that is passed to your engine (and that your engine is expected to understand) and for lack of a better name, I’m going to call it the key tuple format. The first place you’ll probably see this is when implementing the index_read function for a Cursor (or handler in MySQL speak).

You get two things: a pointer to the buffer and the length of the buffer. Since a key can be made up of multiple parts, some of which can be NULL and some of which can be of variable length, this buffer is not (usually) a simple value. If you are starting out in your engine development, you can use this buffer blindly as a single value for non-nullable indexes with only 1 column.

The basic format is this:

  • The buffer is in-order of the index. First column in the index is first in the buffer, second second etc.
  • The buffer must be zero-filled. The server kernel will use memcmp to compare two key values.
  • If the column is NULLable, then the first byte is set to 1 if the column is null. Else, 0 means not-null.
  • From ha_innodb.cc (for BLOBs, which I haven’t put in embedded_innodb yet): If the column is of a BLOB type (it must be a column prefix field in this case), then we put the length of the data in the field to the next 2 bytes, in the little-endian format. If the field is SQL NULL, then these 2 bytes are set to 0. Note that the length of data in the field is <= column prefix length.
  • For fixed length fields (such as int), the next max field length bytes are for that field.
  • For VARCHAR, there is always a 2 byte (in little endian) length. This is different to the row format, which may have 1 or 2 bytes. In the key tuple format it is ALWAYS two bytes.

I’ll discuss the use of this for rnd_pos() and position() in a later post…

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Improving the Storage Engine “API”

I increasingly enclose the API part of “Storage Engine API” in quotes as it does score a rather large number on the API Design Rusty levels (Coined by Rusty Russell). I give it a 15 (out of 18. lower is better) in this case “The obvious use is wrong”.

The ideas is that your handler gets called to write a row (the amazingly named handler::write_row()). It’s passed a buffer which is the row to be stored. An engine that uses the MySQL row format (lets say, ARCHIVE) will simply pack the row and write it out.

Unless there is a TIMESTAMP field with auto set on insert. Up until now (and still now in MySQL) the engine had to check if this was the case and make sure the timestamp field was updated.

To remove this particular bonghit is actually a really small patch, which Jay recently got merged:

~drizzle-developers/drizzle/development : revision 873.1.16

Hopefully somebody does this soon for MySQL as well.