Is your Storage Engine buggy or the database server?

If your storage engine returns an error from rnd_init (or doStartTableScan as it’s named in Drizzle) and does not save this error and return it in any subsequent calls to rnd_next, your engine is buggy. Namely it is buggy in that a) an error may not be reported back to the user and b) everything may explode horribly when rnd_next is called after rnd_init returned an error.

Unless it is running on MariaDB 5.2 or (soon, when the patch hits the tree) Drizzle.

Monty (Widenius, not Taylor) wrote a patch for MariaDB based on my bug report that addressed that problem. It uses the compiler feature to throw a warning if the result of a function isn’t checked to make sure that all places that call rnd_init are checking for an error from the engine.

Today I (finally) pulled that into Drizzle as well.

So… if your engine does the logical thing and goes “oh look, this method returns an error… I’ll return my error” it will exhibit bugs in MySQL but not MariaDB 5.2 or Drizzle (when patch hits).

Which is buggy, the server or the engine?

The MySQL bug number is 54166, filed in June 2010.

A more complete look at Storage Engine API

Okay… So I’ve blogged many times before about the Storage Engine API in Drizzle. This API is somewhat inherited from MySQL. We have very much attempted to make it a much cleaner interface. Our goals in making changes include: make it much easier to write and maintain a storage engine, make the upper layer code obviously correct and clear in what it’s doing and being able to more easily introduce optimisations.

I’ve recently added a Storage Engine that is only used in testing: storage_engine_api_tester. I’ve blogged on it producing call graphs (really state transition graphs) before both for Storage Engine and Cursor.

I’ve been expanding the test. My test engine is now a wrapper around a real engine instead of just a fake one. This lets us run real queries (and test cases) while testing what’s going on. At some point in the near future I plan to make it so that it will be able to log what calls go on to the engine and produce a graph just of those.

I added a lot more to the Storage Engine part of the wrapper. Below is what you can see is the current graph:

I’ve coded what I consider to be bugs as red and what I consider suspect as blue.

Also for the Cursor (colours mean the same):

As you can see, there’s currently some wacky possibilities. I’m investigating exactly what’s going on here – If I’m somehow missing some calls that I should be wrapping (I don’t think so) or if we are really doing some dumb-ass things in the upper layer.

Also, please do not be under any impression that any of this means that we’re going to have a stable API. We’re not. To stabilise on this would just be insane – way too much of it still makes not much sense.

Storage Engine API state graph

Drizzle still has a number of quirks inherited from the MySQL Storage Engine API (e.g. BLOBs, row buffer, CREATE SELECT and lack of DDL transaction boundaries, key tuple format). One of the things we fixed a long time ago was to have proper methods for StorageEngines to be called for: startTransaction, startStatement, endStatement, commit and rollback.

If you’ve had to implement a transactional storage engine in MySQL you will be well aware of the pattern of “in every Storage Engine/handler call: if transaction doesn’t exist, begin.” We’ve tried to fix this in the Drizzle API for a number of reasons. I think having this obvious set of calls will make the API a lot easier to understand. I am also very interested in making things much easier to prove correct.

A while ago I spotted Bug 587772, which was the READ COMMITTED isolation level not working correctly with InnoDB. It turns out that the most basic example for READ COMMITTED failed. Hrrm… this is no good. It worked on MySQL, so this was certainly something that we broke. What was more worrying is that there wasn’t a test for this in the test suite (and at the time I couldn’t find one in the MySQL test suite either, so I think we inherited the missing test).

I recently started delving in, actually going to solve this. I noticed something worrying, endStatement wasn’t being called, which is where the innobase plugin would release the read view that it used for the statement. You’d think that it would grab a new one on startStatement, but because of the previous design of the API (remember “if txn isn’t started, start it!”) this also happened for getting the read view for the statement… so we instead got a REPEATABLE READ isolation level.

I wanted a test.

Previously, I’ve created a dummy storage engine (tableprototester) and used it to test the server code for reading the table protobuf message. I thought about doing a Storage Engine for this problem too, basically looking at the calls to the Storage Engine as transitions between states in a state machine.

A basic view of a transaction could be:

State transitions for a transaction. Transaction can be empty OR have one or more statementsThat is, a transaction starts and has zero or more statements before it commits or gets rolled back.

By coding up a data structure of allowable state transitions, a small function to assert() on invalid transitions and enough of the boilerplate to make the engine “work”, I was able to hit an assert() exactly where I’d expected it: at an invalid transition from START STATEMENT to COMMIT.

To fix the initial bug (READ COMMITTED not working), I filled in a few state transitions for the system as a whole that aren’t quite correct. From the diagram below, you can quite obviously see where the obvious bugs are (it helps that I’ve coloured them red):

There is absolutely no sense in going BEGIN -> END STATEMENT or immediately to COMMIT. These should be relatively easy to solve too, but are separate bugs.

I wish to expand this in the future to cover Cursor as well. It will also be useful to ensure that DDL can be wrapped in transactions. Not to mention the last few HTON flags that exist (and should likely go away).

To generate the diagrams, I just wrote a little utility to dump out the state transitions in dot, using it to generate the diagrams.

Using the row buffer in Drizzle (and MySQL)

Here’s another bit of the API you may need to use in your storage engine (it also seems to be a rather unknown. I believe the only place where this has really been documented is, so here goes….

Drizzle (through inheritance from MySQL) has its own (in memory) row format (it could be said that it has several, but we’ll ignore that for the moment for sanity). This is used inside the server for a number of things. When writing a Storage Engine all you really need to know is that you’re expected to write these into your engine and return them from your engine.

The row buffer format itself is kind-of documented (in that it’s mentioned in the MySQL Internals documentation) but everywhere that’s ever pointed to makes the (big) assumption that you’re going to be implementing an engine that just uses a more compact variant of the in-memory row format. The notable exception is the CSV engine, which only ever cares about textual representations of data (calling val_str() on a Field is pretty simple).

The basic layout is a NULL bitmap plus the data for each non-null column:

Except that the NULL bitmap is byte aligned. So in the above diagram, with four nullable columns, it would actually be padded out to 1 byte:

Each column is stored in a type-specific way.

Each Table (an instance of an open table which a Cursor is used to iterate over parts of) has two row buffers in it: record[0] and record[1]. For the most part, the Cursor implementation for your Storage Engine only ever has to deal with record[0]. However, sometimes you may be asked to read a row into record[1], so your engine must deal with that too.

A Row (no, there’s no object for that… you just get a pointer to somewhere in memory) is made up of Fields (as in Field objects). It’s really made up of lots of things, but if you’re dealing with the row format, a row is made up of fields. The Field objects let you get the value out of a row in a number of ways. For an integer column, you can call Field::val_int() to get the value as an integer, or you can call val_str() to get it as a string (this is what the CSV engine does, just calls val_str() on each Field).

The Field objects are not part of a row in any way. They instead have a pointer to record[0] stored in them. This doesn’t help you if you need to access record[1] (because that can be passed into your Cursor methods). Although the buffer passed into various Cursor methods is usually record[0] it is not always record[0]. How do you use the Field objects to access fields in the row buffer then? The answer is the Field::move_field_offset(ptrdiff_t) method. Here is how you can use it in your code:

ptrdiff_t row_offset= buf - table->record[0];
(do things with field)

Yes, this API completely sucks and is very easy to misuse and abuse – especially in error handling cases. We’re currently discussing some alternatives for Drizzle.

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

The Drizzle (and MySQL) Key tuple format

Here’s something that’s not really documented anywhere (unless you count as a source of server documentation). You may have some idea about the MySQL/Drizzle row buffer format. This is passed around the storage engine interface: in for write_row and update_row and out for the various scan and index read methods.

If you want to see the docs for it that exist in the code, check out store_key_val_for_row in

However, there is another format that is passed to your engine (and that your engine is expected to understand) and for lack of a better name, I’m going to call it the key tuple format. The first place you’ll probably see this is when implementing the index_read function for a Cursor (or handler in MySQL speak).

You get two things: a pointer to the buffer and the length of the buffer. Since a key can be made up of multiple parts, some of which can be NULL and some of which can be of variable length, this buffer is not (usually) a simple value. If you are starting out in your engine development, you can use this buffer blindly as a single value for non-nullable indexes with only 1 column.

The basic format is this:

  • The buffer is in-order of the index. First column in the index is first in the buffer, second second etc.
  • The buffer must be zero-filled. The server kernel will use memcmp to compare two key values.
  • If the column is NULLable, then the first byte is set to 1 if the column is null. Else, 0 means not-null.
  • From (for BLOBs, which I haven’t put in embedded_innodb yet): If the column is of a BLOB type (it must be a column prefix field in this case), then we put the length of the data in the field to the next 2 bytes, in the little-endian format. If the field is SQL NULL, then these 2 bytes are set to 0. Note that the length of data in the field is <= column prefix length.
  • For fixed length fields (such as int), the next max field length bytes are for that field.
  • For VARCHAR, there is always a 2 byte (in little endian) length. This is different to the row format, which may have 1 or 2 bytes. In the key tuple format it is ALWAYS two bytes.

I’ll discuss the use of this for rnd_pos() and position() in a later post…

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

on TableIdentifier (and the death of path as a parameter to StorageEngines)

As anybody who has ever implemented a Storage Engine for MySQL will know, a bunch of the DDL calls got passed a parameter named “path”. This was a filesystem path. Depending on what platform you were running, it may contain / or \ (and no, it’s not consistent on each platform). Add to that the difference if you were creating temporary tables (table name of #sql_somethingsomething) and the difference if you were one of the two (built in) engines that were able to be used for creating internal temporary tables (temp tables that are created during query execution that do not belong in a schema). Well… you had a bit of a mess.

My earlier attempts involved splitting everything up into two strings: schema name and table name. This ended badly. The final architecture we decided on was to have an object passed around that would deal with various transformations (from what the user entered to what we can store on file systems, or to what temporary table maps to what unique name). This is TableIdentifier.

Brian has been introducing it around the code for a while now, and we just got it to now most of the places where table names are passed to Storage Engines. This means that if you’re writing a Storage Engine that doesn’t just blindly store things in files, you can sensibly use the getSchemaName() and getTableName() methods to call your API.

Embedded InnoDB: querying the configuration

I am rather excited about being able to do awesome things such as this to get the current configuration of your server:

    ->  WHERE NAME IN ("data_file_path", "data_home_dir");
| NAME           | VALUE |
| data_file_path | NULL  | 
| data_home_dir  | ./    | 
2 rows in set (0 sec)

    -> WHERE NAME IN ("data_file_path", "data_home_dir");
| NAME           | VALUE |
| data_file_path | NULL  | 
| data_home_dir  | ./    | 
2 rows in set (0 sec)

    -> WHERE NAME = "io_capacity";
| NAME        | VALUE |
| io_capacity | 200   | 
1 row in set (0 sec)

Coming soon: status in a table.

(this is for the upcoming embedded_innodb plugin, which using the API provided by Embedded InnoDB to implement a Storage Engine for Drizzle)

Writing A Storage Engine for Drizzle, Part 2: CREATE TABLE

The DDL code paths for Drizzle are increasingly different from MySQL. For example, the embedded_innodb StorageEngine CREATE TABLE code path is completely different than what it would have to be for MySQL. This is because of a number of reasons, the primary one being that Drizzle uses a protobuf message to describe the table format instead of several data structures and a FRM file.

We are pretty close to having the table protobuf message format being final (there’s a few bits left to clean up, but expect them done Real Soon Now (TM)). You can see the definition (which is pretty simple to follow) in drizzled/message/table.proto. Also check out my series of blog posts on the table message (more posts coming, I promise!).

Drizzle allows either your StorageEngine or the Drizzle kernel to take care of storage of table metadata. You tell the Drizzle kernel that your engine will take care of metadata itself by specifying HTON_HAS_DATA_DICTIONARY to the StorageEngine constructor. If you don’t specify HTON_HAS_DATA_DICTIONARY, the Drizzle kernel stores the serialized Table protobuf message in a “table_name.dfe” file in a directory named after the database. If you have specified that you have a data dictionary, you’ll also have to implement some other methods in your StorageEngine. We’ll cover these in a later post.

If you ever dealt with creating a table in MySQL, you may recognize this method:

virtual int create(const char *name, TABLE *form, HA_CREATE_INFO *info)=0;

This is not how we do things in Drizzle. We now have this function in StorageEngine that you have to implement:

int doCreateTable(Session* session, const char *path,
                  Table& table_obj,
                  drizzled::message::Table& table_message)

The existence of the Table parameter is largely historic and at some point will go away. In the Embedded InnoDB engine, we don’t use the Table parameter at all. Shortly we’ll also get rid of the path parameter, instead having the table schema in the Table message and helper functions to construct path names.

Methods name “doFoo” (such as doCreateTable) mean that there is a method named foo() (such as createTable()) in the base class. It does some base work (such as making sure the table_message is filled out and handling any errors) while the “real” work is done by your StorageEngine in the doCreateTable() method.

The Embedded InnoDB engine goes through the table message and constructs a data structure for the Embedded InnoDB library to create a table. The ARCHIVE storage engine is much simpler, and it pretty much just creates the header of the ARZ file, mostly ignoring the format of the table. The best bet is to look at the code from one of these engines, depending on what type of engine you’re working on. This code, along with the table message definition should be more than enough.

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Writing A Storage Engine for Drizzle, Part 1: Plugin basics

So, you’ve decided to write a Storage Engine for Drizzle. This is excellent news! The API is continually being improved and if you’ve worked on a Storage Engine for MySQL, you’ll notice quite a few differences in some areas.

The first step is to create a skeleton StorageEngine plugin.

You can see my skeleton embedded_innodb StorageEngine plugin in its merge request.

The important steps are:

1. Create the plugin directory

e.g. mkdir plugin/embedded_innodb

2. Create the plugin.ini file describing the plugin

create the plugin.ini file in the plugin directory (so it’s plugin/plugin_name/plugin.ini)
An example plugin.ini for embedded_innodb is.

title=InnoDB Storage Engine using the Embedded InnoDB library
description=Work in progress engine using libinnodb instead of including it in tree.

This gives us a title and description, along with telling the build system what sources to compile and what headers to make sure to include in any source distribution.

3. Add plugin dependencies

Your plugin may require extra libraries on the system. For example, the embedded_innodb plugin uses the Embedded InnoDB library (libinnodb).

Other examples include the MD5 function requiring either openssl or gnutls, the gearman related plugins requiring gearman libraries, the UUID() function requiring libuuid and BlitzDB requiring Tokyo Cabinet libraries.

For embedded_innodb, pandora-build has a macro for finding libinnodb on the system. We want to run this configure check, so we create a file in the plugin directory (i.e. plugin/plugin_name/ and add the check to it.

For embedded_innodb, the file just contains this one line:


We also want to add two things to plugin.ini; one to tell the build system only to build our plugin if libinnodb was found and the other to link our plugin with libinnodb. For embedded_innodb, it’s these two lines:

build_conditional="x${ac_cv_libinnodb}" = "xyes"
Not too hard at all! This should look relatively familiar for those who have seen autoconf and automake in the past.

Some plugins (such as the md5 function) have a bit more custom auto-foo in plugin.ini and (as one of two libraries can be used). You can do pretty much anything with the plugin system, but you’re a lot more likely to keep it simple like we have here.

4. Add skeleton source code for your StorageEngine

While this will change a little bit over time (and is a little long to just paste into here), you can see what I did for embedded_innodb in the skeleton-embedded-innodb-engine tree.

5. Build!

You will need to re-run ./config/ so the build system picks up your new plugin. When you run ./configure --help afterwards, you should see options for building with/without your new plugin.

6. Add a test

You will probably want to add a test to see that your plugin loads successfully. When your plugin is built, the test suite automatically picks up any tests you have in the plugin/plugin_name/tests directory. This is in the same format as general MySQL and Drizzle tests: tests go in a t/ directory, expected results in a r/ directory.

Since we are loading a plugin, we will also need some server options to make sure that plugin is loaded. These are stored in the rather inappropriately named test-master.opt file (that’s the test name with “-master.opt” appended to the end instead of “.test“). For the embedded_innodb plugin_load test, we have a plugin/embedded_innodb/tests/t/plugin_load-master.opt file with the following content:


You can have pretty much anything in the plugin_load.test file… if you’re fancy, you’ll have a SELECT query on data_dictionary.plugins to check that the plugin really is there. Be sure to also add a r/plugin_load.result file (My preferred method is to just create an empty result file, run the test suite and examine the rejected output before renaming the .reject file to .result)

Once you’ve added your test, you can run it either by just typing “make test” (which will run the whole test suite), or you can go into the main tests/ directory and run ./ --suite=plugin_name (which will just run the tests for your plugin).

7. Check the code in, feel good about self

and you’re done. Well… the start of a Storage Engine plugin is done :)

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).