SHOW CREATE TABLE using table protobuf message

… and really testing the replication code path for CREATE TABLE.

So, for a very long time now, Drizzle has been using a protobuf based structure to describe the structure of tables. The idea was to be able to have engines rather painlessly generate this structure themselves (which several now do). A secondary idea was to use this structure itself for CREATE TABLE (in progress, and embedded_innodb does in fact does only use the table message for its CREATE TABLE codepath). The third idea was to just put the table protobuf message into the replication stream instead of the CREATE TABLE statement (i.e. a SQL string). This means that you could (easily) write a replicator to a DBMS with different SQL syntax, or to a system that doesn’t speak SQL at all.

The final step, to reduce duplicated code functionality, would be to have the code that does SHOW CREATE TABLE to use a common bit of code for turning a table protobuf message back into a SQL string.

We now have that.

Just now, my branch to replace the old SHOW CREATE TABLE code (that was using TableShare and friends) with the statement_transform code (that we also use in converting a replication log to SQL commands) hit trunk.

Yay.

ENUM now works properly (in Drizzle)

Over at the Drizzle blog, the recent 2010-06-07 tarball was announced. This tarball release has my fixes for the ENUM type, so that it now works as it should. I was quite amazed that such a small block of code could have so many bugs! One of the most interesting was the documented limit we inherited from MySQL (see the MySQL Docs on ENUM) of a maximum of 65,535 elements for an ENUM column.

This all started out from a quite innocent comment of Jay‘s in a code review for adding support for the ENUM data type to the embedded_innodb engine. It was all pretty innocent… saying that I should use a constant instead of the magic 0x10000 number as a limit on an assert for sanity of values getting passed to the engine. Seeing as there wasn’t a constant already in the code for that (surprise number 1), I said I’d fix it properly in a separate patch (creating a bug for it so it wouldn’t get lost) and the code went in.

So, now, a few weeks after that, I got around to dealing with that bug (because hey, this was going to be an easy fix that’ll give me a nice sense of accomplishment). A quick look in the Field_enum code raised my suspicions of bugs… I initially wondered if we’d get any error message if a StorageEngine returned a table definition that had too many ENUM elements (for example, 70,000). So, I added a table to the tableprototester plugin (a simple dummy engine that is loaded for testing the parsing of specially constructed table messages) that had 70,000 elements for a single ENUM column. It didn’t throw an error. Darn. It did, however, have an incredibly large result for SHOW CREATE TABLE.

Often with bugs like this I may try to see if the problem is something inherited from MySQL. I’ll often file a bug with MySQL as well if that’s the case. If I can, I’ll sometimes attach the associated patch from Drizzle that fixes the bug, sometimes with a patch directly for and tested on MySQL (if it’s not going to take me too long). If these patches are ever applied is a whole other thing – and sometimes you get things like “each engine is meant to have auto_increment behave differently!” – which doesn’t inspire confidence.

But anyway, the MySQL limit is somewhere between 10850 and 10900. This is not at all what’s documented. I’ve filed the appropriate bug (Bug #54194) with reproducible test case and the bit of problematic code. It turns out that this is (yet another) limit of the FRM file. The limit is “about 64k FRM”. The bit of code in MySQL that was doing the checking for the ENUM limit was this:


/* Hack to avoid bugs with small static rows in MySQL */
  reclength=max(file->min_record_length(table_options),reclength);
  if (info_length+(ulong) create_fields.elements*FCOMP+288+
      n_length+int_length+com_length > 65535L || int_count > 255)
  {
    my_message(ER_TOO_MANY_FIELDS, ER(ER_TOO_MANY_FIELDS), MYF(0));
    DBUG_RETURN(1);
  }

So it’s no surprise to anyone how this specific limit (the number of elements in an ENUM) got missed when I converted Drizzle from using an FRM over to a protobuf based structure.

So a bunch of other cleanup later, a whole lot of extra testing and I can pretty confidently state that the ENUM type in Drizzle does work exactly how you think it would.

Either way, if you’re getting anywhere near 10,000 choices for an ENUM column you have no doubt already lost.

Writing A Storage Engine for Drizzle, Part 2: CREATE TABLE

The DDL code paths for Drizzle are increasingly different from MySQL. For example, the embedded_innodb StorageEngine CREATE TABLE code path is completely different than what it would have to be for MySQL. This is because of a number of reasons, the primary one being that Drizzle uses a protobuf message to describe the table format instead of several data structures and a FRM file.

We are pretty close to having the table protobuf message format being final (there’s a few bits left to clean up, but expect them done Real Soon Now (TM)). You can see the definition (which is pretty simple to follow) in drizzled/message/table.proto. Also check out my series of blog posts on the table message (more posts coming, I promise!).

Drizzle allows either your StorageEngine or the Drizzle kernel to take care of storage of table metadata. You tell the Drizzle kernel that your engine will take care of metadata itself by specifying HTON_HAS_DATA_DICTIONARY to the StorageEngine constructor. If you don’t specify HTON_HAS_DATA_DICTIONARY, the Drizzle kernel stores the serialized Table protobuf message in a “table_name.dfe” file in a directory named after the database. If you have specified that you have a data dictionary, you’ll also have to implement some other methods in your StorageEngine. We’ll cover these in a later post.

If you ever dealt with creating a table in MySQL, you may recognize this method:

virtual int create(const char *name, TABLE *form, HA_CREATE_INFO *info)=0;

This is not how we do things in Drizzle. We now have this function in StorageEngine that you have to implement:

int doCreateTable(Session* session, const char *path,
                  Table& table_obj,
                  drizzled::message::Table& table_message)

The existence of the Table parameter is largely historic and at some point will go away. In the Embedded InnoDB engine, we don’t use the Table parameter at all. Shortly we’ll also get rid of the path parameter, instead having the table schema in the Table message and helper functions to construct path names.

Methods name “doFoo” (such as doCreateTable) mean that there is a method named foo() (such as createTable()) in the base class. It does some base work (such as making sure the table_message is filled out and handling any errors) while the “real” work is done by your StorageEngine in the doCreateTable() method.

The Embedded InnoDB engine goes through the table message and constructs a data structure for the Embedded InnoDB library to create a table. The ARCHIVE storage engine is much simpler, and it pretty much just creates the header of the ARZ file, mostly ignoring the format of the table. The best bet is to look at the code from one of these engines, depending on what type of engine you’re working on. This code, along with the table message definition should be more than enough.

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

The Table protobuf message format

If you’ve ever opened up drizzled/message/table.proto in the Drizzle source tree you will have seen what’s in the table message: the structure that describes a database table in Drizzle. Previously I’ve talked about the Table message more generally, giving a fair bit of history of the FRM file and how we’ve replaced it with both the Table protobuf message and an infrastructure inside Drizzle so that Storage Engines own their own metadata.

Yesterday I talked about the Schema protobuf message format in more detail, and this time I’m talking about the Table protobuf message in a similar amount.

The first time we were loading (then only part of) the table definition out of a protobuf message was way back in January 2009 (I blogged about it too). It was an adventure untangling all sorts of things to get to a much nicer place (where we are now). The code in the server is not perfect… I’ll be the first to admit that some of it is rather strange, but that’s mostly all behind the scenes for people interested in the protobuf Table message!

The Table message has several embedded messages in it too. We need to have information on the Storage Engine, Fields and Indexes (and each of those can have other properties). It is much more complex than the simple Schema message.

Let’s have a look at the basic structure of the Table message:

message Table {

    /* *SNIP* (Here goes the definitions for TableType, StorageEngine, Field, Index, ForeignKeyConstrain, TableOptions and TableStats) */

  required string name = 1;
  required TableType type = 5;
  required StorageEngine engine = 2;
  repeated Field field = 3;
  repeated Index indexes = 4;

  repeated ForeignKeyConstraint fk_constraint = 8;
  optional TableOptions options = 9;
  optional TableStats stats = 10;
}

(We’ve skipped the definitions for the embedded messages for now)

This seems all pretty logical; a table has a name, a type, is in a Storage Engine, has Fields, may have Indexes, may have foreign key constraints, it has some options and statistics (the statistics may go away at some point “soon”).

Let’s have a look at the TableType message definition:

  enum TableType {
    STANDARD = 0;
    TEMPORARY = 1;
    INTERNAL = 2;
  }

It’s pretty simple, the table type is either a standard table (what you get from CREATE TABLE), a temporary table (what you get from CREATE TEMPORARY TABLE) or an INTERNAL table (what you get when Drizzle uses a temporary table during query execution).

Next, the StorageEngine message:

  message StorageEngine {

    message EngineOption {
      enum EngineOptionType {
        BOOL = 0;
        INTEGER = 1;
        STRING = 2;
      }

      required string option_name = 1;
      required string option_value = 2;
      required EngineOptionType option_type = 3;
    }

    required string name = 1;
    repeated EngineOption option = 2;
  }

The main part is the “name” member, which is just the name of the storage engine (e.g. “PBXT”,  “INNODB”, “ARCHIVE”). We do however have support specified in the StorageEngine message for engine specific options (in key value form). Expect these to be used more in the near future.

Specifying Fields is probably the most complex part of the table message. The Field message looks like this (with many embedded messages):

message Field {
    required string name = 1;
    required FieldType type = 2;
    optional FieldFormatType format = 3;
    optional FieldOptions options = 4;
    optional FieldConstraints constraints = 5;
    optional NumericFieldOptions numeric_options = 6;
    optional StringFieldOptions string_options = 7;

    optional string comment = 16; /* Reserve 0-15 for frequently accessed attributes */
    optional SetFieldOptions set_options = 17;
    optional TimestampFieldOptions timestamp_options = 18;
}

So… what does this all mean? Well, Fields have a type, they’re stored in a format, there’s options attached to them, there may be constraints as well as field type specific options.

The different field types should be fairly familiar by now:

    enum FieldType {
      DOUBLE = 0;
      VARCHAR = 1;
      BLOB = 2;
      ENUM = 3;
      INTEGER = 4;
      BIGINT = 5;
      DECIMAL = 6;
      DATE = 7;
      TIME = 8;
      TIMESTAMP = 9;
      DATETIME = 10;
    }

We also allow fields in different formats. Currently, these are default, fixed and dynamic. The idea is you can tell the engine (or the engine can tell you) how it’s storing the field. This is currently here as a nicety and the users for this are few and far between.

    enum FieldFormatType {
      DefaultFormat= 0;
      FixedFormat= 1;
      DynamicFormat= 2;
    }

The FieldOptions get interesting though:

    message FieldOptions {
      optional string default_value = 1;
      optional string update_value = 2;
      optional bool default_null = 3 [default = false];
      optional bytes default_bin_value = 4;
    }

You’ll no doubt be intrigued by the existence of both “default_value” and “default_bin_value”. Ordinarily, using a string to contain a textual representation of the default value (e.g. “foo” or “42”) is fine. However, for BLOB columns, you can have defaults that aren’t representable in a text string, you need binary data (e.g. the default value contains ‘\0’).

For TIMESTAMP columns, we continue to support DEFAULT NOW() and the ability to update the timestamp column on UPDATE. How is this represented in the table message? Well… default_value will be “NOW()” and update_value will be “NOW()”. It is intended that in the future it will be possible to have arbitrary SQL expressions for these. This does, of course, require support in the Drizzle server.

The default_null bool should be rather obvious :)

Well… that’s enough for today. Next time: more of the Field message!

The Schema protobuf message: Drizzle’s metadata on a schema

I’ve previously talked about table metadata in Drizzle and how we use the table protobuf message to describe a table (see Drizzle FRM Replacement and others). The model in Drizzle is that the engine is responsible for its metadata. For schemas (you may be thinking ‘database’ but we’re moving to the Schema terminology in Drizzle) we also have a small amount of metadata.

The protobuf message is specified in drizzled/message/schema.proto and is incredibly short. In fact, here it is in its entirety:

1
package drizzled.message;
2
option optimize_for = SPEED;
3
4
message Schema {
5
  required string name = 1;
6
  optional string collation = 2;
7
}

We don’t keep an awful lot of metadata about schemas. A Schema has a name and it has a default collation.

You can also read the db.opt file directly using the provided (and very simple) schema_reader utility.

In the near future, we could have CREATE DATABASE and CREATE SCHEMA replicated via this protobuf message. This would make it extremely easy to parse for utilities parsing the replication stream.

We’ll also (rather shortly) have key,value pairs for options to CREATE SCHEMA/CREATE DATABASE. More on that later :)

Drizzle FRM replacement: the table proto

Drizzle originally inherited the FRM file from MySQL (which inherited it from UNIREG). The FRM file stores metadata about a table; what columns it has, what type those columns are, what indexes, any default values, comments etc are all stored in the FRM. In the days of MyISAM, this worked relatively well. The row data was stored in table.MYD, indexes on top of it in table.MYI and information about the format of the row was
in table.FRM. Since MyISAM itself wasn’t crash safe, it didn’t really matter if creating/deleting the FRM file along with the table was either.

As more sophisticated engines were introduced (e.g. InnoDB) that had their own data dictionary, there started to be more of a problem. There were now two places storing information about a table: the FRM file and the data dictionary specific to the engine. Even if the data dictionary of the storage engine was crash safe, the FRM file was not plugged into that, so you could end up in a situation where the storage engine
recovered from a crash okay, but the FRM was incorrect for what the engine recovered to. This would always require manual intervention to find out what went wrong and then fix it (in some rather unusual ways).

When the MySQL Cluster (NDB) engine was introduced, a new set of problems arose. Now the MySQL server was connecting to an existing database, where tables could be created on other nodes connected to the cluster. You now not only had the problems of crash recovery, but the problems of keeping the FRM files in sync across many nodes, requiring
all sorts of interesting solutions that, for the most part, do work.

The “obvious” solution to some of these problems would be for an engine to write out an FRM file itself. This is much easier said than done. The file format was never created to be read and written by multiple pieces of software, the code that did the reading and writing inside the server was not reusable elsewhere and the only documentation (that
wasn’t a decent chunk of the MySQL source tree) is the rather incomplete definition in the MySQL Internals wiki (http://forge.mysql.com/wiki/MySQL_Internals_File_Formats) – not nearly enough to write a correct FRM file as the specifics are very, very odd.

Our goals for reworking the metadata system in Drizzle were: to allow engines to own their own metadata (removing any opportunity to have inconsistencies between the engine and the ‘FRM’) and for engines without their own data dictionary, to replace the FRM file format with something simple and well documented.

One option was to use SQL as the standard storage format, but it is rather non-trivial and expensive to parse – especially if we were to use it as the preferred way of talking table definitions with storage engines. We had been looking at the protobuf library
(http://code.google.com/p/protobuf/) ever since its first release and it has a number of very nice characteristics: a description language of a data structure that is then used to generate APIs for reading and writing it in a number of programming languages and a standard (documented) way to serialize the data structure.

After a bit of discussion, we arrived at a good outline for the table definition proto. The current one can always be found in the Drizzle source tree at drizzled/message/table.proto. The current format is very close to final (i.e. one that we’ll suppport upgrades from).

The process of modifying the Drizzle code base so that it would write (and read) a file format different to the FRM isn’t worth going too much into here although there were some interesting hurdles to overcome. An interesting one was the FRM file contains a binary image of the default row for the table (which is in the row format that the server uses); we now store the default value for each column in the proto and generate the default row when we read the proto. Another interesting one was removing and refactoring “pack_flag” – the details of which should only be extracted from Jay or Stewart with a liberal application of fine ale.

The end result is that we now have storage engines that are completely responsible for their own metadata. One example is the ARCHIVE engine. In the CREATE TABLE code path, the ARCHIVE storage engine gets the table definition in an object that represents the table proto. It can examine the parameters it needs to and then either store the proto directly, or convert it into its own format. Since ARCHIVE is simple, it just stores
the table proto in a serialised form (using a standard function provided by the protobuf library) and stores it in the .ARZ data file for the table. This instantly makes the ARCHIVE storage engine crash safe for CREATE and DROP table as there is only 1 file on disk, so no two files to get out of sync.

If an engine does not have its own data dictionary, it can still use the default implementation which just stores the serialised table proto in a file on disk.

We can also now use this interface to move INFORMATION_SCHEMA into its own storage engine. This means we can remove a lot of special case code throughout the server for INFORMATION_SCHEMA and instead just have a INFORMATION_SCHEMA storage engine that says it has the following tables in the INFORMATION_SCHEMA database. Because the table definition is now in a documented format with a standard API, this becomes a relatively trivial exercise.

What we’re all looking forward to is when the InnoDB data dictionary is linked into the new interface and we can have a truly crash safe database server.

Another wonderful side effect is since we now have a standard data structure for representing a table definition, we can integrate this with the replication system. In the “near” future, we can represent a CREATE TABLE in the replication stream as a table proto and not the raw SQL. If you were wanting to apply the replication stream to a different database server, you then only have to write a table proto to SQL
converter. If the target database system doesn’t do SQL at all, you could generate API calls to create the table.

So we now have a rather flexible system in place, with the code implementing it being increasingly simple and possible to be “obviously correct”.

Things that easily fall out of this work that people have written about:
– CREATE TABLE LIKE with ENGINE clause
http://krow.livejournal.com/671235.html
– table_raw_reader – looking at the raw representation of table metadata
http://www.flamingspork.com/blog/2009/10/01/table_raw_reader-reading-the-table-proto-from-disk-and-examining-everything/
– Table discovery
http://www.flamingspork.com/blog/2009/07/29/table-discovery-for-drizzle-take-2-now-merged/

Some more info:
http://krow.livejournal.com/642329.html

table_raw_reader – reading the table proto from disk and examining everything

Ever wondered exactly *what* was in a Drizzle Table proto? Well, wonder no more. A while back this little utility called table_raw_reader hit the drizzle codebase. It’s a simple command line utility that takes a .proto file as an argument, reads it off disk and then prints out a text representation using the TextFormat class of the protobuf library.

An example:

stewart@willster:~/drizzle/jay-and-stewart-remove-pack_flag$ ./drizzled/message/table_raw_reader ./tests/var/master-data/test/t1.dfe
name: "t1"
engine {
  name: "InnoDB"
}
field {
  name: "id"
  type: INTEGER
  format: DefaultFormat
  options {
    length: 11
  }
}
field {
  name: "padding"
  type: VARCHAR
  format: DefaultFormat
  options {
    length: 800
  }
  string_options {
    length: 200
    collation_id: 45
    collation: "utf8_general_ci"
  }
}
indexes {
  name: "PRIMARY"
  is_primary: true
  is_unique: true
  type: UNKNOWN_INDEX
  key_length: 4
  index_part {
    fieldnr: 0
    compare_length: 4
    key_type: 8
  }
  options {
  }
}
type: STANDARD
options {
  collation: "utf8_general_ci"
  collation_id: 45
  pack_record: true
  row_type: ROW_TYPE_DEFAULT
}

This can help a lot when debugging, it already has helped me a lot. It’s left as an exercise for the reader to create a utility that lets you modify the table proto in this text based format.

pack_flag is dead!

(At least in Drizzle)

Brian merged my jay-and-stewart-remove-pack_flag branch and it’s now in trunk. Removing pack_flag was a task for the bell milestone and in true collaborative effort, it took more than one person to unravel its dark mysteries. Hats go off to Jay who bravely ripped more of it out with the aid of excellent Seattle coffee. I finished it off with removing the last remnants and fixing an outstanding bug (and at the same time making us look at the DECIMAL code and shuddering in fear of what it may mean).

With pack_flag gone, we can now do insane things like remove the last bits of TINYINT from the code, as previous to removing pack_flag, if you did that, things broke.

This also means we are one more step towards a table proto that is maintainable long term and doesn’t have incomprehensible magic.