Storing the table message in Embedded InnoDB

One of the exciting things[1] about working on a storage engine in Drizzle is that you get to manage your own metadata. When the database engine you’re writing the storage engine interface for has a pretty complete data dictionary (e.g. Embedded InnoDB) you could just directly use it. At some point I plan to do this for the embedded_innodb engine for Drizzle so that you could just point Drizzle at an existing Embedded InnoDB database and run SQL queries on it.

The Drizzle table message does have some things in it that aren’t in the InnoDB data dictionary though (e.g. table and column comments). We want to preserve these (and also things like there may be several data types in Drizzle that map to the same data type in InnoDB). Since the Embedded InnoDB API allows us to do things within the DDL transaction (such as insert a row into a table), we store the serialized table message in a table as part of the DDL transaction. This means we can have fully crash safe DDL! There is no way the table definition can get out of sync with what is in InnoDB; we are manipulating them both in the same transaction!

The table structure we’re using is pretty simple. There is two columns: table_name VARCHAR(IB_MAX_TABLE_NAME_LEN) and message BLOB.

The operations we need are:

  • store the table message in doCreateTable (INSERT)
  • rename the table message in doRenameTable (UPDATE the table_name column)
  • delete the table message in doDropTable (DELETE)
  • list tables in a database (SELECT with prefix)
  • get table message (SELECT using key lookup)

All of which are pretty easy to implement using the Embedded InnoDB API.

[1] Maybe I need to get out more….

Finding Ada

Ada Lovelace Day is an international day of blogging to celebrate the achievements of women in technology and science.

- http://findingada.com/

This is something I had wanted to do last year… and I’m finding I have the same problem this year. My idea was to write about someone who has had an influence on me. The problem is picking one person to write about. Throughout my life there have been many women in technology who have influenced me. I started going through people in my head… and got to a very long list rather quickly.

So, instead, I shall write about the future.

To the future Ada, who will think this whole exercise of picking a woman in technology to write about as absurd as if we, today, picked a woman who votes to write about.

on TableIdentifier (and the death of path as a parameter to StorageEngines)

As anybody who has ever implemented a Storage Engine for MySQL will know, a bunch of the DDL calls got passed a parameter named “path”. This was a filesystem path. Depending on what platform you were running, it may contain / or \ (and no, it’s not consistent on each platform). Add to that the difference if you were creating temporary tables (table name of #sql_somethingsomething) and the difference if you were one of the two (built in) engines that were able to be used for creating internal temporary tables (temp tables that are created during query execution that do not belong in a schema). Well… you had a bit of a mess.

My earlier attempts involved splitting everything up into two strings: schema name and table name. This ended badly. The final architecture we decided on was to have an object passed around that would deal with various transformations (from what the user entered to what we can store on file systems, or to what temporary table maps to what unique name). This is TableIdentifier.

Brian has been introducing it around the code for a while now, and we just got it to now most of the places where table names are passed to Storage Engines. This means that if you’re writing a Storage Engine that doesn’t just blindly store things in files, you can sensibly use the getSchemaName() and getTableName() methods to call your API.

One last bit of evil….

You can store things for later!
drizzle> select libtcc("#include <string.h>\n#include <stdlib.h>\nint foo(char* s) { char *a= malloc(1000); return snprintf(s, 100, \"%p\", a); }") as RESULT;
+-----------+
| RESULT    |
+-----------+
| 0x199c610 |
+-----------+
1 row in set (0 sec)
drizzle> select libtcc("#include <string.h>\n#include <stdlib.h>\nint foo(char* s) { char *a= 0x199c610; strcpy(a, \"Hello World!\"); strcpy(s,\"done\"); return strlen(s); }") as result;
+--------+
| result |
+--------+
| done   |
+--------+
1 row in set (0.01 sec)
drizzle> select libtcc("#include <string.h>\n#include <stdlib.h>\nint foo(char* s) { char *a= 0x199c610; strcpy(s, a); return strlen(s); }") as result;
+--------------+
| result       |
+--------------+
| Hello World! |
+--------------+
1 row in set (0.01 sec)
And then… i can disconnect, reconnect, or whatever (as for any of the above really) before cleaning up my memory:
drizzle> select libtcc("#include <string.h>\n#include <stdlib.h>\nint foo(char* s) { char *a= 0x19a9bc0; free(a); strcpy(s,\"done\"); return strlen(s); }") as result;
+--------+
| result |
+--------+
| done   |
+--------+
1 row in set (0 sec)

A MD5 stored procedure for Drizzle… in C

So, just in case that wasn’t evil enough for you… perhaps you have something you want to know the MD5 checksum of. So, you could just do this:

drizzle> select md5('Hello World!');
+----------------------------------+
| md5('Hello World!')              |
+----------------------------------+
| ed076287532e86365e841e92bfc50d8c |
+----------------------------------+
1 row in set (0 sec)

But that is soooo boring.

Since we have the SSL libs already loaded into Drizzle, and using my very evil libtcc plugin… we could just implement it in C. We can even use malloc!

drizzle> SELECT LIBTCC("#include <string.h>\n#include <stdlib.h>\n#include <openssl/md5.h>\nint foo(char* s) { char *a = malloc(100); MD5_CTX context; unsigned char digest[16]; MD5_Init(&context); strcpy(a,\"Hello World!\"); MD5_Update(&context, a, strlen(a)); MD5_Final(digest, &context); snprintf(s, 33, \"%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x\", digest[0], digest[1], digest[2], digest[3],digest[4], digest[5], digest[6], digest[7],digest[8], digest[9], digest[10], digest[11],digest[12], digest[13], digest[14], digest[15]); free(a); return 32; }") AS RESULT;

+----------------------------------+
| RESULT                           |
+----------------------------------+
| ed076287532e86365e841e92bfc50d8c | 
+----------------------------------+
1 row in set (0.01 sec)

Currently the parameter is static in the C version due to me not having… well.. done a good job implementing the calling of C code.

Stored Procedures/Functions for Drizzle

Previously, in “Thoughts on Thoughts on Drizzle” I theorized that one of the major reasons why we did not see lots of people jumping at stored procedures in MySQL was that it wasn’t in their native language (for lack of a better term). We’ve seen External Language Stored Procedures for MySQL that let you write stored procedures in some other languages…. but I felt something was missing.

Firstly, I wanted a language I was really familiar with and comfortable writing complex things in.

Secondly, it should be compiled so that it runs as fast as possible.

Thirdly, it shouldn’t just be linking to a pre-compiled library (drizzle function plugins do that already)

So… the obvious choice was C.

I have a really, really, really early prototype:

drizzle> SELECT LIBTCC("int foo(char* s) { s[0]='4'; s[1]='2'; s[2]=0; return 2; }") AS RESULT;

+--------+
| RESULT |
+--------+
| 42     |
+--------+
1 row in set (0 sec)

or… a bit more sophisticated:

drizzle> SELECT LIBTCC("#include <string.h>\nint foo(char* s) { strcpy(s,\"Hello World!\");; return strlen(s); }") AS RESULT;

+--------------+
| RESULT       |
+--------------+
| Hello World! |
+--------------+
1 row in set (0 sec)

I’m using a function as a bit of a cheat… but the string is passed to libtcc (modified so it’s a shared library so I can load it into drizzle), where it is compiled into native object code (in my case x86-64) and then run.

With the right bits of foo… I could allow calling of all sorts of server functions…. such as those to execute SQL inside the current transaction context.

There are a number of reasons why this is Pure Evil(TM):

  • It executes inside the address space of your database server
    one null pointer dereference and your database server is all gone.
  • It’s arbitrary code injection by design
    Exactly how insane are you? Security–;
  • While great for me and my C hacking friends, possibly not for web app developers, who likely aren’t writing their apps in C every day.
  • See the first reason. Is that not reason enough? Memory protection is a good thing yo.

Anyway, you can see the code up on launchpad in the drizzle-libtcc-function branch. You’ll need to modify your tcc source so that the Makefile snippet for libtcc.o looks like this:

# libtcc generation and test
libtcc.o: $(NATIVE_FILES)
        $(CC) -fPIC -o $@ -c libtcc.c $(NATIVE_TARGET) $(CFLAGS)

libtcc.a: libtcc.o
        $(AR) rcs $@ $^

libtcc.so: libtcc.o
        $(CC) -shared -Wl,-soname,libtcc.so.1 -o $@ libtcc.o

Embedded InnoDB: InnoDB Status

Using the Embedded InnoDB plugin I’m working on, you can use the INNODB_STATUS table function in the data_dictionary, you can do pretty neat things.

For example, we can see that each autocommit transaction causes an fsync and if you insert multiple rows ina  single statement, you still only get 1 fsync:

drizzle> SELECT * FROM DATA_DICTIONARY.INNODB_STATUS
    -> WHERE name="fsync_req_done";
+----------------+-------+
| NAME           | VALUE |
+----------------+-------+
| fsync_req_done |    25 |
+----------------+-------+
1 row in set (0 sec)

drizzle> insert into t1 values (1);
Query OK, 1 row affected (0.05 sec)

drizzle> SELECT * FROM DATA_DICTIONARY.INNODB_STATUS WHERE name="fsync_req_done";
+----------------+-------+
| NAME           | VALUE |
+----------------+-------+
| fsync_req_done |    26 |
+----------------+-------+
1 row in set (0 sec)

drizzle> insert into t1 values (1),(2),(3),(4);Query OK, 4 rows affected (0 sec)
Records: 4  Duplicates: 0  Warnings: 0

drizzle> SELECT * FROM DATA_DICTIONARY.INNODB_STATUS WHERE name="fsync_req_done";
+----------------+-------+
| NAME           | VALUE |
+----------------+-------+
| fsync_req_done |    27 |
+----------------+-------+
1 row in set (0 sec)

Embedded InnoDB: querying the configuration

I am rather excited about being able to do awesome things such as this to get the current configuration of your server:

drizzle> SELECT NAME,VALUE 
    -> FROM DATA_DICTIONARY.INNODB_CONFIGURATION
    ->  WHERE NAME IN ("data_file_path", "data_home_dir");
+----------------+-------+
| NAME           | VALUE |
+----------------+-------+
| data_file_path | NULL  | 
| data_home_dir  | ./    | 
+----------------+-------+
2 rows in set (0 sec)

drizzle> SELECT NAME,VALUE
    -> FROM DATA_DICTIONARY.INNODB_CONFIGURATION
    -> WHERE NAME IN ("data_file_path", "data_home_dir");
+----------------+-------+
| NAME           | VALUE |
+----------------+-------+
| data_file_path | NULL  | 
| data_home_dir  | ./    | 
+----------------+-------+
2 rows in set (0 sec)

drizzle> SELECT NAME,VALUE 
    -> FROM DATA_DICTIONARY.INNODB_CONFIGURATION 
    -> WHERE NAME = "io_capacity";
+-------------+-------+
| NAME        | VALUE |
+-------------+-------+
| io_capacity | 200   | 
+-------------+-------+
1 row in set (0 sec)

Coming soon: status in a table.

(this is for the upcoming embedded_innodb plugin, which using the API provided by Embedded InnoDB to implement a Storage Engine for Drizzle)

Thoughts on Thoughts on Drizzle :)

Mark has some good thoughts on drizzle. I think they’re all valid… and have some extra thoughts too:

“I have problems to solve today”. This is (of course) an active concern in my brain… If we don’t have something out that solves some set of problems with reasonable stability and reliability (and soon), then we are failing. I feel we’re getting there, and will have a solid foundation to build upon.

Drizzle replication, MySQL replication: “I can’t compare the two until Drizzle replication is running in production.“. Completely agree. We need to only say replication is stable and reliable when it really is. Realistic test suites are needed. Very defensive programming of the replication system is needed (you want to know when something has gone wrong). We also need to have it constantly be verifying the right thing is going on. We want our problems to be user visible, not silent and invisible. Having high standards will hopefully pay off when people start running it in production….

3 byte int: “Does this mean that some of my tables will grow from 3GB to 4GB on disk?” I think we’re moving the responsibility down to the engines. The 3 byte int type says two things: use less storage space, limit the maximum value. Often you want the former, not the latter. There are many ways to more efficiently pack integers for storage when they are usually smaller than the maximum you want. The protobuf library does a good job of it.

I think it is the job of storage engines to do better here. Once you’re in memory, 3 byte numbers are horrible to work with.. copy out of a row buffer, convert into a 32bit number and then do foo. Modern CPUs favor 32 or 64bit alignment of data a *lot*. 3byte numbers do not align to 32 or 64bits very well… making things much slower for the common case of using cached data.

“I need stored procedures. They are required for high-performance OLTP as they minimize transaction duration for multi-statement transactions.” The reduction of network round trips is always crucial. I think a lot of round trips could go away if you could issue multiple statements at once (not via semicolon separating them, by protocol awesomeness).

There should be a way to send a set of statements that should be executed. There should also be a way to specify that if no error occurred, commit. This could then be (in the common case) a single round trip to the database. You then only have to make round-trips when what statement to issue next depends on the result of a previous one. The next step being to reduce these round trips… which can either be solved by executing something inside the database server (e.g. stored procedures) or something closer to the database server so that the round trips aren’t as large. This would be where Gearman enters.

I’m interested to see where these two approaches (issuing in batches and executing closer to the DB server) fall down… I know that latency may not be as good… but throughput should be a lot better.

I take heart with “I have yet to use them in MySQL” though. I have my own theories as to why this is… my biggest thought is that it’s because the many, many programmers writing SQL that Mark sees aren’t SQL Stored Procedure programmers. They spend their days in a couple of languages (maybe Perl, Python, PHP, Java, C, C++) and never programmed SQL:2003 Stored Procedures and it just doesn’t come as quickly (or as bug free) as writing code in the languages you use every day.

“Long Running insert, update and delete statements consume too many resources in InnoDB.” I wonder if this desire for MyISAM could be filled by PBXT or BlitzDB? The main reason that MyISAM is currently a temporary table only engine is that MyISAM and the server core were never that well separated.

My ultimate wish is that all engine authors take the approach of that there is an API to their engine and the Storage Engine is merely glue between the database server and their API.

The BlitzDB engine has this, Innobase partially does (and my Embedded InnoDB work goes the whole way) and MySQL Cluster is likely the oldest example.

As a side note, the BlitzDB plugin should go into the main Drizzle tree fairly soon. One of the joys of having an optional plugin that doesn’t touch the core of the server is that we can do this without much worry at all.

“Does Drizzle build on Windows?” Well… no. Funnily enough though, it is increasingly easy to make a Windows port. All the platform specific things are increasingly just plugins. The build system is a sticker… and no, we’re not going to switch to CMake. The C stands for something, and it’s something that even I may not print here… (I had never thought that being able to open up automake generated Makefiles and look at them would be a feature).

This next Drizzle milestone release should be exciting though…

I look forward to having Drizzle widely deployed and relied upon… I think we’ll do well..

Drizzle Developer Day 2010

Hi one and all!

Interested in database systems? Interested because you use them? Because you manage them? Write SQL that goes to them? Or are you one of the people of questionable sanity like myself who develops them?

Well… do we have the offer for you.

Friday, April 16th. Right after the MySQL Conference and Expo at the Santa Clara Convention Center, you can come along to the Drizzle Developer Day.

You will want to add your name to this wiki page: http://drizzle.org/wiki/Drizzle_Developer_Day_2010_signup

Suggest topics over at:
http://drizzle.org/wiki/Drizzle_Developer_Day_2010

Hope to see you there!

Writing A Storage Engine for Drizzle, Part 2: CREATE TABLE

The DDL code paths for Drizzle are increasingly different from MySQL. For example, the embedded_innodb StorageEngine CREATE TABLE code path is completely different than what it would have to be for MySQL. This is because of a number of reasons, the primary one being that Drizzle uses a protobuf message to describe the table format instead of several data structures and a FRM file.

We are pretty close to having the table protobuf message format being final (there’s a few bits left to clean up, but expect them done Real Soon Now (TM)). You can see the definition (which is pretty simple to follow) in drizzled/message/table.proto. Also check out my series of blog posts on the table message (more posts coming, I promise!).

Drizzle allows either your StorageEngine or the Drizzle kernel to take care of storage of table metadata. You tell the Drizzle kernel that your engine will take care of metadata itself by specifying HTON_HAS_DATA_DICTIONARY to the StorageEngine constructor. If you don’t specify HTON_HAS_DATA_DICTIONARY, the Drizzle kernel stores the serialized Table protobuf message in a “table_name.dfe” file in a directory named after the database. If you have specified that you have a data dictionary, you’ll also have to implement some other methods in your StorageEngine. We’ll cover these in a later post.

If you ever dealt with creating a table in MySQL, you may recognize this method:

virtual int create(const char *name, TABLE *form, HA_CREATE_INFO *info)=0;

This is not how we do things in Drizzle. We now have this function in StorageEngine that you have to implement:

int doCreateTable(Session* session, const char *path,
                  Table& table_obj,
                  drizzled::message::Table& table_message)

The existence of the Table parameter is largely historic and at some point will go away. In the Embedded InnoDB engine, we don’t use the Table parameter at all. Shortly we’ll also get rid of the path parameter, instead having the table schema in the Table message and helper functions to construct path names.

Methods name “doFoo” (such as doCreateTable) mean that there is a method named foo() (such as createTable()) in the base class. It does some base work (such as making sure the table_message is filled out and handling any errors) while the “real” work is done by your StorageEngine in the doCreateTable() method.

The Embedded InnoDB engine goes through the table message and constructs a data structure for the Embedded InnoDB library to create a table. The ARCHIVE storage engine is much simpler, and it pretty much just creates the header of the ARZ file, mostly ignoring the format of the table. The best bet is to look at the code from one of these engines, depending on what type of engine you’re working on. This code, along with the table message definition should be more than enough.

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Continuing the journey

A couple of months ago (December 1st for those playing along at home) it marked five years to the day that I started at MySQL AB (now Sun, now Oracle). A good part of me is really surprised it was for that long and other parts surprised it wasn’t longer. Through MySQL and Sun, I met some pretty amazing people, worked with some really smart ones and formed really solid and awesome friendships. Of course, not everything was perfect (sometimes not even close), but we did have some fun.

Up until November 2008 (that’s 3 years and 11 months for those playing at home) I worked on MySQL Cluster. Still love the product and love how much better we’re making Drizzle so it’ll be the best SQL interface to NDB :)

The ideas behind Drizzle had been talked about for a while… and with my experience with internals of the MySQL server, I thought that some change and dramatic improvement was sorely needed.

Then, in 2008, Brian created a tree. I was soon sending in patches at nights, we announced to the whole world at OSCON and it captured a lot of attention.

Since November 2008 I’ve been working on Drizzle full time. It was absolutely awesome that I had the opportunity to spend all my days hacking on Drizzle – both directly with fantastic people and for fantastic people.

But… the Sun set… which was exciting and sad at the same time.

Never to fear! There were plenty of places wanting Drizzle hackers (and MySQL hackers). For me, it came down to this: “real artists ship”. While there were other places where I would no doubt be happy and work on something really cool, the only way I could end up working out where I should really be was: what is the best way to have Drizzle make a stable release that we’d see be suitable for deployment? So, Where Am I Now?

Rackspace.

Where I’ll again be spending all my time hacking Drizzle.

Bike Riding in the storm

Out on a pier down St Kilda, the weather looked… well… like it could be a bit annoying on the way back:

but then… just a bit down the way…. it hit:

It was “a bit wet”. Big blocks of ice falling from the sky (that hurt).

Anyway, on the way back we found a storm water drain:

Yes, behind Michael is just all water (and I’m not talking about the Bay).

Still managed to get a 36.5km ride out of it, so not all bad.

Writing A Storage Engine for Drizzle, Part 1: Plugin basics

So, you’ve decided to write a Storage Engine for Drizzle. This is excellent news! The API is continually being improved and if you’ve worked on a Storage Engine for MySQL, you’ll notice quite a few differences in some areas.

The first step is to create a skeleton StorageEngine plugin.

You can see my skeleton embedded_innodb StorageEngine plugin in its merge request.

The important steps are:

1. Create the plugin directory

e.g. mkdir plugin/embedded_innodb

2. Create the plugin.ini file describing the plugin

create the plugin.ini file in the plugin directory (so it’s plugin/plugin_name/plugin.ini)
An example plugin.ini for embedded_innodb is.

[plugin]
title=InnoDB Storage Engine using the Embedded InnoDB library
description=Work in progress engine using libinnodb instead of including it in tree.
sources=embedded_innodb_engine.cc
headers=embedded_innodb_engine.h

This gives us a title and description, along with telling the build system what sources to compile and what headers to make sure to include in any source distribution.

3. Add plugin dependencies

Your plugin may require extra libraries on the system. For example, the embedded_innodb plugin uses the Embedded InnoDB library (libinnodb).

Other examples include the MD5 function requiring either openssl or gnutls, the gearman related plugins requiring gearman libraries, the UUID() function requiring libuuid and BlitzDB requiring Tokyo Cabinet libraries.

For embedded_innodb, pandora-build has a macro for finding libinnodb on the system. We want to run this configure check, so we create a plugin.ac file in the plugin directory (i.e. plugin/plugin_name/plugin.ac) and add the check to it.

For embedded_innodb, the plugin.ac file just contains this one line:

PANDORA_HAVE_LIBINNODB

We also want to add two things to plugin.ini; one to tell the build system only to build our plugin if libinnodb was found and the other to link our plugin with libinnodb. For embedded_innodb, it’s these two lines:

build_conditional="x${ac_cv_libinnodb}" = "xyes"
ldflags=${LTLIBINNODB}
Not too hard at all! This should look relatively familiar for those who have seen autoconf and automake in the past.

Some plugins (such as the md5 function) have a bit more custom auto-foo in plugin.ini and plugin.ac (as one of two libraries can be used). You can do pretty much anything with the plugin system, but you’re a lot more likely to keep it simple like we have here.

4. Add skeleton source code for your StorageEngine

While this will change a little bit over time (and is a little long to just paste into here), you can see what I did for embedded_innodb in the skeleton-embedded-innodb-engine tree.

5. Build!

You will need to re-run ./config/autorun.sh so the build system picks up your new plugin. When you run ./configure --help afterwards, you should see options for building with/without your new plugin.

6. Add a test

You will probably want to add a test to see that your plugin loads successfully. When your plugin is built, the test suite automatically picks up any tests you have in the plugin/plugin_name/tests directory. This is in the same format as general MySQL and Drizzle tests: tests go in a t/ directory, expected results in a r/ directory.

Since we are loading a plugin, we will also need some server options to make sure that plugin is loaded. These are stored in the rather inappropriately named test-master.opt file (that’s the test name with “-master.opt” appended to the end instead of “.test“). For the embedded_innodb plugin_load test, we have a plugin/embedded_innodb/tests/t/plugin_load-master.opt file with the following content:

--plugin_add=embedded_innodb

You can have pretty much anything in the plugin_load.test file… if you’re fancy, you’ll have a SELECT query on data_dictionary.plugins to check that the plugin really is there. Be sure to also add a r/plugin_load.result file (My preferred method is to just create an empty result file, run the test suite and examine the rejected output before renaming the .reject file to .result)

Once you’ve added your test, you can run it either by just typing “make test” (which will run the whole test suite), or you can go into the main tests/ directory and run ./test-run.pl --suite=plugin_name (which will just run the tests for your plugin).

7. Check the code in, feel good about self

and you’re done. Well… the start of a Storage Engine plugin is done :)

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).