The EXAMPLE storage engine

The Example storage engine is meant to serve mainly as a code example of the stub of a storage engine for example purposes only (or so the code comment at the start of ha_example.cc reads). In reality however, it’s not very useful. It likely was back in 2004 when it could be used as a starting point for starting some simple new engines (my guess would be that more than a few of the simpler engines started from ha_example.cc).

The sad reality is the complexity of the non-obviousness of the bits o the storage engine API you actually care about are documented in ha_ndbcluster.cc, ha_myisam.cc and ha_innodb.cc. If you’re doing something that isn’t already done by one of those three engines: good luck.

Whenever I looked at ha_example.cc I always wished there was something more behind it… basically hoping that InnoDB would get a better and cleaner API with the server and would use that rather than the layering violations it has to do the interesting stuff.

That all being said, as a starting point, it probably helped spawn at least a dozen storage engines.

The ARCHIVE Storage Engine

I wonder how much longer the ARCHIVE storage engine is going to ship with MySQL…. I think I’m the last person to actually fix a bug in it, and that was, well, a good number of years ago now. It was created to solve a simple problem: write once read hardly ever. Useful for logs and the like. A zlib stream of rows in a file.

You can actually easily beat ARCHIVE for INSERT speed with a non-indexed MyISAM table, and with things like TokuDB around you can probably get pretty close to compression while at the same time having these things known as “indexes”.

ARCHIVE for a long time held this niche though and was widely and quietly used (and likely still is). It has the great benefit of being fairly lightweight – it’s only about 2500 lines of code (1130 if you exclude azio.c, the slightly modified gzio.c from zlib).

It also use the table discovery mechanism that NDB uses. If you remove the FRM file for an ARCHIVE table, the ARCHIVE storage engine will extract the copy it keeps to replace it. You can also do consistent backups with ARCHIVE as it’s an append-only engine. The ARCHIVE engine was certainly the simplest example code of this and a few other storage engine API things.

I’d love to see someone compare storage space and performance of ARCHIVE against TokuDB and InnoDB (hint hint, the Internet should solve this for me).

A few notes on InnoDB in MySQL 5.7.1

I’ve started poking around the MySQL 5.7.1 source tree (although just from tarball as I don’t see a BZR tree yet). I thought I’d share a few thoughts:

  • InnoDB temporary tables. Not REDO logged. What does this mean? It’s a huge step in removing the dependency on MEMORY and MyISAM engines for temporary tables used in query execution. With InnoDB temporary tables there is no reason for MEMORY engine to continue to exist, there is absolutely no way in which it is better.
  • InnoDB temp tables aren’t insert buffered
    This probably doesn’t really matter as you’re not going to be doing REDO logging for them (plus things are generally short lived)… but it could be a future area for performance improvement
  • The NO_REDO log mode appears to be implemented fairly neatly.
  • Improvements in innodb read only mode. What does this mean? Maybe we can finally get rid of the oddity of compressed read only MyISAM tables on read only media. (on the other hand, CDs and DVDs aren’t exactly a modern form of software distribution).
  • Some of the source code comments have improved.. it’s getting easier to understand InnoDB. I’d still make the argument that if you need source code comments you’re code isn’t clear enough… but any step is an improvement. (that being said, InnoDB was always easier than the server)
  • There is some pretty heavy refactoring of lock0lock.cc – I really need to sit down and poke at it a bunch.
  • The shared tablespace code (innodb system tablespace) has been heavily refactored. This also introduces tablespaces for temporary tables – and it appears to be implemented in the correct way.

I need to look into things a bunch more, and it’ll be really useful to see a bzr tree to better understand some of the changes.

More to come later, but that’s my quick look.

INNODB_SYS_FIELDS vs INNODB_SYS_COLUMNS

In MySQL 5.6 we have two new INFORMATION_SCHEMA tables for InnoDB that are likely going to cause confusion: INNODB_SYS_FIELDS and INNODB_SYS_COLUMNS. You may think these are likely to just be aliases of each other in order to make your life easier. However…

These are not the same thing. The INNODB_SYS_FIELDS table is all about key columns (fields) of InnoDB indexes, while INNODB_SYS_COLUMNS is about actual columns. This is even more confusing as within the MySQL source code, there is the Field set of objects that manipulate fields (columns) in a row.

Blegh. I’m glad it’s Friday.

Information on Bug#12704861 (which doesn’t exist in any public bug tracker)

Some of you may be aware that MySQL is increasingly using an Oracle-internal bug tracker. You can see these large bug numbers mentioned alongside smaller public bug numbers in recent MySQL release notes. If you’re particularly unlucky, you  just get a big Oracle-internal bug number. For a recently fixed bug, I dug further, posted up on the Percona blog: http://www.mysqlperformanceblog.com/2011/11/20/bug12704861/

Possibly interesting reading for those of you who interested in InnoDB, MySQL, BLOBs and crash recovery.

innodb and memcached

I had a quick look at the source tree (I haven’t compiled it, just read the source – that’s what I do. I challenge any C/C++ compiler to keep up with my brain!) that’s got a tarball up on labs.mysql.com for the memcached interface to innodb. A few quick thoughts:

  • Where’s the Bazaar tree on launchpad? I hate pulling tarballs, following the dev tree is much more interesting from a tech perspective (especially for early development releases). I note that the NDB memcached stuff is up on launchpad now, so yay there. I would love it if the InnoDB team in general was much more open with development, especially with having source trees up on launchpad.
  • It embeds a copy of the memcached server engines branch into the MySQL tree. This is probably the correct way to go. There is no real sense in re-implementing the protocol and network stack (this is about half what memcached is anyway).
  • The copy of the memcached engine branch seems to be a few months old.
  • The current documentation appears to be the source code.
  • The innodb_memcached plugin embeds a memcached server using an API to InnoDB inside the MySQL server process (basically so it can access the same instance of InnoDB as a running MySQL server).
  • There’s a bit of something that kind-of looks similar to the Embedded InnoDB (now HailDB) API being used to link InnoDB and memcached together. I can understand why they didn’t go through the MySQL handler interface… this would be bracing to say the least to get correct. InnoDB APIs, much more likely to have fewer bugs.
  • If this accepted JSON and spat it back out… how fast would MongoDB die? weeks? months?
  • The above dot point would be a lot more likely if adding a column to an InnoDB table didn’t involve epic amounts of IO.
  • I’ve been wanting a good memcached protocol inside Drizzle, we have ,of course, focused on stability of what we do have first. That being said…. upgrade my flight home so I can open a laptop… probably be done well before I land….. (assuming I don’t get to it in the 15 other awesome things I want to hack on this week)

xtrabackup for Drizzle merge request

Follow it over on launchpad.

After having fixed an incredibly odd compiler warning (and with -Werror that we build with, error) on OSX (die die die) – xtrabackup for Drizzle is ready to be merged. This will bring it into our next milestone: freemont. Over the next few weeks you should see some good tests merged in for backup and restore too.

While not final final, I’m thinking that the installed binary name will be drizzlebackup.innobase. A simple naming scheme for various backup tools that are Drizzle specific. This casually pre-empts a drizzlebackup tool that can co-ordinate all of these (like the innobackupex script).

Things I’ve done in Drizzle

When writing my Dropping ACID: Eating Data in a Web 2.0 Cloud World talk for LCA2011 I came to the realisation that I had forgotten a lot of the things I had worked on in MySQL and MySQL Cluster. So, as a bit of a retrospective as part of the Drizzle7 GA release, I thought I might try and write down a (incomplete) list of the various things I’ve worked on in Drizzle.

I noticed I did a lot of code removal, that’s all fine and dandy but maybe I won’t list all of that… except perhaps my first branch that was merged :)

2008

  • First ever branch that was merged: some mysys removal (use POSIX functions instead of wrappers that sometimes have different semantics than their POSIX functions), some removal of NETWARE, build scripts that weren’t helpful (i.e. weren’t what any build team ever used to build a release) and some other dead code removal.
  • Improve ‘make test’ time – transactions FTW! (this would be a theme for me over the years, I always want build and test to be faster)
  • Started moving functions out into their own plugins, removing the difference between UDFs (User Defined Functions) and builtin functions. One API to rule them all.
  • Ahhh compiler warnings (we now build with -Werror) and unchecked return codes from system calls (we now add this to some of our own methods, finding and fixing even more bugs). We did enable a lot of compiler warnings and OMG fix a lot of them.
  • Removal of FRM – use a protobuf message to describe the table. The first branch that implemented any of this was merged mid-November. It was pretty minimal, and we still had the FRM around for a little while yet – but it was the beginning of the end for features that couldn’t be implemented due to limitations in the FRM file format. I wrote a few blog entries about this.
  • A lot of test fixes for Drizzle
  • After relating the story of .test (hint: it broke the ability to check out the source tree on Solaris) to Anthony Baxter, he suggested trying something rather nasty… a unicode character above 2^16 – I chose 𝄢 – which at the time didn’t even render on Ubuntu – this was the test case you could only see correctly on MacOS X at the time (or some less broken Linux distro I guess). Some time later, I was actually able to view the test file on Ubuntu correctly.
  • I think it was November 1st when I started to work on Drizzle full time – this was rather awesome, although a really difficult decision as I did rather enjoy working with all the NDB guys.

2009:

  • January sparked the beginning of reading table information from the table protobuf message instead of the FRM file. The code around the FRM file was lovely and convoluted in places – to this day I’m surprised at the low bug count all that effort resulted in. One day I may write up a list of bugs and exploits probably available through the FRM code.
  • My hate for C++ fstream started in Feb 2009
  • In Feb I finally removed the code to work out (and store) in the FRM file how to display a set of fields for entering data into the table on a VT100 80×24 screen.
  • I filed my first GCC bug. The morning started off with a phone call and Brian asking me to look at some strange bug and ended with Intel processor manuals (mmm… the 387 is fun), the C language standard and (legitimately) finding it was a real compiler bug. Two hours and two minutes after filing the bug there was a patch to GCC fixing it – I was impressed.
  • By the end of Feb, the FRM was gone.
  • I spent a bit of time making Drizzle on linux-sparc work. We started building with compiler options that should be much friendlier to fast execution on SPARC processors (all to do with aligning things to word boundaries). -Wcast-align is an interesting gcc flag
  • Moved DDL commands to be in StorageEngine rather than handler for an instance of an open table (now known as Cursor)
  • MyISAM and CSV as temporary table only engines – this means we save a bunch of mucking about in the server.
  • Amazingly enough, sprintf is dangerous.
  • Moved to an API around what tables exist in a database so that the Storage Engines can own their own metadata.
  • Move to have Storage Engines deal with the protobuf table message directly.
  • Found out you should never assume that your process never calls fork() – if you use libuuid, it may. If it wasn’t for this and if we had param-build, my porting of mtr2 to Drizzle probably would have gone in.
  • There was this thing called pack_flag – its removal was especially painful.
  • Many improvements to the table protobuf message (FRM replacement) code and format – moving towards having a file format that could realistically be produced by code that wasn’t the Drizzle database kernel.
  • Many bug fixes, some acused by us, others exposed by us, others had always been there.

2010:

  • embedded_innodb storage engine (now HailDB). This spurred many bug and API fixes in the Storage Engine interface. This was an education in all the corner cases of the various interfaces, where the obvious way to do things was almost always not fully correct (but mostly worked).
  • Engine and Schema custom KEY=VALUE options.
  • Found a bug in the pthread_mutex implementation of the atomics<> template we had. That was fun tracking down.
  • started writing more test code for the storage engine interface and upper layer. With our (much improved) storage engine interface, this became relatively easy to implement a storage engine to test specific bits of the upper layer.
  • Wrote a CREATE TABLE query that would take over four minutes to run. Fixed the execution time too. This only existed because of a (hidden and undocumented) limitation in the FRM file format to do with ENUM columns.
  • Characters versus bytes is an important difference, and one that not all of the code really appreciated (or dealt with cleanly)
  • FOREIGN KEY information now stored in the table protobuf message (this was never stored in the FRM).
  • SHOW CREATE TABLE now uses a library that reads the table protobuf message (this will later be used in the replication code as well)
  • Started the HailDB project
  • Updated the innobase plugin to be based on the current InnoDB versions.
  • Amazingly, noticed that the READ_COMMITTED isolation level was never tested (even in the simplest way you would ever explain READ_COMMITTED).
  • Created a storage engine for testing called storage_engine_api_tester (or SEAPITester). It’s a dummy engine that checks we’re calling things correctly. It exposed even more bugs and strangeness that we weren’t aware of.
  • Fixed a lot of cases in the code where we were using a large stack frame in a function (greater than 32kb).
  • An initial patch using the internal InnoDB API to store the replication log

2011:

  • This can be detailed later, it’s still in progress :)
  • The big highlight: a release.

Drizzle gets InnoDB 1.0.9

My branch that updates the innobase plugin in Drizzle to be based on innodb_plugin 1.0.9 has been merged. For the next milestone, we’ll probably have 1.0.11 as well.

How’s the progress getting 1.1 and 1.2 in? Pretty good actually. We’ll have it for either this milestone or the next one.

and merging newer innodb into HailDB? It’s going well too, expect more news “soon”.

Second Drizzle Beta (and InnoDB update)

We just released the latest Drizzle tarball (2010-10-11 milestone). There are a whole bunch of bug fixes, but there are two things that are interesting from a storage engine point of view:

  • The Innobase plugin is now based on innodb_plugin 1.0.6
  • The embedded_innodb engine is now named HailDB and requires HailDB, it can no longer be built with embedded_innodb.

Those of you following Drizzle fairly closely have probably noticed that we’ve lagged behind in InnoDB versions. I’m actively working on fixing that – both for the innobase plugin and for the HailDB library.

If building the HailDB plugin (which is planned to replace the innobase plugin), you’ll need the latest HailDB release (which as of writing is 2.3.1). We’re making good additions to the HailDB API to enable the storage engine to have the same features as the Innobase plugin.

What was InnoDB+?

Yes, I said InnoDB+ with a plus sign at the end (also see the first comment here).

Please note that this blog post is only based on public information. It has absolutely nothing in it that I only could have learned from back when I worked at Sun or MySQL AB. Everything has links or pointers to where you can find the information out on the Internet and all thoughts are based on stringing these things together.

There was a lot of talk around the acquisition of Sun Microsystems by Oracle about MySQL (MySQL AB was bought by Sun). Some of the talk centred around Oracle and their ability to make a closed source version of MySQL with added bits that wouldn’t be released as GPL. They’ve since proved that they’re quite willing to do this to an open source project (see OpenSolaris).

Relatively recently, a bunch of history from the old InnoDB SVN trees was imported into the MySQL source tree. You can pull the revision of the SVN tree as of InnoDB Plugin 1.0.6 release by using revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/zip:6263  from the MySQL repository – or just use a branch I’ve put up on launchpad for it (lp:~stewart/haildb/innodb-1.0.6-from-svn).

The first revision from the SVN tree was created on 2005-10-27, which you may remember was not too long after Oracle acquired Innobase on the 7th of October that year. The next two revisions were importing the 5.0 innodb code base, and then the 5.1 code base. Previous history can be found according to this blog post on Transactions on InnoDB.

According to Monty in the comment on the Pythian blog:

Oracle did work on a closed source version of InnoDB, codename InnoDB+, but they never released it, probably because our contract with them stopped them.

and from Eben Moglen’s letter to the EU Commission (via Baron Schwartz’s blog post):

Innobase could therefore have provided an enhanced version of InnoDB, like Oracle’s current InnoDB+, under non-GPL license

Most tellingly is a lot of references in the revision history to “branches/innodb+” as well as this commit:

revno: 0.5.148
revision-id: svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6329
parent: svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6322
committer: vasil
timestamp: Thu 2009-12-17 11:00:17 +0000
message:
branches/innodb+: change name and version
Change name from “InnoDB Plugin” to “InnoDB+” and
version from 1.0.5 to 1.0.0.

So, from the revision history I’ve managed to work out that it likely was going to have the following features:

  • innodb_change_buffering (for values other than inserts)
    See revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/zip:4061
    Or, more tellingly revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:4053
    The latter tells about the merge of change buffering for delete-mark and delete in addition to the default of inserts.
  • Possibly compressed tables.
    revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:2316 seems to show that it may have been copied across: “branches/innodb+: Copy from branches/zip r2315″ in the comment.  There’s a lot of other merges of branches/zip as well
  • Something named FTS
    There is “branches/fts” in revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:2325 and revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:2324  (there’s an import of a red-black tree implementation)
    If you also look at revid: svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6776
    you’ll see references to a innofts+ branch with ha_innodb.cc in it.
    So between a red-black tree and handler changes, this is surely something interesting.
  • Persistent statistics (also revid: svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6776)
  • Metrics Table (also revid: svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6776)
  • posix_fadvise() hints to temp files used in creating indexes (revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:2342 )
  • Improved recovery performance
    See revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:2989
    Talks about using the red-black tree for sorted insertion into the flush_list
  • native linux aio (revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:3913 )
  • group commit (revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:3923 )
  • New mutex to protect flush_list (revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6330)

and finally, in revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6819 you can see the change from “InnoDB+” back to “InnoDB” for being the built in default for MySQL 5.5

Warnings are now actual problems

Yesterday, I reached a happy milestone in HailDB development. All compiler warnings left in the api/ directory (the public interface to the database engine) are now either probable/possible bugs (that we need to look at closely) or are warnings due to unfinished code (that we should finish).

There’s still a bunch of compiler warnings that we’ve inherited (HailDB compiles with lots of warnings enabled) that we have to get through, but a lot will wait until after we update the core to be based on InnoDB 1.1.

HailDB 2.0.0 released!

(Reposted from the HailDB Blog. See also the announcement on the Drizzle Blog.)
We’ve made our first HailDB release! We’ve decided to make this a very conservative release. Fixing some minor bugs, getting a lot of compiler warnings fixed and start to make the name change in the source from Embedded InnoDB to HailDB.

Migrating your software to use HailDB is really simple. In fact, for this release, it shouldn’t take more than 5 minutes.

Highlights of this release:

  • A lot of compiler warnings have been fixed.
  • The build system is now pandora-build.
  • some small bugs have been fixed
  • Header file is now haildb.h instead of innodb.h
  • We display “HailDB” instead of “Embedded InnoDB”
  • Library name is libhaildb instead of libinnodb
  • It is probably binary compatible with the last Embedded InnoDB release, but we don’t have explicit tests for that, so YMMV.

Check out the Launchpad page on 2.0.0 and you can download the tarball either from there or right here:

  • haildb-2.0.0.tar.gz
    MD5:  183b81bfe2303aed435cdc8babf11d2b
    SHA1:  065e6a2f2cb2949efd7b8f3ed664bc1ac655cd75

HailDB, Hudson, compiler warnings and cppcheck

I’ve integrated HailDB into our Hudson setup (haildb-trunk on Hudson). I’ve also made sure that Hudson is tracking the compiler warnings. We’ve enabled more compiler warnings than InnoDB has traditionally been compiled with – this means we’ve started off with over 4,300 compiler warnings! Most of those are not going to be anything remotely harmful – however, we often find that it’s 1 in 1000 that is a real bug. I’ve managed to get it down to about 1,700 at the moment (removing a lot of harmless ones).

I’ve also enabled a cppcheck run on it. Cppcheck is a static analysis tool for C/C++. We’ve also enabled it for Drizzle (see drizzle-build-cppcheck on Hudson). When we enabled it for Drizzle, we immediately found three real bugs! There is also a coding style checker which we’ve also enabled on both projects. So far, cppcheck has not found any real bugs in HailDB, just some style warnings.

So, I encourage you to try cppcheck if you’re writing C/C++.

The rotating blades database benchmark

(and before you ask, yes “rotating blades” comes from “become a fan”)

I’m forming the ideas here first and then we can go and implement it. Feedback is much appreciated.

Two tables.

Table one looks like this:

CREATE TABLE fan_of (
user_id BIGINT,
item_id BIGINT,
PRIMARY KEY (user_id, item_id),
INDEX (item_id)
);

That is, two columns, both 64bit integers. The primary key covers both columns (a user cannot be a fan of something more than once) and can be used to look up all things the user is a fan of. There is also an index over item_id so that you can find out which users are a fan of an item.

The second table looks like this:

CREATE TABLE fan_count (
item_id BIGINT PRIMARY KEY,
fans BIGINT
);

Both tables start empty.

You will have 1000, 2000,4000 and 8000 concurrent clients attempting to run the queries. These concurrent clients must behave as if they could be coming from a web server. The spirit of the benchmark is to have 8000 threads (or processes) talk to the database server independent of each other.

The following set of queries will be run a total of 23,000,000 (twenty three million) times. The my_user_id below is an incrementing ID per connection allocated by partitioning 23,000,000 evenly between all the concurrent clients (e.g. for 1000 connections each connection gets 23,000 sequential ids)

You must run the following queries.

  • How many fans are there of item 12345678 (e.g. SELECT fans FROM fan_count WHERE item_id=12345678)
  • Is my_user_id already a fan of item 12345678 (e.g. SELECT user_id FROM fan_of WHERE user_id=my_user_id AND item_id=12345678)
  • The next two queries MUST be in the same transaction:
    • my_user_id becomes a fan of item 12345678 (e.g. INSERT INTO fans (user_id,item_id) values (my_user_id, 12345678))
    • increment count of fans (e.g. UPDATE fan_count SET fans=fans+1 WHERE item_id=12345678)

For the first query you are allowed to use a caching layer (such as memcached) but the expiry time must be 5 seconds or less.

You do not have to use SQL. You must however obey the transaction boundary above. The insert and the update must be part of the same transaction.

Results should include: min, avg, max response time for each query as well as the total time to execute the benchmark.

Data must be durable to a machine being switched off and must still be available with that machine switched off. If committing to local disk, you must also replicate to another machine. If running asynchronous replication, the clock does not stop until all changes have been applied on the slave. If doing asynchronous replication, you must also record the replication delay throughout the entire test.

In the event of timeout or deadlock in doing the insert and update part, you must go back to the first query (how many fans) and retry. Having to retry does not count towards the 23,000,000 runs.

At the end of the benchmark, the query SELECT fans FROM fan_count WHERE item_id=12345678 should return 23,000,000.

Yes, this is a very evil benchmark. It seems to be a bit indicative about the kind of peak load that can be experienced by a bunch of Web 2.0 sites that have a “like” or “become a fan” style buttons. I fully expect the following:

  • Pretty much all systems will nosedive in performance after 1000 concurrent clients
  • Transaction rollbacks due to deadlock detection or lock wait timeouts will be a lot.
  • Many existing systems and setups not complete it in reasonable time.
  • A solution using Scale Stack to be an early winner (backed by MySQL or Drizzle)
  • Somebody influenced by Domas turning InnoDB deadlock detection off very quickly.
  • Somebody to call this benchmark “stupid” (that person will have a system that fails dismally at this benchmark)
  • Somebody who actually has any knowledge of modern large scale web apps to suggest improvements
  • Nobody even attempting to benchmark the Oracle database
  • Somebody submitting results with MySQL to not wait until the replication stream has finished applying.
  • Some NoSQL systems to suck considerably more than their SQL counterparts.

Storage Engine API: write_row, CREATE SELECT and DDL

(this probably applies exactly the same for MySQL and Drizzle… but I’m just speaking about current Drizzle here)

In my current merge request for the embedded-innodb-create-select-transaction-arrgh branch (also see this specific revision), you’ll notice an odd hoop that we have to jump through to make CREATE SELECT statements work with an engine such as InnoDB.

Basically, this is what happens:

  • start transaction
  • start executing SELECT QUERY (well, prepare executing it and fetch a row)
  • create table
  • attempt to insert into table

But… we have to do the DDL statement (i.e. the CREATE TABLE) in its own transaction. This means that the outer transaction (running the SELECT) shouldn’t be able to see it. Except it does. We can create a cursor on this table. However, when we try and do something with it (e.g. ib_cursor_first()) we then get the error message DB_MISSING_HISTORY from InnoDB. With a data dictionary that was REPEATABLE READ, we shouldn’t have this problem. However, we don’t have that.

So? What do we do? If we’re in ::write_row and we get an error and we’re running a SQLCOM_CREATE_TABLE sql_command (yes, we get to poke into current_session->lex->sql_command to find this out) we just magically restart the transaction so that we can (properly) see the created table and write rows to it.

This is not a sane part of the interface; it won’t be an issue for many engines but it is needed here.

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Announcing HailDB

I just announced our continuation of the Embedded InnoDB project under the name of HailDB. Check out the announcement over at http://www.haildb.com/.

HailDB is a relational database that is embeddable within applications. You embed HailDB by linking to a shared library and calling a clean and simple API. HailDB is a continuation of the Embedded InnoDB project. It is not itself a database server, but is a library implementing the storage layer. With the addition of the HailDB plugin to Drizzle you get a full SQL interface.

Read more at http://www.haildb.com

Embedded InnoDB is in the tree!

Well… the start of it :)

I’ve taken the approach of taking tiny incremental steps (and getting review for each step) in implementing a Storage Engine based on the Embedded InnoDB library. What hit lp:drizzle (the trunk branch, for the 2010-04-07 milestone tarball) is only a handful of these small steps, so this engine is not remotely ready for end users.

There should be more of my Embedded InnoDB work hitting the tree in the upcoming days/weeks, enough to get it to a satte that one could describe as functional :)