Storage Engine API state graph

Posted on 2010-10-25 by Stewart Smith

Drizzle still has a number of quirks inherited from the MySQL Storage Engine API (e.g. BLOBs, row buffer, CREATE SELECT and lack of DDL transaction boundaries, key tuple format). One of the things we fixed a long time ago was to have proper methods for StorageEngines to be called for: startTransaction, startStatement, endStatement, commit and rollback.

If you’ve had to implement a transactional storage engine in MySQL you will be well aware of the pattern of “in every Storage Engine/handler call: if transaction doesn’t exist, begin.” We’ve tried to fix this in the Drizzle API for a number of reasons. I think having this obvious set of calls will make the API a lot easier to understand. I am also very interested in making things much easier to prove correct.

A while ago I spotted Bug 587772, which was the READ COMMITTED isolation level not working correctly with InnoDB. It turns out that the most basic example for READ COMMITTED failed. Hrrm… this is no good. It worked on MySQL, so this was certainly something that we broke. What was more worrying is that there wasn’t a test for this in the test suite (and at the time I couldn’t find one in the MySQL test suite either, so I think we inherited the missing test).

I recently started delving in, actually going to solve this. I noticed something worrying, endStatement wasn’t being called, which is where the innobase plugin would release the read view that it used for the statement. You’d think that it would grab a new one on startStatement, but because of the previous design of the API (remember “if txn isn’t started, start it!”) this alsoÂ happenedÂ for getting the read view for the statement… so we instead got a REPEATABLE READ isolation level.

I wanted a test.

Previously, I’ve created a dummy storage engine (tableprototester) and used it to test the server code for reading the table protobuf message. I thought about doing a Storage Engine for this problem too, basically looking at the calls to the Storage Engine as transitions between states in a state machine.

A basic view of a transaction could be:

That is, a transaction starts and has zero or more statements before it commits or gets rolled back.

By coding up a data structure of allowable state transitions, a small function to assert() on invalid transitions and enough of the boilerplate to make the engine “work”, I was able to hit an assert() exactly where I’d expected it: at an invalid transition from START STATEMENT to COMMIT.

To fix the initial bug (READ COMMITTED not working), I filled in a few state transitions for the system as a whole that aren’t quite correct. From the diagram below, you can quite obviously see where the obvious bugs are (it helps that I’ve coloured them red):

There is absolutely no sense in going BEGIN -> END STATEMENT or immediately to COMMIT. These should be relatively easy to solve too, but are separate bugs.

I wish to expand this in the future to cover Cursor as well. It will also be useful to ensure that DDL can be wrapped in transactions. Not to mention the last few HTON flags that exist (and should likely go away).

To generate the diagrams, I just wrote a little utility to dump out the state transitions in dot, using it to generate the diagrams.

HailDB being built by default in Drizzle

Posted on 2010-10-21 by Stewart Smith

It just it trunk – if you have HailDB installed when you build Drizzle, you will now get the HailDB plugin built. You can even run Drizzle with it (remove innobase plugin, load HailDB plugin). Previously, we had problems building both due to symbol conflicts between innobase and HailDB. We’ve fixed this thanks to the linker.

So, enjoy HailDB… welll, test it and report bugs that I can fix :)

New APIs in HailDB

Posted on 2010-10-18 by Stewart Smith

In the current HailDB we have a couple of new API calls that you may like:

ib_status_get_all()
Is very similar to ib_cfg_get_all(). This allows the library to add new status variables without applications having to know about them – because we return a list of what there are. For Drizzle, this means that the DATA_DICTIONARY.HAILDB_STATUS table will automatically have any new status variables we add to HailDB without a single extra line of code having to be written.
ib_set_panic_handler()
Having a shared library call exit() is generally considered impolite. Previously, if HailDB hit corruption (or some other nasty conditions), it could call exit() and you’d never get a chance to display a sensible error message to your user (especially bad in a GUI app where the printed to console error message would be unseen). This call allows an application to specify a callback in the case of HailDB entering such a condition. We’ll still be unable to continue (and we strongly advise that you do in fact exit the process in your callback) but you’re at least now able to (for example) pop up a dialog box saying sorry.
ib_trx_set_client_data()
This call lets you associate a void* with a transaction. HailDB keeps this pointer in its transaction data structure and in some callbacks (e.g. ib_set_trx_is_interrupted_handler(), see below) will pass this pointer back to you for you to use to help make a decision. In InnoDB in MySQL, this is the THD. In Drizzle, it’s the Session.
ib_set_trx_is_interrupted_handler()
In various wait conditions (e.g. waiting for a row lock), HailDB will call the callback you set with this function with the client data (set with ib_trx_set_client_data()) to work out if the transaction has been cancelled. This enables an application to implement something like the MySQL/Drizzle KILL command to cancel a transaction in another thread.
ib_get_duplicate_key()
If you just got a duplicate key error, this function will tell you what key it was. This allows you to implement a nicer error message.
ib_get_table_statistics()
This function gives you access to some basic table statistics that HailDB maintains. This includes an approximate row count, clustered index size, total of secondary indexes as well as a “modified counter” which can give you a rough idea about how out of date these statistics are.

All of these are new to HailDB (and weren’t available in embedded_innodb), many in the new 2.3 development release. You can see usage examples both in the HailDB test suite and (for most of them) in the Drizzle HailDB Storage Engine.

Second Drizzle Beta (and InnoDB update)

Posted on 2010-10-14 by Stewart Smith

We just released the latest Drizzle tarball (2010-10-11 milestone). There are a whole bunch of bug fixes, but there are two things that are interesting from a storage engine point of view:

The Innobase plugin is now based on innodb_plugin 1.0.6
The embedded_innodb engine is now named HailDB and requires HailDB, it can no longer be built with embedded_innodb.

Those of you following Drizzle fairly closely have probably noticed that we’ve lagged behind in InnoDB versions. I’m actively working on fixing that – both for the innobase plugin and for the HailDB library.

If building the HailDB plugin (which is planned to replace the innobase plugin), you’ll need the latest HailDB release (which as of writing is 2.3.1). We’re making good additions to the HailDB API to enable the storage engine to have the same features as the Innobase plugin.

Drizzle7 Beta!

Posted on 2010-09-30 by Stewart Smith

Just in case you missed it, I’m rather thrilled that our latest tarball of Drizzle is named Beta. Specifically, we’re calling it Drizzle7. Seven is a very nice number, and it seems rather appropriate.

This release is for a stand alone database server. A lot of the infrastructure for replication is there (with testing), but the big thing we want to hammer on and get perfect here is Drizzle7 as a stand alone database server.

Can I trust it? If you trust InnoDB to store your data, then yes, you can trust Drizzle (it uses InnoDB too)

What was InnoDB+?

Posted on 2010-09-30 by Stewart Smith

Yes, I said InnoDB+ with a plus sign at the end (also see the first comment here).

Please note that this blog post is only based on public information. It has absolutely nothing in it that I only could have learned from back when I worked at Sun or MySQL AB. Everything has links or pointers to where you can find the information out on theÂ Internet and all thoughts are based on stringing these things together.

There was a lot of talk around theÂ acquisitionÂ of Sun Microsystems by Oracle about MySQL (MySQL AB was bought by Sun). Some of the talk centred around Oracle and their ability to make a closed source version of MySQL with added bits that wouldn’t be released as GPL. They’ve since proved that they’re quite willing to do this to an open source project (see OpenSolaris).

Relatively recently, a bunch of history from the old InnoDB SVN trees was imported into the MySQL source tree. You can pull the revision of the SVN tree as of InnoDB Plugin 1.0.6 release by using revid:ï»¿svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/zip:6263 Â from the MySQL repository – or just use a branch I’ve put up on launchpad for it (lp:~stewart/haildb/innodb-1.0.6-from-svn).

The first revision from the SVN tree was created on 2005-10-27, which you may remember was not too long after Oracle acquired Innobase on the 7th of October that year. The next two revisions were importing the 5.0 innodb code base, and then the 5.1 code base. Previous history can be found according to this blog post on Transactions on InnoDB.

According to Monty in the comment on the Pythian blog:

Oracle did work on a closed source version of InnoDB, codename InnoDB+, but they never released it, probably because our contract with them stopped them.

and from Eben Moglen’s letter to the EU Commission (via Baron Schwartz’s blog post):

Innobase could therefore have provided an enhanced version of InnoDB, like Oracleâ€™s current InnoDB+, under non-GPL license

Most tellingly is a lot of references in the revision history to “branches/innodb+” as well as this commit:

revno: 0.5.148
revision-id: svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6329
parent: svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6322
committer: vasil
timestamp: Thu 2009-12-17 11:00:17 +0000
message:
branches/innodb+: change name and version
Change name from “InnoDB Plugin” to “InnoDB+” and
version from 1.0.5 to 1.0.0.

So, from the revision history I’ve managed to work out that it likely was going to have the following features:

innodb_change_buffering (for values other than inserts)
See revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/zip:4061
Or, more tellingly revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:4053
The latter tells about the merge of change buffering for delete-mark and delete in addition to the default of inserts.
Possibly compressed tables.
revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:2316 seems to show that it may have been copied across: “branches/innodb+: Copy from branches/zip r2315” in the comment. Â There’s a lot of other merges of branches/zip as well
Something named FTS
There is “branches/fts” in revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:2325 and revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:2324 Â (there’s an import of a red-black tree implementation)
If you also look at revid:Â svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6776
you’ll see references to a innofts+ branch with ha_innodb.cc in it.
So between a red-black tree and handler changes, this is surely something interesting.
Persistent statistics (also revid:Â svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6776)
Metrics Table (also revid:Â svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6776)
posix_fadvise() hints to temp files used in creating indexes (revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:2342 )
Improved recovery performance
See revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:2989
Talks about using the red-black tree for sorted insertion into the flush_list
native linux aio (revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:3913 )
group commit (revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:3923 )
New mutex to protect flush_list (revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6330)

and finally, in revid:svn-v4:16c675df-0fcb-4bc9-8058-dcc011a37293:branches/innodb%2B:6819 you can see the change from “InnoDB+” back to “InnoDB” for being the built in default for MySQL 5.5

LCA Miniconf Call for Papers: Data Storage: Databases, Filesystems, Cloud Storage, SQL and NoSQL

Posted on 2010-09-29 by Stewart Smith

This miniconf aims to cover many of the current methods of data storage and retrieval and attempt to bring order to the universe. We’re aiming to cover what various systems do, what the latest developments are and what you should use for various applications.

We aim for talks from developers of and developers using the software in question.

Aiming for some combination of: PostgreSQL, Drizzle, MySQL, XFS, ext[34], Swift (open source cloud storage, part of OpenStack), memcached, TokyoCabinet, TDB/CTDB, CouchDB, MongoDB, Cassandra, HBase….. and more!

Call for Papers open NOW (Until 22nd October).

Warnings are now actual problems

Posted on 2010-09-23 by Stewart Smith

Yesterday, I reached a happy milestone in HailDB development. All compiler warnings left in the api/ directory (the public interface to the database engine) are now either probable/possible bugs (that we need to look at closely) or are warnings due to unfinished code (that we should finish).

There’s still a bunch of compiler warnings that we’ve inherited (HailDB compiles with lots of warnings enabled) that we have to get through, but a lot will wait until after we update the core to be based on InnoDB 1.1.

SHOW CREATE TABLE using table protobuf message

Posted on 2010-08-20 by Stewart Smith

… and really testing the replication code path for CREATE TABLE.

So, for a very long time now, Drizzle has been using a protobuf based structure to describe the structure of tables. The idea was to be able to have engines rather painlessly generate this structure themselves (which several now do). A secondary idea was to use this structure itself for CREATE TABLE (in progress, and embedded_innodb does in fact does only use the table message for its CREATE TABLE codepath). The third idea was to just put the table protobuf message into the replication stream instead of the CREATE TABLE statement (i.e. a SQL string). This means that you could (easily) write a replicator to a DBMS with different SQL syntax, or to a system that doesn’t speak SQL at all.

The final step, to reduce duplicated code functionality, would be to have the code that does SHOW CREATE TABLE to use a common bit of code for turning a table protobuf message back into a SQL string.

We now have that.

Just now, my branch to replace the old SHOW CREATE TABLE code (that was using TableShare and friends) with the statement_transform code (that we also use in converting a replication log to SQL commands) hit trunk.

Yay.

Storage Engine independent test suite

Posted on 2010-08-18 by Stewart Smith

This is something that has never existed in the MySQL world. Nothing to help you start developing your engine.

Sure, you could start running the whole test suite against your engine…. but where it wasn’t specifically MyISAM, you’d certainly hit things that were.. well… as simple as having to change the result file so that SHOW CREATE TABLE masked out the ENGINE= part.

Also, if you were just starting out and were trying to incrementally write your engine, instead of just hacking away for 6 months on “everything” and then testing “everything” and hoping that anything at all worked – you were out of luck.

So when working on the embedded_innodb engine for Drizzle I wanted to finally fix this problem. I was not going to fix it perfectly, or completely. What I wanted was a set of simple tests, that were very short and that tested distinct bits of the engine.

So… if you look in the Drizzle tree, in plugin/embedded_innodb/tests/ you’ll find a bunch of test cases beginning with basic_ (they’re also now in plugin/pbxt/tests). These check very basic functionality and should be useful when starting to write your own storage engine.

They should also port easily to MySQL too (possibly without changes), but I haven’t tried it. Porting to PBXT was simple.. a regex to replace InnoDB with PBXT.

Enjoy.

embedded_innodb engine (HailDB Engine)

Posted on 2010-08-18 by Stewart Smith

A whole bunch of stuff got merged to the embedded_innodb engine recently. The end game is to have the embedded_innodb engine be just linking to HailDB (where we’re continuing maintenance of Embedded InnoDB).

So, what made it in? A lot of bug fixes (especially around error reporting and tables without an explicit primary key), SELECT FOR UPDATE, support for temporary tables and support of prefix indexes.

We also now, if it’s available, link to HailDB instead of Embedded InnoDB. At some point “soon” we shall just require HailDB.

HailDB 2.0.0 released!

Posted on 2010-08-03 by Stewart Smith

(Reposted from the HailDB Blog. See also theÂ announcementÂ on the Drizzle Blog.)
We’ve made our first HailDB release! We’ve decided to make this a very conservative release. Fixing some minor bugs, getting a lot of compiler warnings fixed and start to make the name change in the source from Embedded InnoDB to HailDB.

Migrating your software to use HailDB is really simple. In fact, for this release, it shouldn’t take more than 5 minutes.

Highlights of this release:

A lot of compiler warnings have been fixed.
The build system is now pandora-build.
some small bugs have been fixed
Header file is now haildb.h instead of innodb.h
We display “HailDB” instead of “Embedded InnoDB”
Library name is libhaildb instead of libinnodb
It is probably binary compatible with the last Embedded InnoDB release, but we don’t have explicit tests for that, so YMMV.

Check out the Launchpad page on 2.0.0 and you can download the tarball either from there or right here:

haildb-2.0.0.tar.gz
MD5: Â ï»¿ï»¿ï»¿ï»¿183b81bfe2303aed435cdc8babf11d2b
SHA1: Â 065e6a2f2cb2949efd7b8f3ed664bc1ac655cd75

A tale of a bug…

Posted on 2010-07-22 by Stewart Smith

So I sometimes get asked if we funnel back bug reports or patches back to MySQL from Drizzle. Also, MariaDB adds some interest here as they are a lot closer (and indeed compatible with) to MySQL. With Drizzle, we have deviated really quite heavily from the MySQL codebase. There are still some common areas, but they’re getting rarer (especially to just directly apply a patch).

Back in June 2009, while working on Drizzle at Sun, I found a bug that I knew would affect both. The patch would even directly apply (well… close, but I made one anyway).

So the typical process of me filing a MySQL bug these days is:

Stewart files bug
In the next window of Sveta being awake, it’s verified.

ThisÂ happenedÂ within a really short time.

Unfortunately, what happens next isn’t nearly as awesome.

Namely, nothing. For a year.

So a year later, I filed it in launchpad for MariaDB.

So, MariaDB is gearing up for a release, it’s a relatively low priority bug (but it does have a working, correct and obvious patch), within 2 months, Monty applied it and improved the error checking around it.

So MariaDB bug 588599 is Fix Committed (June 2nd 2010 – July 20th 2010), MySQL Bug ï»¿ï»¿45377 is still Verified (July 20th 2009 – ….).

(and yes, this tends to be a general pattern I find)

But Mark says he gets things through… so yay for him.2

At OSCON

Posted on 2010-07-20 by Stewart Smith

I’m at OSCON this week. Come say hi and talk Drizzle, Rackspace, cloud, photography, vegan food or brewing.

linux.conf.au 2011 CFP Open!

Posted on 2010-07-15 by Stewart Smith

Head on over to http://lca2011.linux.org.au/ and check it out!

You’ve got until August 7th to put in a paper, miniconf, poster or tutorial.

Things I’d like to see come from my kinda world:

topics on running large numbers of machines
latest in large scale web infrastructure
latest going on in the IO space: (SSD, filesystems, SSD as L2 cache)
Applications of above technologies and what it means for application performance
Scalable and massive tcp daemons (i.e. Eric should come talk on scalestack)
exploration of pain points in current technologies and discussion on ways to fix them (from people really in the know)
A Hydra tutorial: starting with stock Ubuntu lucid, and exiting the tutorial with some analysis running on my project.
Something that completely takes me off guard and is awesome.

I’d love to see people from the MySQL, Drizzle and Rackspace worlds have a decent presence. For those who’ve never heard of/been to an LCA before: we reject at least another whole conference worth of papers. It’s the conference on the calendar that everything else moves around.

PBMS in Drizzle

Posted on 2010-07-08 by Stewart Smith

Some of you may have noticed that blob streaming has been merged into the main Drizzle tree recently. There are a few hooks inside the Drizzle kernel that PBMS uses, and everything else is just in the plug in.

For those not familiar with PBMS it does two things: provide a place (not in the table) for BLOBs to be stored (locally on disk or even out to S3) and provide a HTTP interface to get and store BLOBs.

This means you can do really neat things such as have your BLOBs replicated, consistent and all those nice databasey things as well as easily access them in a scalable way (everybody knows how to cache HTTP).

This is a great addition to the AlsoSQL arsenal of Drizzle. I’m looking forward to it advancing and being adopted (now much easier that it’s in the main repository)

Drizzle @ Velocity (seemed to go well)

Posted on 2010-06-29 by Stewart Smith

Monty’s talk at Velocity 2010 seemed to go down really well (at least from reading the agile admin entry on Drizzle). There are a few great bits from this article that just made me laugh:

“Oracleâ€™s â€œrun Java within the databaseâ€ is an example of totally retarded functionality whose main job is to ruin your life”

Love it that we’re managing to get the message out.

ENUM now works properly (in Drizzle)

Posted on 2010-06-29 by Stewart Smith

Over at the Drizzle blog, the recent 2010-06-07 tarball was announced. This tarball release has my fixes for the ENUM type, so that it now works as it should. I was quite amazed that such a small block of code could have so many bugs! One of the most interesting was the documented limit we inherited from MySQL (see the MySQL Docs on ENUM) of a maximum of 65,535 elements for an ENUM column.

This all started out from a quite innocent comment of Jay‘s in a code review for adding support for the ENUM data type to the embedded_innodb engine. It was all pretty innocent… saying that I should use a constant instead of the magic 0x10000 number as a limit on an assert for sanity of values getting passed to the engine. Seeing as there wasn’t a constant already in the code for that (surprise number 1), I said I’d fix it properly in a separate patch (creating a bug for it so it wouldn’t get lost) and the code went in.

So, now, a few weeks after that, I got around to dealing with that bug (because hey, this was going to be an easy fix that’ll give me a nice sense of accomplishment). A quick look in the Field_enum code raised my suspicions of bugs… I initially wondered if we’d get any error message if a StorageEngine returned a table definition that had too many ENUM elements (for example, 70,000). So, I added a table to the tableprototester plugin (a simple dummy engine that is loaded for testing the parsing of specially constructed table messages) that had 70,000 elements for a single ENUM column. It didn’t throw an error. Darn. It did, however, have an incredibly large result for SHOW CREATE TABLE.

Often with bugs like this I may try to see if the problem is something inherited from MySQL. I’ll often file a bug with MySQL as well if that’s the case. If I can, I’llÂ sometimes attach the associated patch from Drizzle that fixes the bug, sometimes with a patch directly for and tested on MySQL (if it’s not going to take me too long). If these patches are ever applied is a whole other thing – and sometimes you get things like “each engine is meant to have auto_increment behave differently!” – which doesn’t inspire confidence.

But anyway, the MySQL limit is somewhere between 10850 and 10900. This is not at all what’s documented. I’ve filed the appropriate bug (Bug #54194) with reproducible test case and the bit of problematic code. It turns out that this is (yet another) limit of the FRM file. The limit is “about 64k FRM”. The bit of code in MySQL that was doing the checking for the ENUM limit was this:

/* Hack to avoid bugs with small static rows in MySQL */
  reclength=max(file->min_record_length(table_options),reclength);
  if (info_length+(ulong) create_fields.elements*FCOMP+288+
      n_length+int_length+com_length > 65535L || int_count > 255)
  {
    my_message(ER_TOO_MANY_FIELDS, ER(ER_TOO_MANY_FIELDS), MYF(0));
    DBUG_RETURN(1);
  }

So it’s no surprise to anyone how this specific limit (the number of elements in an ENUM) got missed when I converted Drizzle from using an FRM over to a protobuf based structure.

So a bunch of other cleanup later, a whole lot of extra testing and I can pretty confidently state that the ENUM type in Drizzle does work exactly how you think it would.

Either way, if you’re getting anywhere near 10,000 choices for an ENUM column you have no doubt already lost.

New CREATE TABLE performance record!

Posted on 2010-06-03 by Stewart Smith

ï»¿ï»¿4 min 20 sec

So next time somebody complains about NDB taking a long time in CREATE TABLE, you’re welcome to point them to this :)

A single CREATE TABLE statement
It had ONE column
It was an ENUM column.
With 70,000 possible values.
It was 605kb of SQL.
It ran on Drizzle

This was to test if you could create an ENUM column with greater than 2¹⁶ possible values (you’re notÂ supposedÂ to be able to) – bug 589031 has been filed.

How does it compare to MySQL? Well… there are other problems (Bug 54194 – ENUM limit of 65535 elements isn’t true filed). Since we don’t have any limitations in Drizzle due to the FRM file format, we actually get to execute the CREATE TABLE statement.

Still, why did this take four and a half minutes? I luckily managed to run poor man’s profiler during query execution. I very easily found out that I had this thread constantly running check_duplicates_in_interval(), which does a stupid linear search for duplicates. It turns out, that for 70,000 items, this takes approximately four minutes and 19.5 seconds. Bug 589055 CREATE TABLE with ENUM fields with large elements takes forever (where forever is defined as a bit over four minutes) filed.

So I replaced check_duplicates_in_interval() with a implementation using a hash table (boost::unordered_set actually) as I wasn’t quite immediately in the mood for ripping out all of TYPELIB from the server. I can now run the CREATE TABLE statement in less than half a second.

So now, I can run my test case in much less time and indeed check for correct behaviour rather quickly.

I do have an urge to find out how big I can get a valid table definition file to though…. should be over 32MB…

BLOBS in the Drizzle/MySQL Storage Engine API

Posted on 2010-05-26 by Stewart Smith

Another (AFAIK) undocumented part of the Storage Engine API:

We all know what a normal row looks like in Drizzle/MySQL row format (a NULL bitmap and then column data):

Nothing that special. It’s a fixed sized buffer, Field objects reference into it, you read out of it and write the values into your engine. However, when you get to BLOBs, we can’t use a fixed sized buffer as BLOBs may be quite large. So, the format with BLOBS is the bit in the row is a length of the blob (1, 2, 3 or 4 bytes – in Drizzle it’s only 3 or 4 bytes now and soon only 4 bytes once we fix a bug that isn’t interesting to discuss here). The Second part of the in-row part is a pointer to a location in memory where the BLOB is stored. So a row that has a BLOB in it looks something like this:

The size of the pointer is (of course) platform dependent. On 32bit machines it’s 4 bytes and on 64bit machines it’s 8 bytes.

Now, if I were any other source of documentation, I’d stop right here.

But I’m not. I’m a programmer writing a Storage Engine who now has the crucial question of memory management.

When your engine is given the row from the upper layer (such as doInsertRecord()/write_row()) you don’t have to worry, for the duration of the call, the memory will be there (don’t count on it being there after though, so if you’re not going to immediately splat it somewhere, make your own copy).

For reading, you are expected to provide a pointer to a location in memory that is valid until the next call to your Cursor. For example, rnd_next() call reads a BLOB field and your engine provides a pointer. At the subsequent rnd_next() call, it can free that pointer (or at doStopTableScan()/rnd_end()).

HOWEVER, this is true except for ï»¿index_read_idx_map(), which in the default implementation in the Cursor (handler) base class ends up doing a doStartIndexScan(), index_read(), doEndIndexScan(). This means that if a BLOB was read, the engine could have (quite rightly) freed that memory already. In this case, you must keep the memory around until either a reset() or extra(HA_EXTRA_FLUSH) call.

This exception is tested (by accident) by a whole single query in type_blob.test – a monster of a query that’s about a seven way join with a group by and an order by. It would be quite possible to write a fairly functional engine and completely miss this.

Good luck.

This blog post (but not the whole blog) is published under theÂ Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: