Multi-tenancy Drizzle

My previous post focused on some of the problems of doing multi-tenant MySQL.

One of the reasons why I started hacking on Drizzle was that the multi-tenancy options for MySQL just weren’t very good (this is also the reason why I run my blog in a VM and not a shared hosting solution).

What you really want is to be able to give your users access to a virtual database server. What you don’t want is to be administering a separate database server for each of your users. What you want are CATALOGs.

A CATALOG is a collection of SCHEMAs (which have TABLEs in them). Each CATALOG is isolated from all the others. Once you connect to a catalog, that’s it. They are entirely separate units. There are no cross-catalog queries or CHANGE CATALOG commands. It is as if each catalog is its own database server.

You can easily imagine a world where there are per-catalog resource limits too (max connections, max temp tables etc).

For the Drizzle7 release, we got in some preliminary support to ensure that the upgrade path would be easy. You’ll notice that all the schemas you create are in the ‘local’ catalog (you can even spot this on the file system in the datadir).

For the next Drizzle milestone, the intent has been to flesh out this support to enable a very elegant multi-tenant solution for Drizzle.

One of the things we worked on a little while ago now is introducing TableIdentifier and SchemaIdentifier objects into Drizzle. Historically (and still in the MySQL codebase) tables would be referenced by a string in the format “database/table_name” (except sometimes when it could be “database\table_name”). There were also various times when this was the name as entered by the user and times when this was converted into a safe form for storing on disk (and comparing to one another).

Everywhere in Drizzle where we have to deal with the path to a table, we call a method on a TableIdentifier for it. There is also a method for getting a series of bytes that can be used as a key in a data structure (e.g. table definition cache). It’s even used to work out what path on the file system to store the data file in.

I would say that the use of TableIdentifier and SchemaIdentifier has prevented many bugs from creeping into the server during development. It took several aborted goes to get something like this into the codebase, just because how the database name, table name and table path strings were being used and abused around the server.

So, with our cleaned up codebase using TableIdentifier and SchemaIdentifier, how hard would it really be to introduce the new level of CATALOG?

So along comes an epic hacking session last weekend. Basically I wanted to see how close I could get in such a short time. I introduced a CatalogIdentifier and then forced the SchemaIdentifier constructor to require one as a parameter….

The great benefit? Every place in the code that I needed to change was a compile error. The overwhelming majority of places I had to fix were shown to me by GCC. I cannot express how wonderful this is.

Anyway, by going through all of these places and fixing up a few things that also needed fixing (including just throwing in a few temporary hacks), I got it working!

I have a tree on launchpad (lp:~stewart/drizzle/multitenant) where you can create multiple catalogs, connect to any of the catalogs you’ve created (or local), create schemas and tables in them. Each catalog is separate. So you can have 1 server with 2 schemas called ‘test’ and 2 tables called ‘t1’.

This is a pretty early development tree… and it comes with some limitations (that will need to be fixed):

  • “local” is the default catalog (as it is in Drizzle7)
  • “local” must exist
  • I_S and DATA_DICTIONARY only show up in local (i.e. SHOW TABLES doesn’t even work yet)
  • HailDB is “local” only
  • testing equals approximately zero
  • currently the only protocol plugin that supports connecting to another catalog is the console plugin

But, as a result of only a weekend of hacking, pretty cool.

Paving the way for easy Database-As-Aa-Service.

Things I’ve done in Drizzle

When writing my Dropping ACID: Eating Data in a Web 2.0 Cloud World talk for LCA2011 I came to the realisation that I had forgotten a lot of the things I had worked on in MySQL and MySQL Cluster. So, as a bit of a retrospective as part of the Drizzle7 GA release, I thought I might try and write down a (incomplete) list of the various things I’ve worked on in Drizzle.

I noticed I did a lot of code removal, that’s all fine and dandy but maybe I won’t list all of that… except perhaps my first branch that was merged :)

2008

  • First ever branch that was merged: some mysys removal (use POSIX functions instead of wrappers that sometimes have different semantics than their POSIX functions), some removal of NETWARE, build scripts that weren’t helpful (i.e. weren’t what any build team ever used to build a release) and some other dead code removal.
  • Improve ‘make test’ time – transactions FTW! (this would be a theme for me over the years, I always want build and test to be faster)
  • Started moving functions out into their own plugins, removing the difference between UDFs (User Defined Functions) and builtin functions. One API to rule them all.
  • Ahhh compiler warnings (we now build with -Werror) and unchecked return codes from system calls (we now add this to some of our own methods, finding and fixing even more bugs). We did enable a lot of compiler warnings and OMG fix a lot of them.
  • Removal of FRM – use a protobuf message to describe the table. The first branch that implemented any of this was merged mid-November. It was pretty minimal, and we still had the FRM around for a little while yet – but it was the beginning of the end for features that couldn’t be implemented due to limitations in the FRM file format. I wrote a few blog entries about this.
  • A lot of test fixes for Drizzle
  • After relating the story of ☃.test (hint: it broke the ability to check out the source tree on Solaris) to Anthony Baxter, he suggested trying something rather nasty… a unicode character above 2^16 – I chose 𝄢 – which at the time didn’t even render on Ubuntu – this was the test case you could only see correctly on MacOS X at the time (or some less broken Linux distro I guess). Some time later, I was actually able to view the test file on Ubuntu correctly.
  • I think it was November 1st when I started to work on Drizzle full time – this was rather awesome, although a really difficult decision as I did rather enjoy working with all the NDB guys.

2009:

  • January sparked the beginning of reading table information from the table protobuf message instead of the FRM file. The code around the FRM file was lovely and convoluted in places – to this day I’m surprised at the low bug count all that effort resulted in. One day I may write up a list of bugs and exploits probably available through the FRM code.
  • My hate for C++ fstream started in Feb 2009
  • In Feb I finally removed the code to work out (and store) in the FRM file how to display a set of fields for entering data into the table on a VT100 80×24 screen.
  • I filed my first GCC bug. The morning started off with a phone call and Brian asking me to look at some strange bug and ended with Intel processor manuals (mmm… the 387 is fun), the C language standard and (legitimately) finding it was a real compiler bug. Two hours and two minutes after filing the bug there was a patch to GCC fixing it – I was impressed.
  • By the end of Feb, the FRM was gone.
  • I spent a bit of time making Drizzle on linux-sparc work. We started building with compiler options that should be much friendlier to fast execution on SPARC processors (all to do with aligning things to word boundaries). -Wcast-align is an interesting gcc flag
  • Moved DDL commands to be in StorageEngine rather than handler for an instance of an open table (now known as Cursor)
  • MyISAM and CSV as temporary table only engines – this means we save a bunch of mucking about in the server.
  • Amazingly enough, sprintf is dangerous.
  • Moved to an API around what tables exist in a database so that the Storage Engines can own their own metadata.
  • Move to have Storage Engines deal with the protobuf table message directly.
  • Found out you should never assume that your process never calls fork() – if you use libuuid, it may. If it wasn’t for this and if we had param-build, my porting of mtr2 to Drizzle probably would have gone in.
  • There was this thing called pack_flag – its removal was especially painful.
  • Many improvements to the table protobuf message (FRM replacement) code and format – moving towards having a file format that could realistically be produced by code that wasn’t the Drizzle database kernel.
  • Many bug fixes, some acused by us, others exposed by us, others had always been there.

2010:

  • embedded_innodb storage engine (now HailDB). This spurred many bug and API fixes in the Storage Engine interface. This was an education in all the corner cases of the various interfaces, where the obvious way to do things was almost always not fully correct (but mostly worked).
  • Engine and Schema custom KEY=VALUE options.
  • Found a bug in the pthread_mutex implementation of the atomics<> template we had. That was fun tracking down.
  • started writing more test code for the storage engine interface and upper layer. With our (much improved) storage engine interface, this became relatively easy to implement a storage engine to test specific bits of the upper layer.
  • Wrote a CREATE TABLE query that would take over four minutes to run. Fixed the execution time too. This only existed because of a (hidden and undocumented) limitation in the FRM file format to do with ENUM columns.
  • Characters versus bytes is an important difference, and one that not all of the code really appreciated (or dealt with cleanly)
  • FOREIGN KEY information now stored in the table protobuf message (this was never stored in the FRM).
  • SHOW CREATE TABLE now uses a library that reads the table protobuf message (this will later be used in the replication code as well)
  • Started the HailDB project
  • Updated the innobase plugin to be based on the current InnoDB versions.
  • Amazingly, noticed that the READ_COMMITTED isolation level was never tested (even in the simplest way you would ever explain READ_COMMITTED).
  • Created a storage engine for testing called storage_engine_api_tester (or SEAPITester). It’s a dummy engine that checks we’re calling things correctly. It exposed even more bugs and strangeness that we weren’t aware of.
  • Fixed a lot of cases in the code where we were using a large stack frame in a function (greater than 32kb).
  • An initial patch using the internal InnoDB API to store the replication log

2011:

  • This can be detailed later, it’s still in progress :)
  • The big highlight: a release.

Drizzle7

We’ve released Drizzle7! Not only that, we’re now calling it Generally Available – a GA release.

What does this mean? What does this GA label mean?

You could view as a GA label being “we’re pretty confident people aren’t going to on mass ask for our heads when they start using it”… which isn’t a too bad description. We also plan to maintain it, there could be future releases in this series that just include bug fixes – we won’t just immediately tell you to go and use the latest tarball or bzr tree. This release series is a good one to use.

Drizzle7 is something that can be packaged in Linux distros. It’s no longer something where the best bet is to add the PPA and upgrade every two weeks or build from source yourself. If you’re looking to deploy Drizzle (or develop against it) – you can rely on this release.

I’ll never use the words “production ready” to describe a release – it’s never up to me. It’s up to each person or organisation looking to deploy a piece of software to decide if that bit of software is production ready for them.

Personally, I’m looking forward to see how people can break it. While Drizzle is the best tested FOSS SQL RDBMS server, I’m sure there’s new an interesting ways it can be broken by saying we’re ready for a much larger crowd to hammer on it.

Overall, I think we’ve managed to take the now defunct MySQL 6.0 tree (way back in 2008) and release something that can truly live up to the line “database for cloud”. Drizzle is modern, modular, rather solid and understandable. The future is bright, there is so much more to do to make the ultimate database for cloud. Drizzle7 is a great platform to build on – both for us (developers) and us (people who use relational databases).

Fixed in Drizzle: No more “GOTCHA’s”

 

O'Reilly MySQL Conference & Expo 2011

At the upcoming MySQL Conference and Expo, I’m going to give a Thursday afternoon (2pm) session entitled Fixed in Drizzle: No more “GOTCHA’s”. I plan to have a lot of fun with this session..

If you go back to the very start of when I started submitting code to Drizzle (June 2008) – I was going and fixing some of my favourite “gotcha’s” inside the code: BUILD/ scripts that didn’t build the way releases would, wrappers on POSIX functions with different (and inconsistent) semantics, NETWARE support, a non thread safe client lib, my_errno (different to errno) etc. I won’t really be talking about internals like this – it may give me a happy but really isn’t the latest awesome in databases.

I’ll instead be going over the way more awesome user and DBA visible things we’ve fixed/added/removed from Drizzle that make it a database with as few GOTCHA’s as possible.

Authentication (pluggable, LDAP), Logging (to syslog, gearman), DATA_DICTIONARY, INFORMATION_SCHEMA, engines owning their own metadata, STRICT mode by default, removing global mutexes, improving the Storage Engine API, improving the replication log, including code such as PBXT and PBMS Blob Streaming, filesystem engine (read files from disk like a table), pluggable protocol, UTF8 by default, ENUM data type, auto_increment behaviour.

All this and more is “Fixed in Drizzle”.

(oh, and there’s no 24bit integer or a BLOB that can only be 255 bytes)

Undocumented ALTER TABLE that does *nothing* (useful)

(at least since MySQL 5.1.42)

alter table t1 force;

Pretty neat huh? In fact, in Drizzle this will end up doing a copying alter table. Not useful.

There’s an over four year old bug report in MySQL (Bug#24091).

I’m just going to remove that bit from the parser in Drizzle – it makes no sense.

SQL Oddity: ALTER TABLE and default values

So, the MySQL (and Drizzle) ALTER TABLE syntax allows you to easily change the default value of a column. For example:

CREATE TABLE t1 (answer int);
ALTER TABLE t1 ALTER answer SET DEFAULT 42;

So, you create a TIMESTAMP column and forgot to set the default value to CURRENT_TIMESTAMP. Easy, just ALTER TABLE:

create table t1 (a timestamp);
alter table t1 alter a set default CURRENT_TIMESTAMP;

(This is left as another exercise for the reader as to what this will do – again, maybe not what you expect)

ALTER TABLE RENAME RENAME RENAME

Here’s a nice challenge for you. What does the following do (or error out on?):

CREATE TABLE t1 (a int);
CREATE TABLE t2 (b int);
ALTER TABLE t1 RENAME t3, RENAME t2, RENAME t4;

I’d be interested to know what a) you think it does and then b) if you were surprised when you went and typed it into your RDBMS of choice.

No implicit commit (on the road to transactional DDL)

A long time ago, in a time that can only serve to make some feel old and others older, MySQL didn’t support transactions. Each statement was executed as it went, there was no ROLLBACK (or COMMIT or crash recovery etc). Then there were transactions. Other RDBMSs implement auto_commit functionality, but for MySQL users, we think of it as the magic compatibility mode that (mostly) makes applications written for MyISAM magically work on InnoDB (okay, and making “you should use transactions” a really easy consulting gig :)

I’m currently working on finishing up a patch that removes the implicit COMMIT from DDL operations in Drizzle. Instead, you get an error message saying that Transactional DDL is not currently supported. I see a future where we have one of two situations (possibly depending on the storage engine): support DDL within normal transactions, DDL only transactions (cannot mix with DML). The latter (DDL only transactions) I see as the option for InnoDB/HailDB.

Is your Storage Engine buggy or the database server?

If your storage engine returns an error from rnd_init (or doStartTableScan as it’s named in Drizzle) and does not save this error and return it in any subsequent calls to rnd_next, your engine is buggy. Namely it is buggy in that a) an error may not be reported back to the user and b) everything may explode horribly when rnd_next is called after rnd_init returned an error.

Unless it is running on MariaDB 5.2 or (soon, when the patch hits the tree) Drizzle.

Monty (Widenius, not Taylor) wrote a patch for MariaDB based on my bug report that addressed that problem. It uses the compiler feature to throw a warning if the result of a function isn’t checked to make sure that all places that call rnd_init are checking for an error from the engine.

Today I (finally) pulled that into Drizzle as well.

So… if your engine does the logical thing and goes “oh look, this method returns an error… I’ll return my error” it will exhibit bugs in MySQL but not MariaDB 5.2 or Drizzle (when patch hits).

Which is buggy, the server or the engine?

The MySQL bug number is 54166, filed in June 2010.

Persistent index statistics for InnoDB

In browsing the BZR tree for lp:mysql-server, I noticed some rather exciting code had been merged into the Innobase code.

You may be aware that InnoDB will do some index dives when opening a table to get some statistics about the indexes that can help the optimiser make good query plans.

The problem being that this is many disk seeks. It means that on server restart, you have to spend a whole bunch of time seeking around the disk reading index pages.

Not any more.

There is now code merged in to store the calculated statistics in a table inside InnoDB so that these index dives don’t have to happen on startup.

Originally, this looked like it was going to make it into InnoDB+. The good news is that it’s now in a public source tree. I look forward to when it hits a stable release.

(hopefully somebody other than me can beat me to it and write a nice description of the algorithms involved… the code is pretty easy to follow, so it shouldn’t be hard)

Replication log inside InnoDB

The MySQL replication system has always had the replication log (“binlog”) as a separate set of files on disk. Originally, this really didn’t matter as, well, MyISAM wasn’t transactional or crash safe so the binlog didn’t need to be either. If you crashed on a busy write workload, your replication was just going to be hosed anyway.

So then came a time where everybody used InnoDB. Transactional, crash-safe and all the goodies. Then, a bit later, came storing master rpl log position in InnoDB log and XA with the binlog. So a rather long time after MySQL first had replication, you could pull the power cord on the master with a decent amount of certainty that things would be okay when you turned it on again.

I am, of course, totally ignoring the slave state and if it’s safe to do that on slaves.

Using XA to make the binlog and InnoDB consistent does have a cost. That cost is fsync()s. You have to do a lot more of them (two phase commit here).

As you may be aware, at a (much) earlier point in Drizzle we completely ripped out the replication code. Why? A lot of it was very much still geared to support statement based replication – something we certainly didn’t want to support. We also did not really want to keep the legacy binlog format. We wanted it to be very, very pluggable.

So the initial implementation is a transaction log file. Basically, we write out the replication messages to a file. A slave reads this and applies the operations. Pretty simple and foolproof to implement.

But it’s pluggable.

What if we stored the transaction log inside innodb? Not only that, what if we wrote it as part of the transaction that was doing the changes? That way, no XA is needed – everything is consistent with a COMMIT. This would greatly reduce the number of fsync()s needed to be consistent.

Now… the first thing people will say is “arrggh! You’re writing the data *four* times now”. First being the txn data into the log, then the replication log into the log, and then both of these are written back to the data file. It turns out that this is much cheaper than doing the additional fsync()s.

In one of our tests, the file based transaction log: ~300tps. Transaction log in InnoDB: ~1200tps.

I think that’s an acceptable trade-off.

We’ve just merged the first bit of this work into Drizzle.

Props go to Joe Daly, Brian and myself for making it work.

A more complete look at Storage Engine API

Okay… So I’ve blogged many times before about the Storage Engine API in Drizzle. This API is somewhat inherited from MySQL. We have very much attempted to make it a much cleaner interface. Our goals in making changes include: make it much easier to write and maintain a storage engine, make the upper layer code obviously correct and clear in what it’s doing and being able to more easily introduce optimisations.

I’ve recently added a Storage Engine that is only used in testing: storage_engine_api_tester. I’ve blogged on it producing call graphs (really state transition graphs) before both for Storage Engine and Cursor.

I’ve been expanding the test. My test engine is now a wrapper around a real engine instead of just a fake one. This lets us run real queries (and test cases) while testing what’s going on. At some point in the near future I plan to make it so that it will be able to log what calls go on to the engine and produce a graph just of those.

I added a lot more to the Storage Engine part of the wrapper. Below is what you can see is the current graph:

I’ve coded what I consider to be bugs as red and what I consider suspect as blue.

Also for the Cursor (colours mean the same):

As you can see, there’s currently some wacky possibilities. I’m investigating exactly what’s going on here – If I’m somehow missing some calls that I should be wrapping (I don’t think so) or if we are really doing some dumb-ass things in the upper layer.

Also, please do not be under any impression that any of this means that we’re going to have a stable API. We’re not. To stabilise on this would just be insane – way too much of it still makes not much sense.

Limiting functions to 32k stack in Drizzle (and scoped_ptr)

I wonder if this comes under “Code Style” or not…

Anyway, Monty and I finished getting Drizzle ready for adding “-Wframe-larger-than=32768” as a standard compiler flag. This means that no function within the Drizzle source tree can use greater than 32kb stack – it’s a compiler warning – and with -Werror, it means that it’s a build error.

GCC is not perfect at detecting stack usage, but it’s pretty good.

Why have we done this?

Well, there is a little bit of recursion in the server… and we can craft queries to blow a small stack (not so good). On MacOS X, the default thread stack size is only 512kb. This gives not many frames if 32kb stack is a even remotely common.

I found some interesting places to throw a lot of things on the stack too – that would be rather far down on a callchain – leading to the possibility of blowing up in really strange ways.

We’d love to make it 16kb…. but that’s a fair bit more work, so something for the future.

We’ve used the Boost scoped_ptr to address a bunch of these situations as it provides pretty much minimal code change for the same effect (except that memory is dynamically allocated instead of as part of the stack frame).

Drizzle gets InnoDB 1.0.9

My branch that updates the innobase plugin in Drizzle to be based on innodb_plugin 1.0.9 has been merged. For the next milestone, we’ll probably have 1.0.11 as well.

How’s the progress getting 1.1 and 1.2 in? Pretty good actually. We’ll have it for either this milestone or the next one.

and merging newer innodb into HailDB? It’s going well too, expect more news “soon”.

Cursor states

Following on from my post yesterday on the various states of a Storage Engine, I said I’d have a go with the Cursor object too. A Cursor is used by the Drizzle kernel to get and set data in a table. There can be more than one cursor open at once, and more than one per thread. If your engine cannot cope with this, it is its responsibility to figure it out and return the appropriate errors.

Let’s look at a really simple operation, inserting a couple of rows and then reading them back via a full table scan.

Now, this graph is slightly incomplete as there is no doEndTableScan() call. But you can see in which order things are meant to happen. In this case, “store_lock()” means that store_lock() has been called, so when coming back from doInsertRecord() we do not call store_lock() again, rather, we’re just in a state where it has already been executed.

For MySQL handler, think ::write_row() for doInsertRecord() and ::rnd_init() for doStartTableScan().

This diagram was again auto-generated from my test engine.

Rackspace Rookie-O (in Hong Kong!)

I’d meant to finish writing this way back in July… but I failed at that. Now is a good time to talk about Rookie-O as my again new colleague Andrew Hutchings (Buy his and Sergei’s book on MySQL 5.1 Plugin Development!) just went through the same thing (but in London instead of Hong Kong) given by the same trainer (Hi Eddie!).

Rackspace is the second employer I’ve had that has some kind of new hire training (the first being Sun). I am, of course, not quite counting Salmiakki as new-hire training for MySQL (although I probably should). To quote from the Wikipedia article: “Although the rumor of the heart attack was a hoax, the drink may still cause harm. The strong flavor almost completely masks the presence of ethanol, and the drinker may not realize he is consuming a drink almost 40% alcohol by volume (80-proof), leading to possible alcohol poisoning.” A promising introduction to the company.

Monty, MÃ¥rten and Kaj with Salmiakki singing Helan GÃ¥r at the MySQL User Conference Japan in 2007

Monty, MÃ¥rten and Kaj with Salmiakki singing Helan GÃ¥r at the MySQL User Conference Japan in 2007

I could possibly say something about the Sun New-Hire training… but I’m just trying to find something positive to say – and I can’t. I got a bit of hacking done? Seriously.

Actually coordinating a time to attend a Rookie-O (Rookie Orientation, the Rackspace name for new hire training) was rather tricky. There was one right before the MySQL User Conference back in April (not the best of timing), one during an upcoming team meeting (again, not ideal) and one that got organised in the middle of everything for the office in Hong Kong. So, I headed to Hong Kong.

Hong Kong streetlife

The Hong Kong office is relatively new (late 2008) and there were people there who hadn’t gone through the standard Rackspace Rookie-O (Orientation).

Rackers walking Hong Kong at Night

It was rather cool to hang out with other people who worked for the company – and in totally different areas than I do. I did get a better understanding for how the rest of the company operates and the people involved. The training itself was useful and substantially less geared towards not-my-job than Sun’s was.

The good news is that Andrew thought it was useful too. Pretty impressed so far.