MySQL removes the FRM (7 years after Drizzle did)

The new MySQL 8.0.0 milestone release that was recently announced brings something that has been a looooong time coming: the removal of the FRM file. I was the one who implemented this in Drizzle way back in 2009 (July 28th 2009 according to Brian)- and I may have had a flashback to removing the tentacles of the FRM when reading the MySQL 8.0.0 announcement.

As an idea for how long this has been on the cards, I’ll quote Brian from when we removed it in Drizzle:

We have been talking about getting rid of FRM since around 2003. I remember a drive up to northern Finland with Kaj Arnö, where we spent an hour talking about this. I, David, and MontyW have talked about this for years.

http://krow.livejournal.com/642329.html

Soo… it was a known problem for at least thirteen years. One of the issues removing it was how pervasive all of the FRM related things were. I shudder at the mention of “pack_flag” and Jay Pipes probably does too.

At the time, we tried a couple of approaches as to how things should look. Our philosophy with Drizzle was that it should get out of the way at let the storage engines be the storage engines and not try to second guess them or keep track of things behind their back. I still think that was the correct architectural approach: the role of Drizzle was to put SQL on top of a storage engine, not to also be one itself.

Looking at the MySQL code, there’s one giant commit 31350e8ab15179acab5197fa29d12686b1efd6ef. I do mean giant too, the diffstat is amazing:

 786 files changed, 58471 insertions(+), 25586 deletions(-)

How anyone even remotely did code review on that I have absolutely no idea. I know the only way I could get it to work in Drizzle was to do it incrementally, a series of patches that gradually chiseled out what needed to be taken out so I could put it an API and the protobuf code.

Oh, and in case you’re wondering:

- uint offset,pack_flag;
+ uint offset;

Thank goodness. Now, you may not appreciate that as much as I might, but pack_flag was not the height of design, it was… pretty much a catchalll for some kind of data about a field that wasn’t something that already had a field in the FRM. So it may include information on if the field could be null or not, if it’s decimal, how many bytes an integer takes, that it’s a number and how many oh, just don’t ask.

Also gone is the weird interval_id and a whole bunch of limitations because of the FRM format, including one that I either just discovered or didn’t remember: if you used all 256 characters in an enum, you couldn’t create the table as MySQL would pick either a comma or an unused character to be the separator in the FRM!?!

Also changed is how the MySQL server handles default values. For those not aware, the FRM file contains a static copy of the row containing default values. This means the default values are computed once on table creation and never again (there’s a bunch of work arounds for things like AUTO_INCREMENT and DEFAULT NOW()). The new sql/default_values.cc is where this is done now.

For now at least, table metadata is also written to a file that appears to be JSON format. It’s interesting that a SQL database server is using a schemaless file format to describe schema. It appears that these files exist only for disaster recovery or perhaps portable tablespaces. As such, I’m not entirely convinced they’re needed…. it’s just a thing to get out of sync with what the storage engine thinks and causes extra IO on DDL (as well as forcing the issue that you can’t have MVCC into the data dictionary itself).

What will be interesting is to see the lifting of these various limitations and how MariaDB will cope with that. Basically, unless they switch, we’re going to see some interesting divergence in what you can do in either database.

There’s certainly differences in how MySQL removed the FRM file to the way we did it in Drizzle. Hopefully some of the ideas we had were helpful in coming up with this different approach, as well as an extra seven years of in-production use.

At some point I’ll write something up as to the fate of Drizzle and a bit of a post-mortem, I think I may have finally worked out what I want to say…. but that is a post for another day.

Some current MySQL Architecture writings

So, I’ve been looking around for a while (and a few times now) for any good resources that cover a bunch of MySQL architecture and technical details aimed towards the technically proficient but not MySQL literate audience. I haven’t really found anything. I mean, there’s the (huge and very detailed) MySQL manual, there’s the MySQL Internals manual (which is sometimes only 10 years out of date) and there’s various blog entries around the place. So I thought I’d write something explaining roughly how it all fits together and what it does to your system (processes, threads, IO etc).(Basically, I’ve found myself explaining this enough times in the past few years that I should really write it down and just point people to my blog). I’ve linked to things for more reading. You should probably read them at some point.

Years ago, there were many presentations on MySQL Architecture. I went to try and find some on YouTube and couldn’t. We were probably not cool enough for YouTube and the conferences mostly weren’t recorded. So, instead, I’ll just link to Brian on NoSQL – because it’s important to cover NoSQL as well.

So, here is a quick overview of executing a query inside a MySQL Server and all the things that can affect it. This isn’t meant to be complete… just a “brief” overview (of a few thousand words).

MySQL is an open source relational database server, the origins of which date back to 1979 with MySQL 1.0 coming into existence in 1995. It’s code that has some history and sometimes this really does show. For a more complete history, see my talk from linux.conf.au 2014: Past, Present and Future of MySQL (YouTube, Download).

At least of writing, everything here applies to MariaDB and Percona Server too.

The MySQL Server runs as a daemon (mysqld). Users typically interact with it over a TCP or UNIX domain socket through the MySQL network protocol (of which multiple implementations exist under various licenses). Each connection causes the MySQL Server (mysqld) to spawn a thread to handle the client connection.

There are now several different thread-pool plugins that instead of using one thread per connection, multiplex connections over a set of threads. However, these plugins are not commonly deployed and we can largely ignore them. For all intents and purposes, the MySQL Server spawns one thread per connection and that thread alone performs everything needed to service that connection. Thus, parallelism in the MySQL Server is gotten from executing many concurrent queries rather than one query concurrently.

The MySQL Server will cache threads (the amount is configurable) so that it doesn’t have to have the overhead of pthread_create() for each new connection. This is controlled by the thread_cache_size configuration option. It turns out that although creating threads may be a relatively cheap operation, it’s actually quite time consuming in the scope of many typical MySQL Server connections.

Because the MySQL Server is a collection of threads, there’s going to be thread local data (e.g. connection specific) and shared data (e.g. cache of on disk data). This means mutexes and atomic variables. Most of the more advanced ways of doing concurrency haven’t yet made it into MySQL (e.g. RCU hasn’t yet and is pretty much needed to get 1 million TPS), so you’re most likely going to see mutex contention and contention on cache lines for atomic operations and shared data structures.

There are also various worker threads inside the MySQL Server that perform various functions (e.g. replication).

Until sometime in the 2000s, more than one CPU core was really uncommon, so the fact that there were many global mutexes in MySQL wasn’t really an issue. These days, now that we have more reliable async networking and disk IO system calls but MySQL has a long history, there’s global mutexes still and there’s no hard and fast rule about how it does IO.

Over the past 10 years of MySQL development, it’s been a fight to remove the reliance on global mutexes and data structures controlled by them to attempt to increase the number of CPU cores a single mysqld could realistically use. The good news is that it’s no longer going to max out on the number of CPU cores you have in your phone.

So, you have a MySQL Client (e.g. the mysql client or something) connecting to the MySQL Server. Now, you want to enter a query. So you do that, say “SELECT 1;”. The query is sent to the server where it is parsed, optimized, executed and the result returns to the client.

Now, you’d expect this whole process to be incredibly clean and modular, like you were taught things happened back in university with several black boxes that take clean input and produce clean output that’s all independent data structures. At least in the case of MySQL, this isn’t really the case. For over a decade there’s been lovely architecture diagrams with clean boxes – the code is not like this at all. But this probably only worries you once you’re delving into the source.

The parser is a standard yacc one – there’s been attempts to replace it over the years, none of which have stuck – so we have the butchered yacc one still. With MySQL 5.0, it exploded in size due to the addition of things like SQL2003 stored procedures and it is of common opinion that it’s rather bloated and was better in 4.1 and before for the majority of queries that large scale web peeps execute.

There is also this thing called the Query Cache – protected by a single global mutex. It made sense in 2001 for a single benchmark. It is a simple hash of the SQL statement coming over the wire to the exact result to send(2) over the socket back to a client. On a single CPU system where you ask the exact same query again and again without modifying the data it’s the best thing ever. If this is your production setup, you probably want to think about where you went wrong in your life. On modern systems, enabling the query cache can drop server performance by an order of magnitude. A single global lock is a really bad idea. The query cache should be killed with fire – but at least in the mean time, it can be disabled.

Normally, you just have the SQL progress through the whole process of parse, optimize, execute, results but the MySQL Server also supports prepared statements. A prepared statement is simply this: “Hi server, please prepare this statement for execution leaving the following values blank: X, Y and Z” followed by “now execute that query with X=foo, Y=bar and Z=42”. You can call execute many times with different values. Due to the previously mentioned not-quite-well-separated parse, optimize, execute steps, prepared statements in MySQL aren’t as awesome as in other relational databases. You basically end up saving parsing the query again when you execute it with new parameters. More on prepared statements (from 2006) here. Unless you’re executing the same query many times in a single connection, server side prepared statements aren’t worth the network round trips.

The absolute worst thing in the entire world is MySQL server side prepared statements. It moves server memory allocation to be the responsibility of the clients. This is just brain dead stupid and a reason enough to disable prepared statements. In fact, just about every MySQL client library for every programming language ever actually fakes prepared statements in the client rather than trust every $language programmer to remember to close their prepared statements. Open many client connections to a MySQL Server and prepare a lot of statements and watch the OOM killer help you with your DoS attack.

So now that we’ve connected to the server, parsed the query (or done a prepared statement), we’re into the optimizer. The optimizer looks at a data structure describing the query and works out how to execute it. Remember: SQL is declarative, not procedural. The optimizer will access various table and index statistics in order to work out an execution plan. It may not be the best execution plan, but it’s one that can be found within reasonable time. You can find out the query plan for a SELECT statement by prepending it with EXPLAIN.

The MySQL optimizer is not the be all and end all of SQL optimizers  (far from it). A lot of MySQL performance problems are due to complex SQL queries that don’t play well with the optimizer, and there’s been various tricks over the years to work around deficiencies in it. If there’s one thing the MySQL optimizer does well it’s making quick, pretty good decisions about simple queries. This is why MySQL is so popular – fast execution of simple queries.

To get table and index statistics, the optimizer has to ask the Storage Engine(s). In MySQL, the actual storage of tables (and thus the rows in tables) is (mostly) abstracted away from the upper layers. Much like a VFS layer in an OS kernel, there is (for some definition) an API abstracting away the specifics of storage from the rest of the server. The API is not clean and there are a million and one layering violations and exceptions to every rule. Sorry, not my fault.

Table definitions are in FRM files on disk, entirely managed by MySQL (not the storage engines) and for your own sanity you should not ever look into the actual file format. Table definitions are also cached by MySQL to save having to open and parse a file.

Originally, there was MyISAM (well, and ISAM before it, but that’s irrelevant now). MyISAM was non-transactional but relatively fast, especially for read heavy workloads. It only allowed one writer although there could be many concurrent readers. MyISAM is still there and used for system tables. The current default storage engine is called InnoDB. It’s all the buzzwords like ACID and MVCC. Just about every production environment is going to be using InnoDB. MyISAM is effectively deprecated.

InnoDB originally was its own independent thing and has (to some degree) been maintained as if it kind of was. It is, however, not buildable outside a MySQL Server anymore. It also has its own scalability issues. A recent victory was splitting the kernel_mutex, which was a mutex that protected far too much internal InnoDB state and could be a real bottleneck where NRCPUs > 4.

So, back to query execution. Once the optimizer has worked out how to execute the query, MySQL will start executing it. This probably involves accessing some database tables. These are probably going to be InnoDB tables. So, MySQL (server side) will open the tables, looking up the MySQL Server table definition cache and creating a MySQL Server side table share object which is shared amongst the open table instances for that table. See here for scalability hints on these (from 2009). The opened table objects are also cached – table_open_cache. In MySQL 5.6, there is table_open_cache_instances, which splits the table_open_cache mutex into table_open_cache_instances mutexes to help reduce lock contention on machines with many CPU cores (> 8 or >16 cores, depending on workload).

Once tables are opened, there are various access methods that can be employed. Table scans are the worst (start at the start and examine every row). There’s also index scans (often seeking to part of the index first) and key lookups. If your query involves multiple tables, the server (not the storage engine) will have to do a join. Typically, in MySQL, this is a nested loop join. In an ideal world, this would all be really easy to spot when profiling the MySQL server, but in reality, everything has funny names like rnd_next.

As an aside, any memory allocated during query execution is likely done as part of a MEM_ROOT – essentially a pool allocator, likely optimized for some ancient libc on some ancient linux/Solaris and it just so happens to still kinda work okay. There’s some odd server configuration options for (some of the) MEM_ROOTs that get exactly no airtime on what they mean or what changing them will do.

InnoDB has its own data dictionary (separate to FRM files) which can also be limited in current MySQL (important when you have tens of thousands of tables) – which is separate to the MySQL Server table definitions and table definition cache.

But anyway, you have a number of shared data structures about tables and then a data structure for each open table. To actually read/write things to/from tables, you’re going to have to get some data to/from disk.

InnoDB tables can be stored either in one giant table space or file-per-table. (Even though it’s now configurable), InnoDB database pages are 16kb. Database pages are cached in the InnoDB Buffer Pool, and the buffer-pool-size should typically be about 80% of system memory. InnoDB will use a (configurable) method to flush. Typically, it will all be O_DIRECT (it’s configurable) – which is why “just use XFS” is step 1 in IO optimization – the per inode mutex in ext3/ext4 just doesn’t make IO scale.

InnoDB will do some of its IO in the thread that is performing the query and some of it in helper threads using native linux async IO (again, that’s configurable). With luck, all of the data you need to access is in the InnoDB buffer pool – where database pages are cached. There exists innodb_buffer_pool_instances configuration option which will split the buffer pool into several instances to help reduce lock contention on the InnoDB buffer pool mutex.

All InnoDB tables have a clustered index. This is the index by which the rows are physically sorted by. If you have an INT PRIMARY KEY on your  InnoDB table, then a row with that primary key value of 1 will be physically close to the row with primary key value 2 (and so on). Due to the intricacies of InnoDB page allocation, there may still be disk seeks involved in scanning a table in primary key order.

Every page in InnoDB has a checksum. There was an original algorithm, then there was a “fast” algorithm in some forks and now we’re converging on crc32, mainly because Intel implemented CPU instructions to make that fast. In write heavy workloads, this used to show up pretty heavily in profiles.

InnoDB has both REDO and UNDO logging to keep both crash consistency and provide consistent read views to transactions. These are also stored on disk, the redo logs being in their own files (size and number are configurable). The larger the redo logs, the longer it may take to run recovery after a crash. The smaller the redo logs, the more trouble you’re likely to run into with large or many concurrent transactions.

If your query performs writes to database tables, those changes are written to the REDO log and then, in the background, written back into the table space files. There exists configuration parameters for how much of the InnoDB buffer pool can be filled with dirty pages before they have to be flushed out to the table space files.

In order to maintain Isolation (I in ACID), InnoDB needs to assign a consistent read view to a new transaction. Transactions are either started explicitly (e.g. with BEGIN) or implicitly (e.g. just running a SELECT statement). There has been a lot of work recently in improving the scalability of creating read views inside InnoDB. A bit further in the past there was a lot of work in scaling InnoDB for greater than 1024 concurrent transactions (limitations in UNDO logging).

Fancy things that make InnoDB generally faster than you’d expect are the Adaptive Hash Index and change buffering. There are, of course, scalability challenges with these too. It’s good to understand the basics of them however and (of course), they are configurable.

If you end up reading or writing rows (quite likely) there will also be a translation between the InnoDB row format(s) and the MySQL Server row format(s). The details of which are not particularly interesting unless you’re delving deep into code or wish to buy me beer to hear about them.

Query execution may need to get many rows from many tables, join them together, sum things together or even sort things. If there’s an index with the sort order, it’s better to use that. MySQL may also need to do a filesort (sort rows, possibly using files on disk) or construct a temporary table in order to execute the query. Temporary tables are either using the MEMORY (formerly HEAP) storage engine or the MyISAM storage engine. Generally, you want to avoid having to use temporary tables – doing IO is never good.

Once you have the results of a query coming through, you may think that’s it. However, you may also be part of a replication hierarchy. If so, any changes made as part of that transaction will be written to the binary log. This is a log file maintained by the MySQL Server (not the storage engines) of all the changes to tables that have occured. This log can then be pulled by other MySQL servers and applied, making them replication slaves of the master MySQL Server.

We’ll ignore the differences between statement based replication and row based replication as they’re better discussed elsewhere. Being part of replication means you get a few extra locks and an additional file or two being written. The binary log (binlog for short) is a file on disk that is appended to until it reaches a certain size and is then rotated. Writes to this file vary in size (along with the size of transactions being executed). The writes to the binlog occur as part of committing the transaction (and the crash safety between writing to the binlog and writing to the storage engine are covered elsewhere – basically: you don’t want to know).

If your MySQL Server is a replication slave, then you have a thread reading the binary log files from another MySQL Server and then another thread (or, in newer versions, threads) applying the changes.

If the slow query log or general query log is enabled, they’ll also be written to at various points – and the current code for this is not optimal, there be (yes, you guess it) global mutexes.

Once the results of a query have been sent back to the client, the MySQL Server cleans things up (frees some memory) and prepares itself for the next query. You probably have many queries being executed simultaneously, and this is (naturally) a good thing.

There… I think that’s a mostly complete overview of all the things that can go on during query execution inside MySQL.

Past, Present and future of MySQL and variants Part 1: Ghosts of MySQL Past

You can watch the video of my linux.conf.au 2014 talk here: http://mirror.linux.org.au/linux.conf.au/2014/Wednesday/28-Past_Present_and_future_of_MySQL_and_variants_-_Stewart_Smith.mp4

But let’s talk about things in blog form rather than video form :)

Back in 1979, there was UNIREG. A text UI to records (rows) in a database (err, table). The reason I mention UNIREG is that it had FoRMs which as you may have guessed by my capitalization there is where the FRM file comes from.

In 1986, UNIREG came to UNIX. That’s right kids, the 80×24 VT100 interface to ISAM (Index Sequential Access Method – basically rows are written in insert order and indexes point to them) came to UNIX. There was no generic query language, just FoRMs and reports. In fact, to this day, that 80×24 text interface is stored in the FRM file by MySQL and never ever used (I’ve written about this before).

Then there was this mSQL thing around the 1990s, which was a small SQL server (with source) but not FOSS. Originally, Monty W plugged in his ISAM engine but it wasn’t quite the right fit… so in 1995, we had MySQL 1.0 and MySQL AB was founded.

Fast forward a bit and in 1996 we had MySQL 3.19 and development continued. It managed to gain features, performance, ports to different operating systems and CPU architectures and, of course, stability.

It wasn’t until the year 2000 that MySQL adopted the GPL. This turned out to be a huge step in the right direction for increased adoption. At the time, this was a huge risk for the company, essentially risking all the revenue of the company on making the software more free.

This was the birth of the dual licensing business model. You see, the client library (libmysql) was also GPL, which meant it was easy to use if your application was also GPL, but if you were going to distribute your application and it wasn’t under a GPL compatible license (there was also a FOSS exception so that things like PHP could use it) then you needed a license.

Revenue from licensing was to be significant throughout the entire history of MySQL AB.

(We’ll continue this in part 2 tomorrow)

Things I’ve done in Drizzle

When writing my Dropping ACID: Eating Data in a Web 2.0 Cloud World talk for LCA2011 I came to the realisation that I had forgotten a lot of the things I had worked on in MySQL and MySQL Cluster. So, as a bit of a retrospective as part of the Drizzle7 GA release, I thought I might try and write down a (incomplete) list of the various things I’ve worked on in Drizzle.

I noticed I did a lot of code removal, that’s all fine and dandy but maybe I won’t list all of that… except perhaps my first branch that was merged :)

2008

  • First ever branch that was merged: some mysys removal (use POSIX functions instead of wrappers that sometimes have different semantics than their POSIX functions), some removal of NETWARE, build scripts that weren’t helpful (i.e. weren’t what any build team ever used to build a release) and some other dead code removal.
  • Improve ‘make test’ time – transactions FTW! (this would be a theme for me over the years, I always want build and test to be faster)
  • Started moving functions out into their own plugins, removing the difference between UDFs (User Defined Functions) and builtin functions. One API to rule them all.
  • Ahhh compiler warnings (we now build with -Werror) and unchecked return codes from system calls (we now add this to some of our own methods, finding and fixing even more bugs). We did enable a lot of compiler warnings and OMG fix a lot of them.
  • Removal of FRM – use a protobuf message to describe the table. The first branch that implemented any of this was merged mid-November. It was pretty minimal, and we still had the FRM around for a little while yet – but it was the beginning of the end for features that couldn’t be implemented due to limitations in the FRM file format. I wrote a few blog entries about this.
  • A lot of test fixes for Drizzle
  • After relating the story of ☃.test (hint: it broke the ability to check out the source tree on Solaris) to Anthony Baxter, he suggested trying something rather nasty… a unicode character above 2^16 – I chose 𝄢 – which at the time didn’t even render on Ubuntu – this was the test case you could only see correctly on MacOS X at the time (or some less broken Linux distro I guess). Some time later, I was actually able to view the test file on Ubuntu correctly.
  • I think it was November 1st when I started to work on Drizzle full time – this was rather awesome, although a really difficult decision as I did rather enjoy working with all the NDB guys.

2009:

  • January sparked the beginning of reading table information from the table protobuf message instead of the FRM file. The code around the FRM file was lovely and convoluted in places – to this day I’m surprised at the low bug count all that effort resulted in. One day I may write up a list of bugs and exploits probably available through the FRM code.
  • My hate for C++ fstream started in Feb 2009
  • In Feb I finally removed the code to work out (and store) in the FRM file how to display a set of fields for entering data into the table on a VT100 80×24 screen.
  • I filed my first GCC bug. The morning started off with a phone call and Brian asking me to look at some strange bug and ended with Intel processor manuals (mmm… the 387 is fun), the C language standard and (legitimately) finding it was a real compiler bug. Two hours and two minutes after filing the bug there was a patch to GCC fixing it – I was impressed.
  • By the end of Feb, the FRM was gone.
  • I spent a bit of time making Drizzle on linux-sparc work. We started building with compiler options that should be much friendlier to fast execution on SPARC processors (all to do with aligning things to word boundaries). -Wcast-align is an interesting gcc flag
  • Moved DDL commands to be in StorageEngine rather than handler for an instance of an open table (now known as Cursor)
  • MyISAM and CSV as temporary table only engines – this means we save a bunch of mucking about in the server.
  • Amazingly enough, sprintf is dangerous.
  • Moved to an API around what tables exist in a database so that the Storage Engines can own their own metadata.
  • Move to have Storage Engines deal with the protobuf table message directly.
  • Found out you should never assume that your process never calls fork() – if you use libuuid, it may. If it wasn’t for this and if we had param-build, my porting of mtr2 to Drizzle probably would have gone in.
  • There was this thing called pack_flag – its removal was especially painful.
  • Many improvements to the table protobuf message (FRM replacement) code and format – moving towards having a file format that could realistically be produced by code that wasn’t the Drizzle database kernel.
  • Many bug fixes, some acused by us, others exposed by us, others had always been there.

2010:

  • embedded_innodb storage engine (now HailDB). This spurred many bug and API fixes in the Storage Engine interface. This was an education in all the corner cases of the various interfaces, where the obvious way to do things was almost always not fully correct (but mostly worked).
  • Engine and Schema custom KEY=VALUE options.
  • Found a bug in the pthread_mutex implementation of the atomics<> template we had. That was fun tracking down.
  • started writing more test code for the storage engine interface and upper layer. With our (much improved) storage engine interface, this became relatively easy to implement a storage engine to test specific bits of the upper layer.
  • Wrote a CREATE TABLE query that would take over four minutes to run. Fixed the execution time too. This only existed because of a (hidden and undocumented) limitation in the FRM file format to do with ENUM columns.
  • Characters versus bytes is an important difference, and one that not all of the code really appreciated (or dealt with cleanly)
  • FOREIGN KEY information now stored in the table protobuf message (this was never stored in the FRM).
  • SHOW CREATE TABLE now uses a library that reads the table protobuf message (this will later be used in the replication code as well)
  • Started the HailDB project
  • Updated the innobase plugin to be based on the current InnoDB versions.
  • Amazingly, noticed that the READ_COMMITTED isolation level was never tested (even in the simplest way you would ever explain READ_COMMITTED).
  • Created a storage engine for testing called storage_engine_api_tester (or SEAPITester). It’s a dummy engine that checks we’re calling things correctly. It exposed even more bugs and strangeness that we weren’t aware of.
  • Fixed a lot of cases in the code where we were using a large stack frame in a function (greater than 32kb).
  • An initial patch using the internal InnoDB API to store the replication log

2011:

  • This can be detailed later, it’s still in progress :)
  • The big highlight: a release.

ENUM now works properly (in Drizzle)

Over at the Drizzle blog, the recent 2010-06-07 tarball was announced. This tarball release has my fixes for the ENUM type, so that it now works as it should. I was quite amazed that such a small block of code could have so many bugs! One of the most interesting was the documented limit we inherited from MySQL (see the MySQL Docs on ENUM) of a maximum of 65,535 elements for an ENUM column.

This all started out from a quite innocent comment of Jay‘s in a code review for adding support for the ENUM data type to the embedded_innodb engine. It was all pretty innocent… saying that I should use a constant instead of the magic 0x10000 number as a limit on an assert for sanity of values getting passed to the engine. Seeing as there wasn’t a constant already in the code for that (surprise number 1), I said I’d fix it properly in a separate patch (creating a bug for it so it wouldn’t get lost) and the code went in.

So, now, a few weeks after that, I got around to dealing with that bug (because hey, this was going to be an easy fix that’ll give me a nice sense of accomplishment). A quick look in the Field_enum code raised my suspicions of bugs… I initially wondered if we’d get any error message if a StorageEngine returned a table definition that had too many ENUM elements (for example, 70,000). So, I added a table to the tableprototester plugin (a simple dummy engine that is loaded for testing the parsing of specially constructed table messages) that had 70,000 elements for a single ENUM column. It didn’t throw an error. Darn. It did, however, have an incredibly large result for SHOW CREATE TABLE.

Often with bugs like this I may try to see if the problem is something inherited from MySQL. I’ll often file a bug with MySQL as well if that’s the case. If I can, I’ll sometimes attach the associated patch from Drizzle that fixes the bug, sometimes with a patch directly for and tested on MySQL (if it’s not going to take me too long). If these patches are ever applied is a whole other thing – and sometimes you get things like “each engine is meant to have auto_increment behave differently!” – which doesn’t inspire confidence.

But anyway, the MySQL limit is somewhere between 10850 and 10900. This is not at all what’s documented. I’ve filed the appropriate bug (Bug #54194) with reproducible test case and the bit of problematic code. It turns out that this is (yet another) limit of the FRM file. The limit is “about 64k FRM”. The bit of code in MySQL that was doing the checking for the ENUM limit was this:


/* Hack to avoid bugs with small static rows in MySQL */
  reclength=max(file->min_record_length(table_options),reclength);
  if (info_length+(ulong) create_fields.elements*FCOMP+288+
      n_length+int_length+com_length > 65535L || int_count > 255)
  {
    my_message(ER_TOO_MANY_FIELDS, ER(ER_TOO_MANY_FIELDS), MYF(0));
    DBUG_RETURN(1);
  }

So it’s no surprise to anyone how this specific limit (the number of elements in an ENUM) got missed when I converted Drizzle from using an FRM over to a protobuf based structure.

So a bunch of other cleanup later, a whole lot of extra testing and I can pretty confidently state that the ENUM type in Drizzle does work exactly how you think it would.

Either way, if you’re getting anywhere near 10,000 choices for an ENUM column you have no doubt already lost.

The Table protobuf message format

If you’ve ever opened up drizzled/message/table.proto in the Drizzle source tree you will have seen what’s in the table message: the structure that describes a database table in Drizzle. Previously I’ve talked about the Table message more generally, giving a fair bit of history of the FRM file and how we’ve replaced it with both the Table protobuf message and an infrastructure inside Drizzle so that Storage Engines own their own metadata.

Yesterday I talked about the Schema protobuf message format in more detail, and this time I’m talking about the Table protobuf message in a similar amount.

The first time we were loading (then only part of) the table definition out of a protobuf message was way back in January 2009 (I blogged about it too). It was an adventure untangling all sorts of things to get to a much nicer place (where we are now). The code in the server is not perfect… I’ll be the first to admit that some of it is rather strange, but that’s mostly all behind the scenes for people interested in the protobuf Table message!

The Table message has several embedded messages in it too. We need to have information on the Storage Engine, Fields and Indexes (and each of those can have other properties). It is much more complex than the simple Schema message.

Let’s have a look at the basic structure of the Table message:

message Table {

    /* *SNIP* (Here goes the definitions for TableType, StorageEngine, Field, Index, ForeignKeyConstrain, TableOptions and TableStats) */

  required string name = 1;
  required TableType type = 5;
  required StorageEngine engine = 2;
  repeated Field field = 3;
  repeated Index indexes = 4;

  repeated ForeignKeyConstraint fk_constraint = 8;
  optional TableOptions options = 9;
  optional TableStats stats = 10;
}

(We’ve skipped the definitions for the embedded messages for now)

This seems all pretty logical; a table has a name, a type, is in a Storage Engine, has Fields, may have Indexes, may have foreign key constraints, it has some options and statistics (the statistics may go away at some point “soon”).

Let’s have a look at the TableType message definition:

  enum TableType {
    STANDARD = 0;
    TEMPORARY = 1;
    INTERNAL = 2;
  }

It’s pretty simple, the table type is either a standard table (what you get from CREATE TABLE), a temporary table (what you get from CREATE TEMPORARY TABLE) or an INTERNAL table (what you get when Drizzle uses a temporary table during query execution).

Next, the StorageEngine message:

  message StorageEngine {

    message EngineOption {
      enum EngineOptionType {
        BOOL = 0;
        INTEGER = 1;
        STRING = 2;
      }

      required string option_name = 1;
      required string option_value = 2;
      required EngineOptionType option_type = 3;
    }

    required string name = 1;
    repeated EngineOption option = 2;
  }

The main part is the “name” member, which is just the name of the storage engine (e.g. “PBXT”,  “INNODB”, “ARCHIVE”). We do however have support specified in the StorageEngine message for engine specific options (in key value form). Expect these to be used more in the near future.

Specifying Fields is probably the most complex part of the table message. The Field message looks like this (with many embedded messages):

message Field {
    required string name = 1;
    required FieldType type = 2;
    optional FieldFormatType format = 3;
    optional FieldOptions options = 4;
    optional FieldConstraints constraints = 5;
    optional NumericFieldOptions numeric_options = 6;
    optional StringFieldOptions string_options = 7;

    optional string comment = 16; /* Reserve 0-15 for frequently accessed attributes */
    optional SetFieldOptions set_options = 17;
    optional TimestampFieldOptions timestamp_options = 18;
}

So… what does this all mean? Well, Fields have a type, they’re stored in a format, there’s options attached to them, there may be constraints as well as field type specific options.

The different field types should be fairly familiar by now:

    enum FieldType {
      DOUBLE = 0;
      VARCHAR = 1;
      BLOB = 2;
      ENUM = 3;
      INTEGER = 4;
      BIGINT = 5;
      DECIMAL = 6;
      DATE = 7;
      TIME = 8;
      TIMESTAMP = 9;
      DATETIME = 10;
    }

We also allow fields in different formats. Currently, these are default, fixed and dynamic. The idea is you can tell the engine (or the engine can tell you) how it’s storing the field. This is currently here as a nicety and the users for this are few and far between.

    enum FieldFormatType {
      DefaultFormat= 0;
      FixedFormat= 1;
      DynamicFormat= 2;
    }

The FieldOptions get interesting though:

    message FieldOptions {
      optional string default_value = 1;
      optional string update_value = 2;
      optional bool default_null = 3 [default = false];
      optional bytes default_bin_value = 4;
    }

You’ll no doubt be intrigued by the existence of both “default_value” and “default_bin_value”. Ordinarily, using a string to contain a textual representation of the default value (e.g. “foo” or “42”) is fine. However, for BLOB columns, you can have defaults that aren’t representable in a text string, you need binary data (e.g. the default value contains ‘\0’).

For TIMESTAMP columns, we continue to support DEFAULT NOW() and the ability to update the timestamp column on UPDATE. How is this represented in the table message? Well… default_value will be “NOW()” and update_value will be “NOW()”. It is intended that in the future it will be possible to have arbitrary SQL expressions for these. This does, of course, require support in the Drizzle server.

The default_null bool should be rather obvious :)

Well… that’s enough for today. Next time: more of the Field message!

Drizzle FRM replacement: the table proto

Drizzle originally inherited the FRM file from MySQL (which inherited it from UNIREG). The FRM file stores metadata about a table; what columns it has, what type those columns are, what indexes, any default values, comments etc are all stored in the FRM. In the days of MyISAM, this worked relatively well. The row data was stored in table.MYD, indexes on top of it in table.MYI and information about the format of the row was
in table.FRM. Since MyISAM itself wasn’t crash safe, it didn’t really matter if creating/deleting the FRM file along with the table was either.

As more sophisticated engines were introduced (e.g. InnoDB) that had their own data dictionary, there started to be more of a problem. There were now two places storing information about a table: the FRM file and the data dictionary specific to the engine. Even if the data dictionary of the storage engine was crash safe, the FRM file was not plugged into that, so you could end up in a situation where the storage engine
recovered from a crash okay, but the FRM was incorrect for what the engine recovered to. This would always require manual intervention to find out what went wrong and then fix it (in some rather unusual ways).

When the MySQL Cluster (NDB) engine was introduced, a new set of problems arose. Now the MySQL server was connecting to an existing database, where tables could be created on other nodes connected to the cluster. You now not only had the problems of crash recovery, but the problems of keeping the FRM files in sync across many nodes, requiring
all sorts of interesting solutions that, for the most part, do work.

The “obvious” solution to some of these problems would be for an engine to write out an FRM file itself. This is much easier said than done. The file format was never created to be read and written by multiple pieces of software, the code that did the reading and writing inside the server was not reusable elsewhere and the only documentation (that
wasn’t a decent chunk of the MySQL source tree) is the rather incomplete definition in the MySQL Internals wiki (http://forge.mysql.com/wiki/MySQL_Internals_File_Formats) – not nearly enough to write a correct FRM file as the specifics are very, very odd.

Our goals for reworking the metadata system in Drizzle were: to allow engines to own their own metadata (removing any opportunity to have inconsistencies between the engine and the ‘FRM’) and for engines without their own data dictionary, to replace the FRM file format with something simple and well documented.

One option was to use SQL as the standard storage format, but it is rather non-trivial and expensive to parse – especially if we were to use it as the preferred way of talking table definitions with storage engines. We had been looking at the protobuf library
(http://code.google.com/p/protobuf/) ever since its first release and it has a number of very nice characteristics: a description language of a data structure that is then used to generate APIs for reading and writing it in a number of programming languages and a standard (documented) way to serialize the data structure.

After a bit of discussion, we arrived at a good outline for the table definition proto. The current one can always be found in the Drizzle source tree at drizzled/message/table.proto. The current format is very close to final (i.e. one that we’ll suppport upgrades from).

The process of modifying the Drizzle code base so that it would write (and read) a file format different to the FRM isn’t worth going too much into here although there were some interesting hurdles to overcome. An interesting one was the FRM file contains a binary image of the default row for the table (which is in the row format that the server uses); we now store the default value for each column in the proto and generate the default row when we read the proto. Another interesting one was removing and refactoring “pack_flag” – the details of which should only be extracted from Jay or Stewart with a liberal application of fine ale.

The end result is that we now have storage engines that are completely responsible for their own metadata. One example is the ARCHIVE engine. In the CREATE TABLE code path, the ARCHIVE storage engine gets the table definition in an object that represents the table proto. It can examine the parameters it needs to and then either store the proto directly, or convert it into its own format. Since ARCHIVE is simple, it just stores
the table proto in a serialised form (using a standard function provided by the protobuf library) and stores it in the .ARZ data file for the table. This instantly makes the ARCHIVE storage engine crash safe for CREATE and DROP table as there is only 1 file on disk, so no two files to get out of sync.

If an engine does not have its own data dictionary, it can still use the default implementation which just stores the serialised table proto in a file on disk.

We can also now use this interface to move INFORMATION_SCHEMA into its own storage engine. This means we can remove a lot of special case code throughout the server for INFORMATION_SCHEMA and instead just have a INFORMATION_SCHEMA storage engine that says it has the following tables in the INFORMATION_SCHEMA database. Because the table definition is now in a documented format with a standard API, this becomes a relatively trivial exercise.

What we’re all looking forward to is when the InnoDB data dictionary is linked into the new interface and we can have a truly crash safe database server.

Another wonderful side effect is since we now have a standard data structure for representing a table definition, we can integrate this with the replication system. In the “near” future, we can represent a CREATE TABLE in the replication stream as a table proto and not the raw SQL. If you were wanting to apply the replication stream to a different database server, you then only have to write a table proto to SQL
converter. If the target database system doesn’t do SQL at all, you could generate API calls to create the table.

So we now have a rather flexible system in place, with the code implementing it being increasingly simple and possible to be “obviously correct”.

Things that easily fall out of this work that people have written about:
– CREATE TABLE LIKE with ENGINE clause
http://krow.livejournal.com/671235.html
– table_raw_reader – looking at the raw representation of table metadata
http://www.flamingspork.com/blog/2009/10/01/table_raw_reader-reading-the-table-proto-from-disk-and-examining-everything/
– Table discovery
http://www.flamingspork.com/blog/2009/07/29/table-discovery-for-drizzle-take-2-now-merged/

Some more info:
http://krow.livejournal.com/642329.html

No implicit defaults

See also: MySQL Bug 43151

The MySQL Manual proudly states that you don’t get implicit default values if strict mode.

mysql> set sql_mode='STRICT_ALL_TABLES';
Query OK, 0 rows affected (0.00 sec)

mysql> create table t1 (a int, b int, c int, d int, e int, f int, g int, h int,i int,
primary key (a,b,c,d,e,f,g,i,h)) engine=MyISAM;
Query OK, 0 rows affected (0.03 sec)

mysql> show create table t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `a` int(11) NOT NULL default '0',
  `b` int(11) NOT NULL default '0',
  `c` int(11) NOT NULL default '0',
  `d` int(11) NOT NULL default '0',
  `e` int(11) NOT NULL default '0',
  `f` int(11) NOT NULL default '0',
  `g` int(11) NOT NULL default '0',
  `h` int(11) NOT NULL default '0',
  `i` int(11) NOT NULL default '0',
  PRIMARY KEY  (`a`,`b`,`c`,`d`,`e`,`f`,`g`,`i`,`h`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> insert into t1 (a) values (1);
Query OK, 1 row affected (0.00 sec)

Which means your getting a default value you never specified.

In my latest Drizzle patch, we no longer give you what you didn’t ask for (an implicit default).

You can still (of course) specify an explicit default (and I’ve done this in some of our test cases).

The FRM file format

It’s fortunate that I’m watching Veronica Mars again with a mate; a more-than-you-think amount of detective work is required to understand the relationship (and format) of the TABLE_SHARE, the FRM file and HA_CREATE_INFO. Oh, also you’ll need drizzled/base.h and drizzled/structs.h and drizzled/table_share.h is also a good one to have open.

The FRM file is really a FoRM file from UNIREG (see copies of really old mysql docs around the place or even better, the links off Sheeri‘s blog post). Also, Jan has some thoughts on FRM too and Thava has a scary frmdump php script.

I have to agree completely with Jan:

  • “the internals document is missing all the interesting parts”
    I’ve just read the source. Everything in the internals doc is easily gotten from the source, so when I finally did take a close look it was “I know all this”. I can’t fault the docs team here at all – I’d place this 3rd from the bottom in priorities. In fact, it’s better to fix it than to document it.
  • ” get the hands dirty and get into the code … it got really dirty”
    Oh yeah – and it’s only gotten worse with things added to it.

It contains interesting nuggets like unireg_check (or unireg_type, depending on where you read) that does:

 enum utype  { NONE,DATE,SHIELD,NOEMPTY,CASEUP,PNR,BGNR,PGNR,YES,NO,REL,
                CHECK,EMPTY,UNKNOWN_FIELD,CASEDN,NEXT_NUMBER,INTERVAL_FIELD,
                BIT_FIELD, TIMESTAMP_OLD_FIELD, CAPITALIZE, BLOB_FIELD,
                TIMESTAMP_DN_FIELD, TIMESTAMP_UN_FIELD, TIMESTAMP_DNUN_FIELD};

But really only the timestamp things… which should be default magic, but it’s somewhere tied in (Jay had thoughts last time I spoke to him… hopefully going away soon). A bunch of these aren’t ever used and are just relics from UNIREG. In fact… I went and removed what wasn’t needed and just ended up with:

  enum utype  { NONE,
                NEXT_NUMBER,
                TIMESTAMP_OLD_FIELD,
                TIMESTAMP_DN_FIELD, TIMESTAMP_UN_FIELD, TIMESTAMP_DNUN_FIELD};

Which does seem a bit nicer. The  fact that TIMESTAMP_OLD_FIELD is used as in interim value is, wel, scary. At least with a smaller set of possiblities it will be easier to convert into the proto format.

A hint of a brighter future is in the comment there:

/*
    We use three additional unireg types for TIMESTAMP to overcome limitation
    of current binary format of .frm file. We'd like to be able to support
    NOW() as default and on update value for such fields but unable to hold
    this info anywhere except unireg_check field. This issue will be resolved
    in more clean way with transition to new text based .frm format.
    See also comment for Field_timestamp::Field_timestamp().
  */

Hrrm… a text based FRM? That would be much nicer to read the code for. Unfortunately, it doesn’t really exist. Some FRMs are text in MySQL, but not ones to do with tables. You can look at the FRM for a VIEW in a text editor and see the SQL quite easily (the file format is text).

So I can’t go look at any nice text based format code – it’s all uint2korr() and friends. Yes folks, this is about the only place left in the code with function names in Swedish. What does korr mean? “accurate, correct, correctly”. If you look at korr.h, you’ll see that it’s just for storing in machine independent format: low byte first.

My favourite korr functions:

  • uint3korr
    which reads 4 bytes, so remember to alloc it, initialise it or Valgrind will make you its bitch.
  • uint5korr
    err… 5 bytes of course
  • uint6korr
    6 bytes (getting the pattern now)
  • uint7korr
    which doesn’t actually exist. Nobody loves Seven – George Costanza was wrong.

It also (as Jan showed) does the whole layout on a 80 column terminal for you! This functionality is going, going gone in Drizzle and won’t be coming back.

There’s also an “empty record on start of formfile” (see make_empty_rec in unireg.cc). This bit is going to cause me some pain relatively soon. Not so much for writing something like it out (default values can be easily put in the proto) but by then constructing it on open (with some careful footing around the issue of the egg coming before the chicken).

Incidently, when discussing with Daniel Stone about this (and explaining all the weirdness) it did cause him to exclaim “omg, it’s XKB!” – so that probably helps the X hackers in the room to relate.

The biggest test in moving from FRM to proto is to only rewrite this part of the code – the TABLE_SHARE, field, Create_foo etc have sooo many bits I want to change/fix. Going down the rat-hole into an endless cycle of fixing is always a possibility. Sometimes (like with unireg_type) the cleanup lets me really discover what the code is doing, so that’s being done (but will go away “soon”).

Progress in nofrm branch

“Ban FRM Now!” branch in Launchpad

Now we’re reading part of the table information out of the proto file on disk instead of the frm.

Not everything (yet) but a bit. Good first steps. Had to fix bugs along the way as well (and find weirdness in FRM file format…).

Progress is being made.

drop table fail (on the road to removing the FRM)

So… in removing the FRM file in Drizzle, I found a bit of a nugget on how drop table works (currently in the MySQL server and now “did” in Drizzle).

If you DROP TABLE t1; this is what happens

  • open the .frm file
  • read first 10bytes (oh, and if you get EIO there, in a SELECT * FROM INFORMATION_SCHEMA.TABLES you’ll get “Error” instead of “Base Table”)
  • if (header[0] != (unsigned char) 254 || header[1] != 1 ||
    (header[2] != FRM_VER && header[2] != FRM_VER+1 &&
    (header[2] < FRM_VER+3 || header[2] > FRM_VER+4)))
    return true;
    Which means that you probably (well, should have) set your enum legacy_db_type to DB_TYPE_UNKNOWN in the caller of bool mysql_frm_type(Session *, char *path, enum legacy_db_type *dbt) otherwise you end up in some form of pain.
  • Else, *dbt= (enum legacy_db_type) (uint) *(header + 3);
    return true;                   // Is probably a .frm table

I do like the “probably”.

Oh, and on a “storage engine api” front, some places seem to expect handler::delete_table(const char* name) to return ENOENT on table not existing. In reality however:

  • int ha_heap::delete_table(const char *name)
    {
    -  int error= heap_delete_table(name);
    -  return error == ENOENT ? 0 : error;
    +  return heap_delete_table(name);
    }
  • InnoDB (note the behaviour of returning DB_TABLE_NOT_FOUND… which isn’t ENOENT)
    err = DB_TABLE_NOT_FOUND;
    ut_print_timestamp(stderr);

    fputs(“  InnoDB: Error: table “, stderr);
    ut_print_name(stderr, trx, TRUE, name);
    fputs(” does not exist in the InnoDB internaln”
    “InnoDB: data dictionary though MySQL is”
    ” trying to drop it.n”
    “InnoDB: Have you copied the .frm file”
    ” of the table to then”
    “InnoDB: MySQL database directory”
    ” from another database?n”
    “InnoDB: You can look for further help fromn”
    “InnoDB: http://dev.mysql.com/doc/refman/5.1/en/”
    “innodb-troubleshooting.htmln”,
    stderr);

  • and MyISAM would generate the error message itself, but that’s fixed with:
    -  if (my_delete_with_symlink(from, MYF(MY_WME)))
    +  if (my_delete_with_symlink(from, 0))
    return(my_errno);

and just to add to the fun, elsewhere in the code, a access(2) call on the frm file name is used to determine if the table exists or not.

The road to removing the FRM has all sorts of these weird-ass things along it. Kinda nice to be able to replace this with something better (and, hopefully – good).

But let me sum up with sql_table.cc:

“This code is wrong and will be removed, please do not copy.”

Goodbye FRM (or at least the steps to it)

Since before MySQL was MySQL, there has been the .FRM file. Of course, what it really wanted to be was “.form” -  a file that stored how to display a form on your (green) CRT. Jan blogged earlier in the year on this still being there, even in MySQL 5.1 (albeit not in any useful form).

So why do we want it to die?

Well… it’s not exactly very useful anymore.

There are a few things it’s used for….

If database/table.frm exists, the table exists (or, on Windows, you may also get databasetable.frm). This is tested in a few bits in the code by a call to access(2).

Most engines have their own data dictionaries (Innodb, PBXT, NDB, Falcon). Keeping these in sync with the FRMs can be problematic at best. This is especially true with distributed engines such as NDB.

The current solution is that on the SQL node that is creating the table, we create the FRM file, gzip it, and store it in the cluster. Then, other nodes, if they go “err… no local frm” first call ha_create_table_from_engine() which NDB will go and see if the table exists in the cluster. If so, it copies the FRM from the cluster to local disk and then the SQL server continues on its way with the standard way of opening a table (through the FRM). If you do DDL through the NDB-API (and not via SQL) then well… you get to keep both pieces.

As for if you crash during a table rename (with any engine with its own data dictionary.. e.g. InnoDB)… you again get to keep both pieces. (There is a bit of discussion on this over here)

Having FRM files also doesn’t especially lead to having multiple versions of table metadata co-existing in the server.

The fun part of reading a frm is open_binary_frm in table.cc. It reads in the frm into a TABLE_SHARE. If we only had some other way of filling out a TABLE_SHARE… one from the engine itself…

But what about any metadata that the engine data dictionary doesn’t have? For example, many server types may map to 1 engine type. An example of this is the GIS types in MySQL. For most engines, these just map to BLOBs. The engine itself has no knowledge about that, but we should fill out the table definition correctly…. so for this type of thing the engine may need to store some additional metadata. This is pretty easy for transactional engines: put it in a table! (although you then have your own problem about keeping this synchronised with any DDL). For engines that don’t have their own data dictionary, we can just provide a set of routines to store/read a frm type file (based on protobufs no doubt).

There also seems to be some entanglement with LOCK_open. Ahhh LOCK_open, the lock that nobody can possibly understand.

The tricky thing will be not rewriting every little bit from scratch all at once but rather go for the incremental bits….