Ghosts of MySQL Past, Part 7: PBXT

Recently, I’ve been writing based on my linux.conf.au 2014 talk, which you can watch the recording of. Also see Part 1, Part 2, Part 3, Part 4, Part 5 and Part 6. My feed feel off Planet MySQL for a bit so you may have missed those posts – so feel free to go on a trip down memory lane before returning here for, well, more of a trip down memory lane.

At the start of 2005, Paul McCullagh started working on a new transactional storage engine for MySQL. He announced it in early 2006, so the majority of initial development was done before Oracle bought InnoBase Oy.

It’s at this point I should correct myself from my Part 5 where I was talking about when Maria started – it was actually at the start of 2005 when the project started, about 10 months before InnoDB Friday. I think this was mostly small scale work at first, before months later having a larger focus on it.

But anyway, let’s talk about PBXT! It had a different architecture than InnoDB and thus could have an interesting set of performance characteristics. The shortest way to describe the original architecture is “kind of log based”. Rows are written to a log file and there’s a handle file that points to where the rows are in the log.

Also, the design had both a fixed size and a variable sized part of the row. The fixed size part was stored alongside the handle in the record data file, so it was incredibly quick at getting the fixed size part of the row.

Not having to write data twice and having really quick access to some columns gave PBXT a quite decent performance advantage for many work loads.

The BLOB Streaming feature that was also worked on (originally just for PBXT and then for any engine) was quite ahead of its time. Think HandlerSocket but think of it done more correctly. HandlerSocket is an awful binary protocol that supports very limited operations and uses two TCP ports (one for reads, one for writes). The blob streaming used HTTP, something that anyone could use and manipulate and proxy and whatever you’d want.

If we look at Paul’s Top 5 Wishes for storage engines, an engine test suite and sane APIs (or at least documented) would clearly have helped. Having to do API tracing to discover when you should start a transaction is probably not the best way to attract developers – the amount of time that could have been saved given better interfaces in the server was immense (and not just for PBXT developers).

So, why aren’t we all using PBXT now? Well… if MySQL had shipped with PBXT, it would possibly have been a different story. There was (and indeed is) a very uneven playing field for MySQL storage engines. If you’re in the main MySQL tree then you’re everywhere. If not, then you’ve got a heck of a lot of work on packaging and keeping up to date with various API changes (not to mention ABI, which since we’re talking C++ and since we’re talking MySQL, changed a lot). With InnoDB being “good enough” for many and actually being forced to improve due to challengers such as PBXT (as well as being in the tree) it just continued to have the majority of the users… and it’s not easy making money as a storage engine vendor – and developing a storage engine does cost money.

With no more than a couple of people working on it at any one time, it’s amazing that PBXT had such a big influence – and we can likely credit many InnoDB performance improvement to PBXT being so much better in some areas.

An interesting side story: I was sending some build fixes/complier warning fixes and other minimal patches to Paul when MySQL belonged to Sun and he mentioned that perhaps it was time he had a contributor agreement. I pointed out the Sun contributor agreement as an example of one that could be used, and maybe he could just replace “Sun Microsystems” with “PrimeBase” and I could submit that to the Sun system – at the very least, it would be amusing. Paul said that sounded like a great idea and I submitted Sun’s contributor agreement (which it expected outside contributors to agree to) to Sun for them to agree to.

Well… a few weeks later I got a response, and Sun wasn’t exactly keen on it – until I pointed out a few times that it was in fact their agreement and then finally it was all okay.

If your company wants people to agree to a contributor agreement: see if they’d agree to it if it was for someone else. It was an interesting exercise.

Further reading:

Ghosts of MySQL Past, Part 6: The engine revs

This week I’ve been writing based on my linux.conf.au 2014 talk, which you can watch the recording of. Also see Part 1, Part 2, Part 3, Part 4 and Part 5. My feed feel off Planet MySQL for a bit so you may have missed those posts.

Netfrastructure was many other things apart from just a database engine, but the one part that was interesting to MySQL was the storage engine. One interesting thing about Netfrastructure was the inbuilt Java VM, which meant you could do all sorts of neat tricks right near the data and with Jim and Ann coming from much more feature complete databases, some parts of MySQL were going to be a little bit of a shock.

There was still work to be done on the Falcon engine itself, it wasn’t completely finished and ready to just drop in, even though you may have heard things like “to be completed later this year”.

There was also another experiment to see if MySQL could get a storage engine that wasn’t owned by somebody else: resurrect Gemini. Remember, it was GPL licensed now, and it could conceivably be updated to the latest MySQL versions. You can go and see the code for the Amira project on launchpad. Ultimately, this went nowhere, but it could have been an interesting story if this project got more resources. Ultimately, it was Falcon that got all the attention and effort.

This was going to be the ultimate triumph of the Pluggable Storage Engine Architecture, something that had been touted in presentations and marketing material for years. What really happened? Well, it took NDB (now MySQL Cluster) somewhere between two and four years before you could really say it was stable and free of many “obvious” bugs/unintended deviations in behavior. Would Falcon be any different?

In reality, the Falcon project unearthed all of the warts of the storage engine interface. There were many exchanges of “Is it really like that?”, “Yes, yes it is, sorry.”

The APIs (and I use the term loosely) between a storage engine and the database server were originally used to plug in ISAM and then MyISAM, and the first transactional store (InnoDB) required a bunch of hacks to work and the API was never cleaned up. When the third transactional engine came along (MySQL Cluster) – which was also a distributed engine – there were again no real fixes in the API, just more of the same work-arounds and oddness. So by the time Falcon came along, if you were lucky the API was merely just not what you’d expect. There were many, many situations where not only was the official documentation inaccurate, but you’d have to commit several gross layering violations just to get something rather fundamental to a transactional database working.

Arguably the worst thing was foreign keys, a requirement for the Falcon project. You had to implement your own SQL parser to get any information on foreign keys. Let’s not even get started on auto_increment – something which the correct behavior is pretty much completely undocumented.

Next, we’ll look at PBXT.

The MEMORY storage engine

I recently wrote about Where are they now: MySQL Storage Engines and The MERGE storage engine: not dead, just resting…. or forgotten. Today, it’s the turn of the MEMORY storage engine – otherwise known as HEAP.

This is yet another piece of the MySQL server that sits largely unmaintained and unloved. The MySQL Manual even claims that it supports encryption… with the caveat of having to use the SQL functions for encryption/decryption rather than in the engine itself (so, basically, it supports encryption about as much as every other engine does).

The only “recent” innovation in the MEMORY engine was the dynamic row patch that ended up making its way into Percona Server (and isn’t enabled by default). This forced me to go and look at the code of the MEMORY engine again and I cannot possibly drink enough in my lifetime to erase the memory.

The MEMORY engine is used by just about everybody as you probably have a SQL query somewhere that uses an in memory temporary table. I can, however, feel the comments being added to this post right now by people who use gdb to set server variables that not a single query in their systems use MEMORY….. (IIRC there have been some patches around that would throw an error rather than create a temporary table)

We had a early version of the dynamic row format patch in Drizzle for a while… and if you turned it on, all sorts of things horrifically broke. It was a remarkably non-trivial amount of work to get that code to work properly and this is largely a testament to the “design” of the MEMORY engine.

While it may be efficient or fast or something (likely on 1990s hardware and workloads), it misses the boat completely on the things that matter today: simultaneous access, MVCC, BLOB/TEXT columns and transactions. Basically, it’s a engine that’s really only useful for a single connection in limited use cases…. and even then, it’s likely a good way to ruin things. MyISAM is better as at least on memory pressure things may be written out to disk sensibly…. and if InnoDB had a “don’t log this table” mode it would beat that absolute pants off it.

It is, again, another part of the MySQL server that’s remarkably hard to pull out and replace with something different/better. Why? Well, I wrote about it before: Refactoring Internal temporary tables (another stab at it). If it was easy, we’d likely have Tokyo Cabinet (via BlitzDB) or similar (some bit of code maintained by other people) doing the same job in Drizzle rather than this large chunk of code that nobody really cares about.

The Drizzle (and MySQL) Key tuple format

Here’s something that’s not really documented anywhere (unless you count ha_innodb.cc as a source of server documentation). You may have some idea about the MySQL/Drizzle row buffer format. This is passed around the storage engine interface: in for write_row and update_row and out for the various scan and index read methods.

If you want to see the docs for it that exist in the code, check out store_key_val_for_row in ha_innodb.cc.

However, there is another format that is passed to your engine (and that your engine is expected to understand) and for lack of a better name, I’m going to call it the key tuple format. The first place you’ll probably see this is when implementing the index_read function for a Cursor (or handler in MySQL speak).

You get two things: a pointer to the buffer and the length of the buffer. Since a key can be made up of multiple parts, some of which can be NULL and some of which can be of variable length, this buffer is not (usually) a simple value. If you are starting out in your engine development, you can use this buffer blindly as a single value for non-nullable indexes with only 1 column.

The basic format is this:

  • The buffer is in-order of the index. First column in the index is first in the buffer, second second etc.
  • The buffer must be zero-filled. The server kernel will use memcmp to compare two key values.
  • If the column is NULLable, then the first byte is set to 1 if the column is null. Else, 0 means not-null.
  • From ha_innodb.cc (for BLOBs, which I haven’t put in embedded_innodb yet): If the column is of a BLOB type (it must be a column prefix field in this case), then we put the length of the data in the field to the next 2 bytes, in the little-endian format. If the field is SQL NULL, then these 2 bytes are set to 0. Note that the length of data in the field is <= column prefix length.
  • For fixed length fields (such as int), the next max field length bytes are for that field.
  • For VARCHAR, there is always a 2 byte (in little endian) length. This is different to the row format, which may have 1 or 2 bytes. In the key tuple format it is ALWAYS two bytes.

I’ll discuss the use of this for rnd_pos() and position() in a later post…

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).