Ghosts of MySQL Past, Part 7: PBXT

Recently, I’ve been writing based on my linux.conf.au 2014 talk, which you can watch the recording of. Also see Part 1, Part 2, Part 3, Part 4, Part 5 and Part 6. My feed feel off Planet MySQL for a bit so you may have missed those posts – so feel free to go on a trip down memory lane before returning here for, well, more of a trip down memory lane.

At the start of 2005, Paul McCullagh started working on a new transactional storage engine for MySQL. He announced it in early 2006, so the majority of initial development was done before Oracle bought InnoBase Oy.

It’s at this point I should correct myself from my Part 5 where I was talking about when Maria started – it was actually at the start of 2005 when the project started, about 10 months before InnoDB Friday. I think this was mostly small scale work at first, before months later having a larger focus on it.

But anyway, let’s talk about PBXT! It had a different architecture than InnoDB and thus could have an interesting set of performance characteristics. The shortest way to describe the original architecture is “kind of log based”. Rows are written to a log file and there’s a handle file that points to where the rows are in the log.

Also, the design had both a fixed size and a variable sized part of the row. The fixed size part was stored alongside the handle in the record data file, so it was incredibly quick at getting the fixed size part of the row.

Not having to write data twice and having really quick access to some columns gave PBXT a quite decent performance advantage for many work loads.

The BLOB Streaming feature that was also worked on (originally just for PBXT and then for any engine) was quite ahead of its time. Think HandlerSocket but think of it done more correctly. HandlerSocket is an awful binary protocol that supports very limited operations and uses two TCP ports (one for reads, one for writes). The blob streaming used HTTP, something that anyone could use and manipulate and proxy and whatever you’d want.

If we look at Paul’s Top 5 Wishes for storage engines, an engine test suite and sane APIs (or at least documented) would clearly have helped. Having to do API tracing to discover when you should start a transaction is probably not the best way to attract developers – the amount of time that could have been saved given better interfaces in the server was immense (and not just for PBXT developers).

So, why aren’t we all using PBXT now? Well… if MySQL had shipped with PBXT, it would possibly have been a different story. There was (and indeed is) a very uneven playing field for MySQL storage engines. If you’re in the main MySQL tree then you’re everywhere. If not, then you’ve got a heck of a lot of work on packaging and keeping up to date with various API changes (not to mention ABI, which since we’re talking C++ and since we’re talking MySQL, changed a lot). With InnoDB being “good enough” for many and actually being forced to improve due to challengers such as PBXT (as well as being in the tree) it just continued to have the majority of the users… and it’s not easy making money as a storage engine vendor – and developing a storage engine does cost money.

With no more than a couple of people working on it at any one time, it’s amazing that PBXT had such a big influence – and we can likely credit many InnoDB performance improvement to PBXT being so much better in some areas.

An interesting side story: I was sending some build fixes/complier warning fixes and other minimal patches to Paul when MySQL belonged to Sun and he mentioned that perhaps it was time he had a contributor agreement. I pointed out the Sun contributor agreement as an example of one that could be used, and maybe he could just replace “Sun Microsystems” with “PrimeBase” and I could submit that to the Sun system – at the very least, it would be amusing. Paul said that sounded like a great idea and I submitted Sun’s contributor agreement (which it expected outside contributors to agree to) to Sun for them to agree to.

Well… a few weeks later I got a response, and Sun wasn’t exactly keen on it – until I pointed out a few times that it was in fact their agreement and then finally it was all okay.

If your company wants people to agree to a contributor agreement: see if they’d agree to it if it was for someone else. It was an interesting exercise.

Further reading:

The MySQL Cluster storage engine

This is one close to my heart. I’ve recently written on other storage engines: Where are they now: MySQL Storage EnginesThe MERGE storage engine: not dead, just resting…. or forgotten and The MEMORY storage engine. Today, it’s the turn of MySQL Cluster.

Like InnoDB, MySQL Cluster started outside of MySQL. Those of you paying attention at home may notice a correlation between storage engines not written exclusively for MySQL and being at all successful.

NDB (for Network DataBase) started inside Ericsson, originally written in a language called PLEX, which was internal to Ericsson and used in the AXE telephone switches. Mikael Ronstrom’s PHD thesis covered NDB and even covered things that (at least were) yet to be implemented (it’s been quite a few years since I leafed through it last). The project at Ericsson (IIRC) was shelved a couple of times, but eventually got spun out into an Ericsson Business Innovation company called Alzato.

Some remnants of PLEX can still be found in the NDB source code (if you look really hard that is). At some point the code was fed through a PLEX to C++ converter and development continued from there. Some of the really, really old parts of the source may seem weird either due to this or some hand optimization for SPARC processors in the 1990s.

In 2003, MySQL AB acquired Alzato and work on a storage engine plugin for MySQL to interface to the (C++ API only) NDB was underway. Seeing as the storage engine interface was so simple, easy and modular it would only take several years for the interface to NDB to become mature.

The biggest problem: NDB itself worked really well if your workload fit exactly what it was good at… if you deviated, horrific performance and/or crashes were not as uncommon as we’d have liked. This was a source of strain for many years with the developers and support team on one side and some of the less-than-careful sales team on the other. That being said, there have been some absolutely awesome sales people selling NDB into markets it truly fits, and this is why there’s barely a place in the world where placing a mobile phone call doesn’t go through MySQL Cluster at some point.

You should read Tomas Ulin’s post Celebrating 10 years @MySQL for a bit of an insight into how Alzato became part of MySQL AB (which later became part of Sun which became part of Oracle).

I joined the MySQL Cluster team at MySQL in December 2004, not too long after Alzato was acquired, but certainly when the NDB storage engine in MySQL 4.1 was in its very early stages – it was then by no means a general purpose database.

Over the years, MySQL Cluster gained both traction and features, making it useful for more applications. One of the biggest marketing successes of MySQL was the storage engine architecture and how you could just “plug in” different engines. The reality (of course) was far different and even though MySQL Cluster did just “plug in” to MySQL, it was certainly not a drop in replacement.

In MySQL 5.0, a bunch of neat new features were added:

  • Engine condition pushdown
    This enabled conditions on non-indexed columns to be evaluated on the data nodes rather than having every row pulled up to the SQL node to be evaluated.
  • Batched read interface
    So that queries like SELECT FOO FROM BAR WHERE A IN (1,2,3) were executed as a single network round trip rather than 3 round trips.
  • Query cache
    Although the query cache should die, hey, at least it worked with NDB now…. in a way.
  • Reduced IndexMemory usage
    Remember, NDB is an in-memory database, so saving a bunch of bytes for secondary indexes was a big thing.

the first release with things I really worked on was MySQL 5.1. My first talk (to a packed room) at the MySQL User Conference in 2006 was on new features in MySQL Cluster 5.1. I’m still quite proud of that talk even though I know I am a much better speaker than I was then (It would have been great to have had more guidance… but hey, learning from experience is good too).

We added a lot in 5.1:

  • Integration with replication
    This is where row based replication was born. It was a real team effort with the NDB kernel part (going from memory and bzr logs) having been written by Tomas and Jonas seems to have a bunch of code there too. I worked a bunch on the NDB Injector thread in mysqld, Mats worked on the core row based code (at the time the most C++ like code in the entire MySQL world). You could now have a cluster replicate to another cluster with the giant bottleneck that is MySQL replication.
  • disk data
    You could store non-indexed columns on disk. I implemented the INFORMATION_SCHEMA.FILES table for this, I was young and naive enough to think that the InnoDB guys would also fill out this table and all would be happy with the world (I’m lucky I haven’t been holding my breath on this one).
  • Variable Sized columns
    A VARCHAR(255) would actually not always use more than 255bytes if you just stored a single character in it. Catch? Only for in-memory columns.
  • User defined partitioning
    Because NDB desperately needed more options, we let the user choose how they wanted to partition up their data (per table).
  • Autodiscovery of schema changes
    This was a giant workaround to the epic mess that is FRM files and data dictionary things inside the MySQL Server. It is because of all this code that when I went to rewrite the whole thing for Drizzle I took the approach of “just pass it down to the engines, the server must not attempt to know better”. FWIW, I’m still right: if the server tries to be clever you now have two places for bugs to be, not just one.
  • Distribution awareness
    i.e. better selection of which data node to talk to for a particular query, reducing latency.
  • Online add/drop index.
    How long did it take for other engines to get this? Let’s not think about that :)

After that the really interesting stuff started to happen, that is, the first major fork of MySQL: MySQL Cluster Carrier Grade Edition (CGE). Why? We had customers that simply couldn’t wait for MySQL 6.0 (after all, they’d still be waiting).

We had MySQL Cluster CGE 6.1, 6.2, 6.3 and now we’re into 7.0, 7.1 and 7.2. There is without doubt that it’s the longest serving and surviving MySQL fork. There were non-trivial changes inside the MySQL server too, which caused enough of a merge problem for the (small) Cluster team.

One big thing that you’re probably still all waiting for? Replication conflict detection and resolution in circular/multi-master replication setups. It was an NDB first and been used in production for a decent amount of time.

I remember a hack while on an airplane led to the CompressedBackup and CompressedLCP options (used zlib when writing out checkpoints/backups) – something that took more time than you’d think to go from prototype to production ready code.

The last few things I worked on in MySQL Cluster before going and working full time on Drizzle was the Windows port, online add/drop node and NDBINFO.

I’ve left out so many cool MySQL Cluster things that were worked on over the years (e.g. online add/drop column, rewriting of LCP code, micro GCPs, crash-safe DDL, the test suite). I really should mention the test suite, in lines of code it was over three times that of MyISAM.. and that was probably six years ago that I worked that out.

One thing to think about: when Innobase Oy was bought by Oracle and there was this effort to have a transactional storage engine that was inside MySQL AB rather than another company, I pointed out that I thought it would take less time adding the needed features to NDB and integrating it inside the MySQL server binary (and with the addition of online add node you could go from stand alone DB server to a full cluster with no down time) than it would for any of the alternatives to get to a suitable level of maturity.

I wish I put money on this… I put money on the MySQL 5.1 GA release date (which I was happy to loose), but in the years since you can see that InnoDB is still reigning supreme with all that came to replace it having fallen away for one reason or another. It’s still on track to have MySQL Cluster be the only real alternative (now also, funnily enough, owned by Oracle). I have to say, it’s kind of a hollow victory though, it would have been nice to see Falcon and PBXT be serious players in today’s market.

Where are they now: MySQL Storage Engines

There was once a big hooplah about the MySQL Storage Engine Architecture and how it was easy to just slot in some other method of storage instead of the provided ones. Over the years I’ve repeatedly mentioned how this wasn’t really the case and that it was remarkably non trivial.

Over the years there have been many storage engines crop up and then disappear. So… where are they now?

  • ISAM
    This became MyISAM…. you know you’ve been around MySQL a long time if you’ve ever had to deal with an ISAM table.
  • Gemini
    This was the first big test of the GPL in court. Basically, you have to obey the GPL (see wikipedia for more info). The code was released as GPL and development stopped. This has been dead since ca 2002.
  • Amira – http://launchpad.net/amira
    Antony first mentioned this in 2008 on his blog. This was a continuation of the Gemini engine, you can actually go over to launchpad and get the code. This was one of the projects to have a transactional storage engine not owned by Oracle after Innobase Oy was acquired by them. It went nowhere special as Netfrastructure was acquired which became Falcon.
  • BDB
    otherwise known as the BerkeleyDB engine. It was seldom used and never gained much of a userbase. It was unceremoniously dropped back in 2006 and both users didn’t really exist.
  • PBXT – http://pbxt.blogspot.com/
    I think we can credit PBXT with at least half of the features and performance improvements to InnoDB since it first emerged back in 2006. It got attention very quickly. Why? Because it was different. It had the very rare ability to outperform InnoDB in some places. You can still find PBXT in MariaDB, but sadly it can be hard to fund development of a MySQL storage engine, especially one as tied to MySQL as PBXT is, and it’s no longer under active development. Closely related was the Blob Streaming project which was way ahead of its time as an AlsoSQL access method. The good news is that the code was released under a BSD license in 2012 (was previously GPL). We even had PBXT in Drizzle for a while.
  • Blob Streaming (PBMS) – http://bpbdev.blogspot.com/
    This project was closely related to (but not depending exclusively on) PBXT. It embedded a HTTP server inside the database and could use it to read and write BLOBs. This was not only fairly cool but way ahead of its time. We owe the existence of both HandlerSocket and the memcached interface to InnoDB to PBMS (it was also an inspiration for the JSON server plugin for Drizzle, to address some of the use cases of the PBMS plugin).
  • Federated
    It’s still there… but is effectively unmaintained and dead. There’s even FederatedX in MariaDB which is an improvement, but still, the MySQL server really doesn’t lend itself kindly to this type of engine… it’s always been an oddity only suitable for very specific tasks.
  • Archive
    Although useful, effectively unmaintained. I kinda don’t want to say dead… but if it went away, I wouldn’t exactly be surprised.
  • CSV
    Currently used to access the log tables in MySQL… and hardly used otherwise. It’s odd that the same code doesn’t deal with SELECT INTO OUTFILE and LOAD DATA INFILE, and I doubt this will ever change. I’d say effectively niche/dead.
  • SolidDB
    Purchased by IBM, abandoned.
  • DB2
    Only ever on System i. Useful for very very few people… but you can still find it around if you’re one of them.
  • Infobright
    OMG it exists! This is probably because they’re largely just using the MySQL server as a way to implement the MySQL network protocol and all of the heavy lifting is done by their own code.
  • Xeround
    I’m quite surprised these guys are still around, as they’re a proprietary storage engine as a service, and initial testing wasn’t entirely promising.
  • TokuDB
    I cannot emphasize how much more interesting TokuDB would be if it were open source. It actually holds some promise… and with their recent work with mongo, perhaps this is a good way forward for them…
  • Maria/Aria
    Another “OMG Oracle just bought Innobase Oy” engine. This was a project to take MyISAM and turn it into a lean, mean, transactional storage engine machine. It’s still not there and I don’t think it ever will be.
  • Falcon
    This was the hot new thing. It came out of Netfrastructure, which MySQL AB acquired in order to help get a transactional storage engine after Innobase Oy was acquired by Oracle. If you’re keeping count, that’s three projects for a transactional storage engine. Falcon was the star though, receiving all the press and publicity (well before it was ready). There are many reasons why Falcon isn’t around today – the chief one probably being that Oracle bought Sun who had bought MySQL and thus a need for an “InnoDB replacement” instantly vanished. There was also immense management pressure for performance to be greater than InnoDB, without any allowance for or focus on correctness…. and this showed. This was quite disappointing as Falcon had a lot of good architectural things going for it.
  • BlitzDB – https://launchpad.net/blitzdb
    I had hoped we’d replace MyISAM with BlitzDB in Drizzle. It was a wrapper around Tokyo Cabinet to the storage engine API in Drizzle. Unfortunately, the ties to MyISAM are incredibly deep (see my recent post on internal temporary tables) and we never quite got there.

I think this is all the notable engines that were aimed at widespread adoption… what ones have I forgotten?

It’s interesting to note that only Archive, CSV, Xeround, TokuDB and Infobright can be gotten anywhere, and the latter two only in their own distribution (one proprietary) and Xeround only as a service.