Ghosts of MySQL Past: Part 2

This continues on from my post yesterday and also contains content from my linux.conf.au 2014 talk (view video here).

Way back in May in the year 2000, a feature was added to MySQL that would keep many people employed for many years – replication. In 3.23.15 you could replicate from one MySQL instance to another. This is commonly cited as the results of two weeks of work by one developer. The idea is simple: create a log of all the SQL queries that modify the database and then replay them on a slave. Remember, this is before there was concurrency and everything was ISAM or MyISAM, so this worked (for certain definitions of worked).

The key things to remember about MySQL replication are: it was easy to use, it was easy to set up and it was built into the MySQL Server. This is why it won. You have to fast forward to September in 2010 before PostgreSQL caught up! It was only with PostgreSQL 9.0 that you could have queryable read-only slaves with stock standard PostgreSQL.

If you want to know why MySQL was so much bigger than PostgreSQL, this built in and easy to use replication was a huge reason. There is the age of a decent scotch between read-only slaves for MySQL and PostgreSQL (although I don’t think I’ve ever pointed that out to my PostgreSQL friends when having scotch with them… I shall have to!)

In 2001, when space was an odyssey, the first GA (General Availability) release of MySQL 3.23 hit the streets (quite literally, this was back in the day of software that came in actual physical boxes, so it quite probably was literally hitting the streets).

For a good piece of trivia, it’s 3.23.22-beta that is the first release in the current bzr tree, which means that it was around this time that BitKeeper first came into use for MySQL source code.

We also saw the integration of InnoDB in 2001. What was supremely interesting is that the transactional storage engine was not from MySQL AB, it was from Innobase Oy. The internals of the MySQL server were certainly not set up for transactions, and for many years (in fact, to this day) we talk about how a transactional engine was shoehorned in there. Every transactional engine since has had to do the same odd things to, say, find out when a transaction was being started. The exception here is in Drizzle, where we finally cleaned up a bunch of this mess.

Having a major component of the MySQL server owned and controlled by another company was an interesting situation, and one that would prove interesting in a few years time.

We also saw MÃ¥rten Mickos become CEO in 2001, a role he would have through the Sun acquisition – an acquisition that definitively proved that you can build an open source company and sell it for a *lot* of money. It was also the year that saw MySQL AB accept its first round of VC funding, and this would (of course) have interesting implications: some good, some less ideal.

(We’ll continue tomorrow with Part 3!)

Past, Present and future of MySQL and variants Part 1: Ghosts of MySQL Past

You can watch the video of my linux.conf.au 2014 talk here: http://mirror.linux.org.au/linux.conf.au/2014/Wednesday/28-Past_Present_and_future_of_MySQL_and_variants_-_Stewart_Smith.mp4

But let’s talk about things in blog form rather than video form :)

Back in 1979, there was UNIREG. A text UI to records (rows) in a database (err, table). The reason I mention UNIREG is that it had FoRMs which as you may have guessed by my capitalization there is where the FRM file comes from.

In 1986, UNIREG came to UNIX. That’s right kids, the 80×24 VT100 interface to ISAM (Index Sequential Access Method – basically rows are written in insert order and indexes point to them) came to UNIX. There was no generic query language, just FoRMs and reports. In fact, to this day, that 80×24 text interface is stored in the FRM file by MySQL and never ever used (I’ve written about this before).

Then there was this mSQL thing around the 1990s, which was a small SQL server (with source) but not FOSS. Originally, Monty W plugged in his ISAM engine but it wasn’t quite the right fit… so in 1995, we had MySQL 1.0 and MySQL AB was founded.

Fast forward a bit and in 1996 we had MySQL 3.19 and development continued. It managed to gain features, performance, ports to different operating systems and CPU architectures and, of course, stability.

It wasn’t until the year 2000 that MySQL adopted the GPL. This turned out to be a huge step in the right direction for increased adoption. At the time, this was a huge risk for the company, essentially risking all the revenue of the company on making the software more free.

This was the birth of the dual licensing business model. You see, the client library (libmysql) was also GPL, which meant it was easy to use if your application was also GPL, but if you were going to distribute your application and it wasn’t under a GPL compatible license (there was also a FOSS exception so that things like PHP could use it) then you needed a license.

Revenue from licensing was to be significant throughout the entire history of MySQL AB.

(We’ll continue this in part 2 tomorrow)

and now for something completely different…

As many of you know, I’ve been working in the MySQL world for quite a while now. IN fact, it was nearly 10 years ago when I first started hacking on MySQL Cluster at MySQL AB.

Most recently, I was at Percona which was a wonderful journey where over my nearly three years there the company at least doubled in size, launched several new software products and greatly improved the quality and frequency of releases.

However the time has come for something completely different. The MySQL world is rather mature, the future of Percona software is bright and, well, I could do with poking into something rather different.

So a couple of weeks ago I started at IBM in the Linux Technology Centre working on KVM on POWER and related things. No doubt there’ll be interesting things to blog about as time goes on, but it’s about time I posted my change of employment :)

Converting MySQL trees to git

I have put up a set of scripts on github: https://github.com/stewartsmith/bzr-to-git-conversion-scripts. Why do I need these? Well… if only bzr fast-export|git fast-import worked flawlessly for large, complex and old trees. It doesn’t.

Basically, when you clone this repo you can run “./sync-BLAH.sh” and it’ll pull BZR trees for the project, convert to git and clean things up a bit. You will likely have to edit the sync-BLAH.sh scripts as I have them pointed at branches on my own machine (to speed up the process, not having to do fresh BZR branches of MySQL trees over the network is a feature – it’s never been fast.). You’ll also want to edit the git remotes to point where you want git trees to end up.

I’ve done it for:

What problems did I hit? Well… the first is performance, things are slow unless you tweak a bunch of knobs, and then it’s just rather slow rather than slow. So in the empty git repo I set core.compression=1, which makes zlib a whole lot faster.

I naturally give the correct incantation to bzr fast-export to munge tag names appropriately, set a git branch name (each BZR branch ends up as a git branch) and use a marks file (this speeds up incremental syncs).

For one of these branches I was importing, BZR had allowed the invalid committer of “billy-earney billy.earney@gmail.com\n <>” – yes, a newline in the committer. This messes up the fast-import format so I have to run the entire fast-export output through sed to clean it up.

We then use bzr fast-import-filter to apply a user map – which is me looking at the appropriate committers and cleaning them up so that we get better attribution in the resulting git trees as well as cleaning up some errors in the bzr tree so that Git likes them (most notably, missing < or (not and) > around email addresses). The user map is fairly Percona specific, but there’s at least one or two for Oracle committers too.

Next, I pass the output through pv(1) – to do two things: monitor the output to see that it’s still going, and to have a transfer buffer so that git fast-import doesn’t stall waiting for output – amazingly enough, this gave a decent speed boost to import speed.

Finally, when we’re done doing the import of all of the revisions for all of the bzr branches, if this is our first run, we set the HEAD ref to the last BZR branch name and then do a git repack. Through experimentation, I’ve found that “git repack -AdfF –depth=100 –window=500” is what gives me the smallest size possible.

My lca2014 talk video: Past, Present and Future of MySQL and variants

On last Wednesday morning I gave my talk at linux.conf.au 2014. You can now view and download the recording of it here:

http://mirror.linux.org.au/linux.conf.au/2014/Wednesday/28-Past_Present_and_future_of_MySQL_and_variants_-_Stewart_Smith.mp4

(hopefully more free formats will come soon, the all volunteer AV team has been absolutely amazing getting things up this quickly).

Hong Kong (OpenStack Summit)

I’ll be in Hong Kong for the upcoming OpenStack Summit Nov 5-8. I’d be thrilled to talk database things with others present, especially around Trove DBaaS (DataBase as a Service) and high availability MySQL for OpenStack deployments.

I was last in Hong Kong in 2010 when I worked for Rackspace. The closest office to me was in Hong Kong so that’s where I did my HR onboarding training. I remember telling friends on the Sunday night before leaving for Hong Kong that I may be able to make dinner later in the week purely depending on if somebody got back to me on if I was going to Hong Kong that week. I was, and I went. I took some photos while there.

Walking from the hotel where we were staying to the Rackspace office could be done pretty much entirely through buildings without going outside. There were bits of art around too, which is just kind of awesome – I’m always in favour of random art.
Statues in walkways

The photo below was the view from my hotel room. The OpenStack summit is just by the airport rather than in the middle of town, so the views will be decidedly different to this, but still probably quite spectacular if you’re around the right place (I plan to take camera gear, so shout if you want to journey too)
Hotel Window (Hong Kong)

There are some pretty awesome markets around Hong Kong offering just about everything you’d want, including a lot just out on the street.
Java Road
Hong Kong Street Market

Nightime was pretty awesome, having people from around the world journey out into the night was great.
Rackers walking Hong Kong at Night

I was there during the World Cup, and the streets were wonderfully decorated. I’m particularly proud of this photo as it was handheld, at night, after beer.
Hong Kong streetlife

Who is working on MariaDB 10.0?

There was some suggestion after my previous post (Who works on MariaDB and MySQL?) that I look at MariaDB 10.0 – so I have. My working was very simple, in a current MariaDB 10.0 BZR tree (somewhat beyond 10.0.3), I ran the following command:

bzr log -n0 -rtag:mariadb-10.0.0..|egrep '(author|committer): '| \
  sed -e 's/^\s*//; s/committer: //; s/author: //'| \
  sort -u|grep -iv oracle

 

MariaDB foundation/MontyProgram/SkySQL:

  1. Alexander Barkov
  2. Alexey Botchkov
  3. Daniel Bartholomew
  4. Elena Stepanova
  5. Igor Babaev
  6. Jani Tolonen
  7. knielsen
  8. Michael Widenius
  9. sanja
  10. Sergei Golubchik
  11. Sergey Petrunya
  12. Sergey Vojtovich
  13. timour
  14. Vladislav Vaintroub

Elsewhere:

  1. Kentoku SHIBA (4 commits)
  2. Lixun Peng (1 commit)
  3. Olivier Bertrand (212 commits)

From Oracle (i.e. revisions merged from Oracle MySQL):

  • 81 names (which I won’t list here as 81 is a lot)

The results are no different if you go back to the first revision that is different between MariaDB 5.5 and 10.0 (found using bzr missing). Even when grepping through the bzr log for things such as “patch by”, “contribution” or “originally” I can only find 1 or two more names as original authors for patches (about the same as I can for patches going into the Oracle tree).

Please point me to revisions (revid is best way) that come from outside contributors as then I really can update this to show that there’s a larger developer community.

The current development version of Drizzle (7.2) has just as many contributors as the MariaDB development version (10.0) – although Drizzle does have fewer commits.

nanomysql – tiny MySQL client lib

I recently got pointed towards https://github.com/shodanium/nanomysql/ which is a tiny (less than 400 lines of C++) MySQL client library which is GPL licensed.

If you need to link into non-GPL compatible code, there is the (slightly larger and full featured) libdrizzle library. But if you want something *tiny* and are okay with GPL, then nanomysql may be something to look at.

Who works on MariaDB and MySQL?

Looking at the committers/authors of patches in the bzr tree for MariaDB 5.5.31.

Non Oracle Contributors:

  1. Alexander Barkov
  2. Alexey Botchkov
  3. Elena Stepanova
  4. Igor Babaev
  5. knielsen
  6. Michael Widenius
  7. sanja
  8. Sergei Golubchik
  9. Sergey Petrunya
  10. timour
  11. Vladislav Vaintroub

Oracle (as they pull Oracle changes):

  1. Aditya A
  2. Akhila Maddukuri
  3. Alexander Nozdrin
  4. Anirudh Mangipudi
  5. Annamalai Gurusami
  6. Astha Pareek
  7. Balasubramanian Kandasamy
  8. Chaithra Gopalareddy
  9. Daniel Fischer
  10. Gleb Shchepa
  11. Harin Vadodaria
  12. Hery Ramilison
  13. Igor Solodovnikov
  14. Inaam Rana
  15. Jon Olav Hauglid
  16. kevin.lewis
  17. Krunal Bauskar
  18. Marc Alff
  19. Marko Mäkelä
  20. Mattias Jonsson
  21. Murthy Narkedimilli
  22. Neeraj Bisht
  23. Nisha Gopalakrishnan
  24. Nuno Carvalho
  25. Olav Sandstaa
  26. Pedro Gomes
  27. prabakaran thirumalai
  28. Praveenkumar Hulakund
  29. Ravinder Thakur
  30. Satya Bodapati
  31. sayantan.dutta
  32. Shivji Kumar Jha
  33. Sujatha Sivakumar
  34. Sunanda Menon
  35. Sunny Bains
  36. Thayumanavar
  37. Tor Didriksen
  38. Venkata Sidagam
  39. Venkatesh Duggirala
  40. Yasufumi Kinoshita

Observations:

  1. All the non-Oracle contributors work for SkySQL (and worked for Monty Program before that)
  2. Even when you go back to MariaDB 5.5.23 I can only find evidence for a maximum of 2-3 external contributions of code to MariaDB since then.
  3. In the same time frame (5.5.23-5.5.32) I see 1 or 2 going into Oracle trees, so it’s roughly the same.
  4. If you look at the contributors from Oracle over 5.5.23 to 5.5.32 there are closer to twice as many as the 40 listed above.

Somebody please correct me if I’m wrong here… perhaps MariaDB guys are just really bad at clearly marking commits that come from elsewhere? I’ve looked for “patch.*by”, “original” and “ontributed” and only turned up the above.

Are MariaDB tests adding anything extra over Oracle MySQL tests?

I grabbed all the tests introduced in MariaDB 5.5.32 (i.e. “bzr diff -rtag:mariadb-5.5.31..mariadb-5.5.32 mysql-test/” and some foo) and threw them in their own test file. I only kept tests for crashing bugs and ignored those that required plugins (there were two or three, but nothing major). So now I have a test file that should crash MariaDB 5.5.31 and probably before. But, the question is: does this crash Percona Server or MySQL?

While it is excellent to see the MariaDB guys including tests for their crashing bugs, are these MariaDB specific or do they affect other MySQL flavours?

I built a release build of top of trunk Percona Server and ran the test against it. I got no crashes. In a debug build, I got two. One was to do with REPAIR on an ARCHIVE table and the other was “SELECT UNIX_TIMESTAMP(STR_TO_DATE(‘2020′,’%Y’));”. I found the same thing for a debug build of top of tree MySQL.

All the other tests for crashing bugs, of which there were 14 – were MariaDB specific. So, out of 16 total, only 2 applied to Percona Server and MySQL.

An old note on the Storage Engine API

Whenever I stick my head into the MySQL storage engine API, I’m reminded of a MySQL User Conference from several years ago now.

Specifically, I’m reminded of a slide from an early talk at the MySQL User Conference by Paul McCullagh describing developing PBXT. For “How to write a Storage Engine for MySQL”, it went something like this:

  1. Develop basic INSERT (write_row) support – INSERT INTO t1 VALUES (42)
  2. Develop full table scan (rnd_init, rnd_next, rnd_end)  – SELECT * from t1
  3. If you’re sane, stop here.

A lot of people stop at step 3. It’s a really good place to stop too. It avoids most of the tricky parts that are unexpected, undocumented and unlogical (yes, I’m inventing words here).

MySQL vs Drizzle plugin APIs

There’s a big difference in how plugins are treated in MySQL and how they are treated in Drizzle. The MySQL way has been to create a C API in front of the C++-like (I call it C- as it manages to take the worst of both worlds) internal “API”. The Drizzle way is to have plugins be first class citizens and use exactly the same API as if they were inside the server.

This means that MySQL attempts to maintain API stability. This isn’t something worth trying for. Any plugin that isn’t trivial quickly surpasses what is exposed via the C API and has to work around it, or, it’s a storage engine and instead you have this horrible mash of C and C++. The byproduct of this is that no core server features are being re-implemented as plugins. This means the API is being developed in a vacuum devoid of usefulness. At least, this was the case… The authentication plugin API seems to be an exception, and it’s interesting to note that semisync replication is in fact a plugin.

So times may be changing… sort of. Yesterday I noted that some storage engine API features are only available if you’re InnoDB and I’ve voiced my general disappointment in the audit API being unsuitable to implement various forms of query logging already in the server (general query log, slow query log).

One thing to note: when the API is the same for both inside the server and a plugin, it makes initial refactoring very easy, and you quickly see the bits that could be improved.

The EXAMPLE storage engine

The Example storage engine is meant to serve mainly as a code example of the stub of a storage engine for example purposes only (or so the code comment at the start of ha_example.cc reads). In reality however, it’s not very useful. It likely was back in 2004 when it could be used as a starting point for starting some simple new engines (my guess would be that more than a few of the simpler engines started from ha_example.cc).

The sad reality is the complexity of the non-obviousness of the bits o the storage engine API you actually care about are documented in ha_ndbcluster.cc, ha_myisam.cc and ha_innodb.cc. If you’re doing something that isn’t already done by one of those three engines: good luck.

Whenever I looked at ha_example.cc I always wished there was something more behind it… basically hoping that InnoDB would get a better and cleaner API with the server and would use that rather than the layering violations it has to do the interesting stuff.

That all being said, as a starting point, it probably helped spawn at least a dozen storage engines.

The ARCHIVE Storage Engine

I wonder how much longer the ARCHIVE storage engine is going to ship with MySQL…. I think I’m the last person to actually fix a bug in it, and that was, well, a good number of years ago now. It was created to solve a simple problem: write once read hardly ever. Useful for logs and the like. A zlib stream of rows in a file.

You can actually easily beat ARCHIVE for INSERT speed with a non-indexed MyISAM table, and with things like TokuDB around you can probably get pretty close to compression while at the same time having these things known as “indexes”.

ARCHIVE for a long time held this niche though and was widely and quietly used (and likely still is). It has the great benefit of being fairly lightweight – it’s only about 2500 lines of code (1130 if you exclude azio.c, the slightly modified gzio.c from zlib).

It also use the table discovery mechanism that NDB uses. If you remove the FRM file for an ARCHIVE table, the ARCHIVE storage engine will extract the copy it keeps to replace it. You can also do consistent backups with ARCHIVE as it’s an append-only engine. The ARCHIVE engine was certainly the simplest example code of this and a few other storage engine API things.

I’d love to see someone compare storage space and performance of ARCHIVE against TokuDB and InnoDB (hint hint, the Internet should solve this for me).

The MySQL Cluster storage engine

This is one close to my heart. I’ve recently written on other storage engines: Where are they now: MySQL Storage Engines, The MERGE storage engine: not dead, just resting…. or forgotten and The MEMORY storage engine. Today, it’s the turn of MySQL Cluster.

Like InnoDB, MySQL Cluster started outside of MySQL. Those of you paying attention at home may notice a correlation between storage engines not written exclusively for MySQL and being at all successful.

NDB (for Network DataBase) started inside Ericsson, originally written in a language called PLEX, which was internal to Ericsson and used in the AXE telephone switches. Mikael Ronstrom’s PHD thesis covered NDB and even covered things that (at least were) yet to be implemented (it’s been quite a few years since I leafed through it last). The project at Ericsson (IIRC) was shelved a couple of times, but eventually got spun out into an Ericsson Business Innovation company called Alzato.

Some remnants of PLEX can still be found in the NDB source code (if you look really hard that is). At some point the code was fed through a PLEX to C++ converter and development continued from there. Some of the really, really old parts of the source may seem weird either due to this or some hand optimization for SPARC processors in the 1990s.

In 2003, MySQL AB acquired Alzato and work on a storage engine plugin for MySQL to interface to the (C++ API only) NDB was underway. Seeing as the storage engine interface was so simple, easy and modular it would only take several years for the interface to NDB to become mature.

The biggest problem: NDB itself worked really well if your workload fit exactly what it was good at… if you deviated, horrific performance and/or crashes were not as uncommon as we’d have liked. This was a source of strain for many years with the developers and support team on one side and some of the less-than-careful sales team on the other. That being said, there have been some absolutely awesome sales people selling NDB into markets it truly fits, and this is why there’s barely a place in the world where placing a mobile phone call doesn’t go through MySQL Cluster at some point.

You should read Tomas Ulin’s post Celebrating 10 years @MySQL for a bit of an insight into how Alzato became part of MySQL AB (which later became part of Sun which became part of Oracle).

I joined the MySQL Cluster team at MySQL in December 2004, not too long after Alzato was acquired, but certainly when the NDB storage engine in MySQL 4.1 was in its very early stages – it was then by no means a general purpose database.

Over the years, MySQL Cluster gained both traction and features, making it useful for more applications. One of the biggest marketing successes of MySQL was the storage engine architecture and how you could just “plug in” different engines. The reality (of course) was far different and even though MySQL Cluster did just “plug in” to MySQL, it was certainly not a drop in replacement.

In MySQL 5.0, a bunch of neat new features were added:

  • Engine condition pushdown
    This enabled conditions on non-indexed columns to be evaluated on the data nodes rather than having every row pulled up to the SQL node to be evaluated.
  • Batched read interface
    So that queries like SELECT FOO FROM BAR WHERE A IN (1,2,3) were executed as a single network round trip rather than 3 round trips.
  • Query cache
    Although the query cache should die, hey, at least it worked with NDB now…. in a way.
  • Reduced IndexMemory usage
    Remember, NDB is an in-memory database, so saving a bunch of bytes for secondary indexes was a big thing.

the first release with things I really worked on was MySQL 5.1. My first talk (to a packed room) at the MySQL User Conference in 2006 was on new features in MySQL Cluster 5.1. I’m still quite proud of that talk even though I know I am a much better speaker than I was then (It would have been great to have had more guidance… but hey, learning from experience is good too).

We added a lot in 5.1:

  • Integration with replication
    This is where row based replication was born. It was a real team effort with the NDB kernel part (going from memory and bzr logs) having been written by Tomas and Jonas seems to have a bunch of code there too. I worked a bunch on the NDB Injector thread in mysqld, Mats worked on the core row based code (at the time the most C++ like code in the entire MySQL world). You could now have a cluster replicate to another cluster with the giant bottleneck that is MySQL replication.
  • disk data
    You could store non-indexed columns on disk. I implemented the INFORMATION_SCHEMA.FILES table for this, I was young and naive enough to think that the InnoDB guys would also fill out this table and all would be happy with the world (I’m lucky I haven’t been holding my breath on this one).
  • Variable Sized columns
    A VARCHAR(255) would actually not always use more than 255bytes if you just stored a single character in it. Catch? Only for in-memory columns.
  • User defined partitioning
    Because NDB desperately needed more options, we let the user choose how they wanted to partition up their data (per table).
  • Autodiscovery of schema changes
    This was a giant workaround to the epic mess that is FRM files and data dictionary things inside the MySQL Server. It is because of all this code that when I went to rewrite the whole thing for Drizzle I took the approach of “just pass it down to the engines, the server must not attempt to know better”. FWIW, I’m still right: if the server tries to be clever you now have two places for bugs to be, not just one.
  • Distribution awareness
    i.e. better selection of which data node to talk to for a particular query, reducing latency.
  • Online add/drop index.
    How long did it take for other engines to get this? Let’s not think about that :)

After that the really interesting stuff started to happen, that is, the first major fork of MySQL: MySQL Cluster Carrier Grade Edition (CGE). Why? We had customers that simply couldn’t wait for MySQL 6.0 (after all, they’d still be waiting).

We had MySQL Cluster CGE 6.1, 6.2, 6.3 and now we’re into 7.0, 7.1 and 7.2. There is without doubt that it’s the longest serving and surviving MySQL fork. There were non-trivial changes inside the MySQL server too, which caused enough of a merge problem for the (small) Cluster team.

One big thing that you’re probably still all waiting for? Replication conflict detection and resolution in circular/multi-master replication setups. It was an NDB first and been used in production for a decent amount of time.

I remember a hack while on an airplane led to the CompressedBackup and CompressedLCP options (used zlib when writing out checkpoints/backups) – something that took more time than you’d think to go from prototype to production ready code.

The last few things I worked on in MySQL Cluster before going and working full time on Drizzle was the Windows port, online add/drop node and NDBINFO.

I’ve left out so many cool MySQL Cluster things that were worked on over the years (e.g. online add/drop column, rewriting of LCP code, micro GCPs, crash-safe DDL, the test suite). I really should mention the test suite, in lines of code it was over three times that of MyISAM.. and that was probably six years ago that I worked that out.

One thing to think about: when Innobase Oy was bought by Oracle and there was this effort to have a transactional storage engine that was inside MySQL AB rather than another company, I pointed out that I thought it would take less time adding the needed features to NDB and integrating it inside the MySQL server binary (and with the addition of online add node you could go from stand alone DB server to a full cluster with no down time) than it would for any of the alternatives to get to a suitable level of maturity.

I wish I put money on this… I put money on the MySQL 5.1 GA release date (which I was happy to loose), but in the years since you can see that InnoDB is still reigning supreme with all that came to replace it having fallen away for one reason or another. It’s still on track to have MySQL Cluster be the only real alternative (now also, funnily enough, owned by Oracle). I have to say, it’s kind of a hollow victory though, it would have been nice to see Falcon and PBXT be serious players in today’s market.

TokuDB

Big news at Percona Live MySQL Conference and Expo this week, Tokutek open sourced TokuDB thus making my previous post Where are they now: MySQL Storage Engines out of date in just a few days.

In this case, I really don’t mind. It’s rather exciting that they’ve gone ahead and done this –  and it’s not just a code drop: https://github.com/Tokutek/ft-engine is where things are at, and recent commits were 2hrs and 18hrs ago which means that this is being maintained. This is certainly a good way to grow a developer community.

While being a MySQL engine is really interesting, the MongoDB integration is certainly something to watch – and may be something quite huge for Tokutek (after all, it’s the database engine parts of Mongo that are the most troublesome – along with the client library license).

A few notes on InnoDB in MySQL 5.7.1

I’ve started poking around the MySQL 5.7.1 source tree (although just from tarball as I don’t see a BZR tree yet). I thought I’d share a few thoughts:

  • InnoDB temporary tables. Not REDO logged. What does this mean? It’s a huge step in removing the dependency on MEMORY and MyISAM engines for temporary tables used in query execution. With InnoDB temporary tables there is no reason for MEMORY engine to continue to exist, there is absolutely no way in which it is better.
  • InnoDB temp tables aren’t insert buffered
    This probably doesn’t really matter as you’re not going to be doing REDO logging for them (plus things are generally short lived)… but it could be a future area for performance improvement
  • The NO_REDO log mode appears to be implemented fairly neatly.
  • Improvements in innodb read only mode. What does this mean? Maybe we can finally get rid of the oddity of compressed read only MyISAM tables on read only media. (on the other hand, CDs and DVDs aren’t exactly a modern form of software distribution).
  • Some of the source code comments have improved.. it’s getting easier to understand InnoDB. I’d still make the argument that if you need source code comments you’re code isn’t clear enough… but any step is an improvement. (that being said, InnoDB was always easier than the server)
  • There is some pretty heavy refactoring of lock0lock.cc – I really need to sit down and poke at it a bunch.
  • The shared tablespace code (innodb system tablespace) has been heavily refactored. This also introduces tablespaces for temporary tables – and it appears to be implemented in the correct way.

I need to look into things a bunch more, and it’ll be really useful to see a bzr tree to better understand some of the changes.

More to come later, but that’s my quick look.

The MEMORY storage engine

I recently wrote about Where are they now: MySQL Storage Engines and The MERGE storage engine: not dead, just resting…. or forgotten. Today, it’s the turn of the MEMORY storage engine – otherwise known as HEAP.

This is yet another piece of the MySQL server that sits largely unmaintained and unloved. The MySQL Manual even claims that it supports encryption… with the caveat of having to use the SQL functions for encryption/decryption rather than in the engine itself (so, basically, it supports encryption about as much as every other engine does).

The only “recent” innovation in the MEMORY engine was the dynamic row patch that ended up making its way into Percona Server (and isn’t enabled by default). This forced me to go and look at the code of the MEMORY engine again and I cannot possibly drink enough in my lifetime to erase the memory.

The MEMORY engine is used by just about everybody as you probably have a SQL query somewhere that uses an in memory temporary table. I can, however, feel the comments being added to this post right now by people who use gdb to set server variables that not a single query in their systems use MEMORY….. (IIRC there have been some patches around that would throw an error rather than create a temporary table)

We had a early version of the dynamic row format patch in Drizzle for a while… and if you turned it on, all sorts of things horrifically broke. It was a remarkably non-trivial amount of work to get that code to work properly and this is largely a testament to the “design” of the MEMORY engine.

While it may be efficient or fast or something (likely on 1990s hardware and workloads), it misses the boat completely on the things that matter today: simultaneous access, MVCC, BLOB/TEXT columns and transactions. Basically, it’s a engine that’s really only useful for a single connection in limited use cases…. and even then, it’s likely a good way to ruin things. MyISAM is better as at least on memory pressure things may be written out to disk sensibly…. and if InnoDB had a “don’t log this table” mode it would beat that absolute pants off it.

It is, again, another part of the MySQL server that’s remarkably hard to pull out and replace with something different/better. Why? Well, I wrote about it before: Refactoring Internal temporary tables (another stab at it). If it was easy, we’d likely have Tokyo Cabinet (via BlitzDB) or similar (some bit of code maintained by other people) doing the same job in Drizzle rather than this large chunk of code that nobody really cares about.

The MERGE storage engine: not dead, just resting…. or forgotten.

Following on from my fun post on Where are they now: MySQL Storage Engines, I thought I’d cover the few storage engines that are really just interfaces to a collection of things. In this post, I’m talking about MERGE.

The MERGE engine was basically a multiplexer down to a number of MyISAM tables. They all had to be the same, there was no parallel query execution and it saw fairly limited use. One of the main benefits was that then you could actually put more rows in a MyISAM table than your “files up to 2/4GB” file system allowed. With the advent of partitioning, this really should have instantly gone away and been replaced by it. It wasn’t.

It is another MySQL feature that exists likely due to customer demand at the time. It’s not a complete solution by any means, PARTITIONING is way more complete and universal…. and much harder to get right inside the MySQL server – which is why MERGE exists. It was easier to write a storage engine that wrapped MyISAM than it was to have any form of partitioning in the server.

One advantage of MERGE tables is it means that you could parallelize myisamchk to repair your broken MyISAM tables after a crash. One step better than no crash safety is at least parallel recovery. The disadvantage being that you’re using MERGE and MyISAM tables.

There is also the great security problem of MRG_MYISAM (the other name for MERGE tables): if you create a MyISAM table t1 and have a user able to access it, if they can create a MERGE table that accesses t1 (say m1) and you then revoke their access to t1, they’ll still be able to access t1 through m1.

MERGE still seems to exist in MySQL 5.6 without even a warning that it’ll go away… which I suspect it will…. we long since got rid of it in Drizzle as, well, what you really want is a query rewrite engine that does views, partitioning etc etc.

Can anyone think of a reason why you should still use MERGE tables in 2013? I can’t.

Detecting if a MySQL server supports partitioning

This morning, this Percona XtraBackup bug came to my attention: https://bugs.launchpad.net/bugs/1170340 – basically, it’s now really quite tricky to determine if a MySQL server you’re connected to supports partitioning or not.

If you’re connected to anything less than MySQL 5.6, you can use have_partitioning variable. But since that’s gone in 5.6, you’re going to get a false negative if you’re connected to 5.6. You could use INFORMATION_SCHEMA.PLUGINS table, but that’s not there in 5.0, so you have some added workarounds to add there too.

A simple version check could be the solution… but what if you compiled the server without partitioning support?