Preliminary MySQL Cluster benchmark results on POWER8

Yesterday, I got the basics going for MySQL Cluster on POWER. Today, I finished up a couple more patches to improve performance and ran some benchmarks.

This is on a 3.7Ghz POWER8 machine with non-balanced memory (only 2 of the 4 NUMA nodes have memory, so we have less total memory bandwidth than we could have, plus I’m going to bind ndbmtd to the CPUs in these NUMA nodes)

With a setup of a single replica and two data nodes on the one machine (each bound to a specific NUMA node), running the flexAsync benchmark on MySQL Cluster 7.3.7, I could get around:

  • 3.2 million reads/sec
  • 2.6 million deletes/sec
  • 2.4 million updates/sec
  • 2.4 million inserts/sec.

So, that’s at least in the right ballpark for a first go.

(I’m running this on a big endian host kernel, some random kernel I booted on the box and built with gcc 4.8 with whatever build options the MySQL Cluster cmake foo chooses by default)

MySQL Cluster on POWER8

So, I’ve written previously on MySQL on POWER, and today is a quick bit of news about MySQL Cluster on POWER – specifically MySQL Cluster 7.3.7.

I ran into three main issues in getting some flexAsync benchmark results. One of them was the fact that I wanted to do this in the middle of all the POWER8 machines I usually use moving buildings (hard to run benchmarks when computers are packed up in boxes on a truck).

The next issue was that ndbmtd (the multi-threaded data node) needs memory barriers for the magic message passing stuff between threads. So, that’s pretty easy (about an eight line patch).

The next issue was in the results from flexAsync, it turns out 32bit math is a bad idea with results from my POWER8 box.

My preliminary performance numbers are fairly promising (actually… what is the world record for a single machine and NDB these days? Single data node?). I think there’s a bit more low hanging fruit and a couple more things that are a bit more involved.

Bugs with patches:

  • Bug 74782 – compile fix (memory barriers for POWER)
  • Bug 74781 – flexAsync uses 32bit math, leading to incorrect summary on POWER8

Ghosts of MySQL Past, Part 8: The First Fork.

This is the 8th installment in the rather long series that started with Part 1 about a month ago.

Back in 2006, we were in the situation where MySQL 5.0 had taken forever, and the first “GA” release was not suitable for production. Looking towards MySQL 5.1, it was also unlikely to be out any time soon. The MySQL Cluster team had customers that needed new features in a stable release. The majority of users didn’t use the MySQL server at all, they directly used the C++ NDB API for the vast majority of queries – so the vast majority of release blocker bugs in the MySQL server would not affect the production readiness of MySQL Cluster for these customers.

So, the decision was wisely made to do separate releases from a separate tree for MySQL Cluster. This was named MySQL Cluster: Carrier Grade Edition and exists to this day.

The main use case for MySQL Cluster at this time was running telephone networks, specifically the Home Location Registry databases of GSM phone networks. Basically, you need to keep a database of which tower each subscriber is associated with so when you go to make a phone call (or SMS) the network can properly route the call. This means there’s some realtime response requirements and hardcore availabilty which demands a special type of database.

NDB has a long history (some of it detailed in a previous post), but for those kind of interested in internals, I’ll quote Frazer Clement (now a long time MySQL Cluster developer although I completely forget which year he joined the team, which is just slightly embarrassing):

…Erlang and Ndb Cluster share some Plex heritage, which can still be seen in their architectures today. Since Plex, Erlang has mated with Prolog, and Ndb Cluster was involved in a car crash with C++.

The customers for MySQL Cluster were not the ones who bought the $5000 support option… Typically, paying for the addition of a relatively major feature was considered quite okay and even “normal”. After all, the effort that goes into constructing the entire software stack of a large cell phone network is rather large, and MySQL Cluster would end up being a very small part of that.

Many major features went into MySQL Cluster first, sometimes years before they made the general MySQL Server: row based replication, circular replication with conflict resolution and online DDL (add/drop index and column). In fact, it kind of incredibly frustrates me that we solved online add column in NDB so many years ago and you still can’t add a column to an InnoDB table without some serious planning.

The key thing about MySQL Cluster releases? They happened. They were also regular, addressing customer issues and new major versions brought features that worked for the use cases of those who needed them.

Interestingly enough, you may recognize the person who ran the MySQL Cluster team as the same person who has overseen the now regular release cycle of the MySQL Server itself (and has now been in the MySQL world for over ten years).

The MySQL Cluster storage engine

This is one close to my heart. I’ve recently written on other storage engines: Where are they now: MySQL Storage Engines, The MERGE storage engine: not dead, just resting…. or forgotten and The MEMORY storage engine. Today, it’s the turn of MySQL Cluster.

Like InnoDB, MySQL Cluster started outside of MySQL. Those of you paying attention at home may notice a correlation between storage engines not written exclusively for MySQL and being at all successful.

NDB (for Network DataBase) started inside Ericsson, originally written in a language called PLEX, which was internal to Ericsson and used in the AXE telephone switches. Mikael Ronstrom’s PHD thesis covered NDB and even covered things that (at least were) yet to be implemented (it’s been quite a few years since I leafed through it last). The project at Ericsson (IIRC) was shelved a couple of times, but eventually got spun out into an Ericsson Business Innovation company called Alzato.

Some remnants of PLEX can still be found in the NDB source code (if you look really hard that is). At some point the code was fed through a PLEX to C++ converter and development continued from there. Some of the really, really old parts of the source may seem weird either due to this or some hand optimization for SPARC processors in the 1990s.

In 2003, MySQL AB acquired Alzato and work on a storage engine plugin for MySQL to interface to the (C++ API only) NDB was underway. Seeing as the storage engine interface was so simple, easy and modular it would only take several years for the interface to NDB to become mature.

The biggest problem: NDB itself worked really well if your workload fit exactly what it was good at… if you deviated, horrific performance and/or crashes were not as uncommon as we’d have liked. This was a source of strain for many years with the developers and support team on one side and some of the less-than-careful sales team on the other. That being said, there have been some absolutely awesome sales people selling NDB into markets it truly fits, and this is why there’s barely a place in the world where placing a mobile phone call doesn’t go through MySQL Cluster at some point.

You should read Tomas Ulin’s post Celebrating 10 years @MySQL for a bit of an insight into how Alzato became part of MySQL AB (which later became part of Sun which became part of Oracle).

I joined the MySQL Cluster team at MySQL in December 2004, not too long after Alzato was acquired, but certainly when the NDB storage engine in MySQL 4.1 was in its very early stages – it was then by no means a general purpose database.

Over the years, MySQL Cluster gained both traction and features, making it useful for more applications. One of the biggest marketing successes of MySQL was the storage engine architecture and how you could just “plug in” different engines. The reality (of course) was far different and even though MySQL Cluster did just “plug in” to MySQL, it was certainly not a drop in replacement.

In MySQL 5.0, a bunch of neat new features were added:

  • Engine condition pushdown
    This enabled conditions on non-indexed columns to be evaluated on the data nodes rather than having every row pulled up to the SQL node to be evaluated.
  • Batched read interface
    So that queries like SELECT FOO FROM BAR WHERE A IN (1,2,3) were executed as a single network round trip rather than 3 round trips.
  • Query cache
    Although the query cache should die, hey, at least it worked with NDB now…. in a way.
  • Reduced IndexMemory usage
    Remember, NDB is an in-memory database, so saving a bunch of bytes for secondary indexes was a big thing.

the first release with things I really worked on was MySQL 5.1. My first talk (to a packed room) at the MySQL User Conference in 2006 was on new features in MySQL Cluster 5.1. I’m still quite proud of that talk even though I know I am a much better speaker than I was then (It would have been great to have had more guidance… but hey, learning from experience is good too).

We added a lot in 5.1:

  • Integration with replication
    This is where row based replication was born. It was a real team effort with the NDB kernel part (going from memory and bzr logs) having been written by Tomas and Jonas seems to have a bunch of code there too. I worked a bunch on the NDB Injector thread in mysqld, Mats worked on the core row based code (at the time the most C++ like code in the entire MySQL world). You could now have a cluster replicate to another cluster with the giant bottleneck that is MySQL replication.
  • disk data
    You could store non-indexed columns on disk. I implemented the INFORMATION_SCHEMA.FILES table for this, I was young and naive enough to think that the InnoDB guys would also fill out this table and all would be happy with the world (I’m lucky I haven’t been holding my breath on this one).
  • Variable Sized columns
    A VARCHAR(255) would actually not always use more than 255bytes if you just stored a single character in it. Catch? Only for in-memory columns.
  • User defined partitioning
    Because NDB desperately needed more options, we let the user choose how they wanted to partition up their data (per table).
  • Autodiscovery of schema changes
    This was a giant workaround to the epic mess that is FRM files and data dictionary things inside the MySQL Server. It is because of all this code that when I went to rewrite the whole thing for Drizzle I took the approach of “just pass it down to the engines, the server must not attempt to know better”. FWIW, I’m still right: if the server tries to be clever you now have two places for bugs to be, not just one.
  • Distribution awareness
    i.e. better selection of which data node to talk to for a particular query, reducing latency.
  • Online add/drop index.
    How long did it take for other engines to get this? Let’s not think about that :)

After that the really interesting stuff started to happen, that is, the first major fork of MySQL: MySQL Cluster Carrier Grade Edition (CGE). Why? We had customers that simply couldn’t wait for MySQL 6.0 (after all, they’d still be waiting).

We had MySQL Cluster CGE 6.1, 6.2, 6.3 and now we’re into 7.0, 7.1 and 7.2. There is without doubt that it’s the longest serving and surviving MySQL fork. There were non-trivial changes inside the MySQL server too, which caused enough of a merge problem for the (small) Cluster team.

One big thing that you’re probably still all waiting for? Replication conflict detection and resolution in circular/multi-master replication setups. It was an NDB first and been used in production for a decent amount of time.

I remember a hack while on an airplane led to the CompressedBackup and CompressedLCP options (used zlib when writing out checkpoints/backups) – something that took more time than you’d think to go from prototype to production ready code.

The last few things I worked on in MySQL Cluster before going and working full time on Drizzle was the Windows port, online add/drop node and NDBINFO.

I’ve left out so many cool MySQL Cluster things that were worked on over the years (e.g. online add/drop column, rewriting of LCP code, micro GCPs, crash-safe DDL, the test suite). I really should mention the test suite, in lines of code it was over three times that of MyISAM.. and that was probably six years ago that I worked that out.

One thing to think about: when Innobase Oy was bought by Oracle and there was this effort to have a transactional storage engine that was inside MySQL AB rather than another company, I pointed out that I thought it would take less time adding the needed features to NDB and integrating it inside the MySQL server binary (and with the addition of online add node you could go from stand alone DB server to a full cluster with no down time) than it would for any of the alternatives to get to a suitable level of maturity.

I wish I put money on this… I put money on the MySQL 5.1 GA release date (which I was happy to loose), but in the years since you can see that InnoDB is still reigning supreme with all that came to replace it having fallen away for one reason or another. It’s still on track to have MySQL Cluster be the only real alternative (now also, funnily enough, owned by Oracle). I have to say, it’s kind of a hollow victory though, it would have been nice to see Falcon and PBXT be serious players in today’s market.

AlsoSQL

So there’s a bit of a swelling around the idea of NoSQL. That is, databases that don’t have an SQL interface in front of them – with the promise of better performance. With a well designed backend, this is no doubt the case.

A flexible query language is rather useful though. I think we’ll see the rise of AlsoSQL. That is systems that present a fast and simple protocol along with a SQL interface.

This hybrid system has seen use for many years. MySQL Cluster is one such example. SQL through MySQL Server, NoSQL through NDB API.

With Drizzle, I feel we’ll be in a pretty good position to offer non-sql based protocols and access methods to existing storage engines.

Continuing the journey

A couple of months ago (December 1st for those playing along at home) it marked five years to the day that I started at MySQL AB (now Sun, now Oracle). A good part of me is really surprised it was for that long and other parts surprised it wasn’t longer. Through MySQL and Sun, I met some pretty amazing people, worked with some really smart ones and formed really solid and awesome friendships. Of course, not everything was perfect (sometimes not even close), but we did have some fun.

Up until November 2008 (that’s 3 years and 11 months for those playing at home) I worked on MySQL Cluster. Still love the product and love how much better we’re making Drizzle so it’ll be the best SQL interface to NDB :)

The ideas behind Drizzle had been talked about for a while… and with my experience with internals of the MySQL server, I thought that some change and dramatic improvement was sorely needed.

Then, in 2008, Brian created a tree. I was soon sending in patches at nights, we announced to the whole world at OSCON and it captured a lot of attention.

Since November 2008 I’ve been working on Drizzle full time. It was absolutely awesome that I had the opportunity to spend all my days hacking on Drizzle – both directly with fantastic people and for fantastic people.

But… the Sun set… which was exciting and sad at the same time.

Never to fear! There were plenty of places wanting Drizzle hackers (and MySQL hackers). For me, it came down to this: “real artists ship”. While there were other places where I would no doubt be happy and work on something really cool, the only way I could end up working out where I should really be was: what is the best way to have Drizzle make a stable release that we’d see be suitable for deployment? So, Where Am I Now?

Rackspace.

Where I’ll again be spending all my time hacking Drizzle.

NDB$INFO with SQL hits beta

Bernhard blogged over at http://ocklin.blogspot.com/2010/02/mysql-cluster-711-is-there.html that MySQL Cluster 7.1.1 Beta has been released. The big feature (from my point of view) is the SQL interface on top of NDB$INFO. This means there is now full infrastructure from the NDB data nodes right out to SQL in the MySQL Server for adding monitoring to any bit of the internals of the data nodes.

Drizzle FRM replacement: the table proto

Drizzle originally inherited the FRM file from MySQL (which inherited it from UNIREG). The FRM file stores metadata about a table; what columns it has, what type those columns are, what indexes, any default values, comments etc are all stored in the FRM. In the days of MyISAM, this worked relatively well. The row data was stored in table.MYD, indexes on top of it in table.MYI and information about the format of the row was
in table.FRM. Since MyISAM itself wasn’t crash safe, it didn’t really matter if creating/deleting the FRM file along with the table was either.

As more sophisticated engines were introduced (e.g. InnoDB) that had their own data dictionary, there started to be more of a problem. There were now two places storing information about a table: the FRM file and the data dictionary specific to the engine. Even if the data dictionary of the storage engine was crash safe, the FRM file was not plugged into that, so you could end up in a situation where the storage engine
recovered from a crash okay, but the FRM was incorrect for what the engine recovered to. This would always require manual intervention to find out what went wrong and then fix it (in some rather unusual ways).

When the MySQL Cluster (NDB) engine was introduced, a new set of problems arose. Now the MySQL server was connecting to an existing database, where tables could be created on other nodes connected to the cluster. You now not only had the problems of crash recovery, but the problems of keeping the FRM files in sync across many nodes, requiring
all sorts of interesting solutions that, for the most part, do work.

The “obvious” solution to some of these problems would be for an engine to write out an FRM file itself. This is much easier said than done. The file format was never created to be read and written by multiple pieces of software, the code that did the reading and writing inside the server was not reusable elsewhere and the only documentation (that
wasn’t a decent chunk of the MySQL source tree) is the rather incomplete definition in the MySQL Internals wiki (http://forge.mysql.com/wiki/MySQL_Internals_File_Formats) – not nearly enough to write a correct FRM file as the specifics are very, very odd.

Our goals for reworking the metadata system in Drizzle were: to allow engines to own their own metadata (removing any opportunity to have inconsistencies between the engine and the ‘FRM’) and for engines without their own data dictionary, to replace the FRM file format with something simple and well documented.

One option was to use SQL as the standard storage format, but it is rather non-trivial and expensive to parse – especially if we were to use it as the preferred way of talking table definitions with storage engines. We had been looking at the protobuf library
(http://code.google.com/p/protobuf/) ever since its first release and it has a number of very nice characteristics: a description language of a data structure that is then used to generate APIs for reading and writing it in a number of programming languages and a standard (documented) way to serialize the data structure.

After a bit of discussion, we arrived at a good outline for the table definition proto. The current one can always be found in the Drizzle source tree at drizzled/message/table.proto. The current format is very close to final (i.e. one that we’ll suppport upgrades from).

The process of modifying the Drizzle code base so that it would write (and read) a file format different to the FRM isn’t worth going too much into here although there were some interesting hurdles to overcome. An interesting one was the FRM file contains a binary image of the default row for the table (which is in the row format that the server uses); we now store the default value for each column in the proto and generate the default row when we read the proto. Another interesting one was removing and refactoring “pack_flag” – the details of which should only be extracted from Jay or Stewart with a liberal application of fine ale.

The end result is that we now have storage engines that are completely responsible for their own metadata. One example is the ARCHIVE engine. In the CREATE TABLE code path, the ARCHIVE storage engine gets the table definition in an object that represents the table proto. It can examine the parameters it needs to and then either store the proto directly, or convert it into its own format. Since ARCHIVE is simple, it just stores
the table proto in a serialised form (using a standard function provided by the protobuf library) and stores it in the .ARZ data file for the table. This instantly makes the ARCHIVE storage engine crash safe for CREATE and DROP table as there is only 1 file on disk, so no two files to get out of sync.

If an engine does not have its own data dictionary, it can still use the default implementation which just stores the serialised table proto in a file on disk.

We can also now use this interface to move INFORMATION_SCHEMA into its own storage engine. This means we can remove a lot of special case code throughout the server for INFORMATION_SCHEMA and instead just have a INFORMATION_SCHEMA storage engine that says it has the following tables in the INFORMATION_SCHEMA database. Because the table definition is now in a documented format with a standard API, this becomes a relatively trivial exercise.

What we’re all looking forward to is when the InnoDB data dictionary is linked into the new interface and we can have a truly crash safe database server.

Another wonderful side effect is since we now have a standard data structure for representing a table definition, we can integrate this with the replication system. In the “near” future, we can represent a CREATE TABLE in the replication stream as a table proto and not the raw SQL. If you were wanting to apply the replication stream to a different database server, you then only have to write a table proto to SQL
converter. If the target database system doesn’t do SQL at all, you could generate API calls to create the table.

So we now have a rather flexible system in place, with the code implementing it being increasingly simple and possible to be “obviously correct”.

Things that easily fall out of this work that people have written about:
– CREATE TABLE LIKE with ENGINE clause
http://krow.livejournal.com/671235.html
– table_raw_reader – looking at the raw representation of table metadata
http://www.flamingspork.com/blog/2009/10/01/table_raw_reader-reading-the-table-proto-from-disk-and-examining-everything/
– Table discovery
http://www.flamingspork.com/blog/2009/07/29/table-discovery-for-drizzle-take-2-now-merged/

Some more info:
http://krow.livejournal.com/642329.html

NDB Kernel size over releases

So Jonas pointed out that the NDB kernel hasn’t changed too much in size over releases. Let’s have a look:

In fact, the size went down slightly from 4.1 to 5.0. In this, 6.4 and 7.0 are the same thing but appear twice for completeness.

You can see the raw results in the spreadsheet here.

Feedback from MySQL Cluster tutorial

Way back on Monday (at the MySQL Conference and Expo), I gave a full day tutorial on MySQL Cluster. I awoke early in the morning to a “oh ha ha” URL in an IM; but no, it wasn’t jetlag playing tricks with me. Luckily, this didn’t take much (if anything) away from the purpose of the day: teaching people about NDB.

Distracting-and-this-time-really-annoying-thing-of-the-day-2: It seems that O’Reilly had cut back on power this year, and there were no power boards in the room. A full day interactive tutorial, and nowhere to plug in laptops. Hrrm.. Luckily, having over the many years I’ve been speaking at this event, I’ve gotten to know the AV guys okay, and asked them. They totally deserve a medal. Tutorial started at 8:30, I noticed at 7:30, and it was all fixed by 7:45. The front half of the room (enough for everyone coming) had enough power for everyone. It was quite okay to bunch everybody up – means I have to run around less.

This years tutorial was modified from last year (and that does take time, even though I’ve given it many times before). I wanted to remove out of date things, trim bits down (to better fit into the time we have, based on more experience on how long it takes to get interactive parts done) and add a bit.

When we got to the end of the day (yes, I ran over… and everybody stayed, so either I’m really scary or the material is really interesting) I pleaded for feedback. It’s amazingly scary doing an interactive tutorial. You’re placing the success of the session not so much on you, but on everyone who’s come to it.

Sometimes I’ve gotten not much feedback at all; this time was different. I spoke to a number of people afterwards (and some via email) and got some really good suggestions for small changes that would have greatly enhanced the day for them. I was pleased that they also really enjoyed the tutorial and liked the interactivity. I (and it seems a great many others) do not much like tutorials that are just long talks.

People walked out of my tutorial with a good overview of what MySQL Cluster was, how to set one up, use one, do a bit of admin and some of how it works.

I even dragged Jonas up to explain in great detail the 2 phase commit protocol for transactions. Of course, this is detail you don’t ever need to know to deploy – but people are intersted in internals.

So far the session has received an average of 4 stars in evaluations (four five star, two four star and one two star). I’d be really interested in feedback from the person who gave two stars, as this may mean I missed getting something done for them (e.g. providing information, help etc). Even though it is hard to spread yourself around a room of 60-ish-plus people, I do like to do it well. There is the other possibility of people not coming prepared, which will mean they may be bored for a lot of the day if they don’t jump in with another group and help learn that way.

So, I’m rather happy with how my first session went.

MySQL Cluster Tutorial

This year I am again giving a MySQL Cluster Tutorial at the MySQL Conference and Expo. As those who have attended before can tell you, this is a hands on tutorial. I don’t just stand up the front and talk at you for a day, that would be very boring (for all of us). While there is a good amount of presented material (there is a decent amount of theory to get through), there is a large component that involves setting up a cluster, putting data in, getting data out, backup, restore.

So if you’re wanting to learn about MySQL Cluster in a nice and friendly hands-on environment, I can recommend coming to my tutorial.

The tutorial isn’t the be-all and end-all tutorial. It does not teach you everything. It does give you a decent introduction though.

linux.conf.au 2009 wrap-up (incl Open Source Databases Mini-conf): Day 0-1

It’s no secret that I love linux.conf.au. My first was linux.conf.au 2003, in Perth and I’ve been to every one since (there are at least two people who’ve been to every single one, including CALU as it was called in 1999).

I’ve been on the board of Linux Australia for some insane proportion of the years since then (joining in 2003). Linux Australia is the not-for-profit community organisation that puts on linux.conf.au. It’s all volunteers and amazingly enough we have more than one group of people wanting to put on linux.conf.au each year!

This year, we Marched South to Hobart.

Here I detail what I saw, what I wish I saw and whatever else comes to mind.

Sunday – Before the conference

Ran into Bdale while checking in. Short flight down. A million and one people on the plane and on the ground that I knew. It must be linux.conf.au.

Seeing way too many awesome people I know, checking into accommodation (oh my, what a hill), registering for conf, beer and then off to a “ghosts of conferences past” dinner – where a few people who had organised previous linux.conf.au’s were hastily gathered together to chat to part of the 2010 team.

Monday – Open Source Databases Miniconf Day 1

Oh, that’s right – I’m running the OSDB Miniconf :)

First up, Monty Taylor spoke on “NDB/Bindings – Use the MySQL Cluster Direct API from languages you actually like for fun and profit”. Possibly taking the prize for the longest talk title of the conference. The NDB API is not SQL, it’s what the MySQL server (and one day, when Monty and I get around to it, Drizzle) translates SQL into for NDB. That being said, you can (pretty much always) write NDB API code that dramatically outperforms equivilent SQL (for a variety of reasons). Monty maintains the NDB/Bindings project that lets you use languages other than C++ for the NDB API.

At the same time as Monty was speaking, I wish I’d been able to fork() and go and see “Is Parallel Programming Hard, And, If So, Why?by Paul McKenney and Michael Still talking about MythNetTV (pull RSS feeds of video in as MythTV programs).

After morning tea, we were meant to have “InnoDB scaling up and performance” by Bruce Huang, but he was a no-show. Hint: if you don’t want bad things to be said about you by conference organisers, either show up or let them know you’re not able to make it.

Instead, we led a crazy Q&A type session around the room which was a whole lot of fun. Really a “ask the experts” meets running up-and-down stairs with a microphone.

Next up, Arjen Lentz who runs Open Query spoke on “OurDelta: Builds for MySQL”. The best way to describe OurDelta is a “distribution of MySQL”. It’s the MySQL server plus a bunch of patches provided by various people that haven’t yet made it into the main source tree (for any number of reasons).

At the same time (if you’ve never been to linux.conf.au, you’ll find that you often want to be in at least 3 places at once) I would have really liked to see “MythTV Internals by Nigel Pearson” (I co-wrote Practical MythTV with Michael Still, which is having a “second edition” in wiki form over at http://www.mythtvbook.com/) as well as the panel on geek parenting as this may be something I’m one day faced with.

Up next: Russell Coker filled in for Kaigai (same talk, different speaker) to talk on The Security-Enhanced PostgreSQL – “System-wide consistency” in access controls. I found this quite interesting and different approaches to database security are worth looking at. Modern applications (read: web applications) don’t map their uses to database users at all. There are usually two users on the database server: the super user and the user that the app uses. It would be nice to have a good solution for those who want it.

Again, If I had the ability to be in two places at once, I would have also seen “How I Learned To Stop Worrying And Love ACPI” by the extremely handsome Matthew Garrett.

Monty Widenius (blog here – and yes, we have two Monty’s now… which does cause confusion) talking about the Maria storage engine. Maria is based on MyISAM, but adding crash safety and transactions (among other things).

Again, if I was able to be in several places at once I would have also seen Rusty‘s “Large CPUmasks”, Nathan Scott talking about “System level performance management with PCP” and Bdale’s “Collaborating Successfully with large corporations”.

An awesome start to the conference.

row id in MySQL and Drizzle (and the engines)

Some database engines have a fundamental concept of a row id. The row id is everything you need to know to locate a row. Common uses include secondary indexes (key is what’s indexed, value is rowid which you then use to lookup the row).

One design is the InnoDB method of having secondary indexes have the value in the index be the primary key of the row. Another is to store the rowid instead. Usually (or often… or sometimes…) rowid is much smaller than the pkey of the row. This is how innodb can answer some queries just out of the index. If it used rowid, it may involve more IO to answer the query. All this is irrelevant if you never want just the primary key from a secondary index.

Some engines are designed from the start to have rowid, others it’s added later (e.g. NDB).

Anyway… all beside the point. Did you know you can do this in mysql or drizzle:

drizzle> create table t1 (a int primary key);
Query OK, 0 rows affected (0.02 sec)

drizzle> insert into t1 (a) values (1);
Query OK, 1 row affected (0.01 sec)

drizzle> select _rowid from t1;
+--------+
| _rowid |
+--------+
|      1 |
+--------+
1 row in set (0.00 sec)

Is that the rowid from the engine? No (although at least NDB will let you select the real ROWID through a pseudo column through NDBAPI). Quoting from the MySQL manual:

If a PRIMARY KEY or UNIQUE index consists of only one column that has an integer type, you can also refer to the column as _rowid in SELECT statements.

Unfortunately, this isn’t correct… as this lovely bit of “oh my, what an excellent way to obfuscate my database app!” shows:

drizzle> create table t1 (a int primary key, b varchar(100));
Query OK, 0 rows affected (0.02 sec)

drizzle> insert into t1 values (1,”foo”);
Query OK, 1 row affected (0.00 sec)

drizzle> update t1 set b=”foobar!” where _rowid=1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

drizzle> select * from t1;
+—+———+
| a | b |
+—+———+
| 1 | foobar! |
+—+———+
1 row in set (0.00 sec)

So how is this implemented? In two places: in sql_base.cc find_field_in_table() and in table.cc during FRM parsing (this is how I found it). We can even do things Oracle can’t (insert, update and delete):

drizzle> update t1 set a=2 where _rowid=1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

drizzle> select * from t1;
+---+---------+
| a | b       |
+---+---------+
| 2 | foobar! |
+---+---------+
1 row in set (0.00 sec)

drizzle> update t1 set _rowid=3 where _rowid=2;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

drizzle> select * from t1;
+---+---------+
| a | b       |
+---+---------+
| 3 | foobar! |
+---+---------+
1 row in set (0.00 sec)

SQLite also has something similar (see the autoinc docs).

I do wonder if anybody uses this functionality. It’s even tested (I was quite shocked at this) in the auto_increment and heap_auto_increment tests.

People on IRC as some measure of a project

#mysql isn’t too fair to include, as it’s really about users, not dev. #mysql-ndb is there because i heart ndb.

Oh, and linux.conf.au is there because it’s *awesome* and you should go.

Totally unscientific due to i’m only taking a sample once and whatever… but it kinda interests me…

Speaker: MySQL Conference & Expo 2009 – O’Reilly Conferences, April 20 – 23, 2009, Santa Clara, CA

Yes, I’m speaking at  the upcoming MySQL Conference & Expo 2009 – on April 20 – 23 (and yes, it’s in Santa Clara again).

I have three sessions:

MySQL Cluster Tutorial: this time with 6.4 feature goodness. Very hands-on, very interactive.

MySQL Cluster on Windows:  (insert witty text about hating operating system freedom here)

Memory Management in MySQL and Drizzle: not magic setting of buffer variables, but memory allocation and management inside the server, a bunch of malloc() discussion and hopefully some interesting numbers.

MySQL Cluster (NDB) on Win32 progress

Many things have been happenning in the land of NDB on Win32 as of late.

I’ve fixed about 700 compiler warnings (some of which were real bugs) leaving about 161 to go on Win32 (VS2003). We’re getting a few more warnings on Win64 (some of which look merely semantic, while others could be real bugs), but the main focus now is getting 32bit going really well.

I fixed a number of bugs that were around preventing lots of things from working properly:

Disk Data (i.e. CREATE TABLESPACE, CREATE LOGFILE GROUP, and CREATE TABLE… TABLESPACE ts1 STORAGE DISK) now works. The main problem here was that our filesystem abstraction layer for the NDB kernel (ndbd) once had a Win32 port… which has sorely bitrotted over the years. As new features were introduced to the file IO interface, they (of course) weren’t also added to the Win32 abstraction. In the disk data case, the OM_INIT feature, which on FSOPENREQ (open a file) allows data to be passed in for initialising the file. Previously, I fixed this to allocate the file on disk and create a file of the same size, but i didn’t add the feature that writes initial data to the file. This caused bugs as soon as you tried to use the disk data tables (the files weren’t initialised, so you hit asserts on corrupt disk data files).

Paths in the server: for whatever bizarre and stupid reason, the MySQL server can end up having paths to a table as ./database/table OR .\database\table. The latter *never* shows up on non-Win32 platforms but can *sometimes* show up on Win32. Ick ick ick ick. Anyway, we (in the NDB handler) weren’t dealing with this properly, causing problems around some metadata ops.

Our pushbuild system takes each push to a source tree, builds it on a variety of platfroms and runs the mysql-test-run.pl test suite. The Win32 hosts are actually running on vmware. In order to make tests run faster, on Linux we use /dev/shm for the data files. Microsoft Windows doesn’t have a good ram disk, so we create a file on /dev/shm on the host and map that as a drive inside Windows (and format it as NTFS). This drive is only 1GB. This is not enough disk space for running all the clusters (yes, plural) started by the test suite (and everything would die with ENOSPC). The workaround I’ve come up with is that for debug builds, we simply enable NTFS file compression on files ndbd creates.

Win64 is also working! Pushbuild builds and runs on 64bit, and the Win64 host is building with NDB and passing about the same amount of tests as the Win32 hosts!

The bad news is that the NDB with replication tests are pretty much all failing… so I’m fairly confident that cluster replication is very broken on Win32 (and 64) at the moment.

I’ve had to do a fair amount of fixing on a bunch of the test cases (mainly to do with finding where various NDB utilities are). They’ve also prompted fixes in NDB (automatically converting / to \ in ndbd on Win32 for CREATE DATAFILE/UNDOFILE).

If you want to give it a go – you can get the source from launchpad. Either in the mysql-5.1-telco-6.4 tree, or if you want a few more things fixed, always have a look at the mysql-5.1-telco-6.4-win tree. Hopefully both are synced with the latest internal trees (i.e. plain 6.4 is working on win32) by the time you read this.

Iggy and I discussed installers for NDB on Windows in Riga, and we should have something soon-ish for those of you who don’t build from source.

getarg calls srand() ???

storage/ndb/test/src/getarg.c

Guess what? It calls srand(time(NULL)) in getarg(). Why you ask? well.. what you want to be able to when specifying a flag is have it be true, false or it could “maybe” be set.

That’s right kids… maybe.

I’m sure it’s used somewhere in our test suite to get coverage on different things.. but umm.. yeah, interesting discovery for today.