I don’t read code comments

They are wrong.

Misleading at best.

Reworking parts of Drizzle (which came directly from MySQL) it can get painfully obvious. Things like “afaiu” and “???” appear in more than one place (that is if the comment isn’t just obviously wrong).

A comment merely states what one person thought the code did at some point in the past. It has no relation to what the code actually does now.

Hold me, I’m scared….

From ha_myisam.cc:

/*
TODO: switch from protocol to push_warning here. The main reason we didn’t
it yet is parallel repair. Due to following trace:
mi_check_print_msg/push_warning/sql_alloc/my_pthread_getspecific_ptr.

Also we likely need to lock mutex here (in both cases with protocol and
push_warning).
*/
protocol->prepare_for_resend();
protocol->store(name, length, system_charset_info);
protocol->store(param->op_name, system_charset_info);
protocol->store(msg_type, system_charset_info);
protocol->store(msgbuf, msg_length, system_charset_info);
if (protocol->write())
sql_print_error(“Failed on my_net_write, writing to stderr instead: %s\n”,
msgbuf);
return;

Hopefully this will serve as a good TODO list to go and fix at some point.

The Drizzle Snowman – PlanetMySQL fail

If you went “wtf” at The Drizzle Snowman – WIN! – Jay Pipes on PlanetMySQL suddenly ending at “create table”, you should click through and see the unicode character for a snowman.

In other fun, we’ve also created tables with the name of cloud symbol, umbrella symbol and umbrella with rain drops symbol. All of these seem rather appropriate for Drizzle.

What VERSION in INFORMATION_SCHEMA.TABLES means (hint: not what you think)

It’s the FRM file format version number.

It’s not the version of the table as one might expect (i.e. after CREATE it’s 1. Then, if you ALTER, it’s 2. Alter again 3 etc).

In Drizzle, we now return 0.

In future, I plan that Drizzle will allow the engine to say what version it is (where 0 is “dunno”).

This’ll be a good step towards being able to cope with multiple versions of a table in use at once (and making sense of this to the user).

drop table fail (on the road to removing the FRM)

So… in removing the FRM file in Drizzle, I found a bit of a nugget on how drop table works (currently in the MySQL server and now “did” in Drizzle).

If you DROP TABLE t1; this is what happens

  • open the .frm file
  • read first 10bytes (oh, and if you get EIO there, in a SELECT * FROM INFORMATION_SCHEMA.TABLES you’ll get “Error” instead of “Base Table”)
  • if (header[0] != (unsigned char) 254 || header[1] != 1 ||
    (header[2] != FRM_VER && header[2] != FRM_VER+1 &&
    (header[2] < FRM_VER+3 || header[2] > FRM_VER+4)))
    return true;
    Which means that you probably (well, should have) set your enum legacy_db_type to DB_TYPE_UNKNOWN in the caller of bool mysql_frm_type(Session *, char *path, enum legacy_db_type *dbt) otherwise you end up in some form of pain.
  • Else, *dbt= (enum legacy_db_type) (uint) *(header + 3);
    return true;                   // Is probably a .frm table

I do like the “probably”.

Oh, and on a “storage engine api” front, some places seem to expect handler::delete_table(const char* name) to return ENOENT on table not existing. In reality however:

  • int ha_heap::delete_table(const char *name)
    {
    -  int error= heap_delete_table(name);
    -  return error == ENOENT ? 0 : error;
    +  return heap_delete_table(name);
    }
  • InnoDB (note the behaviour of returning DB_TABLE_NOT_FOUND… which isn’t ENOENT)
    err = DB_TABLE_NOT_FOUND;
    ut_print_timestamp(stderr);

    fputs(“  InnoDB: Error: table “, stderr);
    ut_print_name(stderr, trx, TRUE, name);
    fputs(” does not exist in the InnoDB internaln”
    “InnoDB: data dictionary though MySQL is”
    ” trying to drop it.n”
    “InnoDB: Have you copied the .frm file”
    ” of the table to then”
    “InnoDB: MySQL database directory”
    ” from another database?n”
    “InnoDB: You can look for further help fromn”
    “InnoDB: http://dev.mysql.com/doc/refman/5.1/en/”
    “innodb-troubleshooting.htmln”,
    stderr);

  • and MyISAM would generate the error message itself, but that’s fixed with:
    -  if (my_delete_with_symlink(from, MYF(MY_WME)))
    +  if (my_delete_with_symlink(from, 0))
    return(my_errno);

and just to add to the fun, elsewhere in the code, a access(2) call on the frm file name is used to determine if the table exists or not.

The road to removing the FRM has all sorts of these weird-ass things along it. Kinda nice to be able to replace this with something better (and, hopefully – good).

But let me sum up with sql_table.cc:

“This code is wrong and will be removed, please do not copy.”

Scaling MySQL on a 256-way T5440 server using Solaris ZFS and Java 1.7

Scaling MySQL on a 256-way T5440 server using Solaris ZFS and Java 1.7

*cough*

(and then wipe coffee off the computer)

of course the real aim should be to scale with one instance on the machine as scaling with multiple instances on the one machine isn’t scaling at all – it’s scale out, but with more problems (now when one machine goes down, so do 1110202434 database instances).

phrase from nearest book

from elliot: phrase from nearest book

  • Grab the nearest book.
  • Open it to page 56.
  • Find the fifth sentence.
  • Post the text of the sentence in your journal along with these instructions.
  • Don’t dig for your favorite book, the cool book, or the intellectual one: pick the CLOSEST.

My result:

“A lot of UN staff call themselves human rights experts but need a shoulder to cry on each time there’s a killing”

– Emergency Sex (and other desperate measures)
Kenneth Cain, Heidi Postlewait and Andrew Thomson

Technology predictions

In 2 years (ish):

  • the majority of consumer bought machines (which will be laptops) will have SSD and not rotational media
  • At the same time, servers with larger storage requirements will use disk as we once used tape.
  • At least one Linux distributoin will be shipping with btrfs as default
  • OpenSolaris will be looking interesting and not annoying to try out (a lot more “just work” and easy to get going).
  • Unless Sun puts ZFS under a GPL compatible license so it can make it into the Linux kernel, it will become nothing more than a Solaris oddity as other file systems will have caught up (and possibly surpassed).
  • There will be somebody developing a a MySQL compatible release based off Drizzle
  • Somebody will have ported Drizzle back to Microsoft Windows… possibly Microsoft.
  • X will still be used for graphics on Linux, although yet another project will start up to “replace X with something modern”, get a lot of press and then fail.

In 5 years:

  • Apple will single handedly control 1/3rd the mobile phone market
  • The other 2/3rds will be divided between Blackberry (small), Windows Mobile and Android.
  • Linux desktop market share will be much higher than Apple’s

That’s all for now…

libmallocfail

Bazaar branches of libmallocfail

Simple LD_PRELOAD library that will take parameters via environment variables and cause malloc() to occationally fail.

Aim was to use this to test bits of MySQL/Drizzle although since their libtool based stuf, the binary in tree is a libtool shell script, and I haven’t found a way to LD_PRELOAD only for mysqld and not the shell script and the other processes spawned by it.

I have found a bug in libc though :)

Goodbye FRM (or at least the steps to it)

Since before MySQL was MySQL, there has been the .FRM file. Of course, what it really wanted to be was “.form” -  a file that stored how to display a form on your (green) CRT. Jan blogged earlier in the year on this still being there, even in MySQL 5.1 (albeit not in any useful form).

So why do we want it to die?

Well… it’s not exactly very useful anymore.

There are a few things it’s used for….

If database/table.frm exists, the table exists (or, on Windows, you may also get databasetable.frm). This is tested in a few bits in the code by a call to access(2).

Most engines have their own data dictionaries (Innodb, PBXT, NDB, Falcon). Keeping these in sync with the FRMs can be problematic at best. This is especially true with distributed engines such as NDB.

The current solution is that on the SQL node that is creating the table, we create the FRM file, gzip it, and store it in the cluster. Then, other nodes, if they go “err… no local frm” first call ha_create_table_from_engine() which NDB will go and see if the table exists in the cluster. If so, it copies the FRM from the cluster to local disk and then the SQL server continues on its way with the standard way of opening a table (through the FRM). If you do DDL through the NDB-API (and not via SQL) then well… you get to keep both pieces.

As for if you crash during a table rename (with any engine with its own data dictionary.. e.g. InnoDB)… you again get to keep both pieces. (There is a bit of discussion on this over here)

Having FRM files also doesn’t especially lead to having multiple versions of table metadata co-existing in the server.

The fun part of reading a frm is open_binary_frm in table.cc. It reads in the frm into a TABLE_SHARE. If we only had some other way of filling out a TABLE_SHARE… one from the engine itself…

But what about any metadata that the engine data dictionary doesn’t have? For example, many server types may map to 1 engine type. An example of this is the GIS types in MySQL. For most engines, these just map to BLOBs. The engine itself has no knowledge about that, but we should fill out the table definition correctly…. so for this type of thing the engine may need to store some additional metadata. This is pretty easy for transactional engines: put it in a table! (although you then have your own problem about keeping this synchronised with any DDL). For engines that don’t have their own data dictionary, we can just provide a set of routines to store/read a frm type file (based on protobufs no doubt).

There also seems to be some entanglement with LOCK_open. Ahhh LOCK_open, the lock that nobody can possibly understand.

The tricky thing will be not rewriting every little bit from scratch all at once but rather go for the incremental bits….

Perhaps you’re not all stupid…. although…

As we (the rest of the world) heave a huge sigh of relief at the result of the US Presidential election it’s morning here and I’m happy that Obama is the president elect.

However, it looks like Proposition 8 will get up in California. EPIC FAIL. I hope the remaining ballots tip the result in the other direction, I really do.

I don’t see how the state (where state includes country and state… i.e. governments) has any right to legislate anything to do with human relationships.

What is marriage? Beyond the personal committment between consenting adults, it sets up a few things in case one party is incapacitated or dies. So why don’t we have a set of forms for this information?

In case I’m unable to make my own medical decisions, a list of people who (in order) I want to make them for me. If I’m not married, surely I still have a right to say who gets to make medical decisions for me.

On the “who gets my stuff when I die” front, a legal will pretty much covers that.

Some places seem to have tax benefits if you’re married… which to me is much like the government interfering in something that is none of its business. Surely being with a partner is its own reward, not something you do for tax incentives.

So you’re left with things like: shared property if relationship ends, right to adopt, access to fertility treatment etc

Shared property: How is this different from “two friends buy a house together, share it. they then fall out/decide to sell and there is dispute as one put more $$ into it”? It’s not, it’s just on a larger scale.

All I can say on who gets to reproduce is that it is surely a child is best raised by people who love them and can provide them with a safe, nurturing and positive environment.

Solution? Abolish Marriage.

Well… at least any legal definition thereof and everybody can live happily ever after.

Except that’s not what the issue is. It’s a debate between those who believe in personal freedom and right to live as you choose (on the proviso that it does not cause harm to others) and those who do not.

I have exactly zero time for those who think abortion should be illegal. When no child on this earth dies of hunger or easily preventative disease, I will then hear your argument. Odds are I will disagree with you, but until that day, you are just heartless.

But for the moment, we have until January 20th to see what more Bush/Cheney can do to make Americans constantly apologise for their country.

There is a great deal of hope for Obama, so I say this: don’t fuck up.

(Oh, and please don’t follow our new guys and try and erect a great internet firewall… because the club of countries that do that: Iran, Myanmar, North Korea, and Syria is one you really want to be part of).

I remain cautiously optimistic.

Singing in the Rain

The past 3 years, 11 months I have worked full time on NDB (MySQL Cluster). It’s been awesome. Love the product and people. In the time I’ve been on the Cluster team, we’ve gone from a small group that would easily fit in the (old old) Stockholm office to one that requires large rooms to house us all in. It’s also been all about smart people (you have to be to work on a distributed database).

With MySQL Cluster 6.4 we’re getting in a bunch of features that have been on the “wide adoption” wishlist. With each release of NDB we’ve gained a wedge of applications that can be used with it – and 6.4 is no exception.

One of the biggest things that’s been worked on is multithreaded data nodes. If you check out Jonas‘ recent posts on 500,000 reads/sec and then a massive 700,000 reads/sec.

We’ve also got a Microsoft Windows port coming up, which a number of people have asked for over the years. Mostly I think this is a “I want to try it out” thing and not a deployment thing. (can any sane person deploy a HA app on Win32?)

I’ve used “NDB$INFO” as the ultimate answer to any problem for a while now. It’s been the much-wanted monitoring interface. We have a lot of info inside NDB that currently isn’t easily user accessible (or only accessible through the magic DUMP interface or by gathering up many events in the cluster log). We have the start of NDB$INFO in 6.4 now and Martin will be continuing my work in making it truly awesome.

So go and grab the 6.4 tree and have a look – things are looking sweet.

What next for me?

Well… a while ago I started hacking on Drizzle. Why? Well… I thought we could move the database server in a new direction and make it more modular, leaner, meaner query machine.

And now, I’m starting to work on it full time.

It’s exciting, and I’ll be blogging on the first TODO which is remove the FRM file and switch to a full discovery method shortly.

UPDATE: Yes, I’m working full time on Drizzle for Sun Microsystems (in the CTO group). While not spending work time on NDB anymore, no doubt you’ll still see fun-time patches.

MySQL Cluster (NDB) on Win32 progress

Many things have been happenning in the land of NDB on Win32 as of late.

I’ve fixed about 700 compiler warnings (some of which were real bugs) leaving about 161 to go on Win32 (VS2003). We’re getting a few more warnings on Win64 (some of which look merely semantic, while others could be real bugs), but the main focus now is getting 32bit going really well.

I fixed a number of bugs that were around preventing lots of things from working properly:

Disk Data (i.e. CREATE TABLESPACE, CREATE LOGFILE GROUP, and CREATE TABLE… TABLESPACE ts1 STORAGE DISK) now works. The main problem here was that our filesystem abstraction layer for the NDB kernel (ndbd) once had a Win32 port… which has sorely bitrotted over the years. As new features were introduced to the file IO interface, they (of course) weren’t also added to the Win32 abstraction. In the disk data case, the OM_INIT feature, which on FSOPENREQ (open a file) allows data to be passed in for initialising the file. Previously, I fixed this to allocate the file on disk and create a file of the same size, but i didn’t add the feature that writes initial data to the file. This caused bugs as soon as you tried to use the disk data tables (the files weren’t initialised, so you hit asserts on corrupt disk data files).

Paths in the server: for whatever bizarre and stupid reason, the MySQL server can end up having paths to a table as ./database/table OR .\database\table. The latter *never* shows up on non-Win32 platforms but can *sometimes* show up on Win32. Ick ick ick ick. Anyway, we (in the NDB handler) weren’t dealing with this properly, causing problems around some metadata ops.

Our pushbuild system takes each push to a source tree, builds it on a variety of platfroms and runs the mysql-test-run.pl test suite. The Win32 hosts are actually running on vmware. In order to make tests run faster, on Linux we use /dev/shm for the data files. Microsoft Windows doesn’t have a good ram disk, so we create a file on /dev/shm on the host and map that as a drive inside Windows (and format it as NTFS). This drive is only 1GB. This is not enough disk space for running all the clusters (yes, plural) started by the test suite (and everything would die with ENOSPC). The workaround I’ve come up with is that for debug builds, we simply enable NTFS file compression on files ndbd creates.

Win64 is also working! Pushbuild builds and runs on 64bit, and the Win64 host is building with NDB and passing about the same amount of tests as the Win32 hosts!

The bad news is that the NDB with replication tests are pretty much all failing… so I’m fairly confident that cluster replication is very broken on Win32 (and 64) at the moment.

I’ve had to do a fair amount of fixing on a bunch of the test cases (mainly to do with finding where various NDB utilities are). They’ve also prompted fixes in NDB (automatically converting / to \ in ndbd on Win32 for CREATE DATAFILE/UNDOFILE).

If you want to give it a go – you can get the source from launchpad. Either in the mysql-5.1-telco-6.4 tree, or if you want a few more things fixed, always have a look at the mysql-5.1-telco-6.4-win tree. Hopefully both are synced with the latest internal trees (i.e. plain 6.4 is working on win32) by the time you read this.

Iggy and I discussed installers for NDB on Windows in Riga, and we should have something soon-ish for those of you who don’t build from source.

No opt-out of filtered Internet

Computerworld – No opt-out of filtered Internet (via Chris)

EPIC FAIL. Looks like I’m going to have to reconsider which end of the ballot paper the ALP goes at (hint, it’s now towards the bottom).

Pay no attention to the criminals behind the curtain… we’re just going to let them do their thing and make sure it’s hard for you to see them.

With a false positive rate (incorrectly blocking something) of 10,000/1,000,000 (otherwise known as 1%) we’ll no doubt see things blocked that shouldn’t be.

What happened to my free country?

evolution-data-server even worse (is that possible?)

Just caught it using 713MB of resident memory. What the fuck? I don’t even have Evolution running! There’s only the clock applet (which does pull things out of calendar i guess…).

Does Evolution win the prize for worst piece of free software yet?

new NetworkManager VPNC not better at all (in fact, much worse)

I upgraded to Ubuntu 8.10 the other day, NetworkManager promptly forgot my wireless LAN key (grr… lucky I keep a copy in a text file) as well as my VPN configuration. It’s also changed the UI for entering what specific networks to route over the VPN (’cause the last thing you want is putting all your traffic through VPN when you have a perfectly good internet connection here… or even worse, I do *not* need to go via Sydney or the US to access the machine 2ft from me thank you very much).

Generally not happy with the new NetworkManager.