Record autotest numbers for NDB

So, with a bunch of recent tests I added (and some bugs that have been fixed) we’re now consistently getting 203 or 204 passing tests. We’ve got typically around 8 or 9 that often fail – often because the test may be broken or not quite deterministic. Or there’s a bug… :)

(all numbers for the daily-basic list of tests for various 5.1 branches).

It would be great to hit 300 by this time next year… which means a lot of test cases… hrrm… anybody want to volunteer?

MySQL Conf coming up (and memories of last year)

Andy Dustman just blogged referencing his previous posts on last years MySQL User Conference. This years is coming close (April 23-26) and the pressure to have all my presentations all perfect is mounting (err.. by the way, they will be).

Last year was a blast. Long days (and into the evenings) with sessions, BoFs, food and beer discussing all sorts of things that in some way related back to databases (and rather often, surprisingly enough, MySQL).

What was also great was being able to talk to lots of people who are doing real things out in the real world abotu MySQL Cluster and if it’s remotely suitable to their application. Often the answer can be “I think you’re looking for replication”, which is perfectly okay too.

I’m in a few days early (and around a few days after) – so if you’re around the area do give me a yell – it’d be cool to hang.

FYI, I’m giving the following sessions:

  • MySQL Cluster: The Complete Tutorial (Parts I and II)
    Which is a total of 6hrs of MySQL Cluster goodiness. It’s aimed at people who know MySQL (or are pretty good with other RDBMSs and can fake it) and are wanting to know about MySQL Cluster. It’s a hands-on tutorial, so be prepared!
  • Introduction to MySQL Cluster
    A 45minute whirlwind introduction to MySQL Cluster. Assumes some MySQL knowledge. Good if you’ve heard about this cluster thing (even from just reading the title of this session) and want to know what it’s all about.
  • Exploring New Features in MySQL 5.1 Cluster
    A 45 minute blast of a session on what’s new for MySQL Cluster in the 5.1 release. This will cover just about everything that was in my last years presentation on the same topic. So if you came to last years and come to this one again… I’m going to make fun of you for being a groupie :)
  • Bleeding Edge MySQL Cluster: Upcoming Cool Things
    A whole hour on the stuff you shouldn’t use in production. The topic list is sort-of known… it really is what is the latest and greatest that should be coming to a tree somewhere, sometime… this year. We’ll no doubt talk about online add node, online add/drop attribute, multithreaded NDB kernel, API improvements and a whole lot more!
  • The Design and Internals of MySQL Cluster
    What happens under the hood in MySQL Cluster? Find out here! An hour for those with the real technical mind. If source code and network protocol discriptions scare you, possibly not for you – expect an hour of coolness.

Yes, there seems to be a “Stewart” track at the conf :) Aparrently people enjoyed my session last year… so there was a tendancy to accept my sessions this year.

NDB Online Add Node Progress (or rather, testing it)

So, the sitch as of today:

Added ndb_mgm_set_configuration() call to the mgmapi – which is not-so-casually evil API call that sends a packed ndb_mgm_configuration object (like what you get from ndb_mgm_get_configuration) to the management server, who then resets its lists of nodes for event reporting and for ClusterMgr and starts serving things out of this configuration. Notably, if a data node restarts, it gets this new configuration.

By itself, this will let us write test programs for online configuration changes (e.g. changing DataMemory).

I’ve also added a Disabled property to data nodes. If set, just about everywhere ignores the node.

This allows a test program to test add/drop node functionality – without the need for external clusterware stopping and starting processes.

If you start with a large cluster, we can get a test program to disable some nodes and do an initial cluster restart (essentially starting a new, smaller cluster) and then add in the disabled nodes to form a larger cluster. Due to the way we do things, we actually still have the Transporters to the nodes we’re adding, which is slightly different than what happens in the real world. HOWEVER, it keeps the test program independent of any mechanism to start a node on a machine – so i don’t (for example) need to run ndb_cpcd on my laptop while testing.

But anyway, I now have a test program, that we could run as part of autotest that will happily take a 4 node cluster, start a 2 node cluster from it and online add 2 nodes.

Adding these new nodes into a nodegroup is currently not working with my patches though… for some reason the DBDICT transaction seems to not be going through the prepare phase… no doubt a bug in my code relating to something that’s changed in DBDICT in the past year.

So there is progress towards having the ability to add data nodes (and node groups) to a running cluster.

Online table re-organisation is another thing alltogether though… and no doubt some good subtle bugs to be written.

Code size of an engine versus test suite

If you count the lines of code in the MySQL Cluster (NDB) test suite (mysql-5.1/storage/ndb/test – and exclude the old ODBC stuff) you come up with about 104000 lines of code. This is in contrast to the approximate other 350,000 lines of code for the NDB engine (excluding the handler, which is an additional 12,000 lines – this isn’t tested much by the NDB test suite… mysql-test-run.pl is meant to take care of a lot of that).

If you go and check the MyISAM tree, it’s only 40545 lines of code – for the entire engine. That’s right, the MySQL Cluster test suite is about 2.5 times the size of MyISAM.

If you look at mysql-test-run.pl tests, which are just lists of SQL commands with static data, it comes up at 250,000 lines (that excludes result files). The NDB tests do things programmatically – so can generate large amounts of data and different loads quite easily.

The architecture of the NDB tests (commonly referred to as autotest, ATRT or HUGO framework) is very different from mysql-test-run.pl – it easily allows you to write a test that is high on concurrency, high on load and high on amount of data. It also is modular, so that when you get an issue from a customer (or need to do some benchmarking on a speficic type of schema) you can use the utility programs to help you (e.g. there’s one that does random PK updates to tables, one that does scans, one that does index operations etc).

There’s this whole bunch of things you just cannot do with mysql-test-run.pl.

Then we get to fault injection… MySQL Cluster is a distributed system that is designed to withstand failure. Without testing this, we can never say it’s remotely HA. So we test it. A lot. We inject failures into nodes to check our node failure handling, using the utility programs and some basic shell it’s possible to do custom tests (such as multi-node failure)  where our test suite doesn’t have the best coverage yet.

Again, either not possible or extremely hard with mysql-test-run.pl

mysql_slap is the hint of a nice utility to help in testing… but using it in mysql-test-run.pl scripts in a verifyable way (i.e. check what came out is what went in, using a variety of access methods – full table scans, pk scans, index scans, stored procs, cursors, views, joins etc) is tricky at best (but really impossible).

Yes, I’m really pining for a better test suite infrastructure for the MySQL Server – it can only lead to better quality software…. almost somebody just rewriting a bunch of the hugo classes to use the MySQL C API would be useful.

mgmapi timeouts and resurrecting the online add node

The other day I managed to send off what’s nearly the final patches for adding proper timeout support to the MySQL Cluster management API. Jonas has had a bit of a look, found one thing I’ve missed, but it’ll probably get in somewhere soon (probably the carrier grade edition first, then others… 5.1 makes sense IMHO if only for the amount of management server testing that my patches add).

Unfortunately in what we laughingly call the past the management server – for whatever hysterical raisins – never really received much direct testing. Sure, if the data nodes couldn’t get configuration, autotest couldn’t control the daemons or something then things were obviously broken. But, say, a subtle (or not so much) change in API or behaviour would certainly not be picked up.

Although the real “feature of the year” (not my words) is fault injection for the management server that we can use in testing. The MySQL Cluster kernel (data nodes) already have extensive fault injection that is regularly tested by ATRT (storage/ndb/test in the source tree).

I’ve also started to resurrect my online add node patch that I’ve had sitting around in various states for over a year (actually… about 14 months… i just haven’t touched it in 12) and port it to the latest 5.1 tree (as not sure where it’ll end up, start at the lowest common denominator – possible that it’ll end up in Carrier Grade first too). Now comes the problem of testing the sucker. Previously i’ve had a shockingly bad shell script and hard coded files to make this go.

Obviously, hard coded stuff is not the way to go. The real way is to be able to do everything neatly and programmatically so we can run it as part of the regular autotest.

timeout units

Following a discussion on mythtv on #xfs (as you do), and a wondering of “hrrm… i wonder what unit that timeout is” with some NDB code I wish to make the following announcement:

All timeout values in NDB related APIs will now be given in centijiffies of the server system. For APIs that can talk to multiple hosts, it will be furlongs per fortnight.

I feel that having a consistent interface such as this will lead to much less confusion and better apps.

NDB! NDB! The storage engine for me!

Today I set up a mysqld connected to my not-quite-HA cluster at home here to replicate from my MythTV database into cluster. The idea behind this is to eat an increasing amount of my own dogfood around the house.

To do this, I also set up the MySQL Instance Manager to manage the now multiple instances of MySQL Servers on a box here. I found it a pain to do, it should be a lot simpler, but isn’t. At least now things are going okay…. but the feature wish list I have is rather long (perhaps I should hack some stuff up in this “spare time” i’ve been hearing so much about).

I’m also about 10 minutes (or however long the build takes) off moving one of the data nodes off the machine so it will be a real 2 node system (but I still have to move the management server to a third machine to have any real HA… and I have a PowerPC machine marked for that, I just have to await some patches to make it work :)

Currently though, my Gallery is being served off this. There are so many more photos I should add, I just haven’t come up with a decent way to interface f-spot and Gallery together – especially when I go back and retouch, delete or tag photos.

MySQL 5.1.14 has hit the streets, the kids love it.

Over at the DevZone, MySQL 5.1.14 Downloads the cool kids are grabbing the latest 5.1 beta. Lots of Cluster fixes in this release too. We’re getting to a much more polished state for NDB with each release and that’s a good thing to see.

On a totally different topic, i bought a really sweet smelling mango today and cannot wait for the right time sometime this afternoon to eat it. All the summer fruits are really nice at the moment (benefit of being in a warm December I guess) and I’m loving it.

Although 37-41 degrees (Celcius, duh) can be less fun with a rather warm laptop.

online online online! (or restarts are for wusses)

I often see things go past my eyes where customers (and users – i.e. those that don’t send wads of cash our way and hence are not financially supporting my beer, curry and photography habits) have amazing uptime and reliability requirements.

When talking to businesses that use MySQL, it’s not uncommon to have the “if the DB is down, our business doesn’t operate” line bandied around. How people make sure this never happens can differ (hint: it often involves replication and good sysadmin practices).

One thing I like doing is making things easier for people. Sometimes it’s also a much more complicated problem than you’re initially led to believe.

I think configuration files are obsolete. Okay, maybe just for databases. Everything should be changable as an online operation. This should also be able to be done via a standard interface – in our case, SQL. This means it’s suddenly really easy to write portable UIs around the admin functionality (no getting the parsing and generation – most trickily, the modification of text based config files right) just the issuing of SQL to the server, relativly simple. This even enables web apps to tune the database a bit, opting for various amounts of automation for various applications – in a cross platform way!

One of my visions for NDB (MySQL Cluster) is to get rid of the (user visible) configuration file and manage everything through SQL (or management client, something like that). This way you could ALTER CLUSTER ADD NODE, ALTER CLUSTER SET DataMemory=4GB etc and things should “just work”, take however long it needs – without downtime.

In a clustered environment, we could do these operations transactionally so that in the event of node or system failure we have some hope of being in a nicely consistent state and that during system recovery (or node recovery) we’re not performing a configuration change in addition to restarting (e.g. if you edited a config file and then had a crash).

Config changes could also have EXPLAIN, a non-modifying operation that would EXPLAIN what would be done – e.g. rolling restart, taking approximately X minutes per node and Y minutes total. This could help in planning and scheduling of configuration changes.

(i wonder if that made any sense)

pluggable NDB

Spoke with Brian the other day on what was required to get NDB to be a pluggable engine – and started hacking.

The tricky bits invole dependencies of things like mysqldump and ndb_restore on some headers to determine what tables shouldn’t be dumped (hint: the cluster database used for replication).

Also, all those command line parameters and global variables – they’re fun too. It turns out InnoDB and PBXT are also waiting on this. In the meantime, I’ve done a hack that puts config options in a table.

Currently blocked on getting the embedded server (libmysqld) to build properly – but i now have a sql/mysqld binary with pluggable NDB. All libtool foo too.

Hopefully i’ll be able to post soon with a “it works” post

Disk allocation, XFS, NDB Disk Data and more…

I’ve talked about disk space allocation previously, mainly revolving around XFS (namely because it’s what I use, a sensible choice for large file systems and large files and has a nice suite of tools for digging into what’s going on).Most people write software that just calls write(2) (or libc things like fwrite or fprintf) to do file IO – including space allocation. Probably 99% of file io is fine to do like this and the allocators for your file system get it mostly right (some more right than others). Remember, disk seeks are really really expensive so the less you have to do, the better (i.e. fragmentation==bad).

I recently (finally) wrote my patch to use the xfsctl to get better allocation for NDB disk data files (datafiles and undofiles).
patch at:
http://lists.mysql.com/commits/15088

This actually ends up giving us a rather nice speed boost in some of the test suite runs.

The problem is:
– two cluster nodes on 1 host (in the case of the mysql-test-run script)
– each node has a complete copy of the database
– ALTER TABLESPACE ADD DATAFILE / ALTER LOGFILEGROUP ADD UNDOFILE creates files on *both* nodes. We want to zero these out.
– files are opened with O_SYNC (IIRC)

The patch I committed uses XFS_IOC_RESVSP64 to allocate (unwritten) extents and then posix_fallocate to zero out the file (the glibc implementation of this call just writes zeros out).

Now, ideally it would be beneficial (and probably faster) to have XFS do this in kernel. Asynchronously would be pretty cool too.. but hey :)

The reason we don’t want unwritten extents is that NDB has some realtime properties, and futzing about with extents and the like in the FS during transactions isn’t such a good idea.

So, this would lead me to try XFS_IOC_ALLOCSP64 – which doesn’t have the “unwritten extents” warning that RESVSP64 does. However, with the two processes writing the files out, I get heavy fragmentation. Even with a RESVSP followed by ALLOCSP I get the same result.

So it seems that ALLOCSP re-allocates extents (even if it doesn’t have to) and really doesn’t give you much (didn’t do too much timing to see if it was any quicker).

I’ve asked if this is expected behaviour on the XFS list… we’ll see what the response is (i haven’t had time yet to go read the code… i should though).

So what improvement does this patch make? well, i’ll quote my commit comments:

BUG#24143 Heavy file fragmentation with multiple ndbd on single fs

If we have the XFS headers (at build time) we can use XFS specific ioctls
(once testing the file is on XFS) to better allocate space.

This dramatically improves performance of mysql-test-run cases as well:

e.g.
number of extents for ndb_dd_basic tablespaces and log files
BEFORE this patch: 57, 13, 212, 95, 17, 113
WITH this patch  :  ALL 1 or 2 extents

(results are consistent over multiple runs. BEFORE always has several files
with lots of extents).

As for timing of test run:
BEFORE
ndb_dd_basic                   [ pass ]         107727
real    3m2.683s
user    0m1.360s
sys     0m1.192s

AFTER
ndb_dd_basic                   [ pass ]          70060
real    2m30.822s
user    0m1.220s
sys     0m1.404s

(results are again consistent over various runs)

similar for other tests (BEFORE and AFTER):
ndb_dd_alter                   [ pass ]         245360
ndb_dd_alter                   [ pass ]         211632

So what about the patch? It’s actually really tiny:


--- 1.388/configure.in	2006-11-01 23:25:56 +11:00
+++ 1.389/configure.in	2006-11-10 01:08:33 +11:00
@@ -697,6 +697,8 @@
sys/ioctl.h malloc.h sys/malloc.h sys/ipc.h sys/shm.h linux/config.h \
sys/resource.h sys/param.h)

+AC_CHECK_HEADERS([xfs/xfs.h])
+
 #--------------------------------------------------------------------
# Check for system libraries. Adds the library to $LIBS
# and defines HAVE_LIBM etc

--- 1.36/storage/ndb/src/kernel/blocks/ndbfs/AsyncFile.cpp	2006-11-03 02:18:41 +11:00
+++ 1.37/storage/ndb/src/kernel/blocks/ndbfs/AsyncFile.cpp	2006-11-10 01:08:33 +11:00
@@ -18,6 +18,10 @@
#include
#include

+#ifdef HAVE_XFS_XFS_H
+#include
+#endif
+
 #include "AsyncFile.hpp"

#include
@@ -459,6 +463,18 @@
Uint32 index = 0;
Uint32 block = refToBlock(request->theUserReference);

+#ifdef HAVE_XFS_XFS_H
+    if(platform_test_xfs_fd(theFd))
+    {
+      ndbout_c("Using xfsctl(XFS_IOC_RESVSP64) to allocate disk space");
+      xfs_flock64_t fl;
+      fl.l_whence= 0;
+      fl.l_start= 0;
+      fl.l_len= (off64_t)sz;
+      if(xfsctl(NULL, theFd, XFS_IOC_RESVSP64, &fl) < 0)
+        ndbout_c("failed to optimally allocate disk space");
+    }
+#endif
 #ifdef HAVE_POSIX_FALLOCATE
posix_fallocate(theFd, 0, sz);
#endif

So get building your MySQL Cluster with the XFS headers installed and run on XFS for sweet, sweet disk allocation.

mysql NDB team trees up on bkbits.net

If you head over here: mysql on bkbits.net you can get a copy of the NDB team trees. This is where we push stuff before it hits the main MySQL trees so that we can get some extra testing in (also for when pulling from the main tree). So you can be relatively assured that this is going to work fairly well for NDB and have the latest bug fixes.

Of course, if anything is going to break here – it’s going to be NDB :)

This should allow you to get easy access to the latest-and-greatest NDB code.

At some point soon I’ll update my scripts that generate doxygen output (and builds) to do the -ndb trees.

enjoy!

weekly builds

Saturn’s autoweb

I’ve hacked my scripts that generate doxygen docs to also build MySQL 4.1, 5.0 and 5.1 for AMD64 (the box that it’s running on) with Cluster. This is to help my idea of running Gallery at home with NDB disk data tables in very recent MySQL builds.

How’s it going so far? Well… I’ve found some bugs and some seemingly strange behaviour here and there. However, bug reports will come, and I’m currently running a bit of an older build.

I’ll make the URL of the Gallery public at some point too

DaveM on Ingo’s SMP lock validator

DaveM talks about Ingo’s new SMP lock validator for linux kernel

A note reminding me to go take a look and see what can be ripped out and placed into various bits of MySQL and NDB. Ideally, of course, it could be turned into a LD_PRELOAD for pthread mutexes.

Anybody who wants to look deeper into it before I wake up again is welcome to (and tell me what they find)

How auto_increment is implemented in NDB

I was writing this in an email to a co-worker today, could possibly interest people in the outside world as well. It’s a good idea to look at the source at the same time as reading this :)

In ha_ndbcluster::write_row(byte*),

if (table_share->primary_key != MAX_KEY)
{
/*
* Increase any auto_incremented primary key
*/
if (has_auto_increment)
{
THD *thd= table->in_use;

m_skip_auto_increment= FALSE;
update_auto_increment();
/* Ensure that handler is always called for auto_increment values */
thd->next_insert_id= 0;
m_skip_auto_increment= !auto_increment_column_changed;
}
}

We set next_insert_id to 0 so that in handler::update_auto_increment() we end calling the handler and never doing it just inside the server.

The handler function that we end up in is: ha_ndbcluster::get_auto_increment().

From here we end up inside NDB to do the actual work (not in the table handler).

Looking inside storage/ndb/src/ndbapi/Ndb.cpp at the method:

Ndb::getAutoIncrementValue(NdbDictionary::Table*,Uint32)

which really just calls Ndb::getTupleIdFromNdb(Uint32,Uint32)
which either returns a cached value, or goes off and does a call to NDB to get either 1 auto increment value or the full cacheSize we’ve requested (which is worked out in ha_ndbcluster::get_auto_increment()). This increment is done in the interestingly named Ndb::opTupleIdOnNdb(Uint32 aTableId, Uint64 opValue, Uint32 op) (with op=0).

This increments an entry in the SYSTAB_0 table inside the sys database in NDB. The row with SYSKEY_0 equal to the table id keeps the auto increment value. You can watch this by using a tool such as ndb_select_all on this table (grepping for the table id which you found with ndb_show_tables) while inserting rows into a table with an auto_increment value.