MySQL Cluster (NDB) on Win32 progress

Many things have been happenning in the land of NDB on Win32 as of late.

I’ve fixed about 700 compiler warnings (some of which were real bugs) leaving about 161 to go on Win32 (VS2003). We’re getting a few more warnings on Win64 (some of which look merely semantic, while others could be real bugs), but the main focus now is getting 32bit going really well.

I fixed a number of bugs that were around preventing lots of things from working properly:

Disk Data (i.e. CREATE TABLESPACE, CREATE LOGFILE GROUP, and CREATE TABLE… TABLESPACE ts1 STORAGE DISK) now works. The main problem here was that our filesystem abstraction layer for the NDB kernel (ndbd) once had a Win32 port… which has sorely bitrotted over the years. As new features were introduced to the file IO interface, they (of course) weren’t also added to the Win32 abstraction. In the disk data case, the OM_INIT feature, which on FSOPENREQ (open a file) allows data to be passed in for initialising the file. Previously, I fixed this to allocate the file on disk and create a file of the same size, but i didn’t add the feature that writes initial data to the file. This caused bugs as soon as you tried to use the disk data tables (the files weren’t initialised, so you hit asserts on corrupt disk data files).

Paths in the server: for whatever bizarre and stupid reason, the MySQL server can end up having paths to a table as ./database/table OR .\database\table. The latter *never* shows up on non-Win32 platforms but can *sometimes* show up on Win32. Ick ick ick ick. Anyway, we (in the NDB handler) weren’t dealing with this properly, causing problems around some metadata ops.

Our pushbuild system takes each push to a source tree, builds it on a variety of platfroms and runs the mysql-test-run.pl test suite. The Win32 hosts are actually running on vmware. In order to make tests run faster, on Linux we use /dev/shm for the data files. Microsoft Windows doesn’t have a good ram disk, so we create a file on /dev/shm on the host and map that as a drive inside Windows (and format it as NTFS). This drive is only 1GB. This is not enough disk space for running all the clusters (yes, plural) started by the test suite (and everything would die with ENOSPC). The workaround I’ve come up with is that for debug builds, we simply enable NTFS file compression on files ndbd creates.

Win64 is also working! Pushbuild builds and runs on 64bit, and the Win64 host is building with NDB and passing about the same amount of tests as the Win32 hosts!

The bad news is that the NDB with replication tests are pretty much all failing… so I’m fairly confident that cluster replication is very broken on Win32 (and 64) at the moment.

I’ve had to do a fair amount of fixing on a bunch of the test cases (mainly to do with finding where various NDB utilities are). They’ve also prompted fixes in NDB (automatically converting / to \ in ndbd on Win32 for CREATE DATAFILE/UNDOFILE).

If you want to give it a go – you can get the source from launchpad. Either in the mysql-5.1-telco-6.4 tree, or if you want a few more things fixed, always have a look at the mysql-5.1-telco-6.4-win tree. Hopefully both are synced with the latest internal trees (i.e. plain 6.4 is working on win32) by the time you read this.

Iggy and I discussed installers for NDB on Windows in Riga, and we should have something soon-ish for those of you who don’t build from source.

SetFileValidData Function (Windows) – Now with added FAIL

SetFileValidData Function (Windows)

There seems to be two options on Win32 for preallocating disk space to files.

Basically, I want a equivilent to posix_fallocate or the ever wonderful xfsctl XFS_IOC_RESVSP64 call.

The idea being to (quickly) create a large file on disk that is stored efficiently (i.e. isn’t fragmented).

From SQL, you’d do something like “CREATE LOGFILE GROUP lg1 ADD UNDOFILE ‘uf1’ INITIAL_SIZE 1G;” and expect a 1GB file on disk. One way of getting this is calling write() (or WriteFile() on Win32) repeatedly until you’ve written a 1GB file full of zeros. This means you’re generating approximately 1GB of IO.

Except it’s worse than that: every time you extend the file, you’re going to be changing the metadata (file and free space information). If you’re lucky, you won’t be using a file system that writes a new transaction to the journal for each time you do this.

If your file system allocator doesn’t like you today (even more likely when you’ve got more than one process doing IO), you may end up with rather fragmented files as well – especially if you’re doing synchronous IO. So you want some method of saying “this file will be size X, please allocate disk space to it in the most efficient way for a file size of X” as it’s not possible to infer this from everyday IO calls (I guess the Win32 CopyFile and CopyFileEx calls could though).

It probably doesn’t do it, but having a CopyFile call would be neat for copy on write file systems and saving space… although I wonder how many Win32 apps would cope with ENOSPC on a write to an existing part of a file.

On IRIX we used the magic xfsctl() with the XFS_IOC_RESVSP64 argument. On Linux (with XFS), we use the same. On ext2/ext3 the only way to get the same has been to (with the file system unmounted), parse the file system and implement it yourself. Although (and this just in) the brand new fallocate() call should help with this. The posix_fallocate() call in GNU libc has just been a wrapper around the simple method of writing 0 to a file from start to end (albeit rather efficiently).

XFS implements something called “unwritten extents”. An unwritten extent says “this range of blocks is allocated to this file. If reading from this range, return a zero page. If writing, split the unwritten extent into 3 parts: before, the newly written extent (which isn’t unwritten: i.e. now valid data), and the after extent.” Simple, rather efficient and gets really good allocation as XFS gets to search the free space btrees based on size.

So what to do on Win32 (apart from drink heavily to try and make it all go away)?

There’s SetFileValidData, but that needs special permissions and may expose previously deleted data from other users. i.e. massive security hole. FAIL

There’s SetEndOfFile which, quoting the MS docs: “If the file is extended, the contents of the file between the old end of the file and the new end of the file are not defined.” Not exactly reassuring… but introduced in W2k, so rather safe to use today. Doesn’t save you from having to fill the file with zeros as part of initialisation though.

There’s SetFileInformationByHandle, which looks like it may do exactly what I want… if you read between the lines of the documentation. But it’s only supported starting with Vista. Which you all use of course, so that’s not a problem.

Building MySQL on Windows – MySQL Forge Wiki

Building MySQL on Windows – MySQL Forge Wiki

This one covers running mysqld in the VisualStudio debugger, which can be useful.

I have no special ndb_mgmd.exe or ndbd.exe in debugger instructions or wisdom (running them from mysql-test-run.pl at least). I’ve attached debugger to already running (started by mysql-test-run.pl) ndb processes, but haven’t made any changes to mtr to make it like the mysqld of “go and enter this”.

Building MySQL Cluster on Windows (for Windows)

You will need:

  • CMake (at least 2.4.7)
  • Bazaar (the newer the better – 1.6 was just released – at least use that)
  • Gnu Bison
  • Visual Studio (Express works, but I’m talking about 2005 here)
  • … and all this installed on a Microsoft Windows machine.
  • … and to hate yourself, you are going to be using Windows after all.

Then, get and build it:

  1. Get the source:
    bzr branch lp:~mysql/mysql-server/mysql-5.1-telco-6.4-win
  2. Run CMake. the CMake GUI can now be used to select compile options! You’ll have to set the path “where is the source code” to where you put the source code in step 1.
  3. Hit “Configure” in CMake
  4. Select the target (i.e. the version of Visual Studio you’re going to use)
  5. Select the build options. HINT: WITH_NDBCLUSTER_STORAGE_ENGINE may be a useful one to enable
  6. Hit Configure again
  7. Hit Ok.
  8. CMAKE now generates the Visual Studio project. Use this time to drink some good scotch.
  9. Open Mysql.sln (which should launch Visual Studio)
  10. Go Build -> Build Solution (or hit F7)

Now you can go and have much whisky as this will take a few minutes. You should now have a set of built binaries for MySQL Cluster on Windows. Scary.

ndb_mgm.exe builds (and works) in mysql-5.1-telco-6.4-win

“MySQL Cluster 6.4 Windows tree” branch in Launchpad

(which really should have the -fail suffix… but anyway)

In what will (soon) be mirrored to launchpad, all but 17 targets (yeah, working on that… but it’s out of 130 or something) build.

Not only that, I’ve used the management client (ndb_mgm.exe) to monitor the cluster running my Bugzilla instance (which is now a rather old 6.3 build).

Getting closer to NDB on Windows.

Be afraid. Be very, very afraid.

“MySQL Cluster 6.4 Windows tree” branch in Launchpad

“MySQL Cluster 6.4 Windows tree” branch in Launchpad

That’s right folks, I’m pushing up patches for MySQL Cluster on Windows. This tree is incomplete, and no promises on when enough will be pushed for it to even compile on Windows.

Tree is updated when launchpad pulls from our internal tree.

Getting a file size (on Windows)

The first point I’d like to make is that you’re using a Microsoft Windows API, so you have already lost. You are just not quite aware of how much you have lost.

A quick look around and you say “Ahh… GetFileSize, that’s what I want to do!” Except, of course, you’re wrong. You don’t want to use GetFileSize at all. It has the following signature:

DWORD WINAPI GetFileSize(  __in       HANDLE hFile,

__out_opt  LPDWORD lpFileSizeHigh

);

Yes, it supports larger than 4GB files! How? A pointer to the high-order doubleword is passed in! So how do you know if this errored? Return -1? WRONG! Because the high word could have been set and your file length could legitimately be 0x1ffffffff. So to find out if you actually had an error, you must call GetLastError! Instead of one call, you now have two.

The Microsoft documentation even acknowledges that this is stupid: “Because of this behavior, it is recommended that you use GetFileSizeEx instead.”

GetFileSizeEx is presumably named “Ex” as in “i broke up with my ex because their API sucked.”

You now have something that looks like this:

BOOL WINAPI GetFileSizeEx(  __in   HANDLE hFile,

__out  PLARGE_INTEGER lpFileSize

);

Which starts to look a little bit nicer. For a start, the return code of BOOL seems to indicate success or failure.

You now get to provide a pointer to a LARGE_INTEGER. Which, if you missed it, a LARGE_INTEGER is:

typedef union _LARGE_INTEGER {  struct {

DWORD LowPart;

LONG HighPart;

};

struct {

DWORD LowPart;

LONG HighPart;

} u;

LONGLONG QuadPart;

} LARGE_INTEGER,

*PLARGE_INTEGER;

Why this abomination? Well… ” If your compiler has built-in support for 64-bit integers, use the QuadPart member to store the 64-bit integer. Otherwise, use the LowPart and HighPart members to store the 64-bit integer.”

That’s right kiddies… if you’ve decided to loose from the get-go and have a compiler that doesn’t support 64-bit integers, you can still get the file size! Of course, you’re using a compiler that doesn’t have 64bit integer support… and the Microsoft documentation indicates that the GetFileSizeEx call requires Windows 2000… so it’s post y2k and you’re using a compiler without 64-bit ints? You have already lost.

Oh, but you say something about binary compatibility for apps written in the old days (handwave something like that). Well… let’s see… IRIX will give you 64bit numbers in stat (stat64) unless you build with -o32 – giving you the old ABI. I just can’t see a use for GetFileSize….. somebody please enlighten me.

Which header would you include? Any Linux/UNIX person would think of something logical – say sys/stat.h (Linux man page says sys/types.h, sys/stat.h and unistd.h). No, nothing sensible like that. It’s “Declared in WinBase.h; include Windows.h”.

So… you thought that obviously somebody went through the API and gave you all this Ex function goodness to get rid of mucking about with parts of a 64bit int? You were wrong. Let me say it with this:

DWORD WINAPI GetCompressedFileSizeTransacted(
  __in       LPCTSTR lpFileName,
  __out_opt  LPDWORD lpFileSizeHigh,
  __in       HANDLE hTransaction
);

I’ll now tell you that this was introduced in Vista/Server 2008.

Obviously, you want to be able to use Transaction NTFS on Windows Vista with a compiler that doesn’t have 64 bit ints. Oh, and you must then make another function call to see if something went wrong?

But you know what… perhaps we can get away from this complete and utter world of madness and use stat()…. or rather… perhaps _stati64().

Well… you’d be fine except for the fact that these seem to lie to you (at least on Windows Server 2003 and Vista) – it seems that even Explorer can lie to you.

But perhaps you’ve been barking up the wrong tree… you obviously don’t want to find the file size at all – what you want is to FindFirstFile! No, you don’t want FindFirstFileEx (in this case, Ex is for Extremely complicated). It’s meant to be faster too… you know, maybe.

So remember kids, smoke your crack pipe – you’re going to need it if using this thing called the Microsoft Windows File Management Functions.