DaveM on Ingo’s SMP lock validator

DaveM talks about Ingo’s new SMP lock validator for linux kernel

A note reminding me to go take a look and see what can be ripped out and placed into various bits of MySQL and NDB. Ideally, of course, it could be turned into a LD_PRELOAD for pthread mutexes.

Anybody who wants to look deeper into it before I wake up again is welcome to (and tell me what they find)

ha_file

In what I laughingly call “spare time” I started hacking on ha_file.cc, otherwise known as the FILE storage engine. My idea is relatively simple, I want to be able to store and access my photos from MySQL. I also want the storage to be relatively efficient and have the raw image files on disk, not tied up too much in any different format (my file system is pretty good at storing multi-megabyte files thank you very much) – it also doesn’t require any fancy things to re-use space when I delete things. I should also be able to (efficiently) directly serve the images out of a web server (satisfying the efficiency itch). You could also use something like DMF to migrate old rows off to tape.
So, I started some hacking and some designing and have a working design and a nearly basically working read/write implementation. I’ll share the code when it does, in fact, actually work (by “work” i mean reads and writes basic rows).
I’ve decided to go for the approach of storing columns in extended attributes. Why columns? ’cause then you can access them either from the command line or programmatically through another interface. It also adds an extra layer of evil. With XFS and sufficiently large inodes, these should all fit in the inode anyway. ext3 also has some nice optimisations that should help with performance too.

For blob data, I plan to just store that in the file. In my table for photos example, you could then just run a image browser (e.g. gthumb) on the data directory for the table and see your images. It also means that recovery programs (see my jpeg_recover.c) will work as well.

Knowing the primary key of the row (which I plan to use as the file name for the row) then allows us to generate URLs that could be directly served by a lightweight http server, avoiding all that database code when you’re just serving up an image to a client.
Symbolic links can be used to have indexes.

We can write new rows to a temp directory, sync them, then move them into place. Zero time crash recovery. Index consistency can be handled at runtime with a small extra check.

At some point I should write down how I plan to do isolation levels too. but that’s for another day.

I at least hope that the resulting code may be a useful example for people wanting to implement a storage engine.

A simple implementation should be fairly fast too (with a slightly tuned file system).

I heart valgrind (or: an early patch integrating the MySQL MEM_ROOT stuff with valgrind)

Everybody knows that valgrind is great.

Well, I was observing a problem in some MySQL code, it looked like we were writing over some memory that we weren’t meant to be (as the structure hadn’t been initialised yet). But, seeing as this was memory that had been allocated off a MEM_ROOT (one of our memory allocators), valgrind wasn’t gonig to spit out anything.

This is because this bit of memory had already been allocated and subsequently “freed”, but then reallocated. The “free”ing overwrites the memory with garbage (which is what the MEM_ROOT code does) so that you should see crashes (and a pattern) when you do something bad.

The traditional way to troubleshoot this in to modify your memory allocator so that it just calls malloc() and free() all the time (and valgrind will trap them). We have some code in there to do that too. However, this requires changing some ifdefs and probably not being as efficient.

Valgrind has some macros you can insert into your memory allocator code that tell valgrind that you have “freed” this memory (VALGRIND_MAKE_NOACCESS) or have allocated it (VALGRIND_MAKE_WRITABLE) or have written valid data to it (VALGRIND_MAKE_READABLE). These are semi-documented in the valgrind/valgrind.h file.

These are designed to only add a few CPU instructions to your code, so it should be possible to always have them in your build (you can disable them donig anything by building with -DNVALGRIND IIRC).

(I haven’t done any benchmarks on the code to see if there is any impact though).

Valgrind also has a great (largely undocumented) feature of being able to integrate with memory pools. Since our MEM_ROOT is largely just this, we can get some added benefits here too (one should be better valgrind warnings when we do some bad stuff).
It lets you associate memory with a memory pool, and then just say “this pool has been freed”. Saves you having to keep track of each pointer in the code to pass to “free”. It also can give you valgrind warnings when you try and allocate memory to something that hasn’t been initialised as a memory pool.

The most interesting thing of writing the patch was finding some false positive warnings. Namely, a trick used in a couple of places (i see 2) in the code is to create a temporary memory root on the stack, allocate a larger block of memory and then “swap” the memory roots to be based in this block of memory. I had to write a swap_root function to implement this as valgrind doesn’t export a “swap memory pool” function. It would be a useful addition, maybe I’ll go and suggest it to the developers.

Anyway, I got over that hurdle and now have this patch which seems to work pretty well. I still get a couple of (possible) false positives. We’ll see if this finds any neat bugs. Also, a good exercise would be to see how many extra instructions are really generated and if this has any affect on performance at all.

===== include/my_sys.h 1.196 vs edited =====
--- 1.196/include/my_sys.h 2006-05-22 20:04:34 +10:00
+++ edited/include/my_sys.h 2006-05-26 16:22:11 +10:00
@@ -804,6 +804,7 @@
extern void set_prealloc_root(MEM_ROOT *root, char *ptr);
extern void reset_root_defaults(MEM_ROOT *mem_root, uint block_size,
uint prealloc_size);
+extern void swap_root(MEM_ROOT* new_root, MEM_ROOT* old);
extern char *strdup_root(MEM_ROOT *root,const char *str);
extern char *strmake_root(MEM_ROOT *root,const char *str,uint len);
extern char *memdup_root(MEM_ROOT *root,const char *str,uint len);
===== mysys/my_alloc.c 1.33 vs edited =====
--- 1.33/mysys/my_alloc.c 2005-11-24 07:44:54 +11:00
+++ edited/mysys/my_alloc.c 2006-05-26 19:21:12 +10:00
@@ -22,6 +22,8 @@
#undef EXTRA_DEBUG
#define EXTRA_DEBUG
+#include "valgrind/valgrind.h"
+#include "valgrind/memcheck.h"

/*
Initialize memory root
@@ -66,9 +68,12 @@
mem_root->free->size= pre_alloc_size+ALIGN_SIZE(sizeof(USED_MEM));
mem_root->free->left= pre_alloc_size;
mem_root->free->next= 0;
+ VALGRIND_MAKE_NOACCESS(mem_root->free+ALIGN_SIZE(sizeof(USED_MEM)),
+ pre_alloc_size);
}
}
#endif
+ VALGRIND_CREATE_MEMPOOL(mem_root,0,0);
DBUG_VOID_RETURN;
}

@@ -217,6 +222,9 @@
mem_root->first_block_usage= 0;
}
DBUG_PRINT(“exit”,(“ptr: 0x%lx”, (ulong) point));
+// fprintf(stderr,”root: %lx point: %lx size:%lx\n”,mem_root,point,Size);
+ VALGRIND_MEMPOOL_ALLOC(mem_root,point,Size);
+ VALGRIND_MAKE_WRITABLE(point,Size);
DBUG_RETURN(point);
#endif
}
@@ -286,7 +294,8 @@
for (next= root->free; next; next= *(last= &next->next))
{
next->left= next->size – ALIGN_SIZE(sizeof(USED_MEM));
– TRASH_MEM(next);
+ VALGRIND_MAKE_NOACCESS(next+ALIGN_SIZE(sizeof(USED_MEM)),next->left);
+// TRASH_MEM(next);
}

/* Combine the free and the used list */
@@ -296,7 +305,8 @@
for (; next; next= next->next)
{
next->left= next->size – ALIGN_SIZE(sizeof(USED_MEM));
– TRASH_MEM(next);
+ VALGRIND_MAKE_NOACCESS(next+ALIGN_SIZE(sizeof(USED_MEM)),next->left);
+// TRASH_MEM(next);
}

/* Now everything is set; Indicate that nothing is used anymore */
@@ -357,12 +367,55 @@
{
root->free=root->pre_alloc;
root->free->left=root->pre_alloc->size-ALIGN_SIZE(sizeof(USED_MEM));
– TRASH_MEM(root->pre_alloc);
+ //TRASH_MEM(root->pre_alloc);
root->free->next=0;
}
root->block_num= 4;
root->first_block_usage= 0;
+ VALGRIND_DESTROY_MEMPOOL(root);
+ VALGRIND_CREATE_MEMPOOL(root,0,0);
+ VALGRIND_MAKE_READABLE(root,sizeof(MEM_ROOT));
+ if(root->pre_alloc)
+ {
+ VALGRIND_MAKE_READABLE(root->pre_alloc, ALIGN_SIZE(sizeof(USED_MEM)));
+ VALGRIND_MEMPOOL_ALLOC(root,root->pre_alloc,root->pre_alloc->size);
+ VALGRIND_MAKE_READABLE(root->pre_alloc, ALIGN_SIZE(sizeof(USED_MEM)));
+ }
DBUG_VOID_RETURN;
+}
+
+void swap_root(MEM_ROOT* new_root, MEM_ROOT* old)
+{
+ memcpy((char*) new_root, (char*) old, sizeof(MEM_ROOT));
+ VALGRIND_DESTROY_MEMPOOL(old);
+ VALGRIND_CREATE_MEMPOOL(new_root,0,0);
+
+ reg1 USED_MEM *next;
+
+ VALGRIND_MEMPOOL_ALLOC(new_root,new_root,sizeof(MEM_ROOT));
+ VALGRIND_MAKE_READABLE(new_root,sizeof(MEM_ROOT));
+
+ /* iterate through (partially) free blocks */
+ next= new_root->free;
+ do
+ {
+ if(!next)
+ break;
+ VALGRIND_MEMPOOL_ALLOC(new_root,next,next->size-next->left);
+ VALGRIND_MAKE_READABLE(next,next->size-next->left);
+ next= next->next;
+ } while(1);
+
+ /* now go through the used blocks and mark them free */
+ next= new_root->used;
+ do
+ {
+ if(!next)
+ break;
+ VALGRIND_MEMPOOL_ALLOC(new_root,next,next->size-next->left);
+ VALGRIND_MAKE_READABLE(next,next->size-next->left);
+ next= next->next;
+ } while(1);
}

/*
===== sql/table.cc 1.215 vs edited =====
— 1.215/sql/table.cc 2006-05-23 05:54:55 +10:00
+++ edited/sql/table.cc 2006-05-26 18:12:21 +10:00
@@ -150,7 +150,8 @@

#endif

– memcpy((char*) &share->mem_root, (char*) &mem_root, sizeof(mem_root));
+// memcpy((char*) &share->mem_root, (char*) &mem_root, sizeof(mem_root));
+ swap_root(&share->mem_root,&mem_root);
pthread_mutex_init(&share->mutex, MY_MUTEX_INIT_FAST);
pthread_cond_init(&share->cond, NULL);
}
@@ -252,7 +253,7 @@
hash_free(&share->name_hash);

/* We must copy mem_root from share because share is allocated through it */
– memcpy((char*) &mem_root, (char*) &share->mem_root, sizeof(mem_root));
+ swap_root(&mem_root,&share->mem_root);//memcpy((char*) &mem_root, (char*) &share->mem_root, sizeof(mem_root));
free_root(&mem_root, MYF(0)); // Free’s share
DBUG_VOID_RETURN;
}
===== storage/ndb/src/kernel/blocks/dbdict/Dbdict.cpp 1.87 vs edited =====
— 1.87/storage/ndb/src/kernel/blocks/dbdict/Dbdict.cpp 2006-04-25 22:02:07 +10:00
+++ edited/storage/ndb/src/kernel/blocks/dbdict/Dbdict.cpp 2006-05-26 12:15:43 +10:00
@@ -3148,9 +3148,23 @@

CreateTableRecordPtr createTabPtr;
ndbrequire(c_opCreateTable.find(createTabPtr, callbackData));

– //@todo check error
– ndbrequire(createTabPtr.p->m_errorCode == 0);
+
+ if(createTabPtr.p->m_errorCode != 0)
+ {
+ char buf[255];
+ TableRecordPtr tabPtr;
+ c_tableRecordPool.getPtr(tabPtr, createTabPtr.p->m_tablePtrI);
+
+ BaseString::snprintf(buf, sizeof(buf),
+ “Unable to restart, fail while creating table %d”
+ ” error: %d. Most likely change of configuration”,
+ tabPtr.p->tableId,
+ createTabPtr.p->m_errorCode);
+ progError(__LINE__,
+ NDBD_EXIT_INVALID_CONFIG,
+ buf);
+ ndbrequire(createTabPtr.p->m_errorCode == 0);
+ }

Callback callback;
callback.m_callbackData = callbackData;

Upgrade to OpenOffice.org 2.0.2 and stop murderous urges

It’s no great secret that I think the stability of OpenOffice.org2 Impress in what’s shipped in Ubuntu Breezy leaves a lot to be desired. By ‘a lot’ I mean copy and pasting is unreliably and the Slide Sorter just stopped working for me without crashes (in at least one document).

However, I took the plunge and did something I usually don’t like doing – installing non-official debs.

deb http://people.ubuntu.com/~doko/ubuntu/ breezy-updates/
deb-src http://people.ubuntu.com/~doko/ubuntu/ breezy-updates/
I am now a much happy camper.

Saving is still amazingly slow, but the lack of crashes has made my week.

doko is my hero for the week. A Tip Of The Hat for him.

Nirvana

Today I’ve put With The Lights Out in the virtual CD player (read: amaroK) and am remembering why I like Nirvana. I just haven’t listened to it in long enough. Actually, now that I come to think of it… long enough is like, what, a week? I know I was listening to it on the way to/from SFO on the plane.

Now, after spending the morning on financial issues (tax not fun, especially the weird way I end up getting payed and the complications it has) I may actually get to some work…. Well, conference tutorial preparation… Doing a tutorial where you get the people in the room to do things is always tricky – and nerve racking.

Kristian on “How to blog for a planet”

How to blog for a planet – MySQL-dump

I have to say I disagree with the whole teaser/article body thing. I really don’t like having RSS feeds that don’t contain the full article. It means I can’t read them offline. I often like to catch up on RSS while offline. I also don’t particularly feel the need to have to make yet another click to view the content of an article.

Yes, it’s a little more bandwidth. But really, it’s cheap. Especially with mod_gzip and whatever else optimised foo we can do.

Maybe planet aggregators could get more clever in summarising entries? Or not. How many people actually read a planet from the web site anyway?

An Apple article on MySQL on Mac OS X

MySQL on Mac OS X: An Ideal Development Combination

They got one bit a bit unclear. They say “In fact, the development team at MySQL AB uses the Mac platform for developing the MySQL server software itself.” Which is misleading at best if not downright wrong.

Yes, some people do use MacOS X. But some also use Microsoft Windows, some FreeBSD and a lot use Linux (various flavours – mine’s Ubuntu). The way their sentence reads is that we only use the Mac platform. This, is wrong. They even quote Brian later on as saying that “A significant number of the developers inside MySQl AB use MacOS X as one of their development platforms.” So they can’t be ignorant of the fact that the Mac is just another platform.
They then go on to again, mislead at best. “the MySQL database was originally an open-source project, but is now owned by a commercial enterprise”. WHAT??? Oh, if you read the next sentence (and disregard this one) you find out you can get both commercial and free, open source licenses.

Apart from that, it reads like marketing. Good for us though, more exposure of MySQL to OSX people is a good thing.

Oh, and I am pointing this out to Apple too. I’m not some asshole who just whines on his blog :)

UPDATE: Brian mentions in the comments of this entry that Apple is taking the feedback seriously and have contacted him about my feedback. So, I’m quite impressed. In fact, kudos to Apple. Anybody who actually takes notice of comments submitted on their web site is doing pretty good. I also have the feeling that this entry perhaps came over a little strong…. so go back and interperet it as “hey, maybe people could read the article and get the wrong idea”.

last.fm

Set myself up on last.fm, changed to amaroK for playing music (so things go to last.fm) and added foo to the sidebar of my blog. I guess the trick now would be to get something to auto-add my current tune to the bottom of each entry. Maybe :)

I’m a GNOME boy, but amaroK seems to be leaps and bounds ahead of either rhythmbox or the version of banshee that ships with Ubuntu Breezy.

My main complaints with amaroK are that it looks nothing like my other desktop applications – it stands out that it’s a KDE app and not a GNOME app. Some clue on how to fix this would be appreciated.

Back to last.fm, I think the goal is to help suggest music that you may like by looking at what you listen to and what other people who listen to some of the stuff you listen to listen to. Seems interesting at least.

Newfangled technology to remove the “so what have you been listening to?” question from the list of things to talk about with friends :)

How auto_increment is implemented in NDB

I was writing this in an email to a co-worker today, could possibly interest people in the outside world as well. It’s a good idea to look at the source at the same time as reading this :)

In ha_ndbcluster::write_row(byte*),

if (table_share->primary_key != MAX_KEY)
{
/*
* Increase any auto_incremented primary key
*/
if (has_auto_increment)
{
THD *thd= table->in_use;

m_skip_auto_increment= FALSE;
update_auto_increment();
/* Ensure that handler is always called for auto_increment values */
thd->next_insert_id= 0;
m_skip_auto_increment= !auto_increment_column_changed;
}
}

We set next_insert_id to 0 so that in handler::update_auto_increment() we end calling the handler and never doing it just inside the server.

The handler function that we end up in is: ha_ndbcluster::get_auto_increment().

From here we end up inside NDB to do the actual work (not in the table handler).

Looking inside storage/ndb/src/ndbapi/Ndb.cpp at the method:

Ndb::getAutoIncrementValue(NdbDictionary::Table*,Uint32)

which really just calls Ndb::getTupleIdFromNdb(Uint32,Uint32)
which either returns a cached value, or goes off and does a call to NDB to get either 1 auto increment value or the full cacheSize we’ve requested (which is worked out in ha_ndbcluster::get_auto_increment()). This increment is done in the interestingly named Ndb::opTupleIdOnNdb(Uint32 aTableId, Uint64 opValue, Uint32 op) (with op=0).

This increments an entry in the SYSTAB_0 table inside the sys database in NDB. The row with SYSKEY_0 equal to the table id keeps the auto increment value. You can watch this by using a tool such as ndb_select_all on this table (grepping for the table id which you found with ndb_show_tables) while inserting rows into a table with an auto_increment value.

A Googly MySQL Cluster Talk – Google Video

A Googly MySQL Cluster Talk – Google Video

The talk I gave at Google is now up on Google Video for all to see. I don’t think I gave it as well as I did at the User Conference (largely because, I think, by this time I was really tired), but it still went well (I think).

Feedback is much appreciated – always looking for ways to improve my talks.

Oh, I’m also wearing an Augie March t-shirt.

Update: watching yourself give a presentation is a bit strange… but hopefully I can learn from watching my own talk.

Things I’ve learnt so far:

  • some words are spoken a bit quickly/mumbly. probably due to not knowing how long this presentation would go for (it was a bit cobbled together in the two sessions before mine)
  • I (for whatever reason) had nano instead of micro for SCI latency. Oh well, hopefully nobody will notice my mistake (apart from when i blogged it. eep)
  • Make sure the mouse cursor on your slides is hidden/at the edge of the screen
  • At least for google video resolution, small text can’t be read (see my Components of a Cluster slide, you can’t read the name of the processes – ndbd, mysqld, ndb_mgmd). Getting around this can be interesting for some slides – that are easily viewable on a projector, but not on the video. Doh.
    • One possible solution is to “zoom” in on important text.
    • For example, the “A Configuration” slide. When I talk about each config parameter, have it zoom up to a larger font size (on that big bit of white space on the right)
  • I really do want to walk around a room. Having to stand in the one place for the video camera was nearly killing me :)
    • This is interesting as for my UC presentation I was annoyed that I couldn’t get a wireless microphone. Instead I was getting tangled up in a microphone cord.
  • I should possibly have some good slides on the split brain problem and a quick example of how we avoid it (and why you need at least three machines)
  • I sometimes pronounce “server’ strangely.
  • I should have a diagram that includes a magic load balancer box showing it load balancing between the mysql servers connected to the cluster.
  • I should possibly have a clear diagram showing where you can do queries (UPDATE) on cluster (everywhere) versus with replication (only the master).
  • I don’t think i should laugh when there’s something that I think is funny.
  • Be more carefull about when I stop for a drink of water. Think about it. between slides can be good, but when there’s something up for people to look at… maybe the next topic (or something else exciting)
    • I’m unsure if photos of scantily clad models are suitable here… :)
  • For “find out more” – have screenshots of the various things (and a photo of the book)
  • Need a clear definition of fragment.
  • I need to pronounce ‘configurable’ more clearly.
  • a laser pointer is not visible on google video

That’s all i wrote down while watching it. other input welcome

caught in the act of explaining how easy it is to provide food for me

In this photo: Gallery :: MySQL Users Conference 2006 :: 3 it seems that Jeremy managed to capture me (in the background) explaining to the O’Reilly conference woman (I forget her name…. Arjen knows) that it really wasn’t too hard for them to get me lunch considering that inside the building the restaurant staff managed to get food for me (so why can’t they do it outside). I also expressed my dismay at my interaction with the staff around an (afternoon/morning… i forget now) interaction involving a request for an apple (see “How not to do customer service“).

This was after the catering staff took over an hour and a half to bring me a small plate of vegetables (mostly spinach – which I do like – but there’s only so much you can eat) and a salad (that never arrived).

I was also rather dismayed at the comment (from the conf woman) about how they “didn’t cater for vegans”. I then gave an explanation (that she seemed surprised by) that all vegetarians can eat food that is vegan – and a whole bunch of stuff they had out there could be vegan with next to no effort (e.g. not putting cheese in salad,  vegetarian fried rice without egg). There’s also the fact that the sushi bar and the food serevd at the bar in the hotel (not 10ft from where we are) has managed to get me food easily and relatively promptly before.

Things did improve after this. Things like soy milk being available with morning/afternoon tea coffee (and being well used, I’m not the only one who drinks it). Also, the appearance of fruit. Not surprisingly, it mostly gone rather quickly (people like fruit).

Lesson for all conference organisers: have fruit. Everywhere. People like fruit. It’s good snack food that keeps your brain going (not your gut).

Oh, and I, of course, would have filled out the bit on the web form saying my dietry requirements. So it’s not as if they can claim ignorance.

Everything did work out eventually though. It just would have been nice to have lunch go smoother. Also, note to catering staff – kosher does not equal vegan. It really, really doesn’t.

10,000 Days

Have I mentioned how awesome the new Tool album is?

I can’t remember changing the star rating in rhythmbox, I think it even knows how awesome it is.

The album packaging itself is rather impressive. Stereoscopic glasses and all. I’ll take some photos and post.

I cannot wait until they tour. I’m very tempted to just fly to where they’re playing sooner. Hrrm… this could be getting a little too fanboy.

Interesting SSL with Australian System Administrator’s Conference 2006

The Australian System Administrator’s Conference 2006
It’s interesting that the online registration doesn’t have an SSL certificate that matches. I now have to find a printer to produce dead tree to mail.

Considering that I don’t actually own a printer, this is getting interesting…

P.S. come to my tutorial on MySQL Cluster!