MySQL removes the FRM (7 years after Drizzle did)

Posted on 2016-09-27 by Stewart Smith

The new MySQL 8.0.0 milestone release that was recently announced brings something that has been a looooong time coming: the removal of the FRM file. I was the one who implemented this in Drizzle way back in 2009 (July 28th 2009 according to Brian)- and I may have had a flashback to removing the tentacles of the FRM when reading the MySQL 8.0.0 announcement.

As an idea for how long this has been on the cards, I’ll quote Brian from when we removed it in Drizzle:

We have been talking about getting rid of FRM since around 2003. I remember a drive up to northern Finland with Kaj ArnÃ¶, where we spent an hour talking about this. I, David, and MontyW have talked about this for years.

http://krow.livejournal.com/642329.html

Soo… it was a known problem for at least thirteen years. One of the issues removing it was how pervasive all of the FRM related things were. I shudder at the mention of “pack_flag” and Jay Pipes probably does too.

At the time, we tried a couple of approaches as to how things should look. Our philosophy with Drizzle was that it should get out of the way at let the storage engines be the storage engines and not try to second guess them or keep track of things behind their back. I still think that was the correct architectural approach: the role of Drizzle was to put SQL on top of a storage engine, not to also be one itself.

Looking at the MySQL code, there’s one giant commit 31350e8ab15179acab5197fa29d12686b1efd6ef. I do mean giant too, the diffstat is amazing:

 786 files changed, 58471 insertions(+), 25586 deletions(-)

How anyone even remotely did code review on that I have absolutely no idea. I know the only way I could get it to work in Drizzle was to do it incrementally, a series of patches that gradually chiseled out what needed to be taken out so I could put it an API and the protobuf code.

Oh, and in case you’re wondering:

- uint offset,pack_flag;
+ uint offset;

Thank goodness. Now, you may not appreciate that as much as I might, but pack_flag was not the height of design, it was… pretty much a catchalll for some kind of data about a field that wasn’t something that already had a field in the FRM. So it may include information on if the field could be null or not, if it’s decimal, how many bytes an integer takes, that it’s a number and how many oh, just don’t ask.

Also gone is the weird interval_id and a whole bunch of limitations because of the FRM format, including one that I either just discovered or didn’t remember: if you used all 256 characters in an enum, you couldn’t create the table as MySQL would pick either a comma or an unused character to be the separator in the FRM!?!

Also changed is how the MySQL server handles default values. For those not aware, the FRM file contains a static copy of the row containing default values. This means the default values are computed once on table creation and never again (there’s a bunch of work arounds for things like AUTO_INCREMENT and DEFAULT NOW()). The new sql/default_values.cc is where this is done now.

For now at least, table metadata is also written to a file that appears to be JSON format. It’s interesting that a SQL database server is using a schemaless file format to describe schema. It appears that these files exist only for disaster recovery or perhaps portable tablespaces. As such, I’m not entirely convinced they’re needed…. it’s just a thing to get out of sync with what the storage engine thinks and causes extra IO on DDL (as well as forcing the issue that you can’t have MVCC into the data dictionary itself).

What will be interesting is to see the lifting of these various limitations and how MariaDB will cope with that. Basically, unless they switch, we’re going to see some interesting divergence in what you can do in either database.

There’s certainly differences in how MySQL removed the FRM file to the way we did it in Drizzle. Hopefully some of the ideas we had were helpful in coming up with this different approach, as well as an extra seven years of in-production use.

At some point I’ll write something up as to the fate of Drizzle and a bit of a post-mortem, I think I may have finally worked out what I want to say…. but that is a post for another day.

First look at MySQL 8.0.0 Milestone

Posted on 2016-09-23 by Stewart Smith

So, about ten days ago the MySQL Server Team released MySQL 8.0.0 Milestone to the world. One of the most unfortunate things about MySQL development is that it’s done behind closed doors, with the only hints of what’s to come arriving in maybe a note on a bug or such milestone releases that contain a lot of code changes. How much code change? Well, according to the text up on github for the 8.0 branch “This branch is 5714 commits ahead, 4 commits behind 5.7. ”

Way back in 2013, I looked at MySQL Code Size over releases, which I can again revisit and include both MySQL 5.7 and 8.0.0.

While 5.7 was a big jump again, we seem to be somewhat leveling off, which is a good thing. Managing to add features and fix long standing problems without bloating code size is good for software maintenance. Honestly, hats off to the MySQL team for keeping it to around a 130kLOC code size increase over 5.7 (that’s around 5%).

These days I’m mostly just a user of MySQL, pointing others in the right direction when it comes to some issues around it and being the resident MySQL grey(ing)beard(well, if I don’t shave for a few days) inside IBM as a very much side project to my day job of OPAL firmware.

So, personally, I’m thrilled about no more FRM, better Unicode, SET PERSIST and performance work. With my IBM hat on, I’m thrilled about the fact that it compiled on POWER out of the box and managed to work (I haven’t managed to crash it yet). There seems to be a possible performance issue, but hey, this is a huge improvement over the 5.7 developer milestones when run on POWER.

A lot of the changes are focused around usability, making it easier to manage and easier to run at at least a medium amount of scale. This is long overdue and it’s great to see even seemingly trivial things like SET PERSIST coming (I cannot tell you how many times that has tripped me up).

In a future post, I’ll talk about the FRM removal!

Lesson 124 in why scales on a graph matter…

Posted on 2016-09-22 by Stewart Smith

The original article presented two graphs: one of MariaDB searches (which are increasing) and the other showing MySQL searches (decreasing or leveling out). It turns out that the y axis REALLY matters.

I honestly expected better….

@mariadb that Trends graph is misleading. Steady over 12months, flattening from peak 12+y ago, well before MariaDB. pic.twitter.com/sP6H65FEvR

â€” Stewart Smith (@stewartsmith) September 22, 2016

Try “Will @mariadb ever replace @postgresql “. At current rate, equal on Google Trends in 12-20 years if no pg growth. pic.twitter.com/6BP9FuU1RU

â€” Stewart Smith (@stewartsmith) September 22, 2016

Video of my Percona Live Talk: Why would I run MySQL/MariaDB on POWER anyway?

Posted on 2016-05-06 by Stewart Smith

Good news everyone! There’s video up for the talk I gave at Percona Live in April 2016 up: Why would I run MySQL/MariaDB on POWER anyway?

The talk is a general overview of POWER and why MySQL/MariaDB may be a good fit.

“Toy” database on mainframes

Posted on 2016-04-29 by Stewart Smith

While much less common than 10 or 15 (err… even 20) years ago, you still sometimes hear MySQL being called a “toy” database. The good news is, when somebody says that, they’re admitting ignorance and you can help educate them!

Recently, a fellow IBMer submitted a pull request (and bug) to start having MySQL support on Z Series (s390x).

Generally speaking, when there’s effort being spent on getting something to run on Z, it is in no way considered a toy by those who’ll end up using it.

MySQL Contributions status

Posted on 2016-02-26 by Stewart Smith

This post is an update to the status of various MySQL bugs (some with patches) that I’ve filed over the past couple of years (or that people around me have). I’m not looking at POWER specific ones, as there are these too, but each of these bugs here deal with general correctness of the code base.

Firstly, let’s look at some points I’ve raised:

Incorrect locking for global_query_id (bug #72544)
Raised on May 5th, 2014 on the internals list. As of today, no action (apart from Dimitri verifying the bug back in May 2014). There continues to be locking that perhaps only works by accident around query IDs.Soon, this bug will be two years old.
Endian code based on CPU type rather than endian define (bug #72715)
About six-hundred and fifty days ago I filed this bug – back in May 2014, which probably has a relatively trivial fix of using the correct #ifdef of BIG_ENDIAN/LITTLE_ENDIAN rather than doing specific behavior based on #ifdef __i386__
What’s worse is that this looks like somebody being clever for a compiler in the 1990s, which unlikely ends up with the most optimal code today.
mysql-test-run.pl –valgrind-all does not run all binaries under valgrind (bug #74830)
Yeah, this should be a trivial fix, but nothing has happened since November 2014.
I’m unlikely to go provide a patch simply because it seems to take sooooo damn long to get anything merged.
MySQL 5.1 doesn’t build with Bison 3.0 (bug #77529)
Probably of little consequence, unless you’re trying to build MySQL 5.1 on a linux distro released in the last couple of years. Fixed in Maria for a long time now.

Trivial patches:

Incorrect function name in DBUG_ENTER (bug #78133)
Pull request number 25 on github – a trivial patch that is obviously correct, simply correcting some debug only string.
So far, over 191 days with no action. If you can’t get trivial and obvious patches merged in about 2/3rds of a year, you’re not going to grow contributions. Nearly everybody coming to a project starts out with trivial patches, and if a long time contributor who will complain loudly on the internet (like I am here) if his trivial patches aren’t merged can’t get it in, what chance on earth does a newcomer have?
In case you’re wondering, this is the patch:
```
--- a/sql/rpl_rli_pdb.cc
+++ b/sql/rpl_rli_pdb.cc
@@ -470,7 +470,7 @@ bool Slave_worker::read_info(Rpl_info_handler *from)
 
 bool Slave_worker::write_info(Rpl_info_handler *to)
 {
-  DBUG_ENTER("Master_info::write_info");
+  DBUG_ENTER("Slave_worker::write_info");
```
InnoDB table flags in bitfield is non-optimal (bug #74831)
With a patch since I filed this back in November 2014, it’s managed to sit idle long enough for GCC 4.8 to practically disappear from anywhere I care about, and 4.9 makes better optimization decisions. There are other reasons why C bitfields are an awful idea too.

Actually complex issues:

InnoDB mutex spin loop is missing GCC barrier (bug #72755)
Again, another bug filed back in May 2014, where InnoDB is doing a rather weird trick to attempt to get the compiler to not optimize away a spinloop. There’s a known good way of doing this, it’s called a compiler barrier. I’ve had a patch for nearly two years, not merged :(
buf_block_align relies on random timeouts, volatile rather than memory barriers (bug #74775)
This bug was first filed in November 2014 and deals with a couple of incorrect assumptions about memory ordering and what volatile means.
While this may only exhibit a problem on ARM and POWER processors (as well as any other relaxed memory ordering architectures, x86 is the notable exception), it’s clearly incorrect and very non-portable.
Don’t expect MySQL 5.7 to work properly on ARM (or POWER). Try this:
```
./mysql-test-run.pl rpl.rpl_checksum_cache --repeat=10
```
You’ll likely find MySQL > 5.7.5 still explodes.
In fact, there’s also Bug #79378 which Alexey Kopytov filed with patch that’s been sitting idle since November 2015 which is likely related to this bug.

Not covered here: universal CRC32 hardware acceleration (rather than just for innodb data pages) and other locking issues (some only recently discovered). I also didn’t go into anything filed in December 2015… although in any other project I’d expect something filed in December 2015 to have been looked at by now.

Like it or not, MySQL is still upstream for all the MySQL derivatives active today. Maybe this will change as RocksDB and TokuDB gain users and if WebScaleSQL, MariaDB and Percona can foster a better development community.

POWER8 Accelerated CRC32 merged in MariaDB 10.1

Posted on 2015-12-18 by Stewart Smith

Earlier on in benchmarking MySQL and MariaDB on POWER8, we noticed that on write workloads (or read workloads involving a lot of IO) we were spending a bunch of time computing InnoDB page checksums. This is a relatively well known MySQL problem and has existed for many years and Percona even added innodb_fast_checksum to Percona Server to help alleviate the problem.

In MySQL 5.6, we got the ability to use CRC32 checksums, which are great in that they’re a lot faster to compute than tho old InnoDB “new” checksum. There’s code inside InnoDB to use the x86 SSE2 crc32q instruction to accelerate performing the checksum on compatible x86 CPUs (although oddly enough, the CRC32 checksum in the binlog does not use this acceleration).

However, on POWER, we’d end up using the software implementation of CRC32, which used a lot more CPU than we’d like. Luckily, CRC32 is really common code and for POWER8, we got some handy instructions to help computing it. Unfortunately, this required brushing up on vector polynomial math in order to understand how to do it all quickly. The end result was Anton coming up with crc32-vpmsum code that we could drop into projects that embed a copy of crc32 that was about 41 times faster than the best non-vpmsum implementation.

Recently, Daniel Black took the patch that had passed through both Daniel Axten‘s and my hands and worked on upstreaming it into MariaDB and MySQL. We did some pretty solid benchmarking on the improvement you’d get, and we pretty much cannot notice the difference between innodb_checksum=off and having it use the POWER8 accelerated CRC32 checksum, which frees up maybe 30% of CPU time to be used for things like query execution! My original benchmark showed a 30% improvement in sysbench read/write workloads.

The excellent news? Two days ago, MariaDB merged POWER8 accelerated crc32! This means that IO heavy workloads on MariaDB on POWER8 will get much faster in the next release.

MySQL bug 74776 is open, with patch attached, so hopefully MySQL will merge this soon too (hint hint).

1 Million SQL Queries per second: GA MariaDB 10.1 on POWER8

Posted on 2015-10-19 by Stewart Smith

A couple of days ago, MariaDB announced that MariaDB 10.1 is stable GA – around 19 months since the GA of MariaDB 10.0. With MariaDB 10.1 comes some important scalabiity improvements, especially for POWER8 systems. On POWER, we’re a bit unique in that we’re on the higher end of CPUs, have many cores, and up to 8 threads per core (selectable at runtime: 1, 2, 4 or 8/core) – so a dual socket system can easily be a 160 thread machine.

Recently, we (being IBM) announced availability of a couple of new POWER8 machines – machines designed for Linux and cloud environments. They are very much OpenPower machines, and more info is available here: http://www.ibm.com/marketplace/cloud/commercial-computing/us/en-us

Combine these two together, with Axel Schwenke running some benchmarks and you get 1 Million SQL Queries per second with MariaDB 10.1 on POWER8.

Having worked a lot on both MySQL for POWER and the firmware that ships in the S882LC, I’m rather happy that 1 Million queries per second is beyond what it was in June 2014, which was a neat hack on MySQL 5.7 that showed the potential of MySQL on POWER8 but wasn’t yet a product. Now, you can run a GA release of MariaDB on GA POWER8 hardware designed for scale-out cloud environments and get 1 Million SQL queries/second (with fewer cores than my initial benchmark last year!)

What’s even more impressive is that this million queries per second is in a KVM guest!

doing nothing on modern CPUs

Posted on 2015-08-28 by Stewart Smith

Sometimes you don’t want to do anything. This is understandably human, and probably a sign you should either relax or get up and do something.

For processors, you sometimes do actually want to do absolutely nothing. Often this will be while waiting for a lock. You want to do nothing until the lock is free, but you want to be quick about it, you want to start work once that lock is free as soon as possible.

On CPU cores with more than one thread (e.g. hyperthreading on Intel, SMT on POWER) you likely want to let the other threads have all of the resources of the core if you’re sitting there waiting for something.

So, what do you do? On x86 there’s been the PAUSE instruction for a while and on POWER there’s been the SMT priority instructions.

The x86 PAUSE instruction delays execution of the next instruction for some amount of time while on POWER each executing thread in a core has a priority and this is how chip resources are handed out (you can set different priorities using special no-op instructions as well as setting the Relative Priority Register to map how these coarse grained priorities are interpreted by the chip).

So, when you’re writing spinlock code (or similar, such as the implementation of mutexes in InnoDB) you want to check if the lock is free, and if not, spin for a bit, but at a lower priority than the code running in the other thread that’s doing actual work. The idea being that when you do finally acquire the lock, you bump your priority back up and go do actual work.

Usually, you don’t continually check the lock, you do a bit of nothing in between checking. This is so that when the lock is contended, you don’t just jam every thread in the system up with trying to read a single bit of memory.

So you need a trick to do nothing that the complier isn’t going to optimize away.

Current (well, MySQL 5.7.5, but it’s current in MariaDB 10.0.17+ too, and other MySQL versions) code in InnoDB to “do nothing” looks something like this:

ulint ut_delay(ulint   delay)
{
        ulint   i, j;
        UT_LOW_PRIORITY_CPU();
        j = 0;
        for (i = 0; i < delay * 50; i++) {
                j += i;
                UT_RELAX_CPU();
        }
        if (ut_always_false) {
                ut_always_false = (ibool) j;
        }
        UT_RESUME_PRIORITY_CPU();
        return(j);
}

On x86, UT_RELAX_CPU() ends up being the PAUSE instruction.

On POWER, the UT_LOW_PRIORITY_CPU() and UT_RESUME_PRIORITY_CPU() tunes the SMT thread priority (and on x86 they’re defined as nothing).

If you want an idea of when this was all written, this comment may be a hint:

/*!< in: delay in microseconds on 100 MHz Pentium */

But, if you’re not on x86 you don’t have the PAUSE instruction, instead, you end up getting this code:

# elif defined(HAVE_ATOMIC_BUILTINS)
#  define UT_RELAX_CPU() do { \
     volatile lint      volatile_var; \
     os_compare_and_swap_lint(&volatile_var, 0, 1); \
   } while (0)

Which you may think “yep, that does nothing and is not optimized away by the compiler”. Except you’d be wrong! What it actually does is generates a lot of memory traffic. You’re now sitting in a tight loop doing atomic operations, which have to be synchronized between cores (and sockets) since there’s no real way that the hardware is going to be able to work out that this is only a local variable that is never accessed from anywhere.

Additionally, the ut_always_false and j variable there is also attempts to trick the complier into not optimizing the loop away, and since ut_always_false is a global, you’re generating traffic to a single global variable too.

Instead, what’s needed is a compiler barrier. This simple bit of nothing tells the compiler “pretend memory has changed, so you can’t optimize around this point”.

__asm__ __volatile__ ("":::"memory")

So we can eliminate all sorts of useless non-work and instead do what we want: do nothing (a for loop for X iterations that isn’t optimized away by the compiler) and don’t have side effects.

In MySQL bug 74832 I detailed this with the appropriately produced POWER assembler. Unfortunately, this patch (submitted under the OCA) has sat since November 2014 (so, over 9 months) with no action. I’m a bit disappointed by that to be honest.

Anyway, the real moral of this story is: don’t implement your own locking primitives. You’re either going to get it wrong or you’ll be wrong in a few years when everything changes under you.

MySQL on NUMA machines just got better!

Posted on 2015-07-14 by Stewart Smith

A followup to my previous entry , my patch that was part of Bug #72811 Set NUMA mempolicy for optimum mysqld performance has been merged!

I hope it’s enabled by default so that everything “just works”.

I also hope it filters down through MariaDB and Percona Server fairly quickly.

Also, from the release notes on that bug, I think we can expect 5.7.8 any day now.

The sad state of MySQL and NUMA

Posted on 2015-07-08 by Stewart Smith

Way back in 2010, MySQL Bug 57241 was filed, pointing out that the “swap insanity” problem was getting serious on x86 systems – with NUMA being more and more common back then.

The swapping problem is due to running out of memory on a NUMA node and having to swap things to other nodes (see Jeremy Cole‘s blog entry also from 2010 on the topic of swap insanity). This was back when 64GB and dual quad core CPUs was big – in the past five years big systems have gotten bigger.

Back then there were two things you could do to have your system be usable: 1) numa=off as kernel boot parameter (this likely has other implications though) and 2) “numactl –interleave all” in mysqld_safe script (I think MariaDB currently has this built in if you set an option but I don’t think MySQL does, otherwise perhaps the bug would have been closed).

Anyway, it’s now about 5 years since this bug was opened and even when there’s been a patch in the Twitter MySQL branch for a while (years?) and my Oracle Contributor Agreement signed patch attached to bug 72811 since May 2014 (over a year) we still haven’t seen any action.

My patch takes the approach of you want things allocated at server startup to be interleaved across nodes (e.g. buffer pool) while runtime allocations are probably per connection and are thus fine (in fact, better) to do node local allocations.

Without a patch like this, or without running mysqld with the right numactl incantation, you end up either having all your memory on one NUMA node (potentially not utilising full memory bandwidth of the hardware), or you end up with swap insanity, or you end up with some other not exactly what you’d expect situation.

While we could have MySQL be more NUMA aware and perhaps do a buffer pool instance per NUMA node or some such thing, it’s kind of disappointing that for dedicated database servers bought in the past 7+ years (according to one comment on one of the bugs) this crippling issue hasn’t been addressed upstream.

Just to make it even more annoying, on certain workloads you end up with a lot of mutex contention, which can end up meaning that binding MySQL to fewer NUMA nodes (memory and CPU) ends up increasing performance (cachelines don’t have as far to travel) – this is a different problem than swap insanity though, and one that is being addressed.

Update: My patch as part of https://bugs.mysql.com/bug.php?id=72811 has been merged! MySQL on NUMA machines just got a whole lot better. I just hope it’s enabled by default…

Going beyond 1.3 MILLION SQL Queries/second

Posted on 2015-04-28 by Stewart Smith

So, on a large IBM POWER8 system I was recently running the newly coined “yesmark” benchmark, which is best translated as this:

Benchmark (N for concurrency): for i in {1..N}; do yes "DO 0;" | mysql > /dev/null & done
Live results: mysqladmin -ri 1 extended-status | grep Questions

Which sounds all fun until you realize that it’s *amazingly* close in results to a sysbench point select benchmark these days (well, with MySQL 5.7.7).

Since yesmark doesn’t use InnoDB though, MariaDB is back in the game.

I don’t think it matters between MariaDB and MySQL at this point for yesbench. With MySQL in a KVM guest on a shared 2 socket POWER8 I could get 754kQPS and on a larger system, I could get 1.3 million / sec.

1.3 Million queries / sec is probably the highest number anybody has ever seen out of MySQL or MariaDB, so that’s fairly impressive in itself.

What’s also impressive is that on this workload, mysqld was still only using 50% of CPU in the system. The mysql command line client was really heavy user.

Other users are: 8% completely idle, another 12% in linux scheduler (alarmingly high really). So out of all execution time, only about 44% spent in mysqld, 29% in mysql client.

It seems that the current issues scaling to two socked POWER8 machines are the same as with scaling to other large systems, when we go beyond about 20 POWER8 cores (SMT8), we start to find new and interesting challenges.

Towards (and beyond) ONE MILLION queries per second

Posted on 2015-04-10 by Stewart Smith

At Percona Live MySQL Conference 2015 next week I’ll be presenting on “Towards One MILLION queries per second” on 14th April at 4:50pm in Ballroom A.

This is the story of work I’ve been doing to get MySQL executing ONE MILLION SQL queries per second. It involves tales of MySQL, tales of the POWER8 Processor and a general amount of fun in extracting huge amounts of performance.

As I speak, I’m working on some even more impressive benchmark results! New hardware, new MySQL versions and really breaking news on MySQL scalability.

Preliminary results from POWER8 optimized CRC32 for MySQL

Posted on 2015-02-13 by Stewart Smith

So, Anton got some useful code working that I could patch into a MySQL server for testing purposes – a POWER8 optimized CRC32 implementation.

I went with a pretty stock MySQL 5.6.22 (one patch) with sysbench preparing a single 2GB table (10,000,000 rows). I then hacked up innochecksum so that it would only do the correct CRC32 (rather than trying each checksum type). Using the standard CRC32 algorithm it took around three seconds to verify all of the checksums. With a POWER8 optimized CRC32: 0.4-0.5 seconds. Useful speed-up!

I then ran sysbench read/write with 16 threads with oltp-table-size=10000 (on the larger table) to see if there would be an improvement in a “real world” workload. I got about 30% better performance on read/write operations!

Using perf to see where CPU was going, CPU time spent doing CRC32 calculations went down from ~2.5% to ~0.25%!

In theory, we should be able to get about 52GiB/sec of CRC32 out of a 4.1Ghz POWER8 core. I don’t think we’ll be hitting this in MySQL any time soon.

Give us another week or two and we’ll likely have a patch that’s ready to merge.

Initial benchmarks look promising though!

MySQL-next = Drizzle 5 years ago?

Posted on 2015-02-07 by Stewart Smith

With JSON functionality, alternate protocols (HTTP, memcache), a move towards saner defaults and crash safety, pluggable logging etc it really looks like MySQL is following what we did in Drizzle years ago, which is great!

C bitfields considered harmful

Posted on 2014-11-14 by Stewart Smith

In C (and C++) you can specify that a variable should take a specific number of bits of storage by doing “uint32_t foo:4;” rather than just “uint32_t foo”. In this example, the former uses 4 bits while the latter uses 32bits. This can be useful to pack many bit fields together.

Or, that’s what they’d like you to think.

In reality, the C spec allows the compiler to do just about anything it wants with these bitfields – which usually means it’s something you didn’t expect.

For a start, in a struct -e.g. “struct foo { uint32_t foo:4; uint32_t blah; uint32_t blergh:20; }” the compiler could go and combine foo and blergh into a single uint32_t and place it somewhere… or it could not. In this case, sizeof(struct foo) isn’t defined and may vary based on compiler, platform, compiler version, phases of the moon or if you’ve washed your hands recently.

Where this can get interesting is in network protocols (OMG DO NOT DO IT), APIs (OMG DO NOT DO IT), protecting different parts of a struct with different mutexes (EEP, don’t do it!) and performance.

I recently filed MySQL bug 74831 which relates to InnoDB performance on POWER8. InnoDB uses C bitfields which are themselves bitfields (urgh) for things like “flag to say if this table is compressed”. At various parts of the code, this flag is checked.

When you apply this simple patch:

--- mysql-5.7.5-m15.orig/storage/innobase/include/dict0mem.h
+++ mysql-5.7.5-m15/storage/innobase/include/dict0mem.h
@@ -1081,7 +1081,7 @@ struct dict_table_t {
        Use DICT_TF_GET_COMPACT(), DICT_TF_GET_ZIP_SSIZE(),
        DICT_TF_HAS_ATOMIC_BLOBS() and DICT_TF_HAS_DATA_DIR() to parse this
        flag. */
-       unsigned                                flags:DICT_TF_BITS;
+       unsigned                                flags;

I get 10,000 key lookups/sec more than without it!

Why is this? If you go and read the bug, you’ll see that the amount of CPU time spent on the instruction checking the bit flag is actually about the same… and this puzzled me for a while. That is, until Anton reminded me that the PMU can be approximate and perhaps I should look at the loads.

Sure enough, the major difference is that with the bitfield in place (i.e. MySQL 5.7.5 as it stands today), there is a ld instruction doing the load – which is a 64bit load. In my patched version, it’s a lwx instruction – which is a 32bit load.

So, basically, we were loading 8 bytes instead of 4 every time we were checking if it was a compressed table.

So, along with yesterday’s lesson of never, ever, ever use volatile, today’s lesson is never, ever, ever use bitfields.

volatile considered harmful

Posted on 2014-11-13 by Stewart Smith

While playing with MySQL 5.7.5 on POWER8, I came across a rather interesting bug (74775 – and this is not the only one… I think I have a decent amount of auditing and patching to do now) which made me want to write a bit on memory barriers and the volatile keyword.

Memory barriers are hard.

Like, super hard. It’s the kind of thing that makes you curse hardware designers, probably because they’re not magically solving all your problems for you. Basically, as you get more CPU cores and each of them have caches, it gets more expensive to keep everything in sync. It’s quite obvious that with *ahem* an eventually consistent model, you could save a bunch of time and effort at the expense of shifting some complexity into software.

Those in the MySQL world should recognize this – we’ve been dealing with asynchronous replication for well over a decade as a good way to scale.

On some CPU architectures (POWER for example) not all loads are created equal. When you load a value from memory, it will be consistent with your thread of execution. That is, with any stores that you have done in this thread of execution. If another thread updates that memory location you may not see that update even if your load occurs after that thread updates that memory location. Think eventually consistent.

If you want up to date reads (and not clobber writes), then you get to do memory barriers! (a topic for elsewhere – the PowerISA document has good explanations of what we have on POWER though, and how load with reserve works).

What the volatile keyword does is generate load and store instructions. It is useful when talking to hardware, as the load and store instructions are actually doing something there that the compiler doesn’t know about and thus shouldn’t optimize away.

The volatile keyword does not add any memory barriers. This is important to realize – volatile just makes loads and stores happen for your thread, not in relation to any other threads of execution. Thus, you cannot use volatile as a thread synchronization mechanism at all. It is completely and totally wrong.

Basically, if you have a volatile variable and you do stores to it in one thread and loads in another, after the store happens, it could be quite a long time before the thread doing the loads sees it! For some applications this may be okay (although I can’t really think of any beyond very very inaccurate status variables)… but if it matters at all for application correctness, volatile is the wrong thing to use.

Preliminary MySQL Cluster benchmark results on POWER8

Posted on 2014-11-12 by Stewart Smith

Yesterday, I got the basics going for MySQL Cluster on POWER. Today, I finished up a couple more patches to improve performance and ran some benchmarks.

This is on a 3.7Ghz POWER8 machine with non-balanced memory (only 2 of the 4 NUMA nodes have memory, so we have less total memory bandwidth than we could have, plus I’m going to bind ndbmtd to the CPUs in these NUMA nodes)

With a setup of a single replica and two data nodes on the one machine (each bound to a specific NUMA node), running the flexAsync benchmark on MySQL Cluster 7.3.7, I could get around:

3.2 million reads/sec
2.6 million deletes/sec
2.4 million updates/sec
2.4 million inserts/sec.

So, that’s at least in the right ballpark for a first go.

(I’m running this on a big endian host kernel, some random kernel I booted on the box and built with gcc 4.8 with whatever build options the MySQL Cluster cmake foo chooses by default)

MySQL Cluster on POWER8

Posted on 2014-11-11 by Stewart Smith

So, I’ve written previously on MySQL on POWER, and today is a quick bit of news about MySQL Cluster on POWER – specifically MySQL Cluster 7.3.7.

I ran into three main issues in getting some flexAsync benchmark results. One of them was the fact that I wanted to do this in the middle of all the POWER8 machines I usually use moving buildings (hard to run benchmarks when computers are packed up in boxes on a truck).

The next issue was that ndbmtd (the multi-threaded data node) needs memory barriers for the magic message passing stuff between threads. So, that’s pretty easy (about an eight line patch).

The next issue was in the results from flexAsync, it turns out 32bit math is a bad idea with results from my POWER8 box.

My preliminary performance numbers are fairly promising (actually… what is the world record for a single machine and NDB these days? Single data node?). I think there’s a bit more low hanging fruit and a couple more things that are a bit more involved.

Bugs with patches:

Bug 74782 – compile fix (memory barriers for POWER)
Bug 74781 – flexAsync uses 32bit math, leading to incorrect summary on POWER8

MariaDB Foundation board

Posted on 2014-10-14 by Stewart Smith

There seems be a bit of an exodus from the MariaDB Foundation board recently… I’m not sure exactly what to make of it all, but the current members according to https://mariadb.org/en/foundation/ are:

Rasmus Johansson (chair)
Michael “Monty” Widenius
Jeremy Zawodny
Sergei Golubchik

With Jeremy Zawodny being the only non-MariaDB Corp member.

Recently, Jeremy Cole asked some people about their membership:

@jeremycole Nope. Resigned months ago.

— Mike Milinkovich (@mmilinkov) October 2, 2014

@jeremycole Hi Jeremy, no, sorry, I resigned a little while back.

— Andrew Katz (@andrewjskatz) October 1, 2014

@jeremycole yes, though I'm having some doubts about its future (for a variety of reasons) :-(

— Jeremy Zawodny (@jzawodn) October 1, 2014

I’m a little worried for the project, the idea of a foundation around it and for people I count as friends who work on MariaDB.

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: