MySQL Cluster on POWER8

So, I’ve written previously on MySQL on POWER, and today is a quick bit of news about MySQL Cluster on POWER – specifically MySQL Cluster 7.3.7.

I ran into three main issues in getting some flexAsync benchmark results. One of them was the fact that I wanted to do this in the middle of all the POWER8 machines I usually use moving buildings (hard to run benchmarks when computers are packed up in boxes on a truck).

The next issue was that ndbmtd (the multi-threaded data node) needs memory barriers for the magic message passing stuff between threads. So, that’s pretty easy (about an eight line patch).

The next issue was in the results from flexAsync, it turns out 32bit math is a bad idea with results from my POWER8 box.

My preliminary performance numbers are fairly promising (actually… what is the world record for a single machine and NDB these days? Single data node?). I think there’s a bit more low hanging fruit and a couple more things that are a bit more involved.

Bugs with patches:

  • Bug 74782 – compile fix (memory barriers for POWER)
  • Bug 74781 – flexAsync uses 32bit math, leading to incorrect summary on POWER8

21 thoughts on “MySQL Cluster on POWER8

  1. Interesting, not sure if it would help but you can also implement the xcng function to make use of futexes instead of mutexes on Linux.

    Looks like simple enough patches to get into the trees.

    I don’t think there is a specific record on single machine NDB benchmarks, but I have run quite a lot on a 96 CPU thread machine which is a few years old now and a few years ago I ran 4-5 million reads/writes per second on a single node. The data node LDM threads can handle at least 250k PK operations per second per CPU on this box. Today it is possible to have up to 32 LDM threads, so I gather you should be able to get 32x that at least if you have sufficient number of CPUs. Gather that the POWER 8 might go beyond 250k a bit, so maybe you can squeeze it up to around 10-20M PK operations per second on a single node.

    Actually with enough machines of large size even 1G PK operations per second is getting within reach :)

    One more thing about getting record numbers is that 7.4 improves PK operations by about 5-10% and scan operations by about 30-40%.

  2. I just realized it’s probably nearly 10 years to the day to the first time I compiled NDB around my first day of starting work for MySQL AB – and that was on a PowerPC machine.

    As for numbers… if you check out the flexAsync bug, I may have “forgotten” to erase my preliminary results in example output :)

  3. Pingback: Preliminary MySQL Cluster benchmark results on POWER8 | Ramblings

  4. I’ve just been using NoOfReplicas=1, 2 data nodes and using NDBAPI benchmarks rather than SQL. I’ve used two data nodes and bound each of them to a NUMA node on the host. The system I’ve been doing preliminary work on is not fully loaded with memory (it’s usually not used for benchmarks) so it’s not really illustrative of a system you would buy for use with MySQL Cluster.

  5. I went ahead and implemented xcng() for POWER and I got a small speed boost – 50,000-100,000 ops/sec typically, although may have been a bit higher for deletes (600kops/sec, no idea why though).

    The highest peak I’ve seen is currently 3.8 million ops/sec…. so if we loaded a POWER8 box with memory to increase the memory bandwidth and balance the NUMA nodes, 10M+ should be possible I think :)

    Interestingly enough, the MD5Transform is showing up really high in profiles, and about 3% of total execution time is spent on a single load instruction in mt.cpp

    I think my next step is to ensure correct functionality rather than chasing benchmark numbers too much. I suspect there’s a race in initial start at least, I’m sometimes seeing a node fall over if the other one has not yet reached start phase 0 before being started… and the crash is certainly odd (looking at random memory locations).

  6. Great to see the efforts you’re doing Stewart. The MD5 Transform happens in the TC threads, so I am pretty sure this should not be an issue for as long as you have sufficient number of TC threads. Have you used ThreadConfig, with ThreadConfig you can ensure that LDM threads and TC threads and send threads and receive thread and the main threads are using different CPUs if you like. It is also good to see if you have misconfigured the system with some resource too small. Also I get a performance boost in locking threads to CPUs.

    Haven’t seen this initial start problem you discuss, but I know that the config parameter that sets the initial start watchdog timer needs to be set at times.

    The most obvious problems that you might encounter is related to the memory barriers in mt.cpp. These have been developed and tested on x86 and SPARC. So this is an important area to be on the lookout for if
    strange things happens.

  7. yeah, I’m a little suspicious of the code in mt.cpp… not totally (otherwise I’d expect other explosions) but it seems fairly solid currently. I wonder if Jonas wrote some nice torture utility somewhere in tree that I could run…

  8. There is mt-send-t.cpp, I am pretty sure Jonas also wrote some utilities that he never put in the tree, but don’t really know. I think it is fairly well tested the mt.cpp parts, but obviously the POWER8 has higher requirements on the programmer than x86 and SPARC, so there might be bugs still.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.