MySQL 5.6 Performance on POWER8

The following sentence is brought to you by IBM Legal: The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.

My previous post covered the work needed to get MySQL 5.6.17 running reliably on modern POWER systems. The patch to MySQL 5.6.17 that’s needed is available here.

For those who don’t know, POWER8 is the latest Power Architecture processors from IBM (my employer). These chips will be available in systems from IBM in June 2014 (i.e. Real Soon Now(TM)). There’s some fairly impressive specs and numbers (see Wikipedia and elsewhere) – but what could this mean for actual applications?

Well, it turns out that MySQL is a pretty big thing in some target markets for POWER8, and inspired by Dimitri’s impressive benchmark numbers, I thought we should have a go on POWER8.

Firstly, I focused on MySQL 5.6 as it is the current stable release. MySQL 5.7 will be the subject of a future blog post.

The first step was to ensure that MySQL 5.6 worked correctly on POWER. My previous blog post covered the few bugs I ran into and filed (often  with patches). This wasn’t too hard and I’m fairly confident the bug fixes are simple enough to get into MySQL 5.6 – I can’t comment on what would be/could be “officially supported”, that’s a business discussion :)

In order to ensure that my patch was not only correct but performing well, I needed a benchmark. For my initial benchmark. I chose sysbench point selects (i.e. read only key lookups), which should show the theoretical maximum queries per second you could pump through the MySQL Server as well as really stressing the mutex code, helping ensure it was not only correct, but performing well.

A simple comparison of my early patch that used heavyweight memory barriers versus Yasufumi’s patch that used more lightweight ones showed that using heavyweight barriers could be as much as a 50% performance hit – so getting this code right is important.

To add to the fun, the POWER8 processor has a few parameters you can tweak. There is the SMT mode, which dictates how many threads per core there are. This can be changed at runtime. You can be in SMT=off, SMT=2, SMT=4 or SMT=8. Typically, only some workloads can benefit from SMT8 rather than SMT4. There is also DSCR, which is data prefetching. For sysbench point selects, I’ve found we do slightly better (around 10%) when DSCR is set to 1 rather than zero – but YMMV on other benchmarks.

In my experiments, I’ve found that SMT4 or SMT8 seems to be the best bang for buck for MySQL workloads on POWER8. With SMT=2 rather than off, I’ve seen a ~50% performance boost in sysbench point select results. With SMT=4 I’ve seen another 50% boost (i.e. roughly double SMT=off performance). The benefit of SMT8 for MySQL 5.6 (and the 5.6 part is crucial here) may be minimal, especially for this benchmark. This is mostly due to hitting heavy mutex contention inside the MySQL server rather than anything else.

POWER8 systems come in either single or dual socket, with the number of cores being a total of 4, 6, 8, 10, 12, 16, 20 or 24 depending on configuration of the system (go check IBM web site for specifics of what’s available in what model). This means with SMT8, a dual socket, 24 core POWER8 system has 192 hardware threads – the system I was using for these benchmarks.With this number of cores and hardware threads, those familiar with MySQL on multi core systems may already have an inkling that using the full capacity of such a system may be hard for MySQL.

Certainly for old versions of MySQL (such as 5.0 or 5.1) you’re going to get nowhere near full system utilization on POWER8. For MySQL 5.6 (and in the future, 5.7) you have a much better hope.

Before anyone asks, yes, I used jemalloc for most of my benchmarks and it helps by giving a single digit percent performance increase (around 3-4%).

The bottlenecks inside MySQL 5.6 for sysbench point select workload are fairly well documented, so at best we may be striving to equal the performance of other CPU architectures rather than get too much higher simply due to hitting mutex contention in creating read views inside InnoDB. So the maximum performance will be a function of individual core CPU speed and the speed at which a lock can be acquired (i.e. related to how quick you can bounce a cacheline with a lock between cores).

This is exactly what I found on POWER8 with MySQL 5.6 – you hit the same bottleneck on POWER8 as you do everywhere else – creating read views in InnoDB.

That being said, my maximum sysbench point select results on POWER8 was 344kQPS. This not only matches but exceeds the previous record holder by quite a decent amount.

This number was across 8 tables with mysqld bound to a single NUMA node (6 cores) and sysbench bound to another NUMA node (6 cores) on the same socket. For this benchmark, due to the mutex contention, bringing the second socket into play didn’t improve performance. For other benchmarks, (e.g. standard sysbench read only) it seems to scale with more CPU cores much better (no doubt the subject of a future blog post).

Single table sysbench point select was also impressive at 335kQPS – you only got an additional 10kQPS by going to 8 tables! All of these results were with SMT4 and DSCR=1, which seems to be the best configuration for this type of workload.

Up next: MySQL 5.7 on POWER8.

The rotating blades database benchmark

(and before you ask, yes “rotating blades” comes from “become a fan”)

I’m forming the ideas here first and then we can go and implement it. Feedback is much appreciated.

Two tables.

Table one looks like this:

CREATE TABLE fan_of (
user_id BIGINT,
item_id BIGINT,
PRIMARY KEY (user_id, item_id),
INDEX (item_id)
);

That is, two columns, both 64bit integers. The primary key covers both columns (a user cannot be a fan of something more than once) and can be used to look up all things the user is a fan of. There is also an index over item_id so that you can find out which users are a fan of an item.

The second table looks like this:

CREATE TABLE fan_count (
item_id BIGINT PRIMARY KEY,
fans BIGINT
);

Both tables start empty.

You will have 1000, 2000,4000 and 8000 concurrent clients attempting to run the queries. These concurrent clients must behave as if they could be coming from a web server. The spirit of the benchmark is to have 8000 threads (or processes) talk to the database server independent of each other.

The following set of queries will be run a total of 23,000,000 (twenty three million) times. The my_user_id below is an incrementing ID per connection allocated by partitioning 23,000,000 evenly between all the concurrent clients (e.g. for 1000 connections each connection gets 23,000 sequential ids)

You must run the following queries.

  • How many fans are there of item 12345678 (e.g. SELECT fans FROM fan_count WHERE item_id=12345678)
  • Is my_user_id already a fan of item 12345678 (e.g. SELECT user_id FROM fan_of WHERE user_id=my_user_id AND item_id=12345678)
  • The next two queries MUST be in the same transaction:
    • my_user_id becomes a fan of item 12345678 (e.g. INSERT INTO fans (user_id,item_id) values (my_user_id, 12345678))
    • increment count of fans (e.g. UPDATE fan_count SET fans=fans+1 WHERE item_id=12345678)

For the first query you are allowed to use a caching layer (such as memcached) but the expiry time must be 5 seconds or less.

You do not have to use SQL. You must however obey the transaction boundary above. The insert and the update must be part of the same transaction.

Results should include: min, avg, max response time for each query as well as the total time to execute the benchmark.

Data must be durable to a machine being switched off and must still be available with that machine switched off. If committing to local disk, you must also replicate to another machine. If running asynchronous replication, the clock does not stop until all changes have been applied on the slave. If doing asynchronous replication, you must also record the replication delay throughout the entire test.

In the event of timeout or deadlock in doing the insert and update part, you must go back to the first query (how many fans) and retry. Having to retry does not count towards the 23,000,000 runs.

At the end of the benchmark, the query SELECT fans FROM fan_count WHERE item_id=12345678 should return 23,000,000.

Yes, this is a very evil benchmark. It seems to be a bit indicative about the kind of peak load that can be experienced by a bunch of Web 2.0 sites that have a “like” or “become a fan” style buttons. I fully expect the following:

  • Pretty much all systems will nosedive in performance after 1000 concurrent clients
  • Transaction rollbacks due to deadlock detection or lock wait timeouts will be a lot.
  • Many existing systems and setups not complete it in reasonable time.
  • A solution using Scale Stack to be an early winner (backed by MySQL or Drizzle)
  • Somebody influenced by Domas turning InnoDB deadlock detection off very quickly.
  • Somebody to call this benchmark “stupid” (that person will have a system that fails dismally at this benchmark)
  • Somebody who actually has any knowledge of modern large scale web apps to suggest improvements
  • Nobody even attempting to benchmark the Oracle database
  • Somebody submitting results with MySQL to not wait until the replication stream has finished applying.
  • Some NoSQL systems to suck considerably more than their SQL counterparts.