Way back in 2010, MySQL Bug 57241 was filed, pointing out that the “swap insanity” problem was getting serious on x86 systems – with NUMA being more and more common back then.
The swapping problem is due to running out of memory on a NUMA node and having to swap things to other nodes (see Jeremy Cole‘s blog entry also from 2010 on the topic of swap insanity). This was back when 64GB and dual quad core CPUs was big – in the past five years big systems have gotten bigger.
Back then there were two things you could do to have your system be usable: 1) numa=off as kernel boot parameter (this likely has other implications though) and 2) “numactl –interleave all” in mysqld_safe script (I think MariaDB currently has this built in if you set an option but I don’t think MySQL does, otherwise perhaps the bug would have been closed).
Anyway, it’s now about 5 years since this bug was opened and even when there’s been a patch in the Twitter MySQL branch for a while (years?) and my Oracle Contributor Agreement signed patch attached to bug 72811 since May 2014 (over a year) we still haven’t seen any action.
My patch takes the approach of you want things allocated at server startup to be interleaved across nodes (e.g. buffer pool) while runtime allocations are probably per connection and are thus fine (in fact, better) to do node local allocations.
Without a patch like this, or without running mysqld with the right numactl incantation, you end up either having all your memory on one NUMA node (potentially not utilising full memory bandwidth of the hardware), or you end up with swap insanity, or you end up with some other not exactly what you’d expect situation.
While we could have MySQL be more NUMA aware and perhaps do a buffer pool instance per NUMA node or some such thing, it’s kind of disappointing that for dedicated database servers bought in the past 7+ years (according to one comment on one of the bugs) this crippling issue hasn’t been addressed upstream.
Just to make it even more annoying, on certain workloads you end up with a lot of mutex contention, which can end up meaning that binding MySQL to fewer NUMA nodes (memory and CPU) ends up increasing performance (cachelines don’t have as far to travel) – this is a different problem than swap insanity though, and one that is being addressed.
Update: My patch as part of https://bugs.mysql.com/bug.php?id=72811 has been merged! MySQL on NUMA machines just got a whole lot better. I just hope it’s enabled by default…