The MySQL replication system has always had the replication log (“binlog”) as a separate set of files on disk. Originally, this really didn’t matter as, well, MyISAM wasn’t transactional or crash safe so the binlog didn’t need to be either. If you crashed on a busy write workload, your replication was just going to be hosed anyway.
So then came a time where everybody used InnoDB. Transactional, crash-safe and all the goodies. Then, a bit later, came storing master rpl log position in InnoDB log and XA with the binlog. So a rather long time after MySQL first had replication, you could pull the power cord on the master with a decent amount of certainty that things would be okay when you turned it on again.
I am, of course, totally ignoring the slave state and if it’s safe to do that on slaves.
Using XA to make the binlog and InnoDB consistent does have a cost. That cost is fsync()s. You have to do a lot more of them (two phase commit here).
As you may be aware, at a (much) earlier point in Drizzle we completely ripped out the replication code. Why? A lot of it was very much still geared to support statement based replication – something we certainly didn’t want to support. We also did not really want to keep the legacy binlog format. We wanted it to be very, very pluggable.
So the initial implementation is a transaction log file. Basically, we write out the replication messages to a file. A slave reads this and applies the operations. Pretty simple and foolproof to implement.
But it’s pluggable.
What if we stored the transaction log inside innodb? Not only that, what if we wrote it as part of the transaction that was doing the changes? That way, no XA is needed – everything is consistent with a COMMIT. This would greatly reduce the number of fsync()s needed to be consistent.
Now… the first thing people will say is “arrggh! You’re writing the data *four* times now”. First being the txn data into the log, then the replication log into the log, and then both of these are written back to the data file. It turns out that this is much cheaper than doing the additional fsync()s.
In one of our tests, the file based transaction log: ~300tps. Transaction log in InnoDB: ~1200tps.
I think that’s an acceptable trade-off.
We’ve just merged the first bit of this work into Drizzle.
Props go to Joe Daly, Brian and myself for making it work.