Replication log inside InnoDB

The MySQL replication system has always had the replication log (“binlog”) as a separate set of files on disk. Originally, this really didn’t matter as, well, MyISAM wasn’t transactional or crash safe so the binlog didn’t need to be either. If you crashed on a busy write workload, your replication was just going to be hosed anyway.

So then came a time where everybody used InnoDB. Transactional, crash-safe and all the goodies. Then, a bit later, came storing master rpl log position in InnoDB log and XA with the binlog. So a rather long time after MySQL first had replication, you could pull the power cord on the master with a decent amount of certainty that things would be okay when you turned it on again.

I am, of course, totally ignoring the slave state and if it’s safe to do that on slaves.

Using XA to make the binlog and InnoDB consistent does have a cost. That cost is fsync()s. You have to do a lot more of them (two phase commit here).

As you may be aware, at a (much) earlier point in Drizzle we completely ripped out the replication code. Why? A lot of it was very much still geared to support statement based replication – something we certainly didn’t want to support. We also did not really want to keep the legacy binlog format. We wanted it to be very, very pluggable.

So the initial implementation is a transaction log file. Basically, we write out the replication messages to a file. A slave reads this and applies the operations. Pretty simple and foolproof to implement.

But it’s pluggable.

What if we stored the transaction log inside innodb? Not only that, what if we wrote it as part of the transaction that was doing the changes? That way, no XA is needed – everything is consistent with a COMMIT. This would greatly reduce the number of fsync()s needed to be consistent.

Now… the first thing people will say is “arrggh! You’re writing the data *four* times now”. First being the txn data into the log, then the replication log into the log, and then both of these are written back to the data file. It turns out that this is much cheaper than doing the additional fsync()s.

In one of our tests, the file based transaction log: ~300tps. Transaction log in InnoDB: ~1200tps.

I think that’s an acceptable trade-off.

We’ve just merged the first bit of this work into Drizzle.

Props go to Joe Daly, Brian and myself for making it work.

12 thoughts on “Replication log inside InnoDB

  1. Why not just get rid of the replication log altogether?

    InnoDB already has a transaction log. Just use that log for replication. PBXT and Postgresql both do replication without a separate replication log.

  2. Using the engine log for replication has a few limitations:

    – you *must* replicate from engine X to engine X. You *cannot* even remotely easily replicate to engine Y.

    – you may be placing upon yourself limitations in upgrade path to future on-disk formats of that engine.

    – it is not a cross-engine solution. if you have 2 engines that you’re using, you would have to have 2 sep rpl solutions – not great.

  3. I guess the replication log cannot be used to recover from backups anymore? As a disk failure would take down the log with the data…?

  4. You can still copy the log as part of your backup strategy – it’s even SELECTable. Joe made the transaction_reader program be able to connect to the database server and run the appropriate queries too.

  5. Andy,

    Engine-level replication via log is a problem for InnoDB because the log contains only REDO data (space_id:page_id:offset:payload) and as such, it does not contain “before” data, which can only be found in the UNDO segments.

    The Oracle logging mechanism (and other databases that support log-shipping engine based replication) contain UNDO/REDO information in the log (usually as entire before/after pages). The MySQL row based binary log has old/new images for updates, new images for inserts and old images for deletes, which makes it usable for replication.

  6. Realy a great idea that everything can heppen in single fsync().

    But Not sure whether “storage engine log” or “transaction log” contains super set of data. if one is superset of other, it should be theoretically possible to generate the subset from data on the fly from the superset while trying to read.
    I wish superset should be “transaction log”. in that case storage engines can just drop the txn generation logic and redirect the reads to a wrapper of on the top of “transaction log”

    is it sounding stupid?

  7. Pingback: drizzle.org

  8. I’m extremely surprised by the performance difference. Good thinking outside the box. :)

  9. Pingback: Crash-testing the innodb transaction log! » workcenter220

  10. I agree with @Nick. Even before I got down to the comments, I was very surprised at the level of performance increase. It is counter intuitive at first that writing the data 4 times would be the cheaper solution. I guess that’s why testing always gives the final answer.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.