A warning to Solaris users…. (fsync possibly doesn’t)

Read the following:

Linux has its fair share of dumb things with data too (ext3 not defaulting to using write barriers is a good one). This is however particularly nasty… I’d have really hoped there were some good tests in place for this.

This should also be a good warning to anybody implementing advanced storage systems: we database guys really do want to be able to write things reliably and you really need to make sure this works.

So, Stewart’s current list of stupid shit you have to do to ensure a 1MB disk write goes to disk in a portable way:

  • You’re a database, so you’re using O_DIRECT
  • Use < 32k disk writes
  • fsync()
  • write 32-64mb of sequential data to hopefully force everything out of the drive write cache and onto the platter to survive power failure (because barriers may not be on). Increase this based on whatever caching system happens to be in place. If you think there may be battery backed RAID… maybe 1GB or 2GB of data writes
  • If you’re extending the file, don’t bother… that especially seems to be buggy. Create a new file instead.

Of course you could just assume that the OS kind of gets it right…. *laugh*