getting rid of duplicate emails, elegantly

I like duplicate emails in the way that everybody is thinking. This is different.

Due to a bug in offlineimap i hit a little while ago, it’s managed to make copies (sometimes even two copies) of each email in certain folders. Now, this isn’t so bad as

a) email didn’t get lost

b) it’s just using extra disk, and disk is cheap.

but it is annoying when searching.

It’s also annoying because it’s decided to do this on folders such as INBOX/MySQL/bugs which contains an email for each change to a bug report even since I joined the company. That adds up to a lot of wasted inodes and disk blocks.

So, I’ve revived this project that I have in the back of my head of efficiently storing email in a database and being able to sync between instances of it.

This gives us some nice advantages. you can use replication to keep a backup of your email. You can put it in Cluster and have high availability email.

We can also do some neat tricks with tables of all that info that you need to display lists of emails and probably get performance boosts instead of having to open each mail as we currently do. i.e. current email solutions don’t scale to a million emails in a folder.

Partitioning will also be useful to make searches quicker (odds are what we’re searching for is recent and all sorts of foo).

Anyway…. it’s interesting to see the bunch of errors that gets thrown up by the Mail::Box perl module on some of my Maildirs. Hrrm… I may have to resort to my own more error tolerant code. I’m determined to write scripts that can not possibly loose anything.

2 thoughts on “getting rid of duplicate emails, elegantly

  1. We have a database solution for mail in our system, and we actually want to move to IMAP. Partially because of features, and partially because of problems we encounter.

    Basically, FULLTEXT searching only works with MyISAM tables, but we need the row-level locking capabilites of InnoDB. Also, binary logging is impossible, because they would grow large. We currently have limits on message size and how long they’re kept for. Make sure you monitor the InnoDB free space as well (or use autoextend and monitor disk usage).

  2. Basically what I’m looking at is to have an external program be able to do the syncing (so you can do updates on disconnected computers and then resync them. So I wouldn’t be using the binary log for that.

    Although I intend to use it to have a replication slave as a backup.

    I am more tempted to link into Beagle than rely on fulltext indexing at the moment – at least until fulltext is supported across storage engines (and since i may be doing some of the coding for that, i’ll be an early adopter too).

    I’d also present an IMAP interface to the DB, otherwise it’s kinda silly.

    Personally I wouldn’t wory about message size or size of table – you just need some optimisations to get better performance!

    Mail clients such as Evolution keep their own indexes anyway.

    Since my mail chews up gigabytes as it is, I’m not too concerned about disk space usage.

    Disk is cheap – loosing data due to deleted emails is not.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.