getting rid of duplicate emails, elegantly

I like duplicate emails in the way that everybody is thinking. This is different.

Due to a bug in offlineimap i hit a little while ago, it’s managed to make copies (sometimes even two copies) of each email in certain folders. Now, this isn’t so bad as

a) email didn’t get lost

b) it’s just using extra disk, and disk is cheap.

but it is annoying when searching.

It’s also annoying because it’s decided to do this on folders such as INBOX/MySQL/bugs which contains an email for each change to a bug report even since I joined the company. That adds up to a lot of wasted inodes and disk blocks.

So, I’ve revived this project that I have in the back of my head of efficiently storing email in a database and being able to sync between instances of it.

This gives us some nice advantages. you can use replication to keep a backup of your email. You can put it in Cluster and have high availability email.

We can also do some neat tricks with tables of all that info that you need to display lists of emails and probably get performance boosts instead of having to open each mail as we currently do. i.e. current email solutions don’t scale to a million emails in a folder.

Partitioning will also be useful to make searches quicker (odds are what we’re searching for is recent and all sorts of foo).

Anyway…. it’s interesting to see the bunch of errors that gets thrown up by the Mail::Box perl module on some of my Maildirs. Hrrm… I may have to resort to my own more error tolerant code. I’m determined to write scripts that can not possibly loose anything.