ImapGoose status update: v0.3.2

Plenty has happened since my initial announcement of ImapGoose. First of all, I implemented pipelining of uploads and downloads, which dramatically improved speeds (initial tests yielded ~1GB messages per minute on my host).

Scanning and task queue

[permalink]

An important bug quickly came to light: the local maildirs are never fully re-scanned. If any changes happened while ImapGoose was not running, those changes would not be detected the next time it ran— or ever!

My first approach at fixing this introduced substantial complexity into the task dispatching mechanism. There were no obvious faults with it, but it was too complex for any obvious faults to be visible. I redesigne…

Scanning and task queue

[permalink]

The resulting design is also quite simple: the IMAP listeners emits events indicating which mailbox has changed (this is all the information we get). The filesystem watcher emits events indicating which mailbox and file changed. In case of the initials start-up event, no file is specified, and this implies that the entire directory needs to be scanned for new files, deleted files, or changed flags.

The dispatcher accumulates events received and executes sync tasks based on this. Sync tasks are basically three:

Scan a single message in filesystem (and sync change to IMAP).
Scan entire filesystem (and sync changes to IMAP).
Scan IMAP (and sync changes to local filesystem).

There is no longer a distinction between “Full Sync” and “Incremental Sync” when deciding what to do. When performing an IMAP scan, the workers use CONDSTORE/QRESYNC (which returns a list of messages that have changed since last time), or fetch all UIDs depending on whether the mailbox has been seen before or not. This is a decision on how to scan the remote, and not a decision on what to do, so it’s one layer down.

A noticeable change of the refactor is that scanning the remote mailbox for changes is no longer implied in all sync events: if only a local file (or files) changed, we can determine what to do based on that file (or those files) and the status repository. Since we’ve been monitoring the remote mailbox, we know its latest state already and can decide what to do without any network I/O.

All these changes brought around another simplification. Previously, during the start-up sequence, both the listener and watcher would trigger full sync events for all known mailboxes. This implied duplicate sync events for mailboxes on both sides. I had implemented some special treatment for these to be properly deduplicated. Now queued sync events are directional, so while both the listener and watcher queue events, they’re complementary events: one syncs local changes to the IMAP server and the other syncs remote changes to the local filesystem. We don’t need special de-duplication for them, and this simplifies the start-up sequence of the dispatcher.

Nested mailboxes

[permalink]

Some issues surfaced relating to nested mailboxes. My initial design mapped the hierarchy separation to a period in Maildir names, inspired by the same behaviour in offlineimap. The mailbox lists/mine was stored on disk as lists.mine, but the mailbox lists.mine would also be stored on disk with the same name— a collision.

Eventually, I settled on fully supporting nested mailboxes. notmuch supports these just fine, and there’s realistically no problem with this. It’s likely what users of nested mailboxes expect too. The only limitation is that the mailbox names cur, new and tmp are prohibited, since these names have special meanings in Maildir structures.

Because directory mapping changed, installations running a version prior to v0.2.0 need to migrate. The new -m command executes a migration, moving directories and updating the status database accordingly.

Single sqlite connection

[permalink]

Some users reported an occasional “database is locked” error from sqlite. This was caused by Go using multiple connections for sqlite by default. While one connection was writing, the other tried to write the same row and failed due to the row being locked. Fixing this was as simple as adding a call to db.SetMaxOpenConns(1). For unknown reasons, I was never able to personally reproduce this issue, despite having synchronised hundreds of thousands of messages.

Placing new messages in `new/`

[permalink]

Some MUAs make a distinction between the new/ and cur/ directories inside a maildir. ImapGoose now places new unseen messages in new/ and other messages in cur/. This should also help writing scripts to notify of new recent unseen messages.

Concurrent edits on flags

[permalink]

It’s possible for a message to have conflicting flag edits, and this is now handled appropriately. For example: a message is synchronised initially with no flags. The IMAP version is externally edited to add the Replied flag and the local message is edited to add the Seen flag. When synchronising after these two events, we need to detect which flags changed exactly. When ImapGoose reads the IMAP message with only the Replied flag, it will compare it with the status. The message had no flag last time, which implies that the Replied flag is new. This is then added to the local copy, but this operation needs to only append a flag without overwriting others (in this case, the new Seen flag which was added locally).

Additionally, when saving the result into the status, we don’t record the Seen flag, because that one is not in sync. This particular operation synced remote-to-local flags only, and the status now records that the last time items were in sync, they all had the Replied flag.

In this particular scenario, we’ll always have another pending task to sync local-to-remote, which will detect the Seen flag that was added locally, and replicate that to the remote server, eventually having both sides with the same view.

It took me several iterations to get the algorithm that was just right for this. My initial approach was to do a two-way sync of flags, but this introduced some added complexity and was somewhat non-obvious to follow. The final design is solid and simple to reason about.

Ignoring flags in filename matches

[permalink]

When looking for a file with an expected name, it’s possible that the file does not exist, but the message hasn’t been deleted— it’s been renamed. This is because every time that flags change in a Maildir message, the message has to be renamed (the flags are reflected in the filename). This caused some issues in cases where ImapGoose tries to read a file and its flags have changed (and the change hasn’t been processed yet). To work around this, when looking for a file, ImapGoose needs to find any other file in the same directory with the same prefix and different flags. This is a computationally expensive operation, since it requires reading all the directory entries.

The extra cost doesn’t seem to have a noticeable impact on performance. This operation cannot be skipped, because otherwise when a message’s flags change, ImapGoose would delete the remote message, and then re-upload it with the new flags.

User service files

[permalink]

v0.3.0 includes service files for running ImapGoose as a user service under OpenRC and systemd (thanks Clayton).

Initial index import

[permalink]

I added support for converging an existing setup. That is, an IMAP account and a set of local Maildirs which are already in-sync. An additional condition is required for this to work: messages in the Maildir must have a U=71931 portion in their filename, where 71931 is the UID of its counterpart on the IMAP server. OfflineIMAP encodes filenames this way, and I’ve heard that mbsync does too.

This initial import is implemented as imapgoose -i, which scans all local messages and compares them to the remote message with the same UID. If they are the same, ImapGoose stores into its status database that these two messages are the same and in-sync. No changes are made during this operation. Its output is as follows:

# Repeated for every mailbox:
Mailbox indexed mailbox=INBOX identical=433 mismatched=1 missing=6
# And finally:
Account indexed account=personal identical=83084 mismatched=4 missing=44

mismatched indicates that a file has a U= flag, but doesn’t match the other copy byte-to-byte or that its flags have diverged. The latter case is the usual one. missing indicates a file has a U=, but no message exists on the server with that UID.

This initial import feature allows quickly migrating to ImapGoose from other tools “importing” the existing configuration and syncing only newer changes in future. Such migrations are one-way. After ImapGoose has imported a mailbox like this, running it again without -i will start synchronising messages. Switching back to the previous tool at this point might be problematic, since its own internal status is ignorant of the changes done by ImapGoose.

imapgoose -i is safe to run at a later time, and is essentially idempotent. It should never be used while another instance is already running.

Simplicity through constraints

[permalink]

Much of ImapGoose’s simplicity comes from supporting only one specific use case.

I’ll use vdirsyncer as a counter-example: it can sync filesystem to caldav, or caldav to caldav, or webdav to filesystem, etc. Due to this permutation of choices, all these “backends” are implemented using a common API. This somewhat constrains the design in some ways. We can’t have one storage implement a feature that others do not. Or we need others to provide fallback, or to make the feature “optional”, and have call sites work around a feature’s potential non-availability.

On the other hand, in ImapGoose, the IMAP listener and Mailbox watcher return entirely different event types:

// Listener sends: "remote changed"
type RemoteEvent struct {
Mailbox string // Just mailbox name
}

// Watcher sends: "local changed"
type LocalEvent struct {
Mailbox string
AbsPath string // Empty = full mailbox, non-empty = specific file
}

When scanning an IMAP remote, we can ask it for a list including only items which have changed, when we scan a filesystem, we need to read all file entries ourselves. This is why ImapGoose can’t sync one IMAP server to another: it’s designed around this asymmetry.

Service discovery

[permalink]

The server configuration parameter is now optional. If no server is specified, it is determined by using DNS-based service discovery. As a reminder, your local DNS server MUST be a validating DNSSEEC server in order to avoid MITM attacks.

Performance improvements

[permalink]

Large synchronisation operations sometimes took quite long. It turns out that I had lots of N+1 queries. I’m aware of the usual wisdom against these, but I wrote them thinking about sqlite’s article titled Many Small Queries Are Efficient In SQLite. Ultimately, one query with all results is still dramatically faster.

With that particular issue fixed, performance is good enough at this point that I’m not particularly inclined to tinker with it further. I am aware of places where there is room for theoretical improvements, but ImapGoose can max out my network uplink, so it’s pointless.

CRLF normalisation

[permalink]

Messages transmitted over the network are always transmitted with \r\n (carriage return, newline) at the end of each line. When we download messages, they’re saved with these line endings. When we upload messages, they are expected to have the same line ending. In theory, any tool that writes to a Maildir SHOULD respect this convention. In practice, they don’t. notmuch even removes a single \r from a single line when a message is moved across mailboxes.

In theory, changing any lines of a message might invalidate any signature which it may contain. In practice, if the message was missing CR, it couldn’t have been transmitted anywhere without being changed anyway. Since messages can’t be uploaded without fixing a missing CR, they’re now automatically fixed (and a warning is logged).

Messages which we download are guaranteed to end in CRLF (because of how the IMAP protocol encodes them), so messages which were downloaded and moved around are not being tampered in any way.

Keeping connections alive

[permalink]

Ideally, the connection with the NOTIFY listener remains permanently open. Keeping the connection open is relatively cheap. Through it, we can receive updates immediately, react instantly and fetch any updates that happened remotely.

If the connection is dropped, ImapGoose reconnects and sets up a new NOTIFY listener, but also needs to ask for the new status of all mailboxes. This adds some network overhead which is not ideal (but necessary) during each reconnection.

Many servers close idle connections after a while, so ImapGoose sends a NOOP (basically a ping) to the server every 15 minutes to keep the connection alive. Initial development showed that servers commonly close connections after 30 minutes, so this worked well to keep those connections alive. However, many servers use a much shorter timeout (as short as 5 minutes). When connected to these servers, ImapGoose would need to reconnect every 5 minutes, adding unnecessary overhead.

Making the interval configurable isn’t practical since people don’t have a convenient way to determine the right value. I changed the interval to 3 minutes, which works for servers with short timeouts while adding minimal burden for those with longer timeouts.

I’m still considering implementing auto-detection of how long a server takes to disconnect an idle connection. It’s possible to distinguish between a clean disconnect and a network glitch. ImapGoose could record the interval after which the server closes the connection, store this timeout, and only send pings every half a period.

Development reflections

[permalink]

Purity of git history

[permalink]

I’ve been iterating on how I code with these new projects.

I did a lot of short micro-commits, then reviewing, keeping notes of bugs, and iterating with the model. When I’m done implementing a feature, I have a large amount of commits, many of which have changes known to be faulty in some way, and some introducing hundreds of lines of code which are removed in the following commit. All these commits are steps in the right direction, but are a huge amount of noise. They all get squashed into one commit.

This results in larger commits, some a bit harder to follow since they mix a new feature with a refactor of another function which I noticed was necessary along the way. These types of commits are harder to review, and not ideal when sending somebody else patches. But when working on a new project where nobody else is looking at changes, these details don’t really matter.

By removing this burden of “keep small tidy commits” I’ve also moved faster, since I can focus on fixing the task without focusing on keeping clean commits. When sending patches to someone else, I sometimes need to split a large change into multiple commits, sometimes in a shuffled order. This kind of rebasing takes time, and is only meaningful if someone else will be reviewing commits, in order, and understanding the full context.

Multiple flags in Go

[permalink]

Go likes to be special and different in some unique aspects. One of these is command line arguments and flags. Go doesn’t support the classic single-letter flags (like ls -lt). It only supports single-dash full-word flags like -list -time.

ImapGoose has -m and -v, but using -mv didn’t work: Go interprets that as a single mv flag rather than two separate flags.

I could have written a paragraph in the documentation warning about this, but it seems cleaner to just ignore Go’s flag parsing abilities and parse these flags on my own. It’s honestly less than 50 lines of code anyway.

Current status

[permalink]

The latest releases contain mostly minor fixes and optimisations. Work on ImapGoose has slowed down, with it being stable at this point. I’ll gather feedback on the current version for some days, and then tag a v1.0.0.

Scanning and task queue

Scanning and task queue

Nested mailboxes

Single sqlite connection

Placing new messages in new/

Concurrent edits on flags

Ignoring flags in filename matches

User service files

Initial index import

Simplicity through constraints

Service discovery

Performance improvements

CRLF normalisation

Keeping connections alive

Development reflections

Purity of git history

Multiple flags in Go

Current status

Similar Posts

Placing new messages in `new/`