Alternative sync protocol for Automerge documents

Motivation and goals

Our current Bloom-filter-based protocol for syncing Automerge documents has some problems:

You need to store a sync state for every peer you sync with (which in turn requires that each peer is uniquely identified in some way). If you don’t, every sync session starts by exchanging a Bloom filter containing every single change, which is expensive to build (in terms of CPU time and memory use), and can be hundreds of kB of network traffic for a document with a long editing history. Even if you do store sync states, you typically have to rebuild the hash graph in order to generate the first sync message, which is expensive.
We’ve discussed having a coarser-grained hash graph that compresses a bunch of changes into blocks, which would mean both a more compact encoding and fewer entries in the Bloom filter. However, there are several open questions around how exactly the boundaries between these blocks should be drawn. We discussed e.g. grouping every 1,000 sequence numbers from the same actorId into one block, but it’s unclear if this still works well when we have many actorIds (or get rid of actorIds entirely), and how mutual dependencies between several actorIds should be handled.

This document is a proposal for a new sync protocol that would not require any sync states — it remains efficient without storing the outcome of the last sync session with the same peer. (Sync states may still be desired from an application point of view, to inform the user, but this is a separate concern.) Moreover, it removes the need to recompute the hash graph in most situations, and instead allows the sync protocol to directly read data blocks from storage and send them over the network without having to decode and decompress them. Those data blocks can contain compressed blocks of changes, which take up much less space than the individual changes.

The new protocol does not use Bloom filters, but instead relies on Merkle trees that are structurally similar to the Merkle Search Trees that we want to adopt anyway for the purpose of syncing large document collections. If we can use essentially the same mechanism for syncing the changes within a single document as we use for syncing collections of documents, we probably end up with a protocol that is simpler and less code overall.

A Merkle tree over a change log

The core idea of the proposed protocol is quite simple. We put all of the changes in a document’s editing history (from all branches) into a linear order, ensuring that this order is the same on all replicas (I’ll discuss later how to do this). We then build a Merkle tree in which every node represents a contiguous subsequence of this log, and in which every parent node represents the concatenation of the subsequences of each of its children. The tree is balanced, so its height is logarithmic in the number of changes. The tree is constructed in a deterministic way, so that two replicas with the same log of changes have the same tree structure, and hence the same root hash.

Two replicas can reconcile their trees by first comparing their root hashes; if they match, the replicas must be in sync. Otherwise they compare the hashes at the next level: any subtrees with the same hashes must be identical, and they recurse into any subtrees whose hashes differ. In a logarithmic number of round-trips they know which changes they need to send to each other in order to get in sync.

The tree structure is determined similarly to a Merkle Search Tree, except that we use the linear order on changes instead of key order (it’s also similar to Prolly Trees and Skip Lists). Any change hash whose first 4 bits are non-zero is at level 0 (a leaf node in the tree); a change hash whose first 4 bits are zero is the start of a level-1 node (parent of leaf nodes); a change hash whose first 8 bits are zero is the start of a level-2 node; a change hash whose first 4*i bits are zero is the start of a level-i node. (I picked 4 arbitrarily; we can run some experiments to find the best value.) The sequence of level-i nodes from one level-i+1 node to the next are the children of that level-i+1 node. Assuming change hashes have a pseudorandom uniform distribution, this results in a probabilistically balanced tree.

Compressing runs of changes

This construction also provides a natural structure for compressing blocks of changes: for some subtrees we might not store the individual changes within that subtree, but only store the compressed representation of that subtree’s subsequence of changes. We can always decompress this into the individual changes, but we would rarely need to do that: most of the time, if the tree reconciliation protocol reaches one of those subtrees, we just send the compressed representation of that whole subtree, which is faster and more compact.

The Merkle tree would still be computed on the basis of the uncompressed changes, so compression does not affect the Merkle root hash. When a peer receives a compressed block for the first time, it needs to decompress it into individual changes and recompute the hashes to make sure they match the Merkle hash for that subtree. However, the peer sending the compressed block does not need to do any decoding: it just loads the block’s bytes from storage and sends them over the network without further processing. And even the receiving peer does not need to recompute the hash graph of the whole document; it only needs to compute the subtree that it has just received.

We would need some sort of heuristic for deciding which subtrees to store in this compressed way. For example, we could do it for all subtrees where the compressed representation is less than a quarter of the size of the individual changes, or something along those lines. We also might not compress any subtree that contains one of the current document heads, since that subtree is more likely to change often than a subtree that is further back in the document history. We can play with the parameters to see how it performs. The compression strategy can easily be changed without making breaking changes to the protocol.

Determining the order of changes in the log

All of this is dependent on peers putting changes in the same linear order. In principle, any topological sort of the dependency graph would work; for example, ordering changes lexicographically by the first opId in the change would give us a deterministic topological sort.

However, for the block compression to be effective, we want to keep a single actorId’s sequence of changes consecutive as much as possible (that way, when typing a text document, the reference character of an insertion is the previously inserted character with fairly high probability). Sorting changes by opId would interleave changes from two parallel branches in the linear order, which would destroy the effectiveness of the RLE compression.

Much better would be to define the linear order as the topological sort based on depth-first search, which is explained on Wikipedia. It starts with the oldest change, traverses the whole DAG depth-first (keeping track of which elements have already been visited), and then builds up the topological sort in reverse order (newest to oldest) as it returns from the recursive calls. The only nondeterminism in this algorithm is the traversal order when a change has two or more changes that depend on it (i.e. the start of a branch); we can make it deterministic by traversing those branches in order of the hashes of their first change, or in actorId order. With this algorithm, all changes from one branch are consecutive in the linear order.

Incrementally maintained topological sort

When we receive some new changes, we need to figure out where to place them in the linear order without repeating the depth-first traversal on the whole change DAG. (Note that this topological sort is also a CRDT – we want all peers to converge to the same linear order of changes, regardless of the order in which changes are added.) I need to think harder about this, but I believe something like the following algorithm might work: