CouchDB is kinda postmodern

published by natevw on 2013 July 3, 3:32pm — Subscribe

There is a hierarchy of use cases CouchDB fulfills:

A central RESTful data store
A chain of SLEEPy databases
Many nodes sharing as much data as they agree to disagree about

The first is handy, but boring. It's great that with CouchDB I don't have to work through a bunch of boilerplate every time I want to spin up an API on top of a datastore. CouchDB is a datastore with a great API already included, but RESTful web middleware is nothing a little code generation or some developer typing practice couldn't already provide above any database. The only thing exciting about giving third-parties access to your database over HTTP is how easy it is to accidentally share too much of said data. (More on this in the context of CouchDB in another post, perhaps?)

The second starts to get interesting. As promoted, the "Syncable Lightweight Event Emitting Persistence" aspect of CouchDB's _changes feed allows clients to maintain an up-to-date mirror of an authoritative data set. This is powerful! I've often wanted "a CouchDB" of offline IETF RFCs, up-to-date Wikipedia articles, realtime Open Street Map edits, weather forecasts, you name it. Having "my own" copy of valuable information and being able to efficiently process secondary views of its contents would be liberating. I'm excited to see how far dat can take this idea.

Only the last use case makes CouchDB a truly masterless database, however.

What makes CouchDB [almost] unique among databases is not its _changes feed, but its revision trees! Let's call this feature DOZE (a stretch upon DecentraliZed ObjEcts), or perhaps NAP (for Normative-Agnostic Packratification), to be like the other two.

For every document, CouchDB also builds up metadata concerning its history, namely all known revisions ever. This revision metadata is not a simple ordered list, but a tree allowing multiple branches. When CouchDB replicates it uses this revtree lineage to decide whether an incoming change implies a direct modification to the local document, or a divergent "alternate universe" coming of age for said doc. For the "alternate universe" case, the revtree branches end up leading to multiple documents which co-exist simultaneously. DON'T WORRY "DOZE" GETS EVEN MORE PHILOSOPHICAL

So anyway, most of the documentation and even CouchDB's API itself, glosses over this important aspect of a given document's existence. When it must be explained the community calls it "conflict resolution", rather than the more accurate "branch management". What you need to understand is that the primary index CouchDB offers is not a key-value store. It is inherently a key-values store.

In most multi-node data architectures, "the truth" must be decided before a particular portion of the system is considered synchronized. The most common arrangement for keeping multiple "client" devices in sync to treat a "master" server somewhere as the Single Source of Truth, propagating all changes therethru. You could eliminate the central point of failure, but frequently all parties in such an arrangement still agree-to-agree that, in essence, there is One True Dataset; sync is a matter of eliminating deviations from this absolute.

Apple SyncServices requiring conflict resolution

With DOZE, there is no database. Sync becomes a matter of sharing state, not determining absolute truth. There is still an ideal; within a CouchDB "social network" a given key represents the identity of one particular document. All databases will eventually agree that doc._id exists/existed even if they end up with different contents. In practice, usually the document's type and usefulness within an app also remain consistent between all nodes. And with "standard" CouchDB there is even a veneer of complete system consistency — a polite lie of consensus that (based on its generation number or falling back to comparing random hash output) one revision will "win" against the others by default.

DOZE doesn't "resolve conflicts". CouchDB's actual replication algorithm simply compares and shares revision trees and the most recent known versions of each known document. This is a different model than people imagine based on what gets indexed locally or returned by a naïve GET request — even though on the surface CouchDB pretends like the conflicting versions don't exist, it is pushing and pulling all the options when it replicates. And they are options: you can continue to update a "losing" revision's branch just the same as a "winning" document's if you care to. Once you realize how winningly CouchDB's replicator treats all current versions of a given document you start to get extra annoyed at those 409 errors an individual database so adamantly throws to its own users.

Conflicts happen. They're harder to design an interface around, but this is the only reason CouchDB papers over them. If you want to "do masterless replication" you need to embrace conflicts. The difference between SLEEP and DOZE is the difference between publishing source code and accepting pull requests. SLEEP makes it easy to maintain a mirror, while DOZE makes it easy to maintain a fork. With DOZE you don't need to assume a neutral point of view exists, each database simply accepts only changes it agrees with. In theory, a future version of CouchDB could actually let in and replicate out document revisions that validate_doc_update rejects, without allowing them to "win" locally as far as the current style of indexes and app interfaces go.

In practice, I'm not arguing for an eminently tolerant database. CouchDB tends to assume disk is cheap, but I don't want my SSD wasted on your version of RFC 2616 replacing it with a copy of httpd's SVN repository. My disk, my definition of "normative" whenever possible. It's an interesting idea for allowing dissent "within" a higher level-application, e.g. Wikipedia itself — Ward Cunningham is building that, by the way — but in practice we're probably mostly interested in using CouchDB for offline apps (where the conflicts should indeed be resolved as soon as possible) and the more general DOZE for maintaining corrective patches to otherwise authoritative-but-changing datasets (where the "main server" branch should be left open and updates folded into the locally-patched revision).

In summary, if you're avoiding conflicts, you're probably not using CouchDB to its full potential. I'm as guilty of this as anyone, because developing on top of a single CouchDB database makes conflict resolution easy to "put off". And indeed, you don't have to resolve conflicts in CouchDB and when you do, you don't have to use the arbitrary "winning" revision to do so. Certainly if you're trying to join CouchDB's "masterless replication" ecosystem you'll need to donate some disk space to the cause — to wish there was only one current revision allowed is to wish the data model weren't really decentralized. Don't go back to throwing away data just because it looks old. Track your revtrees, deal with life's inevitable conflicts, and keep postmodernity weird!