What is a Couch?

published by natevw on 2012 May 29, 5:35pm — Subscribe

I've tried to explain this many times to myself, to clients and to peers. And will try again to explain it here.

Various types of Couch

I suppose you could call these subspecies.

Apache CouchDB

Started life as Lotus Notes built of the web.

Apache CouchDB logo

It is "a database that uses JSON for documents, JavaScript for MapReduce queries, and regular HTTP for an API". Masterless synchronization is a very important part of its DNA, and I'd also note its ability to attach arbitrary binary files to said JSON documents.

Use for:

all your NoSQL needs, or…
prototyping, casual app hosting
de-facto standard for peer-to-peer replication

Riak

No relation, not relevant, not capable of interbreeding. Happens to resemble CouchDB enough to merit a brief mention.

Use for:

all your NoSQL needs, when…
…you're not interested in CouchDB, but want a reliable document store that speaks JSON over HTTP

BigCouch

Apache CouchDB plus scalability (via Dynamo-style clustering).

Use for:

bigger datasets and/or greater uptime

TouchDB

Apache CouchDB minus scalability (currently lacks certain storage and concurrency optimizations)

Use for:

not-too-big datasets in native apps (Android, iPhone, Mac)

PouchDB

Apache CouchDB in the browser.

Use for:

offline Chrome and/or Firefox webapps

Couchbase 2.0

is still Membase, and has nothing at all in common with Apache CouchDB
except it will provide the same MapReduce/GeoCouch view features via the same API calls
(actually via a heavily modified fork of Apache CouchDB — consider this an implementation detail)
except it may be replication-compatible with TouchDB (and therefore CouchDB!) via a "Syncpoint" bridge
(actually includes the efforts of many past/present Apache CouchDB contributors — consider this too an implementation detail)

Use for:

all your NoSQL needs, and…
speed and scalability
also, did I mention speed and scalability?

So what is a Couch?

Uhh.

It seems, at its core, a Couch is a non-atomic collection of atomic JSON documents. Each document has a unique identity, and a sequential list of changes to a given set of these documents is always available.

This core enables two common extensions:

Masterless, eventually consistent replication between any two such collections
Incremental (asynchronous) indexing of a collection, usually via "map" and "reduce" functions, usually defined by JavaScript code

The first extension — peer-to-peer synchronization! — is the most interesting. The second extension is related to the first — because of the simple/powerful syncability of a collection's changes feed, we can decouple the practical "optimized access" indexing options from the core collecting of data.

Above this core are where most of the differences lie. Does this collection contain data from multiple users, or just one? Will a collection live redundantly across multiple servers, or be self-contained on just one? Will a user have direct access to this collection of data, or use it only through a middleware layer? Does this data include access control and indexing methods and display logic, or is that outside the scope of a database?

In my opinion, direct bi-directional access to my own data via a standard peer-to-peer synchronization protocol is the most alluring, most important, feature of a Couch. However, implementing this in practice foists all manner of odd questions and unusual concerns upon the database layer. The Apache CouchDB project has managed some very creative solutions to perhaps 90% of these problems, but the solving the next 90% will test the community's (communities'?) mettle.

And why does this matter.

Imagine a world.

Where you don't need to give your data to a big corporation, to have it on all your little devices.

Where you don't need to take your data out of one app, to use it in another.

Where you don't need all your data, but it doesn't hurt to keep it.

Is this terribly important? Maybe maybe not.

At all likely? No.

Worth investigating?

I think so. And the Couch ecosystem provides a great foundation for trying.

Hippie.

So all this data freedom mumbo-jumbo is for paranoid people who hate business models and don't believe in economies of scale.

Learning/knowing/understanding the Couch model is still important. It will improve the way you think about data. It will improve the way you manage state. It will improve your architecture.

The Couch ecosystem is still young, and it is becoming kind of a mess. But the diversity within this particular species of "NoSQL database" is an indication of how resilient the DNA in its nucleus is. Its core model is solid enough in theory and simple enough in practice that we already see a baseline of interoperability between Couches of all shapes and sizes.

Ignore all the magic unicorn webscale utopia stuff, pretend most versions aren't nearly impossible to compile, pretend the best API documentation isn't trapped in narrative form, and give CouchDB a try anyway. Then something something…relax! Okay.