The three ways to remove a document from CouchDB

published by natevw on 2012 November 17, 9:40pm — Subscribe

So you're using CouchDB because it has a great RESTful interface for managing data. PUT a JSON document in with various options, GET it back later through several means. Did you know that how you DELETE can also affect how your document is persisted?

Let me to explain.

Door number one: DELETE

You PUT a document {"_id":"mydoc", "some_data":42} into your database. You fetch this document directly via GET /whatever_db/mydoc, and the simplest way to "get rid of it" is to DELETE /whatever_db/mydoc. That's a bit oversimplified because you do also have to include the MVCC revision token, but modulo that, it's pretty much what you'd expect from a REST API.

Door number two: _deleted

The other way is to make a document stop showing up is to PUT/POST it with a special field: {"_id":"mydoc", "some_data":42, "_deleted":true}

Door number three: don't forget this

The third way is usually the wrong way, and it's not really "deletion" in the normal sense. You can POST document info to a database's _purge API to erase CouchDB's memory of it. The data is still on disk until after database/view compaction and similar disk storage caveats have been taken care of, but the data will cease to exist as far as the API is concerned.

Wait…deleted data doesn't normally cease to exist? To delete a document, its wrong for CouchDB to actually remove it?!

Right. The CouchDB wiki explains it this way:

Deleted documents remain in the database forever, even after compaction, to allow eventual consistency when replicating. If you delete using the DELETE method above, only the _id, _rev and a deleted flag are preserved. If you deleted a document by adding "_deleted":true then all the fields of the document are preserved. This is to allow, for example, recording the time you deleted a document, or the reason you deleted it.

Syncing _changes is foundational to CouchDB's data model; not forgetting deleted documents lets masterless replication bring each instance into a shared state.

Gone but not forgotten: newspaper left at an abandoned property

Which one should you use?

The choice depends (mostly) on how you're syncing between databases:

With filtered replication, you might want to add _deleted:true alongside the original document data
For normal/plain/unfiltered replication, you can simply DELETE
If you are NOT replicating, _purge has its uses

Simply DELETEing a document could break filtered replication. If a filter function only makes changes visible based on, say, an application-specific "type" or "user" field, these fields will not be present in the final {"id":"mydoc","_deleted":true} stub left behind by a simple DELETE. Consider this filter function, equivalent to the one in the filter function guide:

function(doc, req) {
  if (doc.name && doc.name == req.userCtx.name) {
    return true;
  } else {
    return false;
  }
}

This filter only exposes revisions if there is a "name" field and it matches the session's user information. Revisions resulting from a DELETE request will not have any "name" field. That change won't get passed along — so the document will be left in the target database as-is! In order to propagate deletes, either the document's other fields should be retained in its _deleted version (by using PUT/POST instead of DELETE) or the filter function would need to return true after checking if (doc._deleted), before assuming any other fields.

Consistency

Even if you're using filtered replication, leaving full documents of data which is no longer needed might be undesirable. Remember that the final form of a document will eventually propagate to all databases. Especially if your documents are large or the frequency of deletions is high, there's no sense wasting bandwidth and disk space just to work around a naïve filter function.

Note that simply propagating all deletions to avoid the filtering issue has its own drawbacks, though. If you're using filtered replication because the target has limited storage space, or to keep private data private, copying the last version of every deleted document to every replica is not a good solution. At best it would be a waste of resources, at worst it could leak sensitive data — especially if some parts of your app are simply flagging original contents as deleted, while some of your filters are assuming that any delete is safe to propagate!

A hybrid solution might be appropriate: if you know which fields are relied on, across all of your replication filters, you could store those in the final revision while omitting the others. This enables filter functions to remain picky about what they pass, while increasing space savings after compaction and slightly mitigating the risk of an inadvertently propagated delete.

When you're consistently aware of the implications, any method can be used to control the final state of a document — just make sure its last words are the most appropriate epitaph.