The three ways to remove a document from CouchDB
So you're using CouchDB because it has a great RESTful interface for managing data. PUT
a JSON document in with various options, GET
it back later through several means. Did you know that how you DELETE
can also affect how your document is persisted?
Let me to explain.
Door number one: DELETE
You PUT
a document {"_id":"mydoc", "some_data":42}
into your database. You fetch this document directly via GET /whatever_db/mydoc
, and the simplest way to "get rid of it" is to DELETE /whatever_db/mydoc
. That's a bit oversimplified because you do also have to include the MVCC revision token, but modulo that, it's pretty much what you'd expect from a REST API.
Door number two: _deleted
The other way is to make a document stop showing up is to PUT
/POST
it with a special field: {"_id":"mydoc", "some_data":42, "_deleted":true}
Door number three: don't forget this
The third way is usually the wrong way, and it's not really "deletion" in the normal sense. You can POST
document info to a database's _purge API to erase CouchDB's memory of it. The data is still on disk until after database/view compaction and similar disk storage caveats have been taken care of, but the data will cease to exist as far as the API is concerned.
Wait…deleted data doesn't normally cease to exist? To delete a document, its wrong for CouchDB to actually remove it?!
Right. The CouchDB wiki explains it this way:
Deleted documents remain in the database forever, even after compaction, to allow eventual consistency when replicating. If you delete using the DELETE method above, only the _id, _rev and a deleted flag are preserved. If you deleted a document by adding "_deleted":true then all the fields of the document are preserved. This is to allow, for example, recording the time you deleted a document, or the reason you deleted it.
Syncing _changes
is foundational to CouchDB's data model; not forgetting deleted documents lets masterless replication bring each instance into a shared state.
Which one should you use?
The choice depends (mostly) on how you're syncing between databases:
- With filtered replication, you might want to add
_deleted:true
alongside the original document data - For normal/plain/unfiltered replication, you can simply
DELETE
- If you are NOT replicating, _purge has its uses
Simply DELETE
ing a document could break filtered replication. If a filter function only makes changes visible based on, say, an application-specific "type" or "user" field, these fields will not be present in the final {"id":"mydoc","_deleted":true}
stub left behind by a simple DELETE
. Consider this filter function, equivalent to the one in the filter function guide:
function(doc, req) {
if (doc.name && doc.name == req.userCtx.name) {
return true;
} else {
return false;
}
}
This filter only exposes revisions if there is a "name" field and it matches the session's user information. Revisions resulting from a DELETE
request will not have any "name" field. That change won't get passed along — so the document will be left in the target database as-is! In order to propagate deletes, either the document's other fields should be retained in its _deleted
version (by using PUT
/POST
instead of DELETE
) or the filter function would need to return true
after checking if (doc._deleted)
, before assuming any other fields.
Consistency
Even if you're using filtered replication, leaving full documents of data which is no longer needed might be undesirable. Remember that the final form of a document will eventually propagate to all databases. Especially if your documents are large or the frequency of deletions is high, there's no sense wasting bandwidth and disk space just to work around a naïve filter function.
Note that simply propagating all deletions to avoid the filtering issue has its own drawbacks, though. If you're using filtered replication because the target has limited storage space, or to keep private data private, copying the last version of every deleted document to every replica is not a good solution. At best it would be a waste of resources, at worst it could leak sensitive data — especially if some parts of your app are simply flagging original contents as deleted, while some of your filters are assuming that any delete is safe to propagate!
A hybrid solution might be appropriate: if you know which fields are relied on, across all of your replication filters, you could store those in the final revision while omitting the others. This enables filter functions to remain picky about what they pass, while increasing space savings after compaction and slightly mitigating the risk of an inadvertently propagated delete.
When you're consistently aware of the implications, any method can be used to control the final state of a document — just make sure its last words are the most appropriate epitaph.