Notes and tips on optimizing CouchDB performance

We’ve been using CouchDB for a couple of years now, starting with pcapr. Couch was still at version 0.8 when we first started using it since then it has come a long way. And so have we. We are actively using Couch both in the cloud (pcapr and Test Cloud) as well in our product. With the last release of Mu Studio, we can now daisy chain any number of appliances to generate truly elastic scale. All the engines are coordinated through the master which is running Couch and we use it to map/reduce massive amounts of statistics collected by the scale engines.

Having said that, the following are tips and various notes on squeezing out all the performance out of Couch. This is not just configuration settings on Couch, but more like an end to end set of tweaks. Do leave a comment if you have more suggestions.

Conflicting Thoughts

While Couch is all about JSON docs, it’s still important to think through who owns these docs and who updates them. Think about concurrency. Many people trying to do many people things. If they are all modifying the same document at the same time, you are going to get lots of conflicts with everyone stepping on everyone else’s toes. It’s a lot easier to avoid conflicts and get your document saved without a lot of fuss.

In general though, try to model your documents that allow you to save and update them without conflicting with each other. And yeah, don’t start a flame war on a mailing list. Life’s short.

include_docs

I’ve seen this in a number of situations. The following map function should strip you off your MacBook and have you use a remote dumb terminal:

If you didn’t know about include_docs, it’s time to read up on it. Each view (as long as it’s not reduced) knows which document emitted it. This means that when you query the view, you can always add include_docs=true to the query parameters and get back the doc that emitted the key, value pair. Including the entire doc in the emit adds up precious view index space and is really redundant anyways.

CouchRest and RestClient

We are a Ruby shop, through and through, though JavaScript is trending pretty darn fast. Still, we love Sinatra and the sheer power of Ruby to do some amazing stuff. You can read more about how we use Ruby in these blogs. While CouchRest and RestClient are both awesome, on a high volume site there are things you need to watch out for to save precious bandwidth. The first thing of course is the default_headers that these libraries add: i.e.,

Accept: application/json
Accept-Encoding: gzip, deflate

You definitely don’t need those for basic Couch RESTful calls. Simplest way to get around this of course, is to monkey patch these like so:

The :content_type header is really only required for PUT’s and POST’s and needs some reworking of the CouchRest gem to pull this off. While you might say oh, yeah, whatever, if you are running on EC2 paying for precious bandwidth these extra bytes (60/request) for each Couch-request add up pretty quickly over a million requests.

Pitfalls of CouchRest#update_doc

CouchRest needs lots of love in the update_doc method. Why? Because it takes the doc._id as the argument and does a full fetch of the document, updates it locally and tries to save it. While this is nice and clean, when you know you have the latest doc._rev, all you want to do is just save it to the DB. Read up on Conflicting Thoughts above to know what your program does. If you've organized your docs to minimize conflicts, then CouchRest's update_doc adds additional fetches of the document while all the time you already had the latest revision. I have a more optimistic implementation on pcapr. In other words, we assume that the revision at hand was already the latest and nobody's changed it yet. So the updates look like this:

Bulk delete's using a view

Much has been said about using _bulk_docs for insert performance. But if you are also deleting lots of documents in a write-heavy database, don't delete them one at a time. Here's a simple tip to use the results of a view and use them to bulk delete a bunch of documents without fetching the docs first.

Reducing the number of reduce's

When you first start with Couch, there's a strong inclination to reduce your way through everything. Stop. Think. Relax. Imagine of a queue of sorts and all you really want to know is how many elements are in the queue. For situations like this, you might be tempted to do something like this:

And you query with:

GET /db/_design/foo/_view/bar?group=true

While this works just fine, there's a simpler way to cut CPU cycles and not have the view server be part of the picture. Remove the reduce in the above view and query this way:

GET /db/_design/foo/_view/bar?limit=0

Couch will use happily give you back a doc that looks like this:

{"total_rows":42,"offset":0,"rows":[]}

See the total_rows? That's all you need to get at the queue size.

Symbols, JSON and Ruby

Face it, dealing with JSON in Ruby is kinda sore to the eyes. You end up with code that looks like this to get at the attributes within a document:

doc['foo']['bar']['size']

CouchRest doesn't do this currently, but turns out JSON.parse takes some extra options to symbolize the names when deserializing JSON strings.

JSON.parse string, :symbolize_names => true

What this means is you now start using symbols to peek into docs which has a number of benefits. One, the code looks a lot cleaner (and less to type) and secondly because symbols are interned strings, you don't incur the overhead of object creation and get into GC hell within Ruby.

Statistics and Lies

Oh, one last thing. Couch internally collects all sorts of statistics that you can get at with the following query:

curl http://couch.host/_stats

Watching these in production is super beneficial and really tells you how your system is performing and where things can be tweaked.

Bookmark and Share