blitz.io – Path-finding with CouchDB

blitz.io went down for a short duration yesterday morning. It was an interesting day uncovering and identifying issues we hadn’t encountered before with multi-region CouchDB clusters that are doing multi-master continuous replication. In a lot of ways, we are path-finding and pushing CouchDB to its limits given that we are a write-heavy app. In the process, we are making up our own best practices and working around issues. Some of these issues are already addressed in trunk, but I wanted to document what we went through today and what we can do about this. Any ways, if you are running a large CouchDB cluster in production, would love to hear from you.

Going Down

We are running a 1.1 cluster and we mainly upgraded to this because of the replicator performance as well as the persistent _replicator database that remembers the state across reboots. This morning we got an alert that our virginia CouchDB cluster was down. This was interesting in that we had already setup upstart to restart CouchDB if it fails. With the append-only writes, we can simply kill -9 CouchDB and bring it right back up without an fsck of any sorts. But this wasn’t happening. Before we could figure out what the problem was, we simply rerouted all of our scale engines to our california cluster, temporarily keeping the site up.

Turns out that the cluster wasn’t dead, but wedged. Yes, official term. We had to manually kill and restart the CouchDBs. Note to ourselves: kill and restart anything in the architecture that looks like it might be unavailable. To use the @netflix term, build and run the Chaos Monkey periodically. Turns out, the Heroku tier does this automatically (restarts workers and dynos ever 24 hours) and our scale engines also do this automatically. So we need to figure something out for the CouchDB tier. Some thoughts and ideas below.

Update: The wedge, turns out, is CouchDB sucking up 100% of CPU, eating lots of memory and becoming unresponsive. All we know at at this point, when this happens, we get a 500,000 line stack trace dumping all of the docs that are in the replicator backlog.

Changing CouchDB’s and update_seq

All of our scale engines use the _changes feed from CouchDB and rely on the udpate_seq to grab tasks in case they reboot or restart. When we routed all the traffic engines to the california cluster, the update_seq in the new CouchDB was different from what they remembered and we had to effectively restart the scale engines. Lesson learned is each scale engine needs to remember the updated_seq as well as the CouchDB that it was last using. If the CouchDB that it connects to changes from the last time, then simply restart from the highest commit sequence in that CouchDB.

Replicator Woes

Given the sheer number of user registrations that happened today and the multiple, simultaneous load tests that the users were running (awesome!), the replicator would simply give up, multiple times, throughout the day. The try-for-infinity-and-beyond mode for continuous replication is already in trunk and we are eagerly looking for the next dot release. The temporary fix to handle this is to have this code run as a background worker on Heroku and simply kick the replicator back into action.

res = db.documents :startkey => 'to', :endkey => 'to_z', :include_docs => true
res['rows'].each do |row|
    doc = row['doc']
    if doc['_replication_state'] == 'error'
        puts "patching #{db} replicator"
        db.update_doc doc do |_doc|
            _doc.delete '_replication_state'
            _doc
        end
    end
end

Thinking Ahead

Working with the bleeding edge open-source in production is always fun! As much as we provide load and performance testing as a service, we are also building an awesome app that scales out horizontally. I think the key insight for us is just like how ephemeral Heroku dynos and our scale engines are, we need that level of redundancy and ephemeralness(!!) in the database layer. Some quick observations:

  • The Heroku tier uses config variables for the DB locations
  • Heroku has an API that you can use to programatically change config variables for your apps
  • All of our scale engines first talk to the Heroku tier to find out which CouchDB they should connect to
  • Configuring CouchDB (replication, restarts, etc) are all RESTful

This leads to some very interesting implications, where we simply reroute our scale engines to a secondary cluster, run compaction, restarts, etc on the primary CouchDB and then cut over without disrupting anything. Anyways, just an idea. Will write more once we’ve figured out what works for us.

Before we upgraded to 1.1, we were running 4-way multi-master replication between a 1.0.2 cluster and a 1.1 cluster. So we know that works. Also we are pretty excited about the database size optimizations, replicator speedups, automatic compaction, etc., that make running CouchDB in production so much more funner! Besides, we’ve raked up over a million hits on the scoreboard today. So inspite of all this, remember:

blitz.io makes load and performance testing a fun sport!

Bookmark and Share