Using CouchDB group_level for hierarchical data

CouchDB supports something called group_level in the view queries. On pcapr, we never really had the need to use this feature though we have over 52 different views. But in a recent internal project, we had the need to display folders in the application that can be expanded and collapsed. Each document in CouchDB represents a file of sorts and contains the relative path name. One of the views in the app is a classic folder view that can be expanded recursively. Obviously, from a scaling perspective, we don’t want to load all this data up front and that’s exactly where the group_level comes in. This was my first time playing with this capability and I have to say, once you get grok this, it’s totally cool.

Documents

Let’s say we have a number of documents in CouchDB that looks like this:

{ "filename": "/alice/issues/customer-1/blah.gif", ... }
{ "filename": "/alice/issues/customer-2/foo.gif", ... }
{ "filename": "/bob/recent/hello.gif", ... }
{ "filename": "/bob/recent/world.gif", ... }
{ "filename": "/kowsik/one.gif", ... }
{ "filename": "/kowsik/two.gif", ... }
{ "filename": "/kowsik/mu/three.gif", ... }
{ "filename": "/kowsik/mu/four.gif", ... }
{ "filename": "/kowsik/mu/five.gif", ... }
{ "filename": "/kowsik/pcapr/three.gif", ... }
{ "filename": "/kowsik/pcapr/four.gif", ... }
{ "filename": "/kowsik/pcapr/five.gif", ... }
...

So what we really want to show is something like this:

+ alice
+ bob
+ kowsik

And when the user expands, say the folder kowsik, we only want to get the immediate children of kowsik and show what’s in there.

+ alice
+ bob
- kowsik
  . mu
  . pcapr

The view

Here’s how the view in CouchDB looks like that allows us to incrementally fetch the nested folders:

by_path: {
    map: function(doc) {
        if (doc.type === 'file') {
            var paths = doc.filename.split('/');
            paths.pop();
            emit(paths,1);
        }
    },
    reduce: '_sum'
}

All we are doing here is splitting the filename on the directory components and emitting an array of these path components. Notice that the key in emit is an Array and that’s exactly where group_level comes in. The additional _sum as the reduce allows CouchDB to reduce the number of files in each directory so we can display a count next to the folder name.

Querying

This is both the easy and tricky part. Let’s try and get the top level folders. Here’s how the query looks like:

http://couch:5984/database/_design/files/_view/by_path?group=true&group_level=1&limit=20

What does this do? Since the group_level is 1, CouchDB uses the first element of the array as the key for the map and then does a reduce which is _sum (a native Erlang implementation). Effectively it gives the view we are looking for, which is a list of the top level directories. Now let’s say that the user expands the kowsik folder. How do we find the sub folders under kowsik. Here’s how:

http://.../by_path?startkey=["kowsik",0]&endkey=["kowsik",{}]&group=true&group_level=2&limit=20

Let’s break it down. Here are the query parameters:

startkey=["kowsik",0]
endkey=["kowsik",{}]
group=true
group_level=2
limit=20

The first thing you will notice is that the group_level has been bumped up to 2, the depth of the directory tree. You also have to notice that the startkey is now an array containing the directory components. The reason why we use 0 is for view collation; numbers sort before arrays and objects. So effectively we can find out the immediate sub-folders of kowsik nice and easy. As the user keeps expanding the folders, the group_level in the query increments and so does the number of elements in the startkey and endkey.

All in all, no crazy SELECT’s, JOIN’s and other stuff that I can’t grok. Easy Peasy.

Summary

If you haven’t checked out the interactive CouchDB tutorial you should. One of these days we’ll add the group_level aspect of querying CouchDB views to the tutorial. But just so you know, we use CouchDB more than just on pcapr. When you run massive scale tests using Studio Scale, all of the test results (latency, response times, assertion failures, etc.) are all stored in CouchDB so we can map/reduce at will to show you the coolest test results view. This is testing done in a sexy way!

A video on this coming soon. Watch this blog!

Bookmark and Share