Using Map/Reduce for Network Forensics and Troubleshooting

We launched xtractr earlier this week for network forensics, troubleshooting and handling support escalations involving large packet captures. Just so you know xtractr is a 4-tier app (more on that below) that combines the best of Web 2.0 with looking at packets in new light. Looking beyond the “unleash the power of packets” message, I wanted to write about what’s under the hood a little bit and how we are using CouchDB-style of Map/Reduce for uncovering all sorts of information inside large packet captures.

Technology Stack

xtractr is a single Linux executable that you download from pcapr. This executable uses Ferret for searching, Mongoose for a RESTful API, SQLite for flow classification and a persistent store for various packet fields and labels, and V8 for reporting. xtractr uses tshark for getting at the various field values tucked away in those pesky packets. We purpose-built all of the flow classification and content extraction capabilities in addition to bridging these diverse technologies in a seamless manner using a RESTful API.

The xtractr application (delivered from pcapr) runs in your browser and uses jQuery, Flot and Sammy. This application, written in all JavaScript, uses JSONP (a cross-domain way of making Ajax calls, at least until HTML5 is mainstream) to communicate with your instance of xtractr and makes it super easy to find the needle in your packet stack. Given that search queries are king, we wanted to build the application so that as you click around, you can see the search queries constantly being built. This learn-by-example mode of the UI combines the best of Web 2.0 ease of use with the powerful and open command-line kungfu that most people are used to.

One of our primary mandates when building xtractr was this:

the packet data never leaves your system!

Obviously pcaps contain a wealth of information (packets never lie) including usernames and passwords and we wanted to ensure that the index and the original pcaps stay with you. Besides, do you really want to upload a gig of data to the cloud?

One of the most common questions in multiple forensics and packet related mailing lists is How do I look for foo in my pcap?. The collaborative aspect of xtractr comes from the fact that users can explicitly share complex search queries with the rest of the community. These queries are stored in CouchDB on pcapr. This allows the collective intelligence of the packet-geek community to help out those that are just trying to solve everyday problems. These community-contributed queries are called Nuggets. When you use a nugget, we just fetch the search query from pcapr, but then apply it against your xtractr index.

Using Map/Reduce for Reporting

One of the huge challenges in packet forensics is that packets have incredibly rich information content and they come at many different layers each of which might be interesting on its own. Now, we didn’t want to build crazy SQL joins (I’m personally JOIN-challenged) across 90,000+ Wireshark fields. So we ended up using Map/Reduce very much inspired by CouchDB.

The simplest way to understand how this works is from the Interactive CouchDB tutorial that we published a while back.The basic idea is this. Each flow or packet in the index is conceptually a JSON document that looks like this:

{
    "id":169,
    "offset":16516,
    "length":496,
    "pcap":1,
    "flow":12,
    "time":28.9294,
    "dir":0,
    "src":"192.168.30.132",
    "dst":"192.168.40.234",
    "service":"HTTP",
    "title":"GET /index.html HTTP/1.1 ",
    "eth.src": "00:01:02:03:04:05",
    "eth.dst": "06:05:04:03:02:01",
    ...
}

Fields that have multiple values are conceptually stored as JSON arrays. Given this, let’s say you want to find the ‘Top Bandwidth Hoggers for HTTP’. The query string that generates a nice little chart looks like this:

flow.service:HTTP > sum('flow.src', 'flow.bytes')

The first part identifies all the flows that are HTTP. The second part is where the Map/Reduce comes in. Each conceptual flow is passed into the following JavaScript code. Here ‘flow.src’ becomes the _kfield and ‘flow.bytes’ becomes the _vfield. At a very high level, we are building a hash table with the concrete value of flow.src as the key and the sum of all the bytes as the value.

{
    map: function(flow) {
        var _key = flow[_kfield];
        if (_key) {
            flow.values(_vfield, function(_val) {
                if (typeof(_val) === 'number') {
                    emit(_key, _val);
                }
            });
        }
    },
    reduce: function(key, values) {
        return _sum(values);
    }
}

When you sprinkle some jQuery magic to the result data, we get this:

xtractr-report.png

Now wasn’t that easy? xtractr comes with a few different map/reduce functions which allow you to generate all sorts of cool reports with just a few clicks. While xtractr is a powerful standalone application for forensics, a lot of our customers use it directly with Mu Studio to statefully replay the problem traffic to very rapidly resolve escalations. Besides, Mu Studio can also auto generate all the fuzz tests for you based on the flow you pulled out from xtractr.

So check out xtractr. You don’t have to be a packet geek to use it, but you get to benefit from the collective intelligence of those that are.

Bookmark and Share