2010/12/15

CouchDB: Implementation of SELECT COUNT(DISTINCT field)...

In CouchDB, the operation with group=true provides separate values for each unique key, like SELECT DISTINCT SQL query.

Next, we can get the total number of unique keys from the above results, like SELECT COUNT(DISTINCT ...) SQL query.

However, if the first results set is quite big, then the second count-up operation will spend a long time. It might be a problem.

Some examples ...

My ruby client can calculate the number of unique keys like following operation;

A example of ruby couchdb client.

couch = Couch::Server.new(@host, @port)
json = JSON.parse(couch.get(URI.escape('/example/_design/all/_view/test?group=true')).body)
total_keys = json['rows'].length

The processing time of the json to array conversion is so trivial, but the resident memory size and network traffic will be increased by this operation.

I implemented the operation with group_numrows=true like group=true, it returns just the length of the 'rows' array.

As an example, a curl command line returns the following results.

$ curl 'http://localhost:5984/example/_design/all/_view/all?group=true'
{"rows":[
{"key":["bar","35"],"value":3},
{"key":["foo","25"],"value":3},
{"key":["somebody","20"],"value":8},
{"key":["yasu","32"],"value":4}
]}

The operation with gorup_numrows=true returns the following line.

$ curl 'http://localhost:5984/example/_design/all/_view/all?group=true&group_numrows=true'
{"group_numrows":"4"}

If you use the limit=2 operation with above example, the results will be {"group_numrows":"2"}. So it's just return the total number which should be returned.

System Information

The patch file is available from the following link.

This patch was developed;

  • Ubuntu 10.04.1 LTS x86_64
  • Erlang R13B03

I'm not sure this implementation is robust enough to any production use, but I've been testing it with the jQuery flexbox plugin at http://www.yadiary.net/postal/main.fcgi.

Sorry it's Japanese only, but if you use pull-down selection boxes, you can see the results will be changed by your privious box's selection. Each selection boxes desn't have an entire results set, the group_numrows=true operation is used to show you the total number.

About performance

In my last blog post, the group_numrows=true operation is 8.5 times faster than first json['rows'].length case.There are almost 100K result lines (almost 3MB json string.)

My implementation is not smart, it just replaces the string output function to my count-up function.

If you have huge results set and just needs the total number and small subset of the results, this might be useful.

Thank you.

0 件のコメント: