Saturday, May 24, 2014

The case of the missing operator - How to use $sqrt in the MongoDB Aggregation Framework


The MongoDB Aggregation framework is a fast way of doing ad-hoc summary calculations and processing over many records. Unlike mapreduce it's declarative, akin to SELECT … GROUP BY .. HAVING in SQL but much more powerful.

It has a number of mathematical operators but from a statistical processing point of view, one annoying omission is a lack of square root operator – what if you have a series of values and need to calculate a standard deviation?

Fortunately – one of the earliest Apple developers – Sir Isaac Newton told us how to do this using the existing set of operators. I'm just translating his early blogs.

Newton's method says – guess, then take an average of your guess versus the desired outcome and repeat. Linear approximation. It's like the old playground game of guess the number I'm thinking then improve based on if you are too high or low.

In the aggregation framework this is simple.

Lets say we have a document

{ n : 29 }

we want to add a 'root' to that to make it 

{ n : 29, r: ??? }

where r is the square root.

Let's put that in a collection

>use blogcode
>db.roots.insert({n:29})

First we need an aggregation pipeline stage to add our first guess – whilst I could get clever here I'm simply going to always start with a guess of 1. In the mongoDB shell I will declare this as an object.

> firstguess = { $project : { n : 1, r : { $literal : 1}}}

This shows off a handy addition in MongoDB 2.6 – n:1 means copy n through from the previous step (or original data) as it was, to set r to the value 1 we need to use literal – that's nicer than any of the 2.4 ways to do this ( like using $add:[0,1]).

Then we need a step to refine out answer – 

> refine = { $project : { n: 1, r : { $divide : [ {$add : [{$divide : [ "$n","$r"]} , "$r" ]},2]}}}

Then run the pipeline

> db.roots.aggregate([firstguess,refine,refine,refine,refine, refine,refine,refine,refine])

> { "_id" : ObjectId("538062103439ddd3764ff791"), "n" : 29, "r" : 5.385164807134505 }

My Calculator says sqrt(29) = 5.385164807134504 so that's good enough!

How many times you want to refine it is up to you – it converges pretty fast. You can build that array of operators with a loop easily enough but too much iteration will take too long for little value – a couple of dozen at most will suffice.


How fast is it – lets put 1,000,000 values in to test.

> for(x=0;x<1000000;x++){ db.roots.insert{n:x} }


Because I want them all calculated inside the timer I've added a final pointless $group pipeline stage otherwise you get results as you pull them from the cursor.


use blogcode

var firstguess = { $project : { n : 1, r : { $literal : 1}}}
var refine = { $project : { n: 1, r : { $divide : [ {$add : [{$divide : [ "$n","$r"]} , "$r" ]},2]}}}

var start = new Date().getTime();

db.roots.aggregate([firstguess,refine,refine,refine,refine, refine,refine,refine,refine,{$group:{"_id":1}}])

var end = new Date().getTime(); 
var time = end - start;
print('1 Million roots in: ' + time + "ms");

On my Mac – that takes 14 seconds. As this is done entirely in projections  all processing will be done on the individual shards for maximum parallelism so using sharding or microsharding (or simply splitting the job into two pipelines) will increase speed pretty much linearly.

Square root in aggregation is also a JIRA Ticket ( SERVER-8568 ) feel free to vote for it.