Wednesday, October 11, 2017

Calculating Correlation inside MongoDB

I've been pondering recently the idea of a library of statistical and heuristic functions that run inside MongoDB using the aggregation Pipeline. After all if we can avoid pulling data out of the database that must help performance. As a little experiment, here is  the correlation co-efficient of two fields using Pearsons Rho. It's broken down into individual variables to make it easier to read rather than a huge piece of javascript. That's usually the best way to write pipelines.

//Pearsons Rho as a pipeline

testdata = [{x:1,y:2},
            {x:2,y:3},
            {x:3,y:6},
            {x:4,y:8}]
db=db.getSiblingDB("stats")
db.pearsons.drop();
db.pearsons.insertMany(testdata)

x = "$x"
y = "$y"

//This is a pipeline stage
sumcolumns = { $group : { _id: true,
             count: { $sum: 1 },
             sumx : { $sum : x},
             sumy : { $sum : y},
             sumxsquared : { $sum : { $multiply : [x,x] } },
             sumysquared : { $sum : { $multiply : [y,y] } },
             sumxy : { $sum : { $multiply : [x,y] } }
           }}

//This is building a pipeline stage from objects
multiply_sumx_sumy = { $multiply : [ "$sumx","$sumy"] }
multiply_sumxy_count = { $multiply : ["$sumxy","$count"]}
partone = { $subtract : [ multiply_sumxy_count, multiply_sumx_sumy ]}

multiply_sumxsquared_count = { $multiply : ["$sumxsquared","$count"]}
sumx_squared = { $multiply : ["$sumx","$sumx"]}
subparttwo = { $subtract : [ multiply_sumxsquared_count,sumx_squared  ]}


multiply_sumysquared_count = { $multiply : ["$sumysquared","$count"]}
sumy_squared = { $multiply : ["$sumy","$sumy"]}
subpartthree = { $subtract : [ multiply_sumysquared_count,sumy_squared  ]}

parttwo = { $sqrt : {$multiply : [ subparttwo,subpartthree ]}}

//Glue it all together
rho  = {$project : { rho:  {$divide : [partone,parttwo]}}}


pipeline = [sumcolumns,rho]
db.pearsons.aggregate(pipeline)

Thursday, July 13, 2017

MongoDB Queryable Backups - Time Travel in the Database

MongoDB maintains a statement  based transaction log of all write operations called the OpLog.

This  is used to keep High Availability Replicas in sync with the Master copy.

The Backup Agent ships this, in encrypted slices to the backup server every minute.

The Backup Server stores these slices in a database, called the Oplog Store.

The Backup server then replays them into a copy of the database on the Backup Server  called the Head Database.

Every few hours, it stops replaying them, looks at what has changed in the binary files of the Head Database and saves those changed file blocks  to another database called the Blockstore, deduplicating as it goes. This is called a snapshot.

When you restore it can pull back a snapshot, and if it still has the transactions since then roll it forward to when you want. Typically it keeps the transactions for 48 hours and the snapshots to a schedule for years.

That's all nice but not the amazing part:

With the Queryable Backups - MongoDB can point the database server at the Blockstore and the Oplog store  and bring up a working, read only copy of the database at some past point in time - without recreating the actual database.

If you thought Oracles flashback was neat - this is mind-blowing.

Wednesday, May 3, 2017

Twelve Steps to MongoDB Enlightenment



  1. You install the database with a single command, connect with second and add your first document with a third - this seems really easy.
  2. You write a performance test in the single threaded JavaScript shell and are underwhelmed - then try POCDriver and marvel at the difference.
  3. You use conditional update and findOneAndUpdate to make sequences, locks and queues.
  4. You discover you cannot update an array in two different ways simultaneously - you raise a JIRA ticket.
  5. You find your first use for a truly dynamic schema - then wonder how you can index it.
  6. You discover the aggregation pipeline and map-reduce and wonder which to use - you use Aggregation.
  7. You set up a pair of servers for failover then wonder why when one stops they both do halving your reliability - then you read the manual then add a third.
  8. You are missing data after a failover but find it in a directory on the server marked rollback. You discover write concerns.
  9. You deploy a replica using Cloud Manager and it’s no better than the script you found on Github - then you try a live upgrade using it and are blown away.
  10. You shard and discover it's harder than it looks - but get some help and realise it's just your lack of understanding.
  11. You realise sharding is still harder than you thought and ask for expert help.
  12. You attend MongoDB World and realise people are doing far more amazing things than you are.