Thursday, April 2, 2015

MongoDB Full Text Search - just got 66% more usable.

Last time I blogged, I spoke about index compression. When you use the Wired Tiger storage engine in MongoDB 3.0 you get compressed indexes - and better yet indexes that don't need to be decompressed in RAM - ones that stay compressed and reduce the RAM footprint.

Why does this matter? - because for a database system to work well. You need enough RAM to hold your indexes.

MongoDB 2.4 added a beta Full Text Search (FTS) capability, in 2.6 it became a GA Release. which, whilst it doesn't have all the bells and whistles of a dedicated  FTS indexing engine like Elasticsearch it has enough to make it suitable for many use cases. MongoDB's thinking is that some people will find it easier and cheaper if they don't need to add third party text  indexing to their MongoDB cluster along with all the associated extra hardware and management.

Unfortunately - MongoDB FTS in 2.6 had a small issue, to understand what you need to look at how it works.

When you specified one or more fields as being text searchable in MongoDB using collection.createIndex({ title: "text", "article":"text"}), what  the database server does is, during each update or insert, take the contents of those fields, parse them into words, stem them - reducing shopper and shopping to shop and then treat the unique list of words as though they were an array in MongoDB, one index entry for each.

If you assume that each document has 30% unique words after stopwords then that can be a lot of index entries, each with the word repeated AND an eight byte long record locator. MongoDB MMAP indexes are fairly simple and the word is repeated in the index for every document it appear in.

Now as I explained last week - WiredTiger does NOT repeat a key in the index within a block - so for indexes like these where the same word is repeated in many documents - and where the next word lexographically may share some of the start of this one index prefix compression is a fantastic solution.

So what does this mean in reality? To test I loaded wikipedia english edition into MongoDB and built a full text index on the title and article text.

With MongoDB 2.6 the stats look like this


{
"ns" : "wikipedia.records",
"count" : 4335341,
"size" : 13774842288,
"avgObjSize" : 3177,
"numExtents" : 27,
"storageSize" : 15124713408, <-15.1GB
"nindexes" : 2,
"totalIndexSize" : 16727360160, <-16.7GB
"indexSizes" : {
"_id_" : 123817344,
"text_text_title_text" : 16603542816  
},
"ok" : 1
}


With MongoDB 3.0 WiredTiger (Snappy) they look like like this:

> db.records.stats()
{
"ns" : "wikipedia.records",
"count" : 4335341,
"size" : 9526419045,
"avgObjSize" : 2197,
"storageSize" : 6302990336, <- 6.3 GB
"nindexes" : 2,
"totalIndexSize" : 4646367232, <- 4.6GB
"indexSizes" : {
"_id_" : 45441024,
"text_text_title_text" : 4600926208 
},
"ok" : 1
}

That's huge, over 60% compression and with an index smaller that the data - not larger. Suddenly MongoDB FTS looks like a much more viable option and without all the additional setup or an indexing cluster, Unless you need extra functionality of course - but do you? And have you really accepted it's OK to have the text index update out of sync with the records?

Of course it's only fair to see how much space Elasticsearch takes for the same data, to do that I loaded the same data set in, reading from MongoDB and inserting using the excellent Python API's for both.


 curl localhost:9200/wikipedia/_stats | python -m json.tool

 "indices": {
        "wikipedia": {
            "primaries": {
                "completion": {
                    "size_in_bytes": 0
                },
                "docs": {
                    "count": 4335341,
                    "deleted": 0
                },
                "segments": {
                    "count": 134,
                    "index_writer_memory_in_bytes": 0,
                    "memory_in_bytes": 62915448,
                    "version_map_memory_in_bytes": 0
                },
                "store": {
                    "size_in_bytes": 14342148088, <-14GB
                    "throttle_time_in_millis": 946194
                },
    
            }

The answer ...  the index in ES was 14 GB,  3 times larger than the MongoDB index and of course - you still need retain your original copy. All you are getting back from your search is a key.

This is in no way a criticism of Elasticsearch or any other Indexing technology (although none hold a candle to the original "Memex Information Engine" ). Rich FTS is a category of data technologies all of it's own and specialists will and should always exist. I'm just pointing out that in the same way the music player in your phone can technically replace your home HiFi - converged search technology has come of age - Search is a commodity - it should just be there as a database feature.

Are you using MongoDB FTS? Did you consider and reject it already? I I'd love to hear why in the comments.