Friday, September 9, 2016

Connecting to and Authenticating with MongoDB from Java using x509

Introduction

I recently worked with a MongoDB Customer who wanted to do encryption in flight correctly:  SSL/TLS ,  x509, mutual certificate authentication between clients and servers the full Monty.

We also created service accounts for applications, which would authenticate using x509 - as compared to human administrators who would present a valid certificate to establish a TLS connection then authenticate with their username and password from LDAP/ Active Directory. As an added bonus, we configured MongoDB auditing to audit the deliberate activities of human admins whilst not auditing what the service accounts did.

One challenge we hit though, was getting Java (Actually in this case Clojure, but let's just say the JVM) to correctly connect and authenticate using x509. The Mongo Shell worked, their Java client didn't. After a little head scratching we figured it out and I'm sharing this with the Internet, as that specific issue doesn't seem to be online anywhere.

To do this you need to have MongoDB Configured with SSL, and done correctly with different certificates for clients and servers and a central Certificate Authority (CA) who has signed them all and shared their public information.

Generate Required Keys

Create a Local Certificate Authority


Minimally create a root signing certificate and key, don't apply encryption or passwords to the key for this example,  you need to if you are doing this for a real system.

openssl req -newkey rsa:2048 -new -x509 -sha256 -extensions v3_ca -out ca.cert -keyout ca.key -subj "/C=GB/ST=Scotland/L=Glasgow/O=MongoDB/CN=mongodemo.com" -nodes

The combination of req and -x509 here creates and self signs a certificate, and we have asked for it to be a v3_ca (Certificate Authority). The -nodes  in this suppresses the need for a password this time.


Request Certificates from the CA

Now you need to request a certificate for your MongoDB server. To do this it needs to have a real, fully qualified hostname not localhost or localhost.Local. If it doesn't figure out your ip address using ifconfig (not 127.0.0.1 please). Mine is 192.168.0.112

Our signing auth is called mongodemo.com, lets use that as a domain for the examples.

Then add this to /etc/hosts


echo "192.168.0.112  ssldemo.mongodemo.com" | sudo tee --append /etc/hosts
ping ssldemo.mongodemo.com
PING ssldemo.mongodemo.com (192.168.0.112): 56 data bytes
64 bytes from 192.168.0.112: icmp_seq=0 ttl=64 time=0.052 ms
64 bytes from 192.168.0.112: icmp_seq=1 ttl=64 time=0.086 ms

Now we need a server certificate for ssldemo.mongodemo.com

We get this by creating a Certificate Request (CSR) (and an associated key to unlock the returned certificate) which we can send to our CA to sign. The subject must have our fully qualified hostname in as the Common Name (CN) also depending on their rules the CA may only be able to sign things for their Country or Org so we will keep them similar.

openssl req --newkey rsa:2048 -new  -sha256 -out server.csr -keyout server.key -subj "/C=GB/ST=Scotland/L=Glasgow/O=MongoDB/OU=demoservers/CN=ssldemo.mongodemo.com" -nodes

We also need a certificate for our client user - an email address is reasonable choice for a common name but not required. Importantly the O and OU Fields MUST differ in some way from the server certificate otherwise this certificate is seen as another server in the cluster, not a user.

openssl req -newkey rsa:2048 -new  -sha256 -out client.csr -keyout client.key -subj "/C=GB/ST=Scotland/L=Glasgow/O=MongoDB/OU=demouser/CN=mongouser@appserver.org" -nodes

Have the CA Sign the requests

Now the CA gets the CSR  and signs it. You would typically email the the CSR to the CA , it's not a security critical file. We have them in the same directory here, no need to email but we do need to add somewhere for the CA to keep track of what it's signed. It uses a little text file based database and sequence file. on my mac it needs one called demoCA, if the next openssl command fails you may need to create a different directory.

mkdir demoCA
touch demoCA/index.txt
echo 1000 > demoCA/serial

The CA then approves and signs the two CSRs and creates two certificates.

openssl ca  -in server.csr -out server.cert -keyfile ca.key -cert ca.cert -outdir .
Using configuration from /System/Library/OpenSSL/openssl.cnf
Check that the request matches the signature
Signature ok
Certificate Details:
        Serial Number: 4096 (0x1000)
        Validity
            Not Before: Sep  9 12:36:40 2016 GMT
            Not After : Sep  9 12:36:40 2017 GMT
        Subject:
            countryName               = GB
            stateOrProvinceName       = Scotland
            organizationName          = MongoDB
            organizationalUnitName    = demoservers
            commonName                = ssldemo.mongodemo.com
        X509v3 extensions:
            X509v3 Basic Constraints:
                CA:FALSE
            Netscape Comment:
                OpenSSL Generated Certificate
            X509v3 Subject Key Identifier:
                91:5A:DE:72:B6:C0:23:D4:1D:D7:66:A7:D2:01:70:05:25:9F:1E:E1
            X509v3 Authority Key Identifier:
                keyid:8E:07:FC:22:E2:A3:93:09:6F:58:E2:A6:BC:FF:DE:7F:5E:0F:F1:F4

Certificate is to be certified until Sep  9 12:36:40 2017 GMT (365 days)
Sign the certificate? [y/n]:y


1 out of 1 certificate requests certified, commit? [y/n]y
Write out database with 1 new entries
Data Base Updated

openssl ca  -in client.csr -out client.cert -keyfile ca.key -cert ca.cert -outdir .
Using configuration from /System/Library/OpenSSL/openssl.cnf
Check that the request matches the signature
Signature ok
Certificate Details:
        Serial Number: 4097 (0x1001)
        Validity
            Not Before: Sep  9 12:38:26 2016 GMT
            Not After : Sep  9 12:38:26 2017 GMT
        Subject:
            countryName               = GB
            stateOrProvinceName       = Scotland
            organizationName          = MongoDB
            organizationalUnitName    = demouser
            commonName                = mongouser@appserver.org
        X509v3 extensions:
            X509v3 Basic Constraints:
                CA:FALSE
            Netscape Comment:
                OpenSSL Generated Certificate
            X509v3 Subject Key Identifier:
                CD:67:5D:CF:1F:9C:45:E8:67:33:D6:A3:06:5E:AB:15:4C:6A:BA:E1
            X509v3 Authority Key Identifier:
                keyid:8E:07:FC:22:E2:A3:93:09:6F:58:E2:A6:BC:FF:DE:7F:5E:0F:F1:F4

Certificate is to be certified until Sep  9 12:38:26 2017 GMT (365 days)
Sign the certificate? [y/n]:y


1 out of 1 certificate requests certified, commit? [y/n]y
Write out database with 1 new entries
Data Base Updated

"Data Base" - how quaint!

As we made the CA key without a password it's not asking us to authenticate ourselves before doing this, in a real world setup it would so you can have a human sign things.

Now the CA sends the certs back to the requester - this is also as until they are combined with their keys (which the requestor kept safe) they cannot be used. The CA will also send it's own cert - remember it's no use for signing without the private key, but it can verify what's been signed.


Combine your new Certificates with your secret keys


Finally to make usable keys we combine the cert and key files

cat server.cert server.key > service.pem
cat client.cert client.key > client.pem

At this point we have the files we need: A server pem - with the hostname as a Common Name(CN) field, a client PEM which differs in its O and OU Fields from the server,  and a CA cert file.

We can now use these to start MongoDB - we will start it in a way that it requires clients to use SSL and to present a valid certificate. I'm assuming defaults for database paths , ports etc. We will first start requiring SSL but not authentication - then we will enable x509 authentication.

mongod --sslMode=requireSSL --sslPEMKeyFile=server.pem --sslCAFile=ca.cert

At this point, open a second window so you can continue and leave that running with its log visible.

Connecting with the Mongo Shell


Now let's try to connect from the mongo client. I'm going to do this wrong a few times first so you can see the errors and how to correct them.

mongo
MongoDB shell version: 3.2.7
connecting to: test
2016-09-09T13:48:25.493+0100 E QUERY    [thread1] Error: network error while attempting to run command 'isMaster' on host '127.0.0.1:27017'  :

OK - and in the server log it says

 I NETWORK  [conn1] AssertionException handling request, closing client connection: 17189 The server is configured to only allow SSL connections

So I need to tell the mongo client to use SSL - it tried to connect without any encryption.

mongo --ssl
MongoDB shell version: 3.2.7
connecting to: test
E NETWORK  [thread1] SSL peer certificate validation failed: self signed certificate in certificate chain
E QUERY    [thread1] Error: socket exception [CONNECT_ERROR] for SSL peer certificate validation failed: self signed certificate in certificate chain :
connect@src/mongo/shell/mongo.js:229:14
@(connect):1:6

exception: connect failed

And on the server

E NETWORK  [conn2] no SSL certificate provided by peer; connection rejected

So it's not letting me in without a certificate - good. Let's try that again.

mongo --ssl --sslPEMKeyFile client.pem
MongoDB shell version: 3.2.7
connecting to: test
2016-09-09T13:51:42.379+0100 E NETWORK  [thread1] SSL peer certificate validation failed: self signed certificate in certificate chain
2016-09-09T13:51:42.379+0100 E QUERY    [thread1] Error: socket exception [CONNECT_ERROR] for SSL peer certificate validation failed: self signed certificate in certificate chain :
connect@src/mongo/shell/mongo.js:229:14
@(connect):1:6

exception: connect failed

This is less obvious - but it's telling us that the server doesn't seem to have been signed by a normal, big name CA and we cannot verify it, we could be connecting to a fake server!

Now we can allow this if we want using :

mongo --ssl --sslPEMKeyFile client.pem --sslAllowInvalidCertificates
MongoDB shell version: 3.2.7
connecting to: test
2016-09-09T13:53:27.131+0100 W NETWORK  [thread1] SSL peer certificate validation failed: self signed certificate in certificate chain
2016-09-09T13:53:27.131+0100 W NETWORK  [thread1] The server certificate does not match the host name 127.0.0.1
MongoDB Enterprise > 

But that's not the right thing to do - we have a CA file that will let us verify our connection, you need the above flags only if you are taking the simples possible approach to certificate usage and using using one self signed cert.

The correct incantation is:

mongo --ssl --sslPEMKeyFile client.pem --sslCAFile ca.cert
MongoDB shell version: 3.2.7
connecting to: test
2016-09-09T13:54:41.662+0100 E NETWORK  [thread1] The server certificate does not match the host name 127.0.0.1
2016-09-09T13:54:41.663+0100 E QUERY    [thread1] Error: socket exception [CONNECT_ERROR] for The server certificate does not match the host name 127.0.0.1 :
connect@src/mongo/shell/mongo.js:229:14
@(connect):1:6

exception: connect failed

Oh - what happened there ? well the error explains, we connected to 127.0.0.1, the server can only validate its name, which is ssldemo.mongodemo.com. You cannot make SSL connections to localhost or non fully qualified hostnames.

Let us try that again:

mongo --ssl --sslPEMKeyFile client.pem --sslCAFile ca.cert --host ssldemo.mongodemo.com
MongoDB shell version: 3.2.7
connecting to: ssldemo.mongodemo.com:27017/test
MongoDB Enterprise >


Adding an x509 User


Awesome - now we are connected, we know we have the right server and the server knows we have a valid certificate. If our client cert had been stored with a key password we might even have had some security at this point - but the server will accept any valid certificate here, this is not user authentication, just safe connectivity.

Now we could, at this point use the standard SCRAM_SHA1 (username/password) authentication or LDAP, over the top of SSL. But what we want is to authenticate ourself using the client certificate. To do that I need to add a user matching the certificate I logged in with - at this point MongoDB knows what certificate connected - it alsp needs to know that is a valid user and what rights they have.

First - I need to extract my username from my certificate I can do this with openssl too.

openssl x509 -in client.pem -inform PEM -subject -nameopt RFC2253 | grep subject
subject= CN=mongouser@appserver.org,OU=demouser,O=MongoDB,ST=Scotland,C=GB

Wow, that's a long - but at least unique - username. It's also been rearranged from the original form so make sure you get it in RFC2253 format.

We then add that user in MongoDB's user collection like adding any other user , at this point we do not have authentication turned on so we can do this, once we turn it on (or enable x509 cluster authentication!), then we cannot - so we will make a very high level user to start with who can add other users.

Because of the JavaScript in the shell we cannot simply use $external so we need a more programmatic way to call createUser in the shell.

mongo --ssl --sslPEMKeyFile client.pem --sslCAFile ca.cert --host ssldemo.mongodemo.com
MongoDB shell version: 3.2.7
connecting to: ssldemo.mongodemo.com:27017/test
MongoDB Enterprise > db.getSiblingDB("$external").runCommand(
...   {
...     createUser: 'CN=mongouser@appserver.org,OU=demouser,O=MongoDB,ST=Scotland,C=GB',
...     roles: [
...              { role: 'dbAdminAnyDatabase', db: 'admin' },
...              { role: 'readWriteAnyDatabase', db: 'admin' },
...              { role: 'userAdminAnyDatabase', db: 'admin' }
...            ]
...   }
... )
{ "ok" : 1 }


Now we need to restart MongoD, tell it to enable user authentication, and to use x509 (we also turn off username/password authentication here)

Stop the mongodb server and restart it with:

mongod --sslMode=requireSSL --sslPEMKeyFile=server.pem --sslCAFile=ca.cert --auth --setParameter authenticationMechanisms=MONGODB-X509

This now will only allow authentication with certificates (again recall we've chosen not to password protect out certs but we should). We could also allow other authentication mechanisms here such as the LDAP one I mentioned at the top of this blog.

Login using x509 from the Shell

Let us try logging in:

mongo --ssl --sslPEMKeyFile client.pem --sslCAFile ca.cert --host ssldemo.mongodemo.com
MongoDB shell version: 3.2.7
connecting to: ssldemo.mongodemo.com:27017/test
MongoDB Enterprise >

that's good , but am I authenticated? is security even on?

MongoDB Enterprise > db.a.insert({a:1})
WriteResult({
        "writeError" : {
        "code" : 13,
        "errmsg" : "not authorized on test to execute command { insert: \"a\", documents: [ { _id: ObjectId('57d2b60f95a048a3d91d1a75'), a: 1.0 } ], ordered: true }"
        }
})

OK I guess authentication is on, but I'm not authenticated - as I should be able to do that insert.

Why?  because when I connected I forgot to login! At this point I could still log in from here - but that's ugly, I ought to have done it from the command line so I'll try again.

mongo --ssl --sslPEMKeyFile client.pem --sslCAFile ca.cert --host ssldemo.mongodemo.com -u 'CN=mongouser@appserver.org,OU=demouser,O=MongoDB,ST=Scotland,C=GB'
MongoDB shell version: 3.2.7
connecting to: ssldemo.mongodemo.com:27017/test
2016-09-09T14:17:59.753+0100 E QUERY    [thread1] Error: Missing expected field "pwd" :
DB.prototype._authOrThrow@src/mongo/shell/db.js:1441:20
@(auth):5:1
@(auth):1:2

exception: login failed



Ah, it's missing a pwd (password) I think - but I don't have one - this is a certificate for a service, that's the point, there are no passwords in the database. What did I miss?

I forgot to tell it to use X509:

mongo --ssl --sslPEMKeyFile client.pem --sslCAFile ca.cert --host ssldemo.mongodemo.com -u ''CN=mongouser@appserver.org,OU=demouser,O=MongoDB,ST=Scotland,C=GB' '--authenticationMechanism=MONGODB-X509
MongoDB shell version: 3.2.7
connecting to: ssldemo.mongodemo.com:27017/test
2016-09-09T14:31:00.430+0100 E QUERY    [thread1] Error: X.509 authentication must always use the $external database. :
DB.prototype._authOrThrow@src/mongo/shell/db.js:1441:20
@(auth):6:1
@(auth):1:2

exception: login failed


Ah and the external DB...

mongo --ssl --sslPEMKeyFile client.pem --sslCAFile ca.cert --host ssldemo.mongodemo.com -u 'CN=mongouser@appserver.org,OU=demouser,O=MongoDB,ST=Scotland,C=GB' --authenticationMechanism=MONGODB-X509 --authenticationDatabase='$external'
MongoDB shell version: 3.2.7
connecting to: ssldemo.mongodemo.com:27017/test
MongoDB Enterprise > db.a.insertOne({a:1})
{
        "acknowledged" : true,
        "insertedId" : ObjectId("57d2bc8366a06ee54966fa82")
} })
MongoDB Enterprise >

Great, we have MongoDB correctly configured to accept x509 and a working x509 certificate to use as an example, but how do I now connect from code?

Connecting using x509 from your own code.

Correct URI Format.


Well the first thing to know is that we want to use a MongoDB URI - which being a URI and being in URL format means we need to escape our username - There are online converters but I'm all for doing this.

python
Python 2.7.8 (v2.7.8:ee879c0ffa11, Jun 29 2014, 21:07:35)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> urllib.quote("CN=mongouser@appserver.org,OU=demouser,O=MongoDB,ST=Scotland,C=GB")
'CN%3Dmongouser%40appserver.org%2COU%3Ddemouser%2CO%3DMongoDB%2CST%3DScotland%2CC%3DGB'
>>>

That's an ugly username - OK how do we use it in code

Connecting from Python using x509


In Python it's easy to connect and use SSL once the server is set up.

import ssl
from pymongo import MongoClient

uri = "mongodb://CN%3Dmongouser%40appserver.org%2COU%3Ddemouser%2CO%3DMongoDB%2CST%3DScotland%2CC%3DGB@ssldemo.mongodemo.com/?authMechanism=MONGODB-X509"
client = MongoClient(uri,
                     ssl=True,
                     ssl_certfile='client.pem',
                     ssl_cert_reqs=ssl.CERT_REQUIRED,
                     ssl_ca_certs='ca.cert')
client.test.a.insert_one({a:2})

Connecting To MongoDB with JAVA/JVM using X509 Certificates


Connecting from the JVM is a little harder to figure out - and why I think a number of people may end up on this blog post.

First let us make a minimal mongo client in Java. I'm doing this old school to avoid needing to install gradle, maven, ant or anything else other than a JDK. Chose your own artificial sweeteners.


We will need the MongoDB Java Driver - we can get that with cURL:

curl -OL https://oss.sonatype.org/content/repositories/releases/org/mongodb/mongo-java-driver/3.3.0/mongo-java-driver-3.3.0.jar

Then put the following in a file called LoginTest.java

import com.mongodb.MongoClient;
import com.mongodb.MongoClientURI;

public class LoginTest {

    public static void main(String[] args)
    {
        System.out.println("Login Test");
        MongoClient mongoClient = new MongoClient(new MongoClientURI(args[0]));    System.out.println(mongoClient.getDatabase("test").getCollection("a").count());
    }
}

Compile:

javac -cp mongo-java-driver-3.3.0.jar LoginTest.java

Run:

java -cp mongo-java-driver-3.3.0.jar:. LoginTest mongodb://localhost

Login Test
Sep 09, 2016 3:00:01 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Cluster created with settings {hosts=[localhost:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
Sep 09, 2016 3:00:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: No server chosen by ReadPreferenceServerSelector{readPreference=primary} from cluster description ClusterDescription{type=UNKNOWN, connectionMode=SINGLE, serverDescriptions=[ServerDescription{address=localhost:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
Sep 09, 2016 3:00:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Exception in monitor thread while connecting to server localhost:27017
com.mongodb.MongoSocketReadException: Prematurely reached end of stream
        at com.mongodb.connection.SocketStream.read(SocketStream.java:88)
        at com.mongodb.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:492)
        at com.mongodb.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:222)
        at com.mongodb.connection.CommandHelper.receiveReply(CommandHelper.java:134)
        at com.mongodb.connection.CommandHelper.receiveCommandResult(CommandHelper.java:121)
        at com.mongodb.connection.CommandHelper.executeCommand(CommandHelper.java:32)
        at com.mongodb.connection.InternalStreamConnectionInitializer.initializeConnectionDescription(InternalStreamConnectionInitializer.java:83)
        at com.mongodb.connection.InternalStreamConnectionInitializer.initialize(InternalStreamConnectionInitializer.java:43)
        at com.mongodb.connection.InternalStreamConnection.open(InternalStreamConnection.java:116)
        at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:116)
        at java.lang.Thread.run(Thread.java:745)

So let us try a proper connection string from the python example:

MacPro:blog jlp$ java -cp mongo-java-driver-3.3.0.jar:. LoginTest 'mongodb://CN%3Dmongouser%40appserver.org%2COU%3Ddemouser%2CO%3DMongoDB%2CST%3DScotland%2CC%3DGB@ssldemo.mongodemo.com/?authMechanism=MONGODB-X509&ssl=true'
Login Test
Sep 09, 2016 3:02:10 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Cluster created with settings {hosts=[ssldemo.mongodemo.com:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
Sep 09, 2016 3:02:10 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: No server chosen by ReadPreferenceServerSelector{readPreference=primary} from cluster description ClusterDescription{type=UNKNOWN, connectionMode=SINGLE, serverDescriptions=[ServerDescription{address=ssldemo.mongodemo.com:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
Sep 09, 2016 3:02:15 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Exception in monitor thread while connecting to server ssldemo.mongodemo.com:27017
com.mongodb.MongoSocketWriteException: Exception sending message
        at com.mongodb.connection.InternalStreamConnection.translateWriteException(InternalStreamConnection.java:463)
        at com.mongodb.connection.InternalStreamConnection.sendMessage(InternalStreamConnection.java:206)
        at com.mongodb.connection.CommandHelper.sendMessage(CommandHelper.java:89)
        at com.mongodb.connection.CommandHelper.executeCommand(CommandHelper.java:32)
        at com.mongodb.connection.InternalStreamConnectionInitializer.initializeConnectionDescription(InternalStreamConnectionInitializer.java:83)
        at com.mongodb.connection.InternalStreamConnectionInitializer.initialize(InternalStreamConnectionInitializer.java:43)
        at com.mongodb.connection.InternalStreamConnection.open(InternalStreamConnection.java:116)
        at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:116)
        at java.lang.Thread.run(Thread.java:745)
Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target


Unable to find valid certification path - Hmm good point as we haven't specified where the certificate files are, in Java we need to put them in a java keystone and access them from there.

So lets take the CA cert first. We will use a password of xxxxxx, you have to use six characters at least.

keytool -import -trustcacerts -alias slc -file ca.cert -keystore truststore.jks -storepass xxxxxx -noprompt

Now we just need to specify this when we start Java - this because our top level certificate/CA is in-house, if just we had been using a Signing certificate given to us by a top level CA then we may not have needed this, Java has some built in certificates.



java  -Djavax.net.ssl.trustStore=truststore.jks -cp mongo-java-driver-3.3.0.jar:. LoginTest 'mongodb://CN%3Dmongouser%40appserver.org%2COU%3Ddemouser%2CO%3DMongoDB%2CST%3DScotland%2CC%3DGB@ssldemo.mongodemo.com/?authMechanism=MONGODB-X509&ssl=true'
Login Test
Sep 09, 2016 3:19:52 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Cluster created with settings {hosts=[ssldemo.mongodemo.com:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
Sep 09, 2016 3:19:52 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: No server chosen by ReadPreferenceServerSelector{readPreference=primary} from cluster description ClusterDescription{type=UNKNOWN, connectionMode=SINGLE, serverDescriptions=[ServerDescription{address=ssldemo.mongodemo.com:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
Sep 09, 2016 3:19:57 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Exception in monitor thread while connecting to server ssldemo.mongodemo.com:27017
com.mongodb.MongoSocketReadException: Exception receiving message
        at com.mongodb.connection.InternalStreamConnection.translateReadException(InternalStreamConnection.java:481)
        at com.mongodb.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:226)
        at com.mongodb.connection.CommandHelper.receiveReply(CommandHelper.java:134)
        at com.mongodb.connection.CommandHelper.receiveCommandResult(CommandHelper.java:121)
        at com.mongodb.connection.CommandHelper.executeCommand(CommandHelper.java:32)
        at com.mongodb.connection.InternalStreamConnectionInitializer.initializeConnectionDescription(InternalStreamConnectionInitializer.java:83)
        at com.mongodb.connection.InternalStreamConnectionInitializer.initialize(InternalStreamConnectionInitializer.java:43)
        at com.mongodb.connection.InternalStreamConnection.open(InternalStreamConnection.java:116)
        at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:116)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:209)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
        at sun.security.ssl.InputRecord.read(InputRecord.java:503)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
        at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)
        at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
        at com.mongodb.connection.SocketStream.read(SocketStream.java:85)
        at com.mongodb.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:492)
        at com.mongodb.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:222)
        ... 8 more

The clue here is in the server log

 E NETWORK  [conn91] no SSL certificate provided by peer; connection rejected

Of course... we haven't specified our client certificate anywhere, it looks like we just need to use -Djavax.net.ssl.keystore as we did with the CA certificate but this is where the tricky part it (Of you've read this far and you've been waiting for a tricky part then why are you reading this?)

Because we are importing a Certificate and a Key combined - we need to handle it a little differently.

First we need to convert our .PEM file into a different format, a PKCS12 keystore.Use xxxxxx for the password again.

openssl pkcs12 -export -out client.pkcs12 -in client.pem
Enter Export Password:xxxxxx

Then we need to convert that to a java keystore:

keytool -importkeystore -srckeystore client.pkcs12 -srcstoretype PKCS12 -destkeystore client.jks -deststoretype JKS
Enter destination keystore password:
Re-enter new password: xxxxxx
Enter source keystore password: xxxxxx
Entry for alias 1 successfully imported.
Import command completed:  1 entries successfully imported, 0 entries failed or cancelled

Now, finally we can specify that to our Java App, allowing us to perform X509 Authentication from Java:

java  -Djavax.net.ssl.keyStore=client.jks  -Djavax.net.ssl.trustStore=truststore.jks  -Djavax.net.ssl.keyStorePassword=xxxxxx   -cp mongo-java-driver-3.3.0.jar:. LoginTest 'mongodb://CN%3Dmongouser%40appserver.org%2COU%3Ddemouser%2CO%3DMongoDB%2CST%3DScotland%2CC%3DGB@ssldemo.mongodemo.com/?authMechanism=MONGODB-X509&ssl=true'
Login Test
Sep 09, 2016 4:44:38 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Cluster created with settings {hosts=[ssldemo.mongodemo.com:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
Sep 09, 2016 4:44:39 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: No server chosen by ReadPreferenceServerSelector{readPreference=primary} from cluster description ClusterDescription{type=UNKNOWN, connectionMode=SINGLE, serverDescriptions=[ServerDescription{address=ssldemo.mongodemo.com:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
Sep 09, 2016 4:44:44 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Opened connection [connectionId{localValue:1, serverValue:1}] to ssldemo.mongodemo.com:27017
Sep 09, 2016 4:44:44 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Monitor thread successfully connected to server with description ServerDescription{address=ssldemo.mongodemo.com:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 2, 7]}, minWireVersion=0, maxWireVersion=4, maxDocumentSize=16777216, roundTripTimeNanos=1383314}
Sep 09, 2016 4:44:44 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Opened connection [connectionId{localValue:2, serverValue:2}] to ssldemo.mongodemo.com:27017
4


Now I dont know why the people who define Java feel there must be a command line keystore password but really as this key is all it takes to log in it makes sense - and I suspect there may be better ways to supply it to Java but this short blog post is enough to get you connecting to MongoDB, with 2 way verification and x509 from Java.

Stay tuned for my next blog post which will be way more interesting.







Thursday, March 31, 2016

MongoDB is not a Javascript Database

Sometimes people refer to MongoDB as a Javascript/JSON database - it's not, its internal format is BSON - which is a serialised binary representation of objects. Unlike Javascript everything in MongoDB is stored strongly typed and quickly traversable.

That being said, people do access MongoDB with duck-typed Languages like PHP, Perl, Python and yes - Javascript so it can certainly seem that way - and the Mongo shell is is a Javascript REPL so it all seems very Javascript focussed.

Why does this matter and what's this Blog post about?

Well it matters - because one of the things I want from a database is to know that dates are dates, and integers are integers and floating point numbers aren't being stored as strings. Some 'Document' databases just store JSON strings and deal with it at runtime - ACID compliance is one debate but simply knowing you stored a date as a date and not a string is important too.

Actually I just wanted to get that off my chest before I get into he point of this post.

I want to talk about the fact you can Query MongoDB using Javascript - and like many things, this is a very useful piece of rope, you can secure things to things, you can tie things and pull things and climb things - but you can also hang yourself! so use the following information wisely.

In MongoDB, a normal query is sent to the database as an Object - basically query by example - you make an object in language of your choice and send it to a find function on a collection object to find matching Objects.

query = new Document()
query.add("name","john")
cursor = connection.getDatabase("myApp").getCollection("myUsers").find(query)
while(cursor.hasNext()) {
  record=cursor.Next()
}

Or in the shell (Javascript)

c = db.getSiblingDB("myApp").users.find({"name","john"})

This is the MongoDB approach - rather than a query language targeted at humans writing strings - which is how SQL was created, this starts with the premise queries will be constructed by code from some form of UI - it's actually a much better and safer fit than constructing SQL strings.

What many don't know is you can also do this.

c = db.getSiblingDB("myApp").users.find('this.name == "john" ')

Wait - what? Thats a query, as a string!; Is it code or some undocumented  query language? What's that all about.

Well that's the short version of 

c = db.getSiblingDB("myApp").usersfind({ $where: 'this.name == "john"' })

Each record in the database is converted, server side from it's serialised form into a Javascript object 'this' and then the supplied Javascript is run against it and whether that returns true or false determines it it's returned. It's magic and so powerful and ... slow.

The reason it's slow is two fold

 (1) Javascript - this is an interpreted/JIT language and it's never fast despite what node.js fans tell you.

(2) It has to convert every record in the Collection - there are no indexes used here.

So IMPORTANT - DO NOT do this, don't use $where for your queries.

Except $where it makes sense (see what I did there)

What if you cannot express your query using the Mongo query language? The Javascript interpreter in MongoDB is very sandboxed - so you cannot call out to other services or even access other parts of the DB - you are limited to one record but it does let you express, for example WHERE a+b > 5 or even WHERE a > b and b < 5

Let's take that last one as an example and see what the speed is like.

First I'll make a million records in the shell.


for(x=0;x < 1000000 ; x++) {
db.mine.insert({ _id:x, a:Math.random()*1000, b:Math.random()*1000}) } 

Now let us see how long a simple unindexed normal search takes


var start = new Date().getTime();
db.mine.find({a:{$lt:1}}).count() //Count used to ensure it gets not just first 100 results
var end = new Date().getTime();
var time = end - start;
print('Execution time: ' + time);

384ms - Not bad in human terms but you wouldn't want to do very many of them.

Let's try that javascript query above now

var start = new Date().getTime();
db.mine.find("this.a > this.b && this.b < 5").count()
var end = new Date().getTime();
var time = end - start;
print('Execution time: ' + time);

38,226 ms - yes 40 Seconds. This is is why you just don't do this! That database is only 38MB.

But I don't like to leave you without some positive news. MongoDB also allows a hybrid mode, thanks to $where so I can do

c = db.users.find({b:{$lt;:5},$where:'this.a>this.b'}}

This will use an index for b - which I better check is there first

db.mine.ensureIndex({b:1})
var start = new Date().getTime();
db.mine.find({b:{$lt:5},$where:"this.a > this.b"}).count()
var end = new Date().getTime();
var time = end - start;
print('Execution time: ' + time);

OK That's 263 ms, 150 times faster and I get to do comparisons between fields.


But is it the best option? No, Javascript is rarely the best option. MongoDB offers something much faster and just as capable, the Aggregation Pipeline.

With aggregation, you can write the above query like this. I have to add the last line to do the count

matchQuery={$match:{b:{$lt:5}}}
compareVals = { if: { $gt:["$a","$b"] }, then:true ,else:false}
addBool = {$project: {r:"$$ROOT",k:{$cond: compareVals} } }
filterNot = {$match: {k:true} }
simCount={$group:{_id:1,count:{$sum:1}}}

var start = new Date().getTime();
db.mine.aggregate([matchQuery,addBool,filterNot,simCount])
var end = new Date().getTime();
var time = end - start;
print('Execution time: ' + time);

And that takes ... 27ms, over 1,500 times faster than the Javascript!!

The debate rages whether the Aggregation Framework is Turing complete - but if you are using MongoDB and running ANY in-database javascript, you probably aren't doing the right thing. MongoDB is NOT a Javascript database.








Friday, July 3, 2015

How Wired Tiger opens up the option of MongoDB for Big Data Analytics.


Introduction

Using the Wired Tiger engine for MongoDB brings big improvements to concurrency, although as I blogged last time not in every case unless you understand how to take advantage of it.

However the data compression in Wired Tiger is another real win with the cost of enterprise disk being so high and high performance cloud storage and SSD's not being far behind.

What though if you want to use MongoDB as a data analytics platform? What if you have a lot of data – I'm going to avoid calling it big data because then everyone just discusses what big means – But what if you have a lot of data, its' arriving quickly and it has an ad-hoc schema.

It turns out that Wired Tiger has a great solution to that.

Let's set out a real world example – a number of companies, including Amadeus put data into MongoDB for ad-hoc analysis. The requirement is that a smallish number of users can perform any query or aggregation they want, or pull data into something else for further analysis.

To do this well they either have a huge quantity of RAM to keep all the data in RAM, or they have Fusion IO and other ultra-fast flash storage, or both.

That's good and one asked me what the implications of Wired Tiger were for them – so for the last few months I've been working that out. The answer is it saves you a fortune in hardware.

I have also made sure that without too much difficulty you can repeat the results I get here – or at least the first and final results. Details at the end…

In short – the challenge is to take the MOT (Vehicle inspection) reports from http://data.gov.uk , load them in and then to be able to perform any ad-hoc query and groupings over them. No pre-calculation, pre aggregation or even indexes are allowed – after all you may want to perform ANY computation. There are 290 Million reports so let's load them twice just to ensure we have at least half a Billion.

Each report is simple, tabular, fixed data – exactly what MongoDB is NOT best at, in a later blog we can compare and contrast the document model and RDBMS as we pull in the more specific defect reports, in this case let's keep it simple.

There are 28 GB of tab delimited MOT files on data.gov so that's 56 GB of RAW data – MongoDB will extend it with field names and BSON types and things so in the simplest mode it gets larger.

To save time as I tried multiple configurations I wrote a custom loader in C – this allows me to load this data in in parallel on a sharded cluster – all the code for that is available too.


The tests were on an AWS C4.4xlarge instance with 30GB RAM and 16 CPU's – a typical 'Enterprise' machine, perhaps a little light on RAM.

All data was on EBS, NON PIOPS SPINNING DISK, this was deliberate is is the slow, cheap option for disks with only 150 IOPS Max and a ~60MB/s max read speed. The idea was to use cheap local/SAN disk as we said disk isn't cheap (unless it's slow like this stuff).

In these tests the query I chose, whilst a little contrived was typical of the sorts of queries customer analysts run. In short – for each car, give it a greenness score based on it's engine size and whether it's a diesel or not, then sum the total number of miles it's done. As each MOT has the total miles then this calculation is wholly worthless as you count miles from MOT's in previous years multiple times. Don’t get hung up on the query itself – the point it looks at all the data, calculated something based on each record and adds them up and groups them.

Query summary

Choose a band from the following table based on Engine Capacity (cc)


Capacity
Band
 Less than 1000
1
1000-1199
2
1200-1599
3
1600-1999
4
2000-2499
5
2500-3499
6
More then 3500
7

If the Engine is Diesel – multiply by 1.5 – a 'Dirty' factor.
Now count how many vehicles in each band and the total number of miles.

In mongodb's aggregation pipeline this is written as

isover3500 = { $cond : [ { $gt : [ '$e',3500 ] }, 7, 6 ] }
isover2500 = { $cond : [ { $gt : [ '$e',2500 ] }, isover3500, 5 ] }
isover2000 = { $cond : [ { $gt : [ '$e',2000 ] }, isover2500, 4 ] }
isover1600 = { $cond : [ { $gt : [ '$e',1600 ] }, isover2000, 3 ] }
isover1200 = { $cond : [ { $gt : [ '$e',1200 ] }, isover1600, 2 ] }
getband = { $cond : [ { $gt : [ '$e',1000 ] }, isover1200, 1 ] }

isdirty = { $cond : [ { $eq : [ '$f' ,'D']}, 1.5 , 1 ]}
dirtypenalty = { $multiply : [ getband, isdirty ]}
banding = { $project : { b : dirtypenalty , m : 1}}
group = { $group : { _id : '$b' , count : { $sum : 1} , miles : { $sum : '$m' } } }
sort = {$sort:{_id:1}}
use vosa
db.mot_results.aggregate([banding,group,sort])


Environment Summary

Data

Records: 580,650,946
Volume (CSV) : 56 GB
Volume (BSON): 88.9GB
Typical Record:

Hardware

Host: Amazon EC2
CPUs: 16
RAM: 30GB
DISK: 512GB Spinning, non optimised SAN.

Software

MongoDB Enterprise 3.0.4

My Voyage of Discovery.

I started by simply running MongoDB with default parameters, mmap_v1 storage to see what most folks 'typical' results might be.  Loading the data took me 120 minutes and by the time I has finished I was using 170GB of disk space, between indexes and BSON.

I then ran the query above a few times to get  results and got the following.

Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IOWait
Idle CPU
Mmap

169
120
48:47
2
1
3
95


49 minutes is a long time to wait for a query, and 169GB is a lot of disk space. Of course mmap isn't known for being frugal – one thing that mmap does by default is allocate the next 'power of two' storage size for each record. This results in an average 50% overhead but aids in avoiding fragmentation and re-allocating space. It also means that growing records only need to move on disk a minimal number of times.

In this case though I am slightly bounded by IO, I have some I/O wait in there maybe I can speed things up if I make the data smaller, these are tiny records and therefore the extra space is being read from disk – it's not like those blicks can be skipped as the disk reads a minimum of 8K at a time.

Therefore I turned off powerOfTwo sizing for the collection – technical noPadding=true in MongoDB 3.0 and got the following.

Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IO Wait
Idle CPU
Mmap
noPadding
127
105
38:17
3
0
3
95


Well there is a win there – 42GB less disk (about 25%) and consequent 25% improvement in speed. Of course if I were deleting records or editing them then I would have some issues but for my analytics platform this may be OK.

Aggregation pipelines are single threaded per shard in MongoDB – and therefore if I wanted to bring some more CPU to the party I needed to shard – not that this worries me and as long as my shard key is nicely distributed I should be able to parallelise the task. Therefore, still using mmap I created some microshards shards running all on the same server to see how that helped. I can of course add more servers too to keep scaling out.

Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IO Wait
Idle CPU
Mmap
noPadding,
2 SHARDS
127
89
25:44
4
2
4
90
Mmap
noPadding,
4 SHARDS
127
90
25:45
4
1
10
85

I see now I have a serious improvement by adding shards and bringing in CPU's but only up to two shards, a four I get no improvement – why?, well the clue is right there in the IO Wait column – that's me maxing out my Disk IO, which isn't amazing being slow, cheap disks for lots of data.

This is where I can bring Wired Tiger into the mix – if I compress the data on disk, then I can read it faster at the expense of more CPU. So I rebuilt using WiredTiger and zlib compression to see what that would look like.

Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IO Wait
Idle CPU
WiredTiger
zlib
28.5
88
23:12
6
0
2
74

That's an improvement – not much but there again I'm back to one CPU and that's my limiter. At this point though I noticed that my data is smaller than my RAM! The actual table data is only 24GB of the 30GB I have so will it all fit in RAM like MMAP?

The answer is yes – and no. Unlike mmap, Wired Tiger doesn’t work on the data in it's native form on disk, it uncompresses it into it's own cache – and that cache is NOT compressed, its' plain old BSON so large again.  That said the OS should still cache the disk files themselves. I saw that the IO wait was 2 every run – what was going on? Shouldn’t it be zero even if CPU was limiting me? iostat showed 12 MB/s coming from disk.

Then I remembered that Wired Tiger takes 50% of RAM for it's uncompressed cache by default  – I could turn that down to 1GB and sure enough – now we were managing to run without using the disk – now we can crank it right up.

 I therefore tried a couple of shards – first clearing the OS page cache and the results looked good – monitoring with iostat showed that after the first run no Disk IO was happening at all. I got the following.


Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IO Wait
Idle CPU
WiredTiger
Zlib, 2 shards, 1GB Cache
28.5
68
15:31
13
0
0
88

So that's looking good, unfortunately when I went to four shards -  I was being throttled by Disk IO again – the 4 time 1GB cache didn't leave me enough RAM to cache my data – and the MongoDB option didn’t go lower than 1 GB.

At this point I went to look at the source – and  I had an unsupported answer – you CAN tune the cache much lower using the enggineConfigOptions cache_size parameter however at that point support from MongoDB gets a little more negotiated – I kept going there, loading in the data then setting the cache to 2MB (yes Megabytes) – this made each MonogD instance have a nice tiny footprint and let me crank the shards all the way up to 16.

Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IO Wait
Idle CPU
WiredTiger
Zlib,2MB cache, 4 shards

68
8:32
25
0
0
75
WiredTiger
Zlib,2MB cache, 8 shards

67
4:36
50
0
0
50
WiredTiger
Zlib,2MB cache, 16 shards
28.5
66
3:35
100
0
0
0

I was disappointed that the final 16 didn't halve it again – I'll probably investigate that more sometime I suspect that's because I'm using some cores for decompressing data though. But reducing an ad-hoc  data processing job from over 48 minutes to 3.5 is a real case for using Wired Tiger in your analytic cluster. Conversely – you can buy about 80% less RAM and avoid expensive disks, that should answer my client's question.

If you want to try this at home, I threw (literally) all my materials in http://github.com/johnlpage/BigAggBlog - feel free to use those to get yourself up and going. Let me know if you are wanting to try this seriously on your data and I may be able to lend a hand.

Anyone want to save me the effort and create a similar single server, ad-hoc query/analysis  setup in Hadoop, Spark, Cassandra and Postgress. I'd do it but I'm sure so many of you could do it far better and I'd love to see this like for like. As Mark Callaghan wisely observed the value of a 'benchmark' is worth 1/numberOfSystemTested – and no, I don't expect MongoDB to be the fastest analytic platform out there – that's SAS LASR! I'd like to know how far off the pace MongoDB is though.