Turns Out, ‘Select’ Is Broken

My last few days have been super frantic. MongoDb has decided to go haywire, and nothing makes any sense.

Everything was huming along jauntily, all fine, and then mayhem. The web app could not connect to MongoDb anymore. The driver always throws a TimeoutException every time:

“Timeout waiting for a MongoConnection.” (from MongoConnectionPool.AcquireConnection(MongoDatabase))

The pool runs out of connections, and it never replenishes itself. The whole server stays down until I restart the web-server. I immediately suspected a connection leak in the code. I must have missed to dispose something.

I managed to reliably reproduce the error by hammering the app with a massive load. Eventually Mongo connection-pool would pass out. From that point on, every request will throw a connection timeout. This only happens on a live Azure server, not locally. My fancy debugger is out of the game.

So I reviewed my code, trying to find any connection leak, but couldn’t see anything out of ordinary. I digged into the source-code of MongoDb-CSharp driver, trying to find how my code leaks. No cigar. Then I tried to blame FluentMongo library that I use. That must be it! I suspected it might not dispose MongoCursorEnumerator properly, causing a leak under heavy load. I took the library out of my solution completely, the I was still no closer to fixing the error. Didn’t see any problem with MongoAzure‘s code either.

The only thing left was the MongoDb-Csharp driver itself. But we’re taught that select() isn’t broken. I can’t be seriously thinking that MongoDb-CSharp leaks connections, can I? But after a few days, I hadn’t moved anywhere from where I was when this whole shenanigans started, so I eventually decided to investigate into the MongoDb driver’s source-code.

And there I found the answer. There was not a leak. The timeout happenned naturally. However, apparently there is a bug in MongoDb driver code: after the first connection-timeout, the pool will will always fall into a broken state. More specifically, there’s a bit of code in MongoConnectionPool that clears out all its connections (e.g. after a timeout), but without adjusting its counter accordingly. In other word, the pool would be eternally stuck in a drained state.

The fix was just a single line of code (reseting the pool counter to zero). I was just about to submit the patch, when I discovered it had just been fixed on their latest trunk, just barely a few days ago.

The fix has not been in any release yet. So if you’re using the official release version of MongoDb driver (i.e. from NuGet), you have this bug sitting in your server. Whenever your application gets some heavy traffic, it will knock out your MongoDb connection-pool completely, bringing your entire server to a complete halt. So yea it’s pretty goddamn catastrophic.

The solution: pull the latest MongoDb driver from the latest source trunk, and compile it yourself. The moral: sometimes ‘select’ can be broken.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s