MongoDB CTO on Foursquare’s Scaling Issues

Eliot Horowitz is a co-founder and CTO of 10gen, makers of MongoDB, an increasingly trendy database that’s used by Foursquare, among others. He’s just released a statement on what went wrong this week that led to Foursquare’s 11-hour outage, which was followed by another “aftershock” of downtime the following day.

During waves of epic downtime this week, Foursquare’s engineers struggled with their servers and database architecture, desperately attempting to migrate data from failing shards to get the service back online.

In an official post-mortem, a Foursquare engineer told how one overloaded shard took down the entire service. Engineers introduced a new shard to the system and began migrating data (none of which was lost during the process), but were unsuccessful in bringing Foursquare back online. Ultimately, the team had to take five hours to reindex the original shard.

Still, the engineer admitted the team had no idea why the shard had failed in the first place.

In a more detailed write-up, Horowitz explains this mystery; the blame lies squarely with power users.

Two months ago, Foursquare began using a cluster of two shards, each of which had a 66GB RAM maximum capacity. In theory, checkins would be written evenly between the shards. Unfortunately for all parties involved, this did not happen.

“Assuming certain subsets of users are more active than others,” Horowitz wrote, “it’s conceivable that their updates might all go to the same shard. That’s what occurred in this case, resulting in one shard growing to 66GB and the other only to 50GB.”

So when as the power users’ checkins grew, Shard A received more checkins than it could handle, queries became slow, a backlog built up and then the site went down.

The next logical question is, “How do you architect a system that scales for occasional users and power users alike, especially when you’re storing large numbers of objects with relatively small file sizes?”

While that’s not really a question we’re qualified to answer, we share Horowitz’ obvious conclusion that monitoring the system’s capacity is key. Operating at or near capacity is a recipe for trouble.

“Once you’re at max capacity,” he wrote, “it’s difficult to add more capacity without some downtime when objects are small. However, if caught in advance, adding more shards on a live system can be done with no downtime.” He continued to say that with 12 more hours of notice, Foursquare’s downtime might have been prevented.

While we’re glad that 10gen and Foursquare are working to improve MongoDB and address current and future scalability issues, it’s tough to watch a service grow as quickly as Foursquare — both in terms of user adoption and in terms of partnerships with (and revenue from) national brands — and have these disturbing amounts of downtime.

Do you think MongoDB and Foursquare have what it takes to scale? Let us know what you think of Horowitz’ comments (and Foursquare’s architecture) in the comments.

Reviews: Foursquare

More About: foursquare, mongoDB, post mortem

For more Dev & Design coverage:

Follow Mashable Dev & Design on Twitter
Become a Fan on Facebook
Subscribe to the Dev & Design channel
Download our free apps for iPhone and iPad

MongoDB CTO on Foursquare’s Scaling Issues

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List