Eliot Horowitz is a co-founder and CTO of 10gen, makers of MongoDB, an increasingly trendy database that’s used by Foursquare, among others. He’s just released a statement on what went wrong this week that led to Foursquare’s 11-hour outage, which was followed by another “aftershock” of downtime the following day.
During waves of epic downtime this week, Foursquare’s engineers struggled with their servers and database architecture, desperately attempting to migrate data from failing shards to get the service back online.
In an official post-mortem, a Foursquare engineer told how one overloaded shard took down the entire service. Engineers introduced a new shard to the system and began migrating data (none of which was lost during the process), but were unsuccessful in bringing Foursquare back online. Ultimately, the team had to take five hours to reindex the original shard.
Still, the engineer admitted the team had no idea why the shard had failed in the first place.
In a more detailed write-up, Horowitz explains this mystery; the blame lies squarely with power users.
Two months ago, Foursquare began using a cluster of two shards, each of which had a 66GB RAM maximum capacity. In theory, checkins would be written evenly between the shards. Unfortunately for all parties involved, this did not happen.
“Assuming certain subsets of users are more active than others,” Horowitz wrote, “it’s conceivable that their updates might all go to the same shard. That’s what occurred in this case, resulting in one shard growing to 66GB and the other only to 50GB.”
So when as the power users’ checkins grew, Shard A received more checkins than it could handle, queries became slow, a backlog built up and then the site went down.
The next logical question is, “How do you architect a system that scales for occasional users and power users alike, especially when you’re storing large numbers of objects with relatively small file sizes?”
While that’s not really a question we’re qualified to answer, we share Horowitz’ obvious conclusion that monitoring the system’s capacity is key. Operating at or near capacity is a recipe for trouble.
“Once you’re at max capacity,” he wrote, “it’s difficult to add more capacity without some downtime when objects are small. However, if caught in advance, adding more shards on a live system can be done with no downtime.” He continued to say that with 12 more hours of notice, Foursquare’s downtime might have been prevented.
While we’re glad that 10gen and Foursquare are working to improve MongoDB and address current and future scalability issues, it’s tough to watch a service grow as quickly as Foursquare — both in terms of user adoption and in terms of partnerships with (and revenue from) national brands — and have these disturbing amounts of downtime.
Do you think MongoDB and Foursquare have what it takes to scale? Let us know what you think of Horowitz’ comments (and Foursquare’s architecture) in the comments.
Reviews: Foursquare
More About: foursquare, mongoDB, post mortem
For more Dev & Design coverage:
- Follow Mashable Dev & Design on Twitter
- Become a Fan on Facebook
- Subscribe to the Dev & Design channel
- Download our free apps for iPhone and iPad