It seems like every once in a while we all have to re-learn certain lessons.
As part of our daily processing, Gnip stores many terabytes of data in millions of keys on Amazon’s S3. Various aspects of serving our customers require that we pour over those keys and the data behind them, regularly.
As an example, every 24 hours we construct usage reports that provide visibility into how our customers are using our service. Are they consuming a lot or a little volume? Did their usage profile change? Are they not using us at all? So on and so on. We also have what we affectionately refer to as the “dude where’s my tweet” challenge; of the billion activities we deliver each day to our customers, inevitably someone says “hey, I didn’t receive Tweet ‘X’ what gives?” Answering that question requires that we store the ID of every Tweet a customer ever receives. Pouring over all this data every 24 hours is a challenge.
As we started on the project, it seemed like a good fit for Hadoop. It involves pulling in lots of small-ish files, doing some slicing, aggregate the results, and spitting them out the other end. Because we’re hosted in Amazon it was natural to use their Elastic MapReduce service (EMR).
Conceptually the code was straight forward and easy to understand. The logic fit the MapReduce programming model well. It requires a lot of text processing and sorts well into various stages and buckets. It was up and running quickly.
As the size of the input grew it started to have various problems, many of which came down to configuration. Hadoop options, JVM options, open file limits, number and size of instances, number of reducers, etc. We went through various rounds of tweaking settings and throwing more machines in the cluster, and it would run well for a while longer.
But it still occasionally had problems. Plus there was that nagging feeling that it just shouldn’t take this much processing power to do the work. Operational costs started to pop up on the radar.
So we did a small test to check the feasibility of getting all the necessary files from S3 onto a single EC2 instance and processing it with standard old *nix tools. After promising results we decided to pull it out of EMR. It took several days to re-write, but we’ve now got a simple Ruby script using various *nix goodies like cut, sort, grep and their friends. The script is parallel-ized via JRuby threads at various points that make sense (downloading multiple files at once and processing the files independently once they’ve been bucketed).
In the end it runs in less time than it did on EMR, on a single modest instance, is much simpler to debug and maintain, and costs far less money to run.
We landed in a somewhat counter-intuitive place. There’s great technology available these days to process large amounts of data; we continue to use Hadoop for other projects. But as we start to bring them into our tool-set we have to be careful not to forget the power of straight forward, traditional tools.
Simplicity wins.

In April 2010, the U.S. Library of Congress and Twitter 