Simplicity Wins

It seems like every once in a while we all have to re-learn certain lessons.

As part of our daily processing, Gnip stores many terabytes of data in millions of keys on Amazon’s S3. Various aspects of serving our customers require that we pour over those keys and the data behind them, regularly.

As an example, every 24 hours we construct usage reports that provide visibility into how our customers are using our service. Are they consuming a lot or a little volume? Did their usage profile change? Are they not using us at all? So on and so on. We also have what we affectionately refer to as the “dude where’s my tweet” challenge; of the billion activities we deliver each day to our customers, inevitably someone says “hey, I didn’t receive Tweet ‘X’ what gives?” Answering that question requires that we store the ID of every Tweet a customer ever receives. Pouring over all this data every 24 hours is a challenge.

As we started on the project, it seemed like a good fit for Hadoop. It involves pulling in lots of small-ish files, doing some slicing, aggregate the results, and spitting them out the other end. Because we’re hosted in Amazon it was natural to use their Elastic MapReduce service (EMR).

Conceptually the code was straight forward and easy to understand. The logic fit the MapReduce programming model well. It requires a lot of text processing and sorts well into various stages and buckets. It was up and running quickly.

As the size of the input grew it started to have various problems, many of which came down to configuration. Hadoop options, JVM options, open file limits, number and size of instances, number of reducers, etc. We went through various rounds of tweaking settings and throwing more machines in the cluster, and it would run well for a while longer.

But it still occasionally had problems. Plus there was that nagging feeling that it just shouldn’t take this much processing power to do the work. Operational costs started to pop up on the radar.

So we did a small test to check the feasibility of getting all the necessary files from S3 onto a single EC2 instance and processing it with standard old *nix tools. After promising results we decided to pull it out of EMR. It took several days to re-write, but we’ve now got a simple Ruby script using various *nix goodies like cut, sort, grep and their friends. The script is parallel-ized via JRuby threads at various points that make sense (downloading multiple files at once and processing the files independently once they’ve been bucketed).

In the end it runs in less time than it did on EMR, on a single modest instance, is much simpler to debug and maintain, and costs far less money to run.

We landed in a somewhat counter-intuitive place. There’s great technology available these days to process large amounts of data; we continue to use Hadoop for other projects. But as we start to bring them into our tool-set we have to be careful not to forget the power of straight forward, traditional tools.

Simplicity wins.

Gnip; An Update

Gnip moved into our new office yesterday (other end of the block from our old office). The transition provided an opportunity for me to think about where we’ve been, and where we’re going.

Team

We continue to grow, primarily on the engineering side. Checkout our jobs page if you’re interested in working on a hard problem, with smart people, in a beautiful place (Boulder, CO).

Technology

We’ve built a serious chunk of back-end infrastructure that I’d break into two general pieces: “the bus”, and “the pollers.”

“The Bus”

Our back-end moves large volumes of relatively small (usually <~3k bytes) chunks of data from A to B in a hurry. Data is “published” into Gnip, we do some backflips with it, then spit it out the other side to consumers.

“The Pollers”

Our efforts to get Publishers to push directly into Gnip didn’t pan out the way we initially planned. As a result we had to change course and acquire data ourselves. The bummer here was that we set out on an altruistic mission to relieve the polling pain that the industry has been suffering from, but were met with such inertia that we didn’t get the coverage we wanted. The upside is that building polling infrastructure has allowed us to control more of our business destiny. We’ve gone through a few iterations on approach to polling. From complex job scheduling and systems that “learn” & “adapt” to their surroundings, to dirt simple, mindless grinders that ignorantly eat APIs/endpoints all day long. We’re currently slanting heavily toward simplicity in the model. The idea is to take learning’s from the simple model over time, and feed them into abstractions/re-factorings that make the system smarter.

Deployment

We’re still in the cloud. Amazon’s Ec2/S3 products have been a solid (albeit not necessarily the most cost effective when your CPU utilization isn’t in the 90%+ range per box), highly flexible, framework for us; hats off to those guys.

Industry

“The Polling Problem”

It’s been great to see the industry wake up and acknowledge “the polling problem” over the past year. SUP (Simple Update Protocol) popped up to provide more efficient polling for systems that couldn’t, or wouldn’t, move to an event-driven model. Providing a compact change-log for pollers, you can poll the change-log, and then go do heavier polls for only stuff that has changed. PubSubHubbub popped up to provide the framework for a distributed Gnip (though lacking inherent normalization). A combination of polling and events spread across nodes allows for a more decentralized approach.

“Normalization”

The Activity Streams initiative grew legs and is walking. As with any “standards” (or “standards-like”) initiative things are only as good as adoption. Building ideas in a silo without users makes for a fun exercise, but not much else. Uptake matters, and MySpace and Facebook (among many other smaller initiatives) have bitten off chunks of Activity Streams, and that’s a very big, good, sign for the industry. Structural, and semantic, consistency matters for applications digesting a lot of information. Gnip provides highly structured and consistent data to its consumers via gnip.xsd.

In order to meet its business needs, and to adapt to the constantly moving industry around it, Gnip has adjusted it’s approach on several fronts. We moved to incorporate polling. We understand that there is more than one way of doing and will incorporate SUP and PubSubHubbub into our framework. Doing so will make our own polling efforts more effective, and also provide data to our consumers with flexibility. While normalized data is nice for a large category of consumers, there is a large tier of customers that doesn’t need, or want, heavy normalization. Opaque message flow has significant value as well.

We set out to move mind-boggling amounts of information from A to B, and we’re doing that. Some of the nodes in the graph are shifting, but the model is sound. We’ve found there are primarily two types of data consumers: high-coverage of a small number of sources (“I need 100% of Joe, Jane, and Mike’s activity”), and “as high as you can get it”-coverage of a large number of sources (“I don’t need 100%, but I want very broad coverage”). Gnip’s adjusted to accommodate both.

Business

We’ve had to shift our resources to better focus on the paying segments of our audience. We initially thought “life-stream aggregators” would be our biggest paying customer segment, however data/media analytics firms have proven significant. Catering to the customers who tell you “we have budget for that!” makes good business sense, and we’re attacking those opportunities.

Numbers + Architecture

We’ve been busy over the past several months working hard on what we consider a fundamental piece of infrastructure that the network has been lacking for quite some time. From “ping server for APIs” to “message bus”, we’ve been called a lot of things; and we are actually all of them rolled into one. I want to provide some insight into what our backend architecture looks like as systems like this generally don’t get a lot of fanfare, they just have to “work.” Another title for this blog post could have been “The Glamorous Life of a Plumbing Company.”

First, some production numbers.

  • 99.9%: the Gnip service has 99.9% up-time.
  • 0: we have had zero Amazon Ec2 instances fail.
  • 10: ten Ec2 instances, of various sizes, run the core, redundant, message bus infrastructure.
  • 2.5m: 2.5 million unique activities are HTTP POSTed (pushed) into Gnip’s Publisher front door each day.
  • 2.8m: 2.8 million activities are HTTP POSTed (pushed) out Gnip’s Consumer back door each day.
  • 2.4m: 2.4 million activities are HTTP GETed (polled) from Gnip’s Consumer back door each day.
  • $0: no money has been spent on framework licenses (unless you include “AWS”).

Second, our approach.

Simplicity wins. These production transaction rate numbers, while solid, are not earth shattering. We have however, achieved much higher rates in load tests. We optimized for activity retrieval (outbound) as opposed to delivery into Gnip (inbound). That means every outbound POST/GET, is moving static data off of disk; no math gets done. Every inbound activity results in processing to ensure proper Filtration and distribution; we do the “hard” work on delivery.

We view our core system as handling ephemeral data. This has allowed us, thus far, to avoid having a database in the environment. That means we don’t have to deal with traditional database bottlenecks. To be sure, we have other challenges as a result, but we decided to take on those as opposed to have the “database maintenance and administration” ball and chain perpetually attached. So, in order to share contentious state across multiple VMs, across multiple machine instances, we use shared memory in the form of TerraCotta. I’d say TerraCotta is “easy” for “simple” apps, but challenges emerge when you start dealing with very large data sets in memory (multiple giga-bytes). We’re investing real energy in tuning our object graph, access patterns, and object types to keep things working as Gnip usage increases. For example, we’re in the midst of experimenting with pageable TerraCotta structures that ensure smaller chunks of memory can be paged into “cold” nodes.

When I look at the architecture we started with, compared to where we are now, there are no radical changes. We chose to start clustered, so we could easily add capacity later, and that has worked really well. We’ve had to tune things along the way (split various processes to their own nodes when CPU contention got too high, adjust object graphs to optimize for shared memory models, adjust HTTP timeout settings, and the like), but our core has held strong.

Our Stack

  • nginx – HTTP server, load balancing
  • JRE 1.6 – Core logic, REST Interface
  • TerraCotta – shared memory for clustering/redundancy
  • ejabberd – inbound XMPP server
  • Ruby – data importing, cluster management
  • Python – data importing

High-Level Core Diagram

Gnip Core Architecture Diagram

Gnip owes all of this to our team & our customers; thanks!

Three (Six?) Week Software Retrospective

I had to go back into older blog posts to remind myself when we launched; July 1st. It feels like we’ve been live since June 1st.

Looking Back

Things have gone incredibly well from an infrastructure standpoint. We’ve had to add/adjust some system monitoring parameters to accommodate the variety of Data Producers publishing into the system; different frequencies/volumes call for for specialized treatment. We weren’t expecting the rate, or volume, of Collection creation we wound up with. Within three hours of going live, we had enough Collections in the system to adversely impact node startup/sync times. We patiently tuned our data model, and tuned TerraCotta locks to get things back to normal. It’s looking like we’ll be in bed with TerraCotta for the long haul.

Amazon

I’m not sure I could be any more pleased with AWS. Our core service is heavily dependent on EC2, and that’s been running sans issues. We’re working on non-Amazon failover solutions that assure un-interrupted service even if all of EC2 dies. Our backups are S3 dependent so we had some behind the scenes issues last weekend when S3 was flaky; see my previous post on this issue. We haven’t had our day in the sun with outages, and I obviously hope we never do, but so far I’m walking around with a big “I <3 AWS” t-shirt on.

Other

On the convenience library front, we (Gnip + community) have made all of our code available on github. We’ve had tremendous community support and contribution on this front; so cool to see; thanks everyone!

Collections are by far the primary data access pattern (as opposed to raw public activity stream polling); not really a surprise.

Summize/Twitter has been a totally cool way to track ether discussion around Gnip. When we notice folks talking about Gnip, positive or negative, we can reach out in “real-time” and strike up a conversation.

That’s all for now.

Thanks to all the Data Producers and Consumers that have integrated with Gnip thus far!

A Note About Gnip and S3's Weekend Party

Amazon’s S3 outage over the weekend did not affect Gnip’s live service. Gnip uses S3 for system state archival/backup purposes, but the live data flow through Gnip was not affected as we keep it in local instances (memory/local disk). We weren’t able to backup data while S3 was down, but its outage was intermittent, so during online windows, we did our backups. Eventually the S3 outage was “over” and balance between local-storage and S3/remote storage was restored. At some magical point if S3 simply wasn’t coming back online, we’d move our backups to another service.

Building scalable, redundant, highly-available, systems is the next big game. It actually has been for decades, but now a larger web application audience is becoming accutely aware of its importance, and subsequently, how to accomplish it. At the end of the day, everything fails. The game becomes isolating the weak points, butressing the critical points of your service to ensure “instant” recovery from all the failures you can anticipate, and minimizing complete system setup/restart time in case everything craters and you have to scramble to come back online.

I hope Gnip never has it’s day in the searing outage sun, but we’re not naive.

Brush your teeth before bed, eat right, exercise, and eliminate your Single Points of Failure.

The HOW of Gnip: Keep it Simple Stupid!

Keep it simple

Our designs are rooted in leveraging the most prolific standards on the web today. A big mess of proprietary goop wouldn’t help anyone. HTTP servers have evolved to scale like mad when static data is concerned. So, we set out to host static flat files with commodity HTTP servers (nginx); that layer scales from here to the end of time. Scale trouble comes into the picture when you’re doing a lot of backflips around data at request time. To mitigate this kind of trouble, our API is static as well (to the chagrin of even some of us at Gnip). Gnip doesn’t do any interesting math when data is consumed from it (the read); we serve pre-constructed flat files for each request, which allows us to leverage several standard caches. The Publish side (the write) is where we do some cartwheels to normalize and distill inbound data, and get it into static structures/files for later consumption.

Being concerned only with message flow and notification, we were able to avoid the nastiest aspect of system scaling; the database. Databases suck; all of them do. “Pick your poison” is your only option. While I’m basking in the glory of not having a database in our midst, I’m not naive enough to think we’ll never need one in the main processing flow, but just let me enjoy it for now. To be clear, we do use Amazon’s SimpleDB to store non-critical path info such as user account information, but that info’s rarely accessed and easily hangs in memory. The main reason we can avoid a database right now is that we only advertise availability of 60 minutes worth of transient events. We don’t have programmatic access to big date ranges for example. SQL queries are no-where to be found; jealous?

I want to take a moment and give props to the crew at Digg. With all the folks we’ve interacted with over the past few months on the Publisher side, Digg has nailed the dynamic/flexible web service API; kudos for completeness, consistency, and scalability! Real pros.

From Q’s to NAM

We went through one large backend re-design along the way (so far). We started with a queuing model for data replication across nodes in the system to ensure that if you put a gun to the head of nodeA, that the system would march along its merry way. This gave us a queuing view of the world. This was all well and good, but we wound up with a fairly complex topography of services listening to the bus for this or that; it got complex. Furthermore, when we moved to Ec2 (more on that below), the lack of multi-cast support meant we’d have to do some super dirty tricks to make listeners aware of event publishers going up and down; it was getting too kludgey.

After some investigation and prototyping, we settled on TerraCotta (a NAM solution) for node replication at the memory level. It kept the app layer simple, and, at least on the surface, should be tunable when things hit the fan. We’ve done minimal lock tuning thus far, but are happy with what we see. The prospect of just writing our app and thinking of it as a single thing, rather than “how does all this state get replicated across n number of nodes” was soooo appealing.

Since going live last week, we’ve found some hot-spots with TerraCotta data replication when we bring up new nodes, so our first real tuning exercise has begun!

Ec2

One day we needed seven machines to do load testing. I turned around and asked my traditional hosting provider for “seven machines” and was told “we’ll need 7-10 days to acquire the hardware.” The next day we started prototyping on Ec2. Gnip’s traffic can increase/decrease by millions of requests in minutes (machine-to-machine interaction… no end-user click traffic patterns here), and our ability to manage those fluctuations in a cost effective manner is crucial. With Ec2, we can spin-up/tear-down a nearly arbitrary number of machines without anyone in the food chain having to run to the store to buy more hardware. I was sold!

Moving to Ec2 wasn’t free. We had to overcome lame dynamic IP registration limitations, abandon a multi-cast component solution we’d been counting on (to be fair, Ec2 wasn’t the exclusive impetus for this change though; things were getting complex), and figure out how to run a system sans local persistent storage. We overcame all of these issues and are very happy on Ec2. Rather than rewrite a post around our Ec2 experience however, I’ll just point you to a post on my personal blog if you want more detail.

Without going into pain staking app detail, that’s the gist of how we’ve built our current system.