We’ve been busy over the past several months working hard on what we consider a fundamental piece of infrastructure that the network has been lacking for quite some time. From “ping server for APIs” to “message bus”, we’ve been called a lot of things; and we are actually all of them rolled into one. I want to provide some insight into what our backend architecture looks like as systems like this generally don’t get a lot of fanfare, they just have to “work.” Another title for this blog post could have been “The Glamorous Life of a Plumbing Company.”
First, some production numbers.
- 99.9%: the Gnip service has 99.9% up-time.
- 0: we have had zero Amazon Ec2 instances fail.
- 10: ten Ec2 instances, of various sizes, run the core, redundant, message bus infrastructure.
- 2.5m: 2.5 million unique activities are HTTP POSTed (pushed) into Gnip’s Publisher front door each day.
- 2.8m: 2.8 million activities are HTTP POSTed (pushed) out Gnip’s Consumer back door each day.
- 2.4m: 2.4 million activities are HTTP GETed (polled) from Gnip’s Consumer back door each day.
- $0: no money has been spent on framework licenses (unless you include “AWS”).
Second, our approach.
Simplicity wins. These production transaction rate numbers, while solid, are not earth shattering. We have however, achieved much higher rates in load tests. We optimized for activity retrieval (outbound) as opposed to delivery into Gnip (inbound). That means every outbound POST/GET, is moving static data off of disk; no math gets done. Every inbound activity results in processing to ensure proper Filtration and distribution; we do the “hard” work on delivery.
We view our core system as handling ephemeral data. This has allowed us, thus far, to avoid having a database in the environment. That means we don’t have to deal with traditional database bottlenecks. To be sure, we have other challenges as a result, but we decided to take on those as opposed to have the “database maintenance and administration” ball and chain perpetually attached. So, in order to share contentious state across multiple VMs, across multiple machine instances, we use shared memory in the form of TerraCotta. I’d say TerraCotta is “easy” for “simple” apps, but challenges emerge when you start dealing with very large data sets in memory (multiple giga-bytes). We’re investing real energy in tuning our object graph, access patterns, and object types to keep things working as Gnip usage increases. For example, we’re in the midst of experimenting with pageable TerraCotta structures that ensure smaller chunks of memory can be paged into “cold” nodes.
When I look at the architecture we started with, compared to where we are now, there are no radical changes. We chose to start clustered, so we could easily add capacity later, and that has worked really well. We’ve had to tune things along the way (split various processes to their own nodes when CPU contention got too high, adjust object graphs to optimize for shared memory models, adjust HTTP timeout settings, and the like), but our core has held strong.
- nginx – HTTP server, load balancing
- JRE 1.6 – Core logic, REST Interface
- TerraCotta – shared memory for clustering/redundancy
- ejabberd – inbound XMPP server
- Ruby – data importing, cluster management
- Python – data importing
High-Level Core Diagram
Gnip owes all of this to our team & our customers; thanks!