Copyright © 2010 Gnip, inc.
Gnip makes it easy to build social media tracking tools.
Our designs are rooted in leveraging the most prolific standards on the web today. A big mess of proprietary goop wouldn’t help anyone. HTTP servers have evolved to scale like mad when static data is concerned. So, we set out to host static flat files with commodity HTTP servers (nginx); that layer scales from here to the end of time. Scale trouble comes into the picture when you’re doing a lot of backflips around data at request time. To mitigate this kind of trouble, our API is static as well (to the chagrin of even some of us at Gnip). Gnip doesn’t do any interesting math when data is consumed from it (the read); we serve pre-constructed flat files for each request, which allows us to leverage several standard caches. The Publish side (the write) is where we do some cartwheels to normalize and distill inbound data, and get it into static structures/files for later consumption.
Being concerned only with message flow and notification, we were able to avoid the nastiest aspect of system scaling; the database. Databases suck; all of them do. “Pick your poison” is your only option. While I’m basking in the glory of not having a database in our midst, I’m not naive enough to think we’ll never need one in the main processing flow, but just let me enjoy it for now. To be clear, we do use Amazon’s SimpleDB to store non-critical path info such as user account information, but that info’s rarely accessed and easily hangs in memory. The main reason we can avoid a database right now is that we only advertise availability of 60 minutes worth of transient events. We don’t have programmatic access to big date ranges for example. SQL queries are no-where to be found; jealous?
I want to take a moment and give props to the crew at Digg. With all the folks we’ve interacted with over the past few months on the Publisher side, Digg has nailed the dynamic/flexible web service API; kudos for completeness, consistency, and scalability! Real pros.
We went through one large backend re-design along the way (so far). We started with a queuing model for data replication across nodes in the system to ensure that if you put a gun to the head of nodeA, that the system would march along its merry way. This gave us a queuing view of the world. This was all well and good, but we wound up with a fairly complex topography of services listening to the bus for this or that; it got complex. Furthermore, when we moved to Ec2 (more on that below), the lack of multi-cast support meant we’d have to do some super dirty tricks to make listeners aware of event publishers going up and down; it was getting too kludgey.
After some investigation and prototyping, we settled on TerraCotta (a NAM solution) for node replication at the memory level. It kept the app layer simple, and, at least on the surface, should be tunable when things hit the fan. We’ve done minimal lock tuning thus far, but are happy with what we see. The prospect of just writing our app and thinking of it as a single thing, rather than “how does all this state get replicated across n number of nodes” was soooo appealing.
Since going live last week, we’ve found some hot-spots with TerraCotta data replication when we bring up new nodes, so our first real tuning exercise has begun!
One day we needed seven machines to do load testing. I turned around and asked my traditional hosting provider for “seven machines” and was told “we’ll need 7-10 days to acquire the hardware.” The next day we started prototyping on Ec2. Gnip’s traffic can increase/decrease by millions of requests in minutes (machine-to-machine interaction… no end-user click traffic patterns here), and our ability to manage those fluctuations in a cost effective manner is crucial. With Ec2, we can spin-up/tear-down a nearly arbitrary number of machines without anyone in the food chain having to run to the store to buy more hardware. I was sold!
Moving to Ec2 wasn’t free. We had to overcome lame dynamic IP registration limitations, abandon a multi-cast component solution we’d been counting on (to be fair, Ec2 wasn’t the exclusive impetus for this change though; things were getting complex), and figure out how to run a system sans local persistent storage. We overcame all of these issues and are very happy on Ec2. Rather than rewrite a post around our Ec2 experience however, I’ll just point you to a post on my personal blog if you want more detail.
Without going into pain staking app detail, that’s the gist of how we’ve built our current system.
Location-based social network Brightkite jumped onto Gnip this morning and is now pushing event notifications for a variety of user actions
“Brightkite is now publishing all public activity, including checkins, notes and photos to Gnip for easy consumption by 3rd party aggregation services. Integrating Gnip was a breeze: it took us about 20 minutes from looking at the API docs to deploying a working implementation.
Welcome to the club, guys; I hope that you find your load goes down and your distribution goes up!
A few months ago a handful of folks came together and took a practical look at the state of “web services” on the network today. As an industry we’ve enjoyed the explosion of web APIs over the past several years, but it’s been “every man for himself,” and we’ve been left with hundreds of web APIs being consumed in random ways (random protocols and formats). There have been a few cracks at standardizing some of this, but most have been left in spec form with, at best, fragmented implementations, and most have been too high level to provide anything more than good bedtime reading. We set out to build something; not write a story.
For a great overview of the situation Gnip is plunging into, checkout Nik Cubrilovic’s post on techcrunchIT; “The New Datastream Aggregators, FriendFeed and Standards.”.
Our first service is the culmination of lots of work by smart, pragmatic, people. From day one we’ve had excellent partners helping us along the way; from early integrations with our API, to discussing specifications and standards to follow (or not to follow; what you chose not to do is often more important than what you chose to do). While we aspire to solve all of the challenges in the data portability space, we’re a small team biting off small chunks along a path. We are going to need the support, feedback, and assistance of the broader data portability (formal & informal) community in order to succeed. Now that we’ve finally launched, we’ll be in “release early, release often” mode to ensure tight feedback loops around our products.
Enough; what did we build!?!
For those who want to cut to the chase, here’s our API doc.
We built a system that connects Data Consumers to Data Publishers in a low-latency, highly-scalable standards-based way. Data can be pushed or pulled into Gnip (via XMPP, Atom, RSS, REST) and it can be pushed or pulled out of Gnip (currently only via REST, but the rest to follow). This release of Gnip is focused on propagating user generated activity events from point A to point B. Activity XML provides a terse format for Data Publishers to distribute their user’s activities. Collections XML provides a simple way for Data Consumers to only receive information about the users they care about. This release is about “change notification,” and a subsequent release will include the actual data along with the event.
As a Consumer, whether your application model is event- or polling-based Gnip can get you near-realtime activity information about the users you care about. Our goal is a maximum 60 second latency for any activity that occurs on the network. While the time our service implementation takes to drive activities from end to end is measured in milliseconds, we need some room to breathe.
Data can come in to Gnip via many formats, but it is XSLT’d into a normalized Activity XML format which makes consuming activity events (e.g. “Joe dugg a news story at 10am”) from a wide array of Publishers a breeze. Along the way we started cringing at the verb/activity overlap between various Publishers; did Jane “tweet” or “post”, they’re kinda the same thing? After sitting down with Chris Messina, it became clear that everyone else was cringing too. A verb/activity normalization table has been started, and Gnip is going to distill the cornucopia of activities into a common, community derived, format in order to make consumption even easier.
Data Publishers now have a central clearinghouse to push data when events on their services occur. Gnip manages the relationship with Data Consumers, and figures out which protocols and formats they want to play with. It will take awhile for the system to reach equilibrium with Gnip, but once it does, API balance will be reached; Publishers will notify Gnip when things happen, and Gnip will fan-out those events to an arbitrary number of Consumers in real-time (no throttling, no rate limiting).
Gnip is centralized. After much consternation, we resolved to start out with a centralized model. Not necessarily because we think it’s the best path, but because it is the best path to get something started. Imagine the internet as a clustered application; decentralization is fundamental (DNS comes to mind). That said, we needed a starting point and now we have one. A conversation with Chris Saad highlighted some work Paul Jones (among others) had done around a standard mechanism for change notification discovery and subscription; getpingd. Getpingd describes a mechanism for distributed change notification. The Subscription side of getpingd feels like a no-brainer for Gnip to support, but I’m not sure how to consider the Discovery end of it. In some sense, I see Gnip (assuming getpingd’s discovery model is implemented) as a getpingd node in the graph. We have lots to consider in the federated/distributed model.
Gnip is a classic chicken-and-egg scenario, we need Publishers & Consumers to be interesting. If your service produces events that you want others on the network to consume, we’d love to see you as a Publisher in Gnip; pushing events into the system for wide consumption. If your service relies on events created by users on other applications, we’d love to see you as a Consumer in Gnip.
We’ve started out with convenience libraries for perl, php, java, python, and ruby. Rather than maintain these ourselves, we plan on publishing them to respective language community code sites/repositories.
That’s what we’ve built in a nutshell. I’ll soon blog about exactly how we’ve built it.
Let me say this up front:
I have a tendency to ramble. Why use a sentence when a paragraph will suffice, right? As a result, I limit myself to 100 word posts on my sporadically updated personal blog. I’ll follow suit here, with only occasional excursions into longer territory. This is one such post.
I’ll try not to ramble too much…
Data portability, the ability to create content on one web site and derive value from it on other sites and applications, has become one of the defining characteristics of what is commonly referred to as “Web 2.0″. An emerging class of services are taking advantage of this data to create entirely new products, including social aggregators (Plaxo Pulse, MyBlogLog, FriendFeed), social search (Lijit, Delver) and communications dashboards (Fuser, Orgoo, Digsby). Each of these services is predicated on the belief that user-generated content is the raw material upon which great companies can be built.
Data portability, via RSS or ATOM or XMPP or open APIs is neither difficult nor complex. These are known problems with straightforward solutions and open standards. But each connection between two services (e.g. MyBlogLog and Flickr or Plaxo and Digg) is a custom integration, requiring at least one of the parties to set up a custom channel to access, process and ultimately make use of the transferred data. As companies seek to create robust solutions built upon dozens or even hundreds of data feeds, engineers face an exponentially growing problem of building and maintaining these custom communication channels. Simply put, data portability is a big hassle.
Crucially, data portability has become the cost of entry for these services. It is not enough for a social aggregator to claim the most sources or a social search company the biggest pool of data. The leaders in this space are focused on filtering and presenting data in useful ways; out of a billion pieces of data, they seek to connect you with the appropriate information at the appropriate time. All of the work building and maintaining back-end data portability services comes at the cost of building better front-end features that draw and satisfy users.
That’s where Gnip comes in. We’re dedicated to making data portability suck less, by reducing the effort required to collect and manage the data upon which these awesome new services are being created. Gnip aims to simplify the process of aggregating, standardizing and maintaining large pools of data, ultimately making he process as simple as uploading a list of your users.
Our first service is a solution to a key problem facing data portability implementations (Jud will give you the details in just a moment). We at Gnip believe in direct solutions to painful problems, and as a result, our first service isn’t fancy. But it’s quick to integrate, it scales like a monster and it uses a variety of web standards; we believe we’ve solved this particular problem pretty well. Over the coming months we’ll roll out additional direct solutions to painful problems, and before long we’ll have a bona fide platform for pushing data around the web.
We’re incredibly excited by the bounty that Web 2.0 has created. We are living with an embarrassment of riches in terms of shared information and experiences. But it’s overwhelming. I personally believe that Web 3.0 will herald a return to the individual — story, picture, friend, experience — because in aggregate, that which has great meaning often becomes meaningless. So it’s up to these awesome new services to take the Web 2.0 bounty and find for each of us those few things that will fundamentally enhance our lives. To give us something meaningful.
I hope that we at Gnip can build a foundation that enables these awesome new services to focus all of their attention on making great things. We’ll happily lay plumbing, mix concrete and smelt tin to see that happen.