Gnip. The Story Behind the Name

Have you ever thought “Gnip”. . . well that is a strange name for a company, what does it mean? As one of the newest members of the Gnip team I found myself thinking that very same thing. And as I began telling my friends about this amazing new start-up that I was going to be working for in Boulder, Colorado they too began to inquire as to the meaning behind the name.

Gnip, pronounced (guh’nip), got its name from the very heart of what we do, realtime social media data collection and delivery. So let’s dive in to . . .

Data Collection 101

There are two general methods for data collection, pull technology and push technology. Pull technology is best described as a data transfer in which the request is initiated by the data consumer and responded to by the data publisher’s server. In contrast, push technology refers to the request being initiated by the data publisher’s server and sent to the data consumer.

So why does this matter . . .

Well most social media publishers use the pull method. This means that the data consumer’s system must constantly go out and “ping” the data publisher’s server asking, “do you have any new data now?” . . . “how about now?” . . . “and now?” And this can cause a few issues:

  1. Deduplication – If you ping the social media server one second and then ping it again a second later and there were no new results, you will receive the same results you got one second ago. This would then require deduplication of the data.
  2. Rate Limiting – every social media data publisher’s server out there sets different rate limits, a limit used to control the number of times you can ping a server in a given time frame. These rate limits are constantly changing and typically don’t get published. As such, if your server is set to ping the publisher’s server above the rate limit, it could potentially result in complete shut down of your data collection, leaving you to determine why the connection is broken (Is it the API . . . Is it the rate limit . . . What is the rate limit)?

So as you can see, pull technology can be a tricky beast.

Enter Gnip

Gnip sought to provide our customers with the option: to receive data in either the push model or the pull model, regardless of the native delivery from the data publisher’s server. In other words we wanted to reverse the “ping” process for our customers. Hence, we reversed the word “ping” to get the name Gnip. And there you have it, the story behind the name!

New Beta 3 Update to api.gnip.com

We just finished making a major upgrade to the beta api.gnip.com environment.  First, thank you to everyone for their patience during our middle of the day upgrade.  We normally schedule upgrades off hours or do a rolling upgrade, but tonight the entire team of ten is going to the Star Trek premiere.   Anyway, we made the “management” decision to do the upgrade earlier in the day so we did not run into our company/family event tonight.

What’s new?   Up until now the data publishers in api.gnip.com have been doing lazy scheduling, which means they would pull data but it was actually easy with a small number of rules to miss data or not have a filter get a hit in our scheduling.  Yeah, that is a beta thing as the primary reason for having the beta out for so long was giving users a chance to get all their existing integrations moved to the new schema.   With Beta 3 we have made some major enhancements system that from what we see in our tests greatly improve the amount of data flow across all our data publishers using the new polling services.

From a timeline standpoint we want to let the new features soak a bit and then we will lock down a date to take the system to production as early as the end of May.     In the next few days we are running diagnostics and scaling out the system by putting it through the paces, so feel free to do the same or just work with our notification streams on any given publisher.

Live long and prosper.  (sorry, just could not resist the Star Trek line)

Push & Pull

Over here at Gnip we’re knee deep in the joys of polling. Our mission to “deliver the web’s data” has us using several approaches to hook consumers up with publisher activities. While we iron out the kinks around our polled publishers, I’m reminded of how broken polling is for many types of data. From rate limiting, to minimum poll interval definition, polling inherently yields gaps between actions that take place along a timeline. In some cases those gaps are small, and potentially imperceptible, but in others they are large. Viewing one of our internal daily stats charts exemplified this push vs. pull dichotomy. Each color represents a separate publisher (top/black line == total). Guess which publishers are Push (event driven) and which are Pull (polling driven).

Gnip Publisher Daily Chart

Gnip Publisher Daily Chart

The answer: consistent, connected, publishers/lines in the chart are Push (even driven), and more variable publishers/lines are Pull (poll) driven.

Our goal in life is to smooth those variable/jagged lines for the polled publishers, but along the path to data delivery nirvana, I thought I’d share a this visual.

Software Evolution

Those of us who have been around for awhile constantly joke about how “I remember building that 10 years ago” everytime some big “new” trend emerges. It’s always a lesson in market readiness and timing for a given idea. The flurry around Google Chrome has rekindled the conversation around distributed apps. Most folks are tied up in the concept of a “new browser,” but Chrome is actually another crack at the age old “distrbuted/server-side application” problem; albeit an apparent good one. The real news in Chrome (I’ll avoid the V8 vs. TraceMonkey conversation for now) is native Google Gears support.

My favorite kind of technology is the kind that quietly gets built, then one day you wake up and it’s changed everything. Google Gears has that potential and if Chrome winds up with meaningful distribution (or Firefox adopts Gears) web apps as we know them will finally have mark-up-level access to local resources (read “offline functionality”). This kind of evolution is long overdue.

Another lacking component on the network is the age-old, CS101, notion of event-driven architectures. HTTP GET dominates web traffic, and poor ‘ol HTTP POST is rarely used. Publish and subscribe models are all but unused on the network today, and Gnip aims to change that. We see a world that is PUSH driven rather than PULL. The web has come a looooong way on GET, but apps are desperate for traditional flow paradigms such as local processor event loops. Our goal is to do this in a protocol agnostic manner (e.g. REST/HTTP POST, XMPP, perhaps some distributed queuing model)

Watching today’s web apps poll eachother to death is hard. With each new product that integrates serviceX, the latency of serviceX’s events propegating through the ecosystem degrades, and everyone loses. This is a broken model that if left unresolved, will drive our web apps back into the dark ages once all the web service endpoints are overburdened to the point of being uninteresting.

We’ve seen fabulous adoption of our API since launching a couple of months ago. We hope that more Data Producers and Data Consumers leverage it going forward.

That Twitter Thing

Oh, crap, Eric’s gone and written another long post…

Since we publicly launched Gnip last week, we’ve been asked numerous times if we can integrate with Twitter or somehow help Twitter with the scaling issues they are facing.  We can, but we depend on Twitter giving us access to their XMPP feed.

We are huge fans of Twitter so we’re patiently waiting for that access.  In the mean time, the questions we’ve received have prompted us to explain two things: (1) How we would benefit Twitter and anyone who wants access to Twitter data and (2) Why – if you are a web service – it’s worth integrating now with Gnip rather than waiting either for (a) Gnip to integrate with Twitter or (b) you to get as popular as Twitter and have scale issues.

Let’s address the first issue: How we would benefit Twitter and anyone that wants to integrate with Twitter data.

Twitter has found that XMPP doesn’t scale for them and as a result, people are forced to poll their API *a lot* to get updates for their users.  MyBlogLog has over 25,000 Twitter users that they throw against the Twitter API every 15 minutes.  This results in nearly 2.5 million queries against the API every day, for maybe 250K updates.  Now add millions of pings from Plaxo and SocialThing and Lijit and heaven forbid Yahoo starts beating up their API…

If Twitter starts pushing updates to us, via our dead simple API or Atom or their XMPP server, we can immediately reduce by an order of magnitude the number of requests that some very large sites are making against their API.  At the same time, we reduce the latency between when someone Tweets and when it shows up on consuming sites like Plaxo.  From 15 minutes or more to 60 seconds or less.

We expect that Twitter has their collective heads down and are working around the clock to buttress their infrastructure, and it’s unlikely that they’re going to do anything optional until that’s sorted out.  Unfortunately, “integrate with Gnip” probably falls into the optional category. We expect, however, that at some point Twitter will start opening up their data to more partners once they feel like they have their arms around their infrastructure.

If you run a web service and integrate with Gnip today, you’ll automatically be able to integrate with Twitter data once they give us access.  Presumably you won’t have to wait in line to get direct Twitter integration.  In addition, you’ll have immediate access to all of the other data providers that we integrate with. Such as  Delicious, Flickr, Magnolia, Get Satisfaction, Intense Debate and Six Apart.  For example, only took Brightkite 15 minutes to integrate our API and start pushing data to our partners via us.

Now for the second topic.  Why – if you are a web service – it’s worth integrating with Gnip now rather than waiting either for (a) Gnip to integrate with Twitter or (b) you to get as popular as Twitter and have scale issues.

All things considered, it’s best not to end up in Twitter’s position.  They have a ton of passionate users (I’m one of them) who want reliable service and don’t have infinite patience.  The old startup cliche of “these are problems we’d like to have” is carp.

You don’t want to be in the position where your business suddenly takes off and your infrastructure falls over because people are banging your APIs to death.  You don’t want your most passionate users calling for mass exodus.  It’s better to take a few minutes to start pushing notifications to Gnip now than when you’re doing 20-hour days rebooting servers.

You also don’t want to be in the position that your company takes off and you suddenly get throttled by an API provider.  Nothing is worse than have to pull data sources because you’ve over-polled and the host decides to turn off the spigot.  Start pulling notifications from Gnip and feel secure that you’re only asking for data when there’s something new.

I still use Twitter every day.  Don’t try to kid me; I know you still do too.  Let them get on with their work and rest assured that we’ll integrate with them the instant we get the okay from them.

The WHAT of Gnip: Changing APIs from Pull to Push

A few months ago a handful of folks came together and took a practical look at the state of “web services” on the network today. As an industry we’ve enjoyed the explosion of web APIs over the past several years, but it’s been “every man for himself,” and we’ve been left with hundreds of web APIs being consumed in random ways (random protocols and formats). There have been a few cracks at standardizing some of this, but most have been left in spec form with, at best, fragmented implementations, and most have been too high level to provide anything more than good bedtime reading. We set out to build something; not write a story.

For a great overview of the situation Gnip is plunging into, checkout Nik Cubrilovic’s post on techcrunchIT; “The New Datastream Aggregators, FriendFeed and Standards.”.

Our first service is the culmination of lots of work by smart, pragmatic, people. From day one we’ve had excellent partners helping us along the way; from early integrations with our API, to discussing specifications and standards to follow (or not to follow; what you chose not to do is often more important than what you chose to do). While we aspire to solve all of the challenges in the data portability space, we’re a small team biting off small chunks along a path. We are going to need the support, feedback, and assistance of the broader data portability (formal & informal) community in order to succeed. Now that we’ve finally launched, we’ll be in “release early, release often” mode to ensure tight feedback loops around our products.

Enough; what did we build!?!

For those who want to cut to the chase, here’s our API doc.

We built a system that connects Data Consumers to Data Publishers in a low-latency, highly-scalable standards-based way. Data can be pushed or pulled into Gnip (via XMPP, Atom, RSS, REST) and it can be pushed or pulled out of Gnip (currently only via REST, but the rest to follow). This release of Gnip is focused on propagating user generated activity events from point A to point B. Activity XML provides a terse format for Data Publishers to distribute their user’s activities. Collections XML provides a simple way for Data Consumers to only receive information about the users they care about. This release is about “change notification,” and a subsequent release will include the actual data along with the event.

 

As a Consumer, whether your application model is event- or polling-based Gnip can get you near-realtime activity information about the users you care about. Our goal is a maximum 60 second latency for any activity that occurs on the network. While the time our service implementation takes to drive activities from end to end is measured in milliseconds, we need some room to breathe.

Data can come in to Gnip via many formats, but it is XSLT’d into a normalized Activity XML format which makes consuming activity events (e.g. “Joe dugg a news story at 10am”) from a wide array of Publishers a breeze. Along the way we started cringing at the verb/activity overlap between various Publishers; did Jane “tweet” or “post”, they’re kinda the same thing? After sitting down with Chris Messina, it became clear that everyone else was cringing too. A verb/activity normalization table has been started, and Gnip is going to distill the cornucopia of activities into a common, community derived, format in order to make consumption even easier.

Data Publishers now have a central clearinghouse to push data when events on their services occur. Gnip manages the relationship with Data Consumers, and figures out which protocols and formats they want to play with. It will take awhile for the system to reach equilibrium with Gnip, but once it does, API balance will be reached; Publishers will notify Gnip when things happen, and Gnip will fan-out those events to an arbitrary number of Consumers in real-time (no throttling, no rate limiting).

Gnip is centralized. After much consternation, we resolved to start out with a centralized model. Not necessarily because we think it’s the best path, but because it is the best path to get something started. Imagine the internet as a clustered application; decentralization is fundamental (DNS comes to mind). That said, we needed a starting point and now we have one. A conversation with Chris Saad highlighted some work Paul Jones (among others) had done around a standard mechanism for change notification discovery and subscription; getpingd. Getpingd describes a mechanism for distributed change notification. The Subscription side of getpingd feels like a no-brainer for Gnip to support, but I’m not sure how to consider the Discovery end of it. In some sense, I see Gnip (assuming getpingd’s discovery model is implemented) as a getpingd node in the graph. We have lots to consider in the federated/distributed model.

Gnip is a classic chicken-and-egg scenario, we need Publishers & Consumers to be interesting. If your service produces events that you want others on the network to consume, we’d love to see you as a Publisher in Gnip; pushing events into the system for wide consumption. If your service relies on events created by users on other applications, we’d love to see you as a Consumer in Gnip.

We’ve started out with convenience libraries for perl, php, java, python, and ruby. Rather than maintain these ourselves, we plan on publishing them to respective language community code sites/repositories.

That’s what we’ve built in a nutshell. I’ll soon blog about exactly how we’ve built it.