Why You Should Join Gnip

Gnip’s business is growing heartily. As a result, we need to field current demand, refine our existing product offering, and expand into completely new areas in order to deliver the web’s data. From a business standpoint we need to grow our existing sales team in order to capture as much of our traditional market as possible, as fast as possible. We also need to leverage established footholds in new verticals, and turn those into businesses as big as, or hopefully bigger than, our current primary market. The sales and business-line expansion at Gnip is in full swing, and we need more people on the sales and business team to help us achieve our goals.

From a technical standpoint I don’t know where to begin. We have a large existing customer base that we need to keep informed, help optimize, and generally support; we’re hiring technical support engineers. Our existing system scales just fine, but software was meant to iterate, and we have learned a lot about handling large volumes of real-time data streams, across many protocols and formats, for ultimate delivery to large numbers of customers. We want to evolve the current system to even better leverage computing resources, and provide a more streamlined customer experience. We’ve also bit off a historical data set indexing challenge that is well… of true historical proportion. The historical beast needs feeding, and it needs big brains to feast on. We need folks who know Java very well, have search, indexing, and large data-set management backgrounds.

On the system administration side of things… if you like to twiddle IP tables, tune MTUs for broad geographic region high-bandwidth data flow optimization, handle high-volume/bandwidth streaming content, then we’d like to hear from you. We need even more sys admin firepower.

Gnip is a technical product, with a technical sale. Our growth has us looking to offload a lot of the Sales Engineering support that the dev team currently takes on. Subsequently we’re looking to hire a Sales Engineer as well.

Gnip has a thriving business. We have a dedicated, passionate, intelligent team that knows how to execute. We’re building hard technology that has become a critical piece of the social media ecosystem. Gnip is also located in downtown Boulder, CO.


Twitter XML, JSON & Activity Streams at Gnip

About a month ago Twitter announced they will be shutting off XML for stream based endpoints on Dec, 6th, 2010, in order to exclusively support JSON. While JSON users/supporters are cheering, for some developers this is a non-trivial change. Tweet parsers around the world have to change from XML to JSON. If your brain, and code, only work in XML, you’ll be forced to get your head around something new. You’ll have to get smart, find the right JSON lib, change your code to use it (and any associated dependencies you weren’t already relying on), remove obsolete dependencies, test everything again, and ultimately get comfortable with a new format.

Gnip’s format normalization shields you from all of this as it turns out. Gnip customers get to stay focused on delivering value to their customers. Others integrating directly, and consuming stream data from Twitter in XML, have to make a change (arguably a good one from a pure format standpoint, but change takes time regardless).

From day one, Gnip has been working to shield data consumers from the inevitable API shifts (protocols, formats) that occur in the market at large. Today we ran a query to see what percentage of our customers would benefit from this shield; today we smiled. We’re going to sleep well tonight knowing all of our customers digesting our Activity Streams normalization get to stay focused on what matters to them most (namely NOT data collection intricacies).


Social Media in Natural Disasters

Gnip is located in Boulder, CO, and we’re unfortunately experiencing a spate of serious wildfires as we wind Summer down. Social media has been a crucial source of information for the community here over the past week as we have collectively Tweeted, Flickred, YouTubed and Facebooked our experiences. Mashups depicting the fires and associated social media quickly started emerging after the fires started. VisionLink (a Gnip customer) produced the most useful aggregated map of official boundary & placemark data, coupled with social media delivered by Gnip (click the “Feeds” section along the left-side to toggle social media); screenshot below.

Visionlink Gnip Social Media Map

With Gnip, they started displaying geo-located Tweets, then added Flickr photos with the flip of a switch. No new messy integrations that required learning a new API with all of it’s rate limiting, formatting, and delivery protocol nuances. Simple selection of data sources they deemed relevant to informing a community reacting, real-time, to a disaster.

It was great to see a firm focus on their core value proposition (official disaster relief data), and quickly integrate relevant social media without all the fuss.

Our thoughts are with everyone who was impacted by the fires.

PubSubHubbub (PuSH), Google and Buzz

Setting the quality, validity, and longevity of Google Buzz as a product aside, here’s a first reaction to its PubSubHubbub based API.

I love the pubsub model, because driving applications via events, vs. polling, is almost always advantageous, and certainly more efficient. Gnip has a chapter in O’Reilly’s Beautiful Data wherein we go deeper into why the world should be event driven rather than founded on incessant polling.. bslatkin, also has a good post on the topic (Why Polling Sucks).

Over the past few days we’ve built Google Buzz support into the Gnip offering, which has allowed me to finally dig into PuSH Subscription at the implementation level. Mike Barinek, previously with Gnip, built a Ruby PuSH hub, but I haven’t gone that deep yet.

Some PuSH Subscriber thoughts…

  • PuSH lacks support for batch topic subscription requests. This is a bummer when your customers want to subscribe to large numbers of topics, as you have to one-off each subscription request. Unfortunately, I don’t see an easy way to extend the protocol to allow for batching, as the request acknowledgment semantics are baked into the HTTP response code itself, rather than a more verbose HTTP body.
  • Simple and lightweight. As far as pubsub protocols go, PuSH is nice and neat. Good leverage, and definition, of how HTTP should be used to communicate the bare minimum. While in the bullet above I complain that I want some expandability on this front, which would pollute things a bit, the simplicity of the protocol can’t be reckoned with.
  • Google’s Hub
    • Happily accepts, and returns success for, batch topic subscription requests, when in fact all topics aren’t actually subscribed. Bug.
    • Is the most consistent app I’ve seen WRT predicable HTTP interaction patterns. Respectfully sends back 503/retry-afters when it needs to, and honors them. I wish I could say this about a dozen other HTTP interfaces I have to interact with.
    • Is fast to field subscription requests. However, the queue on the back that shuffles events through the system has proven inconsistent and flaky. I don’t think I’ve lost any data, but the latency and order in which events move through it isn’t as consistent as I’d like. In order for event driven architectures to work, this needs to be tightened up.

Here’s to event driven systems!

Of Client-Server Communication

We’ve recently been having some interesting conversations, both internally and with customers, about the challenges inherent in client-server software interaction, aka Web Services or Web APIs. The relatively baked state of web browsers and servers has shielded us from most of the issues that come with getting computers to talk to other computers.

It didn’t happen over-night, but today’s web browsing world rides on top of a well vetted pipeline of technology to give us good browsing (client-side) experiences. However, there are a lot of assumptions and moving parts behind our browser windows that get uncovered when working with web services (servers). There are skeletons in the closet unfortunately.

End-users’ web browsing demands eventually forced ports 80 and 443 (SSL) open across all firewalls and ISPs and we now take their availability for granted. When was the last time you heard someone ask “is port 80 open?” It’s probably been awhile. By 2000, server-side HTTP implementations (web servers) started solidifying and at the HTTP-level client and server tier there was relatively little incompatibility. Expectations around socket timeouts and HTTP protocol exchanges were clear, and both sides of the connection adhered to those expectations.

Enter the world of web-services/APIs.

We’ve been enjoying the stable client-server interaction that web browsing has provided over the past 15 years, but web services/APIs thrust the ugly realities that lurk beneath into view. When we access modern web services through lower-level software (e.g. something other than the browser), we have to make assumptions and implementation/configuration choices that the browser otherwise makes for us. Among them…

  • port to use for the socket connection
    • the browser assumes you always want ’80’ (HTTP) or ‘443’ (HTTPS)
    • the browser provides built-in encryption handling of HTTPS/SSL
  • URL parsing
    • the browser uses static rules for interpreting and parsing URLs
  • HTTP request methods
    • browsers inherently know when to use GET vs. POST
  • HTTP POST bodies.
    • browsers pre-define how POST bodies are structured, and never deviate from this methodology
  • HTTP header negotiation (this is the big one).
    • browsers handle all of the following scenarios out-of-the-box
    • Request
      • compression support (e.g. gzip)
      • connection duration types (e.g. keep-alive)
      • authentication (basic/oauth/other)
      • user-agent specification
    • Response
      • chunked responses
      • content-types. the browser has a pre-defined set of content types that it knows how to handle internally.
      • content-encoding. the browser knows how to handle various encoding types (e.g. gzip compression), and does so by default
      • authentication (basic/oauth/other)
  • HTTP Response body formats/character sets/encodings
    • browsers juggle the combination between content-encoding, content-type, and charset handling to ensure their international audience can see the information as its author intended.

Web browsers have the luxury of being able to lock down all of the above variables and not worry about changes in these assumptions. Having built browsers (Netscape/Firefox) in the past for a living, it’s still a very difficult task but at least the problem is constrained (e.g. ensure the end user can view the content within the browser). Web service consumers have to understand, and make decisions around, each of those points. Getting just one of them wrong can lead to issues in your application. These issues can range from being connectivity- or content handling-related to service authentication and can lead to long guessing games off “what went wrong?”

To further complicate the API interaction pipeline, many IT departments prevent abnormal connection activity from occurring. This means that while your application may be “doing the right thing” (TM), a system that sits between your application and the API with which it is trying to interact may prevent the exchange from occurring as you intended.

What To Do?

First off, you need to be versed not only in the documentation of the API you’re trying to use. Documentation is often outdated and doesn’t reflect actual implementations or account for bugs and behavioral nuances inherent in any API, so you also need to engage with its developer community/forums. From there, you need to ensure your HTTP client accounts for the assumptions I outline above and adheres to the API you’re interacting with. If you’re experiencing issues you’ll need to ensure your code is establishing the connection successfully, receiving the data it’s expecting, and parsing the data correctly. Never underestimate using a packet sniffer to view the raw HTTP exchange between your client and the server; debugging HTTP libraries at the code-level (even with logging) often don’t yield the truth behind what’s being sent to the server and received.

The Power of cURL

This is an entire blog post in and of itself, but the swiss army knife of any web service developer is cURL. In the right hands, cURL allows you to easily construct HTTP requests to test interaction with a web service. Don’t underestimate the translation of your cURL test to your software however.

Today's API Guessing Game

Intro with Joshua’s digitized voice (from 1983’s WarGames) asking “shall we play a game?

I’ve spent the better part of the past couple of days playing a game. I was chasing down some odd polling behavior observed in one of our internal prototype applications. It ultimately turned out to be some bad assumptions I made around how some code I wrote should behave. The rate limiting policy around one of the open APIs I was using was obfuscated.

The scenario reminded me of a challenge I faced earlier in my career at Netscape. We were trying to figure out how Netscape/Mozilla open source should function (early on in Mozilla’s life… pre-independence from AOL; e.g. 1998). We struggled managing corporate needs, sometimes around confidentiality, in the context of open source. Mozilla wouldn’t work if things that impacted the open source software on the Netscape-side of the engineering house weren’t openly discussed. As predicted, innovation suffered when significant code contributions being made by Netscape weren’t transparent. Netscape was faced with staying quiet about its intentions, or being open with them. Open sourcing the code (e.g. having an “open” API) wasn’t enough. The process by which the code/API was to evolve and function had to be open.

Netscape/AOL weren’t able to let go of key, though seemingly small, aspects of the project and innovation waned. Mozilla/Firefox didn’t explode until there was a formal transition from AOL to Mozilla Foundation many years later. While Firefox has pushed the industry forward in bounds since then, there were years of browser industry confusion and impedance due to a non-committal controlling interest.

The parallel I’m drawing between Netscape/Mozilla’s history and today’s “open” web APIs is that there are key players chokeholding the rest of the industry with inadequately supported, poorly communicated, API access policies. Access policies, while sometimes documented, are highly irregular and poorly communicated. The result is a developing ecosystem around these APIs that has to decide whether or not to play the API access guessing game. When a developer using some of today’s open APIs wakes up and rolls out of bed each morning they wonder “will my application work today?” That’s untenable in the long term.

Just as it was Netscape’s right to control the bits it wanted to, Kings of today’s API hill have a right to do whatever they want. To those who’ve been successful at creating unyielding demand; hats off! Use that power wisely however, and learn from history’s mistakes.

To APIs crying uncle due to the operational overload of their popularity I recommend moving to an event driven API access model (ala Gnip). When that’s simply not possible (though I’d argue it always is) use something like SUP to minimize constipation in the rest of the digestive system.

If you’re throttling access to your API because you don’t know what your business model is, hurry up, get it sorted, and communicate intentions. If you don’t, industry will find a way to pass you by.

Winding Down XMPP, for Now

Without going into a full blown post about XMPP, our take is that it’s a good model / protocol, with too many scattered implementations which is leaving it in the “immature” bucket. Apache wound up setting the HTTP standard, and an XMPP server equivalent hasn’t taken hold in the marketplace.

From Gnip’s perspective, XMPP is causing us pain and eating cycles.  More than half of all customer service requests are about XMPP and in many cases, the receiving party isn’t standing up their own server.  They’re running off of Google or Jabber.org and there’s not much we can do when they get throttled. As a result, we’ve decided that we should eliminate XMPP (both in/out bound) as soon as possible. Outbound will be shut off with our next code push on Wednesday; we’ll cut inbound when Twitter finds another way to push to us.

For the foreseeable future, our world revolves around increasing utility by adding to the breadth of publishers in our system.  Features / functionality that support that goal are, with few exceptions, our only priority and XMPP support isn’t in that mix.  Expect our first releases of hosted polling and usage statistics later this month.  We’ll reevaluate XMPP support when either a) we have cycles or b) a significant number of partners request it.

The WHAT of Gnip: Changing APIs from Pull to Push

A few months ago a handful of folks came together and took a practical look at the state of “web services” on the network today. As an industry we’ve enjoyed the explosion of web APIs over the past several years, but it’s been “every man for himself,” and we’ve been left with hundreds of web APIs being consumed in random ways (random protocols and formats). There have been a few cracks at standardizing some of this, but most have been left in spec form with, at best, fragmented implementations, and most have been too high level to provide anything more than good bedtime reading. We set out to build something; not write a story.

For a great overview of the situation Gnip is plunging into, checkout Nik Cubrilovic’s post on techcrunchIT; “The New Datastream Aggregators, FriendFeed and Standards.”.

Our first service is the culmination of lots of work by smart, pragmatic, people. From day one we’ve had excellent partners helping us along the way; from early integrations with our API, to discussing specifications and standards to follow (or not to follow; what you chose not to do is often more important than what you chose to do). While we aspire to solve all of the challenges in the data portability space, we’re a small team biting off small chunks along a path. We are going to need the support, feedback, and assistance of the broader data portability (formal & informal) community in order to succeed. Now that we’ve finally launched, we’ll be in “release early, release often” mode to ensure tight feedback loops around our products.

Enough; what did we build!?!

For those who want to cut to the chase, here’s our API doc.

We built a system that connects Data Consumers to Data Publishers in a low-latency, highly-scalable standards-based way. Data can be pushed or pulled into Gnip (via XMPP, Atom, RSS, REST) and it can be pushed or pulled out of Gnip (currently only via REST, but the rest to follow). This release of Gnip is focused on propagating user generated activity events from point A to point B. Activity XML provides a terse format for Data Publishers to distribute their user’s activities. Collections XML provides a simple way for Data Consumers to only receive information about the users they care about. This release is about “change notification,” and a subsequent release will include the actual data along with the event.


As a Consumer, whether your application model is event- or polling-based Gnip can get you near-realtime activity information about the users you care about. Our goal is a maximum 60 second latency for any activity that occurs on the network. While the time our service implementation takes to drive activities from end to end is measured in milliseconds, we need some room to breathe.

Data can come in to Gnip via many formats, but it is XSLT’d into a normalized Activity XML format which makes consuming activity events (e.g. “Joe dugg a news story at 10am”) from a wide array of Publishers a breeze. Along the way we started cringing at the verb/activity overlap between various Publishers; did Jane “tweet” or “post”, they’re kinda the same thing? After sitting down with Chris Messina, it became clear that everyone else was cringing too. A verb/activity normalization table has been started, and Gnip is going to distill the cornucopia of activities into a common, community derived, format in order to make consumption even easier.

Data Publishers now have a central clearinghouse to push data when events on their services occur. Gnip manages the relationship with Data Consumers, and figures out which protocols and formats they want to play with. It will take awhile for the system to reach equilibrium with Gnip, but once it does, API balance will be reached; Publishers will notify Gnip when things happen, and Gnip will fan-out those events to an arbitrary number of Consumers in real-time (no throttling, no rate limiting).

Gnip is centralized. After much consternation, we resolved to start out with a centralized model. Not necessarily because we think it’s the best path, but because it is the best path to get something started. Imagine the internet as a clustered application; decentralization is fundamental (DNS comes to mind). That said, we needed a starting point and now we have one. A conversation with Chris Saad highlighted some work Paul Jones (among others) had done around a standard mechanism for change notification discovery and subscription; getpingd. Getpingd describes a mechanism for distributed change notification. The Subscription side of getpingd feels like a no-brainer for Gnip to support, but I’m not sure how to consider the Discovery end of it. In some sense, I see Gnip (assuming getpingd’s discovery model is implemented) as a getpingd node in the graph. We have lots to consider in the federated/distributed model.

Gnip is a classic chicken-and-egg scenario, we need Publishers & Consumers to be interesting. If your service produces events that you want others on the network to consume, we’d love to see you as a Publisher in Gnip; pushing events into the system for wide consumption. If your service relies on events created by users on other applications, we’d love to see you as a Consumer in Gnip.

We’ve started out with convenience libraries for perl, php, java, python, and ruby. Rather than maintain these ourselves, we plan on publishing them to respective language community code sites/repositories.

That’s what we’ve built in a nutshell. I’ll soon blog about exactly how we’ve built it.