From API Consumers to API Designers: A Wish List

At Gnip, we spend a large part of our days integrating with third party APIs in the Social Media space. As part of this effort, we’ve come up with some API design best practices.

Use Standard HTTP Response Codes

HTTP has been around since the the early 90′s. Standard HTTP Response codes have been around for quite some time. For example, 200 codes level have meant success, 400 level have meant a client side error, and 500 level have been indicative of a server error. If there was an error during an API call to your service, please don’t send us back a 200 response and expect us to parse the response body for error details. If you want to rate limit us, please don’t send us back a 500, that makes us freak out.

Publish Your Rate Limits
We get it. You want the right to scale back your rate limits without a hoard of angry developers wielding virtual pitchforks showing up on your mailing list. It would make everyone’s lives easier if you published your rate limits rather than having developers playing a constant guessing game. Bonus points if you describe how your rate limits work. Do you limit per set of credentials, per API key, per IP address?

Use Friendly Ids, Not System Ids
We understand that it’s a common pattern to have an ugly system id (e.g. 17134916) backing a human readable id (e.g. ericwryan). As users of your API, we really don’t want to remember system ids, so why not go the extra mile and let us hit your API with friendly ids?

Allow Us to Limit Response Data
Let’s say your rate limit is pretty generous. What if Joe User is hammering your API once a second and retrieving 100 items with every request, even though on average, he will only see one new item per day. Joe has just wasted a lot of your precious CPU, memory, and bandwidth. Protect your users. Allow them to ask for everything since the last id or timestamp they received.

Keep Your Docs Up to Date
Who has time to update their docs when you have customers banging on your door for bug fixes and new features? Well, you would probably have less customers banging on your door if they had a better understanding of how to use your product. Keep your docs up to date with your code.

Publish Your Search Parameter Constraints
Search endpoints are very common these days. Do you have one? How do we go about searching your data? Do you split search terms on whitespace? Do you split on punctuation? How does quoting affect your query terms? Do you allow boolean operators?

Use Your Mailing List
Do you have a community mailing list? Great! Then use it. Is there an unavoidable, breaking change coming in a future release? Let your users know as soon as possible. Do you keep a changelog of features and bug fixes? Why not publish this information for your users to see?

We consider this to be a fairly complete list on designing an API that is easy to work with. Feel free to yell at us (info at gnip) if you see us lacking in any of these departments.

Clusters & Silos

Gnip is nearing its one-year anniversary of our 2.0 product. We reset our direction several months ago. As part of that shift, we completely changed our architecture. I thought I’d write about that experience a bit.

Gnip 1.0

Our initial implementation is best referred to as a clustered, non-relational DB (aka NoSQL), data aggregation service. We built/ran this product for about a year and a half. The system was comprised of a centralized cluster of machines that divvy’d up load, centralized streams of publisher data, and then fanned that data out to many customers. Publishers did not like this approach as it obfuscated the ultimate consumer of their data; they wanted transparency. Our initial motivation for this architecture was around alleviating load pain on the Publishers. “Real-time” APIs were the rage, and having detrimental impact to real-time delivery was in-part due to load on the Publisher’s API. A single stream of data to Gnip, and allowing Gnip to deal w/ the fan-out via a system built for such demand, was part of the solution we sold. We thought we could charge Publishers for alleviating their load pain. Boy were we wrong on that count. While Publishers love to complain about the load on their API, effectively none of them wanted to do anything about it. Some smartly built caching proxies, and others built homegrown notification-like/PubSub solutions (SIP, SUP, PubSubHubBub). However, most simply horizontally scaled and threw money at the problem. Twitter has shinned a light on streaming HTTP (or whatever you want to call it… there are so many monikers), which is “as good as it gets” (leaving proto buffers and compressed HTTP streams as simply optimizations to the model). I digress. The 1.0 platform was a fantastic engineering feat, ahead of its time, and unfortunately a thorn in Publisher’s sides. As a data integration middle-man, Gnip couldn’t afford to have antagonistic relations with data sources.

Gnip 2.0

Literally overnight, we walked away from further construction on our 1.0 platform. We had paying customers on it however, so we operated it for several months before ultimately shutting it down; after migrating everyone we could to 2.0. Gnip 2.0 un-intuitively departed from a clustered environment, and instead started providing a consuming customer with explicit, transparent, integrations with Publishers, all via standalone instances of the software running on standalone virtualized hardware instances (Ec2). Whereas 1.0 would sometimes leverage Gnip-owned authentication/app credentials to the benefit of many consuming customers, 2.0 was architected explicitly not to support this. For each 2.0 instance a customer runs, they use credentials they obtain themselves, from the Publisher, to configure the instances. Publishers have full transparency into, and control of, who’s using their data.

The result is an architecture that doesn’t leverage certain data structures an engineer would naturally wish to use. That said, an unexpected operational benefit has fallen out of the 2.0 system. Self-healing, zero SPOF (single point of failure), clusters aside (I’d argue they’re actually relatively few of them out there), the reality with clusters is that they’re actually hard to build in a fault tolerant manner, and SPOFs find their way in. From there, you have all of your customers leveraged against a big SPOF. If something cracks in the system, all of your customers feel that pain. On the flip side, silo’d instances rarely suffer from systemic failure. Sure operational issues arise, but you can treat each case uniquely and react accordingly. The circumstances in which all of your customers feel pain simultaneously are very few and far between. So, the cost of not leveraging hardware/software that we’re generally inclined to want to architect for and leverage, is indeed higher, but a simplified system has it’s benefits to be sure.

We now find ourselves promoting Publisher integration best practices, and they advocate our usage. Two such significant architectures built under the same roof has been a fascinating thing to experience. The pros and cons to each are many. Where you wind up in your system is an interesting function of what your propensity is technically, as well as what the business constraints are. One size never fits all.

Response Code Nuances

While fixing a bug yesterday, I plowed through the code that does Gnip’s HTTP response code special case handling. The scenarios we’re handling illustrate the complexities around doing integrations with many web APIs. It was a reminder of how much we all want standards to work, and how often they only partially do so. Here are a few nuances you should consider if you’re doing API integrations by hand.

“retry-after”

When doing a polling based integration with a “real-time” API, you’re inclined to poll it a lot. That has caused some service providers to tell you to slow down using the “retry-after” HTTP header. Some providers use other, not so standard, ways to cool you down, but those are beyond the scope of this post. When you get a non-200-level response back from a server, you should consider looking for the retry-after header, regardless of whether or not it was a 503 or 300-level code (per HTTP 1.1 specification). Generally, when a services sends a retry-after, they’re intention behind it is clear, and you should respect the value that comes back. Now, the format of that value can be either “seconds”, or in a more verbose time format that tells you when you should wait “until” before trying the request again. In practice, we’ve never seen the latter; only the “seconds” version. When we see retry-after, we sleep that duration; you should probably do the same.

HTTP Response-code ’999′

You can look for it in the spec, but you won’t find it. Delicious likes to send a ’999′ back when you’re hitting them too hard. Consider backing off for several minutes if you see this from them.

non-200 HTTP Response Bodies

While many services don’t bother sending response bodies back for non-200s (and those that do often don’t provide anything actionable), many do. It’s a good idea to write those bodies to a log file (or at least the first n-hundred bytes) for human inspection. There can be some useful information in there to help you build a more effective and efficient integration.

The matrix of services-to-response codes, and how you should respond to them, is big. The above is just a small slice of the scenarios your integrations will encounter, and that you’ll need to solve for.

While a service’s documentation is always some degree out of date, and you can only truly learn the behavioral characteristics through long nights of debugging, here are some pointers to service specific response codes that you might find useful.

Swiss Army Knives: cURL & tidy

Iterating quickly is what makes modern software initiatives work, and the mantra applies to everything in the stack. From planning your work, to builds, things have to move fast, and feedback loops need to be short and sweet. In the realm of REST[-like] API integration, writing an application to visually validate the API you’re interacting with is overkill. At the end of the day, web services boil down to HTTP requests which are rapidly tested with a tight little application called cURL. You can test just about anything with cURL (yes, including HTTP streaming/Comet/long-poll interactions), and its configurability is endless. You’ll have to read the man page to get all the bells and whistles, but I’ll provide a few samples of common Gnip use cases here. At the end of this post I’ll clue you into cURL’s indispensable cohort in web service slaying, ‘tidy.’

cURL power

cURL can generate custom HTTP client requests with any HTTP method you’d like. ProTip: the biggest gotcha I’ve seen trip up most people is leaving the URL unquoted. Many URLs don’t need quotes when being fed to cURL, but many do, and you should just get in the habit of quoting every one, otherwise you’ll spend time debugging your driver error for far too long. There are tons of great cURL tutorials out on the network; I won’t try to recreate those here.

POSTing

Some APIs want data POSTed to them. There are two forms of this.

Inline

curl -v -d "some=data" "http://blah.com/cool/api"

From File

curl -v -d @filename "http://blah.com/cool/api"

In either case, cURL defaults the content-type to the ubiquitous “application/x-www-form-urlencoded”. While this is often the correct thing to do, by default, there are a couple of things to keep in mind: one, this assumes that the data you’re inlining, or that is in your file, is indeed formatted as such (e.g. key=value pairs). two, when the API you’re working with does NOT want data in this format, you need to explicitly override the content-type header like so.

curl -v -d "someotherkindofdata" "http://blah.com/cool/api" --header "Content-Type: foo"

Authentication

Passing HTTP-basic authentication credentials along is easy.

curl -v -uUSERNAME[:PASSWORD] "http://blah.com/cool/api"

You can inline the password, but keep in mind your password will be cached in your shell history logs.

Show Me Everything

You’ll notice I’m using the “-v” option on all of my requests. “-v” allows me to see all the HTTP-level interaction (method, headers, etc), with the exception of a request POST body, which is crucial for debugging interaction issues. You’ll also need to use “-v” to watch streaming data fly by.

Crossing the Streams (cURL + tidy)

Most web services these days spew XML formatted data, and it is often not whitespace formatted such that a human can read it easily. Enter tidy. If you pipe your cURL output to tidy, all of life’s problems will melt away like a fallen ice-cream scoop on a hot summer sidewalk.

cURL’d web service API without tidy

curl -v "http://rss.clipmarks.com/tags/flower/"
...
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/style/rss/rss_feed.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="/style/rss/rss_feed.css" type="text/css" media="screen" ?><rss versi
on="2.0"><channel><title>Clipmarks | Flower Clips</title><link>http://clipmarks.com/tags/flower/</link><feedUrl>http://rss.clipmarks.com/tags/flower/</feedUrl><ttl>15</ttl
><description>Clip, tag and save information that's important to you. Bookmarks save entire pages...Clipmarks save the specific content that matters to you!</description><
language>en-us</language><item><title>Flower Shop in Parsippany NJ</title><link>http://clipmarks.com/clipmark/CAD213A7-0392-4F1D-A7BB-19195D3467FD/</link><description>&lt;
b&gt;clipped by:&lt;/b&gt; &lt;a href="http://clipmarks.com/clipper/dunguschariang/"&gt;dunguschariang&lt;/a&gt;&lt;br&gt;&lt;b&gt;clipper's remarks:&lt;/b&gt;  Send Dishg
ardens in New Jersey, NJ with the top rated FTD florist in Parsippany Avas specializes in Fruit Baskets, Gourmet Baskets, Dishgardens and Floral Arrangments for every Holi
day. Family Owned and Opperated for over 30 years. &lt;br&gt;&lt;div border="2" style="margin-top: 10px; border:#000000 1px solid;" width="90%"&gt;&lt;div style="backgroun
d-color:"&gt;&lt;div align="center" width="100%" style="padding:4px;margin-bottom:4px;background-color:#666666;overflow:hidden;"&gt;&lt;span style="color:#FFFFFF;f
...

cURL’d web service API with tidy

curl -v "http://rss.clipmarks.com/tags/flower/" | tidy -xml -utf8 -i
...
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="/style/rss/rss_feed.xsl" type="text/xsl" media="screen"?>
<?xml-stylesheet href="/style/rss/rss_feed.css" type="text/css" media="screen" ?>
<rss version="2.0">
   <channel>
     <title>Clipmarks | Flower Clips</title>
     <link>http://clipmarks.com/tags/flower/</link>
     <feedUrl>http://rss.clipmarks.com/tags/flower/</feedUrl>
     <ttl>15</ttl>
     <description>Clip, tag and save information that's important to
       you. Bookmarks save entire pages...Clipmarks save the specific
       content that matters to you!</description>
     <language>en-us</language>
     <item>
       <title>Flower Shop in Parsippany NJ</title>
       <link>

http://clipmarks.com/clipmark/CAD213A7-0392-4F1D-A7BB-19195D3467FD/</link>

       <description>&lt;b&gt;clipped by:&lt;/b&gt; &lt;a
...

I know which one you’d prefer. So what’s going on? We’re piping the output to tidy and telling tidy to treat the document as XML (use XML structural parsing rules), treat encodings as UTF8 (so it doesn’t barf on non-latin character sets), and finally “-i” indicates that you want it indented (pretty printed essentially).

Right Tools for the Job

If you spend a lot of time whacking through the web service API forest, be sure you have a sharp machete. cURL and tidy make for a very sharp machete. Test driving a web service API before you start laying down code is essential. These tools allow you to create tight feedback loops at the integration level before you lay any code down; saving everyone time, energy and money.

PubSubHubbub (PuSH), Google and Buzz

Setting the quality, validity, and longevity of Google Buzz as a product aside, here’s a first reaction to its PubSubHubbub based API.

I love the pubsub model, because driving applications via events, vs. polling, is almost always advantageous, and certainly more efficient. Gnip has a chapter in O’Reilly’s Beautiful Data wherein we go deeper into why the world should be event driven rather than founded on incessant polling.. bslatkin, also has a good post on the topic (Why Polling Sucks).

Over the past few days we’ve built Google Buzz support into the Gnip offering, which has allowed me to finally dig into PuSH Subscription at the implementation level. Mike Barinek, previously with Gnip, built a Ruby PuSH hub, but I haven’t gone that deep yet.

Some PuSH Subscriber thoughts…

  • PuSH lacks support for batch topic subscription requests. This is a bummer when your customers want to subscribe to large numbers of topics, as you have to one-off each subscription request. Unfortunately, I don’t see an easy way to extend the protocol to allow for batching, as the request acknowledgment semantics are baked into the HTTP response code itself, rather than a more verbose HTTP body.
  • Simple and lightweight. As far as pubsub protocols go, PuSH is nice and neat. Good leverage, and definition, of how HTTP should be used to communicate the bare minimum. While in the bullet above I complain that I want some expandability on this front, which would pollute things a bit, the simplicity of the protocol can’t be reckoned with.
  • Google’s Hub
    • Happily accepts, and returns success for, batch topic subscription requests, when in fact all topics aren’t actually subscribed. Bug.
    • Is the most consistent app I’ve seen WRT predicable HTTP interaction patterns. Respectfully sends back 503/retry-afters when it needs to, and honors them. I wish I could say this about a dozen other HTTP interfaces I have to interact with.
    • Is fast to field subscription requests. However, the queue on the back that shuffles events through the system has proven inconsistent and flaky. I don’t think I’ve lost any data, but the latency and order in which events move through it isn’t as consistent as I’d like. In order for event driven architectures to work, this needs to be tightened up.

Here’s to event driven systems!

Migrating to the Twitter Streaming API: A Primer

Some context:

Long, long ago, in a galaxy far, far away, Twitter provided a firehose of data to a few of partners and the world was happy.  These startups were awash in real-time data and they got spoiled, some might say, by the embarrassment of riches that came through the real-time feed.  Over time, numerous factors caused Twitter to cease offering the firehose.  There was much wailing and gnashing of teeth on that day, I can tell you!

At roughly the same time, Twitter bought real-time search company Summize and began offering to everyone access to what is now known as the Search API.  Unlike Twitter’s existing REST API, which was based around usernames, the Search API enabled companies to query for recent data about a specific keyword.  Because of the nature of polling, companies had to contend with latency (the time between when someone performs an action and when an API consumer learns about it) and Twitter had to deal with a constantly-growing number of developers connected to an inherently inefficient interface.

Last year, Twitter announced that they were developing the spiritual successor to the firehose — a real-time stream that could be filtered on a per-customer basis and provide the real-time, zero latency results people wanted.  By August of last year, alpha customers had access to various components of the firehose (spritzer, the gardenhose, track, birddog, etc) and provided feedback that helped shape and solidify Twitter’s Streaming API.

A month ago Twitter Engineer John Kalucki (@jkalucki) posted on the Twitter API Announcements group that “High-Volume and Repeated Queries Should Migrate to Streaming API“.  In the post, he detailed several reasons why the move is beneficial to developers.  Two weeks later, another Twitter developer announced a new error code, 420, to let developers identify when they are getting rate limited by the Search API.  Thus, both the carrot and the stick have been laid out.

The streaming API is going to be a boon for companies who collect keyword-relevant content from the Twitter stream, but it does require some work on the part of developers.  In this post, we’ll help explain who will benefit from using Twitter’s new Streaming API and some ways to make the migration easier.

Question 1:  Do I need to make the switch?

Let me answer your question with another question — Do you have a predictable set of keywords that you habitually query?  If you don’t, keep using the Search API.  If you do, get thee to the Streaming API.

Examples:

  • Use the Streaming API any time you are tracking a keyword over time or sending notifications /  summaries to a subscriber.
  • Use the Streaming API if you need to get *all* the tweets about a specific keyword.
  • Use the Search API for visualization and search tools where a user enters a non-predictable search query for a one-time view of results.
  • What if you offer a configurable blog-based search widget? You may have gotten away with beating up the Search API so far, but I’d suggest setting up a centralized data store and using it as your first look-up location when loading content — it’s bad karma to force a data provider to act as your edge cache.

Question 2: Why should I make the switch?

  • First and foremost, you’ll get relevant tweets significantly faster.  Linearly polling an API or RSS feed for a given set of keywords automatically creates latency which increases at a linear rate.  Assuming one query per second, the average latency for 1,000 keywords is a little over eight minutes; the average latency for 100,000 keywords is almost 14 hours!  With the Streaming API, you get near-real-time (usually within one second) results, regardless of the number of keywords you track.
  • With traditional API polling, each query returns N results regardless of whether any results are new since your last request.  This puts the onus of deduping squarely on your shoulders.  This sounds like it should be simple — cache the last N resultIDs in memory and ignore anything that’s been seen before.  At scale, high-frequency keywords will consume the cache and low frequency keywords quickly age out.  This means you’ll invariably have to hit the disk and begin thrashing your database. Thankfully, Twitter has already obviated much of this in the Search API with an optional “since_id” query parameter, but plenty of folks either ignore the option or have never read the docs and end up with serious deduplication work.  With Twitter’s Streaming API, you get a stream of tweets with very little duplication.
  • You will no longer be able to get full fidelity (aka all the tweets for a given keyword) from the Search API.  Twitter is placing increased weight on relevance, which means that, among other things, the Search API’s results will no longer be chronologically ordered.  This is great news from a user-facing functionality perspective, but it also means that if you query the Search API for a given keyword every N seconds, you’re no longer guaranteed to receive the new tweets each time.
  • We all complain about the limited backwards view of Twitter’s search corpus.  On any given day, you’ll have access to somewhere between seven and 14 days worth of historical data (somewhere between one quarter to one half billion tweets), which is of limited value when trying to discover historical trends.  Additionally, for high volume keywords (think Obama or iPhone or Toyota), you may only have access to an hour of historical data, due to the limited number of results accessible through Twitter’s paging system.  While there is no direct correlation between the number of queries against a database and the amount of data that can be indexed, there IS a direct correlation between devoting resources to handle ever-growing query demands and not having resources to work on growing the index.  As persistent queries move to the Streaming API, Twitter will be able to devote more resources to growing the index of data available via the Search API (see Question 4, below).
  • Lastly, you don’t really have a choice.  While Twitter has not yet begun to heavily enforce rate limiting (Gnip’s customers currently see few errors at 3,600 queries per hour), you should expect the Search API’s performance profile to eventually align with the REST API (currently 150 queries per hour, reportedly moving to 1,500 in the near future).

Question 3: Will I have to change my API integration?

Twitter’s Streaming API uses streaming HTTP

  • With traditional HTTP requests, you initiate a connection to a web server, the server sends results and the connection is closed.  With streaming HTTP, the connection is maintained and new data gets sent over a single long-held response.  It’s not unusual to see a Streaming API connection last for two or three days before it gets reset.
  • That said, you’ll need to reset the connection every time you change keywords.  With the Streaming API , you upload the entire set of keywords when establishing a connection.  If you have a large number of keywords, it can take several minutes to upload all of them and during the duration you won’t get any streaming results.  The way to work around this is to initiate a second Streaming API connection, then terminate the original connection once the new one starts receiving data.  In order to adhere to Twitter’s request that you not initiate a connection more than once every couple of minutes, highly volatile rule sets will need to batch changes into two minute chunks.
  • You’ll need to decouple data collection from data processing.  If you fall behind in reading data from the stream, there is no way to go back and get it (barring making a request from the Search API).  The best way to ensure that you are always able to keep up with the flow of streaming data is to place incoming data into a separate process for transformation, indexing and other work.  As a bonus, decoupling enables you to more accurately measure the size of your backlog.

Streaming API consumers need to perform more filtering on their end

  • Twitter’s Streaming API only accepts single-term rules; no more complex queries.  Say goodbye to ANDs, ORs and NOTs.  This means that if you previously hit the Search API looking for “Avatar Movie -Game”, you’ve got some serious filtering to do on your end.  From now on, you’ll add to the Streaming API one or more of the required keywords (Avatar and/or Movie) and filter out from the results anything without both keywords and containing the word “Game”.
  • You may have previously relied on the query terms you sent to Twitter’s Search API to help you route the results internally, but now the onus is 100% on you.  Think of it this way: Twitter is sending you a personalized firehose based upon your one-word rules.  Twitter’s schema doesn’t include a <keyword> element, so you don’t know which of your keywords are contained in a given Tweet.  You’ll have to inspect the content of the tweet in order to route appropriately.
  • And remember, duplicates are the exception, not the rule, with the Streaming API, so if a given tweet matches multiple keywords, you’ll still only receive it once.  It’s important that you don’t terminate your filtering algo on your first keyword or filter match; test against every keyword, every time.

Throttling is performed differently

  • Twitter throttles their Search API by IP address based upon the number of queries per second.  In a world of real-time streaming results, this whole concept is moot.  Instead, throttling is defined by the number of keywords a given account can track and the overall percentage of the firehose you can receive.
  • The default access to the Streaming API is 200 keywords; just plug in your username and password and off you go.  Currently, Twitter offers approved customers access to 10,000 keywords (restricted track) and 200,000 keywords (partner track).  If you need to track more than 200,000 keywords, Twitter may bind “partner track” access to multiple accounts, giving you access to 400,000 keywords or even more.
  • In addition to keyword-based streams, Twitter makes available several specific-use streams, including the link stream (All tweets with a URL) and the retweet stream (all retweets).  There are also various levels of userid-based streams (follow, shadow and birddog) and the overall firehose (spritzer, gardenhose and firehose), but they are outside the bounds of this post.
  • The best place to begin your quest for increased Streaming API is an email to api@twitter.com — briefly describe your company and use case along with the requested access levels. (This process will likely change for coming Commercial Accounts.)
  • Twitter’s Streaming API is throttled at the overall stream level. Imagine that you’ve decided to try to get as many tweets as you can using track.  I know, I know, who would do such a thing?  Not you, certainly.  But imagine that you did — you entered 200 stop words, like “and”, “or”, “the” and “it” in order to get a ton of tweets flowing to you.  You would be sorely disappointed, because twitter enforces a secondary throttle, a percentage of firehose available to each access level.  The higher the access level (partner track vs. restricted track vs. default track), the greater the percentage you can consume.  Once you reach that amount, you will be momentarily throttled and all matching tweets will be dropped on the floor.  No soup for you!  You should monitor this by watching for “limit” notifications.  If you find yourself regularly receiving these, either tighten up your keywords are request greater access from Twitter.

Start tracking deletes

  • Twitter sends deletion notices down the pipe when a user deletes one of their own tweets.  While Twitter does not enforce adoption of this feature, please do the right thing and implement it.  When a user deletes a tweet, they want it stricken from the public record.  Remember, “it ain’t complete if you don’t delete.”  We just made that up.  Just now.  We’re pretty excited about it.

Question 4: What if I want historical data too?


Twitter’s Streaming API is forward-looking, so you’ll only get new tweets when you add a new keyword.  Depending on your use case you may need some historical data to kick things off.  If so, you’ll want to make one simultaneous query to the Search API.  This means that you’ll need to maintain two integrations with Twitter APIs (three, if you’re taking advantage of Twitter’s REST API for tracking specific users), but the benefit is historical data + low-latency / high-reliability future data.

And as described before, the general migration to the Streaming API should result in deeper results from the Search API, but even now you can get around 1,500 results for a keyword if you get acquainted with the “page” query parameter.

Questions 5: What if I need more help?

Twitter resources:

Streaming HTTP resources:

Gnip help:

  • Ask questions in the comments below and we’ll respond inline
  • Send email to eric@gnip.com to ask the Gnip team direct questions

Of Client-Server Communication

We’ve recently been having some interesting conversations, both internally and with customers, about the challenges inherent in client-server software interaction, aka Web Services or Web APIs. The relatively baked state of web browsers and servers has shielded us from most of the issues that come with getting computers to talk to other computers.

It didn’t happen over-night, but today’s web browsing world rides on top of a well vetted pipeline of technology to give us good browsing (client-side) experiences. However, there are a lot of assumptions and moving parts behind our browser windows that get uncovered when working with web services (servers). There are skeletons in the closet unfortunately.

End-users’ web browsing demands eventually forced ports 80 and 443 (SSL) open across all firewalls and ISPs and we now take their availability for granted. When was the last time you heard someone ask “is port 80 open?” It’s probably been awhile. By 2000, server-side HTTP implementations (web servers) started solidifying and at the HTTP-level client and server tier there was relatively little incompatibility. Expectations around socket timeouts and HTTP protocol exchanges were clear, and both sides of the connection adhered to those expectations.

Enter the world of web-services/APIs.

We’ve been enjoying the stable client-server interaction that web browsing has provided over the past 15 years, but web services/APIs thrust the ugly realities that lurk beneath into view. When we access modern web services through lower-level software (e.g. something other than the browser), we have to make assumptions and implementation/configuration choices that the browser otherwise makes for us. Among them…

  • port to use for the socket connection
    • the browser assumes you always want ’80′ (HTTP) or ’443′ (HTTPS)
    • the browser provides built-in encryption handling of HTTPS/SSL
  • URL parsing
    • the browser uses static rules for interpreting and parsing URLs
  • HTTP request methods
    • browsers inherently know when to use GET vs. POST
  • HTTP POST bodies.
    • browsers pre-define how POST bodies are structured, and never deviate from this methodology
  • HTTP header negotiation (this is the big one).
    • browsers handle all of the following scenarios out-of-the-box
    • Request
      • compression support (e.g. gzip)
      • connection duration types (e.g. keep-alive)
      • authentication (basic/oauth/other)
      • user-agent specification
    • Response
      • chunked responses
      • content-types. the browser has a pre-defined set of content types that it knows how to handle internally.
      • content-encoding. the browser knows how to handle various encoding types (e.g. gzip compression), and does so by default
      • authentication (basic/oauth/other)
  • HTTP Response body formats/character sets/encodings
    • browsers juggle the combination between content-encoding, content-type, and charset handling to ensure their international audience can see the information as its author intended.

Web browsers have the luxury of being able to lock down all of the above variables and not worry about changes in these assumptions. Having built browsers (Netscape/Firefox) in the past for a living, it’s still a very difficult task but at least the problem is constrained (e.g. ensure the end user can view the content within the browser). Web service consumers have to understand, and make decisions around, each of those points. Getting just one of them wrong can lead to issues in your application. These issues can range from being connectivity- or content handling-related to service authentication and can lead to long guessing games off “what went wrong?”

To further complicate the API interaction pipeline, many IT departments prevent abnormal connection activity from occurring. This means that while your application may be “doing the right thing” (TM), a system that sits between your application and the API with which it is trying to interact may prevent the exchange from occurring as you intended.

What To Do?

First off, you need to be versed not only in the documentation of the API you’re trying to use. Documentation is often outdated and doesn’t reflect actual implementations or account for bugs and behavioral nuances inherent in any API, so you also need to engage with its developer community/forums. From there, you need to ensure your HTTP client accounts for the assumptions I outline above and adheres to the API you’re interacting with. If you’re experiencing issues you’ll need to ensure your code is establishing the connection successfully, receiving the data it’s expecting, and parsing the data correctly. Never underestimate using a packet sniffer to view the raw HTTP exchange between your client and the server; debugging HTTP libraries at the code-level (even with logging) often don’t yield the truth behind what’s being sent to the server and received.

The Power of cURL

This is an entire blog post in and of itself, but the swiss army knife of any web service developer is cURL. In the right hands, cURL allows you to easily construct HTTP requests to test interaction with a web service. Don’t underestimate the translation of your cURL test to your software however.

So You Want Some Social Data

If your product or service needs social data in today’s API marketplace, there are a few things you need to consider in order to most effectively consume said data.

 

I need all the data

First, you should double-check your needs. Data consumers often think they need “all the data,” when in fact they don’t. You may need “all the data” for a given set of entities (e.g. keywords, or users) on a particular service, but don’t confuse that with needing “all the data” a service generates. When it comes to high-volume services (such as Twitter), consuming “all of the data” actually amounts to resource intensive engineering exercises on your end. There are often non-trivial scaling challenges involved when handling large data-sets. Do some math and determine whether or not statistical sampling will give you all you need; the answer is usually “yes.” If the answer is “no” be ready for an uphill (technical, financial, or business model) battle with service providers; they don’t necessarily want all of their data floating around out there.
Social data APIs are generally designed around prohibiting “all of the data” being accessed, either technically, or through terms of service agreements. However, they usually provide great access to narrow sets of data. Consider whether you need “100% of the data” for a relatively narrow slice of information; most social data APIs support this use case quite well.

 

Ingestion

 

Connectivity

There are three general styles that you’ll wind up using to access an API, all of them HTTP based: inbound-POST; event driven (e.g. PubSubHubbub/WebHooks), GET; polling, or GET/POST; streaming. Each of these has its pros and cons. I’m avoiding XMPP in this post only because it is infrequently used and hasn’t seen widespread adoption (yet). Each style requires a different level of operational and programmatic understanding.

 

Authentication/Authorization

APIs usually have publicly available versions (usually limited in their capabilities), as well as versions that require registration for subsequent authenticated connections. The authC and authZ semantics around APIs range from simple, to complex. You’ll need to understand the access characteristics around the specific services you want to access. Some require hands-on, human, authorization-level justification processes to be followed in order to have the “right level of access” granted to you and your product. Some are simple automated online registration forms that directly yield the account credentials necessary for API access.
HTTP-Basic authentication, not surprisingly, is the predominate authentication scheme used, and authorization levels are conveniently tied to the account by the service provider. OAuth (proper and 2-legged) is gaining steam however. You’ll also find API-keys (URL params or HTTP header based) are still widely used.

 

Processing

How you process data once you receive it is certainly affected by which connection style you use. Note, that most APIs don’t give you an option in how you connect to them; the provider decides for you. Processing data in the same step as receiving it can cause bottlenecks in your system, and ultimately put you on bad terms with the API provider you’re connecting to. An analogy would be drinking from the proverbial firehose. If you connect the firehose to your mouth, you might get a gulp or two down before you’re overwhelmed by the amount of water actually coming at you. You’ll either cause the firehose to backup on you, or you’ll start leaking water all over the place. Either way, you won’t be able to keep up with the amount of water coming at you. If your, average, ability to process data is slower than the rate at which it arrives, you’ll have a queueing challenge to contend with. Consider offline, or out-of-band, processing of data as it becomes available. For example, write it to disk or a database and have parallelized worker threads/processes parse/handle it from there. The point is, don’t process it in the moment in this case.
Many APIs don’t produce enough data to warrant out-of-band processing, so often inline processing is just fine. It all depends on what operations you’re trying to perform, the speed at which your technology stack can accomplish those operations, and the rate at which data arrives.

 

Reporting

If you don’t care about reporting initially, you will in short order. How much data are you receiving? What are peak volume periods? Which of the things you’re looking for are generating the most results?
API integrations inherently bind your software to someone else’s. Understanding how that relationship is functioning at any given moment is crucial to your day to day operations.

 

Monitoring

Reporting’s close sibling is monitoring. Understanding when an integration has gone south is just as important as knowing when your product is having issues; they’re one and the same. Integrating with an API means you’re dependent on someone else’s software, and that software can have any number of issues. From bugs, to planned upgrades or API changes, you’ll need to know when certain things change, and take appropriate action.

 

Web services/APIs are usually incredibly easy to “sample,” but truly integrating and operationalizing them is another, more challenging, process.

Social Data in a Marketplace

Gnip; shipping & handling for data. Since our inception a couple of years ago, this is one of the ways we’ve described ourselves. What many folks in the social data space (publishers and consumers alike) surprisingly don’t understand however is that such a thing is necessary. Several times we’ve come up against folks who indicate that either a) “our (random publisher X) data’s already freely available through an API” or b) “I (random consumer Y) have free access to their data through their API.” While both statements are often true, they’re shortsighted.

If you’re a “web engineer” versed in HTTP and XHR with time on your hands, then accessing data from a social media publisher (e.g. Twitter, Facebook, MySpace, Digg…. etc) may be relatively straightforward. However, while API integration might be “easy” for you, keep in mind that you’re in the minority. Thousands of companies, either not financially able to afford a “web engineer” or simply technically focused elsewhere (if at all), need help accessing the data they need to make business decisions. Furthermore, while you may do your own integrations, how robust is your error reporting, monitoring, and management of your overall strategy? Odds are that you have not given those areas the attention they require. Did your stream of data stop because of a bug in your code, or because the service you were integrated with went down? Could you more efficiently receive the same data from a publisher, while relieving load from your (and the publisher’s) system? Do you have live charts that depict how data is moving through the system (not just the publisher’s side of the house)? This is where Gnip Data Collection as a Service steps in.

As the social media/data space has evolved over the past couple of years, the necessity of a managed/solution-as-a-service has become clear. As expected, the number of data consumers continues to explode, while the number of consumers with technical capability to reliably integrate with the publishers, as a ratio to total, is shrinking.

Finally some good technical/formatting standards are catching on (PubSubHubbub, WebHooks, HTTP-long-polling/streaming/Comet (thanks Twitter), ActivityStreams), which is giving everyone a vocabulary and common conceptual understanding to use when discussing how/when real-time data is produced/consumed.

In 2010 we’re going to see the beginnings of maturation in the otherwise Wild-West of social data. As things evolve I hope innovation doesn’t suffer (mass availability of data has done wonderful things), but I do look forward to giving other, less inclined, players in the marketplace access to the data they need. As a highly focused example of this kind of maturation happening before our eyes, checkout SimpleGeo. Can I do geo stuff as an engineer, yes. Do I want to collect the thousand sources of light to build what I want to build around/with geo; no. I prefer a one-stop-shop.