Response Code Nuances

While fixing a bug yesterday, I plowed through the code that does Gnip’s HTTP response code special case handling. The scenarios we’re handling illustrate the complexities around doing integrations with many web APIs. It was a reminder of how much we all want standards to work, and how often they only partially do so. Here are a few nuances you should consider if you’re doing API integrations by hand.

“retry-after”

When doing a polling based integration with a “real-time” API, you’re inclined to poll it a lot. That has caused some service providers to tell you to slow down using the “retry-after” HTTP header. Some providers use other, not so standard, ways to cool you down, but those are beyond the scope of this post. When you get a non-200-level response back from a server, you should consider looking for the retry-after header, regardless of whether or not it was a 503 or 300-level code (per HTTP 1.1 specification). Generally, when a services sends a retry-after, they’re intention behind it is clear, and you should respect the value that comes back. Now, the format of that value can be either “seconds”, or in a more verbose time format that tells you when you should wait “until” before trying the request again. In practice, we’ve never seen the latter; only the “seconds” version. When we see retry-after, we sleep that duration; you should probably do the same.

HTTP Response-code ‘999’

You can look for it in the spec, but you won’t find it. Delicious likes to send a ‘999’ back when you’re hitting them too hard. Consider backing off for several minutes if you see this from them.

non-200 HTTP Response Bodies

While many services don’t bother sending response bodies back for non-200s (and those that do often don’t provide anything actionable), many do. It’s a good idea to write those bodies to a log file (or at least the first n-hundred bytes) for human inspection. There can be some useful information in there to help you build a more effective and efficient integration.

The matrix of services-to-response codes, and how you should respond to them, is big. The above is just a small slice of the scenarios your integrations will encounter, and that you’ll need to solve for.

While a service’s documentation is always some degree out of date, and you can only truly learn the behavioral characteristics through long nights of debugging, here are some pointers to service specific response codes that you might find useful.

Social Data in a Marketplace

Gnip; shipping & handling for data. Since our inception a couple of years ago, this is one of the ways we’ve described ourselves. What many folks in the social data space (publishers and consumers alike) surprisingly don’t understand however is that such a thing is necessary. Several times we’ve come up against folks who indicate that either a) “our (random publisher X) data’s already freely available through an API” or b) “I (random consumer Y) have free access to their data through their API.” While both statements are often true, they’re shortsighted.

If you’re a “web engineer” versed in HTTP and XHR with time on your hands, then accessing data from a social media publisher (e.g. Twitter, Facebook, MySpace, Digg…. etc) may be relatively straightforward. However, while API integration might be “easy” for you, keep in mind that you’re in the minority. Thousands of companies, either not financially able to afford a “web engineer” or simply technically focused elsewhere (if at all), need help accessing the data they need to make business decisions. Furthermore, while you may do your own integrations, how robust is your error reporting, monitoring, and management of your overall strategy? Odds are that you have not given those areas the attention they require. Did your stream of data stop because of a bug in your code, or because the service you were integrated with went down? Could you more efficiently receive the same data from a publisher, while relieving load from your (and the publisher’s) system? Do you have live charts that depict how data is moving through the system (not just the publisher’s side of the house)? This is where Gnip Data Collection as a Service steps in.

As the social media/data space has evolved over the past couple of years, the necessity of a managed/solution-as-a-service has become clear. As expected, the number of data consumers continues to explode, while the number of consumers with technical capability to reliably integrate with the publishers, as a ratio to total, is shrinking.

Finally some good technical/formatting standards are catching on (PubSubHubbub, WebHooks, HTTP-long-polling/streaming/Comet (thanks Twitter), ActivityStreams), which is giving everyone a vocabulary and common conceptual understanding to use when discussing how/when real-time data is produced/consumed.

In 2010 we’re going to see the beginnings of maturation in the otherwise Wild-West of social data. As things evolve I hope innovation doesn’t suffer (mass availability of data has done wonderful things), but I do look forward to giving other, less inclined, players in the marketplace access to the data they need. As a highly focused example of this kind of maturation happening before our eyes, checkout SimpleGeo. Can I do geo stuff as an engineer, yes. Do I want to collect the thousand sources of light to build what I want to build around/with geo; no. I prefer a one-stop-shop.

New Digg-2000 Publisher, Hot Stuff

We always get some interesting requests for doing additional processing on data sources.   Some of these are addressed using Gnip filters, but others do not really fit the filter model.   In able to support richer or more complex data processsing we have built some additional features into the Gnip platform.   The first new publisher using some of these new features is “Digg-2000“.

Digg-2000 Publisher — What is it?

Lots of people submit stories to Digg and lots of other people Digg the stories which allows more popular information to rise to the top in being discoverable.    Several Gnip customers asked if we could make it possible for them to only receive stories that had a specific number of Diggs.   We asked Digg about the idea and they said it sounded great since they have a Twitter account that provides a similar type of feature, so the Digg-2000 Gnip publisher was born.

Digg-2000 Publisher — How it works?

On the Gnip platform we have set up a publisher that is listening to activities on Digg.

  • When new stories are submitted to Digg we pick those new activites up along with the Digg count on the story and they are posted to the standard Gnip Digg publisher.
  • With every new Digg on a story we increment our tracking of the story and when we hit 2000 diggs we re-post the original story to the Digg-2000 publisher.
  • The configuration of the Digg-2000 publisher allows for us to turn two different dials.

    • The default configuration only will re-post stories that were first posted on Digg in the last two days.   This means we are looking for current active stories and not stories that were posted months ago and through a slow and steady interest finally hit 2000 Diggs.
    • The default configuration re-posts at 2000 Diggs.   This can be set to any number of Diggs — 100, 1000, 2000, etc.

Pushing and Polling Data Differences in Approach on the Gnip platform

Obviously we have some understanding on the concepts of pushing and polling of data from service endpoints since we basically founded a company on the premise that the world needed a middleware push data service.    Over the last year we have had a lot of success with the push model, but we also learned that for many reasons we also need to work with services via a polling approach.   For this reason our latest v2.1 includes the Gnip Service Polling feature so that we can work with any service using push, poll or a mixed approach.

Now, the really great thing for users of the Gnip platform is that how Gnip collects data is mostly abstracted away.   Every end user developer or company has the option to tell Gnip where to push data that you have set up filters or have a subscription.   We also realize not everyone has an IT setup to handle push so we have always provided the option for HTTP GET support that lets people grab data from a Gnip generated URL for your filters.

One place where the way Gnip collects data can make a difference, at this time, for our users is the expected latency of data.  Latency here refers to the time between the activity happening (i.e. Bob posted a photo, Susie made a comment, etc) and the time it hits the Gnip platform to be delivered to our awaiting users.     Here are some basic expectation setting thoughts.

PUSH services: When we have push services the latency experience is usually under 60 seconds, but we know that this is not always the case sense sometimes the services can back-up during heavy usage and latency can spike to minutes or even hours.   Still, when the services that push to us are running normal it is reasonable to expect 60 second latency or better and this is consistent for both the Community and Standard Edition of the Gnip platform.

POLLED services:   When Gnip is using our polling service to collect data the latency can vary from service to service based on a few factors

a) How often we hit an endpoint (say 5 times per second)

b) How many rules we have to schedule for execution against the endpoint (say over 70 million on YouTube)

c) How often we execute a specific rule (i.e. every 10 minutes).     Right now with the Community edition of the Gnip platform we are setting rule execution by default at 10 minute intervals and people need to have this in mind with their expectation for data flow from any given publisher.

Expectations for POLLING in the Community Edition: So I am sure some people who just read the above stopped and said “Why 10 minutes?”  Well we chose to focus on “breadth of data ” as the initial use case for polling.   Also, the 10 minute interval is for the Community edition (aka: the free version).   We have the complete ability to turn the dial and use the smarts built into the polling service feature we can execute the right rules faster (i.e. every 60 seconds or faster for popular terms and every 10, 20, etc minutes or more for less popular ones).    The key issue here is that for very prolific posting people or very common keyword rules (i.e. “obama”, “http”, “google”) there can be more posts that exist in the 10 minute default time-frame then we can collect in a single poll from the service endpoint.

For now the default expectation for our Community edition platform users should be a 10 minute execution interval for all rules when using any data publisher that is polled, which is consistent with the experience during our v2.1 Beta.    If your project or company needs something a bit more snappy with the data publishers that are polled then contact us at info@gnip.com or contact me directly at shane@gnip.com as these use cases require the Standard Edition of the Gnip platform.

Current pushed services on the platform include:  WordPress, Identi.ca, Intense Debate, Twitter, Seesmic,  Digg, and Delicious

Current polled services on the platform include:   Clipmarks, Dailymotion, deviantART, diigo, Flickr, Flixster, Fotolog, Friendfeed, Gamespot, Hulu, iLike, Multiply, Photobucket, Plurk, reddit, SlideShare, Smugmug, StumbleUpon, Tumblr, Vimeo, Webshots, Xanga, and YouTube

New Gnip schema live on demo.gnip.com

We have flipped the switch to allow people to start working with our new schema at http://demo.gnip.com. In addition to standing up the site with the updated schema we have moved over all the existing accounts from the current system, so your existing gnipcentral.com user and password also get you access to the demo system.

The following publishers are in the demo system and we plan to add more over the course of the next month during the beta period.

  • Digg
  • Identi.ca
  • Seesmic
  • Six Apart
  • Twitter

We will be posting additional examples of how we mapped these social media services to the updated Gnip Schema in the Gnip Community and will link to those examples from the blog as well as point to them in our standard release newsletter that will go out later today.

Based on the feedback we have received there is a lot of interest in the enhanced metadata in the new schema that can be used to support additional types of  URLs, multiple tags, geo data, and rich media. Now, go grab some data and do something cool with it.

Solution Spotlight: Storytlr Using Gnip for Real-Time Social Data Integration

Storytlr

Who is Storytlr?
Storytlr provides a life streaming service that allows people to bring together their entire web 2.0 life and assemble their content to tell stories in a whole new way.  Learn more at their website, http://storytlr.com/, or their blog, http://blog.storytlr.com/.

Real-world results Storytlr says they are realizing from using Gnip
Storytlr is using Gnip to provide real-time data integration to Twitter, Digg, Delicious and Seesmic.  Since Storytrl starting using Gnip they have seen a reduction in the latency for the data integration of these social media activity streams (i.e. the time elapsed for a tweet, digg, or event notice to show up in the Storytlr service from a third-party is now real-time). Read more on how Storytlr added real-time integration using Gnip in their recent blog post.

We are looking forward to working more with the Storytlr team as we roll out more publishers that they can take advantage of in their business. 

Preview: Gnip Publisher Analytics

With everything going on here at Gnip we want to try and regularly preview some of the new features we are working on so people can send us feedback and plan ahead. One feature that we know a lot of people are interested in us delivering is usage and operational reporting and analytics. The reasons for adding an analytics dashboard are many and the primary reason is that we believe it will help companies and developers better understand the richness and variability of the data streams they care about.

Below is one example of the analytics features that we are planning to provide in the near future. This image shows the Digg Data Stream summary view with individual diggs, comments and submissions per second being streamed by the Gnip platform.

Figure: Gnip — Digg Data Stream Activity View

 

Obviously we could pivot on the summary view to show different types of details depending on any number of variables that Gnip partners and customers find interesting. If your company has specific requests for analytics and reporting please let us know.

Solution Spotlight: Strands Now Using Gnip

Strands is the newest company using the Gnip messaging platform for their web API data integration needs. Welcome Strands and thank you to Aaron for sharing what the team is doing!

Who is Strands?
Strands develops technologies to better understand people’s taste and help them discover things they like and didn’t know about. Strands has created a social recommendation engine that is able to provide real-time recommendations of products and services through computers, mobile phones and other Internet-connected devices. This enables users to discover new things, based on their online, offline and mobile activities. The Strands.com website helps people discover new things from other people. Visit http://www.strands.com to learn more.

Real-world results Strands says they are realizing from using Gnip
Strands.com is now able to give people updates faster and more reliably. In addition, Strands has seen reduced load on their system by not having to poll for updates on sites like Twitter, Flickr, Delicious, and Digg. Gnip allows Strands to receive push data from several of these sites, and at a minimum receive notifications when a user on these sites has made an update.

More Examples of How Companies are Using Gnip

We have noticed that we are interacting with two distinct groups of companies; those who instantly understand what Gnip does and those that struggle with what we do, so we decided to provide a few detailed real-world examples of the companies we are actively working with to provide data integration and messaging services today.

First, we are not an end-user facing social aggregation application. (We repeat this often.) We see a lot of people wanting to put Gnip in that bucket along with social content aggregators like FriendFeed, Plaxo and many others. These content aggregators are destination web sites that provide utility to end users by giving them flexibility to bring their social graph or part of their graph together in one place. Also, many of these services are now providing web APIs that allow people to use an alternative client to interact with their core services around status updates and conversations as well other features specific to the service.

Gnip is an infrastructure service and specifically we provide an extensible messaging system that allows companies to more easily access, filter and integrate data from web based APIs. While someone could use Gnip as a way to bring content into a personal social media client they want to write for a specific social aggregator it is not something we are focused. Below are the company use cases we are focused:

  1. Social content aggregators: One of the main reasons we started Gnip was to solve the problems being caused by the point-to-point integration issues that were springing up with the increase of user generated content and corresponding open web APIs. We believe that any developer who has written a poller once, twice, or to their nth API will tell you how unproductive it is to write and maintain this code. However, writing one-off pollers has become a necessary evil for many companies since the content aggregators need to provide access to as many external services as possible for their end users. Plaxo, who recently integrated to Gnip as a way to support their Plaxo Pulse feature is a perfect example, as are several other companies.
  2. Business specific applications: Another main reason we started Gnip was that we believe more and more companies are seeing the value of integrating business and social data as a way to add additional compelling value to their own applications. There are a very wide set of examples, such as how Eventvue uses Gnip as a way to integrate Twitter streams into their online conference community solution, and the companies we have talked to about how they can use Gnip to integrate web-based data to power everything from sales dashboards to customer service portals.
  3. Content producers: Today, Gnip offers value to content producers by providing developers an alternative tool that can be used to integrate to their web APIs. We are working with many producers, such as Digg, Delicious, Identi.ca, and Twitter, and plan to continue to grow the producers available aggressively. The benefits that producers see from working with Gnip include off-loading direct traffic to their web apis as well as providing another channel to make their content available. We are also working very hard to add new capabilities for producers, which includes plans to provide more detailed analytics on how their data is consumed and evaluating publishing features that could allow producers to define their own filters and target service endpoints and web sites where they want to push relevant data for their own business needs.
  4. Market and brand research companies: We are working with several companies that provide market research and brand analysis. These companies see Gnip as an easy way to aggregate social media data to be included in their brand and market analysis client services.

Hopefully this set of company profiles helps provide more context on the areas we are focused and the typical companies we are working with everyday. If your company does something that does not fit in these four areas and is using our services please send me a note.

What We Are Up to At Gnip

As the newest member of the Gnip team I have noticed that people are asking a lot of the same questions about what we are doing at Gnip and what are the ways people can use our services in their business.

What we do

Gnip provides an extensible messaging platform that allows for the publishing or subscribing of events and data from across the Internet, which makes data portability exponentially less painful and more automatic once it is set up. Because Gnip is being built as a platform of capabilities and not a web application the core services are instantly useful for multiple scenarios, including data producers, data consumers and any custom web applications. Gnip already is being used with many of the most popular Internet data sources, including Twitter, Delicious, Flickr, Digg, and Plaxo.

How to use Gnip

So, who is the target user of Gnip? It is a developer, as the platform is not a consumer-oriented web application, but a set of services meant to be used by a developer or an IT department for a set of core use cases.

  • Data Consumers: You’ve built your pollers, let us tell you when and where to fire them. Avoid throttling and decrease latency from hours to seconds.
  • Data Producers: Push your data to us and reduce API traffic by an order of magnitude while increasing distribution through aggregators.
  • Custom web applications: You want to embed or publish content to be used in your own application or for a third-party application. Decide who, or what, you care about for any Publisher, give us an end-point, and we push the data to you so you can solve your business use cases, such as customer service websites, corporate websites, blogs, or any web application.

Get started now

By leveraging the Gnip APIs, developers can easily design reusable services, such as, push-based notifications, smart filters and data streams that can be used for all your web applications to make them better. Are you a developer? Give the new 2.0 version a try!