Pushing and Polling Data Differences in Approach on the Gnip platform

Obviously we have some understanding on the concepts of pushing and polling of data from service endpoints since we basically founded a company on the premise that the world needed a middleware push data service.    Over the last year we have had a lot of success with the push model, but we also learned that for many reasons we also need to work with services via a polling approach.   For this reason our latest v2.1 includes the Gnip Service Polling feature so that we can work with any service using push, poll or a mixed approach.

Now, the really great thing for users of the Gnip platform is that how Gnip collects data is mostly abstracted away.   Every end user developer or company has the option to tell Gnip where to push data that you have set up filters or have a subscription.   We also realize not everyone has an IT setup to handle push so we have always provided the option for HTTP GET support that lets people grab data from a Gnip generated URL for your filters.

One place where the way Gnip collects data can make a difference, at this time, for our users is the expected latency of data.  Latency here refers to the time between the activity happening (i.e. Bob posted a photo, Susie made a comment, etc) and the time it hits the Gnip platform to be delivered to our awaiting users.     Here are some basic expectation setting thoughts.

PUSH services: When we have push services the latency experience is usually under 60 seconds, but we know that this is not always the case sense sometimes the services can back-up during heavy usage and latency can spike to minutes or even hours.   Still, when the services that push to us are running normal it is reasonable to expect 60 second latency or better and this is consistent for both the Community and Standard Edition of the Gnip platform.

POLLED services:   When Gnip is using our polling service to collect data the latency can vary from service to service based on a few factors

a) How often we hit an endpoint (say 5 times per second)

b) How many rules we have to schedule for execution against the endpoint (say over 70 million on YouTube)

c) How often we execute a specific rule (i.e. every 10 minutes).     Right now with the Community edition of the Gnip platform we are setting rule execution by default at 10 minute intervals and people need to have this in mind with their expectation for data flow from any given publisher.

Expectations for POLLING in the Community Edition: So I am sure some people who just read the above stopped and said “Why 10 minutes?”  Well we chose to focus on “breadth of data ” as the initial use case for polling.   Also, the 10 minute interval is for the Community edition (aka: the free version).   We have the complete ability to turn the dial and use the smarts built into the polling service feature we can execute the right rules faster (i.e. every 60 seconds or faster for popular terms and every 10, 20, etc minutes or more for less popular ones).    The key issue here is that for very prolific posting people or very common keyword rules (i.e. “obama”, “http”, “google”) there can be more posts that exist in the 10 minute default time-frame then we can collect in a single poll from the service endpoint.

For now the default expectation for our Community edition platform users should be a 10 minute execution interval for all rules when using any data publisher that is polled, which is consistent with the experience during our v2.1 Beta.    If your project or company needs something a bit more snappy with the data publishers that are polled then contact us at info@gnip.com or contact me directly at shane@gnip.com as these use cases require the Standard Edition of the Gnip platform.

Current pushed services on the platform include:  WordPress, Identi.ca, Intense Debate, Twitter, Seesmic,  Digg, and Delicious

Current polled services on the platform include:   Clipmarks, Dailymotion, deviantART, diigo, Flickr, Flixster, Fotolog, Friendfeed, Gamespot, Hulu, iLike, Multiply, Photobucket, Plurk, reddit, SlideShare, Smugmug, StumbleUpon, Tumblr, Vimeo, Webshots, Xanga, and YouTube

That Twitter Thing

Oh, crap, Eric’s gone and written another long post…

Since we publicly launched Gnip last week, we’ve been asked numerous times if we can integrate with Twitter or somehow help Twitter with the scaling issues they are facing.  We can, but we depend on Twitter giving us access to their XMPP feed.

We are huge fans of Twitter so we’re patiently waiting for that access.  In the mean time, the questions we’ve received have prompted us to explain two things: (1) How we would benefit Twitter and anyone who wants access to Twitter data and (2) Why – if you are a web service – it’s worth integrating now with Gnip rather than waiting either for (a) Gnip to integrate with Twitter or (b) you to get as popular as Twitter and have scale issues.

Let’s address the first issue: How we would benefit Twitter and anyone that wants to integrate with Twitter data.

Twitter has found that XMPP doesn’t scale for them and as a result, people are forced to poll their API *a lot* to get updates for their users.  MyBlogLog has over 25,000 Twitter users that they throw against the Twitter API every 15 minutes.  This results in nearly 2.5 million queries against the API every day, for maybe 250K updates.  Now add millions of pings from Plaxo and SocialThing and Lijit and heaven forbid Yahoo starts beating up their API…

If Twitter starts pushing updates to us, via our dead simple API or Atom or their XMPP server, we can immediately reduce by an order of magnitude the number of requests that some very large sites are making against their API.  At the same time, we reduce the latency between when someone Tweets and when it shows up on consuming sites like Plaxo.  From 15 minutes or more to 60 seconds or less.

We expect that Twitter has their collective heads down and are working around the clock to buttress their infrastructure, and it’s unlikely that they’re going to do anything optional until that’s sorted out.  Unfortunately, “integrate with Gnip” probably falls into the optional category. We expect, however, that at some point Twitter will start opening up their data to more partners once they feel like they have their arms around their infrastructure.

If you run a web service and integrate with Gnip today, you’ll automatically be able to integrate with Twitter data once they give us access.  Presumably you won’t have to wait in line to get direct Twitter integration.  In addition, you’ll have immediate access to all of the other data providers that we integrate with. Such as  Delicious, Flickr, Magnolia, Get Satisfaction, Intense Debate and Six Apart.  For example, only took Brightkite 15 minutes to integrate our API and start pushing data to our partners via us.

Now for the second topic.  Why – if you are a web service – it’s worth integrating with Gnip now rather than waiting either for (a) Gnip to integrate with Twitter or (b) you to get as popular as Twitter and have scale issues.

All things considered, it’s best not to end up in Twitter’s position.  They have a ton of passionate users (I’m one of them) who want reliable service and don’t have infinite patience.  The old startup cliche of “these are problems we’d like to have” is carp.

You don’t want to be in the position where your business suddenly takes off and your infrastructure falls over because people are banging your APIs to death.  You don’t want your most passionate users calling for mass exodus.  It’s better to take a few minutes to start pushing notifications to Gnip now than when you’re doing 20-hour days rebooting servers.

You also don’t want to be in the position that your company takes off and you suddenly get throttled by an API provider.  Nothing is worse than have to pull data sources because you’ve over-polled and the host decides to turn off the spigot.  Start pulling notifications from Gnip and feel secure that you’re only asking for data when there’s something new.

I still use Twitter every day.  Don’t try to kid me; I know you still do too.  Let them get on with their work and rest assured that we’ll integrate with them the instant we get the okay from them.