Pushing and Polling Data Differences in Approach on the Gnip platform

Obviously we have some understanding on the concepts of pushing and polling of data from service endpoints since we basically founded a company on the premise that the world needed a middleware push data service.    Over the last year we have had a lot of success with the push model, but we also learned that for many reasons we also need to work with services via a polling approach.   For this reason our latest v2.1 includes the Gnip Service Polling feature so that we can work with any service using push, poll or a mixed approach.

Now, the really great thing for users of the Gnip platform is that how Gnip collects data is mostly abstracted away.   Every end user developer or company has the option to tell Gnip where to push data that you have set up filters or have a subscription.   We also realize not everyone has an IT setup to handle push so we have always provided the option for HTTP GET support that lets people grab data from a Gnip generated URL for your filters.

One place where the way Gnip collects data can make a difference, at this time, for our users is the expected latency of data.  Latency here refers to the time between the activity happening (i.e. Bob posted a photo, Susie made a comment, etc) and the time it hits the Gnip platform to be delivered to our awaiting users.     Here are some basic expectation setting thoughts.

PUSH services: When we have push services the latency experience is usually under 60 seconds, but we know that this is not always the case sense sometimes the services can back-up during heavy usage and latency can spike to minutes or even hours.   Still, when the services that push to us are running normal it is reasonable to expect 60 second latency or better and this is consistent for both the Community and Standard Edition of the Gnip platform.

POLLED services:   When Gnip is using our polling service to collect data the latency can vary from service to service based on a few factors

a) How often we hit an endpoint (say 5 times per second)

b) How many rules we have to schedule for execution against the endpoint (say over 70 million on YouTube)

c) How often we execute a specific rule (i.e. every 10 minutes).     Right now with the Community edition of the Gnip platform we are setting rule execution by default at 10 minute intervals and people need to have this in mind with their expectation for data flow from any given publisher.

Expectations for POLLING in the Community Edition: So I am sure some people who just read the above stopped and said “Why 10 minutes?”  Well we chose to focus on “breadth of data ” as the initial use case for polling.   Also, the 10 minute interval is for the Community edition (aka: the free version).   We have the complete ability to turn the dial and use the smarts built into the polling service feature we can execute the right rules faster (i.e. every 60 seconds or faster for popular terms and every 10, 20, etc minutes or more for less popular ones).    The key issue here is that for very prolific posting people or very common keyword rules (i.e. “obama”, “http”, “google”) there can be more posts that exist in the 10 minute default time-frame then we can collect in a single poll from the service endpoint.

For now the default expectation for our Community edition platform users should be a 10 minute execution interval for all rules when using any data publisher that is polled, which is consistent with the experience during our v2.1 Beta.    If your project or company needs something a bit more snappy with the data publishers that are polled then contact us at info@gnip.com or contact me directly at shane@gnip.com as these use cases require the Standard Edition of the Gnip platform.

Current pushed services on the platform include:  WordPress, Identi.ca, Intense Debate, Twitter, Seesmic,  Digg, and Delicious

Current polled services on the platform include:   Clipmarks, Dailymotion, deviantART, diigo, Flickr, Flixster, Fotolog, Friendfeed, Gamespot, Hulu, iLike, Multiply, Photobucket, Plurk, reddit, SlideShare, Smugmug, StumbleUpon, Tumblr, Vimeo, Webshots, Xanga, and YouTube

New Gnip Publishers: FriendFeed, YouTube and Hulu

We continue to push out new publishers to the beta http://api.gnip.com environment as we work to finish up the release and get the final touches on lots of new features.

The new publishers this week include the following:

  • FriendFeed-search:  Supports the KEYWORD rule-type and works with the standard FriendFeed Search interface for tracking conversations
  • Hulu: Supports the ACTOR rule-type and works with the standard Hulu interface for tracking conversations
  • Hulu-search: Supports the KEYWORD rule-type and works with the standard Hulu Search interface
  • YouTube: Supports the ACTOR and TAG rule-types and works witih the standard YouTube interface and tracks “uploads”
  • YouTube-search: Supports the KEYWORD rule type and works witih the standard YouTube-search interface

Ok, now go grab some data from these or any of our other now 20+ data publishers in the system.   Or read up on the new features in http://www.gnip.com/docs

Continue reading

Data Standards?

Today’s general data standards are akin to yesterday’s HTML/CSS browser support standards. The first rev of Gecko (not to be confused w/ the original Mosaic/Navigator rendering engine) at Netscape was truly standards compliant in that it did not provide backwards compatibility for the years of web content that had been built up; that idea made it an Alpha or two into the release cycle, until “quirks-mode” became status quo. The abyss of broken data that machines, and humans, generate, eclipsed web pages back then, and it’s an ever present issue in the ATOM/RSS/XML available today.

Gnip, along with social data aggregators like Plaxo and FriendFeed, has a unique view of the data world. While ugly to us, we normalize data to make our Customers’ lives better. Consumer facing aggregators (Plaxo/FF) beautify the picture for their display layers. Gnip beautifies the picture for it’s data consumption API. Cleaning up the mess that exists on the network today has been an eye opening process. When our data producers (publishers) PUSH data in Gnip XML, life is great. We’re able to work closely with said producers to ensure properly structured, formatted, encoded, and escaped data comes into the system. When, data comes into the system through any other means (e.g. XMPP feeds, RSS/ATOM polling) it’s a rats nest of unstructured, cobbled-together, ill-formated, and poorly-encoded/escaped data.

XML has provided self describing formats and structure, but it ends there. Thousands of pounds of wounded data shows up on Gnip’s doorstep each day, and that’s where Gnip’s normalization heavy lifting work comes into play. I thought I’d share some of the more common bustage we see, along with a little commentary around the category of problem

  • ![CDATA[]] is akin to void* and is way overused. The result is magical custom parsing of something that someone couldn’t fit into some higher-level structure.

    • If you’re back-dooring data/functions into an otherwise “content” payload, you should revisit your overall model. Just like void*, CDATA usually suggests an opaque box you’re trying to jam through the system.
  • Character limited message bodies (e.g. microblogging services) wind up providing data to Gnip that has escaped HTML sequences chopped in half, leaving the data consumer (Gnip in this case) guessing at what to do with a broken encoding. If I give you “&a”, you have to decide whether to consider it literally, expand it to “&”, or to drop it. None of which was intended by the user that generated the original content, they just typed ‘&’ into a text field somewhere.

    • Facebook has taken a swing at how to categorize “body”/”message” sizes which is nice, but clients need to do a better job truncating by taking downstream encoding/decoding/expansion realities into consideration.
  • Document bodies that have been escaped/encoded multiple times, subsequently leave us deciphering how many times to run them through the un-escape/decode channel.

    • _Lazy_. Pay attention to how you’re treating data, and be consistent.
  • Illegal characters in XML attribute/element values.

    • _LAZY_. Pay attention.
  • Custom extensions to “standard” formats (XMPP, RSS, ATOM). You think you’re doing the right thing by “extending” the format to do what you want, but you often wind up throwing a wrench in downstream processing. Widely used libs don’t understand your extensions, and much of the time, the extension wasn’t well constructed to begin with.

    • Sort of akin to CDATA, however, legitimate use cases exist for this. Keep in mind that by doing this, there are many libraries in the ecosystem that will not understand what you’ve done. You have to be confident that your data consumers are something you can control and ensure they’re using a lib/extension that can handle your stuff. Avoid extensions, or if you have to use them, get it right.
  • Namespace case-sensitivity/insensitivity assumptions differ from service to service.

    • Case-sensitivity rules were polluted with the advent of MS-DOS, and have been propagated over the years by end-user expectations. Inconsistency stinks, but this one’s around forever.
  • UTF-8, ASCII encoding bugs/misuse/misunderstanding. Often data claims to be encoded one way, when in fact it was encoded differently.

    • Understand your tool chain, and who’s modifying what, and when. Ensure consistency from top to bottom. Take the time to get it right.
  • UTF-16… don’t go there.

    • uh huh.
  • Libraries in the field to handle all of the above each make their own inconsistent assumptions.

    • It’s conceivable to me that Gnip winds up stating the art in XML processing libs, whether by doing it ourselves, or contributing to existing code trees. Lots of good work out there, none of it great.

You’re probably wondering about the quality of the XML structure itself. By volume, the bulk of data that comes into Gnip validates out of the box. Shocking, but true. As you could probably guess, most of our energy is spent resolving the above data quality issues. The unfortunate reality for Gnip is that the “edge” cases consume lots of cycles. As a Gnip consumer, you get to draft off of our efforts, and we’re happy to do it in order to make your lives better.

If everyone would clean up their data by the end of the day, that’d be great. Thanks.

More Examples of How Companies are Using Gnip

We have noticed that we are interacting with two distinct groups of companies; those who instantly understand what Gnip does and those that struggle with what we do, so we decided to provide a few detailed real-world examples of the companies we are actively working with to provide data integration and messaging services today.

First, we are not an end-user facing social aggregation application. (We repeat this often.) We see a lot of people wanting to put Gnip in that bucket along with social content aggregators like FriendFeed, Plaxo and many others. These content aggregators are destination web sites that provide utility to end users by giving them flexibility to bring their social graph or part of their graph together in one place. Also, many of these services are now providing web APIs that allow people to use an alternative client to interact with their core services around status updates and conversations as well other features specific to the service.

Gnip is an infrastructure service and specifically we provide an extensible messaging system that allows companies to more easily access, filter and integrate data from web based APIs. While someone could use Gnip as a way to bring content into a personal social media client they want to write for a specific social aggregator it is not something we are focused. Below are the company use cases we are focused:

  1. Social content aggregators: One of the main reasons we started Gnip was to solve the problems being caused by the point-to-point integration issues that were springing up with the increase of user generated content and corresponding open web APIs. We believe that any developer who has written a poller once, twice, or to their nth API will tell you how unproductive it is to write and maintain this code. However, writing one-off pollers has become a necessary evil for many companies since the content aggregators need to provide access to as many external services as possible for their end users. Plaxo, who recently integrated to Gnip as a way to support their Plaxo Pulse feature is a perfect example, as are several other companies.
  2. Business specific applications: Another main reason we started Gnip was that we believe more and more companies are seeing the value of integrating business and social data as a way to add additional compelling value to their own applications. There are a very wide set of examples, such as how Eventvue uses Gnip as a way to integrate Twitter streams into their online conference community solution, and the companies we have talked to about how they can use Gnip to integrate web-based data to power everything from sales dashboards to customer service portals.
  3. Content producers: Today, Gnip offers value to content producers by providing developers an alternative tool that can be used to integrate to their web APIs. We are working with many producers, such as Digg, Delicious, Identi.ca, and Twitter, and plan to continue to grow the producers available aggressively. The benefits that producers see from working with Gnip include off-loading direct traffic to their web apis as well as providing another channel to make their content available. We are also working very hard to add new capabilities for producers, which includes plans to provide more detailed analytics on how their data is consumed and evaluating publishing features that could allow producers to define their own filters and target service endpoints and web sites where they want to push relevant data for their own business needs.
  4. Market and brand research companies: We are working with several companies that provide market research and brand analysis. These companies see Gnip as an easy way to aggregate social media data to be included in their brand and market analysis client services.

Hopefully this set of company profiles helps provide more context on the areas we are focused and the typical companies we are working with everyday. If your company does something that does not fit in these four areas and is using our services please send me a note.

Garbage In, Garbage Out

Gnip is an intermediary service for message flow across disparate network endpoints. Standing in the middle allows for a variety of value adds (Data Producers can “publish once, distribute to many,” Data Consumers can enjoy single service interaction rather than one-off’ing over and over again), but the quality of data that Data Producers push into the system is fundamental.

Only As Good As The Sum Of Our Parts

Gnip doesn’t control the quality of the data being published to it. Whether it comes in the form of XMPP messages, RSS, or ATOM, there are many issues that can come into play that can affect the data a Data Consumer receives.

  • Bad transport/delivery – The source XMPP, RSS, ATOM, or REST, feed can go down. When this happens for a given Publisher, that source has vanished and Gnip doesn’t receive messages for that Publisher. We’re only as good as the data coming in. While Gnip can consume data from XMPP, RSS, ATOM, and other sources, our preferred inbound message delivery method is via our REST API. Firing off messages to Gnip directly, and not through yet another layer, minimizes delivery issues.
  • Bad data – As any aggregator (Friend Feed, Social Thing, MoveableType Activity Streams…) can attest, the data coming across XMPP, RSS, and ATOM feeds today is a mess. From bad/illegal formatting, to bad/illegal data escaping, nearly every activity feed has unique issues that have to be handled on a case by case basis. There will be bugs. We will fix them as they arise. Once again, these issues can be minimized if Data Producers deliver messages directly to Gnip via our REST API.
  • Bad policy – This one’s interesting. Gnip makes certain assumptions about the kind of data it receives. In our current implementation we advertise to Data Consumers that Data Producers push all public, per user, change notifications generated within their systems, to Gnip. This usually corresponds to the existing public API policies for said Data Producers. We will eventually offer finely tuned, Data Producer controlled, data policies, but for today’s public facing Gnip service, we do not want to see Data Producers creating publishing policies specific to Gnip. Doing so confuses the middle-ware dynamic we’re trying to create with our current product, and subsequently muddies the water for everyone. Imagine a Data Consumer interacting with a Data Producer directly under one policy, then interacting with Gnip under another policy; confusing. Again, we will, perhaps earlier than we think, cater to unique data policies on a per Data Producer basis, but, we’re not there yet.

While addressing all of these issues is part of our vision, they’re not all resolved out of the gate.

The WHAT of Gnip: Changing APIs from Pull to Push

A few months ago a handful of folks came together and took a practical look at the state of “web services” on the network today. As an industry we’ve enjoyed the explosion of web APIs over the past several years, but it’s been “every man for himself,” and we’ve been left with hundreds of web APIs being consumed in random ways (random protocols and formats). There have been a few cracks at standardizing some of this, but most have been left in spec form with, at best, fragmented implementations, and most have been too high level to provide anything more than good bedtime reading. We set out to build something; not write a story.

For a great overview of the situation Gnip is plunging into, checkout Nik Cubrilovic’s post on techcrunchIT; “The New Datastream Aggregators, FriendFeed and Standards.”.

Our first service is the culmination of lots of work by smart, pragmatic, people. From day one we’ve had excellent partners helping us along the way; from early integrations with our API, to discussing specifications and standards to follow (or not to follow; what you chose not to do is often more important than what you chose to do). While we aspire to solve all of the challenges in the data portability space, we’re a small team biting off small chunks along a path. We are going to need the support, feedback, and assistance of the broader data portability (formal & informal) community in order to succeed. Now that we’ve finally launched, we’ll be in “release early, release often” mode to ensure tight feedback loops around our products.

Enough; what did we build!?!

For those who want to cut to the chase, here’s our API doc.

We built a system that connects Data Consumers to Data Publishers in a low-latency, highly-scalable standards-based way. Data can be pushed or pulled into Gnip (via XMPP, Atom, RSS, REST) and it can be pushed or pulled out of Gnip (currently only via REST, but the rest to follow). This release of Gnip is focused on propagating user generated activity events from point A to point B. Activity XML provides a terse format for Data Publishers to distribute their user’s activities. Collections XML provides a simple way for Data Consumers to only receive information about the users they care about. This release is about “change notification,” and a subsequent release will include the actual data along with the event.

 

As a Consumer, whether your application model is event- or polling-based Gnip can get you near-realtime activity information about the users you care about. Our goal is a maximum 60 second latency for any activity that occurs on the network. While the time our service implementation takes to drive activities from end to end is measured in milliseconds, we need some room to breathe.

Data can come in to Gnip via many formats, but it is XSLT’d into a normalized Activity XML format which makes consuming activity events (e.g. “Joe dugg a news story at 10am”) from a wide array of Publishers a breeze. Along the way we started cringing at the verb/activity overlap between various Publishers; did Jane “tweet” or “post”, they’re kinda the same thing? After sitting down with Chris Messina, it became clear that everyone else was cringing too. A verb/activity normalization table has been started, and Gnip is going to distill the cornucopia of activities into a common, community derived, format in order to make consumption even easier.

Data Publishers now have a central clearinghouse to push data when events on their services occur. Gnip manages the relationship with Data Consumers, and figures out which protocols and formats they want to play with. It will take awhile for the system to reach equilibrium with Gnip, but once it does, API balance will be reached; Publishers will notify Gnip when things happen, and Gnip will fan-out those events to an arbitrary number of Consumers in real-time (no throttling, no rate limiting).

Gnip is centralized. After much consternation, we resolved to start out with a centralized model. Not necessarily because we think it’s the best path, but because it is the best path to get something started. Imagine the internet as a clustered application; decentralization is fundamental (DNS comes to mind). That said, we needed a starting point and now we have one. A conversation with Chris Saad highlighted some work Paul Jones (among others) had done around a standard mechanism for change notification discovery and subscription; getpingd. Getpingd describes a mechanism for distributed change notification. The Subscription side of getpingd feels like a no-brainer for Gnip to support, but I’m not sure how to consider the Discovery end of it. In some sense, I see Gnip (assuming getpingd’s discovery model is implemented) as a getpingd node in the graph. We have lots to consider in the federated/distributed model.

Gnip is a classic chicken-and-egg scenario, we need Publishers & Consumers to be interesting. If your service produces events that you want others on the network to consume, we’d love to see you as a Publisher in Gnip; pushing events into the system for wide consumption. If your service relies on events created by users on other applications, we’d love to see you as a Consumer in Gnip.

We’ve started out with convenience libraries for perl, php, java, python, and ruby. Rather than maintain these ourselves, we plan on publishing them to respective language community code sites/repositories.

That’s what we’ve built in a nutshell. I’ll soon blog about exactly how we’ve built it.

The WHY of Gnip: Stop Building What Everyone Else is Building

Let me say this up front:

I have a tendency to ramble. Why use a sentence when a paragraph will suffice, right? As a result, I limit myself to 100 word posts on my sporadically updated personal blog. I’ll follow suit here, with only occasional excursions into longer territory. This is one such post.

I’ll try not to ramble too much…

Data portability, the ability to create content on one web site and derive value from it on other sites and applications, has become one of the defining characteristics of what is commonly referred to as “Web 2.0″. An emerging class of services are taking advantage of this data to create entirely new products, including social aggregators (Plaxo Pulse, MyBlogLog, FriendFeed), social search (Lijit, Delver) and communications dashboards (Fuser, Orgoo, Digsby). Each of these services is predicated on the belief that user-generated content is the raw material upon which great companies can be built.

Data portability, via RSS or ATOM or XMPP or open APIs is neither difficult nor complex. These are known problems with straightforward solutions and open standards. But each connection between two services (e.g. MyBlogLog and Flickr or Plaxo and Digg) is a custom integration, requiring at least one of the parties to set up a custom channel to access, process and ultimately make use of the transferred data. As companies seek to create robust solutions built upon dozens or even hundreds of data feeds, engineers face an exponentially growing problem of building and maintaining these custom communication channels. Simply put, data portability is a big hassle.

Crucially, data portability has become the cost of entry for these services. It is not enough for a social aggregator to claim the most sources or a social search company the biggest pool of data. The leaders in this space are focused on filtering and presenting data in useful ways; out of a billion pieces of data, they seek to connect you with the appropriate information at the appropriate time. All of the work building and maintaining back-end data portability services comes at the cost of building better front-end features that draw and satisfy users.

That’s where Gnip comes in. We’re dedicated to making data portability suck less, by reducing the effort required to collect and manage the data upon which these awesome new services are being created. Gnip aims to simplify the process of aggregating, standardizing and maintaining large pools of data, ultimately making he process as simple as uploading a list of your users.

Our first service is a solution to a key problem facing data portability implementations (Jud will give you the details in just a moment). We at Gnip believe in direct solutions to painful problems, and as a result, our first service isn’t fancy. But it’s quick to integrate, it scales like a monster and it uses a variety of web standards; we believe we’ve solved this particular problem pretty well. Over the coming months we’ll roll out additional direct solutions to painful problems, and before long we’ll have a bona fide platform for pushing data around the web.

We’re incredibly excited by the bounty that Web 2.0 has created. We are living with an embarrassment of riches in terms of shared information and experiences. But it’s overwhelming. I personally believe that Web 3.0 will herald a return to the individual — story, picture, friend, experience — because in aggregate, that which has great meaning often becomes meaningless. So it’s up to these awesome new services to take the Web 2.0 bounty and find for each of us those few things that will fundamentally enhance our lives. To give us something meaningful.

I hope that we at Gnip can build a foundation that enables these awesome new services to focus all of their attention on making great things. We’ll happily lay plumbing, mix concrete and smelt tin to see that happen.