Official Google Buzz Firehose Added to Gnip’s Social Media API

Today we’re excited to announce the integration of the Google Buzz firehose into Gnip’s social media data offering. Google Buzz data has been available via Gnip for some time, but today Gnip became one of the first official providers of the Google Buzz firehose.

The Google Buzz firehose is a stream of all public Buzz posts (excluding Twitter tweets) from all Google Buzz users. If you’re interested in the Google Buzz firehose, here are some things to know:

  • Google delivers it via Pubsubhubbub. If you don’t want to consume it via Pubsubhubbub, Gnip makes it available in any of our supported delivery methods: Polling (HTTP GET), Streaming HTTP (Comet), or Outbound HTTP Post (Webhooks).
  • The format of the Firehose is XML Activity Streams. Gnip loves Activity Streams and we’re excited to see Google continue to push this standard forward.
  • Google Buzz activities are Geo-enabled. If the end user attaches a geolocation on a Buzz post (either from a mobile Google Buzz client or through an import from another geo-enabled service), that location will be included in the Buzz activity.

We’re excited to bring the Google Buzz firehose to the Social Media Monitoring and Business Intelligence community through the power of the Gnip platform.

Here’s how to access the Google Buzz firehose. If you’re already a Gnip customer, just log in to your Gnip account and with 3 clicks you can have the Buzz firehose flowing into your system. If you’re not yet using Gnip and you’d like to try out the Buzz firehose to get a sense of volume, latency, and other key metrics, grab a free 3 day trial at http://try.gnip.com and check it out along with the 100 or so other feeds available through Gnip’s social media API.

Expanding Gnip's Facebook Graph API Support

One of our most requested features has long been Facebook support. While customers have had beta access for awhile now, today we’re officially announcing support for several new Facebook Graph API feeds. As with the other feeds available through Gnip, Facebook data is available in Activity Streams format (as well as original if you so desire), and you can choose your own delivery method (polling, webhook POSTing, or streaming). Gnip integrates with Facebook on your behalf, in a fully transparent manner, in order to feed you the Facebook data you’ve been longing for.

As with most services, Facebook’s APIs are also in constant flux. Integrating with Gnip shields you from the ever shifting sands of service integration. You don’t have to worry about authentication implementation changes or delivery method shifts.

Use-case Highlight

Discovery is hard. If you’re monitoring a brand or keyword for popularity (positive or negative sentiment), it’s challenging to keep track of fan pages that crop up without notice. With Gnip, you can receive real-time notification when one of your search terms is found within a fan page. Discover when a community is forming around a given topic, product, or brand before others do.

We currently support the following endpoints, and will be adding more based on customer demand.

  • Keyword Search – Search over all public objects in the Facebook social graph.
  • Lookup Fan Pages by Keyword – Look up IDs for Fan Pages with titles containing your search terms.
  • Fan Page Feed (with or without comments) – Receive wall posts from a list of Facebook Fan Pages you define.
  • Fan Page Posts (by page owner, without comments) – Receive wall posts from a list of Facebook Fan Pages you define. Only shows wall posts made by the page owner.
  • Fan Page Photos (without comments) – Get photos for a list of Facebook Fan Pages.
  • Fan Page Info – Get information including fan count, mission, and products for a list of Fan Pages.

Give Facebook via Gnip a try (http://try.gnip.com), and let us know what you think info@gnip.com

Swiss Army Knives: cURL & tidy

Iterating quickly is what makes modern software initiatives work, and the mantra applies to everything in the stack. From planning your work, to builds, things have to move fast, and feedback loops need to be short and sweet. In the realm of REST[-like] API integration, writing an application to visually validate the API you’re interacting with is overkill. At the end of the day, web services boil down to HTTP requests which are rapidly tested with a tight little application called cURL. You can test just about anything with cURL (yes, including HTTP streaming/Comet/long-poll interactions), and its configurability is endless. You’ll have to read the man page to get all the bells and whistles, but I’ll provide a few samples of common Gnip use cases here. At the end of this post I’ll clue you into cURL’s indispensable cohort in web service slaying, ‘tidy.’

cURL power

cURL can generate custom HTTP client requests with any HTTP method you’d like. ProTip: the biggest gotcha I’ve seen trip up most people is leaving the URL unquoted. Many URLs don’t need quotes when being fed to cURL, but many do, and you should just get in the habit of quoting every one, otherwise you’ll spend time debugging your driver error for far too long. There are tons of great cURL tutorials out on the network; I won’t try to recreate those here.

POSTing

Some APIs want data POSTed to them. There are two forms of this.

Inline

curl -v -d "some=data" "http://blah.com/cool/api"

From File

curl -v -d @filename "http://blah.com/cool/api"

In either case, cURL defaults the content-type to the ubiquitous “application/x-www-form-urlencoded”. While this is often the correct thing to do, by default, there are a couple of things to keep in mind: one, this assumes that the data you’re inlining, or that is in your file, is indeed formatted as such (e.g. key=value pairs). two, when the API you’re working with does NOT want data in this format, you need to explicitly override the content-type header like so.

curl -v -d "someotherkindofdata" "http://blah.com/cool/api" --header "Content-Type: foo"

Authentication

Passing HTTP-basic authentication credentials along is easy.

curl -v -uUSERNAME[:PASSWORD] "http://blah.com/cool/api"

You can inline the password, but keep in mind your password will be cached in your shell history logs.

Show Me Everything

You’ll notice I’m using the “-v” option on all of my requests. “-v” allows me to see all the HTTP-level interaction (method, headers, etc), with the exception of a request POST body, which is crucial for debugging interaction issues. You’ll also need to use “-v” to watch streaming data fly by.

Crossing the Streams (cURL + tidy)

Most web services these days spew XML formatted data, and it is often not whitespace formatted such that a human can read it easily. Enter tidy. If you pipe your cURL output to tidy, all of life’s problems will melt away like a fallen ice-cream scoop on a hot summer sidewalk.

cURL’d web service API without tidy

curl -v "http://rss.clipmarks.com/tags/flower/"
...
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/style/rss/rss_feed.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="/style/rss/rss_feed.css" type="text/css" media="screen" ?><rss versi
on="2.0"><channel><title>Clipmarks | Flower Clips</title><link>http://clipmarks.com/tags/flower/</link><feedUrl>http://rss.clipmarks.com/tags/flower/</feedUrl><ttl>15</ttl
><description>Clip, tag and save information that's important to you. Bookmarks save entire pages...Clipmarks save the specific content that matters to you!</description><
language>en-us</language><item><title>Flower Shop in Parsippany NJ</title><link>http://clipmarks.com/clipmark/CAD213A7-0392-4F1D-A7BB-19195D3467FD/</link><description>&lt;
b&gt;clipped by:&lt;/b&gt; &lt;a href="http://clipmarks.com/clipper/dunguschariang/"&gt;dunguschariang&lt;/a&gt;&lt;br&gt;&lt;b&gt;clipper's remarks:&lt;/b&gt;  Send Dishg
ardens in New Jersey, NJ with the top rated FTD florist in Parsippany Avas specializes in Fruit Baskets, Gourmet Baskets, Dishgardens and Floral Arrangments for every Holi
day. Family Owned and Opperated for over 30 years. &lt;br&gt;&lt;div border="2" style="margin-top: 10px; border:#000000 1px solid;" width="90%"&gt;&lt;div style="backgroun
d-color:"&gt;&lt;div align="center" width="100%" style="padding:4px;margin-bottom:4px;background-color:#666666;overflow:hidden;"&gt;&lt;span style="color:#FFFFFF;f
...

cURL’d web service API with tidy

curl -v "http://rss.clipmarks.com/tags/flower/" | tidy -xml -utf8 -i
...
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="/style/rss/rss_feed.xsl" type="text/xsl" media="screen"?>
<?xml-stylesheet href="/style/rss/rss_feed.css" type="text/css" media="screen" ?>
<rss version="2.0">
   <channel>
     <title>Clipmarks | Flower Clips</title>
     <link>http://clipmarks.com/tags/flower/</link>
     <feedUrl>http://rss.clipmarks.com/tags/flower/</feedUrl>
     <ttl>15</ttl>
     <description>Clip, tag and save information that's important to
       you. Bookmarks save entire pages...Clipmarks save the specific
       content that matters to you!</description>
     <language>en-us</language>
     <item>
       <title>Flower Shop in Parsippany NJ</title>
       <link>

http://clipmarks.com/clipmark/CAD213A7-0392-4F1D-A7BB-19195D3467FD/</link>

       <description>&lt;b&gt;clipped by:&lt;/b&gt; &lt;a
...

I know which one you’d prefer. So what’s going on? We’re piping the output to tidy and telling tidy to treat the document as XML (use XML structural parsing rules), treat encodings as UTF8 (so it doesn’t barf on non-latin character sets), and finally “-i” indicates that you want it indented (pretty printed essentially).

Right Tools for the Job

If you spend a lot of time whacking through the web service API forest, be sure you have a sharp machete. cURL and tidy make for a very sharp machete. Test driving a web service API before you start laying down code is essential. These tools allow you to create tight feedback loops at the integration level before you lay any code down; saving everyone time, energy and money.

Of Client-Server Communication

We’ve recently been having some interesting conversations, both internally and with customers, about the challenges inherent in client-server software interaction, aka Web Services or Web APIs. The relatively baked state of web browsers and servers has shielded us from most of the issues that come with getting computers to talk to other computers.

It didn’t happen over-night, but today’s web browsing world rides on top of a well vetted pipeline of technology to give us good browsing (client-side) experiences. However, there are a lot of assumptions and moving parts behind our browser windows that get uncovered when working with web services (servers). There are skeletons in the closet unfortunately.

End-users’ web browsing demands eventually forced ports 80 and 443 (SSL) open across all firewalls and ISPs and we now take their availability for granted. When was the last time you heard someone ask “is port 80 open?” It’s probably been awhile. By 2000, server-side HTTP implementations (web servers) started solidifying and at the HTTP-level client and server tier there was relatively little incompatibility. Expectations around socket timeouts and HTTP protocol exchanges were clear, and both sides of the connection adhered to those expectations.

Enter the world of web-services/APIs.

We’ve been enjoying the stable client-server interaction that web browsing has provided over the past 15 years, but web services/APIs thrust the ugly realities that lurk beneath into view. When we access modern web services through lower-level software (e.g. something other than the browser), we have to make assumptions and implementation/configuration choices that the browser otherwise makes for us. Among them…

  • port to use for the socket connection
    • the browser assumes you always want ’80′ (HTTP) or ’443′ (HTTPS)
    • the browser provides built-in encryption handling of HTTPS/SSL
  • URL parsing
    • the browser uses static rules for interpreting and parsing URLs
  • HTTP request methods
    • browsers inherently know when to use GET vs. POST
  • HTTP POST bodies.
    • browsers pre-define how POST bodies are structured, and never deviate from this methodology
  • HTTP header negotiation (this is the big one).
    • browsers handle all of the following scenarios out-of-the-box
    • Request
      • compression support (e.g. gzip)
      • connection duration types (e.g. keep-alive)
      • authentication (basic/oauth/other)
      • user-agent specification
    • Response
      • chunked responses
      • content-types. the browser has a pre-defined set of content types that it knows how to handle internally.
      • content-encoding. the browser knows how to handle various encoding types (e.g. gzip compression), and does so by default
      • authentication (basic/oauth/other)
  • HTTP Response body formats/character sets/encodings
    • browsers juggle the combination between content-encoding, content-type, and charset handling to ensure their international audience can see the information as its author intended.

Web browsers have the luxury of being able to lock down all of the above variables and not worry about changes in these assumptions. Having built browsers (Netscape/Firefox) in the past for a living, it’s still a very difficult task but at least the problem is constrained (e.g. ensure the end user can view the content within the browser). Web service consumers have to understand, and make decisions around, each of those points. Getting just one of them wrong can lead to issues in your application. These issues can range from being connectivity- or content handling-related to service authentication and can lead to long guessing games off “what went wrong?”

To further complicate the API interaction pipeline, many IT departments prevent abnormal connection activity from occurring. This means that while your application may be “doing the right thing” (TM), a system that sits between your application and the API with which it is trying to interact may prevent the exchange from occurring as you intended.

What To Do?

First off, you need to be versed not only in the documentation of the API you’re trying to use. Documentation is often outdated and doesn’t reflect actual implementations or account for bugs and behavioral nuances inherent in any API, so you also need to engage with its developer community/forums. From there, you need to ensure your HTTP client accounts for the assumptions I outline above and adheres to the API you’re interacting with. If you’re experiencing issues you’ll need to ensure your code is establishing the connection successfully, receiving the data it’s expecting, and parsing the data correctly. Never underestimate using a packet sniffer to view the raw HTTP exchange between your client and the server; debugging HTTP libraries at the code-level (even with logging) often don’t yield the truth behind what’s being sent to the server and received.

The Power of cURL

This is an entire blog post in and of itself, but the swiss army knife of any web service developer is cURL. In the right hands, cURL allows you to easily construct HTTP requests to test interaction with a web service. Don’t underestimate the translation of your cURL test to your software however.

So You Want Some Social Data

If your product or service needs social data in today’s API marketplace, there are a few things you need to consider in order to most effectively consume said data.

 

I need all the data

First, you should double-check your needs. Data consumers often think they need “all the data,” when in fact they don’t. You may need “all the data” for a given set of entities (e.g. keywords, or users) on a particular service, but don’t confuse that with needing “all the data” a service generates. When it comes to high-volume services (such as Twitter), consuming “all of the data” actually amounts to resource intensive engineering exercises on your end. There are often non-trivial scaling challenges involved when handling large data-sets. Do some math and determine whether or not statistical sampling will give you all you need; the answer is usually “yes.” If the answer is “no” be ready for an uphill (technical, financial, or business model) battle with service providers; they don’t necessarily want all of their data floating around out there.
Social data APIs are generally designed around prohibiting “all of the data” being accessed, either technically, or through terms of service agreements. However, they usually provide great access to narrow sets of data. Consider whether you need “100% of the data” for a relatively narrow slice of information; most social data APIs support this use case quite well.

 

Ingestion

 

Connectivity

There are three general styles that you’ll wind up using to access an API, all of them HTTP based: inbound-POST; event driven (e.g. PubSubHubbub/WebHooks), GET; polling, or GET/POST; streaming. Each of these has its pros and cons. I’m avoiding XMPP in this post only because it is infrequently used and hasn’t seen widespread adoption (yet). Each style requires a different level of operational and programmatic understanding.

 

Authentication/Authorization

APIs usually have publicly available versions (usually limited in their capabilities), as well as versions that require registration for subsequent authenticated connections. The authC and authZ semantics around APIs range from simple, to complex. You’ll need to understand the access characteristics around the specific services you want to access. Some require hands-on, human, authorization-level justification processes to be followed in order to have the “right level of access” granted to you and your product. Some are simple automated online registration forms that directly yield the account credentials necessary for API access.
HTTP-Basic authentication, not surprisingly, is the predominate authentication scheme used, and authorization levels are conveniently tied to the account by the service provider. OAuth (proper and 2-legged) is gaining steam however. You’ll also find API-keys (URL params or HTTP header based) are still widely used.

 

Processing

How you process data once you receive it is certainly affected by which connection style you use. Note, that most APIs don’t give you an option in how you connect to them; the provider decides for you. Processing data in the same step as receiving it can cause bottlenecks in your system, and ultimately put you on bad terms with the API provider you’re connecting to. An analogy would be drinking from the proverbial firehose. If you connect the firehose to your mouth, you might get a gulp or two down before you’re overwhelmed by the amount of water actually coming at you. You’ll either cause the firehose to backup on you, or you’ll start leaking water all over the place. Either way, you won’t be able to keep up with the amount of water coming at you. If your, average, ability to process data is slower than the rate at which it arrives, you’ll have a queueing challenge to contend with. Consider offline, or out-of-band, processing of data as it becomes available. For example, write it to disk or a database and have parallelized worker threads/processes parse/handle it from there. The point is, don’t process it in the moment in this case.
Many APIs don’t produce enough data to warrant out-of-band processing, so often inline processing is just fine. It all depends on what operations you’re trying to perform, the speed at which your technology stack can accomplish those operations, and the rate at which data arrives.

 

Reporting

If you don’t care about reporting initially, you will in short order. How much data are you receiving? What are peak volume periods? Which of the things you’re looking for are generating the most results?
API integrations inherently bind your software to someone else’s. Understanding how that relationship is functioning at any given moment is crucial to your day to day operations.

 

Monitoring

Reporting’s close sibling is monitoring. Understanding when an integration has gone south is just as important as knowing when your product is having issues; they’re one and the same. Integrating with an API means you’re dependent on someone else’s software, and that software can have any number of issues. From bugs, to planned upgrades or API changes, you’ll need to know when certain things change, and take appropriate action.

 

Web services/APIs are usually incredibly easy to “sample,” but truly integrating and operationalizing them is another, more challenging, process.

Real-Time Event Notification via Gnip

pubsubhubbub and rssCloud are helping shed light on the technical solution to real-time event propagation; HTTP POST (aka webhooks). As a friendly reminder, if you’re building pubsubhubbub and/or rssCloud into your app as a publisher/author, you should also consider pushing to Gnip as well. While Gnip, pubsububbub and rssCloud are providing sound technical solutions to a huge problem, Gnip’s widespread adoption (thousands of existing subscribers) can get your events in front of a consumer-base that Gnip has spent over a year cultivating. With very little integration work on your part (heck, we have a half-dozen convenience libs already built for you to use; pick your language), you can get your data out to a wide-audience of existing Gnip subscribers.

Numbers + Architecture

We’ve been busy over the past several months working hard on what we consider a fundamental piece of infrastructure that the network has been lacking for quite some time. From “ping server for APIs” to “message bus”, we’ve been called a lot of things; and we are actually all of them rolled into one. I want to provide some insight into what our backend architecture looks like as systems like this generally don’t get a lot of fanfare, they just have to “work.” Another title for this blog post could have been “The Glamorous Life of a Plumbing Company.”

First, some production numbers.

  • 99.9%: the Gnip service has 99.9% up-time.
  • 0: we have had zero Amazon Ec2 instances fail.
  • 10: ten Ec2 instances, of various sizes, run the core, redundant, message bus infrastructure.
  • 2.5m: 2.5 million unique activities are HTTP POSTed (pushed) into Gnip’s Publisher front door each day.
  • 2.8m: 2.8 million activities are HTTP POSTed (pushed) out Gnip’s Consumer back door each day.
  • 2.4m: 2.4 million activities are HTTP GETed (polled) from Gnip’s Consumer back door each day.
  • $0: no money has been spent on framework licenses (unless you include “AWS”).

Second, our approach.

Simplicity wins. These production transaction rate numbers, while solid, are not earth shattering. We have however, achieved much higher rates in load tests. We optimized for activity retrieval (outbound) as opposed to delivery into Gnip (inbound). That means every outbound POST/GET, is moving static data off of disk; no math gets done. Every inbound activity results in processing to ensure proper Filtration and distribution; we do the “hard” work on delivery.

We view our core system as handling ephemeral data. This has allowed us, thus far, to avoid having a database in the environment. That means we don’t have to deal with traditional database bottlenecks. To be sure, we have other challenges as a result, but we decided to take on those as opposed to have the “database maintenance and administration” ball and chain perpetually attached. So, in order to share contentious state across multiple VMs, across multiple machine instances, we use shared memory in the form of TerraCotta. I’d say TerraCotta is “easy” for “simple” apps, but challenges emerge when you start dealing with very large data sets in memory (multiple giga-bytes). We’re investing real energy in tuning our object graph, access patterns, and object types to keep things working as Gnip usage increases. For example, we’re in the midst of experimenting with pageable TerraCotta structures that ensure smaller chunks of memory can be paged into “cold” nodes.

When I look at the architecture we started with, compared to where we are now, there are no radical changes. We chose to start clustered, so we could easily add capacity later, and that has worked really well. We’ve had to tune things along the way (split various processes to their own nodes when CPU contention got too high, adjust object graphs to optimize for shared memory models, adjust HTTP timeout settings, and the like), but our core has held strong.

Our Stack

  • nginx – HTTP server, load balancing
  • JRE 1.6 – Core logic, REST Interface
  • TerraCotta – shared memory for clustering/redundancy
  • ejabberd – inbound XMPP server
  • Ruby – data importing, cluster management
  • Python – data importing

High-Level Core Diagram

Gnip Core Architecture Diagram

Gnip owes all of this to our team & our customers; thanks!

Software Evolution

Those of us who have been around for awhile constantly joke about how “I remember building that 10 years ago” everytime some big “new” trend emerges. It’s always a lesson in market readiness and timing for a given idea. The flurry around Google Chrome has rekindled the conversation around distributed apps. Most folks are tied up in the concept of a “new browser,” but Chrome is actually another crack at the age old “distrbuted/server-side application” problem; albeit an apparent good one. The real news in Chrome (I’ll avoid the V8 vs. TraceMonkey conversation for now) is native Google Gears support.

My favorite kind of technology is the kind that quietly gets built, then one day you wake up and it’s changed everything. Google Gears has that potential and if Chrome winds up with meaningful distribution (or Firefox adopts Gears) web apps as we know them will finally have mark-up-level access to local resources (read “offline functionality”). This kind of evolution is long overdue.

Another lacking component on the network is the age-old, CS101, notion of event-driven architectures. HTTP GET dominates web traffic, and poor ‘ol HTTP POST is rarely used. Publish and subscribe models are all but unused on the network today, and Gnip aims to change that. We see a world that is PUSH driven rather than PULL. The web has come a looooong way on GET, but apps are desperate for traditional flow paradigms such as local processor event loops. Our goal is to do this in a protocol agnostic manner (e.g. REST/HTTP POST, XMPP, perhaps some distributed queuing model)

Watching today’s web apps poll eachother to death is hard. With each new product that integrates serviceX, the latency of serviceX’s events propegating through the ecosystem degrades, and everyone loses. This is a broken model that if left unresolved, will drive our web apps back into the dark ages once all the web service endpoints are overburdened to the point of being uninteresting.

We’ve seen fabulous adoption of our API since launching a couple of months ago. We hope that more Data Producers and Data Consumers leverage it going forward.