Twitter XML, JSON & Activity Streams at Gnip

About a month ago Twitter announced they will be shutting off XML for stream based endpoints on Dec, 6th, 2010, in order to exclusively support JSON. While JSON users/supporters are cheering, for some developers this is a non-trivial change. Tweet parsers around the world have to change from XML to JSON. If your brain, and code, only work in XML, you’ll be forced to get your head around something new. You’ll have to get smart, find the right JSON lib, change your code to use it (and any associated dependencies you weren’t already relying on), remove obsolete dependencies, test everything again, and ultimately get comfortable with a new format.

Gnip’s format normalization shields you from all of this as it turns out. Gnip customers get to stay focused on delivering value to their customers. Others integrating directly, and consuming stream data from Twitter in XML, have to make a change (arguably a good one from a pure format standpoint, but change takes time regardless).

From day one, Gnip has been working to shield data consumers from the inevitable API shifts (protocols, formats) that occur in the market at large. Today we ran a query to see what percentage of our customers would benefit from this shield; today we smiled. We’re going to sleep well tonight knowing all of our customers digesting our Activity Streams normalization get to stay focused on what matters to them most (namely NOT data collection intricacies).

Fun.

Our Poem for Mountain.rb

Hello and Greetings, Our Ruby Dev Friends,
Mountain.rb we were pleased to attend.

Perhaps we did meet you! Perhaps we did not.
We hope, either way, you’ll give our tools a shot.

What do we do? Manage API feeds.
We fight the rate limits, dedupe all those tweets.

Need to know where those bit.ly’s point to?
Want to choose polling or streaming, do you?

We do those things, and on top of all that,
We put all your results in just one format.

You write only one parser for all of our feeds.
(We’ve got over 100 to meet your needs.)

The Facebook, The Twitter, The YouTube and More
If mass data collection makes your head sore…

Do not curse publishers, don’t make a fuss.
Just go to the Internet and visit us.

We’re not the best poets. Data’s more our thing.
So when you face APIs… give us a ring.

How to Select a Social Media Data Provider

If you’re looking for social media data, you’ve got a lot of options: social media monitoring companies provide end-user brand tracking tools, some businesses provide deep-dive analyses of social data, other companies provide a reputation scores for individual users, and still other services specialize in geographic social media display, to name just a few. 

Some organizations ultimately decide to build internal tools for social media data analysis. Then they must decide between outsourcing the social data collection bit so they can focus their efforts on analyzing and visualizing the data, or building everything — including API connections to each individual publisher — internally. Establishing and maintaining those API connections over time can be costly. If your team has the money and resources to build your own social media integrations, then go for it!

But if you’re shopping for raw social media data, you should consider a social media API – that is, a single API that aggregates raw data from dozens of different social media publishers – instead of making connections to each one of those dozens of social media APIs individually. And in the social media API market, there is only a small handful of companies for you to choose from. We are one of them and we would love to work with you. But we know that you’ll probably want to shop your options before making a decision, so we’d like to offer our advice to help you understand some of the most important factors in selecting a social media API provider.

Here are some good questions for you to ask every social media API solution you consider (including your own internal engineers, if you’re considering hiring them for the job):

Are your data collection methods in compliance with all social media publishers’ terms of use?

–> Here’s why it matters: by working with a company that violates any publisher’s terms of use, you risk unstable (or sudden loss of) access to violated publisher’s data — not to mention the potential legal consequences of using black market data in your product. Conversely, if you work with a company that has a strong relationship with the social media publishers, our experience shows that you not only get stable, reliable data access, but you just might get rewarded with *extra* data access every now and then. (In case you’re wondering, Gnip’s methods are in compliance with each of our social media publishers’ terms of use.)

Do you provide results and allow parameter modifications via API, and do you maintain those API connections over time?

–> In our experience, establishing a single API connection to collect data from a single publisher isn’t hard. But! Establishing many API connections to various social media publishers and – this is key – maintaining those connections over time is really quite a chore. So much so, we made a whole long list of API-related difficulties associated with that integration work, based on our own experiences. Make sure that whoever you work with understands the ongoing work involved and is prepared to maintain your access to all of the social media APIs you care about over time.

How many data sources do you provide access to?

–> Even if you only want access to Twitter and Facebook today, it’s a good idea to think ahead. How much incremental work will be involved for you to integrate additional sources a few months down the line? Our own answer to this question is this: using Gnip’s social media API, once you’re set up to receive your first feed from Gnip via API, it takes about 1 minute for you to configure Gnip to send you data from a 2nd feed. Ten minutes later, you’re collecting data from 10 different feeds, all at no extra charge. Since you can configure Gnip to send all of your data in one format, you only need to create one parser and all the data you want gets streamed into your product. You can even start getting data from a new social media source, decide it’s not useful for your product, and replace it with a different feed from a different source, all in a matter of seconds. We’re pretty proud that we’ve made it so fast and simple for you to receive data from new sources… (blush)… and we hope you’ll find it to be useful, too.

What format is your data delivered in?

–> Ten different social media sources might provide data in 10 different formats. And that means you have to write 10 different parsers to get all the data into your product. Gnip allows you to normalize all the social media data you want into one single format — Activity Streams — so you can collect all your results via one API and feed them into your product with just one parser.

Hope this helps! If you’ve got additional questions to suggest for our list, don’t hesitate to drop us a note. We’d love to hear from you.

Activity Streams

Gnip pledges allegiance to Activity Streams.

Consuming data from APIs with heterogeneous response formats is a pain. From basic format differences (XML vs JSON) to the semantics around structure and element meaning (custom XML structure, Atom, RSS), if you’re consuming data from multiple APIs, you have to handle each API’s responses differently. Gnip minimizes this pain by normalizing data from across services into Activity Streams. Activity Streams allows you to consistently digest responses from many services, using a single parsing routine in your code; no more special casing.

Gnip’s history with Activity Streams runs long and deep. We contributed to one of the first service/activity/verb mapping proposals, and have been implementing aspects of Activity Streams over the past couple of years. Over the past several months Activity Streams has gained enough traction that the decision for it to be Gnip’s canonical normalization format was only natural. We’ve flipped the switch and are proud to be part of such a useful standard.

The Activity Streams initiative is in the process of getting its JSON version together, so for now, we offer the XML version. As JSON crystalizes, we’ll offer that as well.

xml.to_json

Gnip spends an in-ordinate amount of time slicing and dicing data for our customers. Normalizing the web’s data is something we’ve been doing for a long time now, and we’ve gone through many incantations of it.  While you can usually find a way from format A to format B (assuming the two are inherently extensible (as XML and JSON are)), you often bastardize one or the other in the process.  DeWitt Clinton (Googler) recently posted a clear and concise outline of the challenges around moving between various formats. I’ve been wanting to write a post using the above title for a couple of weeks, so a thank you to DeWitt for providing the inadvertent nudge.

Parsing

Here at Gnip we’ve done the rounds with respect to how to parse a formatted document. From homegrown regex’ing, to framework specific parsing libraries, the decisions around how and when to parse a document aren’t always obvious. Layer in the need to performantly parse large documents in real-time, and the challenge becomes palpable. Offline document parsing/processing (traditional Google crawler/index-style) allows you to push-off many of the real-time processing challenges. I’m curious to see how Google’s real-time index (their “demo” PubSubHubbub hub implementation) fares with potentially hundreds of billions of events moving through, per day, it in “real-time” in the years to come.

When do you parse?

If you’re parsing structured documents in “real-time” (e.g. XML or JSON), one of the first questions you need to answer is when do you actually parse. Whether you parse when the data arrives at your system’s front door versus when it’s on its way out can make or break your app. An assumption throughout this post is that you are dealing with “real-time” data, as opposed to data that can be processed “offline” for future on-demand use.

A good rule of thumb is to parse data on the way in when the relationship between inbound and outbound consumption is greater than 1. If you have lots of consumers of your parsed/processed content, do the work once, up-front, so it can be leveraged across all of the consumption (diagram below).

If the relationship between in/out is purely 1-to-1, then it doesn’t really matter, and other factors around your architecture will likely guide you. If the consumption dynamic is such that not all the information will be consumed 100% of the time (e.g. 1-to-something-less-than-1), then parsing on the outbound side generally makes sense (diagram below).

Synchronous vs. Asynchronous Processing

When handling large volumes of constantly changing data you may have to sacrifice the simplicity of serial/synchronous data processing, in favor of parallel/asynchronous data processing. If your inbound processing flow becomes a processing bottleneck, and things start queuing up to an unacceptable degree, you’ll need to move processing out of band, and apply multiple processors to the single stream of inbound data; asynchronous processing.

How do you parse?

Regex parsing: While old-school, regex can get you a long way, performantly. However, this assumes you’re good at writing regular expressions. Simple missteps can make regex’ing perform incredibly slow.

DOM-based parsing: While the APIs around DOM based parsers are oh so temping to use, that higher level interface comes at a cost. DOM parsers often construct heavy object models around everything they find in a document and, most of the time, you won’t use but 10% of it. Most are configurable WRT how they parse, but often not to the degree to just give you what you need. All have their own bugs you’ll learn to work through/around. Gnip currently uses Nokogiri for much of it’s XML document parsing.

SAX-style parsing. It doesn’t get much faster. The trade-off to this kind of parsing is complexity. One of the crucial benefits to DOM-style parsing is that node graph is constructed and maintained for you. SAX-style parsing requires that you deal with this tree and it often isn’t fun or pretty.

Transformation

Whether you’re moving between different formats (e.g. XML or JSON), or making structural changes to the content, the promises around ease of transformation that were made by XSLT were never kept. For starters, no one moved beyond the 1.0 spec which is grossly underpowered. Developers have come to rely on home-grown mapping languages (Gnip 1.0 employed a complete custom language for moving between arbitrary XML inbound documents and known outbound structure), conveniences provided by the underlying parsing libraries, or in the language frameworks they’re building in. For example Ruby has “.to_json” methods sprinkled throughout many classes. While the method works much of the time for serializing an object of known structure, its output on more complex objects, like arbitrarily structured XML, is highly variable and not necessarily what you want in the end.

An example of when simple .to_json falls short is the handling of XML namespaces. While structural integrity is indeed maintained, and namespaces are translated, they’re meaningless in the world of JSON. So, if your requirements are one-way transformation, JSON is cluttered in the end using out-of-the-box transformation methods. Of course, as DeWitt points out, if your need round-trip integrity, then the clutter is necessary.

While custom mapping languages give you flexibility, they also require upkeep (bugs and features). Convenience lib transformation routines are often written to base-line specification and a strict set of structural rules, which are often violated by real-world documents.

Integrity

Simple transformations are… simple; they generally “just work.” The more complex the documents however, the harder your transformation logic gets pushed and the more things start to break (if not on the implementation-side then on the format-side). Sure you can beat a namespace, attribute, and element laden XML document into JSON submission, but in doing so, you’ll likely defeat the purpose of JSON altogether (fast, small wire cost, easy JS objectification). While you might lose some of format specific benefits, the end may justify the means in this case. Sure it’s ugly, but in order to move the world closer to JSON, ugly XML-to-JSON transformers may need to exist for awhile. Not everyone with an XML spewing back-end can afford to build true JSON output into their systems (think Enterprise apps for one).

In the End

Gnip’s working to normalize many sources of data into succinct, predictable, streams of data. While taking on this step is part of our value proposition to customers, the ecosystem at large can benefit significantly from native JSON sources of data (in addition to prolific XML). XML’s been a great, necessary, stepping stone for the industry, but 9 times out of 10 tighter JSON suffices. And finally, if anyone builds a XSLT 2.0 spec compliant parser for Ruby, we’ll use it!

Gnip; An Update

Gnip moved into our new office yesterday (other end of the block from our old office). The transition provided an opportunity for me to think about where we’ve been, and where we’re going.

Team

We continue to grow, primarily on the engineering side. Checkout our jobs page if you’re interested in working on a hard problem, with smart people, in a beautiful place (Boulder, CO).

Technology

We’ve built a serious chunk of back-end infrastructure that I’d break into two general pieces: “the bus”, and “the pollers.”

“The Bus”

Our back-end moves large volumes of relatively small (usually <~3k bytes) chunks of data from A to B in a hurry. Data is “published” into Gnip, we do some backflips with it, then spit it out the other side to consumers.

“The Pollers”

Our efforts to get Publishers to push directly into Gnip didn’t pan out the way we initially planned. As a result we had to change course and acquire data ourselves. The bummer here was that we set out on an altruistic mission to relieve the polling pain that the industry has been suffering from, but were met with such inertia that we didn’t get the coverage we wanted. The upside is that building polling infrastructure has allowed us to control more of our business destiny. We’ve gone through a few iterations on approach to polling. From complex job scheduling and systems that “learn” & “adapt” to their surroundings, to dirt simple, mindless grinders that ignorantly eat APIs/endpoints all day long. We’re currently slanting heavily toward simplicity in the model. The idea is to take learning’s from the simple model over time, and feed them into abstractions/re-factorings that make the system smarter.

Deployment

We’re still in the cloud. Amazon’s Ec2/S3 products have been a solid (albeit not necessarily the most cost effective when your CPU utilization isn’t in the 90%+ range per box), highly flexible, framework for us; hats off to those guys.

Industry

“The Polling Problem”

It’s been great to see the industry wake up and acknowledge “the polling problem” over the past year. SUP (Simple Update Protocol) popped up to provide more efficient polling for systems that couldn’t, or wouldn’t, move to an event-driven model. Providing a compact change-log for pollers, you can poll the change-log, and then go do heavier polls for only stuff that has changed. PubSubHubbub popped up to provide the framework for a distributed Gnip (though lacking inherent normalization). A combination of polling and events spread across nodes allows for a more decentralized approach.

“Normalization”

The Activity Streams initiative grew legs and is walking. As with any “standards” (or “standards-like”) initiative things are only as good as adoption. Building ideas in a silo without users makes for a fun exercise, but not much else. Uptake matters, and MySpace and Facebook (among many other smaller initiatives) have bitten off chunks of Activity Streams, and that’s a very big, good, sign for the industry. Structural, and semantic, consistency matters for applications digesting a lot of information. Gnip provides highly structured and consistent data to its consumers via gnip.xsd.

In order to meet its business needs, and to adapt to the constantly moving industry around it, Gnip has adjusted it’s approach on several fronts. We moved to incorporate polling. We understand that there is more than one way of doing and will incorporate SUP and PubSubHubbub into our framework. Doing so will make our own polling efforts more effective, and also provide data to our consumers with flexibility. While normalized data is nice for a large category of consumers, there is a large tier of customers that doesn’t need, or want, heavy normalization. Opaque message flow has significant value as well.

We set out to move mind-boggling amounts of information from A to B, and we’re doing that. Some of the nodes in the graph are shifting, but the model is sound. We’ve found there are primarily two types of data consumers: high-coverage of a small number of sources (“I need 100% of Joe, Jane, and Mike’s activity”), and “as high as you can get it”-coverage of a large number of sources (“I don’t need 100%, but I want very broad coverage”). Gnip’s adjusted to accommodate both.

Business

We’ve had to shift our resources to better focus on the paying segments of our audience. We initially thought “life-stream aggregators” would be our biggest paying customer segment, however data/media analytics firms have proven significant. Catering to the customers who tell you “we have budget for that!” makes good business sense, and we’re attacking those opportunities.

Newest Gnip Partner: PostRank

Welcome PostRank!

Today Gnip and PostRank announced a new partnership (blog) (press release)that allows companies using the Gnip platform to access the nearly 3 million news articles and stories indexed by PostRank from a million discrete sources every day.  In addition PostRank collects the real-time social interactions with content across dozens of social networks and applications.

Gnip will be providing PostRank as a premium service that requires a subscription.  All of the value added features of the Gnip platform including normalization, rule-based filtering, and push delivery of content are supported with the new Gnip PostRank Data Publisher.   For pricing information contact info@gnip.com or shane@gnip.com

Gravitational Shift

Gnip’s approach to getting more Publishers into the system has evolved. Over the past year we’ve learned a lot about the data delivery business and the state of its technological art. While our core infrastructure remains a highly performant data delivery bus, the way data arrives at Gnip’s front door is shifting.

We set out assuming the industry, at large (both Publishers and Consumers), was tired of highly latent data access. What we’ve learned is that data Consumers (e.g. life-stream aggregators) are indeed weary of the latency, but that many Publishers aren’t as interested in distributing their data in real-time as we initially estimated. So, in order to meet intense Consumer demand to have data delivered in a normalized, minimal latency (not necessarily “real-time”), manner, Gnip is adding many new polled Publishers to its offering.

Checkout http://api.gnip.com and see how many Publishers we have to offer as a result of walking down the polling path.

Our goal remains to “deliver the web’s data,” and while the core Gnip delivery model remains the same, polling has allowed us to greatly expand the list of available Publishers in the system.

Tell us which Publishers/data sources you want Gnip to deliver for you! http://gnip.uservoice.com/

We have a long way to go, but we’re stoked at the rate we’re able to widen our Publisher offering now that our polling infrastructure is coming online.