Google+ Now Available from Gnip

Gnip is excited to announce the addition of Google+ to its repertoire of social data sources. Built on top of the Google+ Search API, Gnip’s stream allows its customers to consume realtime social media data from Google’s fast-growing social networking service. Using Gnip’s stream, customers can poll Google+ for public posts and comments matching the terms and phrases relevant to their business and client needs.

Google+ is an emerging player in the social networking space that is a great pairing with the Twitter, Facebook, and other microblog content currently offered by Gnip. If you are looking for volume, Google+ quickly became the third largest social networking platform within a week of its public launch and some are projecting it to emerge as the world’s second largest social network within the next twelve months. Looking to consume content from social network influencers? Google+ is where they are! (even former Facebook President Sean Parker says so).

By working with Gnip along with a stream of Google+ data (and the availability of an abundance of other social data sources), you’ll have access to a normalized data format, unwound URLs, and data deduplication. Existing Gnip customers can seamlessly add Google+ to their Gnip Data Collectors (all you need is a Google API Key). New to Gnip? Let us help you design the right solution for your social data needs, contact sales@gnip.com.

Our Poem for Mountain.rb

Hello and Greetings, Our Ruby Dev Friends,
Mountain.rb we were pleased to attend.

Perhaps we did meet you! Perhaps we did not.
We hope, either way, you’ll give our tools a shot.

What do we do? Manage API feeds.
We fight the rate limits, dedupe all those tweets.

Need to know where those bit.ly’s point to?
Want to choose polling or streaming, do you?

We do those things, and on top of all that,
We put all your results in just one format.

You write only one parser for all of our feeds.
(We’ve got over 100 to meet your needs.)

The Facebook, The Twitter, The YouTube and More
If mass data collection makes your head sore…

Do not curse publishers, don’t make a fuss.
Just go to the Internet and visit us.

We’re not the best poets. Data’s more our thing.
So when you face APIs… give us a ring.

Activity Streams

Gnip pledges allegiance to Activity Streams.

Consuming data from APIs with heterogeneous response formats is a pain. From basic format differences (XML vs JSON) to the semantics around structure and element meaning (custom XML structure, Atom, RSS), if you’re consuming data from multiple APIs, you have to handle each API’s responses differently. Gnip minimizes this pain by normalizing data from across services into Activity Streams. Activity Streams allows you to consistently digest responses from many services, using a single parsing routine in your code; no more special casing.

Gnip’s history with Activity Streams runs long and deep. We contributed to one of the first service/activity/verb mapping proposals, and have been implementing aspects of Activity Streams over the past couple of years. Over the past several months Activity Streams has gained enough traction that the decision for it to be Gnip’s canonical normalization format was only natural. We’ve flipped the switch and are proud to be part of such a useful standard.

The Activity Streams initiative is in the process of getting its JSON version together, so for now, we offer the XML version. As JSON crystalizes, we’ll offer that as well.

xml.to_json

Gnip spends an in-ordinate amount of time slicing and dicing data for our customers. Normalizing the web’s data is something we’ve been doing for a long time now, and we’ve gone through many incantations of it.  While you can usually find a way from format A to format B (assuming the two are inherently extensible (as XML and JSON are)), you often bastardize one or the other in the process.  DeWitt Clinton (Googler) recently posted a clear and concise outline of the challenges around moving between various formats. I’ve been wanting to write a post using the above title for a couple of weeks, so a thank you to DeWitt for providing the inadvertent nudge.

Parsing

Here at Gnip we’ve done the rounds with respect to how to parse a formatted document. From homegrown regex’ing, to framework specific parsing libraries, the decisions around how and when to parse a document aren’t always obvious. Layer in the need to performantly parse large documents in real-time, and the challenge becomes palpable. Offline document parsing/processing (traditional Google crawler/index-style) allows you to push-off many of the real-time processing challenges. I’m curious to see how Google’s real-time index (their “demo” PubSubHubbub hub implementation) fares with potentially hundreds of billions of events moving through, per day, it in “real-time” in the years to come.

When do you parse?

If you’re parsing structured documents in “real-time” (e.g. XML or JSON), one of the first questions you need to answer is when do you actually parse. Whether you parse when the data arrives at your system’s front door versus when it’s on its way out can make or break your app. An assumption throughout this post is that you are dealing with “real-time” data, as opposed to data that can be processed “offline” for future on-demand use.

A good rule of thumb is to parse data on the way in when the relationship between inbound and outbound consumption is greater than 1. If you have lots of consumers of your parsed/processed content, do the work once, up-front, so it can be leveraged across all of the consumption (diagram below).

If the relationship between in/out is purely 1-to-1, then it doesn’t really matter, and other factors around your architecture will likely guide you. If the consumption dynamic is such that not all the information will be consumed 100% of the time (e.g. 1-to-something-less-than-1), then parsing on the outbound side generally makes sense (diagram below).

Synchronous vs. Asynchronous Processing

When handling large volumes of constantly changing data you may have to sacrifice the simplicity of serial/synchronous data processing, in favor of parallel/asynchronous data processing. If your inbound processing flow becomes a processing bottleneck, and things start queuing up to an unacceptable degree, you’ll need to move processing out of band, and apply multiple processors to the single stream of inbound data; asynchronous processing.

How do you parse?

Regex parsing: While old-school, regex can get you a long way, performantly. However, this assumes you’re good at writing regular expressions. Simple missteps can make regex’ing perform incredibly slow.

DOM-based parsing: While the APIs around DOM based parsers are oh so temping to use, that higher level interface comes at a cost. DOM parsers often construct heavy object models around everything they find in a document and, most of the time, you won’t use but 10% of it. Most are configurable WRT how they parse, but often not to the degree to just give you what you need. All have their own bugs you’ll learn to work through/around. Gnip currently uses Nokogiri for much of it’s XML document parsing.

SAX-style parsing. It doesn’t get much faster. The trade-off to this kind of parsing is complexity. One of the crucial benefits to DOM-style parsing is that node graph is constructed and maintained for you. SAX-style parsing requires that you deal with this tree and it often isn’t fun or pretty.

Transformation

Whether you’re moving between different formats (e.g. XML or JSON), or making structural changes to the content, the promises around ease of transformation that were made by XSLT were never kept. For starters, no one moved beyond the 1.0 spec which is grossly underpowered. Developers have come to rely on home-grown mapping languages (Gnip 1.0 employed a complete custom language for moving between arbitrary XML inbound documents and known outbound structure), conveniences provided by the underlying parsing libraries, or in the language frameworks they’re building in. For example Ruby has “.to_json” methods sprinkled throughout many classes. While the method works much of the time for serializing an object of known structure, its output on more complex objects, like arbitrarily structured XML, is highly variable and not necessarily what you want in the end.

An example of when simple .to_json falls short is the handling of XML namespaces. While structural integrity is indeed maintained, and namespaces are translated, they’re meaningless in the world of JSON. So, if your requirements are one-way transformation, JSON is cluttered in the end using out-of-the-box transformation methods. Of course, as DeWitt points out, if your need round-trip integrity, then the clutter is necessary.

While custom mapping languages give you flexibility, they also require upkeep (bugs and features). Convenience lib transformation routines are often written to base-line specification and a strict set of structural rules, which are often violated by real-world documents.

Integrity

Simple transformations are… simple; they generally “just work.” The more complex the documents however, the harder your transformation logic gets pushed and the more things start to break (if not on the implementation-side then on the format-side). Sure you can beat a namespace, attribute, and element laden XML document into JSON submission, but in doing so, you’ll likely defeat the purpose of JSON altogether (fast, small wire cost, easy JS objectification). While you might lose some of format specific benefits, the end may justify the means in this case. Sure it’s ugly, but in order to move the world closer to JSON, ugly XML-to-JSON transformers may need to exist for awhile. Not everyone with an XML spewing back-end can afford to build true JSON output into their systems (think Enterprise apps for one).

In the End

Gnip’s working to normalize many sources of data into succinct, predictable, streams of data. While taking on this step is part of our value proposition to customers, the ecosystem at large can benefit significantly from native JSON sources of data (in addition to prolific XML). XML’s been a great, necessary, stepping stone for the industry, but 9 times out of 10 tighter JSON suffices. And finally, if anyone builds a XSLT 2.0 spec compliant parser for Ruby, we’ll use it!

Newest Gnip Partner: PostRank

Welcome PostRank!

Today Gnip and PostRank announced a new partnership (blog) (press release)that allows companies using the Gnip platform to access the nearly 3 million news articles and stories indexed by PostRank from a million discrete sources every day.  In addition PostRank collects the real-time social interactions with content across dozens of social networks and applications.

Gnip will be providing PostRank as a premium service that requires a subscription.  All of the value added features of the Gnip platform including normalization, rule-based filtering, and push delivery of content are supported with the new Gnip PostRank Data Publisher.   For pricing information contact info@gnip.com or shane@gnip.com

Gravitational Shift

Gnip’s approach to getting more Publishers into the system has evolved. Over the past year we’ve learned a lot about the data delivery business and the state of its technological art. While our core infrastructure remains a highly performant data delivery bus, the way data arrives at Gnip’s front door is shifting.

We set out assuming the industry, at large (both Publishers and Consumers), was tired of highly latent data access. What we’ve learned is that data Consumers (e.g. life-stream aggregators) are indeed weary of the latency, but that many Publishers aren’t as interested in distributing their data in real-time as we initially estimated. So, in order to meet intense Consumer demand to have data delivered in a normalized, minimal latency (not necessarily “real-time”), manner, Gnip is adding many new polled Publishers to its offering.

Checkout http://api.gnip.com and see how many Publishers we have to offer as a result of walking down the polling path.

Our goal remains to “deliver the web’s data,” and while the core Gnip delivery model remains the same, polling has allowed us to greatly expand the list of available Publishers in the system.

Tell us which Publishers/data sources you want Gnip to deliver for you! http://gnip.uservoice.com/

We have a long way to go, but we’re stoked at the rate we’re able to widen our Publisher offering now that our polling infrastructure is coming online.

New Gnip Solution Spotlight: Collective Intellect

We are thrilled to be working with Collective Intellect, which is a very interesting real time marketing intelligence company and also another local Boulder based company.   Gnip is providing Collective Intellect normalized data access to social media services so they can concentrate on delivering faster, richer, and more actionable market research data to our customers and partners.

Read about how Collective Intellect and other companies are using the Gnip platform in the Solution Spotlight page of our website.

 

collective_intellect_logo