Gnip. The Story Behind the Name

Have you ever thought “Gnip”. . . well that is a strange name for a company, what does it mean? As one of the newest members of the Gnip team I found myself thinking that very same thing. And as I began telling my friends about this amazing new start-up that I was going to be working for in Boulder, Colorado they too began to inquire as to the meaning behind the name.

Gnip, pronounced (guh’nip), got its name from the very heart of what we do, realtime social media data collection and delivery. So let’s dive in to . . .

Data Collection 101

There are two general methods for data collection, pull technology and push technology. Pull technology is best described as a data transfer in which the request is initiated by the data consumer and responded to by the data publisher’s server. In contrast, push technology refers to the request being initiated by the data publisher’s server and sent to the data consumer.

So why does this matter . . .

Well most social media publishers use the pull method. This means that the data consumer’s system must constantly go out and “ping” the data publisher’s server asking, “do you have any new data now?” . . . “how about now?” . . . “and now?” And this can cause a few issues:

  1. Deduplication – If you ping the social media server one second and then ping it again a second later and there were no new results, you will receive the same results you got one second ago. This would then require deduplication of the data.
  2. Rate Limiting – every social media data publisher’s server out there sets different rate limits, a limit used to control the number of times you can ping a server in a given time frame. These rate limits are constantly changing and typically don’t get published. As such, if your server is set to ping the publisher’s server above the rate limit, it could potentially result in complete shut down of your data collection, leaving you to determine why the connection is broken (Is it the API . . . Is it the rate limit . . . What is the rate limit)?

So as you can see, pull technology can be a tricky beast.

Enter Gnip

Gnip sought to provide our customers with the option: to receive data in either the push model or the pull model, regardless of the native delivery from the data publisher’s server. In other words we wanted to reverse the “ping” process for our customers. Hence, we reversed the word “ping” to get the name Gnip. And there you have it, the story behind the name!

Clusters & Silos

Gnip is nearing its one-year anniversary of our 2.0 product. We reset our direction several months ago. As part of that shift, we completely changed our architecture. I thought I’d write about that experience a bit.

Gnip 1.0

Our initial implementation is best referred to as a clustered, non-relational DB (aka NoSQL), data aggregation service. We built/ran this product for about a year and a half. The system was comprised of a centralized cluster of machines that divvy’d up load, centralized streams of publisher data, and then fanned that data out to many customers. Publishers did not like this approach as it obfuscated the ultimate consumer of their data; they wanted transparency. Our initial motivation for this architecture was around alleviating load pain on the Publishers. “Real-time” APIs were the rage, and having detrimental impact to real-time delivery was in-part due to load on the Publisher’s API. A single stream of data to Gnip, and allowing Gnip to deal w/ the fan-out via a system built for such demand, was part of the solution we sold. We thought we could charge Publishers for alleviating their load pain. Boy were we wrong on that count. While Publishers love to complain about the load on their API, effectively none of them wanted to do anything about it. Some smartly built caching proxies, and others built homegrown notification-like/PubSub solutions (SIP, SUP, PubSubHubBub). However, most simply horizontally scaled and threw money at the problem. Twitter has shinned a light on streaming HTTP (or whatever you want to call it… there are so many monikers), which is “as good as it gets” (leaving proto buffers and compressed HTTP streams as simply optimizations to the model). I digress. The 1.0 platform was a fantastic engineering feat, ahead of its time, and unfortunately a thorn in Publisher’s sides. As a data integration middle-man, Gnip couldn’t afford to have antagonistic relations with data sources.

Gnip 2.0

Literally overnight, we walked away from further construction on our 1.0 platform. We had paying customers on it however, so we operated it for several months before ultimately shutting it down; after migrating everyone we could to 2.0. Gnip 2.0 un-intuitively departed from a clustered environment, and instead started providing a consuming customer with explicit, transparent, integrations with Publishers, all via standalone instances of the software running on standalone virtualized hardware instances (Ec2). Whereas 1.0 would sometimes leverage Gnip-owned authentication/app credentials to the benefit of many consuming customers, 2.0 was architected explicitly not to support this. For each 2.0 instance a customer runs, they use credentials they obtain themselves, from the Publisher, to configure the instances. Publishers have full transparency into, and control of, who’s using their data.

The result is an architecture that doesn’t leverage certain data structures an engineer would naturally wish to use. That said, an unexpected operational benefit has fallen out of the 2.0 system. Self-healing, zero SPOF (single point of failure), clusters aside (I’d argue they’re actually relatively few of them out there), the reality with clusters is that they’re actually hard to build in a fault tolerant manner, and SPOFs find their way in. From there, you have all of your customers leveraged against a big SPOF. If something cracks in the system, all of your customers feel that pain. On the flip side, silo’d instances rarely suffer from systemic failure. Sure operational issues arise, but you can treat each case uniquely and react accordingly. The circumstances in which all of your customers feel pain simultaneously are very few and far between. So, the cost of not leveraging hardware/software that we’re generally inclined to want to architect for and leverage, is indeed higher, but a simplified system has it’s benefits to be sure.

We now find ourselves promoting Publisher integration best practices, and they advocate our usage. Two such significant architectures built under the same roof has been a fascinating thing to experience. The pros and cons to each are many. Where you wind up in your system is an interesting function of what your propensity is technically, as well as what the business constraints are. One size never fits all.

Recent Hole In Your Facebook Data Coverage?

Recent changes to Facebook’s crawling policies have had a detrimental impact on some of our customers who were relying on other (black/grey market) third parties for their Facebook social media data collection strategy. I wrote about this dynamic and its potential fallout last month. These recent events underscore the risk and reality around relying on black market data in your business.

Gnip works with publishers, like Facebook and Twitter, to ensure our software is in clear, transparent, compliance with their terms of use/service in order to drastically reduce risks like this. All of Gnip’s data collection is above board and is done respectfully of the policies set forth by the Publishers you want data from. When you use Gnip for your collection needs, you don’t have to live in fear of being “turned off” or sued over data you shouldn’t have.

For specifics on our Facebook integrations, checkout yesterday’s post.

Have no fear, Gnip’s here to help. Give our numerous Facebook and other social media feeds a try at http://try.gnip.com or give us a shout at info@gnip.com

Worried About Twitter's Move to OAuth?

Have you scheduled your engineering work to move to OAuth for Twitter’s interfaces? If you’re a Gnip user, you don’t have to! Gnip users don’t have to know anything about OAuth’s implementation in order to keep their data flowing, and for all of them, that’s a huge relief. OAuth is non-trivial to setup and support, and is an API authentication/authorization mechanism that most data consumers shouldn’t have to worry about. That’s where Gnip steps in! One of our value-adds is that many API integrations shifts, like this one, are hidden from our customers. You can merrily consume data without having to sink expensive resources into adapting to the constantly shifting sands of data provider APIs.

If you’re consuming data via Gnip, when Twitter makes the switch to affected APIs, all you’ll need to do is provide Gnip with your OAuth tokens (the new “username” and “password”; just more secure and controllable), and off you go! You don’t have to worry about query param ordering, hashing, signing, and associated handshakes.

Black-Market Data

This year’s Glue conference is right around the corner, and Gnip has a small spot that we’re going to use to discuss the emerging black market for data. Data providers, like Gnip, are under increasing pressure to provide their customers with “more data” than Publishers actually want them to have. Gnip has taken the moral high-ground position and complies with the terms of service that the Publishers we integrate with provide, yet we’re seeing more and more firms offer data that the Publishers explicitly disallow access to.

Our position is that we don’t want our customers building on a technical house of cards. While we can technically gather “more data,” doing so in a manner that violates the ToS of a given Publisher ultimately leads to an adversarial relationship between us and said Publisher. As a result, we’d be putting our own customers at risk when that scenario goes south (and it eventually would). The result is a widening gap between legitimate, and illegitimate data collection. This should signal high-demand Publishers to mature and build real solutions that market dynamics clearly require.

I don’t have my thoughts fully formed on this topic, but it should make for a good discussion at Glue! Hope to see you there!

xml.to_json

Gnip spends an in-ordinate amount of time slicing and dicing data for our customers. Normalizing the web’s data is something we’ve been doing for a long time now, and we’ve gone through many incantations of it.  While you can usually find a way from format A to format B (assuming the two are inherently extensible (as XML and JSON are)), you often bastardize one or the other in the process.  DeWitt Clinton (Googler) recently posted a clear and concise outline of the challenges around moving between various formats. I’ve been wanting to write a post using the above title for a couple of weeks, so a thank you to DeWitt for providing the inadvertent nudge.

Parsing

Here at Gnip we’ve done the rounds with respect to how to parse a formatted document. From homegrown regex’ing, to framework specific parsing libraries, the decisions around how and when to parse a document aren’t always obvious. Layer in the need to performantly parse large documents in real-time, and the challenge becomes palpable. Offline document parsing/processing (traditional Google crawler/index-style) allows you to push-off many of the real-time processing challenges. I’m curious to see how Google’s real-time index (their “demo” PubSubHubbub hub implementation) fares with potentially hundreds of billions of events moving through, per day, it in “real-time” in the years to come.

When do you parse?

If you’re parsing structured documents in “real-time” (e.g. XML or JSON), one of the first questions you need to answer is when do you actually parse. Whether you parse when the data arrives at your system’s front door versus when it’s on its way out can make or break your app. An assumption throughout this post is that you are dealing with “real-time” data, as opposed to data that can be processed “offline” for future on-demand use.

A good rule of thumb is to parse data on the way in when the relationship between inbound and outbound consumption is greater than 1. If you have lots of consumers of your parsed/processed content, do the work once, up-front, so it can be leveraged across all of the consumption (diagram below).

If the relationship between in/out is purely 1-to-1, then it doesn’t really matter, and other factors around your architecture will likely guide you. If the consumption dynamic is such that not all the information will be consumed 100% of the time (e.g. 1-to-something-less-than-1), then parsing on the outbound side generally makes sense (diagram below).

Synchronous vs. Asynchronous Processing

When handling large volumes of constantly changing data you may have to sacrifice the simplicity of serial/synchronous data processing, in favor of parallel/asynchronous data processing. If your inbound processing flow becomes a processing bottleneck, and things start queuing up to an unacceptable degree, you’ll need to move processing out of band, and apply multiple processors to the single stream of inbound data; asynchronous processing.

How do you parse?

Regex parsing: While old-school, regex can get you a long way, performantly. However, this assumes you’re good at writing regular expressions. Simple missteps can make regex’ing perform incredibly slow.

DOM-based parsing: While the APIs around DOM based parsers are oh so temping to use, that higher level interface comes at a cost. DOM parsers often construct heavy object models around everything they find in a document and, most of the time, you won’t use but 10% of it. Most are configurable WRT how they parse, but often not to the degree to just give you what you need. All have their own bugs you’ll learn to work through/around. Gnip currently uses Nokogiri for much of it’s XML document parsing.

SAX-style parsing. It doesn’t get much faster. The trade-off to this kind of parsing is complexity. One of the crucial benefits to DOM-style parsing is that node graph is constructed and maintained for you. SAX-style parsing requires that you deal with this tree and it often isn’t fun or pretty.

Transformation

Whether you’re moving between different formats (e.g. XML or JSON), or making structural changes to the content, the promises around ease of transformation that were made by XSLT were never kept. For starters, no one moved beyond the 1.0 spec which is grossly underpowered. Developers have come to rely on home-grown mapping languages (Gnip 1.0 employed a complete custom language for moving between arbitrary XML inbound documents and known outbound structure), conveniences provided by the underlying parsing libraries, or in the language frameworks they’re building in. For example Ruby has “.to_json” methods sprinkled throughout many classes. While the method works much of the time for serializing an object of known structure, its output on more complex objects, like arbitrarily structured XML, is highly variable and not necessarily what you want in the end.

An example of when simple .to_json falls short is the handling of XML namespaces. While structural integrity is indeed maintained, and namespaces are translated, they’re meaningless in the world of JSON. So, if your requirements are one-way transformation, JSON is cluttered in the end using out-of-the-box transformation methods. Of course, as DeWitt points out, if your need round-trip integrity, then the clutter is necessary.

While custom mapping languages give you flexibility, they also require upkeep (bugs and features). Convenience lib transformation routines are often written to base-line specification and a strict set of structural rules, which are often violated by real-world documents.

Integrity

Simple transformations are… simple; they generally “just work.” The more complex the documents however, the harder your transformation logic gets pushed and the more things start to break (if not on the implementation-side then on the format-side). Sure you can beat a namespace, attribute, and element laden XML document into JSON submission, but in doing so, you’ll likely defeat the purpose of JSON altogether (fast, small wire cost, easy JS objectification). While you might lose some of format specific benefits, the end may justify the means in this case. Sure it’s ugly, but in order to move the world closer to JSON, ugly XML-to-JSON transformers may need to exist for awhile. Not everyone with an XML spewing back-end can afford to build true JSON output into their systems (think Enterprise apps for one).

In the End

Gnip’s working to normalize many sources of data into succinct, predictable, streams of data. While taking on this step is part of our value proposition to customers, the ecosystem at large can benefit significantly from native JSON sources of data (in addition to prolific XML). XML’s been a great, necessary, stepping stone for the industry, but 9 times out of 10 tighter JSON suffices. And finally, if anyone builds a XSLT 2.0 spec compliant parser for Ruby, we’ll use it!

So You Want Some Social Data

If your product or service needs social data in today’s API marketplace, there are a few things you need to consider in order to most effectively consume said data.

 

I need all the data

First, you should double-check your needs. Data consumers often think they need “all the data,” when in fact they don’t. You may need “all the data” for a given set of entities (e.g. keywords, or users) on a particular service, but don’t confuse that with needing “all the data” a service generates. When it comes to high-volume services (such as Twitter), consuming “all of the data” actually amounts to resource intensive engineering exercises on your end. There are often non-trivial scaling challenges involved when handling large data-sets. Do some math and determine whether or not statistical sampling will give you all you need; the answer is usually “yes.” If the answer is “no” be ready for an uphill (technical, financial, or business model) battle with service providers; they don’t necessarily want all of their data floating around out there.
Social data APIs are generally designed around prohibiting “all of the data” being accessed, either technically, or through terms of service agreements. However, they usually provide great access to narrow sets of data. Consider whether you need “100% of the data” for a relatively narrow slice of information; most social data APIs support this use case quite well.

 

Ingestion

 

Connectivity

There are three general styles that you’ll wind up using to access an API, all of them HTTP based: inbound-POST; event driven (e.g. PubSubHubbub/WebHooks), GET; polling, or GET/POST; streaming. Each of these has its pros and cons. I’m avoiding XMPP in this post only because it is infrequently used and hasn’t seen widespread adoption (yet). Each style requires a different level of operational and programmatic understanding.

 

Authentication/Authorization

APIs usually have publicly available versions (usually limited in their capabilities), as well as versions that require registration for subsequent authenticated connections. The authC and authZ semantics around APIs range from simple, to complex. You’ll need to understand the access characteristics around the specific services you want to access. Some require hands-on, human, authorization-level justification processes to be followed in order to have the “right level of access” granted to you and your product. Some are simple automated online registration forms that directly yield the account credentials necessary for API access.
HTTP-Basic authentication, not surprisingly, is the predominate authentication scheme used, and authorization levels are conveniently tied to the account by the service provider. OAuth (proper and 2-legged) is gaining steam however. You’ll also find API-keys (URL params or HTTP header based) are still widely used.

 

Processing

How you process data once you receive it is certainly affected by which connection style you use. Note, that most APIs don’t give you an option in how you connect to them; the provider decides for you. Processing data in the same step as receiving it can cause bottlenecks in your system, and ultimately put you on bad terms with the API provider you’re connecting to. An analogy would be drinking from the proverbial firehose. If you connect the firehose to your mouth, you might get a gulp or two down before you’re overwhelmed by the amount of water actually coming at you. You’ll either cause the firehose to backup on you, or you’ll start leaking water all over the place. Either way, you won’t be able to keep up with the amount of water coming at you. If your, average, ability to process data is slower than the rate at which it arrives, you’ll have a queueing challenge to contend with. Consider offline, or out-of-band, processing of data as it becomes available. For example, write it to disk or a database and have parallelized worker threads/processes parse/handle it from there. The point is, don’t process it in the moment in this case.
Many APIs don’t produce enough data to warrant out-of-band processing, so often inline processing is just fine. It all depends on what operations you’re trying to perform, the speed at which your technology stack can accomplish those operations, and the rate at which data arrives.

 

Reporting

If you don’t care about reporting initially, you will in short order. How much data are you receiving? What are peak volume periods? Which of the things you’re looking for are generating the most results?
API integrations inherently bind your software to someone else’s. Understanding how that relationship is functioning at any given moment is crucial to your day to day operations.

 

Monitoring

Reporting’s close sibling is monitoring. Understanding when an integration has gone south is just as important as knowing when your product is having issues; they’re one and the same. Integrating with an API means you’re dependent on someone else’s software, and that software can have any number of issues. From bugs, to planned upgrades or API changes, you’ll need to know when certain things change, and take appropriate action.

 

Web services/APIs are usually incredibly easy to “sample,” but truly integrating and operationalizing them is another, more challenging, process.

Social Data in a Marketplace

Gnip; shipping & handling for data. Since our inception a couple of years ago, this is one of the ways we’ve described ourselves. What many folks in the social data space (publishers and consumers alike) surprisingly don’t understand however is that such a thing is necessary. Several times we’ve come up against folks who indicate that either a) “our (random publisher X) data’s already freely available through an API” or b) “I (random consumer Y) have free access to their data through their API.” While both statements are often true, they’re shortsighted.

If you’re a “web engineer” versed in HTTP and XHR with time on your hands, then accessing data from a social media publisher (e.g. Twitter, Facebook, MySpace, Digg…. etc) may be relatively straightforward. However, while API integration might be “easy” for you, keep in mind that you’re in the minority. Thousands of companies, either not financially able to afford a “web engineer” or simply technically focused elsewhere (if at all), need help accessing the data they need to make business decisions. Furthermore, while you may do your own integrations, how robust is your error reporting, monitoring, and management of your overall strategy? Odds are that you have not given those areas the attention they require. Did your stream of data stop because of a bug in your code, or because the service you were integrated with went down? Could you more efficiently receive the same data from a publisher, while relieving load from your (and the publisher’s) system? Do you have live charts that depict how data is moving through the system (not just the publisher’s side of the house)? This is where Gnip Data Collection as a Service steps in.

As the social media/data space has evolved over the past couple of years, the necessity of a managed/solution-as-a-service has become clear. As expected, the number of data consumers continues to explode, while the number of consumers with technical capability to reliably integrate with the publishers, as a ratio to total, is shrinking.

Finally some good technical/formatting standards are catching on (PubSubHubbub, WebHooks, HTTP-long-polling/streaming/Comet (thanks Twitter), ActivityStreams), which is giving everyone a vocabulary and common conceptual understanding to use when discussing how/when real-time data is produced/consumed.

In 2010 we’re going to see the beginnings of maturation in the otherwise Wild-West of social data. As things evolve I hope innovation doesn’t suffer (mass availability of data has done wonderful things), but I do look forward to giving other, less inclined, players in the marketplace access to the data they need. As a highly focused example of this kind of maturation happening before our eyes, checkout SimpleGeo. Can I do geo stuff as an engineer, yes. Do I want to collect the thousand sources of light to build what I want to build around/with geo; no. I prefer a one-stop-shop.

Gnip; An Update

Gnip moved into our new office yesterday (other end of the block from our old office). The transition provided an opportunity for me to think about where we’ve been, and where we’re going.

Team

We continue to grow, primarily on the engineering side. Checkout our jobs page if you’re interested in working on a hard problem, with smart people, in a beautiful place (Boulder, CO).

Technology

We’ve built a serious chunk of back-end infrastructure that I’d break into two general pieces: “the bus”, and “the pollers.”

“The Bus”

Our back-end moves large volumes of relatively small (usually <~3k bytes) chunks of data from A to B in a hurry. Data is “published” into Gnip, we do some backflips with it, then spit it out the other side to consumers.

“The Pollers”

Our efforts to get Publishers to push directly into Gnip didn’t pan out the way we initially planned. As a result we had to change course and acquire data ourselves. The bummer here was that we set out on an altruistic mission to relieve the polling pain that the industry has been suffering from, but were met with such inertia that we didn’t get the coverage we wanted. The upside is that building polling infrastructure has allowed us to control more of our business destiny. We’ve gone through a few iterations on approach to polling. From complex job scheduling and systems that “learn” & “adapt” to their surroundings, to dirt simple, mindless grinders that ignorantly eat APIs/endpoints all day long. We’re currently slanting heavily toward simplicity in the model. The idea is to take learning’s from the simple model over time, and feed them into abstractions/re-factorings that make the system smarter.

Deployment

We’re still in the cloud. Amazon’s Ec2/S3 products have been a solid (albeit not necessarily the most cost effective when your CPU utilization isn’t in the 90%+ range per box), highly flexible, framework for us; hats off to those guys.

Industry

“The Polling Problem”

It’s been great to see the industry wake up and acknowledge “the polling problem” over the past year. SUP (Simple Update Protocol) popped up to provide more efficient polling for systems that couldn’t, or wouldn’t, move to an event-driven model. Providing a compact change-log for pollers, you can poll the change-log, and then go do heavier polls for only stuff that has changed. PubSubHubbub popped up to provide the framework for a distributed Gnip (though lacking inherent normalization). A combination of polling and events spread across nodes allows for a more decentralized approach.

“Normalization”

The Activity Streams initiative grew legs and is walking. As with any “standards” (or “standards-like”) initiative things are only as good as adoption. Building ideas in a silo without users makes for a fun exercise, but not much else. Uptake matters, and MySpace and Facebook (among many other smaller initiatives) have bitten off chunks of Activity Streams, and that’s a very big, good, sign for the industry. Structural, and semantic, consistency matters for applications digesting a lot of information. Gnip provides highly structured and consistent data to its consumers via gnip.xsd.

In order to meet its business needs, and to adapt to the constantly moving industry around it, Gnip has adjusted it’s approach on several fronts. We moved to incorporate polling. We understand that there is more than one way of doing and will incorporate SUP and PubSubHubbub into our framework. Doing so will make our own polling efforts more effective, and also provide data to our consumers with flexibility. While normalized data is nice for a large category of consumers, there is a large tier of customers that doesn’t need, or want, heavy normalization. Opaque message flow has significant value as well.

We set out to move mind-boggling amounts of information from A to B, and we’re doing that. Some of the nodes in the graph are shifting, but the model is sound. We’ve found there are primarily two types of data consumers: high-coverage of a small number of sources (“I need 100% of Joe, Jane, and Mike’s activity”), and “as high as you can get it”-coverage of a large number of sources (“I don’t need 100%, but I want very broad coverage”). Gnip’s adjusted to accommodate both.

Business

We’ve had to shift our resources to better focus on the paying segments of our audience. We initially thought “life-stream aggregators” would be our biggest paying customer segment, however data/media analytics firms have proven significant. Catering to the customers who tell you “we have budget for that!” makes good business sense, and we’re attacking those opportunities.

Gravitational Shift

Gnip’s approach to getting more Publishers into the system has evolved. Over the past year we’ve learned a lot about the data delivery business and the state of its technological art. While our core infrastructure remains a highly performant data delivery bus, the way data arrives at Gnip’s front door is shifting.

We set out assuming the industry, at large (both Publishers and Consumers), was tired of highly latent data access. What we’ve learned is that data Consumers (e.g. life-stream aggregators) are indeed weary of the latency, but that many Publishers aren’t as interested in distributing their data in real-time as we initially estimated. So, in order to meet intense Consumer demand to have data delivered in a normalized, minimal latency (not necessarily “real-time”), manner, Gnip is adding many new polled Publishers to its offering.

Checkout http://api.gnip.com and see how many Publishers we have to offer as a result of walking down the polling path.

Our goal remains to “deliver the web’s data,” and while the core Gnip delivery model remains the same, polling has allowed us to greatly expand the list of available Publishers in the system.

Tell us which Publishers/data sources you want Gnip to deliver for you! http://gnip.uservoice.com/

We have a long way to go, but we’re stoked at the rate we’re able to widen our Publisher offering now that our polling infrastructure is coming online.