Gnip. The Story Behind the Name

Have you ever thought “Gnip”. . . well that is a strange name for a company, what does it mean? As one of the newest members of the Gnip team I found myself thinking that very same thing. And as I began telling my friends about this amazing new start-up that I was going to be working for in Boulder, Colorado they too began to inquire as to the meaning behind the name.

Gnip, pronounced (guh’nip), got its name from the very heart of what we do, realtime social media data collection and delivery. So let’s dive in to . . .

Data Collection 101

There are two general methods for data collection, pull technology and push technology. Pull technology is best described as a data transfer in which the request is initiated by the data consumer and responded to by the data publisher’s server. In contrast, push technology refers to the request being initiated by the data publisher’s server and sent to the data consumer.

So why does this matter . . .

Well most social media publishers use the pull method. This means that the data consumer’s system must constantly go out and “ping” the data publisher’s server asking, “do you have any new data now?” . . . “how about now?” . . . “and now?” And this can cause a few issues:

  1. Deduplication – If you ping the social media server one second and then ping it again a second later and there were no new results, you will receive the same results you got one second ago. This would then require deduplication of the data.
  2. Rate Limiting – every social media data publisher’s server out there sets different rate limits, a limit used to control the number of times you can ping a server in a given time frame. These rate limits are constantly changing and typically don’t get published. As such, if your server is set to ping the publisher’s server above the rate limit, it could potentially result in complete shut down of your data collection, leaving you to determine why the connection is broken (Is it the API . . . Is it the rate limit . . . What is the rate limit)?

So as you can see, pull technology can be a tricky beast.

Enter Gnip

Gnip sought to provide our customers with the option: to receive data in either the push model or the pull model, regardless of the native delivery from the data publisher’s server. In other words we wanted to reverse the “ping” process for our customers. Hence, we reversed the word “ping” to get the name Gnip. And there you have it, the story behind the name!

All These Things That I’ve Done

A little over two years ago, Jud and I hatched an audacious plan — pair a deep data guy with a consumer guy to launch an enterprise company. We would build an incredible data service with the polish of a consumer app, then attack a market generally known for being rather dull with a combination of substance and style.

Over the last two years, Jud has done an amazing job serving as Gnip’s CTO and implicitly as VP of Engineering. Under his leadership, the engineering team has delivered a product that turns the process of integrating with dozens of diverse APIs into a push-button experience. The team he assembled is fantastically talented and passionate about making real-time data more easily consumed. My own team has performed equally well, adding much-needed process to Gnip’s sales and marketing.

Two years ago, if you asked Corporate America to define “social media,” they probably would have said “the blogs.” Last year, they would have probably answered “the blogs and Twitter” and this year they’re adding Facebook to their collective consciousness. The time is better than ever to bring Gnip’s platform to the enterprise and, ultimately, I’m not the CEO to do it. Our plan to have a consumer guy lead an enterprise company ended up having a few holes. For Gnip to thrive in the enterprise, it needs to be squarely in the hands of people who have previously succeeded in that space. So as of today, I’m stepping down as CEO and leaving the company. Jud is taking over as CEO.

I am honored to have worked with Jud and it has been a privilege to work with my team for the last two years. Anything that Gnip has accomplished so far has been because of them. Any criticisms that the company could have accomplished more in the last two years can be directed squarely at me. I look forward to seeing Jud and the team do great things in the years ahead.

New Gnip Push API Service

The Gnip product offerings are growing today as we officially announce a new Push API Service that will help companies more quickly and effectively deliver data to their customers, partners and affiliates.  (See the TechCrunch article: Gnip Launches Push API To Create Real-Time Stream Of Business Data )

This new offering leverages the Gnip SaaS Integration Platform but is provided as a complete white-label and embeddable solution adding real-time push to an existing infrastructure.  The main capabilities include the following:

  • Push Endpoint Management: Easily register service endpoints and APIs to create alternative Push endpoints that are powered by the Gnip platform.
  • Real-time Data Delivery: Complete white-label approach allows for company defined URLs to be enhanced for real-time data delivery. Reduce your data latency and infrastructure costs while maintaining control of data access and offloading the delivery to Gnip
  • Reporting Dashboard: Access important metrics and usage information for service endpoints through a statistics API or a web-based dashboard

A company would be able to add the Push API Service  to their existing infrastructure in hours or days with a few steps.

  1. Tell us how you want Gnip to access your data and APIs as we have several methods depending on your infrastructure
  2. Integrate the Gnip Push API Service to your website with complete control of the user experience and branding
  3. CNAME the subdomain in order to seamlessly add the Push API Service to your existing infrastructure
  4. Track usage of the new Push Service API using a web-based console or stats API

If your company is intersted in learning more about how Gnip can help move your existing repetitive API and website traffic to a more efficient push based approach contact us at info@gnip.com.

Newest Gnip Partner: PostRank

Welcome PostRank!

Today Gnip and PostRank announced a new partnership (blog) (press release)that allows companies using the Gnip platform to access the nearly 3 million news articles and stories indexed by PostRank from a million discrete sources every day.  In addition PostRank collects the real-time social interactions with content across dozens of social networks and applications.

Gnip will be providing PostRank as a premium service that requires a subscription.  All of the value added features of the Gnip platform including normalization, rule-based filtering, and push delivery of content are supported with the new Gnip PostRank Data Publisher.   For pricing information contact info@gnip.com or shane@gnip.com

Pushing and Polling Data Differences in Approach on the Gnip platform

Obviously we have some understanding on the concepts of pushing and polling of data from service endpoints since we basically founded a company on the premise that the world needed a middleware push data service.    Over the last year we have had a lot of success with the push model, but we also learned that for many reasons we also need to work with services via a polling approach.   For this reason our latest v2.1 includes the Gnip Service Polling feature so that we can work with any service using push, poll or a mixed approach.

Now, the really great thing for users of the Gnip platform is that how Gnip collects data is mostly abstracted away.   Every end user developer or company has the option to tell Gnip where to push data that you have set up filters or have a subscription.   We also realize not everyone has an IT setup to handle push so we have always provided the option for HTTP GET support that lets people grab data from a Gnip generated URL for your filters.

One place where the way Gnip collects data can make a difference, at this time, for our users is the expected latency of data.  Latency here refers to the time between the activity happening (i.e. Bob posted a photo, Susie made a comment, etc) and the time it hits the Gnip platform to be delivered to our awaiting users.     Here are some basic expectation setting thoughts.

PUSH services: When we have push services the latency experience is usually under 60 seconds, but we know that this is not always the case sense sometimes the services can back-up during heavy usage and latency can spike to minutes or even hours.   Still, when the services that push to us are running normal it is reasonable to expect 60 second latency or better and this is consistent for both the Community and Standard Edition of the Gnip platform.

POLLED services:   When Gnip is using our polling service to collect data the latency can vary from service to service based on a few factors

a) How often we hit an endpoint (say 5 times per second)

b) How many rules we have to schedule for execution against the endpoint (say over 70 million on YouTube)

c) How often we execute a specific rule (i.e. every 10 minutes).     Right now with the Community edition of the Gnip platform we are setting rule execution by default at 10 minute intervals and people need to have this in mind with their expectation for data flow from any given publisher.

Expectations for POLLING in the Community Edition: So I am sure some people who just read the above stopped and said “Why 10 minutes?”  Well we chose to focus on “breadth of data ” as the initial use case for polling.   Also, the 10 minute interval is for the Community edition (aka: the free version).   We have the complete ability to turn the dial and use the smarts built into the polling service feature we can execute the right rules faster (i.e. every 60 seconds or faster for popular terms and every 10, 20, etc minutes or more for less popular ones).    The key issue here is that for very prolific posting people or very common keyword rules (i.e. “obama”, “http”, “google”) there can be more posts that exist in the 10 minute default time-frame then we can collect in a single poll from the service endpoint.

For now the default expectation for our Community edition platform users should be a 10 minute execution interval for all rules when using any data publisher that is polled, which is consistent with the experience during our v2.1 Beta.    If your project or company needs something a bit more snappy with the data publishers that are polled then contact us at info@gnip.com or contact me directly at shane@gnip.com as these use cases require the Standard Edition of the Gnip platform.

Current pushed services on the platform include:  WordPress, Identi.ca, Intense Debate, Twitter, Seesmic,  Digg, and Delicious

Current polled services on the platform include:   Clipmarks, Dailymotion, deviantART, diigo, Flickr, Flixster, Fotolog, Friendfeed, Gamespot, Hulu, iLike, Multiply, Photobucket, Plurk, reddit, SlideShare, Smugmug, StumbleUpon, Tumblr, Vimeo, Webshots, Xanga, and YouTube

Push & Pull

Over here at Gnip we’re knee deep in the joys of polling. Our mission to “deliver the web’s data” has us using several approaches to hook consumers up with publisher activities. While we iron out the kinks around our polled publishers, I’m reminded of how broken polling is for many types of data. From rate limiting, to minimum poll interval definition, polling inherently yields gaps between actions that take place along a timeline. In some cases those gaps are small, and potentially imperceptible, but in others they are large. Viewing one of our internal daily stats charts exemplified this push vs. pull dichotomy. Each color represents a separate publisher (top/black line == total). Guess which publishers are Push (event driven) and which are Pull (polling driven).

Gnip Publisher Daily Chart

Gnip Publisher Daily Chart

The answer: consistent, connected, publishers/lines in the chart are Push (even driven), and more variable publishers/lines are Pull (poll) driven.

Our goal in life is to smooth those variable/jagged lines for the polled publishers, but along the path to data delivery nirvana, I thought I’d share a this visual.

Data Standards?

Today’s general data standards are akin to yesterday’s HTML/CSS browser support standards. The first rev of Gecko (not to be confused w/ the original Mosaic/Navigator rendering engine) at Netscape was truly standards compliant in that it did not provide backwards compatibility for the years of web content that had been built up; that idea made it an Alpha or two into the release cycle, until “quirks-mode” became status quo. The abyss of broken data that machines, and humans, generate, eclipsed web pages back then, and it’s an ever present issue in the ATOM/RSS/XML available today.

Gnip, along with social data aggregators like Plaxo and FriendFeed, has a unique view of the data world. While ugly to us, we normalize data to make our Customers’ lives better. Consumer facing aggregators (Plaxo/FF) beautify the picture for their display layers. Gnip beautifies the picture for it’s data consumption API. Cleaning up the mess that exists on the network today has been an eye opening process. When our data producers (publishers) PUSH data in Gnip XML, life is great. We’re able to work closely with said producers to ensure properly structured, formatted, encoded, and escaped data comes into the system. When, data comes into the system through any other means (e.g. XMPP feeds, RSS/ATOM polling) it’s a rats nest of unstructured, cobbled-together, ill-formated, and poorly-encoded/escaped data.

XML has provided self describing formats and structure, but it ends there. Thousands of pounds of wounded data shows up on Gnip’s doorstep each day, and that’s where Gnip’s normalization heavy lifting work comes into play. I thought I’d share some of the more common bustage we see, along with a little commentary around the category of problem

  • ![CDATA[]] is akin to void* and is way overused. The result is magical custom parsing of something that someone couldn’t fit into some higher-level structure.

    • If you’re back-dooring data/functions into an otherwise “content” payload, you should revisit your overall model. Just like void*, CDATA usually suggests an opaque box you’re trying to jam through the system.
  • Character limited message bodies (e.g. microblogging services) wind up providing data to Gnip that has escaped HTML sequences chopped in half, leaving the data consumer (Gnip in this case) guessing at what to do with a broken encoding. If I give you “&a”, you have to decide whether to consider it literally, expand it to “&”, or to drop it. None of which was intended by the user that generated the original content, they just typed ‘&’ into a text field somewhere.

    • Facebook has taken a swing at how to categorize “body”/”message” sizes which is nice, but clients need to do a better job truncating by taking downstream encoding/decoding/expansion realities into consideration.
  • Document bodies that have been escaped/encoded multiple times, subsequently leave us deciphering how many times to run them through the un-escape/decode channel.

    • _Lazy_. Pay attention to how you’re treating data, and be consistent.
  • Illegal characters in XML attribute/element values.

    • _LAZY_. Pay attention.
  • Custom extensions to “standard” formats (XMPP, RSS, ATOM). You think you’re doing the right thing by “extending” the format to do what you want, but you often wind up throwing a wrench in downstream processing. Widely used libs don’t understand your extensions, and much of the time, the extension wasn’t well constructed to begin with.

    • Sort of akin to CDATA, however, legitimate use cases exist for this. Keep in mind that by doing this, there are many libraries in the ecosystem that will not understand what you’ve done. You have to be confident that your data consumers are something you can control and ensure they’re using a lib/extension that can handle your stuff. Avoid extensions, or if you have to use them, get it right.
  • Namespace case-sensitivity/insensitivity assumptions differ from service to service.

    • Case-sensitivity rules were polluted with the advent of MS-DOS, and have been propagated over the years by end-user expectations. Inconsistency stinks, but this one’s around forever.
  • UTF-8, ASCII encoding bugs/misuse/misunderstanding. Often data claims to be encoded one way, when in fact it was encoded differently.

    • Understand your tool chain, and who’s modifying what, and when. Ensure consistency from top to bottom. Take the time to get it right.
  • UTF-16… don’t go there.

    • uh huh.
  • Libraries in the field to handle all of the above each make their own inconsistent assumptions.

    • It’s conceivable to me that Gnip winds up stating the art in XML processing libs, whether by doing it ourselves, or contributing to existing code trees. Lots of good work out there, none of it great.

You’re probably wondering about the quality of the XML structure itself. By volume, the bulk of data that comes into Gnip validates out of the box. Shocking, but true. As you could probably guess, most of our energy is spent resolving the above data quality issues. The unfortunate reality for Gnip is that the “edge” cases consume lots of cycles. As a Gnip consumer, you get to draft off of our efforts, and we’re happy to do it in order to make your lives better.

If everyone would clean up their data by the end of the day, that’d be great. Thanks.

Q(ueue), R, S, T, U, V…

queue

As software evolves, you constantly adjust your view from forest perspective, to tree perspective. Recently we’ve been stepping back from our Polling infrastructure to get the forest view. Today we spent the afternoon revisiting initial requirements, considering new requirements, and doing whiteboard math around sweet spots on curves. The conversation reminded me of a meeting we had with Bob Pasker nearly ten months ago.

We met with Bob as part of an investor’s due diligence process; Bob prodded us to a) opine on whether or not we were solving a worthy problem, and b) opine on whether or not we, as a team, were worth the investment risk (the investor gave us the money). Within sixty seconds of sitting down with him, he stopped our pitch and blurted out “you’re a queue!” In so many ways he was right. I’ve always thought about Gnip as a series of pipes that yield a message bus, but today we were visited by the queuing theory ghosts of implementations past. If you look at our architecture from a distance, we have several queues in the system. If you’ve ever worked with a multi-queue system, you’ve grappled with some of our challenges.

Managing the flow of data, in “real-time,” from point A, to point N, in a reliable manner, means you’re queuing it up along the way at various points. At each processing stage, you queue the data, and digest it (usually FIFO for Gnip). The particular challenges we faced today were around queuing constraints in our Polling model. Linear, in process, FIFO queues are relatively straightforward (though complex in and of themselves when “pushes” outstrip “pops” for long durations). However, when you start prioritizing queue entries (changing the natural FIFO order), and adding rate limitations on “pops,” things get more challenging. When you start distributing queue “pop” consumption (e.g. worker threads) across NICs, you really start to feel the joy.

Gnip has a typical scheduler/job processing pattern in our Polling infrastructure. We push jobs on a queue in a prioritized manner, then off-box workers pop those jobs off the queue for processing. Currently those job processors are “dumb;” they know nothing other than how to process the job they just acquired. Getting that distributed grid of independently aware/scheduled job processors to coordinate and have shared state is the fun challenge du-jour at Gnip.

Another general queuing pattern challenge arises when queues become overwhelmed due to a pause in processing downstream. Whether the pause is caused by a node outage, or a flood in “pushes,” queue backup in an automated job creation environment can yield phenomenal data build up. Imagine inbound data from a given Publisher arriving at a rate of 30 messages/sec. For every minute of processing that doesn’t occur, 1,800 messages backup on the queue. General traffic ebb-and-flow isn’t such a big deal as you build your system to grow/shrink accordingly, but when there are major issues (a network outage downstream, a bug, a crash, whatever), physical hardware limitations come into view. Trade-offs between data-loss (dropping data to keep things alive) and chaining “store and forward” queues several levels deep have to be made. Operationally at Gnip, we’ve had to drop data at times, and at others we’ve had to build additional queuing layers where queue backup is more common.

Just another day in the life of shuffling data around the network.