I want '*'

My favorite requirement from customers is the “I want all the data, from all the sources, for all of history, and for all of future” one. You’re never going to get it, from anyone, so reset your expectations. A few constructs fall out of this request.

Two Types of ‘feeds’

Firehoses

These are aggregate sources of data for a given publisher. They may, or may not, be a complete representation of that publisher’s data set. Everyone wants firehoses, but truth be told, there are very few of them in the wild, and those that do exist are of less “valuable” data. Consider access to firehoses to be at statistically relevant access levels, rather than truly “complete” sets of data.

Seeded Feeds

These encompass the majority of data sources, and they require that you know what you’re looking for. Be it a keyword, a tag, a user name, a user id, or a geo-location.

In either case you need to know what it is you’re after. Blind, unfettered, access to a given publisher’s feed is a rarity and actually isn’t all that interesting in the end; you just think it is because someone else had the product idea first (e.g. the publisher you want all the data from… e.g. Twitter).

Historical Access

Storing and indexing lots of data is conceptually simple, yet hard to implement at scale; just ask any of the big-three search engines. You can stuff as much data as possible into a database, and “search” it offline, in order to meet most historical data access requirements, but weaving that into a variably accessed consumer application isn’t always easy. While storage costs are generally nil for today’s highly compressible data, the operational management costs of your locally stored data aren’t.

“Real-time” Access

Processing data in a manner other than which it originated causes an impedance miss-match. Stream-to-offline processing implies that you’ll have gaps in data due to queuing problems. Offline-to-stream suggests the same. Offline-to-offline and stream-to-stream are generally easy to get your head and code around, but be wary of overloading stream processing with too much work as it then starts to feel like stream-to-offline. Once you enter that world, you need to solve parallel processing problems; in real-time.

Regardless of access pattern, you can only introspect and access the data you initially seeded your sources with. If your seed was wrong, for example you used the wrong set of users or keywords, processing the data doesn’t matter. Full circle to garbage in, garbage out.

If you find yourself asking for the introductory requirement with your team, and/or a vendor, I suggest you actually don’t have the focus on your product or idea that you’ll ultimately need in order to be successful. Batten down the hatches, and get crisp about precisely what it is you want to build, and precisely what data you need to do so. If you can do that, you will have a shot at success.

The Only Constant is Change

As a few people have mentioned online today, Gnip laid off seven team members today. It was a horrible thing to have to do and my very best wishes go out to each team member who was let go.  If you’re in Boulder and need a Java or PHP developer, an HR/office manager or an inside salesperson, send an email to eric@gnip.com and I’ll connect you with some truly awesome people.

I would like to address a few specific points for our partners, customers and friends:

  1. We believe as strongly as ever in providing data aggregation solutions for our customers.  If we didn’t, we would have returned to our investors the year of funding we have in the bank (now two years).
  2. We are still delivering the same data as yesterday. The existing platform is highly stable and will continue to churn out data as long as we want it to.
  3. The changes in personnel revolve around rebuilding the technology stack to allow for faster, more iterative releases. We’ve been hamstrung by a technology platform that was built under a very different set of assumptions more than a year ago. While exceptionally fast and stable, it is also a beast to extend.  The next rev will be far more flexible and able to accommodate the many smart feature requests we receive.

To Alex, Shane, Ingrid, JL, Jenna, Chris and Jen, it has been a honor working with you and I hope to have the privilege to do so again some day.

To our partners and customers, Gnip’s future is brighter than ever and we look forward to serving your social data needs for many years to come.

Sincerely,

Eric Marcoullier, CEO

Gravitational Shift

Gnip’s approach to getting more Publishers into the system has evolved. Over the past year we’ve learned a lot about the data delivery business and the state of its technological art. While our core infrastructure remains a highly performant data delivery bus, the way data arrives at Gnip’s front door is shifting.

We set out assuming the industry, at large (both Publishers and Consumers), was tired of highly latent data access. What we’ve learned is that data Consumers (e.g. life-stream aggregators) are indeed weary of the latency, but that many Publishers aren’t as interested in distributing their data in real-time as we initially estimated. So, in order to meet intense Consumer demand to have data delivered in a normalized, minimal latency (not necessarily “real-time”), manner, Gnip is adding many new polled Publishers to its offering.

Checkout http://api.gnip.com and see how many Publishers we have to offer as a result of walking down the polling path.

Our goal remains to “deliver the web’s data,” and while the core Gnip delivery model remains the same, polling has allowed us to greatly expand the list of available Publishers in the system.

Tell us which Publishers/data sources you want Gnip to deliver for you! http://gnip.uservoice.com/

We have a long way to go, but we’re stoked at the rate we’re able to widen our Publisher offering now that our polling infrastructure is coming online.

Making "User-Generated Content" More Useful

We have been thinking a lot about user-generated content over the last few weeks and months as we reflect on how developers are using the Gnip platform to build solutions we never imagined when the company was started.

One of the great things about getting out and talking to lots of people is that we are always learning more about how we fit, or how other people think we fit, into the ecosystems that make up the Internet.    Recently we realized that we really only exist because of the entire phenomena of user-generated content.  This is not the core idea with which we started the company.  The core idea was really techy (see our FAQ)  In addition to that original techy idea now it obvious to us that the primary Gnip mission has to also focus on making user-generated content universally accessible and useful as it was intended by the original author who shared the content.

We still see Gnip providing innovative and bleeding edge technology solutions for social and business data integration, but by realizing that out mission also must include thinking about the people sharing the content in the first place impacts how we prioritize everything we do.

Do not worry, we are not going to start building web apps and become another social media aggregation solution (those our some of our partners and customers, and we love them all).  Instead we are that much more excited to focus on the underlying platform for anyone who wants to integrate user-generated content.

What this most does for our team is help us understand that in addition to providing a great platform for developers to do data integration we also have to help developers using the Gnip platform to access and integrate data in a way that will uphold the original intent of the user who shared the content in the first place. Where something was originally shared and how it was shared does matter, and this is why our new schema goes so far with the inclusion of original destination URLs, author profile information, regarding URLs and other pertinent information that can be lost when people are just passing links, comments or re-tweets around or hacking together social data.

So, now when we say Gnip is focused on Delivering the Web’s Data we will be thinking about developers and the people everywhere who are using the Internet to just tell the world something in their own way.

Data Standards?

Today’s general data standards are akin to yesterday’s HTML/CSS browser support standards. The first rev of Gecko (not to be confused w/ the original Mosaic/Navigator rendering engine) at Netscape was truly standards compliant in that it did not provide backwards compatibility for the years of web content that had been built up; that idea made it an Alpha or two into the release cycle, until “quirks-mode” became status quo. The abyss of broken data that machines, and humans, generate, eclipsed web pages back then, and it’s an ever present issue in the ATOM/RSS/XML available today.

Gnip, along with social data aggregators like Plaxo and FriendFeed, has a unique view of the data world. While ugly to us, we normalize data to make our Customers’ lives better. Consumer facing aggregators (Plaxo/FF) beautify the picture for their display layers. Gnip beautifies the picture for it’s data consumption API. Cleaning up the mess that exists on the network today has been an eye opening process. When our data producers (publishers) PUSH data in Gnip XML, life is great. We’re able to work closely with said producers to ensure properly structured, formatted, encoded, and escaped data comes into the system. When, data comes into the system through any other means (e.g. XMPP feeds, RSS/ATOM polling) it’s a rats nest of unstructured, cobbled-together, ill-formated, and poorly-encoded/escaped data.

XML has provided self describing formats and structure, but it ends there. Thousands of pounds of wounded data shows up on Gnip’s doorstep each day, and that’s where Gnip’s normalization heavy lifting work comes into play. I thought I’d share some of the more common bustage we see, along with a little commentary around the category of problem

  • ![CDATA[]] is akin to void* and is way overused. The result is magical custom parsing of something that someone couldn’t fit into some higher-level structure.

    • If you’re back-dooring data/functions into an otherwise “content” payload, you should revisit your overall model. Just like void*, CDATA usually suggests an opaque box you’re trying to jam through the system.
  • Character limited message bodies (e.g. microblogging services) wind up providing data to Gnip that has escaped HTML sequences chopped in half, leaving the data consumer (Gnip in this case) guessing at what to do with a broken encoding. If I give you “&a”, you have to decide whether to consider it literally, expand it to “&”, or to drop it. None of which was intended by the user that generated the original content, they just typed ‘&’ into a text field somewhere.

    • Facebook has taken a swing at how to categorize “body”/”message” sizes which is nice, but clients need to do a better job truncating by taking downstream encoding/decoding/expansion realities into consideration.
  • Document bodies that have been escaped/encoded multiple times, subsequently leave us deciphering how many times to run them through the un-escape/decode channel.

    • _Lazy_. Pay attention to how you’re treating data, and be consistent.
  • Illegal characters in XML attribute/element values.

    • _LAZY_. Pay attention.
  • Custom extensions to “standard” formats (XMPP, RSS, ATOM). You think you’re doing the right thing by “extending” the format to do what you want, but you often wind up throwing a wrench in downstream processing. Widely used libs don’t understand your extensions, and much of the time, the extension wasn’t well constructed to begin with.

    • Sort of akin to CDATA, however, legitimate use cases exist for this. Keep in mind that by doing this, there are many libraries in the ecosystem that will not understand what you’ve done. You have to be confident that your data consumers are something you can control and ensure they’re using a lib/extension that can handle your stuff. Avoid extensions, or if you have to use them, get it right.
  • Namespace case-sensitivity/insensitivity assumptions differ from service to service.

    • Case-sensitivity rules were polluted with the advent of MS-DOS, and have been propagated over the years by end-user expectations. Inconsistency stinks, but this one’s around forever.
  • UTF-8, ASCII encoding bugs/misuse/misunderstanding. Often data claims to be encoded one way, when in fact it was encoded differently.

    • Understand your tool chain, and who’s modifying what, and when. Ensure consistency from top to bottom. Take the time to get it right.
  • UTF-16… don’t go there.

    • uh huh.
  • Libraries in the field to handle all of the above each make their own inconsistent assumptions.

    • It’s conceivable to me that Gnip winds up stating the art in XML processing libs, whether by doing it ourselves, or contributing to existing code trees. Lots of good work out there, none of it great.

You’re probably wondering about the quality of the XML structure itself. By volume, the bulk of data that comes into Gnip validates out of the box. Shocking, but true. As you could probably guess, most of our energy is spent resolving the above data quality issues. The unfortunate reality for Gnip is that the “edge” cases consume lots of cycles. As a Gnip consumer, you get to draft off of our efforts, and we’re happy to do it in order to make your lives better.

If everyone would clean up their data by the end of the day, that’d be great. Thanks.

Gnip Pushed a New Platform Release This Week

We just pushed out a new release this week that includes new publishers and capabilities. Here is a summary of the release highlights. Enjoy!

  • New YouTube publisher: Do you need an easy way to access, filter and integrate YouTube content to your web application or website? Gnip now provides a YouTube publisher so go create some new filters and start integrating YouTube based content.
  • New Flickr publisher: Our first Flickr publisher had some issues with data consistency and could almost be described as broken. We built a brand new Flickr publisher to provide better access to content from Flickr. Creating filters is a snap so go grab some Flickr content.
  • Now publisher information can be shared across accounts: When multiple developers are using Gnip to integrate web APIs and feeds it sometimes is useful to see other filters as examples. Sharing allows a user to see publisher activity and statistics, but does grant the ability to edit or delete.
  • New Data Producer Analytics Dashboard: If your company is pushing content through Gnip we understand it is important to see how, where and who is accessing the content using our platform and with this release we have added a web-based data producer analytics dashboard. This is a beta feature, not where we want it yet, and we have some incomplete data issues. However, we wanted to get something available and then iterate based on feedback. If you are a data producer let us know how to take this forward. The current version provides access to the complete list of filters created against a publisher and the information can be downloaded in XML or CSV format

Also, we have a few things we are working on for upcoming releases:

  • Gnip Polling: Our new Flickr and YouTube publishers both leverage our new Gnip Polling service, which we have started using internally for access to content that is not available via our push infrastructure. We plan to make this feature available externally to customers in the future, so stay tuned or contact us if you want to learn more.
  • User generated publishers from RSS Feeds: We are going to open up the system so anyone can create new publishers from RSS Feeds. This new feature makes it easy to access, filter and integrate tons of web based content.
  • Field level mapping on RSS feeds: A lot of times the field naming of RSS feeds across different endpoints does not map to the way the field is named in your company. This new feature will allow the editing and mapping at the individual field level to support normalization across multiple feeds.
  • Filter rule batch updates: When your filters start to get big adding lots of new rules can be a challenge. Based on direct customer feedback it will soon be possible to batch upload filter rules.

Solution Spotlight: Soup.io is Now Using Gnip

Soup.io is now using the Gnip messaging platform for their web API data integration needs. Welcome Soup.io!

Who is Soup.io?
Soup.io provides a easy to use micro-blogging and lifestream service that serves as an aggregator for your public social media feeds. Visit their website at http://www.soup.io/or their blog at http://updates.soup.io/ to learn more.

Real-world results Soup.io says they are realizing from using Gnip
Soup.io is using Gnip to provide data integration to Twitter, and they have seen a reduction in the latency for their Twitter integration (i.e. the time elapsed for a tweet to show up in the Soup.io service) since moving to Gnip. Now Soup.io users should see their Twitter notices show up within a minute of them being sent on the Twitter service. Since Gnip also provides data streams from many other providers as well (Flickr, Delicious, etc) Soup.io is working to use Gnip as the way to access and integrate to these services in the future.

More Examples of How Companies are Using Gnip

We have noticed that we are interacting with two distinct groups of companies; those who instantly understand what Gnip does and those that struggle with what we do, so we decided to provide a few detailed real-world examples of the companies we are actively working with to provide data integration and messaging services today.

First, we are not an end-user facing social aggregation application. (We repeat this often.) We see a lot of people wanting to put Gnip in that bucket along with social content aggregators like FriendFeed, Plaxo and many others. These content aggregators are destination web sites that provide utility to end users by giving them flexibility to bring their social graph or part of their graph together in one place. Also, many of these services are now providing web APIs that allow people to use an alternative client to interact with their core services around status updates and conversations as well other features specific to the service.

Gnip is an infrastructure service and specifically we provide an extensible messaging system that allows companies to more easily access, filter and integrate data from web based APIs. While someone could use Gnip as a way to bring content into a personal social media client they want to write for a specific social aggregator it is not something we are focused. Below are the company use cases we are focused:

  1. Social content aggregators: One of the main reasons we started Gnip was to solve the problems being caused by the point-to-point integration issues that were springing up with the increase of user generated content and corresponding open web APIs. We believe that any developer who has written a poller once, twice, or to their nth API will tell you how unproductive it is to write and maintain this code. However, writing one-off pollers has become a necessary evil for many companies since the content aggregators need to provide access to as many external services as possible for their end users. Plaxo, who recently integrated to Gnip as a way to support their Plaxo Pulse feature is a perfect example, as are several other companies.
  2. Business specific applications: Another main reason we started Gnip was that we believe more and more companies are seeing the value of integrating business and social data as a way to add additional compelling value to their own applications. There are a very wide set of examples, such as how Eventvue uses Gnip as a way to integrate Twitter streams into their online conference community solution, and the companies we have talked to about how they can use Gnip to integrate web-based data to power everything from sales dashboards to customer service portals.
  3. Content producers: Today, Gnip offers value to content producers by providing developers an alternative tool that can be used to integrate to their web APIs. We are working with many producers, such as Digg, Delicious, Identi.ca, and Twitter, and plan to continue to grow the producers available aggressively. The benefits that producers see from working with Gnip include off-loading direct traffic to their web apis as well as providing another channel to make their content available. We are also working very hard to add new capabilities for producers, which includes plans to provide more detailed analytics on how their data is consumed and evaluating publishing features that could allow producers to define their own filters and target service endpoints and web sites where they want to push relevant data for their own business needs.
  4. Market and brand research companies: We are working with several companies that provide market research and brand analysis. These companies see Gnip as an easy way to aggregate social media data to be included in their brand and market analysis client services.

Hopefully this set of company profiles helps provide more context on the areas we are focused and the typical companies we are working with everyday. If your company does something that does not fit in these four areas and is using our services please send me a note.

What We Are Up to At Gnip

As the newest member of the Gnip team I have noticed that people are asking a lot of the same questions about what we are doing at Gnip and what are the ways people can use our services in their business.

What we do

Gnip provides an extensible messaging platform that allows for the publishing or subscribing of events and data from across the Internet, which makes data portability exponentially less painful and more automatic once it is set up. Because Gnip is being built as a platform of capabilities and not a web application the core services are instantly useful for multiple scenarios, including data producers, data consumers and any custom web applications. Gnip already is being used with many of the most popular Internet data sources, including Twitter, Delicious, Flickr, Digg, and Plaxo.

How to use Gnip

So, who is the target user of Gnip? It is a developer, as the platform is not a consumer-oriented web application, but a set of services meant to be used by a developer or an IT department for a set of core use cases.

  • Data Consumers: You’ve built your pollers, let us tell you when and where to fire them. Avoid throttling and decrease latency from hours to seconds.
  • Data Producers: Push your data to us and reduce API traffic by an order of magnitude while increasing distribution through aggregators.
  • Custom web applications: You want to embed or publish content to be used in your own application or for a third-party application. Decide who, or what, you care about for any Publisher, give us an end-point, and we push the data to you so you can solve your business use cases, such as customer service websites, corporate websites, blogs, or any web application.

Get started now

By leveraging the Gnip APIs, developers can easily design reusable services, such as, push-based notifications, smart filters and data streams that can be used for all your web applications to make them better. Are you a developer? Give the new 2.0 version a try!