Strata, Big Data & Big Streams

Gnip had a great time at O’Reilly’s Strata 2011 conference in California last week. We signed up several months ago as a big sponsor without knowing exactly how things were going to come together. The bet paid off and Strata was a huge success for us, and the industry at large. We were blown away with the relevance of the topics discussed and the quality of the attendees and discussions that were sparked. I was amazed at how much knowledge everyone now has surrounding big data set analysis and processing. Technologies that were immature and new just a few years ago, are now baked into the ecosystem and have become tools of the trade (e.g. Hadoop). All very cool to see.

 

That said, there remains a distinct gap between big data set handling and high-volume/real-time data stream handling. We’ve come a long way in handling monster data set processing in batch or offline modes, but we have a long way to go when it comes to handling large streaming data set challenges. Hillary Mason, of bit.ly, hit this point squarely in her “What Data Tells Us” talk at Strata. We can open sourcely fan out ungodly amounts of processing… like piranha on fresh meat. However, blending that processing, and high-latency transactions, into real-time streams of thousands of activities per second is not as refined and well understood. Frankly, I’m shocked at the number of engineers I run into that simply don’t understand asynchronous programming at all.

 

The night before the conference started, Pete Warden drove BigDataCamp @Strata, where Mike Montano from BackType gave a high-level overview of their infrastructure. He laid out a few tiers and described the “speed” tier as something that did a lot of work on high-volume streams, and a “batch” tier that did stuff in a more offline manner. The blend of approaches was an interesting teaser into how Big Stream challenges can be handled. Gnip’s own infrastructure has had to address these challenges of course, and we launched into a thread of detail in our Expanding The Twitter Firehose post awhile back.

 

Big Stream handling occupies a good part of my brain. I’d like to see Big Data discussion start to unravel Big Stream challenges as well.

What Does Compound Interest Have to do with Evolving APIs?

Once Albert Einstein was asked what he found to be important discoveries. His answer did not mention physics, relativity theory, or fun stuff like Higgs bosons – but instead he said: “Compound interest is the greatest mathematical discovery of all time.”

I trust that most of you understand compound interest when it comes to investing or debt, but humor me and let’s walk through an example: Say you owe your credit card company $1000, and your interest rate is 16%. To make it simple, we assume the credit card company only requires you to pay 1% as your minimal payment every year, so the effective interest rate is 15%. After 30 years of compound interest you owe almost $60 000!

Compound Interest Graph

If there would be no compounding, you’d just owe a little bit over 5 grand!

What I find truly bizarre though is that when us software engineers throw around words like “technological debt” the eyes of our project managers or CEOs frequently just glaze over. Instead of doing the right thing – I’ll get back to that later – we are asked to come up with the quick hack that will make it work tomorrow and deal with the fallout later. Really? Sounds like we are using one credit card to pay off the other.

And we are even staying within terminology using “debt”! We could have said something like “Well, it would take us roughly 1 week longer to integrate our current J2EE backend with this 3rd party SOAP API instead of expanding our current custom XML parser, but then we would be done for good with maintaining that (POS) part of the app and can focus on our core IP.” But no, we keep it simple and refer to the custom XML parser as “technological debt”, but to no avail.

Now, the next time you have this conversation with your boss, show him the plot above and label the y-axis with “lines of code we have to maintain”, and the x-axis with “development iterations”, and perhaps a bell will go off.

Coming back to doing the right thing. Unfortunately determining what is the right thing is sometimes hard, but here are two strategies that in my experience decrease technological debt almost immediately:

  1. Refactor early and often
  2. Outsource as much as possible of what you don’t consider your core competency.

For instance, if you have to consume millions of tweets every day, but your core competency does not contain:

  • developing high performance code that is distributed in the cloud
  • writing parsers processing real time social activity data
  • maintaining OAuth client code and access tokens
  • keeping up with squishy rate limits and evolving social activity APIs

then it might be time for you to talk to us at Gnip!

New Digg-2000 Publisher, Hot Stuff

We always get some interesting requests for doing additional processing on data sources.   Some of these are addressed using Gnip filters, but others do not really fit the filter model.   In able to support richer or more complex data processsing we have built some additional features into the Gnip platform.   The first new publisher using some of these new features is “Digg-2000“.

Digg-2000 Publisher — What is it?

Lots of people submit stories to Digg and lots of other people Digg the stories which allows more popular information to rise to the top in being discoverable.    Several Gnip customers asked if we could make it possible for them to only receive stories that had a specific number of Diggs.   We asked Digg about the idea and they said it sounded great since they have a Twitter account that provides a similar type of feature, so the Digg-2000 Gnip publisher was born.

Digg-2000 Publisher — How it works?

On the Gnip platform we have set up a publisher that is listening to activities on Digg.

  • When new stories are submitted to Digg we pick those new activites up along with the Digg count on the story and they are posted to the standard Gnip Digg publisher.
  • With every new Digg on a story we increment our tracking of the story and when we hit 2000 diggs we re-post the original story to the Digg-2000 publisher.
  • The configuration of the Digg-2000 publisher allows for us to turn two different dials.

    • The default configuration only will re-post stories that were first posted on Digg in the last two days.   This means we are looking for current active stories and not stories that were posted months ago and through a slow and steady interest finally hit 2000 Diggs.
    • The default configuration re-posts at 2000 Diggs.   This can be set to any number of Diggs — 100, 1000, 2000, etc.

Gnip Licensing Updates Coming in August

Gnip has offered the same basic licensing options since we launched the 2.0 version of the platform last September.   During that year we have learned a lot about how companies and individual developers use the Gnip platform to discover, access, integrate and filter social and business data for their applications.     In that time the daily volume of activities flowing across the platform has grown from thousands of activities across a handful of services to 100 to 150 million activities in a given day across almost forty different data sources.

Gnip Platform License Updates:  In the second half of August Gnip will introduce several changes to our licensing options that will impact existing users and new users

  • Gnip will be provide several licensing options for the Standard Edition service

    • Commercial license:  This is the default license for all commercial uses of the Gnip Platform
    • Non-profit license:  This option will be available to companies and organizations with an appropriate 501(c) status
    • Startup Partner license:  This option is available to companies that meet the qualification terms of the partner program.
    • Trial license:  This option will be the default experience for new users and provide 30 days to evaluate the platform.
  • The Community Edition of the Gnip Platform will being retired since we discovered over the last year that the TOS for the Community Edition made the option a poor fit for real-world company use cases.   We believe any small company using the Gnip Platform Community Edition should be able to move to our Startup Partner Program.

Impact on existing and new users: The most obvious change for new users that sign up after these licensing updates is that their accounts will be active for 30 days.   All existing users on the Gnip Platform will have their existing accounts convert to a 30 day trial account when the new licensing is rolled out during the second half of August.

Planning for the licensing updates: If your company meets any of the regular license options please contact us at sales@gnip.com or info@gnip.com to discuss moving to the Commercial, Non-profit or Startup Partner licenses.

Gnip: Transitioning to New Twitter Streaming API in June

When we started Gnip last year Twitter was among the first group of companies that understood the data integration problems we were trying to solve for developers and companies.   Because Gnip and Twitter were able to work together it has been possible to access and integrate data from Twitter by using the Gnip platform since last July using Gnip Notifications, and since last September using Gnip Data Activities.

All of this data access was the result of Gnip working with the Twitter XMPP “firehose” API to provide Twitter data access for users of both the Gnip Community and Standard edition product offerings.   Recently Twitter announced a new Streaming API and began an alpha program to start making the new API available.  Gnip has been testing the new Streaming API and now we are planning to move from the current XMPP API to the new Streaming API in the middle of June.    This transition to the new Streaming API will mean some changes in the default behavior and ability to access Twitter data as described below

New Streaming API Transition Highlights

  1. Gnip will now be able to provide both Gnip Notifications and Gnip Data Activities to all users of the Gnip platform.   We had stopped providing access to Data Activities to new customers last November when Twitter began working on the new API, but now all users of the Gnip platform can use either Notifications or Data Activities based on what is appropriate for their application use case.
  2. There are no changes to the Gnip API or service endpoints of Gnip Publishers and Filters due to this transition.  This is changing the default Twitter API that we integrate to for data from Twitter (added about 2 hours after original post)
  3. The Twitter Streaming API is meant to accommodate a class of applications that require near-real-time access to Twitter public statuses and is provided with several tiers of streaming API methods.  See the Twitter documentation for more information.
  4. The default Streaming API tiers that Gnip will be making available are the new “spritzer” and “follow” stream methods.   These are the only tiers which are made available publicly without requiring an end user agreement directly with Twitter at this time.
  5. The “spritzer” stream method is not a “firehose” as the XMPP stream that Gnip previously used as our default.   The average messages per second is still being worked out by Twitter, but at this time “spritzer” runs in the ballpark of 10-20 messages per second and can vary depending on lots of variables being managed by Twitter.
  6. The “follow” stream method returns public statuses from a specified set of users, by ID.
  7. For more on “spritzer”, “follow”, and other methods see the Twitter Streaming API Documentation.

What About Companies and Developers With Use Cases Are Not Met With the Twitter “Spritzer” and “Follow” Streaming API methods


Gnip and Twitter realize that many use cases exist for how companies want to use Twitter data and that new applications are being built everyday.   Therefore we are exploring how companies that are authorized by Twitter for other Streaming API methods  would be able to use the Gnip platform as their integration platform of choice.

 

Twitter has several additional Streaming API methods available to approved parties that require a signed agreement to access.   To better understand which developers and companies using the Gnip platform could benefit from these other Streaming API options we would encourage Gnip platform users to take this short 12 question survey: Gnip: Twitter Data Publisher Survey (URL: http://www.surveymonkey.com/s.aspx?sm=dQEkfMN15NyzWpu9sUgzhw_3d_3d)

What About the Gnip Twitter-search Data Publisher?


The Gnip Twitter-search Data Publisher is not impacted by the transition to the new Twitter Streaming API since it is implemented using the new Gnip Polling Service and provides keyword-based data integration to the search.twitter APIs.

We will provide more information when we lock down the actual day for the transition shortly.    Please take the survey and as always please contact us directly at info@gnip.com or send me a direct email at shane@gnip.com

Pushing and Polling Data Differences in Approach on the Gnip platform

Obviously we have some understanding on the concepts of pushing and polling of data from service endpoints since we basically founded a company on the premise that the world needed a middleware push data service.    Over the last year we have had a lot of success with the push model, but we also learned that for many reasons we also need to work with services via a polling approach.   For this reason our latest v2.1 includes the Gnip Service Polling feature so that we can work with any service using push, poll or a mixed approach.

Now, the really great thing for users of the Gnip platform is that how Gnip collects data is mostly abstracted away.   Every end user developer or company has the option to tell Gnip where to push data that you have set up filters or have a subscription.   We also realize not everyone has an IT setup to handle push so we have always provided the option for HTTP GET support that lets people grab data from a Gnip generated URL for your filters.

One place where the way Gnip collects data can make a difference, at this time, for our users is the expected latency of data.  Latency here refers to the time between the activity happening (i.e. Bob posted a photo, Susie made a comment, etc) and the time it hits the Gnip platform to be delivered to our awaiting users.     Here are some basic expectation setting thoughts.

PUSH services: When we have push services the latency experience is usually under 60 seconds, but we know that this is not always the case sense sometimes the services can back-up during heavy usage and latency can spike to minutes or even hours.   Still, when the services that push to us are running normal it is reasonable to expect 60 second latency or better and this is consistent for both the Community and Standard Edition of the Gnip platform.

POLLED services:   When Gnip is using our polling service to collect data the latency can vary from service to service based on a few factors

a) How often we hit an endpoint (say 5 times per second)

b) How many rules we have to schedule for execution against the endpoint (say over 70 million on YouTube)

c) How often we execute a specific rule (i.e. every 10 minutes).     Right now with the Community edition of the Gnip platform we are setting rule execution by default at 10 minute intervals and people need to have this in mind with their expectation for data flow from any given publisher.

Expectations for POLLING in the Community Edition: So I am sure some people who just read the above stopped and said “Why 10 minutes?”  Well we chose to focus on “breadth of data ” as the initial use case for polling.   Also, the 10 minute interval is for the Community edition (aka: the free version).   We have the complete ability to turn the dial and use the smarts built into the polling service feature we can execute the right rules faster (i.e. every 60 seconds or faster for popular terms and every 10, 20, etc minutes or more for less popular ones).    The key issue here is that for very prolific posting people or very common keyword rules (i.e. “obama”, “http”, “google”) there can be more posts that exist in the 10 minute default time-frame then we can collect in a single poll from the service endpoint.

For now the default expectation for our Community edition platform users should be a 10 minute execution interval for all rules when using any data publisher that is polled, which is consistent with the experience during our v2.1 Beta.    If your project or company needs something a bit more snappy with the data publishers that are polled then contact us at info@gnip.com or contact me directly at shane@gnip.com as these use cases require the Standard Edition of the Gnip platform.

Current pushed services on the platform include:  WordPress, Identi.ca, Intense Debate, Twitter, Seesmic,  Digg, and Delicious

Current polled services on the platform include:   Clipmarks, Dailymotion, deviantART, diigo, Flickr, Flixster, Fotolog, Friendfeed, Gamespot, Hulu, iLike, Multiply, Photobucket, Plurk, reddit, SlideShare, Smugmug, StumbleUpon, Tumblr, Vimeo, Webshots, Xanga, and YouTube

Snipe: Sick Gnip Integration

Jeremy Hinegardner has written a super cool utility (he calls it Snipe) in Ruby that uses Gnip Notifications to optimize your data collection needs. In a nutshell, it digests Gnip Notifications for the Twitter Publisher (though it could obviously be re-purposed for any Publisher) and pings Twitter to retrieve the tweets associated with said Notifications; rounding out Gnip <activity>s. Enjoy, and hats off to Jeremy; well done.

Guest Post, Rick Boykin: Gnip C# .NET Convenience Library

Microsoft .NETNow that the new Gnip convenience libraries have been published for a few weeks on GitHub, I’m going to tell you a bit about the libraries that I’m currently responsible for, the .NET libraries.  So, let’s dive in, shall we… The latest versions of the .NET libraries are heavily based on the previous version of the Java libraries, with a bit of .NET style thrown in. What that means is that I used Microsoft’s Java Language Conversion Assistant as a starting point, mixed in some shell scripting like Bash, Sed and Perl to fix the comments, and some of the messy parts that did not translate very well. I then made it more C# like by removing Java Annotations, adding .NET attributes, taking advantage of .NET native XML Serializer, utilizing System.Net.HttpWebRequest for communications, etc. It actually went fairly quick.  The next task was to start the Unit testing deep dive.

I have to say, I really didn’t know anything about the Gnip model, how it worked, or what it really was, at first. It just looked like an interesting project and some good folks. Unit testing, however, is one place where you learn about the details of how each little piece of a system really works. And since hardly any of my tests passed out of the gate (and I was not really even convinced that I even had enough tests in place,) I decided it was best to go at it till I was convinced. The library components are easy enough. The code is really separated into two parts. The first component is the Data Model, or Resources, which directly map to the Gnip XML model and live in the Gnip.Client.Resource namespace. The second component is the Data Access Layer or GnipConnection. The GnipConnection, when configured, is responsible for passing data to, and receiving data from, the Gnip servers.  So there are really only two main pieces to this code. Pretty simple: Resources and GnipConnection. The other code is just convenience and utility code to help make things a little more orderly and to reduce the amount of code.

So yeah, the testing… I used NUnit so folks could utilize the tests with the free version of VisualStudio, or even the command line if you want. I included a Gnip.build Nant file so that you can compile, run the tests, and create a zipped distribution of the code. I’ve also included an nunit project file in the Gnip.ClientTest root (gnip.nunit) that you can open with the NUnit UI to get things going. To help configure the tests, there is an App.config file in the root of the test project that is used to set all the configuration parameters.

The tests, like the code, are divided onto the Resource objects tests and the GnipConnection tests (and a few utility tests). The premise of the Resource object tests is to first ensure that the Resource objects are cool. These are simple data objects with very little logic built in (which is not to say that testing them thoroughly is not the utmost important.) There is a unit test for each one of the data objects and they proceed by ensuring that the properties work properly, the DeepEquals methods work properly, and that the marshalling to and from XML works properly. The DeepEquals methods are used extensively by the tests, so it is essential that we can trust them. As such, they are fairly comprehensive. The marshalling and un-marshalling tests are less so. They do a decent job; they just do not exercise every permutation of the XML elements and attributes. I do feel that they are sufficient enough to convince me that things are okay.

The GnipConnection is responsible for creating, retrieving, updating and deleting Publishers and Filters, and retrieving and publishing Activities and Notifications. There is also a mechanism built into the GnipConnection to get the Time from the Gnip server and to use that Time value to calculate the time offset between the calling client machine and the Gnip server. Since the Gnip server publishes activities and notifications in 1 minute wide addressable ‘buckets’, it is nice to know what the time is on the Gnip server with some degree of accuracy. No attempt is made to adjust for network latency, but we get pretty close to predicting the real Gnip time. That’s it. That little bit is realized in 25 or so methods on the GnipConnection class. Some of those methods are just different signatures of methods that do the same thing only with a more convenient set of parameters. The GnipConnection tests try to exercise every API call with several permutations of data. They are not completely comprehensive. There are a lot of permutations. But, I believe they hit every major corner case.

In testing all this, one thing I wanted to do was to run my tests and have the de-serialization of the XML validate against the XML Schema file I got from the good folks at Gnip. If I could de-serialize and then serialize a sufficiently diverse set of XML streams, while validating that those streams adhere to the XML Schema, then that was another bit of ammo for trusting that this thing works in situations beyond the test harness. In the Gnip.Client.Uti namespace there is a helper class icalled XmlHelper that contains a singleton of itself. There is a property called ValidateXml that can be reached like this XmlHelper.Instance.ValidateXml. Setting that to true will cause the XML to be validated anytime it is de-serialized, either in the tests or from the server. It is set to true in the tests. But, it doesn’t work with the stock Xsd distributed by Gnip.That Xsd does not include an element definition for each element at the top level which is required when validating against a schema. I had to create one that did. It is semantically identical to the Gnip version; it just pulls things out to the top level. You can find the custom version in the Gnip.Client/Xsd folder. By default it is compiled into the Gnip.Client.dll.

One of the last things I did, which had nothing really to do with testing, is to create the IGnipConnection interface. Use it if you want. If you use some kind of Inversion of Control container like Unity, or like to code to interfaces, it should come in handy.
That’s all for now. Enjoy!

Rick is a Software Engineer and Technical Director at Mondo Robot in Boulder, Colorado. He has been designing and writing software professionally since 1989, and working with .NET for the last 4 years. He is a regular fixture at the Boulder .NET user’s group meetings and the is a member of Boulder Digital Arts.