Once Albert Einstein was asked what he found to be important discoveries. His answer did not mention physics, relativity theory, or fun stuff like Higgs bosons – but instead he said: “Compound interest is the greatest mathematical discovery of all time.”
I trust that most of you understand compound interest when it comes to investing or debt, but humor me and let’s walk through an example: Say you owe your credit card company $1000, and your interest rate is 16%. To make it simple, we assume the credit card company only requires you to pay 1% as your minimal payment every year, so the effective interest rate is 15%. After 30 years of compound interest you owe almost $60 000!
If there would be no compounding, you’d just owe a little bit over 5 grand!
What I find truly bizarre though is that when us software engineers throw around words like “technological debt” the eyes of our project managers or CEOs frequently just glaze over. Instead of doing the right thing – I’ll get back to that later – we are asked to come up with the quick hack that will make it work tomorrow and deal with the fallout later. Really? Sounds like we are using one credit card to pay off the other.
And we are even staying within terminology using “debt”! We could have said something like “Well, it would take us roughly 1 week longer to integrate our current J2EE backend with this 3rd party SOAP API instead of expanding our current custom XML parser, but then we would be done for good with maintaining that (POS) part of the app and can focus on our core IP.” But no, we keep it simple and refer to the custom XML parser as “technological debt”, but to no avail.
Now, the next time you have this conversation with your boss, show him the plot above and label the y-axis with “lines of code we have to maintain”, and the x-axis with “development iterations”, and perhaps a bell will go off.
Coming back to doing the right thing. Unfortunately determining what is the right thing is sometimes hard, but here are two strategies that in my experience decrease technological debt almost immediately:
For instance, if you have to consume millions of tweets every day, but your core competency does not contain:
then it might be time for you to talk to us at Gnip!
We continue to work on enriching the Gnip schema to provide second level meta-data on user generated activities . Given we push 10s to 100s of of millions of activities around daily supporting more meta-data means a bit of work beyond just updating our schema.
Today we rolled out meta-data updates to the <actor> and <to> elements of the Gnip schema. The updates today are new optional attributes that provide a place to map additional user information that is available on some social media services like Twitter and others. Initially we will just add the new meta-data to activities where the information is available inline with activities and then in the near future we are adding more platform features to support the scenario where a second API call is required to add this meta-data to the activity.
Starting today the <actor> element has support for the numeric userID, friends, followers, and posts. In addition, we are now mapping the fullname and username to individual attributes in order to better support services that allow end users to create custom screen names and change those names. The <to> element was updated to provide a new attribute for numeric userID.
Overview of updates to <actor> and <to> elements of Gnip schema:
We always get some interesting requests for doing additional processing on data sources. Some of these are addressed using Gnip filters, but others do not really fit the filter model. In able to support richer or more complex data processsing we have built some additional features into the Gnip platform. The first new publisher using some of these new features is “Digg-2000“.
Lots of people submit stories to Digg and lots of other people Digg the stories which allows more popular information to rise to the top in being discoverable. Several Gnip customers asked if we could make it possible for them to only receive stories that had a specific number of Diggs. We asked Digg about the idea and they said it sounded great since they have a Twitter account that provides a similar type of feature, so the Digg-2000 Gnip publisher was born.
On the Gnip platform we have set up a publisher that is listening to activities on Digg.
Gnip has offered the same basic licensing options since we launched the 2.0 version of the platform last September. During that year we have learned a lot about how companies and individual developers use the Gnip platform to discover, access, integrate and filter social and business data for their applications. In that time the daily volume of activities flowing across the platform has grown from thousands of activities across a handful of services to 100 to 150 million activities in a given day across almost forty different data sources.
Gnip Platform License Updates: In the second half of August Gnip will introduce several changes to our licensing options that will impact existing users and new users
Impact on existing and new users: The most obvious change for new users that sign up after these licensing updates is that their accounts will be active for 30 days. All existing users on the Gnip Platform will have their existing accounts convert to a 30 day trial account when the new licensing is rolled out during the second half of August.
Planning for the licensing updates: If your company meets any of the regular license options please contact us at email@example.com or firstname.lastname@example.org to discuss moving to the Commercial, Non-profit or Startup Partner licenses.
When we started Gnip last year Twitter was among the first group of companies that understood the data integration problems we were trying to solve for developers and companies. Because Gnip and Twitter were able to work together it has been possible to access and integrate data from Twitter by using the Gnip platform since last July using Gnip Notifications, and since last September using Gnip Data Activities.
All of this data access was the result of Gnip working with the Twitter XMPP “firehose” API to provide Twitter data access for users of both the Gnip Community and Standard edition product offerings. Recently Twitter announced a new Streaming API and began an alpha program to start making the new API available. Gnip has been testing the new Streaming API and now we are planning to move from the current XMPP API to the new Streaming API in the middle of June. This transition to the new Streaming API will mean some changes in the default behavior and ability to access Twitter data as described below
Twitter has several additional Streaming API methods available to approved parties that require a signed agreement to access. To better understand which developers and companies using the Gnip platform could benefit from these other Streaming API options we would encourage Gnip platform users to take this short 12 question survey: Gnip: Twitter Data Publisher Survey (URL: http://www.surveymonkey.com/s.aspx?sm=dQEkfMN15NyzWpu9sUgzhw_3d_3d)
The Gnip Twitter-search Data Publisher is not impacted by the transition to the new Twitter Streaming API since it is implemented using the new Gnip Polling Service and provides keyword-based data integration to the search.twitter APIs.
We will provide more information when we lock down the actual day for the transition shortly. Please take the survey and as always please contact us directly at email@example.com or send me a direct email at firstname.lastname@example.org
Obviously we have some understanding on the concepts of pushing and polling of data from service endpoints since we basically founded a company on the premise that the world needed a middleware push data service. Over the last year we have had a lot of success with the push model, but we also learned that for many reasons we also need to work with services via a polling approach. For this reason our latest v2.1 includes the Gnip Service Polling feature so that we can work with any service using push, poll or a mixed approach.
Now, the really great thing for users of the Gnip platform is that how Gnip collects data is mostly abstracted away. Every end user developer or company has the option to tell Gnip where to push data that you have set up filters or have a subscription. We also realize not everyone has an IT setup to handle push so we have always provided the option for HTTP GET support that lets people grab data from a Gnip generated URL for your filters.
One place where the way Gnip collects data can make a difference, at this time, for our users is the expected latency of data. Latency here refers to the time between the activity happening (i.e. Bob posted a photo, Susie made a comment, etc) and the time it hits the Gnip platform to be delivered to our awaiting users. Here are some basic expectation setting thoughts.
PUSH services: When we have push services the latency experience is usually under 60 seconds, but we know that this is not always the case sense sometimes the services can back-up during heavy usage and latency can spike to minutes or even hours. Still, when the services that push to us are running normal it is reasonable to expect 60 second latency or better and this is consistent for both the Community and Standard Edition of the Gnip platform.
POLLED services: When Gnip is using our polling service to collect data the latency can vary from service to service based on a few factors
a) How often we hit an endpoint (say 5 times per second)
b) How many rules we have to schedule for execution against the endpoint (say over 70 million on YouTube)
c) How often we execute a specific rule (i.e. every 10 minutes). Right now with the Community edition of the Gnip platform we are setting rule execution by default at 10 minute intervals and people need to have this in mind with their expectation for data flow from any given publisher.
Expectations for POLLING in the Community Edition: So I am sure some people who just read the above stopped and said “Why 10 minutes?” Well we chose to focus on “breadth of data ” as the initial use case for polling. Also, the 10 minute interval is for the Community edition (aka: the free version). We have the complete ability to turn the dial and use the smarts built into the polling service feature we can execute the right rules faster (i.e. every 60 seconds or faster for popular terms and every 10, 20, etc minutes or more for less popular ones). The key issue here is that for very prolific posting people or very common keyword rules (i.e. “obama”, “http”, “google”) there can be more posts that exist in the 10 minute default time-frame then we can collect in a single poll from the service endpoint.
For now the default expectation for our Community edition platform users should be a 10 minute execution interval for all rules when using any data publisher that is polled, which is consistent with the experience during our v2.1 Beta. If your project or company needs something a bit more snappy with the data publishers that are polled then contact us at email@example.com or contact me directly at firstname.lastname@example.org as these use cases require the Standard Edition of the Gnip platform.
Current pushed services on the platform include: WordPress, Identi.ca, Intense Debate, Twitter, Seesmic, Digg, and Delicious
Current polled services on the platform include: Clipmarks, Dailymotion, deviantART, diigo, Flickr, Flixster, Fotolog, Friendfeed, Gamespot, Hulu, iLike, Multiply, Photobucket, Plurk, reddit, SlideShare, Smugmug, StumbleUpon, Tumblr, Vimeo, Webshots, Xanga, and YouTube
Jeremy Hinegardner has written a super cool utility (he calls it Snipe) in Ruby that uses Gnip Notifications to optimize your data collection needs. In a nutshell, it digests Gnip Notifications for the Twitter Publisher (though it could obviously be re-purposed for any Publisher) and pings Twitter to retrieve the tweets associated with said Notifications; rounding out Gnip <activity>s. Enjoy, and hats off to Jeremy; well done.
We are pleased to announce an early access program for a new Gnip data publisher to access and integrate data from the Facebook Platform Open Streams API.
Developers and companies can sign up right now to be notified when the early access program is launched by sending an email to email@example.com with the subject: Facebook. Any company signing up for the early access program will be eligible for three free months subscription service to the Gnip data publisher for the Facebook Platform once it is generally released. At this time the early access program is planned to be launched in the summer.
And to provide a small taste of the upcoming integration here are two examples of what common Newsfeed actions on Facebook will look like when accessed via the planned Gnip data publisher.
1) Status update Example (fbids in this example were changed from actual one in my stream item)
<actor metaURL=”http://www.facebook.com/people/Shane-Pearson/12345″>Shane Pearson</actor>
<body>It must be spring as my weekly trip to Lowes/Home Depot is back on the schedule</body>
2) Upload photo example (the below Gnip data schema maps to a Facebook activity stream example)
<actor metaURL=”http://www.facebook.com/people/Snapshot-Smith/499225643″>Snapshot Smith</actor>
<title>Snapshot Smith uploaded a photo.</title>
<body><p><a href=”http://www.facebook.com/photo.php?pid=28&id=499225643&ref=at” caption=”A very attractive wall, indeed”/></a></p>
<mediaURL type=”thumbnail” > http://photos-e.ak.fbcdn.net/photos-ak-snc1/v2692/195/117/499225643/s499225643_28_6861716.jpg</mediaURL>
<mediaURL type=”content” > http://www.facebook.com/photo.php?pid=28&id=499225643&ref=at<</mediaURL>
Now that the new Gnip convenience libraries have been published for a few weeks on GitHub, I’m going to tell you a bit about the libraries that I’m currently responsible for, the .NET libraries. So, let’s dive in, shall we… The latest versions of the .NET libraries are heavily based on the previous version of the Java libraries, with a bit of .NET style thrown in. What that means is that I used Microsoft’s Java Language Conversion Assistant as a starting point, mixed in some shell scripting like Bash, Sed and Perl to fix the comments, and some of the messy parts that did not translate very well. I then made it more C# like by removing Java Annotations, adding .NET attributes, taking advantage of .NET native XML Serializer, utilizing System.Net.HttpWebRequest for communications, etc. It actually went fairly quick. The next task was to start the Unit testing deep dive.
I have to say, I really didn’t know anything about the Gnip model, how it worked, or what it really was, at first. It just looked like an interesting project and some good folks. Unit testing, however, is one place where you learn about the details of how each little piece of a system really works. And since hardly any of my tests passed out of the gate (and I was not really even convinced that I even had enough tests in place,) I decided it was best to go at it till I was convinced. The library components are easy enough. The code is really separated into two parts. The first component is the Data Model, or Resources, which directly map to the Gnip XML model and live in the Gnip.Client.Resource namespace. The second component is the Data Access Layer or GnipConnection. The GnipConnection, when configured, is responsible for passing data to, and receiving data from, the Gnip servers. So there are really only two main pieces to this code. Pretty simple: Resources and GnipConnection. The other code is just convenience and utility code to help make things a little more orderly and to reduce the amount of code.
So yeah, the testing… I used NUnit so folks could utilize the tests with the free version of VisualStudio, or even the command line if you want. I included a Gnip.build Nant file so that you can compile, run the tests, and create a zipped distribution of the code. I’ve also included an nunit project file in the Gnip.ClientTest root (gnip.nunit) that you can open with the NUnit UI to get things going. To help configure the tests, there is an App.config file in the root of the test project that is used to set all the configuration parameters.
The tests, like the code, are divided onto the Resource objects tests and the GnipConnection tests (and a few utility tests). The premise of the Resource object tests is to first ensure that the Resource objects are cool. These are simple data objects with very little logic built in (which is not to say that testing them thoroughly is not the utmost important.) There is a unit test for each one of the data objects and they proceed by ensuring that the properties work properly, the DeepEquals methods work properly, and that the marshalling to and from XML works properly. The DeepEquals methods are used extensively by the tests, so it is essential that we can trust them. As such, they are fairly comprehensive. The marshalling and un-marshalling tests are less so. They do a decent job; they just do not exercise every permutation of the XML elements and attributes. I do feel that they are sufficient enough to convince me that things are okay.
The GnipConnection is responsible for creating, retrieving, updating and deleting Publishers and Filters, and retrieving and publishing Activities and Notifications. There is also a mechanism built into the GnipConnection to get the Time from the Gnip server and to use that Time value to calculate the time offset between the calling client machine and the Gnip server. Since the Gnip server publishes activities and notifications in 1 minute wide addressable ‘buckets’, it is nice to know what the time is on the Gnip server with some degree of accuracy. No attempt is made to adjust for network latency, but we get pretty close to predicting the real Gnip time. That’s it. That little bit is realized in 25 or so methods on the GnipConnection class. Some of those methods are just different signatures of methods that do the same thing only with a more convenient set of parameters. The GnipConnection tests try to exercise every API call with several permutations of data. They are not completely comprehensive. There are a lot of permutations. But, I believe they hit every major corner case.
In testing all this, one thing I wanted to do was to run my tests and have the de-serialization of the XML validate against the XML Schema file I got from the good folks at Gnip. If I could de-serialize and then serialize a sufficiently diverse set of XML streams, while validating that those streams adhere to the XML Schema, then that was another bit of ammo for trusting that this thing works in situations beyond the test harness. In the Gnip.Client.Uti namespace there is a helper class icalled XmlHelper that contains a singleton of itself. There is a property called ValidateXml that can be reached like this XmlHelper.Instance.ValidateXml. Setting that to true will cause the XML to be validated anytime it is de-serialized, either in the tests or from the server. It is set to true in the tests. But, it doesn’t work with the stock Xsd distributed by Gnip.That Xsd does not include an element definition for each element at the top level which is required when validating against a schema. I had to create one that did. It is semantically identical to the Gnip version; it just pulls things out to the top level. You can find the custom version in the Gnip.Client/Xsd folder. By default it is compiled into the Gnip.Client.dll.
One of the last things I did, which had nothing really to do with testing, is to create the IGnipConnection interface. Use it if you want. If you use some kind of Inversion of Control container like Unity, or like to code to interfaces, it should come in handy.
That’s all for now. Enjoy!
Rick is a Software Engineer and Technical Director at Mondo Robot in Boulder, Colorado. He has been designing and writing software professionally since 1989, and working with .NET for the last 4 years. He is a regular fixture at the Boulder .NET user’s group meetings and the is a member of Boulder Digital Arts.