Author: Gnip Admin

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Simplicity Wins

It seems like every once in a while we all have to re-learn certain lessons.

As part of our daily processing, Gnip stores many terabytes of data in millions of keys on Amazon’s S3. Various aspects of serving our customers require that we pour over those keys and the data behind them, regularly.

As an example, every 24 hours we construct usage reports that provide visibility into how our customers are using our service. Are they consuming a lot or a little volume? Did their usage profile change? Are they not using us at all? So on and so on. We also have what we affectionately refer to as the “dude where’s my tweet” challenge; of the billion activities we deliver each day to our customers, inevitably someone says “hey, I didn’t receive Tweet ‘X’ what gives?” Answering that question requires that we store the ID of every Tweet a customer ever receives. Pouring over all this data every 24 hours is a challenge.

As we started on the project, it seemed like a good fit for Hadoop. It involves pulling in lots of small-ish files, doing some slicing, aggregate the results, and spitting them out the other end. Because we’re hosted in Amazon it was natural to use their Elastic MapReduce service (EMR).

Conceptually the code was straight forward and easy to understand. The logic fit the MapReduce programming model well. It requires a lot of text processing and sorts well into various stages and buckets. It was up and running quickly.

As the size of the input grew it started to have various problems, many of which came down to configuration. Hadoop options, JVM options, open file limits, number and size of instances, number of reducers, etc. We went through various rounds of tweaking settings and throwing more machines in the cluster, and it would run well for a while longer.

But it still occasionally had problems. Plus there was that nagging feeling that it just shouldn’t take this much processing power to do the work. Operational costs started to pop up on the radar.

So we did a small test to check the feasibility of getting all the necessary files from S3 onto a single EC2 instance and processing it with standard old *nix tools. After promising results we decided to pull it out of EMR. It took several days to re-write, but we’ve now got a simple Ruby script using various *nix goodies like cut, sort, grep and their friends. The script is parallel-ized via JRuby threads at various points that make sense (downloading multiple files at once and processing the files independently once they’ve been bucketed).

In the end it runs in less time than it did on EMR, on a single modest instance, is much simpler to debug and maintain, and costs far less money to run.

We landed in a somewhat counter-intuitive place. There’s great technology available these days to process large amounts of data; we continue to use Hadoop for other projects. But as we start to bring them into our tool-set we have to be careful not to forget the power of straight forward, traditional tools.

Simplicity wins.

Announcing Multiple Connections for Premium Twitter Feeds

A frequent request from our customers has been the ability to open multiple connections to Premium Twitter Feeds on their Gnip data collectors. Our customers have asked and we have delivered!

While multiple connections to standard data feeds have been available for quite some time, we have only allowed one connection to our Premium Twitter Feeds.  Beginning today you will be able to open multiple mirrored connections to Power Track, Decahose, Halfhose, and all of our other Premium Twitter Feeds.  This feature will be helpful when testing connections to your Gnip data collector in different environments (such as staging or review) without having an impact on your production connection.

You may be saying “Sounds great Gnip, but will I be charged the standard Twitter licensing fee for the same tweet delivered across multiple connections?”. The answer is no!  You will pay a small flat fee per month for each additional connection.  If you’re interested in adding Multiple Connections to your Premium Twitter Feed please Contact Us.

Schrödinger's Cat is Always Dead in a Black Box

One thing that I’ve learned in my career is that no part of your system should be a black box. When an ugly customer issue rears its head, you must have immediate access to relevant information quickly to solve the problem.

In order to address this need for quick access to information, a lot of developers are diligent about writing informative log statements. This is certainly good practice but, generally speaking, log statements only give you insight into how things are performing at the application level. What about collecting metrics about how the system is performing as a whole? Enter Munin.

Out of the box, Munin is an Operating System level monitoring solution. It provides a simple web interface view into a multitude of Operating System metrics. Number of processes running, network interface throughput, IOStat output, just to name a few. Each of these metrics are collected on a 5 minute interval. Munin runs a handful of (mostly perl) scripts that generate RRD graphs in png format. These graphs are dumped into /var/www/html/munin/.

Sounds pretty nifty, eh? So here’s how you get Munin up and running with Nginx in less than 5 minutes on CentOS:

Install and Start Munin

sudo yum install munin munin-node
sudo /etc/init.d/munin-node start
sudo /sbin/chkconfig munin-node on

No config needed, the default Munin install should work well. You start Munin with a simple init.d script. The chkconfig command makes sure that Munin will automatically start on reboot.

Pro Tip: check out /usr/share/munin/plugins/ for a list of extra plugins. If you see anything you like, simply symlink it into /etc/munin/plugins and restart Munin and voila, more graphs. Also, be sure to check out the Munin Exchange for an extensive list of plugins written by third party developers.

Install, Configure, and Start Nginx
Now that Munin is started and generating all sorts of pretty graphs for us, we need to make these graphs accessible via your browser. We use Nginx pretty extensively at Gnip but you could just as easily serve these files up with Apache or just dump them on your network somewhere.

sudo yum install nginx

Once that’s done, you’ll need to edit /etc/nginx/nginx.conf and add the following location to the server listening on port 80:

location ~ /munin/ {
root /var/www/html/munin/;
}

Now start Nginx:

sudo /etc/init.d/nginxd start

There you have it. Now you should be able to open up your web browser and hit http://yourserver.foo.com/munin/ and you’ll have an at-a-glance view of your server. Be aware that there are alternatives out there. I came across Ganglia and Cacti but settled on Munin as it was the easiest to drop into our current setup.

From API Consumers to API Designers: A Wish List

At Gnip, we spend a large part of our days integrating with third party APIs in the Social Media space. As part of this effort, we’ve come up with some API design best practices.

Use Standard HTTP Response Codes

HTTP has been around since the the early 90′s. Standard HTTP Response codes have been around for quite some time. For example, 200 codes level have meant success, 400 level have meant a client side error, and 500 level have been indicative of a server error. If there was an error during an API call to your service, please don’t send us back a 200 response and expect us to parse the response body for error details. If you want to rate limit us, please don’t send us back a 500, that makes us freak out.

Publish Your Rate Limits
We get it. You want the right to scale back your rate limits without a hoard of angry developers wielding virtual pitchforks showing up on your mailing list. It would make everyone’s lives easier if you published your rate limits rather than having developers playing a constant guessing game. Bonus points if you describe how your rate limits work. Do you limit per set of credentials, per API key, per IP address?

Use Friendly Ids, Not System Ids
We understand that it’s a common pattern to have an ugly system id (e.g. 17134916) backing a human readable id (e.g. ericwryan). As users of your API, we really don’t want to remember system ids, so why not go the extra mile and let us hit your API with friendly ids?

Allow Us to Limit Response Data
Let’s say your rate limit is pretty generous. What if Joe User is hammering your API once a second and retrieving 100 items with every request, even though on average, he will only see one new item per day. Joe has just wasted a lot of your precious CPU, memory, and bandwidth. Protect your users. Allow them to ask for everything since the last id or timestamp they received.

Keep Your Docs Up to Date
Who has time to update their docs when you have customers banging on your door for bug fixes and new features? Well, you would probably have less customers banging on your door if they had a better understanding of how to use your product. Keep your docs up to date with your code.

Publish Your Search Parameter Constraints
Search endpoints are very common these days. Do you have one? How do we go about searching your data? Do you split search terms on whitespace? Do you split on punctuation? How does quoting affect your query terms? Do you allow boolean operators?

Use Your Mailing List
Do you have a community mailing list? Great! Then use it. Is there an unavoidable, breaking change coming in a future release? Let your users know as soon as possible. Do you keep a changelog of features and bug fixes? Why not publish this information for your users to see?

We consider this to be a fairly complete list on designing an API that is easy to work with. Feel free to yell at us (info at gnip) if you see us lacking in any of these departments.

Guest Post, Rick Boykin: Gnip C# .NET Convenience Library

Microsoft .NETNow that the new Gnip convenience libraries have been published for a few weeks on GitHub, I’m going to tell you a bit about the libraries that I’m currently responsible for, the .NET libraries.  So, let’s dive in, shall we… The latest versions of the .NET libraries are heavily based on the previous version of the Java libraries, with a bit of .NET style thrown in. What that means is that I used Microsoft’s Java Language Conversion Assistant as a starting point, mixed in some shell scripting like Bash, Sed and Perl to fix the comments, and some of the messy parts that did not translate very well. I then made it more C# like by removing Java Annotations, adding .NET attributes, taking advantage of .NET native XML Serializer, utilizing System.Net.HttpWebRequest for communications, etc. It actually went fairly quick.  The next task was to start the Unit testing deep dive.

I have to say, I really didn’t know anything about the Gnip model, how it worked, or what it really was, at first. It just looked like an interesting project and some good folks. Unit testing, however, is one place where you learn about the details of how each little piece of a system really works. And since hardly any of my tests passed out of the gate (and I was not really even convinced that I even had enough tests in place,) I decided it was best to go at it till I was convinced. The library components are easy enough. The code is really separated into two parts. The first component is the Data Model, or Resources, which directly map to the Gnip XML model and live in the Gnip.Client.Resource namespace. The second component is the Data Access Layer or GnipConnection. The GnipConnection, when configured, is responsible for passing data to, and receiving data from, the Gnip servers.  So there are really only two main pieces to this code. Pretty simple: Resources and GnipConnection. The other code is just convenience and utility code to help make things a little more orderly and to reduce the amount of code.

So yeah, the testing… I used NUnit so folks could utilize the tests with the free version of VisualStudio, or even the command line if you want. I included a Gnip.build Nant file so that you can compile, run the tests, and create a zipped distribution of the code. I’ve also included an nunit project file in the Gnip.ClientTest root (gnip.nunit) that you can open with the NUnit UI to get things going. To help configure the tests, there is an App.config file in the root of the test project that is used to set all the configuration parameters.

The tests, like the code, are divided onto the Resource objects tests and the GnipConnection tests (and a few utility tests). The premise of the Resource object tests is to first ensure that the Resource objects are cool. These are simple data objects with very little logic built in (which is not to say that testing them thoroughly is not the utmost important.) There is a unit test for each one of the data objects and they proceed by ensuring that the properties work properly, the DeepEquals methods work properly, and that the marshalling to and from XML works properly. The DeepEquals methods are used extensively by the tests, so it is essential that we can trust them. As such, they are fairly comprehensive. The marshalling and un-marshalling tests are less so. They do a decent job; they just do not exercise every permutation of the XML elements and attributes. I do feel that they are sufficient enough to convince me that things are okay.

The GnipConnection is responsible for creating, retrieving, updating and deleting Publishers and Filters, and retrieving and publishing Activities and Notifications. There is also a mechanism built into the GnipConnection to get the Time from the Gnip server and to use that Time value to calculate the time offset between the calling client machine and the Gnip server. Since the Gnip server publishes activities and notifications in 1 minute wide addressable ‘buckets’, it is nice to know what the time is on the Gnip server with some degree of accuracy. No attempt is made to adjust for network latency, but we get pretty close to predicting the real Gnip time. That’s it. That little bit is realized in 25 or so methods on the GnipConnection class. Some of those methods are just different signatures of methods that do the same thing only with a more convenient set of parameters. The GnipConnection tests try to exercise every API call with several permutations of data. They are not completely comprehensive. There are a lot of permutations. But, I believe they hit every major corner case.

In testing all this, one thing I wanted to do was to run my tests and have the de-serialization of the XML validate against the XML Schema file I got from the good folks at Gnip. If I could de-serialize and then serialize a sufficiently diverse set of XML streams, while validating that those streams adhere to the XML Schema, then that was another bit of ammo for trusting that this thing works in situations beyond the test harness. In the Gnip.Client.Uti namespace there is a helper class icalled XmlHelper that contains a singleton of itself. There is a property called ValidateXml that can be reached like this XmlHelper.Instance.ValidateXml. Setting that to true will cause the XML to be validated anytime it is de-serialized, either in the tests or from the server. It is set to true in the tests. But, it doesn’t work with the stock Xsd distributed by Gnip.That Xsd does not include an element definition for each element at the top level which is required when validating against a schema. I had to create one that did. It is semantically identical to the Gnip version; it just pulls things out to the top level. You can find the custom version in the Gnip.Client/Xsd folder. By default it is compiled into the Gnip.Client.dll.

One of the last things I did, which had nothing really to do with testing, is to create the IGnipConnection interface. Use it if you want. If you use some kind of Inversion of Control container like Unity, or like to code to interfaces, it should come in handy.
That’s all for now. Enjoy!

Rick is a Software Engineer and Technical Director at Mondo Robot in Boulder, Colorado. He has been designing and writing software professionally since 1989, and working with .NET for the last 4 years. He is a regular fixture at the Boulder .NET user’s group meetings and the is a member of Boulder Digital Arts.