Simplicity Wins

It seems like every once in a while we all have to re-learn certain lessons.

As part of our daily processing, Gnip stores many terabytes of data in millions of keys on Amazon’s S3. Various aspects of serving our customers require that we pour over those keys and the data behind them, regularly.

As an example, every 24 hours we construct usage reports that provide visibility into how our customers are using our service. Are they consuming a lot or a little volume? Did their usage profile change? Are they not using us at all? So on and so on. We also have what we affectionately refer to as the “dude where’s my tweet” challenge; of the billion activities we deliver each day to our customers, inevitably someone says “hey, I didn’t receive Tweet ‘X’ what gives?” Answering that question requires that we store the ID of every Tweet a customer ever receives. Pouring over all this data every 24 hours is a challenge.

As we started on the project, it seemed like a good fit for Hadoop. It involves pulling in lots of small-ish files, doing some slicing, aggregate the results, and spitting them out the other end. Because we’re hosted in Amazon it was natural to use their Elastic MapReduce service (EMR).

Conceptually the code was straight forward and easy to understand. The logic fit the MapReduce programming model well. It requires a lot of text processing and sorts well into various stages and buckets. It was up and running quickly.

As the size of the input grew it started to have various problems, many of which came down to configuration. Hadoop options, JVM options, open file limits, number and size of instances, number of reducers, etc. We went through various rounds of tweaking settings and throwing more machines in the cluster, and it would run well for a while longer.

But it still occasionally had problems. Plus there was that nagging feeling that it just shouldn’t take this much processing power to do the work. Operational costs started to pop up on the radar.

So we did a small test to check the feasibility of getting all the necessary files from S3 onto a single EC2 instance and processing it with standard old *nix tools. After promising results we decided to pull it out of EMR. It took several days to re-write, but we’ve now got a simple Ruby script using various *nix goodies like cut, sort, grep and their friends. The script is parallel-ized via JRuby threads at various points that make sense (downloading multiple files at once and processing the files independently once they’ve been bucketed).

In the end it runs in less time than it did on EMR, on a single modest instance, is much simpler to debug and maintain, and costs far less money to run.

We landed in a somewhat counter-intuitive place. There’s great technology available these days to process large amounts of data; we continue to use Hadoop for other projects. But as we start to bring them into our tool-set we have to be careful not to forget the power of straight forward, traditional tools.

Simplicity wins.

Gnip Client Libraries from Our Customers

Our customers rock. When they develop code to start using Gnip, they often share their libraries with us so that they might be useful to future Gnip customers as well. Although Gnip doesn’t currently officially support any client libraries for access to our social media API, we do like to highlight and bring attention to some of our customers who choose to share their work.

In particular, here are a few Gnip client libraries that happy customers have developed and shared with us. We’ll be posting them in our Power Track documentation and you can also find them linked here:

Java
by Zauber
https://github.com/zaubersoftware/gnip4j

Python
by General Sentiment
https://github.com/vkris/gnip-python/blob/master/streamingClient.py

If you’ve developed a library for access to Gnip data and you’d like to share it with us at Gnip and other Gnip customers, then drop us a note at info@gnip.com. We’d love to hear from you.

Ruby on Rails BugMash

On Saturday, Gnip hosted a Rails BugMash. Ten people showed up. We mashed some bugs. We learned about Rails internals and about contributing to open source. It was organized by Prakash Murthy(@_prakash), Mike Gehard(@mikegehard) and me (@baroquebobcat).

What’s a BugMash, you might ask? Rails BugMashes were something that came out of RailsBridge‘s efforts to make the Ruby on Rails community more open and inclusive. We wanted to use the format to help get more people locally involved in OSS culture and to show how contributing to a big project like Rails is approachable by mere mortals.

One of the themes of the event was to help with migrating tickets from the old ticket system. Rails moved to using GitHub Issues as the official place to file bug requests in April and there are still a lot of active tickets in the old ticket tracker, lighthouse.

For greater impact, we focused on tickets with patches already attached. For these tickets, all that we needed to do was verify the patch and make a pull request on GitHub. Migrating these tickets was straightforward and we got a few of these merged into Rails that afternoon.

When the event started only two of us had contributed to Rails. By the end of the afternoon, everyone who had participated had either submitted a patch or helped to do so. Some of the patches we submitted were already merged in before the event was over.

Thanks to @benatkin, @anveo, @mikehoward, @ecoffey, @danielstutzman, @jasonnoble and @jsnrth for coming out on a Saturday to contribute to Rails. Also, thanks to the Rails core team members who were watching pull requests Saturday who helped us with our commits and offered suggestions and comments on our work. In particular, José Valim(@josevalim), Santiago Pastorino(@spastorino) and Aaron Patterson(@tenderlove) were a great help.

Bugs Smashed

Updated: added participant’s twitter handles.

Twitter XML, JSON & Activity Streams at Gnip

About a month ago Twitter announced they will be shutting off XML for stream based endpoints on Dec, 6th, 2010, in order to exclusively support JSON. While JSON users/supporters are cheering, for some developers this is a non-trivial change. Tweet parsers around the world have to change from XML to JSON. If your brain, and code, only work in XML, you’ll be forced to get your head around something new. You’ll have to get smart, find the right JSON lib, change your code to use it (and any associated dependencies you weren’t already relying on), remove obsolete dependencies, test everything again, and ultimately get comfortable with a new format.

Gnip’s format normalization shields you from all of this as it turns out. Gnip customers get to stay focused on delivering value to their customers. Others integrating directly, and consuming stream data from Twitter in XML, have to make a change (arguably a good one from a pure format standpoint, but change takes time regardless).

From day one, Gnip has been working to shield data consumers from the inevitable API shifts (protocols, formats) that occur in the market at large. Today we ran a query to see what percentage of our customers would benefit from this shield; today we smiled. We’re going to sleep well tonight knowing all of our customers digesting our Activity Streams normalization get to stay focused on what matters to them most (namely NOT data collection intricacies).

Fun.

Schrödinger's Cat is Always Dead in a Black Box

One thing that I’ve learned in my career is that no part of your system should be a black box. When an ugly customer issue rears its head, you must have immediate access to relevant information quickly to solve the problem.

In order to address this need for quick access to information, a lot of developers are diligent about writing informative log statements. This is certainly good practice but, generally speaking, log statements only give you insight into how things are performing at the application level. What about collecting metrics about how the system is performing as a whole? Enter Munin.

Out of the box, Munin is an Operating System level monitoring solution. It provides a simple web interface view into a multitude of Operating System metrics. Number of processes running, network interface throughput, IOStat output, just to name a few. Each of these metrics are collected on a 5 minute interval. Munin runs a handful of (mostly perl) scripts that generate RRD graphs in png format. These graphs are dumped into /var/www/html/munin/.

Sounds pretty nifty, eh? So here’s how you get Munin up and running with Nginx in less than 5 minutes on CentOS:

Install and Start Munin

sudo yum install munin munin-node
sudo /etc/init.d/munin-node start
sudo /sbin/chkconfig munin-node on

No config needed, the default Munin install should work well. You start Munin with a simple init.d script. The chkconfig command makes sure that Munin will automatically start on reboot.

Pro Tip: check out /usr/share/munin/plugins/ for a list of extra plugins. If you see anything you like, simply symlink it into /etc/munin/plugins and restart Munin and voila, more graphs. Also, be sure to check out the Munin Exchange for an extensive list of plugins written by third party developers.

Install, Configure, and Start Nginx
Now that Munin is started and generating all sorts of pretty graphs for us, we need to make these graphs accessible via your browser. We use Nginx pretty extensively at Gnip but you could just as easily serve these files up with Apache or just dump them on your network somewhere.

sudo yum install nginx

Once that’s done, you’ll need to edit /etc/nginx/nginx.conf and add the following location to the server listening on port 80:

location ~ /munin/ {
root /var/www/html/munin/;
}

Now start Nginx:

sudo /etc/init.d/nginxd start

There you have it. Now you should be able to open up your web browser and hit http://yourserver.foo.com/munin/ and you’ll have an at-a-glance view of your server. Be aware that there are alternatives out there. I came across Ganglia and Cacti but settled on Munin as it was the easiest to drop into our current setup.

What Does Compound Interest Have to do with Evolving APIs?

Once Albert Einstein was asked what he found to be important discoveries. His answer did not mention physics, relativity theory, or fun stuff like Higgs bosons – but instead he said: “Compound interest is the greatest mathematical discovery of all time.”

I trust that most of you understand compound interest when it comes to investing or debt, but humor me and let’s walk through an example: Say you owe your credit card company $1000, and your interest rate is 16%. To make it simple, we assume the credit card company only requires you to pay 1% as your minimal payment every year, so the effective interest rate is 15%. After 30 years of compound interest you owe almost $60 000!

Compound Interest Graph

If there would be no compounding, you’d just owe a little bit over 5 grand!

What I find truly bizarre though is that when us software engineers throw around words like “technological debt” the eyes of our project managers or CEOs frequently just glaze over. Instead of doing the right thing – I’ll get back to that later – we are asked to come up with the quick hack that will make it work tomorrow and deal with the fallout later. Really? Sounds like we are using one credit card to pay off the other.

And we are even staying within terminology using “debt”! We could have said something like “Well, it would take us roughly 1 week longer to integrate our current J2EE backend with this 3rd party SOAP API instead of expanding our current custom XML parser, but then we would be done for good with maintaining that (POS) part of the app and can focus on our core IP.” But no, we keep it simple and refer to the custom XML parser as “technological debt”, but to no avail.

Now, the next time you have this conversation with your boss, show him the plot above and label the y-axis with “lines of code we have to maintain”, and the x-axis with “development iterations”, and perhaps a bell will go off.

Coming back to doing the right thing. Unfortunately determining what is the right thing is sometimes hard, but here are two strategies that in my experience decrease technological debt almost immediately:

  1. Refactor early and often
  2. Outsource as much as possible of what you don’t consider your core competency.

For instance, if you have to consume millions of tweets every day, but your core competency does not contain:

  • developing high performance code that is distributed in the cloud
  • writing parsers processing real time social activity data
  • maintaining OAuth client code and access tokens
  • keeping up with squishy rate limits and evolving social activity APIs

then it might be time for you to talk to us at Gnip!

Our Poem for Mountain.rb

Hello and Greetings, Our Ruby Dev Friends,
Mountain.rb we were pleased to attend.

Perhaps we did meet you! Perhaps we did not.
We hope, either way, you’ll give our tools a shot.

What do we do? Manage API feeds.
We fight the rate limits, dedupe all those tweets.

Need to know where those bit.ly’s point to?
Want to choose polling or streaming, do you?

We do those things, and on top of all that,
We put all your results in just one format.

You write only one parser for all of our feeds.
(We’ve got over 100 to meet your needs.)

The Facebook, The Twitter, The YouTube and More
If mass data collection makes your head sore…

Do not curse publishers, don’t make a fuss.
Just go to the Internet and visit us.

We’re not the best poets. Data’s more our thing.
So when you face APIs… give us a ring.

From API Consumers to API Designers: A Wish List

At Gnip, we spend a large part of our days integrating with third party APIs in the Social Media space. As part of this effort, we’ve come up with some API design best practices.

Use Standard HTTP Response Codes

HTTP has been around since the the early 90’s. Standard HTTP Response codes have been around for quite some time. For example, 200 codes level have meant success, 400 level have meant a client side error, and 500 level have been indicative of a server error. If there was an error during an API call to your service, please don’t send us back a 200 response and expect us to parse the response body for error details. If you want to rate limit us, please don’t send us back a 500, that makes us freak out.

Publish Your Rate Limits
We get it. You want the right to scale back your rate limits without a hoard of angry developers wielding virtual pitchforks showing up on your mailing list. It would make everyone’s lives easier if you published your rate limits rather than having developers playing a constant guessing game. Bonus points if you describe how your rate limits work. Do you limit per set of credentials, per API key, per IP address?

Use Friendly Ids, Not System Ids
We understand that it’s a common pattern to have an ugly system id (e.g. 17134916) backing a human readable id (e.g. ericwryan). As users of your API, we really don’t want to remember system ids, so why not go the extra mile and let us hit your API with friendly ids?

Allow Us to Limit Response Data
Let’s say your rate limit is pretty generous. What if Joe User is hammering your API once a second and retrieving 100 items with every request, even though on average, he will only see one new item per day. Joe has just wasted a lot of your precious CPU, memory, and bandwidth. Protect your users. Allow them to ask for everything since the last id or timestamp they received.

Keep Your Docs Up to Date
Who has time to update their docs when you have customers banging on your door for bug fixes and new features? Well, you would probably have less customers banging on your door if they had a better understanding of how to use your product. Keep your docs up to date with your code.

Publish Your Search Parameter Constraints
Search endpoints are very common these days. Do you have one? How do we go about searching your data? Do you split search terms on whitespace? Do you split on punctuation? How does quoting affect your query terms? Do you allow boolean operators?

Use Your Mailing List
Do you have a community mailing list? Great! Then use it. Is there an unavoidable, breaking change coming in a future release? Let your users know as soon as possible. Do you keep a changelog of features and bug fixes? Why not publish this information for your users to see?

We consider this to be a fairly complete list on designing an API that is easy to work with. Feel free to yell at us (info at gnip) if you see us lacking in any of these departments.

Hidden Engineering Gotchas Behind Polling

I just spent a couple of days optimizing a customer’s data collection on a Gnip instance, for a specific social media data source API. It had been awhile since I’d done this level of tuning, and it reminded me of just how many variables must be considered when optimally polling source API for data.

Requests Per Second (RPS) Limits

Most services have a rate limit that an given IP address (or API key/login) cannot break. If you hit an endpoint too hard, the API backs you off and/or blocks you. Don’t confuse RPS with concurrent connections however; they’re measured differently and each has its own limitations for a given API. In this particular case I was able to parallelize three requests because the total response time per request was ~3 seconds. The result was that a given IP address was not violating the API’s RPS limitations. Had the API been measuring concurrent connections, that would have been a different story.

Document/Page/Result-set Size

Impacting my ability to parallelize my requests was the document size I was requesting of the API. Smaller document sizes (e.g. 10 activities instead of 1000) meant faster response times, which when parallelized, run the risk of violating the RPS limits. On the other hand, larger document sizes take more time to get; whether because they’re simply bigger and take longer to transfer over the wire, or because the API you’re accessing is taking a long time to assemble the document on the backend.

Cycle Time

The particular API I was working with was a “keyword” based API, meaning that I was polling for search terms/keywords. In Gnip parlance we call these “terms” or “keywords,” “rules” in order to generalize the terminology. A rule-set’s “cycle time” is how long it takes a Gnip Data Collector to poll for a given rule-set once. For example, if a rule-set size is 1,000, and the API’s RPS limit is 1, that rule-set’s cycle time would be 1,000 seconds; every 1k seconds, each rule in the set has been polled. Obviously, the cycle time would increase if the server took longer than a second to respond to each requests.

Skipping (missing data)

A given rule “skips” data during polling (meaning, you will miss data because you’re not covering enough ground) when one of the following conditions is true. ARU (activity update rate) is the rate at which activities/events occur on the given rule (e.g. the number of times per second someone uploads a picture with the tag “foo”)

  • ARU is greater than the RPS limit (RPS represented as 1/RPS) multiplied by the document size.
  • ARU is greater than the rule-set’s cycle time

In order to optimally collect the data you need, in a timely manner, you have to balance all of these variables, and adjust them based on the activity update rate for the rule-set you’re interested in. While the variables make for engaging engineering exercises, do you want to spend time sorting these out, or spend time working on the core business issues you’re trying to solve? Gnip provides visibility into these variables to ensure data is most effectively collected.