Hidden Engineering Gotchas Behind Polling

August 20th, 2010
  • Posted by Jud Valeski, Co-Founder and CEO
No Comments

I just spent a couple of days optimizing a customer’s data collection on a Gnip instance, for a specific social media data source API. It had been awhile since I’d done this level of tuning, and it reminded me of just how many variables must be considered when optimally polling source API for data.

Requests Per Second (RPS) Limits

Most services have a rate limit that an given IP address (or API key/login) cannot break. If you hit an endpoint too hard, the API backs you off and/or blocks you. Don’t confuse RPS with concurrent connections however; they’re measured differently and each has its own limitations for a given API. In this particular case I was able to parallelize three requests because the total response time per request was ~3 seconds. The result was that a given IP address was not violating the API’s RPS limitations. Had the API been measuring concurrent connections, that would have been a different story.

Document/Page/Result-set Size

Impacting my ability to parallelize my requests was the document size I was requesting of the API. Smaller document sizes (e.g. 10 activities instead of 1000) meant faster response times, which when parallelized, run the risk of violating the RPS limits. On the other hand, larger document sizes take more time to get; whether because they’re simply bigger and take longer to transfer over the wire, or because the API you’re accessing is taking a long time to assemble the document on the backend.

Cycle Time

The particular API I was working with was a “keyword” based API, meaning that I was polling for search terms/keywords. In Gnip parlance we call these “terms” or “keywords,” “rules” in order to generalize the terminology. A rule-set’s “cycle time” is how long it takes a Gnip Data Collector to poll for a given rule-set once. For example, if a rule-set size is 1,000, and the API’s RPS limit is 1, that rule-set’s cycle time would be 1,000 seconds; every 1k seconds, each rule in the set has been polled. Obviously, the cycle time would increase if the server took longer than a second to respond to each requests.

Skipping (missing data)

A given rule “skips” data during polling (meaning, you will miss data because you’re not covering enough ground) when one of the following conditions is true. ARU (activity update rate) is the rate at which activities/events occur on the given rule (e.g. the number of times per second someone uploads a picture with the tag “foo”)

  • ARU is greater than the RPS limit (RPS represented as 1/RPS) multiplied by the document size.
  • ARU is greater than the rule-set’s cycle time

In order to optimally collect the data you need, in a timely manner, you have to balance all of these variables, and adjust them based on the activity update rate for the rule-set you’re interested in. While the variables make for engaging engineering exercises, do you want to spend time sorting these out, or spend time working on the core business issues you’re trying to solve? Gnip provides visibility into these variables to ensure data is most effectively collected.

Comments are closed.

Follow Gnip

Archive

Recent Posts
Categories
Tags
Blogroll

Recent Tweets

Switch to our mobile site