Guide to the Twitter API – Part 2 of 3: An Overview of Twitter’s Search API

The Twitter Search API can theoretically provide full coverage of ongoing streams of Tweets. That means it can, in theory, deliver 100% of Tweets that match the search terms you specify almost in realtime. But in reality, the Search API is not intended and does not fully support the repeated constant searches that would be required to deliver 100% coverage.Twitter has indicated that the Search API is primarily intended to help end users surface interesting and relevant Tweets that are happening now. Since the Search API is a polling-based API, the rate limits that Twitter has in place impact the ability to get full coverage streams for monitoring and analytics use cases.  To get data from the Search API, your system may repeatedly ask Twitter’s servers for the most recent results that match one of your search queries. On each request, Twitter returns a limited number of results to the request (for example “latest 100 Tweets”). If there have been more than 100 Tweets created about a search query since the last time you sent the request, some of the matching Tweets will be lost.

So . . . can you just make requests for results more frequently? Well, yes, you can, but the total number or requests you’re allowed to make per unit time is constrained by Twitter’s rate limits. Some queries are so popular (hello “Justin Bieber”) that it can be impossible to make enough requests to Twitter for that query alone to keep up with this stream.  And this is only the beginning of the problem as no monitoring or analytics vendor is interested in just one term; many have hundreds to thousands of brands or products to monitor.

Let’s consider a couple examples to clarify.  First, say you want all Tweets mentioning “Coca Cola” and only that one term. There might be fewer than 100 matching Tweets per second usually — but if there’s a spike (say that term becomes a trending topic after a Super Bowl commercial), then there will likely be more than 100 per second. If because of Twitter’s rate limits, you’re only allowed to send one request per second, you will have missed some of the Tweets generated at the most critical moment of all.

Now, let’s be realistic: you’re probably not tracking just one term. Most of our customers are interested in tracking somewhere between dozens and hundreds of thousands of terms. If you add 999 more terms to your list, then you’ll only be checking for Tweets matching “Coca Cola” once every 1,000 seconds. And in 1,000 seconds, there could easily be more than 100 Tweets mentioning your keyword, even on an average day. (Keep in mind that there are over a billion Tweets per week nowadays.) So, in this scenario, you could easily miss Tweets if you’re using the Twitter Search API. It’s also worth bearing in mind that the Tweets you do receive won’t arrive in realtime because you’re only querying for the Tweets every 1,000 seconds.

Because of these issues related to the monitoring use cases, data collection strategies relying exclusively on the Search API will frequently deliver poor coverage of Twitter data. Also, be forewarned, if you are working with a monitoring or analytics vendor who claims full Twitter coverage but is using the Search API exclusively, you’re being misled.

Although coverage is not complete, one great thing about the Twitter Search API is the complex operator capabilities it supports, such as Boolean queries and geo filtering. Although the coverage is limited, some people opt to use the Search API to collect a sampling of Tweets that match their search terms because it supports Boolean operators and geo parameters. Because these filtering features have been so well liked, Gnip has replicated many of them in our own premium Twitter API (made even more powerful by the full coverage and unique data enrichments we offer).

So, to recap, the Twitter Search API offers great operator support but you should know that you’ll generally only see a portion of the total Tweets that match your keywords and your data might arrive with some delay. To simplify access to the Twitter Search API, consider trying out Gnip’s Enterprise Data Collector; our “Keyword Notices” feed retrieves, normalizes, and deduplicates data delivered through the Search API. We can also stream it to you so you don’t have to poll for your results. (“Gnip” reverses the “ping,” get it?)

But the only way to ensure you receive full coverage of Tweets that match your filtering criteria is to work with a premium data provider (like us! blush…) for full coverage Twitter firehose filtering. (See our Power Track feed if you’d like for more info on that.)

Stay tuned for Part 3, our overview of Twitter’s Streaming API coming next week…

Swiss Army Knives: cURL & tidy

Iterating quickly is what makes modern software initiatives work, and the mantra applies to everything in the stack. From planning your work, to builds, things have to move fast, and feedback loops need to be short and sweet. In the realm of REST[-like] API integration, writing an application to visually validate the API you’re interacting with is overkill. At the end of the day, web services boil down to HTTP requests which are rapidly tested with a tight little application called cURL. You can test just about anything with cURL (yes, including HTTP streaming/Comet/long-poll interactions), and its configurability is endless. You’ll have to read the man page to get all the bells and whistles, but I’ll provide a few samples of common Gnip use cases here. At the end of this post I’ll clue you into cURL’s indispensable cohort in web service slaying, ‘tidy.’

cURL power

cURL can generate custom HTTP client requests with any HTTP method you’d like. ProTip: the biggest gotcha I’ve seen trip up most people is leaving the URL unquoted. Many URLs don’t need quotes when being fed to cURL, but many do, and you should just get in the habit of quoting every one, otherwise you’ll spend time debugging your driver error for far too long. There are tons of great cURL tutorials out on the network; I won’t try to recreate those here.

POSTing

Some APIs want data POSTed to them. There are two forms of this.

Inline

curl -v -d "some=data" "http://blah.com/cool/api"

From File

curl -v -d @filename "http://blah.com/cool/api"

In either case, cURL defaults the content-type to the ubiquitous “application/x-www-form-urlencoded”. While this is often the correct thing to do, by default, there are a couple of things to keep in mind: one, this assumes that the data you’re inlining, or that is in your file, is indeed formatted as such (e.g. key=value pairs). two, when the API you’re working with does NOT want data in this format, you need to explicitly override the content-type header like so.

curl -v -d "someotherkindofdata" "http://blah.com/cool/api" --header "Content-Type: foo"

Authentication

Passing HTTP-basic authentication credentials along is easy.

curl -v -uUSERNAME[:PASSWORD] "http://blah.com/cool/api"

You can inline the password, but keep in mind your password will be cached in your shell history logs.

Show Me Everything

You’ll notice I’m using the “-v” option on all of my requests. “-v” allows me to see all the HTTP-level interaction (method, headers, etc), with the exception of a request POST body, which is crucial for debugging interaction issues. You’ll also need to use “-v” to watch streaming data fly by.

Crossing the Streams (cURL + tidy)

Most web services these days spew XML formatted data, and it is often not whitespace formatted such that a human can read it easily. Enter tidy. If you pipe your cURL output to tidy, all of life’s problems will melt away like a fallen ice-cream scoop on a hot summer sidewalk.

cURL’d web service API without tidy

curl -v "http://rss.clipmarks.com/tags/flower/"
...
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/style/rss/rss_feed.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="/style/rss/rss_feed.css" type="text/css" media="screen" ?><rss versi
on="2.0"><channel><title>Clipmarks | Flower Clips</title><link>http://clipmarks.com/tags/flower/</link><feedUrl>http://rss.clipmarks.com/tags/flower/</feedUrl><ttl>15</ttl
><description>Clip, tag and save information that's important to you. Bookmarks save entire pages...Clipmarks save the specific content that matters to you!</description><
language>en-us</language><item><title>Flower Shop in Parsippany NJ</title><link>http://clipmarks.com/clipmark/CAD213A7-0392-4F1D-A7BB-19195D3467FD/</link><description>&lt;
b&gt;clipped by:&lt;/b&gt; &lt;a href="http://clipmarks.com/clipper/dunguschariang/"&gt;dunguschariang&lt;/a&gt;&lt;br&gt;&lt;b&gt;clipper's remarks:&lt;/b&gt;  Send Dishg
ardens in New Jersey, NJ with the top rated FTD florist in Parsippany Avas specializes in Fruit Baskets, Gourmet Baskets, Dishgardens and Floral Arrangments for every Holi
day. Family Owned and Opperated for over 30 years. &lt;br&gt;&lt;div border="2" style="margin-top: 10px; border:#000000 1px solid;" width="90%"&gt;&lt;div style="backgroun
d-color:"&gt;&lt;div align="center" width="100%" style="padding:4px;margin-bottom:4px;background-color:#666666;overflow:hidden;"&gt;&lt;span style="color:#FFFFFF;f
...

cURL’d web service API with tidy

curl -v "http://rss.clipmarks.com/tags/flower/" | tidy -xml -utf8 -i
...
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="/style/rss/rss_feed.xsl" type="text/xsl" media="screen"?>
<?xml-stylesheet href="/style/rss/rss_feed.css" type="text/css" media="screen" ?>
<rss version="2.0">
   <channel>
     <title>Clipmarks | Flower Clips</title>
     <link>http://clipmarks.com/tags/flower/</link>
     <feedUrl>http://rss.clipmarks.com/tags/flower/</feedUrl>
     <ttl>15</ttl>
     <description>Clip, tag and save information that's important to
       you. Bookmarks save entire pages...Clipmarks save the specific
       content that matters to you!</description>
     <language>en-us</language>
     <item>
       <title>Flower Shop in Parsippany NJ</title>
       <link>

http://clipmarks.com/clipmark/CAD213A7-0392-4F1D-A7BB-19195D3467FD/</link>

       <description>&lt;b&gt;clipped by:&lt;/b&gt; &lt;a
...

I know which one you’d prefer. So what’s going on? We’re piping the output to tidy and telling tidy to treat the document as XML (use XML structural parsing rules), treat encodings as UTF8 (so it doesn’t barf on non-latin character sets), and finally “-i” indicates that you want it indented (pretty printed essentially).

Right Tools for the Job

If you spend a lot of time whacking through the web service API forest, be sure you have a sharp machete. cURL and tidy make for a very sharp machete. Test driving a web service API before you start laying down code is essential. These tools allow you to create tight feedback loops at the integration level before you lay any code down; saving everyone time, energy and money.

Gnip; An Update

Gnip moved into our new office yesterday (other end of the block from our old office). The transition provided an opportunity for me to think about where we’ve been, and where we’re going.

Team

We continue to grow, primarily on the engineering side. Checkout our jobs page if you’re interested in working on a hard problem, with smart people, in a beautiful place (Boulder, CO).

Technology

We’ve built a serious chunk of back-end infrastructure that I’d break into two general pieces: “the bus”, and “the pollers.”

“The Bus”

Our back-end moves large volumes of relatively small (usually <~3k bytes) chunks of data from A to B in a hurry. Data is “published” into Gnip, we do some backflips with it, then spit it out the other side to consumers.

“The Pollers”

Our efforts to get Publishers to push directly into Gnip didn’t pan out the way we initially planned. As a result we had to change course and acquire data ourselves. The bummer here was that we set out on an altruistic mission to relieve the polling pain that the industry has been suffering from, but were met with such inertia that we didn’t get the coverage we wanted. The upside is that building polling infrastructure has allowed us to control more of our business destiny. We’ve gone through a few iterations on approach to polling. From complex job scheduling and systems that “learn” & “adapt” to their surroundings, to dirt simple, mindless grinders that ignorantly eat APIs/endpoints all day long. We’re currently slanting heavily toward simplicity in the model. The idea is to take learning’s from the simple model over time, and feed them into abstractions/re-factorings that make the system smarter.

Deployment

We’re still in the cloud. Amazon’s Ec2/S3 products have been a solid (albeit not necessarily the most cost effective when your CPU utilization isn’t in the 90%+ range per box), highly flexible, framework for us; hats off to those guys.

Industry

“The Polling Problem”

It’s been great to see the industry wake up and acknowledge “the polling problem” over the past year. SUP (Simple Update Protocol) popped up to provide more efficient polling for systems that couldn’t, or wouldn’t, move to an event-driven model. Providing a compact change-log for pollers, you can poll the change-log, and then go do heavier polls for only stuff that has changed. PubSubHubbub popped up to provide the framework for a distributed Gnip (though lacking inherent normalization). A combination of polling and events spread across nodes allows for a more decentralized approach.

“Normalization”

The Activity Streams initiative grew legs and is walking. As with any “standards” (or “standards-like”) initiative things are only as good as adoption. Building ideas in a silo without users makes for a fun exercise, but not much else. Uptake matters, and MySpace and Facebook (among many other smaller initiatives) have bitten off chunks of Activity Streams, and that’s a very big, good, sign for the industry. Structural, and semantic, consistency matters for applications digesting a lot of information. Gnip provides highly structured and consistent data to its consumers via gnip.xsd.

In order to meet its business needs, and to adapt to the constantly moving industry around it, Gnip has adjusted it’s approach on several fronts. We moved to incorporate polling. We understand that there is more than one way of doing and will incorporate SUP and PubSubHubbub into our framework. Doing so will make our own polling efforts more effective, and also provide data to our consumers with flexibility. While normalized data is nice for a large category of consumers, there is a large tier of customers that doesn’t need, or want, heavy normalization. Opaque message flow has significant value as well.

We set out to move mind-boggling amounts of information from A to B, and we’re doing that. Some of the nodes in the graph are shifting, but the model is sound. We’ve found there are primarily two types of data consumers: high-coverage of a small number of sources (“I need 100% of Joe, Jane, and Mike’s activity”), and “as high as you can get it”-coverage of a large number of sources (“I don’t need 100%, but I want very broad coverage”). Gnip’s adjusted to accommodate both.

Business

We’ve had to shift our resources to better focus on the paying segments of our audience. We initially thought “life-stream aggregators” would be our biggest paying customer segment, however data/media analytics firms have proven significant. Catering to the customers who tell you “we have budget for that!” makes good business sense, and we’re attacking those opportunities.

HOW-TO: Twitter Search Publisher

There has been some confusion around how to leverage Gnip’s Twitter Search (“twitter-search”) Publisher. We have work to do in order to clarify this use case from a usability/documentation standpoint, but in the meantime hopefully the following clarifies things a bit.

First off, “twitter-search” is a Polled Publisher which means it is subject to high latencies, as well as gaps in coverage. Secondly, we overload the “keyword” rule type in Filters in order to provide a mechanism for you to enter your http://search.twitter.com compatible queries (see http://search.twitter.com/operators for more information). Any query you can run on http://search.twitter.com, can be added to your Gnip filter as a “keyword” rule.

For example, if you search Twitter for “Boulder, CO” (including the quotes), Twitter considers that a literal, case-insensitive, phrase search; and so will Gnip. “Boulder, CO” (excluding the quotes), yields an OR search on Twitter; and hence the same in Gnip. If you search for “cars AND trucks” you get Boolean search operator behavior in Twitter, and subsequently in Gnip as well.

In short, we pass through the literal queries/strings that you hand Gnip, straight on through to Twitter. The “keywords” are opaque to Gnip. The only trick is in ensuring your “keywords” are entered into Gnip appropriately.

Through Gnip’s web interface, you can add comma separated keywords to a Filter. This is usually straightforward, however in the twitter-search Publisher case, it takes extra care to get the results you want, especially when you want to include commas or quotes in your queries. As a result, the format of the keywords entered in a twitter-search Publisher Filter must conform to csv quoting to ensure your queries get executed properly.

Through Gnip’s REST interface, you encapsulate the keywords within XML <rule> elements, so the csv quoting rules can be ignored.

For some further examples of how to add twitter-search keywords, see the Gnip API documentation.

As a final note, the overload of “keyword” rule types in Filters is something we’re experimenting with and is subject to change.

Pushing and Polling Data Differences in Approach on the Gnip platform

Obviously we have some understanding on the concepts of pushing and polling of data from service endpoints since we basically founded a company on the premise that the world needed a middleware push data service.    Over the last year we have had a lot of success with the push model, but we also learned that for many reasons we also need to work with services via a polling approach.   For this reason our latest v2.1 includes the Gnip Service Polling feature so that we can work with any service using push, poll or a mixed approach.

Now, the really great thing for users of the Gnip platform is that how Gnip collects data is mostly abstracted away.   Every end user developer or company has the option to tell Gnip where to push data that you have set up filters or have a subscription.   We also realize not everyone has an IT setup to handle push so we have always provided the option for HTTP GET support that lets people grab data from a Gnip generated URL for your filters.

One place where the way Gnip collects data can make a difference, at this time, for our users is the expected latency of data.  Latency here refers to the time between the activity happening (i.e. Bob posted a photo, Susie made a comment, etc) and the time it hits the Gnip platform to be delivered to our awaiting users.     Here are some basic expectation setting thoughts.

PUSH services: When we have push services the latency experience is usually under 60 seconds, but we know that this is not always the case sense sometimes the services can back-up during heavy usage and latency can spike to minutes or even hours.   Still, when the services that push to us are running normal it is reasonable to expect 60 second latency or better and this is consistent for both the Community and Standard Edition of the Gnip platform.

POLLED services:   When Gnip is using our polling service to collect data the latency can vary from service to service based on a few factors

a) How often we hit an endpoint (say 5 times per second)

b) How many rules we have to schedule for execution against the endpoint (say over 70 million on YouTube)

c) How often we execute a specific rule (i.e. every 10 minutes).     Right now with the Community edition of the Gnip platform we are setting rule execution by default at 10 minute intervals and people need to have this in mind with their expectation for data flow from any given publisher.

Expectations for POLLING in the Community Edition: So I am sure some people who just read the above stopped and said “Why 10 minutes?”  Well we chose to focus on “breadth of data ” as the initial use case for polling.   Also, the 10 minute interval is for the Community edition (aka: the free version).   We have the complete ability to turn the dial and use the smarts built into the polling service feature we can execute the right rules faster (i.e. every 60 seconds or faster for popular terms and every 10, 20, etc minutes or more for less popular ones).    The key issue here is that for very prolific posting people or very common keyword rules (i.e. “obama”, “http”, “google”) there can be more posts that exist in the 10 minute default time-frame then we can collect in a single poll from the service endpoint.

For now the default expectation for our Community edition platform users should be a 10 minute execution interval for all rules when using any data publisher that is polled, which is consistent with the experience during our v2.1 Beta.    If your project or company needs something a bit more snappy with the data publishers that are polled then contact us at info@gnip.com or contact me directly at shane@gnip.com as these use cases require the Standard Edition of the Gnip platform.

Current pushed services on the platform include:  WordPress, Identi.ca, Intense Debate, Twitter, Seesmic,  Digg, and Delicious

Current polled services on the platform include:   Clipmarks, Dailymotion, deviantART, diigo, Flickr, Flixster, Fotolog, Friendfeed, Gamespot, Hulu, iLike, Multiply, Photobucket, Plurk, reddit, SlideShare, Smugmug, StumbleUpon, Tumblr, Vimeo, Webshots, Xanga, and YouTube

Gravitational Shift

Gnip’s approach to getting more Publishers into the system has evolved. Over the past year we’ve learned a lot about the data delivery business and the state of its technological art. While our core infrastructure remains a highly performant data delivery bus, the way data arrives at Gnip’s front door is shifting.

We set out assuming the industry, at large (both Publishers and Consumers), was tired of highly latent data access. What we’ve learned is that data Consumers (e.g. life-stream aggregators) are indeed weary of the latency, but that many Publishers aren’t as interested in distributing their data in real-time as we initially estimated. So, in order to meet intense Consumer demand to have data delivered in a normalized, minimal latency (not necessarily “real-time”), manner, Gnip is adding many new polled Publishers to its offering.

Checkout http://api.gnip.com and see how many Publishers we have to offer as a result of walking down the polling path.

Our goal remains to “deliver the web’s data,” and while the core Gnip delivery model remains the same, polling has allowed us to greatly expand the list of available Publishers in the system.

Tell us which Publishers/data sources you want Gnip to deliver for you! http://gnip.uservoice.com/

We have a long way to go, but we’re stoked at the rate we’re able to widen our Publisher offering now that our polling infrastructure is coming online.

New Beta 3 Update to api.gnip.com

We just finished making a major upgrade to the beta api.gnip.com environment.  First, thank you to everyone for their patience during our middle of the day upgrade.  We normally schedule upgrades off hours or do a rolling upgrade, but tonight the entire team of ten is going to the Star Trek premiere.   Anyway, we made the “management” decision to do the upgrade earlier in the day so we did not run into our company/family event tonight.

What’s new?   Up until now the data publishers in api.gnip.com have been doing lazy scheduling, which means they would pull data but it was actually easy with a small number of rules to miss data or not have a filter get a hit in our scheduling.  Yeah, that is a beta thing as the primary reason for having the beta out for so long was giving users a chance to get all their existing integrations moved to the new schema.   With Beta 3 we have made some major enhancements system that from what we see in our tests greatly improve the amount of data flow across all our data publishers using the new polling services.

From a timeline standpoint we want to let the new features soak a bit and then we will lock down a date to take the system to production as early as the end of May.     In the next few days we are running diagnostics and scaling out the system by putting it through the paces, so feel free to do the same or just work with our notification streams on any given publisher.

Live long and prosper.  (sorry, just could not resist the Star Trek line)

Push & Pull

Over here at Gnip we’re knee deep in the joys of polling. Our mission to “deliver the web’s data” has us using several approaches to hook consumers up with publisher activities. While we iron out the kinks around our polled publishers, I’m reminded of how broken polling is for many types of data. From rate limiting, to minimum poll interval definition, polling inherently yields gaps between actions that take place along a timeline. In some cases those gaps are small, and potentially imperceptible, but in others they are large. Viewing one of our internal daily stats charts exemplified this push vs. pull dichotomy. Each color represents a separate publisher (top/black line == total). Guess which publishers are Push (event driven) and which are Pull (polling driven).

Gnip Publisher Daily Chart

Gnip Publisher Daily Chart

The answer: consistent, connected, publishers/lines in the chart are Push (even driven), and more variable publishers/lines are Pull (poll) driven.

Our goal in life is to smooth those variable/jagged lines for the polled publishers, but along the path to data delivery nirvana, I thought I’d share a this visual.