Guide to the Twitter API – Part 3 of 3: An Overview of Twitter’s Streaming API

The Twitter Streaming API is designed to deliver limited volumes of data via two main types of realtime data streams: sampled streams and filtered streams. Many users like to use the Streaming API because the streaming nature of the data delivery means that the data is delivered closer to realtime than it is from the Search API (which I wrote about last week). But the Streaming API wasn’t designed to deliver full coverage results and so has some key limitations for enterprise customers. Let’s review the two types of data streams accessible from the Streaming API.The first type of stream is “sampled streams.” Sampled streams deliver a random sampling of Tweets at a statistically valid percentage of the full 100% Firehose. The free access level to the sampled stream is called the “Spritzer” and Twitter has it currently set to approximately 1% of the full 100% Firehose. (You may have also heard of the “Gardenhose,” or a randomly sampled 10% stream. Twitter used to provide some increased access levels to businesses, but announced last November that they’re not granting increased access to any new companies and gradually transitioning their current Gardenhose-level customers to Spritzer or to commercial agreements with resyndication partners like Gnip.)

The second type of data stream is “filtered streams.” Filtered streams deliver all the Tweets that match a filter you select (eg. keywords, usernames, or geographical boundaries). This can be very useful for developers or businesses that need limited access to specific Tweets.

Because the Streaming API is not designed for enterprise access, however, Twitter imposes some restrictions on its filtered streams that are important to understand. First, the volume of Tweets accessible through these streams is limited so that it will never exceed a certain percentage of the full Firehose. (This percentage is not publicly shared by Twitter.) As a result, only low-volume queries can reliably be accommodated. Second, Twitter imposes a query limit: currently, users can query for a maximum of 400 keywords and only a limited number of usernames. This is a significant challenge for many businesses. Third, Boolean operators are not supported by the Streaming API like they are by the Search API (and by Gnip’s API). And finally, there is no guarantee that Twitter’s access levels will remain unchanged in the future. Enterprises that need guaranteed access to data over time should understand that building a business on any free, public APIs can be risky.

The Search API and Streaming API are great ways to gather a sampling of social media data from Twitter. We’re clearly fans over here at Gnip; we actually offer Search API access through our Enterprise Data Collector. And here’s one more cool benefit of using Twitter’s free public APIs: those APIs don’t prohibit display of the Tweets you receive to the general public like premium Twitter feeds from Gnip and other resyndication partners do.

But whether you’re using the Search API or the Streaming API, keep in mind that those feeds simply aren’t designed for enterprise access. And as a result, you’re using the same data sets available to anyone with a computer, your coverage is unlikely to be complete, and Twitter reserves the right change the data accessibility or Terms of Use for those APIs at any time.

If your business dictates a need for full coverage data, more complex queries, an agreement that ensures continued access to data over time, or enterprise-level customer support, then we recommend getting in touch with a premium social media data provider like Gnip. Our complementary premium Twitter products include Power Track for data filtered by keyword or other parameters, and Decahose and Halfhose for randomly sampled data streams (10% and 50%, respectively). If you’d like to learn more, we’d love to hear from you at sales@gnip.com or 888.777.7405.

25 Million Free Tweets on Power Track

Last week we announced Twitter firehose filtering. This week we’re celebrating the news with Free Tweets for all. Sign up by February 28th to enjoy no licensing fees on your first 25 million Tweets in your first 60 days using Power Track. 

Power Track offers powerful filtering of the Twitter firehose, guaranteeing 100% Tweet delivery. For instance, filter by keyword or username to access all Tweets that match the criteria you care about and have all of the matching results delivered to you in realtime via API. Power Track supports Boolean operators, can match your filtering criteria even within expanded URLs, and has no query volume or traffic limitations, helping you access all of the data you want. And it’s only available from Gnip, currently the only authorized distributor of Twitter data via API.

The licensing fee for Power Track is $.10 per 1,000 Tweets, but we’re waiving that fee for the first 25 million Tweets in 60 days for Power Track customers who sign up by February 28th. 1-year agreement and Gnip data collector fee still required.

Learn More or Contact Us to start testing Power Track for firehose filtering. Cheers!

What Facebook Data is Available from Gnip's Social Media API?

Facebook is among the most in-demand (but also among the most challenging) social media sources to access. Most Facebook conversation data is private and so it’s not accessible via Facebook’s API or any of Gnip’s feeds. Facebook data availability is also pretty confusing to understand and the rules keep changing. So, let’s clarify what kinds of Facebook information we can offer through our social media API today. 

What Facebook Data is Available from Gnip?
Within the realm of publicly accessible data only, we provide:

  • User page content: status updates, wall posts, comments
  • Fan page content: wall posts, comments (probably more than you’ll find from any service), “Like” count, historical data up to 90 days


How Can You Get the Data?

Instead of a firehose of Facebook data, you enter parameters indicating what you want to find:

  • Keyword search
    You provide a list of keywords. We’ll return public mentions of those keywords.
  • Username search
    You provide a list of usernames. We’ll return publicly accessible posts generated by those users.
  • Fan page search
    You provide a list of fan pages. We’ll return publicly available posts and comments on those fan pages.

    While these lists are vastly simplified, we hope they’ll clarify what kinds of Facebook data most businesses can access legally, and exactly what Facebook data Gnip provides.

    Oh, and one last thing. We’re sometimes asked how we feel about Facebook’s privacy policies. At Gnip, we don’t make the rules — we just play by them. Our job is to facilitate access to the social data that publishers (like Facebook) officially make accessible to our customers.

    Best wishes to you with your Facebook data collection! If you think we might be able to help, please drop us a note.

    I want '*'

    My favorite requirement from customers is the “I want all the data, from all the sources, for all of history, and for all of future” one. You’re never going to get it, from anyone, so reset your expectations. A few constructs fall out of this request.

    Two Types of ‘feeds’

    Firehoses

    These are aggregate sources of data for a given publisher. They may, or may not, be a complete representation of that publisher’s data set. Everyone wants firehoses, but truth be told, there are very few of them in the wild, and those that do exist are of less “valuable” data. Consider access to firehoses to be at statistically relevant access levels, rather than truly “complete” sets of data.

    Seeded Feeds

    These encompass the majority of data sources, and they require that you know what you’re looking for. Be it a keyword, a tag, a user name, a user id, or a geo-location.

    In either case you need to know what it is you’re after. Blind, unfettered, access to a given publisher’s feed is a rarity and actually isn’t all that interesting in the end; you just think it is because someone else had the product idea first (e.g. the publisher you want all the data from… e.g. Twitter).

    Historical Access

    Storing and indexing lots of data is conceptually simple, yet hard to implement at scale; just ask any of the big-three search engines. You can stuff as much data as possible into a database, and “search” it offline, in order to meet most historical data access requirements, but weaving that into a variably accessed consumer application isn’t always easy. While storage costs are generally nil for today’s highly compressible data, the operational management costs of your locally stored data aren’t.

    “Real-time” Access

    Processing data in a manner other than which it originated causes an impedance miss-match. Stream-to-offline processing implies that you’ll have gaps in data due to queuing problems. Offline-to-stream suggests the same. Offline-to-offline and stream-to-stream are generally easy to get your head and code around, but be wary of overloading stream processing with too much work as it then starts to feel like stream-to-offline. Once you enter that world, you need to solve parallel processing problems; in real-time.

    Regardless of access pattern, you can only introspect and access the data you initially seeded your sources with. If your seed was wrong, for example you used the wrong set of users or keywords, processing the data doesn’t matter. Full circle to garbage in, garbage out.

    If you find yourself asking for the introductory requirement with your team, and/or a vendor, I suggest you actually don’t have the focus on your product or idea that you’ll ultimately need in order to be successful. Batten down the hatches, and get crisp about precisely what it is you want to build, and precisely what data you need to do so. If you can do that, you will have a shot at success.

    Migrating to the Twitter Streaming API: A Primer

    Some context:

    Long, long ago, in a galaxy far, far away, Twitter provided a firehose of data to a few of partners and the world was happy.  These startups were awash in real-time data and they got spoiled, some might say, by the embarrassment of riches that came through the real-time feed.  Over time, numerous factors caused Twitter to cease offering the firehose.  There was much wailing and gnashing of teeth on that day, I can tell you!

    At roughly the same time, Twitter bought real-time search company Summize and began offering to everyone access to what is now known as the Search API.  Unlike Twitter’s existing REST API, which was based around usernames, the Search API enabled companies to query for recent data about a specific keyword.  Because of the nature of polling, companies had to contend with latency (the time between when someone performs an action and when an API consumer learns about it) and Twitter had to deal with a constantly-growing number of developers connected to an inherently inefficient interface.

    Last year, Twitter announced that they were developing the spiritual successor to the firehose — a real-time stream that could be filtered on a per-customer basis and provide the real-time, zero latency results people wanted.  By August of last year, alpha customers had access to various components of the firehose (spritzer, the gardenhose, track, birddog, etc) and provided feedback that helped shape and solidify Twitter’s Streaming API.

    A month ago Twitter Engineer John Kalucki (@jkalucki) posted on the Twitter API Announcements group that “High-Volume and Repeated Queries Should Migrate to Streaming API“.  In the post, he detailed several reasons why the move is beneficial to developers.  Two weeks later, another Twitter developer announced a new error code, 420, to let developers identify when they are getting rate limited by the Search API.  Thus, both the carrot and the stick have been laid out.

    The streaming API is going to be a boon for companies who collect keyword-relevant content from the Twitter stream, but it does require some work on the part of developers.  In this post, we’ll help explain who will benefit from using Twitter’s new Streaming API and some ways to make the migration easier.

    Question 1:  Do I need to make the switch?

    Let me answer your question with another question — Do you have a predictable set of keywords that you habitually query?  If you don’t, keep using the Search API.  If you do, get thee to the Streaming API.

    Examples:

    • Use the Streaming API any time you are tracking a keyword over time or sending notifications /  summaries to a subscriber.
    • Use the Streaming API if you need to get *all* the tweets about a specific keyword.
    • Use the Search API for visualization and search tools where a user enters a non-predictable search query for a one-time view of results.
    • What if you offer a configurable blog-based search widget? You may have gotten away with beating up the Search API so far, but I’d suggest setting up a centralized data store and using it as your first look-up location when loading content — it’s bad karma to force a data provider to act as your edge cache.

    Question 2: Why should I make the switch?

    • First and foremost, you’ll get relevant tweets significantly faster.  Linearly polling an API or RSS feed for a given set of keywords automatically creates latency which increases at a linear rate.  Assuming one query per second, the average latency for 1,000 keywords is a little over eight minutes; the average latency for 100,000 keywords is almost 14 hours!  With the Streaming API, you get near-real-time (usually within one second) results, regardless of the number of keywords you track.
    • With traditional API polling, each query returns N results regardless of whether any results are new since your last request.  This puts the onus of deduping squarely on your shoulders.  This sounds like it should be simple — cache the last N resultIDs in memory and ignore anything that’s been seen before.  At scale, high-frequency keywords will consume the cache and low frequency keywords quickly age out.  This means you’ll invariably have to hit the disk and begin thrashing your database. Thankfully, Twitter has already obviated much of this in the Search API with an optional “since_id” query parameter, but plenty of folks either ignore the option or have never read the docs and end up with serious deduplication work.  With Twitter’s Streaming API, you get a stream of tweets with very little duplication.
    • You will no longer be able to get full fidelity (aka all the tweets for a given keyword) from the Search API.  Twitter is placing increased weight on relevance, which means that, among other things, the Search API’s results will no longer be chronologically ordered.  This is great news from a user-facing functionality perspective, but it also means that if you query the Search API for a given keyword every N seconds, you’re no longer guaranteed to receive the new tweets each time.
    • We all complain about the limited backwards view of Twitter’s search corpus.  On any given day, you’ll have access to somewhere between seven and 14 days worth of historical data (somewhere between one quarter to one half billion tweets), which is of limited value when trying to discover historical trends.  Additionally, for high volume keywords (think Obama or iPhone or Toyota), you may only have access to an hour of historical data, due to the limited number of results accessible through Twitter’s paging system.  While there is no direct correlation between the number of queries against a database and the amount of data that can be indexed, there IS a direct correlation between devoting resources to handle ever-growing query demands and not having resources to work on growing the index.  As persistent queries move to the Streaming API, Twitter will be able to devote more resources to growing the index of data available via the Search API (see Question 4, below).
    • Lastly, you don’t really have a choice.  While Twitter has not yet begun to heavily enforce rate limiting (Gnip’s customers currently see few errors at 3,600 queries per hour), you should expect the Search API’s performance profile to eventually align with the REST API (currently 150 queries per hour, reportedly moving to 1,500 in the near future).

    Question 3: Will I have to change my API integration?

    Twitter’s Streaming API uses streaming HTTP

    • With traditional HTTP requests, you initiate a connection to a web server, the server sends results and the connection is closed.  With streaming HTTP, the connection is maintained and new data gets sent over a single long-held response.  It’s not unusual to see a Streaming API connection last for two or three days before it gets reset.
    • That said, you’ll need to reset the connection every time you change keywords.  With the Streaming API , you upload the entire set of keywords when establishing a connection.  If you have a large number of keywords, it can take several minutes to upload all of them and during the duration you won’t get any streaming results.  The way to work around this is to initiate a second Streaming API connection, then terminate the original connection once the new one starts receiving data.  In order to adhere to Twitter’s request that you not initiate a connection more than once every couple of minutes, highly volatile rule sets will need to batch changes into two minute chunks.
    • You’ll need to decouple data collection from data processing.  If you fall behind in reading data from the stream, there is no way to go back and get it (barring making a request from the Search API).  The best way to ensure that you are always able to keep up with the flow of streaming data is to place incoming data into a separate process for transformation, indexing and other work.  As a bonus, decoupling enables you to more accurately measure the size of your backlog.

    Streaming API consumers need to perform more filtering on their end

    • Twitter’s Streaming API only accepts single-term rules; no more complex queries.  Say goodbye to ANDs, ORs and NOTs.  This means that if you previously hit the Search API looking for “Avatar Movie -Game”, you’ve got some serious filtering to do on your end.  From now on, you’ll add to the Streaming API one or more of the required keywords (Avatar and/or Movie) and filter out from the results anything without both keywords and containing the word “Game”.
    • You may have previously relied on the query terms you sent to Twitter’s Search API to help you route the results internally, but now the onus is 100% on you.  Think of it this way: Twitter is sending you a personalized firehose based upon your one-word rules.  Twitter’s schema doesn’t include a <keyword> element, so you don’t know which of your keywords are contained in a given Tweet.  You’ll have to inspect the content of the tweet in order to route appropriately.
    • And remember, duplicates are the exception, not the rule, with the Streaming API, so if a given tweet matches multiple keywords, you’ll still only receive it once.  It’s important that you don’t terminate your filtering algo on your first keyword or filter match; test against every keyword, every time.

    Throttling is performed differently

    • Twitter throttles their Search API by IP address based upon the number of queries per second.  In a world of real-time streaming results, this whole concept is moot.  Instead, throttling is defined by the number of keywords a given account can track and the overall percentage of the firehose you can receive.
    • The default access to the Streaming API is 200 keywords; just plug in your username and password and off you go.  Currently, Twitter offers approved customers access to 10,000 keywords (restricted track) and 200,000 keywords (partner track).  If you need to track more than 200,000 keywords, Twitter may bind “partner track” access to multiple accounts, giving you access to 400,000 keywords or even more.
    • In addition to keyword-based streams, Twitter makes available several specific-use streams, including the link stream (All tweets with a URL) and the retweet stream (all retweets).  There are also various levels of userid-based streams (follow, shadow and birddog) and the overall firehose (spritzer, gardenhose and firehose), but they are outside the bounds of this post.
    • The best place to begin your quest for increased Streaming API is an email to api@twitter.com — briefly describe your company and use case along with the requested access levels. (This process will likely change for coming Commercial Accounts.)
    • Twitter’s Streaming API is throttled at the overall stream level. Imagine that you’ve decided to try to get as many tweets as you can using track.  I know, I know, who would do such a thing?  Not you, certainly.  But imagine that you did — you entered 200 stop words, like “and”, “or”, “the” and “it” in order to get a ton of tweets flowing to you.  You would be sorely disappointed, because twitter enforces a secondary throttle, a percentage of firehose available to each access level.  The higher the access level (partner track vs. restricted track vs. default track), the greater the percentage you can consume.  Once you reach that amount, you will be momentarily throttled and all matching tweets will be dropped on the floor.  No soup for you!  You should monitor this by watching for “limit” notifications.  If you find yourself regularly receiving these, either tighten up your keywords are request greater access from Twitter.

    Start tracking deletes

    • Twitter sends deletion notices down the pipe when a user deletes one of their own tweets.  While Twitter does not enforce adoption of this feature, please do the right thing and implement it.  When a user deletes a tweet, they want it stricken from the public record.  Remember, “it ain’t complete if you don’t delete.”  We just made that up.  Just now.  We’re pretty excited about it.

    Question 4: What if I want historical data too?


    Twitter’s Streaming API is forward-looking, so you’ll only get new tweets when you add a new keyword.  Depending on your use case you may need some historical data to kick things off.  If so, you’ll want to make one simultaneous query to the Search API.  This means that you’ll need to maintain two integrations with Twitter APIs (three, if you’re taking advantage of Twitter’s REST API for tracking specific users), but the benefit is historical data + low-latency / high-reliability future data.

    And as described before, the general migration to the Streaming API should result in deeper results from the Search API, but even now you can get around 1,500 results for a keyword if you get acquainted with the “page” query parameter.

    Questions 5: What if I need more help?

    Twitter resources:

    Streaming HTTP resources:

    Gnip help:

    • Ask questions in the comments below and we’ll respond inline
    • Send email to eric@gnip.com to ask the Gnip team direct questions