Posts Tagged ‘Twitter’

Posted in APIs,How-To,Polling,real-time by Eric 2 Comments

Some context:

Long, long ago, in a galaxy far, far away, Twitter provided a firehose of data to a few of partners and the world was happy.  These startups were awash in real-time data and they got spoiled, some might say, by the embarrassment of riches that came through the real-time feed.  Over time, numerous factors caused Twitter to cease offering the firehose.  There was much wailing and gnashing of teeth on that day, I can tell you!

At roughly the same time, Twitter bought real-time search company Summize and began offering to everyone access to what is now known as the Search API.  Unlike Twitter’s existing REST API, which was based around usernames, the Search API enabled companies to query for recent data about a specific keyword.  Because of the nature of polling, companies had to contend with latency (the time between when someone performs an action and when an API consumer learns about it) and Twitter had to deal with a constantly-growing number of developers connected to an inherently inefficient interface.

Last year, Twitter announced that they were developing the spiritual successor to the firehose — a real-time stream that could be filtered on a per-customer basis and provide the real-time, zero latency results people wanted.  By August of last year, alpha customers had access to various components of the firehose (spritzer, the gardenhose, track, birddog, etc) and provided feedback that helped shape and solidify Twitter’s Streaming API.

A month ago Twitter Engineer John Kalucki (@jkalucki) posted on the Twitter API Announcements group that “High-Volume and Repeated Queries Should Migrate to Streaming API“.  In the post, he detailed several reasons why the move is beneficial to developers.  Two weeks later, another Twitter developer announced a new error code, 420, to let developers identify when they are getting rate limited by the Search API.  Thus, both the carrot and the stick have been laid out.

The streaming API is going to be a boon for companies who collect keyword-relevant content from the Twitter stream, but it does require some work on the part of developers.  In this post, we’ll help explain who will benefit from using Twitter’s new Streaming API and some ways to make the migration easier.

Question 1:  Do I need to make the switch?

Let me answer your question with another question — Do you have a predictable set of keywords that you habitually query?  If you don’t, keep using the Search API.  If you do, get thee to the Streaming API.

Examples:

  • Use the Streaming API any time you are tracking a keyword over time or sending notifications /  summaries to a subscriber.
  • Use the Streaming API if you need to get *all* the tweets about a specific keyword.
  • Use the Search API for visualization and search tools where a user enters a non-predictable search query for a one-time view of results.
  • What if you offer a configurable blog-based search widget? You may have gotten away with beating up the Search API so far, but I’d suggest setting up a centralized data store and using it as your first look-up location when loading content — it’s bad karma to force a data provider to act as your edge cache.

Question 2: Why should I make the switch?

  • First and foremost, you’ll get relevant tweets significantly faster.  Linearly polling an API or RSS feed for a given set of keywords automatically creates latency which increases at a linear rate.  Assuming one query per second, the average latency for 1,000 keywords is a little over eight minutes; the average latency for 100,000 keywords is almost 14 hours!  With the Streaming API, you get near-real-time (usually within one second) results, regardless of the number of keywords you track.
  • With traditional API polling, each query returns N results regardless of whether any results are new since your last request.  This puts the onus of deduping squarely on your shoulders.  This sounds like it should be simple — cache the last N resultIDs in memory and ignore anything that’s been seen before.  At scale, high-frequency keywords will consume the cache and low frequency keywords quickly age out.  This means you’ll invariably have to hit the disk and begin thrashing your database. Thankfully, Twitter has already obviated much of this in the Search API with an optional “since_id” query parameter, but plenty of folks either ignore the option or have never read the docs and end up with serious deduplication work.  With Twitter’s Streaming API, you get a stream of tweets with very little duplication.
  • You will no longer be able to get full fidelity (aka all the tweets for a given keyword) from the Search API.  Twitter is placing increased weight on relevance, which means that, among other things, the Search API’s results will no longer be chronologically ordered.  This is great news from a user-facing functionality perspective, but it also means that if you query the Search API for a given keyword every N seconds, you’re no longer guaranteed to receive the new tweets each time.
  • We all complain about the limited backwards view of Twitter’s search corpus.  On any given day, you’ll have access to somewhere between seven and 14 days worth of historical data (somewhere between one quarter to one half billion tweets), which is of limited value when trying to discover historical trends.  Additionally, for high volume keywords (think Obama or iPhone or Toyota), you may only have access to an hour of historical data, due to the limited number of results accessible through Twitter’s paging system.  While there is no direct correlation between the number of queries against a database and the amount of data that can be indexed, there IS a direct correlation between devoting resources to handle ever-growing query demands and not having resources to work on growing the index.  As persistent queries move to the Streaming API, Twitter will be able to devote more resources to growing the index of data available via the Search API (see Question 4, below).
  • Lastly, you don’t really have a choice.  While Twitter has not yet begun to heavily enforce rate limiting (Gnip’s customers currently see few errors at 3,600 queries per hour), you should expect the Search API’s performance profile to eventually align with the REST API (currently 150 queries per hour, reportedly moving to 1,500 in the near future).

Question 3: Will I have to change my API integration?

Twitter’s Streaming API uses streaming HTTP

  • With traditional HTTP requests, you initiate a connection to a web server, the server sends results and the connection is closed.  With streaming HTTP, the connection is maintained and new data gets sent over a single long-held response.  It’s not unusual to see a Streaming API connection last for two or three days before it gets reset.
  • That said, you’ll need to reset the connection every time you change keywords.  With the Streaming API , you upload the entire set of keywords when establishing a connection.  If you have a large number of keywords, it can take several minutes to upload all of them and during the duration you won’t get any streaming results.  The way to work around this is to initiate a second Streaming API connection, then terminate the original connection once the new one starts receiving data.  In order to adhere to Twitter’s request that you not initiate a connection more than once every couple of minutes, highly volatile rule sets will need to batch changes into two minute chunks.
  • You’ll need to decouple data collection from data processing.  If you fall behind in reading data from the stream, there is no way to go back and get it (barring making a request from the Search API).  The best way to ensure that you are always able to keep up with the flow of streaming data is to place incoming data into a separate process for transformation, indexing and other work.  As a bonus, decoupling enables you to more accurately measure the size of your backlog.

Streaming API consumers need to perform more filtering on their end

  • Twitter’s Streaming API only accepts single-term rules; no more complex queries.  Say goodbye to ANDs, ORs and NOTs.  This means that if you previously hit the Search API looking for “Avatar Movie -Game”, you’ve got some serious filtering to do on your end.  From now on, you’ll add to the Streaming API one or more of the required keywords (Avatar and/or Movie) and filter out from the results anything without both keywords and containing the word “Game”.
  • You may have previously relied on the query terms you sent to Twitter’s Search API to help you route the results internally, but now the onus is 100% on you.  Think of it this way: Twitter is sending you a personalized firehose based upon your one-word rules.  Twitter’s schema doesn’t include a <keyword> element, so you don’t know which of your keywords are contained in a given Tweet.  You’ll have to inspect the content of the tweet in order to route appropriately.
  • And remember, duplicates are the exception, not the rule, with the Streaming API, so if a given tweet matches multiple keywords, you’ll still only receive it once.  It’s important that you don’t terminate your filtering algo on your first keyword or filter match; test against every keyword, every time.

Throttling is performed differently

  • Twitter throttles their Search API by IP address based upon the number of queries per second.  In a world of real-time streaming results, this whole concept is moot.  Instead, throttling is defined by the number of keywords a given account can track and the overall percentage of the firehose you can receive.
  • The default access to the Streaming API is 200 keywords; just plug in your username and password and off you go.  Currently, Twitter offers approved customers access to 10,000 keywords (restricted track) and 200,000 keywords (partner track).  If you need to track more than 200,000 keywords, Twitter may bind “partner track” access to multiple accounts, giving you access to 400,000 keywords or even more.
  • In addition to keyword-based streams, Twitter makes available several specific-use streams, including the link stream (All tweets with a URL) and the retweet stream (all retweets).  There are also various levels of userid-based streams (follow, shadow and birddog) and the overall firehose (spritzer, gardenhose and firehose), but they are outside the bounds of this post.
  • The best place to begin your quest for increased Streaming API is an email to api@twitter.com — briefly describe your company and use case along with the requested access levels. (This process will likely change for coming Commercial Accounts.)
  • Twitter’s Streaming API is throttled at the overall stream level. Imagine that you’ve decided to try to get as many tweets as you can using track.  I know, I know, who would do such a thing?  Not you, certainly.  But imagine that you did — you entered 200 stop words, like “and”, “or”, “the” and “it” in order to get a ton of tweets flowing to you.  You would be sorely disappointed, because twitter enforces a secondary throttle, a percentage of firehose available to each access level.  The higher the access level (partner track vs. restricted track vs. default track), the greater the percentage you can consume.  Once you reach that amount, you will be momentarily throttled and all matching tweets will be dropped on the floor.  No soup for you!  You should monitor this by watching for “limit” notifications.  If you find yourself regularly receiving these, either tighten up your keywords are request greater access from Twitter.

Start tracking deletes

  • Twitter sends deletion notices down the pipe when a user deletes one of their own tweets.  While Twitter does not enforce adoption of this feature, please do the right thing and implement it.  When a user deletes a tweet, they want it stricken from the public record.  Remember, “it ain’t complete if you don’t delete.”  We just made that up.  Just now.  We’re pretty excited about it.

Question 4: What if I want historical data too?
Twitter’s Streaming API is forward-looking, so you’ll only get new tweets when you add a new keyword.  Depending on your use case you may need some historical data to kick things off.  If so, you’ll want to make one simultaneous query to the Search API.  This means that you’ll need to maintain two integrations with Twitter APIs (three, if you’re taking advantage of Twitter’s REST API for tracking specific users), but the benefit is historical data + low-latency / high-reliability future data.

And as described before, the general migration to the Streaming API should result in deeper results from the Search API, but even now you can get around 1,500 results for a keyword if you get acquainted with the “page” query parameter.

Questions 5: What if I need more help?

Twitter resources:

Streaming HTTP resources:

Gnip help:

  • Ask questions in the comments below and we’ll respond inline
  • Send email to eric@gnip.com to ask the Gnip team direct questions

Posted in APIs,Customers,Partners,Publishers by Eric No Comments

We continue to work on enriching the Gnip schema to provide second level meta-data on user generated activities .  Given we push 10s to 100s of of millions of activities around daily supporting more meta-data means a bit of work beyond just updating our schema.

Today we rolled out meta-data updates to the <actor> and <to> elements of the Gnip schema.   The updates today are new optional attributes that provide a place to map additional user information that is available on some social media services like Twitter and others.    Initially we will just add the new meta-data to activities where the information is available inline with activities and then in the near future we are adding more platform features to support the scenario where a second API call is required to add this meta-data to the activity.

Starting today the <actor> element has support for the numeric userID, friends, followers, and posts.   In addition, we are now mapping the fullname and username to individual attributes in order to better support services that allow end users to create custom screen names and change those names.   The <to> element was updated to provide a new attribute for numeric userID.

Overview of updates to <actor> and <to> elements of Gnip schema:

  • <actor> is the person who performed the activity on the service
    • <posts> is the number of updates made by the user
    • <followers> is the number of people following the user
    • <friends> is the number of people the user has friended
    • <fullname> is the descriptive name or screen name of the user
    • <username> is the username of the user on the service
    • <uid> the unique numeric ID for the user on the service
    • <metaURL> is the user profile link on the service
  • <to> is the person who the activity is in response
    • <uid> the unique numeric ID for the user on the service
    • <metaURL> is the user profile link on the servic


Posted in APIs,Customers,Industry,Long,Partners,Publishers,Release,Strategy,solutions by Eric No Comments

This post is meant to provide a reminder and additional guidance for Gnip platform users as we transition to the new Twitter Streaming API at the end of the week.   We have lots going and want to make sure companies and developers are keeping up with the moving parts.

  • Friday, June 19th:  Twitter is turning off the original XMPP firehose that we have used as the default “Twitter Data Publisher” in the Community Edition of the platform.
  • Starting on Friday, June 19th the new default “Twitter Data Publisher” in the Community Edition of the platform will be integrated to the new “spritzer” tier of the Twitter Streaming API.     Spritzer is a sample of the Twitter stream and not a “firehose”.   This is the default publicly available stream that Twitter is allowing Gnip to make available for anyone to integrate.
  • All Gnip users will be able to access full-data filters with the updated Twitter Data Publisher
  • If your company has an authorized Twitter account for the gardenhose, shadow or birddog tiers and do not want to build and maintain this integration contact us by email at info@gnip.com or shane@gnip.com to discuss how Gnip can provide a solution.

Helpful information about the new Twitter Streaming API:

PS:  The planned Facebook integration is coming along and we have our internal prototype completed.  Driving toward the beta and should have more details in the next week or two.

PSS: We would still appreciate any feedback people can provide on their Twitter data intgration needs – take the survey

Posted in APIs,Customers,Partners,Publishers,Release by Eric No Comments

Last week we informed the community of our plans to transition to the new Twitter Streaming API. (see the blog post)  This post is going to focus on providing some information on how Gnip Filters will be updated in order to support the new requirements of the Streaming API.

Here is a general summary of what Gnip users need to have in mind to prepare for the transition.

1) The Twitter Streaming API uses HTTP Basic Authentication to open up a connection.   The authentication requires the Twitter Username:Password combination, and the account access tier is set at the Twitter account level.

2) The default Gnip support provided to users will be to the “spritzer” and “follow” tiers as these are public and can be accessed by any valid Twitter account.

3) Developers and companies that have use cases which require higher levels of access (gardenhose, shadow, birddog) need to send an email directly to Twitter at api@twitter.com. The email should include basic information about your use case, the access level that is required (gardenhose, shadow, birddog), and the Twitter account to map the access.  Also,  Twitter has a new URL to request access for the gardenhose level.

Also, to provide a preview of what the new Gnip filters will provide we wanted to include some screen shots of what we are working on at this time.   (Also, you will notice the prototypes were built using an updated user experience we are working on for a future release)

Figure 1:  Gnip Filter Creation

This is the start page for creating a Gnip filter that will connect to the new Twitter Streaming API.   Users now will need to provide a valid Twitter account in order to support the HTTP Basic Authentication requirements of the API.

gnip_twitter_streaming_api_filter

Figure 2: Gnip Filters will support the multiple tiers of the Twitter Streaming API

Twitter has multiple tiers for the Streaming API which will be supported in this update to the Gnip filters.  In the developer web app or at the Gnip API it will be possible to select the Streaming API tier that the filter will access.

gnip_twitter_streaming_api_filter_2

Posted in APIs,Customers,Partners,Publishers by Eric No Comments

When we started Gnip last year Twitter was among the first group of companies that understood the data integration problems we were trying to solve for developers and companies.   Because Gnip and Twitter were able to work together it has been possible to access and integrate data from Twitter by using the Gnip platform since last July using Gnip Notifications, and since last September using Gnip Data Activities.

All of this data access was the result of Gnip working with the Twitter XMPP “firehose” API to provide Twitter data access for users of both the Gnip Community and Standard edition product offerings.   Recently Twitter announced a new Streaming API and began an alpha program to start making the new API available.  Gnip has been testing the new Streaming API and now we are planning to move from the current XMPP API to the new Streaming API in the middle of June.    This transition to the new Streaming API will mean some changes in the default behavior and ability to access Twitter data as described below

New Streaming API Transition Highlights

  1. Gnip will now be able to provide both Gnip Notifications and Gnip Data Activities to all users of the Gnip platform.   We had stopped providing access to Data Activities to new customers last November when Twitter began working on the new API, but now all users of the Gnip platform can use either Notifications or Data Activities based on what is appropriate for their application use case.
  2. There are no changes to the Gnip API or service endpoints of Gnip Publishers and Filters due to this transition.  This is changing the default Twitter API that we integrate to for data from Twitter (added about 2 hours after original post)
  3. The Twitter Streaming API is meant to accommodate a class of applications that require near-real-time access to Twitter public statuses and is provided with several tiers of streaming API methods.  See the Twitter documentation for more information.
  4. The default Streaming API tiers that Gnip will be making available are the new “spritzer” and “follow” stream methods.   These are the only tiers which are made available publicly without requiring an end user agreement directly with Twitter at this time.
  5. The “spritzer” stream method is not a “firehose” as the XMPP stream that Gnip previously used as our default.   The average messages per second is still being worked out by Twitter, but at this time “spritzer” runs in the ballpark of 10-20 messages per second and can vary depending on lots of variables being managed by Twitter.
  6. The “follow” stream method returns public statuses from a specified set of users, by ID.
  7. For more on “spritzer”, “follow”, and other methods see the Twitter Streaming API Documentation.
What About Companies and Developers With Use Cases Are Not Met With the Twitter “Spritzer” and “Follow” Streaming API methods
Gnip and Twitter realize that many use cases exist for how companies want to use Twitter data and that new applications are being built everyday.   Therefore we are exploring how companies that are authorized by Twitter for other Streaming API methods  would be able to use the Gnip platform as their integration platform of choice.

Twitter has several additional Streaming API methods available to approved parties that require a signed agreement to access.   To better understand which developers and companies using the Gnip platform could benefit from these other Streaming API options we would encourage Gnip platform users to take this short 12 question survey: Gnip: Twitter Data Publisher Survey (URL: http://www.surveymonkey.com/s.aspx?sm=dQEkfMN15NyzWpu9sUgzhw_3d_3d)

What About the Gnip Twitter-search Data Publisher?
The Gnip Twitter-search Data Publisher is not impacted by the transition to the new Twitter Streaming API since it is implemented using the new Gnip Polling Service and provides keyword-based data integration to the search.twitter APIs.

We will provide more information when we lock down the actual day for the transition shortly.    Please take the survey and as always please contact us directly at info@gnip.com or send me a direct email at shane@gnip.com

Posted in Industry,Partners by Eric No Comments

We wanted to provide an update on how people are able to interact with Twitter via the Gnip platform since people creating new filters will notice a change in behavior as of today.

Gnip has been working with Twitter for several months. This has given developers another option for how they access events and data from the Twitter APIs. During this time Twitter was making their meta-data available to a limited set of partners like Gnip, as they state in their blog post from earlier this week, this meta-data access was on an experimental basis. In addition, Twitter announced in their blog post that they are working on a solution that will better allow people to use a service like Gnip for data-driven applications, but the new solution is not ready. At Gnip we plan to continue to work with Twitter as they bring this new solution to market.

In the interim people who create new filters on the Gnip platform will notice that these filters only provide access to the Gnip Notifications features, and not the Gnip Data Streams features. This change was put in place for new filter creation until the updated solution for data becomes available from Twitter. Existing filters that use the Gnip Data Streams features will continue to function normally.

At Gnip we see this as yet another example of how we are focusing on delivering a standard way for developers to access all the web’s data. We will continue to work on with our partners and the industry to provide reliable tools and services that allow you to get real-time access to the data you need when and where you need it. However, in some cases we in the industry will experience growing pains as the social Internet continues to expand and innovate.