Handling High-Volume, Realtime, Big Social Data

The social ecosystem has become the pulse of the world. From delivering breaking news like the death of Osama Bin Laden before it hit mainstream media to helping President Obama host the first Twitter Town Hall, the realtime social web is flooded with valuable information just waiting to be analyzed and acted upon. With millions of users and billions of social activities passing through the ever-growing realtime social web each day, it is no wonder that companies need to reevaluate their traditional business models to take advantage of this big social data.But with the exponentially ever-growing social web, massive amounts of data are pouring into and out of social media publishers’ websites and APIs every second. In a talk I gave at GlueCon a couple of months ago, I ran down some math to put things into perspective. The numbers are a little dated, but the impact is the same. At that time there were approximately 155,000,000 Tweets per day and the average size of a Tweet was approximately 2,500 Bytes (keep in mind this could include Retweets).

A Little Bit of Arithmetic

155,000,000 Tweets/day   2,500 Bytes = 387,500,000,000 Bytes/day

387,500,000,000 Bytes/day  24 Hours = 16,145,833,333 Bytes/hour

16,145,833,333 Bytes/hour 60 minutes = 269,097,222 Bytes/minute

269,097,222 Bytes/minute 60 second = 4,484,953 Bytes/second

4,484,953 Bytes/second  1,048,576 Bytes/megabyte = 4.2 Megabytes/second

And in terms of data transfer rates . . .

1 Megabyte/second = 8 Megabits/second

So . . .

4.2 Megabytes/second  8 Megabits/Megabyte = 33.8 Megabits/second

That’s a Lot of Data

So what does this mean for the data consumers, the companies wanting to reevaluate their traditional business models to take advantage of vast amounts of Twitter data? At Gnip we’ve learned that some of the collective industry data processing tools simply don’t work at this scale: out-of-the-box HTTP servers/configs aren’t sufficient to move the data, out-of-the-box config’d TCP stacks can’t deliver this much data, and consumption via typical synchronous GET request handling isn’t applicable. So we’ve built our own proprietary data handling mechanisms to capture and process mass amounts of realtime social data for our clients.

Twitter is just one example. We’re seeing more activity on today’s popular social media platforms and a simultaneous increase in the number of popular social media platforms. We’re dedicated to seamless social data delivery to our enterprise customer base and we’re looking forward to the next data processing challenge.

Guide to the Twitter API – Part 3 of 3: An Overview of Twitter’s Streaming API

The Twitter Streaming API is designed to deliver limited volumes of data via two main types of realtime data streams: sampled streams and filtered streams. Many users like to use the Streaming API because the streaming nature of the data delivery means that the data is delivered closer to realtime than it is from the Search API (which I wrote about last week). But the Streaming API wasn’t designed to deliver full coverage results and so has some key limitations for enterprise customers. Let’s review the two types of data streams accessible from the Streaming API.The first type of stream is “sampled streams.” Sampled streams deliver a random sampling of Tweets at a statistically valid percentage of the full 100% Firehose. The free access level to the sampled stream is called the “Spritzer” and Twitter has it currently set to approximately 1% of the full 100% Firehose. (You may have also heard of the “Gardenhose,” or a randomly sampled 10% stream. Twitter used to provide some increased access levels to businesses, but announced last November that they’re not granting increased access to any new companies and gradually transitioning their current Gardenhose-level customers to Spritzer or to commercial agreements with resyndication partners like Gnip.)

The second type of data stream is “filtered streams.” Filtered streams deliver all the Tweets that match a filter you select (eg. keywords, usernames, or geographical boundaries). This can be very useful for developers or businesses that need limited access to specific Tweets.

Because the Streaming API is not designed for enterprise access, however, Twitter imposes some restrictions on its filtered streams that are important to understand. First, the volume of Tweets accessible through these streams is limited so that it will never exceed a certain percentage of the full Firehose. (This percentage is not publicly shared by Twitter.) As a result, only low-volume queries can reliably be accommodated. Second, Twitter imposes a query limit: currently, users can query for a maximum of 400 keywords and only a limited number of usernames. This is a significant challenge for many businesses. Third, Boolean operators are not supported by the Streaming API like they are by the Search API (and by Gnip’s API). And finally, there is no guarantee that Twitter’s access levels will remain unchanged in the future. Enterprises that need guaranteed access to data over time should understand that building a business on any free, public APIs can be risky.

The Search API and Streaming API are great ways to gather a sampling of social media data from Twitter. We’re clearly fans over here at Gnip; we actually offer Search API access through our Enterprise Data Collector. And here’s one more cool benefit of using Twitter’s free public APIs: those APIs don’t prohibit display of the Tweets you receive to the general public like premium Twitter feeds from Gnip and other resyndication partners do.

But whether you’re using the Search API or the Streaming API, keep in mind that those feeds simply aren’t designed for enterprise access. And as a result, you’re using the same data sets available to anyone with a computer, your coverage is unlikely to be complete, and Twitter reserves the right change the data accessibility or Terms of Use for those APIs at any time.

If your business dictates a need for full coverage data, more complex queries, an agreement that ensures continued access to data over time, or enterprise-level customer support, then we recommend getting in touch with a premium social media data provider like Gnip. Our complementary premium Twitter products include Power Track for data filtered by keyword or other parameters, and Decahose and Halfhose for randomly sampled data streams (10% and 50%, respectively). If you’d like to learn more, we’d love to hear from you at sales@gnip.com or 888.777.7405.

Guide to the Twitter API – Part 1 of 3: An Introduction to Twitter’s APIs

You may find yourself wondering . . . “What’s the best way to access the Twitter data I need?” Well the answer depends on the type and amount of data you are trying to access.  Given that there are multiple options, we have designed a three part series of blog posts that explain the differences between the coverage the general public can access and the coverage available through Twitter’s resyndication agreement with Gnip. Let’s dive in . .. 

Understanding Twitter’s Public APIs . . . You Mean There is More than One?

In fact, there are three Twitter APIs: the REST API, the Streaming API, and the Search API. Within the world of social media monitoring and social media analytics, we need to focus primarily on the latter two.

  1. Search API - The Twitter Search API is a dedicated API for running searches against the index of recent Tweets
  2. Streaming API – The Twitter Streaming API allows high-throughput, near-realtime access to various subsets of Twitter data (eg. 1% random sampling of Tweets, filtering for up to 400 keywords, etc.)

Whether you get your Twitter data from the Search API, the Streaming API, or through Gnip, only public statuses are available (and NOT protected Tweets). Additionally, before Tweets are made available to both of these APIs and Gnip, Twitter applies a quality filter to weed out spam.

So now that you have a general understanding of Twitter’s APIs . . . stay tuned for Part 2, where we will take a deeper dive into understanding Twitter’s Search API, coming next week…

 

How to Select a Social Media Data Provider

If you’re looking for social media data, you’ve got a lot of options: social media monitoring companies provide end-user brand tracking tools, some businesses provide deep-dive analyses of social data, other companies provide a reputation scores for individual users, and still other services specialize in geographic social media display, to name just a few. 

Some organizations ultimately decide to build internal tools for social media data analysis. Then they must decide between outsourcing the social data collection bit so they can focus their efforts on analyzing and visualizing the data, or building everything — including API connections to each individual publisher — internally. Establishing and maintaining those API connections over time can be costly. If your team has the money and resources to build your own social media integrations, then go for it!

But if you’re shopping for raw social media data, you should consider a social media API – that is, a single API that aggregates raw data from dozens of different social media publishers – instead of making connections to each one of those dozens of social media APIs individually. And in the social media API market, there is only a small handful of companies for you to choose from. We are one of them and we would love to work with you. But we know that you’ll probably want to shop your options before making a decision, so we’d like to offer our advice to help you understand some of the most important factors in selecting a social media API provider.

Here are some good questions for you to ask every social media API solution you consider (including your own internal engineers, if you’re considering hiring them for the job):

Are your data collection methods in compliance with all social media publishers’ terms of use?

–> Here’s why it matters: by working with a company that violates any publisher’s terms of use, you risk unstable (or sudden loss of) access to violated publisher’s data — not to mention the potential legal consequences of using black market data in your product. Conversely, if you work with a company that has a strong relationship with the social media publishers, our experience shows that you not only get stable, reliable data access, but you just might get rewarded with *extra* data access every now and then. (In case you’re wondering, Gnip’s methods are in compliance with each of our social media publishers’ terms of use.)

Do you provide results and allow parameter modifications via API, and do you maintain those API connections over time?

–> In our experience, establishing a single API connection to collect data from a single publisher isn’t hard. But! Establishing many API connections to various social media publishers and – this is key – maintaining those connections over time is really quite a chore. So much so, we made a whole long list of API-related difficulties associated with that integration work, based on our own experiences. Make sure that whoever you work with understands the ongoing work involved and is prepared to maintain your access to all of the social media APIs you care about over time.

How many data sources do you provide access to?

–> Even if you only want access to Twitter and Facebook today, it’s a good idea to think ahead. How much incremental work will be involved for you to integrate additional sources a few months down the line? Our own answer to this question is this: using Gnip’s social media API, once you’re set up to receive your first feed from Gnip via API, it takes about 1 minute for you to configure Gnip to send you data from a 2nd feed. Ten minutes later, you’re collecting data from 10 different feeds, all at no extra charge. Since you can configure Gnip to send all of your data in one format, you only need to create one parser and all the data you want gets streamed into your product. You can even start getting data from a new social media source, decide it’s not useful for your product, and replace it with a different feed from a different source, all in a matter of seconds. We’re pretty proud that we’ve made it so fast and simple for you to receive data from new sources… (blush)… and we hope you’ll find it to be useful, too.

What format is your data delivered in?

–> Ten different social media sources might provide data in 10 different formats. And that means you have to write 10 different parsers to get all the data into your product. Gnip allows you to normalize all the social media data you want into one single format — Activity Streams — so you can collect all your results via one API and feed them into your product with just one parser.

Hope this helps! If you’ve got additional questions to suggest for our list, don’t hesitate to drop us a note. We’d love to hear from you.

Gnip Integrates Facebook Search API

At F8 this week Facebook publicly announced the new Open Graph API.  As part of their announcement, Facebook also released a Search API against all public stream data created on Facebook.  Gnip is now excited to announce support for this Search API.  If your company has been looking to track conversations about your brand, product, or competitors on Facebook, you can now do so with the Gnip platform.

We’re happy to make this data available to existing Gnip customers and other developers eager to get access to public data.  Public data on Facebook currently includes posts and comments on fan pages and public user wall posts and status updates.

We’re super proud of our engineering team and all the work they’ve put into the Gnip platform to make it possible for us to quickly support the Facebook Search API and all other new APIs on the web.  If you’d like to get the public Facebook stream for a set of keywords, please get in touch with us at info@gnip.com.

Thanks to the Facebook team for making this public data available.  We look forward to continually making the case for more public data to help enterprises listen, understand, and respond to their customers.

Swiss Army Knives: cURL & tidy

Iterating quickly is what makes modern software initiatives work, and the mantra applies to everything in the stack. From planning your work, to builds, things have to move fast, and feedback loops need to be short and sweet. In the realm of REST[-like] API integration, writing an application to visually validate the API you’re interacting with is overkill. At the end of the day, web services boil down to HTTP requests which are rapidly tested with a tight little application called cURL. You can test just about anything with cURL (yes, including HTTP streaming/Comet/long-poll interactions), and its configurability is endless. You’ll have to read the man page to get all the bells and whistles, but I’ll provide a few samples of common Gnip use cases here. At the end of this post I’ll clue you into cURL’s indispensable cohort in web service slaying, ‘tidy.’

cURL power

cURL can generate custom HTTP client requests with any HTTP method you’d like. ProTip: the biggest gotcha I’ve seen trip up most people is leaving the URL unquoted. Many URLs don’t need quotes when being fed to cURL, but many do, and you should just get in the habit of quoting every one, otherwise you’ll spend time debugging your driver error for far too long. There are tons of great cURL tutorials out on the network; I won’t try to recreate those here.

POSTing

Some APIs want data POSTed to them. There are two forms of this.

Inline

curl -v -d "some=data" "http://blah.com/cool/api"

From File

curl -v -d @filename "http://blah.com/cool/api"

In either case, cURL defaults the content-type to the ubiquitous “application/x-www-form-urlencoded”. While this is often the correct thing to do, by default, there are a couple of things to keep in mind: one, this assumes that the data you’re inlining, or that is in your file, is indeed formatted as such (e.g. key=value pairs). two, when the API you’re working with does NOT want data in this format, you need to explicitly override the content-type header like so.

curl -v -d "someotherkindofdata" "http://blah.com/cool/api" --header "Content-Type: foo"

Authentication

Passing HTTP-basic authentication credentials along is easy.

curl -v -uUSERNAME[:PASSWORD] "http://blah.com/cool/api"

You can inline the password, but keep in mind your password will be cached in your shell history logs.

Show Me Everything

You’ll notice I’m using the “-v” option on all of my requests. “-v” allows me to see all the HTTP-level interaction (method, headers, etc), with the exception of a request POST body, which is crucial for debugging interaction issues. You’ll also need to use “-v” to watch streaming data fly by.

Crossing the Streams (cURL + tidy)

Most web services these days spew XML formatted data, and it is often not whitespace formatted such that a human can read it easily. Enter tidy. If you pipe your cURL output to tidy, all of life’s problems will melt away like a fallen ice-cream scoop on a hot summer sidewalk.

cURL’d web service API without tidy

curl -v "http://rss.clipmarks.com/tags/flower/"
...
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/style/rss/rss_feed.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="/style/rss/rss_feed.css" type="text/css" media="screen" ?><rss versi
on="2.0"><channel><title>Clipmarks | Flower Clips</title><link>http://clipmarks.com/tags/flower/</link><feedUrl>http://rss.clipmarks.com/tags/flower/</feedUrl><ttl>15</ttl
><description>Clip, tag and save information that's important to you. Bookmarks save entire pages...Clipmarks save the specific content that matters to you!</description><
language>en-us</language><item><title>Flower Shop in Parsippany NJ</title><link>http://clipmarks.com/clipmark/CAD213A7-0392-4F1D-A7BB-19195D3467FD/</link><description>&lt;
b&gt;clipped by:&lt;/b&gt; &lt;a href="http://clipmarks.com/clipper/dunguschariang/"&gt;dunguschariang&lt;/a&gt;&lt;br&gt;&lt;b&gt;clipper's remarks:&lt;/b&gt;  Send Dishg
ardens in New Jersey, NJ with the top rated FTD florist in Parsippany Avas specializes in Fruit Baskets, Gourmet Baskets, Dishgardens and Floral Arrangments for every Holi
day. Family Owned and Opperated for over 30 years. &lt;br&gt;&lt;div border="2" style="margin-top: 10px; border:#000000 1px solid;" width="90%"&gt;&lt;div style="backgroun
d-color:"&gt;&lt;div align="center" width="100%" style="padding:4px;margin-bottom:4px;background-color:#666666;overflow:hidden;"&gt;&lt;span style="color:#FFFFFF;f
...

cURL’d web service API with tidy

curl -v "http://rss.clipmarks.com/tags/flower/" | tidy -xml -utf8 -i
...
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="/style/rss/rss_feed.xsl" type="text/xsl" media="screen"?>
<?xml-stylesheet href="/style/rss/rss_feed.css" type="text/css" media="screen" ?>
<rss version="2.0">
   <channel>
     <title>Clipmarks | Flower Clips</title>
     <link>http://clipmarks.com/tags/flower/</link>
     <feedUrl>http://rss.clipmarks.com/tags/flower/</feedUrl>
     <ttl>15</ttl>
     <description>Clip, tag and save information that's important to
       you. Bookmarks save entire pages...Clipmarks save the specific
       content that matters to you!</description>
     <language>en-us</language>
     <item>
       <title>Flower Shop in Parsippany NJ</title>
       <link>

http://clipmarks.com/clipmark/CAD213A7-0392-4F1D-A7BB-19195D3467FD/</link>

       <description>&lt;b&gt;clipped by:&lt;/b&gt; &lt;a
...

I know which one you’d prefer. So what’s going on? We’re piping the output to tidy and telling tidy to treat the document as XML (use XML structural parsing rules), treat encodings as UTF8 (so it doesn’t barf on non-latin character sets), and finally “-i” indicates that you want it indented (pretty printed essentially).

Right Tools for the Job

If you spend a lot of time whacking through the web service API forest, be sure you have a sharp machete. cURL and tidy make for a very sharp machete. Test driving a web service API before you start laying down code is essential. These tools allow you to create tight feedback loops at the integration level before you lay any code down; saving everyone time, energy and money.

Migrating to the Twitter Streaming API: A Primer

Some context:

Long, long ago, in a galaxy far, far away, Twitter provided a firehose of data to a few of partners and the world was happy.  These startups were awash in real-time data and they got spoiled, some might say, by the embarrassment of riches that came through the real-time feed.  Over time, numerous factors caused Twitter to cease offering the firehose.  There was much wailing and gnashing of teeth on that day, I can tell you!

At roughly the same time, Twitter bought real-time search company Summize and began offering to everyone access to what is now known as the Search API.  Unlike Twitter’s existing REST API, which was based around usernames, the Search API enabled companies to query for recent data about a specific keyword.  Because of the nature of polling, companies had to contend with latency (the time between when someone performs an action and when an API consumer learns about it) and Twitter had to deal with a constantly-growing number of developers connected to an inherently inefficient interface.

Last year, Twitter announced that they were developing the spiritual successor to the firehose — a real-time stream that could be filtered on a per-customer basis and provide the real-time, zero latency results people wanted.  By August of last year, alpha customers had access to various components of the firehose (spritzer, the gardenhose, track, birddog, etc) and provided feedback that helped shape and solidify Twitter’s Streaming API.

A month ago Twitter Engineer John Kalucki (@jkalucki) posted on the Twitter API Announcements group that “High-Volume and Repeated Queries Should Migrate to Streaming API“.  In the post, he detailed several reasons why the move is beneficial to developers.  Two weeks later, another Twitter developer announced a new error code, 420, to let developers identify when they are getting rate limited by the Search API.  Thus, both the carrot and the stick have been laid out.

The streaming API is going to be a boon for companies who collect keyword-relevant content from the Twitter stream, but it does require some work on the part of developers.  In this post, we’ll help explain who will benefit from using Twitter’s new Streaming API and some ways to make the migration easier.

Question 1:  Do I need to make the switch?

Let me answer your question with another question — Do you have a predictable set of keywords that you habitually query?  If you don’t, keep using the Search API.  If you do, get thee to the Streaming API.

Examples:

  • Use the Streaming API any time you are tracking a keyword over time or sending notifications /  summaries to a subscriber.
  • Use the Streaming API if you need to get *all* the tweets about a specific keyword.
  • Use the Search API for visualization and search tools where a user enters a non-predictable search query for a one-time view of results.
  • What if you offer a configurable blog-based search widget? You may have gotten away with beating up the Search API so far, but I’d suggest setting up a centralized data store and using it as your first look-up location when loading content — it’s bad karma to force a data provider to act as your edge cache.

Question 2: Why should I make the switch?

  • First and foremost, you’ll get relevant tweets significantly faster.  Linearly polling an API or RSS feed for a given set of keywords automatically creates latency which increases at a linear rate.  Assuming one query per second, the average latency for 1,000 keywords is a little over eight minutes; the average latency for 100,000 keywords is almost 14 hours!  With the Streaming API, you get near-real-time (usually within one second) results, regardless of the number of keywords you track.
  • With traditional API polling, each query returns N results regardless of whether any results are new since your last request.  This puts the onus of deduping squarely on your shoulders.  This sounds like it should be simple — cache the last N resultIDs in memory and ignore anything that’s been seen before.  At scale, high-frequency keywords will consume the cache and low frequency keywords quickly age out.  This means you’ll invariably have to hit the disk and begin thrashing your database. Thankfully, Twitter has already obviated much of this in the Search API with an optional “since_id” query parameter, but plenty of folks either ignore the option or have never read the docs and end up with serious deduplication work.  With Twitter’s Streaming API, you get a stream of tweets with very little duplication.
  • You will no longer be able to get full fidelity (aka all the tweets for a given keyword) from the Search API.  Twitter is placing increased weight on relevance, which means that, among other things, the Search API’s results will no longer be chronologically ordered.  This is great news from a user-facing functionality perspective, but it also means that if you query the Search API for a given keyword every N seconds, you’re no longer guaranteed to receive the new tweets each time.
  • We all complain about the limited backwards view of Twitter’s search corpus.  On any given day, you’ll have access to somewhere between seven and 14 days worth of historical data (somewhere between one quarter to one half billion tweets), which is of limited value when trying to discover historical trends.  Additionally, for high volume keywords (think Obama or iPhone or Toyota), you may only have access to an hour of historical data, due to the limited number of results accessible through Twitter’s paging system.  While there is no direct correlation between the number of queries against a database and the amount of data that can be indexed, there IS a direct correlation between devoting resources to handle ever-growing query demands and not having resources to work on growing the index.  As persistent queries move to the Streaming API, Twitter will be able to devote more resources to growing the index of data available via the Search API (see Question 4, below).
  • Lastly, you don’t really have a choice.  While Twitter has not yet begun to heavily enforce rate limiting (Gnip’s customers currently see few errors at 3,600 queries per hour), you should expect the Search API’s performance profile to eventually align with the REST API (currently 150 queries per hour, reportedly moving to 1,500 in the near future).

Question 3: Will I have to change my API integration?

Twitter’s Streaming API uses streaming HTTP

  • With traditional HTTP requests, you initiate a connection to a web server, the server sends results and the connection is closed.  With streaming HTTP, the connection is maintained and new data gets sent over a single long-held response.  It’s not unusual to see a Streaming API connection last for two or three days before it gets reset.
  • That said, you’ll need to reset the connection every time you change keywords.  With the Streaming API , you upload the entire set of keywords when establishing a connection.  If you have a large number of keywords, it can take several minutes to upload all of them and during the duration you won’t get any streaming results.  The way to work around this is to initiate a second Streaming API connection, then terminate the original connection once the new one starts receiving data.  In order to adhere to Twitter’s request that you not initiate a connection more than once every couple of minutes, highly volatile rule sets will need to batch changes into two minute chunks.
  • You’ll need to decouple data collection from data processing.  If you fall behind in reading data from the stream, there is no way to go back and get it (barring making a request from the Search API).  The best way to ensure that you are always able to keep up with the flow of streaming data is to place incoming data into a separate process for transformation, indexing and other work.  As a bonus, decoupling enables you to more accurately measure the size of your backlog.

Streaming API consumers need to perform more filtering on their end

  • Twitter’s Streaming API only accepts single-term rules; no more complex queries.  Say goodbye to ANDs, ORs and NOTs.  This means that if you previously hit the Search API looking for “Avatar Movie -Game”, you’ve got some serious filtering to do on your end.  From now on, you’ll add to the Streaming API one or more of the required keywords (Avatar and/or Movie) and filter out from the results anything without both keywords and containing the word “Game”.
  • You may have previously relied on the query terms you sent to Twitter’s Search API to help you route the results internally, but now the onus is 100% on you.  Think of it this way: Twitter is sending you a personalized firehose based upon your one-word rules.  Twitter’s schema doesn’t include a <keyword> element, so you don’t know which of your keywords are contained in a given Tweet.  You’ll have to inspect the content of the tweet in order to route appropriately.
  • And remember, duplicates are the exception, not the rule, with the Streaming API, so if a given tweet matches multiple keywords, you’ll still only receive it once.  It’s important that you don’t terminate your filtering algo on your first keyword or filter match; test against every keyword, every time.

Throttling is performed differently

  • Twitter throttles their Search API by IP address based upon the number of queries per second.  In a world of real-time streaming results, this whole concept is moot.  Instead, throttling is defined by the number of keywords a given account can track and the overall percentage of the firehose you can receive.
  • The default access to the Streaming API is 200 keywords; just plug in your username and password and off you go.  Currently, Twitter offers approved customers access to 10,000 keywords (restricted track) and 200,000 keywords (partner track).  If you need to track more than 200,000 keywords, Twitter may bind “partner track” access to multiple accounts, giving you access to 400,000 keywords or even more.
  • In addition to keyword-based streams, Twitter makes available several specific-use streams, including the link stream (All tweets with a URL) and the retweet stream (all retweets).  There are also various levels of userid-based streams (follow, shadow and birddog) and the overall firehose (spritzer, gardenhose and firehose), but they are outside the bounds of this post.
  • The best place to begin your quest for increased Streaming API is an email to api@twitter.com — briefly describe your company and use case along with the requested access levels. (This process will likely change for coming Commercial Accounts.)
  • Twitter’s Streaming API is throttled at the overall stream level. Imagine that you’ve decided to try to get as many tweets as you can using track.  I know, I know, who would do such a thing?  Not you, certainly.  But imagine that you did — you entered 200 stop words, like “and”, “or”, “the” and “it” in order to get a ton of tweets flowing to you.  You would be sorely disappointed, because twitter enforces a secondary throttle, a percentage of firehose available to each access level.  The higher the access level (partner track vs. restricted track vs. default track), the greater the percentage you can consume.  Once you reach that amount, you will be momentarily throttled and all matching tweets will be dropped on the floor.  No soup for you!  You should monitor this by watching for “limit” notifications.  If you find yourself regularly receiving these, either tighten up your keywords are request greater access from Twitter.

Start tracking deletes

  • Twitter sends deletion notices down the pipe when a user deletes one of their own tweets.  While Twitter does not enforce adoption of this feature, please do the right thing and implement it.  When a user deletes a tweet, they want it stricken from the public record.  Remember, “it ain’t complete if you don’t delete.”  We just made that up.  Just now.  We’re pretty excited about it.

Question 4: What if I want historical data too?


Twitter’s Streaming API is forward-looking, so you’ll only get new tweets when you add a new keyword.  Depending on your use case you may need some historical data to kick things off.  If so, you’ll want to make one simultaneous query to the Search API.  This means that you’ll need to maintain two integrations with Twitter APIs (three, if you’re taking advantage of Twitter’s REST API for tracking specific users), but the benefit is historical data + low-latency / high-reliability future data.

And as described before, the general migration to the Streaming API should result in deeper results from the Search API, but even now you can get around 1,500 results for a keyword if you get acquainted with the “page” query parameter.

Questions 5: What if I need more help?

Twitter resources:

Streaming HTTP resources:

Gnip help:

  • Ask questions in the comments below and we’ll respond inline
  • Send email to eric@gnip.com to ask the Gnip team direct questions

Social Data in a Marketplace

Gnip; shipping & handling for data. Since our inception a couple of years ago, this is one of the ways we’ve described ourselves. What many folks in the social data space (publishers and consumers alike) surprisingly don’t understand however is that such a thing is necessary. Several times we’ve come up against folks who indicate that either a) “our (random publisher X) data’s already freely available through an API” or b) “I (random consumer Y) have free access to their data through their API.” While both statements are often true, they’re shortsighted.

If you’re a “web engineer” versed in HTTP and XHR with time on your hands, then accessing data from a social media publisher (e.g. Twitter, Facebook, MySpace, Digg…. etc) may be relatively straightforward. However, while API integration might be “easy” for you, keep in mind that you’re in the minority. Thousands of companies, either not financially able to afford a “web engineer” or simply technically focused elsewhere (if at all), need help accessing the data they need to make business decisions. Furthermore, while you may do your own integrations, how robust is your error reporting, monitoring, and management of your overall strategy? Odds are that you have not given those areas the attention they require. Did your stream of data stop because of a bug in your code, or because the service you were integrated with went down? Could you more efficiently receive the same data from a publisher, while relieving load from your (and the publisher’s) system? Do you have live charts that depict how data is moving through the system (not just the publisher’s side of the house)? This is where Gnip Data Collection as a Service steps in.

As the social media/data space has evolved over the past couple of years, the necessity of a managed/solution-as-a-service has become clear. As expected, the number of data consumers continues to explode, while the number of consumers with technical capability to reliably integrate with the publishers, as a ratio to total, is shrinking.

Finally some good technical/formatting standards are catching on (PubSubHubbub, WebHooks, HTTP-long-polling/streaming/Comet (thanks Twitter), ActivityStreams), which is giving everyone a vocabulary and common conceptual understanding to use when discussing how/when real-time data is produced/consumed.

In 2010 we’re going to see the beginnings of maturation in the otherwise Wild-West of social data. As things evolve I hope innovation doesn’t suffer (mass availability of data has done wonderful things), but I do look forward to giving other, less inclined, players in the marketplace access to the data they need. As a highly focused example of this kind of maturation happening before our eyes, checkout SimpleGeo. Can I do geo stuff as an engineer, yes. Do I want to collect the thousand sources of light to build what I want to build around/with geo; no. I prefer a one-stop-shop.

New Gnip Push API Service

The Gnip product offerings are growing today as we officially announce a new Push API Service that will help companies more quickly and effectively deliver data to their customers, partners and affiliates.  (See the TechCrunch article: Gnip Launches Push API To Create Real-Time Stream Of Business Data )

This new offering leverages the Gnip SaaS Integration Platform but is provided as a complete white-label and embeddable solution adding real-time push to an existing infrastructure.  The main capabilities include the following:

  • Push Endpoint Management: Easily register service endpoints and APIs to create alternative Push endpoints that are powered by the Gnip platform.
  • Real-time Data Delivery: Complete white-label approach allows for company defined URLs to be enhanced for real-time data delivery. Reduce your data latency and infrastructure costs while maintaining control of data access and offloading the delivery to Gnip
  • Reporting Dashboard: Access important metrics and usage information for service endpoints through a statistics API or a web-based dashboard

A company would be able to add the Push API Service  to their existing infrastructure in hours or days with a few steps.

  1. Tell us how you want Gnip to access your data and APIs as we have several methods depending on your infrastructure
  2. Integrate the Gnip Push API Service to your website with complete control of the user experience and branding
  3. CNAME the subdomain in order to seamlessly add the Push API Service to your existing infrastructure
  4. Track usage of the new Push Service API using a web-based console or stats API

If your company is intersted in learning more about how Gnip can help move your existing repetitive API and website traffic to a more efficient push based approach contact us at info@gnip.com.