Copyright © 2010 Gnip, inc.
Gnip makes it easy to build social media tracking tools.
888-777-7405
Gnip is always looking for ways to improve its filtering capabilities and customer feedback plays a huge role in these efforts. We are excited to announce enhancements to our PowerTrack product that allow for more precise filtering of the Twitter Firehose, a feature enhancement request that came directly from you, our customers.
Gnip PowerTrack rules now support OR and Grouping using (). We have also loosened limitations on the number of characters and the number of clauses per rule. Specifically, a single rule can now include up to 10 positive clauses and up to 50 negative clauses (previously 10 total clauses). Additionally, the character limit per rule has grown from 255 characters to 1024.
With these changes, we are now able to offer our customers a much more robust and precise filtering language to ensure you receive the Tweets that matter most to you and your business. However, these improvements bring their own set of specific constraints that are important to be aware of. Examples and details on these limitations are as follows:
OR and Grouping Examples
Character Limitations
Limitations
Precedence
For example a rule of:
You can find full details of the Gnip Power Track filtering changes in our online documentation.
Know of another way we can improve our filtering to meet your needs? Let us know in the comments below.
At Gnip, one of the most fascinating aspects of social media is ‘speed’ – specifically in regards to news stories. We continue to see a trend towards the ‘breaking’ of news stories on platforms like Twitter. Both the speed at which a story is broken as well as the speed at which that story catches on show the incredible power of this medium for information exchange. And as we’ve pointed out before, different social media streams offer different analytical value – Twitter versus a news feed for example.
Last night proved a great example of this as word of Huntsman’s withdrawal from the GOP presidential race crept out. Interestingly, the news was broken by Peter Hamby, a CNN Political Reporter–on Twitter. While CNN followed up on this news a few minutes later, it seems the reporter (or the network) realized the inherent ‘newswire’ value of breaking this news as fast as possible…and used Twitter as part of their strategy to do so!
This Tweet was followed with what we’ve begun to see as the normal ‘Twitter’ spike for breaking news – the chart below, built by our Data Scientist Scott, shows how quickly Huntsman withdrawl was retweeted and passed along. When looked at in comparison to an aggregate news feed (in this case, NewsGator’s Datawire Firehose, which is a content aggregator derived from crowdsourced rss feeds and contains many articles from traditional media providers), some interesting comparisons are brought to light.

Comparing tweets of “huntsman” and news articles breaking Jon Huntsman’s withdrawal from GOP primary race. The blue curves show the “Social Activity Pulse” that characterizes the growth and decay of media activity around this topic. By fitting the rate of articles or tweets to a function we can compare standard measure such as time-to-peak, store half-life etc. (More on this in a future post.) The peak in Twitter is reached about the same time as the first story arrives from NewsGator, over 10 minutes after the story broke on Twitter.
Both streams show a similar curve in story adoption, peak and tail. What’s different is the timeframe of the content. Twitter’s data spikes about 10 minutes earlier than NewsGator’s. NewsGator’s content is more in-depth, as it contains news stories and blog posts, but as we’ve seen in other cases, Twitter is the place where news breaks these days.
There’s always a lot going on here at Gnip, but this week is especially packed with the team looking to make a big splash at Salesforce.com’s annual Dreamforce event. Salesforce is obviously a huge player in the software space and the theme of this year’s Dreamforce is “Welcome to the Social Enterprise” which fits really nicely with what we do.
At the conference, we’ll be speaking at two sessions and sponsoring the Hack-a-thon. In the first presentation, Drinking from the Firehose: How Social Data is Changing Business Practices, Jud (@jvaleski) and Chris (@chrismoodycom) will discuss the ways that social data is being used to drive innovation across a variety of industries from Financial Services and Emergency Response to Local Business and Consumer Electronics. They’ll also give a glimpse into the technical challenges involved in handling the ever-increasing volume of data that’s flowing out of Twitter every day. If you’re at Dreamforce, this session is on Tuesday (8/30) from 11am to noon in the DevZone Theater on the 2nd floor of Moscone West.
In the second presentation, Your Guide to Understanding the Twitter API, Rob (@robjohnson) will talk through the best ways to get access to the Twitter data that you’re looking for, examining the pros and cons of the various methods. You can check out Rob’s session on Tuesday (8/30) from 3:00 to 3:30 in the Lightning Forum in the DevZone on the 2nd floor of Moscone West.
And finally, we’re sponsoring the Hack-a-thon where teams of developers will create cloud apps for the social enterprise using Twitter feeds from Gnip and at least one of the Salesforce platforms (Force.com, Heroku, Database.com). The winning team stands to take home at least $10,000 in prize money. We’re really excited to see the creative solutions that the teams develop! All submissions are due no later than 6am on Thursday (9/1), so sign up now and get going!
Want to meet up in person at Dreamforce? Give any of us a shout @jvaleski, @chrismoodycom, @robjohnson, @funkefred.

Providing Klout Scores, a measurement of a user’s overall online influence, for every individual in the exponentially ever-growing base of Twitter users was the task at hand for Matthew Thomson, VP of Platform at Klout. With massive amounts of data flowing in by the second, Thomson and Klout’s scientists and engineers needed a fast and reliable solution for processing, filtering, and eliminating data from the Twitter Firehose that was unnecessary for calculating and assigning Twitter users’ Klout Scores
- Matthew Thomson
VP of Platform, Klout
By selecting Gnip as their trusted premium Twitter data delivery partner, Klout tripled their API volume and increased their ability to provide influence scores by 50 percent among Twitter users in less than one month.
Get the full detail, read the success story here.
But with the exponentially ever-growing social web, massive amounts of data are pouring into and out of social media publishers’ websites and APIs every second. In a talk I gave at GlueCon a couple of months ago, I ran down some math to put things into perspective. The numbers are a little dated, but the impact is the same. At that time there were approximately 155,000,000 Tweets per day and the average size of a Tweet was approximately 2,500 Bytes (keep in mind this could include Retweets).
A Little Bit of Arithmetic
155,000,000 Tweets/day 2,500 Bytes = 387,500,000,000 Bytes/day
387,500,000,000 Bytes/day 24 Hours = 16,145,833,333 Bytes/hour
16,145,833,333 Bytes/hour 60 minutes = 269,097,222 Bytes/minute
269,097,222 Bytes/minute 60 second = 4,484,953 Bytes/second
4,484,953 Bytes/second 1,048,576 Bytes/megabyte = 4.2 Megabytes/second
And in terms of data transfer rates . . .
1 Megabyte/second = 8 Megabits/second
So . . .
4.2 Megabytes/second 8 Megabits/Megabyte = 33.8 Megabits/second
That’s a Lot of Data
So what does this mean for the data consumers, the companies wanting to reevaluate their traditional business models to take advantage of vast amounts of Twitter data? At Gnip we’ve learned that some of the collective industry data processing tools simply don’t work at this scale: out-of-the-box HTTP servers/configs aren’t sufficient to move the data, out-of-the-box config’d TCP stacks can’t deliver this much data, and consumption via typical synchronous GET request handling isn’t applicable. So we’ve built our own proprietary data handling mechanisms to capture and process mass amounts of realtime social data for our clients.
Twitter is just one example. We’re seeing more activity on today’s popular social media platforms and a simultaneous increase in the number of popular social media platforms. We’re dedicated to seamless social data delivery to our enterprise customer base and we’re looking forward to the next data processing challenge.
The first type of stream is “sampled streams.” Sampled streams deliver a random sampling of Tweets at a statistically valid percentage of the full 100% Firehose. The free access level to the sampled stream is called the “Spritzer” and Twitter has it currently set to approximately 1% of the full 100% Firehose. (You may have also heard of the “Gardenhose,” or a randomly sampled 10% stream. Twitter used to provide some increased access levels to businesses, but announced last November that they’re not granting increased access to any new companies and gradually transitioning their current Gardenhose-level customers to Spritzer or to commercial agreements with resyndication partners like Gnip.)
The second type of data stream is “filtered streams.” Filtered streams deliver all the Tweets that match a filter you select (eg. keywords, usernames, or geographical boundaries). This can be very useful for developers or businesses that need limited access to specific Tweets.
Because the Streaming API is not designed for enterprise access, however, Twitter imposes some restrictions on its filtered streams that are important to understand. First, the volume of Tweets accessible through these streams is limited so that it will never exceed a certain percentage of the full Firehose. (This percentage is not publicly shared by Twitter.) As a result, only low-volume queries can reliably be accommodated. Second, Twitter imposes a query limit: currently, users can query for a maximum of 400 keywords and only a limited number of usernames. This is a significant challenge for many businesses. Third, Boolean operators are not supported by the Streaming API like they are by the Search API (and by Gnip’s API). And finally, there is no guarantee that Twitter’s access levels will remain unchanged in the future. Enterprises that need guaranteed access to data over time should understand that building a business on any free, public APIs can be risky.
–
The Search API and Streaming API are great ways to gather a sampling of social media data from Twitter. We’re clearly fans over here at Gnip; we actually offer Search API access through our Enterprise Data Collector. And here’s one more cool benefit of using Twitter’s free public APIs: those APIs don’t prohibit display of the Tweets you receive to the general public like premium Twitter feeds from Gnip and other resyndication partners do.
But whether you’re using the Search API or the Streaming API, keep in mind that those feeds simply aren’t designed for enterprise access. And as a result, you’re using the same data sets available to anyone with a computer, your coverage is unlikely to be complete, and Twitter reserves the right change the data accessibility or Terms of Use for those APIs at any time.
If your business dictates a need for full coverage data, more complex queries, an agreement that ensures continued access to data over time, or enterprise-level customer support, then we recommend getting in touch with a premium social media data provider like Gnip. Our complementary premium Twitter products include Power Track for data filtered by keyword or other parameters, and Decahose and Halfhose for randomly sampled data streams (10% and 50%, respectively). If you’d like to learn more, we’d love to hear from you at sales@gnip.com or 888.777.7405.
Twitter has indicated that the Search API is primarily intended to help end users surface interesting and relevant Tweets that are happening now. Since the Search API is a polling-based API, the rate limits that Twitter has in place impact the ability to get full coverage streams for monitoring and analytics use cases. To get data from the Search API, your system may repeatedly ask Twitter’s servers for the most recent results that match one of your search queries. On each request, Twitter returns a limited number of results to the request (for example “latest 100 Tweets”). If there have been more than 100 Tweets created about a search query since the last time you sent the request, some of the matching Tweets will be lost.
So . . . can you just make requests for results more frequently? Well, yes, you can, but the total number or requests you’re allowed to make per unit time is constrained by Twitter’s rate limits. Some queries are so popular (hello “Justin Bieber”) that it can be impossible to make enough requests to Twitter for that query alone to keep up with this stream. And this is only the beginning of the problem as no monitoring or analytics vendor is interested in just one term; many have hundreds to thousands of brands or products to monitor.
Let’s consider a couple examples to clarify. First, say you want all Tweets mentioning “Coca Cola” and only that one term. There might be fewer than 100 matching Tweets per second usually — but if there’s a spike (say that term becomes a trending topic after a Super Bowl commercial), then there will likely be more than 100 per second. If because of Twitter’s rate limits, you’re only allowed to send one request per second, you will have missed some of the Tweets generated at the most critical moment of all.
Now, let’s be realistic: you’re probably not tracking just one term. Most of our customers are interested in tracking somewhere between dozens and hundreds of thousands of terms. If you add 999 more terms to your list, then you’ll only be checking for Tweets matching “Coca Cola” once every 1,000 seconds. And in 1,000 seconds, there could easily be more than 100 Tweets mentioning your keyword, even on an average day. (Keep in mind that there are over a billion Tweets per week nowadays.) So, in this scenario, you could easily miss Tweets if you’re using the Twitter Search API. It’s also worth bearing in mind that the Tweets you do receive won’t arrive in realtime because you’re only querying for the Tweets every 1,000 seconds.
Because of these issues related to the monitoring use cases, data collection strategies relying exclusively on the Search API will frequently deliver poor coverage of Twitter data. Also, be forewarned, if you are working with a monitoring or analytics vendor who claims full Twitter coverage but is using the Search API exclusively, you’re being misled.
Although coverage is not complete, one great thing about the Twitter Search API is the complex operator capabilities it supports, such as Boolean queries and geo filtering. Although the coverage is limited, some people opt to use the Search API to collect a sampling of Tweets that match their search terms because it supports Boolean operators and geo parameters. Because these filtering features have been so well liked, Gnip has replicated many of them in our own premium Twitter API (made even more powerful by the full coverage and unique data enrichments we offer).
So, to recap, the Twitter Search API offers great operator support but you should know that you’ll generally only see a portion of the total Tweets that match your keywords and your data might arrive with some delay. To simplify access to the Twitter Search API, consider trying out Gnip’s Enterprise Data Collector; our “Keyword Notices” feed retrieves, normalizes, and deduplicates data delivered through the Search API. We can also stream it to you so you don’t have to poll for your results. (“Gnip” reverses the “ping,” get it?)
But the only way to ensure you receive full coverage of Tweets that match your filtering criteria is to work with a premium data provider (like us! blush…) for full coverage Twitter firehose filtering. (See our Power Track feed if you’d like for more info on that.)
Stay tuned for Part 3, our overview of Twitter’s Streaming API coming next week…
Late last week, several members of the Gnip team attended the 2011 Glue Conference, including Gnip’s very own CEO Jud Valeski, who delivered a keynote presentation on High-Volume, Realtime Data Stream Handling. Check out Audrey Watters’ article, Gnip CEO on the Challenges of Handling the Real-Time, Big Data Firehose on ReadWriteCloud, it does a great job of summing up Jud’s presentation.
We were thrilled to once again sponsor such an innovative and informative event dedicated to the bits and pieces, APIs and metadata, standards, and connectors that help “glue” together the applications of the web. We would like to congratulate Eric Norlin (@defrag) and team for putting on a great conference.
It was exciting to have seen several of our customers and partners at the conference, including ReportGrid, a realtime analytics startup, who used Gnip data to power their demo at the conference. For those of you that we met at the conference, it was a pleasure! For those of you that we missed, give us a call or shoot us an email, we would love to hear from you. See you next year at the 2012 Glue Conference!
We’re excited to announce two new enrichments today: we’ve partnered with Klout to deliver influence score data and we’ve enabled filtering by languages on our Twitter firehose-based premium data feeds. Combined with Gnip’s other enrichments (format normalization, URL expansion, etc.), we hope you’ll find it easier than ever to filter your Twitter feeds to precisely the data you want. (See all Gnip Enrichments)
Our latest partner, Klout, is known as “the standard for influence.” Our friends there analyze Twitter and other social media data to determine how influential (or not) different Twitter users are and assign “Klout Scores” to them accordingly. (Last we checked, @gnip’s Klout Score was 41 on Klout’s scale of 1 to 100.) Klout is a Gnip customer as well, so we’re particularly pleased to work with them to bring Klout Score metadata to other Gnip customers and share the love.
Now when you access premium Twitter data through Gnip, you can opt to have each user’s Klout Score appended to their Tweets. Klout filtering capabilities are also available via Gnip — for example, when you use our Power Track feed, you can choose to receive Tweets only from users whose Klout Score exceeds a certain number. Although Klout data has been available upon request to existing Gnip customers for some time, today marks the official start of our partnership and Klout enrichment on Gnip feeds. Welcome to the family, Klout!
Our other new enrichment feature today, language filtering, has long been requested a wide variety of Gnip customers (our international clients in particular!). Starting today, language filtering too is available on Gnip’s premium Twitter feeds for 11 languages: English, Dutch, French, German, Italian, Japanese, Korean, Norwegian, Portuguese, Spanish, and Swedish (with more to follow).
To filter for English Tweets only, for instance, just append “lang:EN” to each relevant rule you’re querying. You can also enter “lang:EN” as a rule on its own if you’d like to receive all Tweets that our algorithm has identified as English language Tweets. Our language filtering option is based on our recently announced language metadata, built from the open sourced JTCL, using n-gram frequencies to categorize Tweets into given languages.
With these two new filtering capabilities you can construct a whole new class of streams using Power Track, such as:
Although Klout Scores and language filtering are only available on premium Twitter feeds so far, many of Gnip’s data enrichments come included with every Gnip Data Collector. Contact us to learn more or try Gnip’s enrichments for yourself.
Gnip and Automattic Make Whole New Universe of Data Available
January 17th, 2012-
Tags: automattic, blogs, comments, deep analysis, engagement, firehose, gnip, intensedebate, jetpack, likes, wordpress.com, wordpress.org
No CommentsPosted by Bill Adkins, Director of Business Development in Data, Partners, Product
“This new data from Automattic is a big addition and a testament to Gnip’s commitment to drive the social data economy forward. This is an important source to add to the social data mix, one that we know our customers will take full advantage of.”
- Rob Begg, VP Marketing of Radian6
Today, we’re excited to announce a major addition to our coverage of the conversations taking place on blogs around the world. We’re expanding our relationship with Automattic to make a whole new universe of blog and comment data available to the market for the first time anywhere.
For those who don’t know, Automattic is a network of web services including WordPress.com, VIP hosting and support, Polldaddy, IntenseDebate, and Jetpack. We’ve been delivering data from WordPress.com and IntenseDebate for about a year and a half and found that while our customers loved their data, they always wanted more.
As of today, we are now offering the full firehose of blog posts and comments from Jetpack-powered WordPress.org sites, as well as engagement streams of “likes” from WordPress.com and IntenseDebate. The new data from WordPress.org greatly increases the coverage available to those who are looking to do deep analysis of blog posts and comments. The new engagement streams enable companies to pull in reaction data to quickly understand sentiment, relevance and resonance. With this they can gauge the intensity of opinion around fast moving blog and comment conversations, helping prioritize critical response.
Being full firehoses, all of the streams from Automattic ensure 100% coverage in realtime giving customers the peace of mind that they can keep up the entire discussion on fast moving threads.
The scope of coverage offered by Automattic is pretty incredible. Check out some of these stats:
We’re thrilled to be able to offer these new data streams to our customers and can’t wait to see the amazing things they’ll be able to do with them.
Updated: Coverage in GigaOM – Gnip and WordPress deepen ties, expand data partnership