Copyright © 2010 Gnip, inc.
Gnip makes it easy to build social media tracking tools.
The first type of stream is “sampled streams.” Sampled streams deliver a random sampling of Tweets at a statistically valid percentage of the full 100% Firehose. The free access level to the sampled stream is called the “Spritzer” and Twitter has it currently set to approximately 1% of the full 100% Firehose. (You may have also heard of the “Gardenhose,” or a randomly sampled 10% stream. Twitter used to provide some increased access levels to businesses, but announced last November that they’re not granting increased access to any new companies and gradually transitioning their current Gardenhose-level customers to Spritzer or to commercial agreements with resyndication partners like Gnip.)
The second type of data stream is “filtered streams.” Filtered streams deliver all the Tweets that match a filter you select (eg. keywords, usernames, or geographical boundaries). This can be very useful for developers or businesses that need limited access to specific Tweets.
Because the Streaming API is not designed for enterprise access, however, Twitter imposes some restrictions on its filtered streams that are important to understand. First, the volume of Tweets accessible through these streams is limited so that it will never exceed a certain percentage of the full Firehose. (This percentage is not publicly shared by Twitter.) As a result, only low-volume queries can reliably be accommodated. Second, Twitter imposes a query limit: currently, users can query for a maximum of 400 keywords and only a limited number of usernames. This is a significant challenge for many businesses. Third, Boolean operators are not supported by the Streaming API like they are by the Search API (and by Gnip’s API). And finally, there is no guarantee that Twitter’s access levels will remain unchanged in the future. Enterprises that need guaranteed access to data over time should understand that building a business on any free, public APIs can be risky.
–
The Search API and Streaming API are great ways to gather a sampling of social media data from Twitter. We’re clearly fans over here at Gnip; we actually offer Search API access through our Enterprise Data Collector. And here’s one more cool benefit of using Twitter’s free public APIs: those APIs don’t prohibit display of the Tweets you receive to the general public like premium Twitter feeds from Gnip and other resyndication partners do.
But whether you’re using the Search API or the Streaming API, keep in mind that those feeds simply aren’t designed for enterprise access. And as a result, you’re using the same data sets available to anyone with a computer, your coverage is unlikely to be complete, and Twitter reserves the right change the data accessibility or Terms of Use for those APIs at any time.
If your business dictates a need for full coverage data, more complex queries, an agreement that ensures continued access to data over time, or enterprise-level customer support, then we recommend getting in touch with a premium social media data provider like Gnip. Our complementary premium Twitter products include Power Track for data filtered by keyword or other parameters, and Decahose and Halfhose for randomly sampled data streams (10% and 50%, respectively). If you’d like to learn more, we’d love to hear from you at sales@gnip.com or 888.777.7405.
Twitter has indicated that the Search API is primarily intended to help end users surface interesting and relevant Tweets that are happening now. Since the Search API is a polling-based API, the rate limits that Twitter has in place impact the ability to get full coverage streams for monitoring and analytics use cases. To get data from the Search API, your system may repeatedly ask Twitter’s servers for the most recent results that match one of your search queries. On each request, Twitter returns a limited number of results to the request (for example “latest 100 Tweets”). If there have been more than 100 Tweets created about a search query since the last time you sent the request, some of the matching Tweets will be lost.
So . . . can you just make requests for results more frequently? Well, yes, you can, but the total number or requests you’re allowed to make per unit time is constrained by Twitter’s rate limits. Some queries are so popular (hello “Justin Bieber”) that it can be impossible to make enough requests to Twitter for that query alone to keep up with this stream. And this is only the beginning of the problem as no monitoring or analytics vendor is interested in just one term; many have hundreds to thousands of brands or products to monitor.
Let’s consider a couple examples to clarify. First, say you want all Tweets mentioning “Coca Cola” and only that one term. There might be fewer than 100 matching Tweets per second usually — but if there’s a spike (say that term becomes a trending topic after a Super Bowl commercial), then there will likely be more than 100 per second. If because of Twitter’s rate limits, you’re only allowed to send one request per second, you will have missed some of the Tweets generated at the most critical moment of all.
Now, let’s be realistic: you’re probably not tracking just one term. Most of our customers are interested in tracking somewhere between dozens and hundreds of thousands of terms. If you add 999 more terms to your list, then you’ll only be checking for Tweets matching “Coca Cola” once every 1,000 seconds. And in 1,000 seconds, there could easily be more than 100 Tweets mentioning your keyword, even on an average day. (Keep in mind that there are over a billion Tweets per week nowadays.) So, in this scenario, you could easily miss Tweets if you’re using the Twitter Search API. It’s also worth bearing in mind that the Tweets you do receive won’t arrive in realtime because you’re only querying for the Tweets every 1,000 seconds.
Because of these issues related to the monitoring use cases, data collection strategies relying exclusively on the Search API will frequently deliver poor coverage of Twitter data. Also, be forewarned, if you are working with a monitoring or analytics vendor who claims full Twitter coverage but is using the Search API exclusively, you’re being misled.
Although coverage is not complete, one great thing about the Twitter Search API is the complex operator capabilities it supports, such as Boolean queries and geo filtering. Although the coverage is limited, some people opt to use the Search API to collect a sampling of Tweets that match their search terms because it supports Boolean operators and geo parameters. Because these filtering features have been so well liked, Gnip has replicated many of them in our own premium Twitter API (made even more powerful by the full coverage and unique data enrichments we offer).
So, to recap, the Twitter Search API offers great operator support but you should know that you’ll generally only see a portion of the total Tweets that match your keywords and your data might arrive with some delay. To simplify access to the Twitter Search API, consider trying out Gnip’s Enterprise Data Collector; our “Keyword Notices” feed retrieves, normalizes, and deduplicates data delivered through the Search API. We can also stream it to you so you don’t have to poll for your results. (“Gnip” reverses the “ping,” get it?)
But the only way to ensure you receive full coverage of Tweets that match your filtering criteria is to work with a premium data provider (like us! blush…) for full coverage Twitter firehose filtering. (See our Power Track feed if you’d like for more info on that.)
Stay tuned for Part 3, our overview of Twitter’s Streaming API coming next week…
Understanding Twitter’s Public APIs . . . You Mean There is More than One?
In fact, there are three Twitter APIs: the REST API, the Streaming API, and the Search API. Within the world of social media monitoring and social media analytics, we need to focus primarily on the latter two.
Whether you get your Twitter data from the Search API, the Streaming API, or through Gnip, only public statuses are available (and NOT protected Tweets). Additionally, before Tweets are made available to both of these APIs and Gnip, Twitter applies a quality filter to weed out spam.
So now that you have a general understanding of Twitter’s APIs . . . stay tuned for Part 2, where we will take a deeper dive into understanding Twitter’s Search API, coming next week…
Sometimes we’re asked why it makes sense to access social media data from Gnip and not through direct access to the publicly accessible APIs. (We usually get this question from people who have never tried to access data from various social media APIs; those who have tried it understand how tedious and time-intensive data collection is and they can’t wait to hand their social data collection over to Gnip to manage for them.)
So, if you’ve never tried collecting data from multiple social media APIs at once… why would you use Gnip instead of connecting directly to the publicly accessible APIs? Here are 10 of the reasons…
#10 – Customer Support
When you use most public APIs, development teams are often busy, so they’re tough (if not impossible) for most developers to reach with questions. At Gnip, we actually want to talk to you. We offer enterprise-level support so clients can contact us at all odd hours and receive a thoughtful, thorough response. And we work closely with a variety of sources, so we can reach out to them directly if necessary.
#9 – Reliability
Public APIs are not contractually guaranteed; data availability and access levels may change at any time, with or without warning to users. Many businesses worry about building their businesses on data that doesn’t come with contractual agreements. When you subscribe to premium data such as the premium Twitter feeds available through Gnip, we provide you with a formal agreement. This locks in your access level, price, service, and terms of use for the duration of your agreement.
#8 – Rate limit recommendations
Instead of having to figure out rate limits for the various sources on your own, Gnip can recommend rate limits based on our own extensive experience with the various APIs.
#7 – Delivery in your protocol of choice: never poll for data again
A lot of developers think polling for data is tedious… and unfortunately, most APIs are polling-based. So if you go to the sources directly, you have to poll their servers for the data. By using Gnip, you can choose between polling for your data or to having your data streamed to you.
#6 – New feed setup in seconds
Without Gnip, it can take many hours (or days) of a developer’s time to set up a new API connection, parse the new feed, and start bringing data into your system. With Gnip, it can take as little as 30 seconds and no dev effort at all to start consuming the data.
#5 – Gnip is the only source for some data
Gnip can offer access to some data that’s not available from any other source (eg. Premium Twitter volume-based feeds like our Decahose and Halfhose).
#4 – Established premium data partnerships
Established partnerships with premium data publishers (Twitter, BackType, WordPress, etc.) make it quick and easy for Gnip customers to test and add premium data feeds.
#3 – Established relationships with all publishers
Because we manage data collection for customers all day every day, we’re among the earliest to know when API changes happen and the fastest to make any necessary changes to keep your data flowing.
#2 – APIs are generally hard to manage
Publishers change their APIs sometimes. Some APIs change frequently and without warning or documentation (cough, Facebook, cough) while others change less frequently. But no matter what, change is inevitable. Gnip manages your social media data delivery over time so you can keep your data flowing smoothly and reliably with minimal effort.
#1 – Enrichments
A variety of enrichments, or added metadata and features, come included with feeds delivered through Gnip data collectors. Some of the most popular enrichments include format normalization across sources (so you only have to write one parser for all your social media data), Klout Score inclusion (currently available for premium Twitter feeds), and language detection and filtering via a proprietary Gnip algorithm. We add enrichments all the time, so look for lots more to come.
We think Gnip is pretty cool (yes, we’re biased)… but even we know that Gnip isn’t for everyone. If you only need 1 feed from 1 source, the data you need is available through a publicly accessible API, you have an engineer who can monitor and optimize your data consumption regularly, and you’re certain that you will never need any other feeds forever and ever, then Gnip probably isn’t the right choice for you.
But if you’d like to ensure you’re receiving top-quality premium data access without requiring your engineering team to invest lots of time in data collection, we’d like to invite you to give Gnip a try. We’ve got lots of happy customers already and we just might prove valuable to you, too.
In particular, here are a few Gnip client libraries that happy customers have developed and shared with us. We’ll be posting them in our Power Track documentation and you can also find them linked here:
Java
by Zauber
https://github.com/zaubersoftware/gnip4j
PHP
by Socialping
https://github.com/socialping/Gnip-Power-Track-PHP-Classes
Python
by General Sentiment
https://github.com/vkris/gnip-python/blob/master/streamingClient.py
If you’ve developed a library for access to Gnip data and you’d like to share it with us at Gnip and other Gnip customers, then drop us a note at info@gnip.com. We’d love to hear from you.
We’re excited to announce two new enrichments today: we’ve partnered with Klout to deliver influence score data and we’ve enabled filtering by languages on our Twitter firehose-based premium data feeds. Combined with Gnip’s other enrichments (format normalization, URL expansion, etc.), we hope you’ll find it easier than ever to filter your Twitter feeds to precisely the data you want. (See all Gnip Enrichments)
Our latest partner, Klout, is known as “the standard for influence.” Our friends there analyze Twitter and other social media data to determine how influential (or not) different Twitter users are and assign “Klout Scores” to them accordingly. (Last we checked, @gnip’s Klout Score was 41 on Klout’s scale of 1 to 100.) Klout is a Gnip customer as well, so we’re particularly pleased to work with them to bring Klout Score metadata to other Gnip customers and share the love.
Now when you access premium Twitter data through Gnip, you can opt to have each user’s Klout Score appended to their Tweets. Klout filtering capabilities are also available via Gnip — for example, when you use our Power Track feed, you can choose to receive Tweets only from users whose Klout Score exceeds a certain number. Although Klout data has been available upon request to existing Gnip customers for some time, today marks the official start of our partnership and Klout enrichment on Gnip feeds. Welcome to the family, Klout!
Our other new enrichment feature today, language filtering, has long been requested a wide variety of Gnip customers (our international clients in particular!). Starting today, language filtering too is available on Gnip’s premium Twitter feeds for 11 languages: English, Dutch, French, German, Italian, Japanese, Korean, Norwegian, Portuguese, Spanish, and Swedish (with more to follow).
To filter for English Tweets only, for instance, just append “lang:EN” to each relevant rule you’re querying. You can also enter “lang:EN” as a rule on its own if you’d like to receive all Tweets that our algorithm has identified as English language Tweets. Our language filtering option is based on our recently announced language metadata, built from the open sourced JTCL, using n-gram frequencies to categorize Tweets into given languages.
With these two new filtering capabilities you can construct a whole new class of streams using Power Track, such as:
Although Klout Scores and language filtering are only available on premium Twitter feeds so far, many of Gnip’s data enrichments come included with every Gnip Data Collector. Contact us to learn more or try Gnip’s enrichments for yourself.
Today we’re excited to announce language enrichment for the Decahose, Power Track, and other commercial Twitter feeds. In the past few years, many of our customers have asked us how they can identify which Tweets come from which language. Starting today, you can use Gnip’s enrichments to easily identify the language of your Tweets.
For instance, if you’re using Power Track to find all Tweets matching “Coca Cola,” now you can identify which of those are written in which language. We’re starting with support for eight languages: English, Spanish, German, French, Italian, Dutch, Portuguese, and Swedish, as available according to our confidence level for each language. You can expect more language support from Gnip in the coming weeks.
Starting from the open sourced JTCL, we’re using n-gram frequencies to categorize a Tweet into a given language. We’re thoroughly impressed with the accuracy levels thus far.
We’re excited about the use cases this enables across the industry. We know many of our friends are rapidly adopting Twitter (hello Japan!) and we’re glad to start providing better support for these global conversations.
Language enrichment is the first step toward a powerful language filtering capability for Twitter and Gnip’s 30+ other sources. If you’d like to try Twitter firehose filtering and language enrichment or to request support for your particular language, send us a note and say… ciao.
Social media is popular — no surprise there. And as a result, there’s a huge amount of social media data in the world and every day the pool of data grows… not just a little bit, but enormously. For instance, just recently our partner Twitter blogged about their business growth and the numbers are staggering.
This social conversation data is valuable. Someday it will yield insights worth many millions, perhaps billions, of dollars for businesses. But the analyses and insights are only barely beginning to take shape. We hear from social media analytics companies every day and we see lots of interesting applications of this data. So… how can social media data be used? Here’s a partial list of social data applications that I believe will begin to take shape over the next decade or so:
All of these projects can be built on public social media conversation data that’s legally and practically accessible. All of the necessary data is (or is on the roadmap to be) accessible via Gnip. But access to the data is only step one — the next step is building great algorithms and applications to draw insights from that data. We leave that part to our customers.
So, here’s to the analysts who are working with huge social data sets to bring social data analyses and insights to fruition and ultimately make the barrage of public data that surrounds us increasingly useful. Here at Gnip we’re grateful for your efforts and eager to find out what you learn.
Gnip is hiring. So you might wonder… what’s it really like to work at Gnip?
For me, working at Gnip means working with people I trust and like who bring good ideas, logical thought, hard work, and diverse experiences to the conversation. It means working to grow a product I believe is good for the world; a product that facilitates access to valuable information.
It’s the feeling of energy and responsibility that come with working at a company that’s seminal in our industry and growing quickly. It’s the spirit of hard work and encouragement, but also fun, that our team collectively strives for.
It’s the startup factor that allows for automatic approval for any tool that might make it easier for me to get my job done, whether that’s purchasing software or midday coffee at The Cup. It’s being encouraged to innovate and having the latitude to solve problems in new ways.
So come join us. Our door is open and we’d love to hear from you.
Power Track offers powerful filtering of the Twitter firehose, guaranteeing 100% Tweet delivery. For instance, filter by keyword or username to access all Tweets that match the criteria you care about and have all of the matching results delivered to you in realtime via API. Power Track supports Boolean operators, can match your filtering criteria even within expanded URLs, and has no query volume or traffic limitations, helping you access all of the data you want. And it’s only available from Gnip, currently the only authorized distributor of Twitter data via API.
The licensing fee for Power Track is $.10 per 1,000 Tweets, but we’re waiving that fee for the first 25 million Tweets in 60 days for Power Track customers who sign up by February 28th. 1-year agreement and Gnip data collector fee still required.
Learn More or Contact Us to start testing Power Track for firehose filtering. Cheers!