Google+ Now Available from Gnip

Gnip is excited to announce the addition of Google+ to its repertoire of social data sources. Built on top of the Google+ Search API, Gnip’s stream allows its customers to consume realtime social media data from Google’s fast-growing social networking service. Using Gnip’s stream, customers can poll Google+ for public posts and comments matching the terms and phrases relevant to their business and client needs.

Google+ is an emerging player in the social networking space that is a great pairing with the Twitter, Facebook, and other microblog content currently offered by Gnip. If you are looking for volume, Google+ quickly became the third largest social networking platform within a week of its public launch and some are projecting it to emerge as the world’s second largest social network within the next twelve months. Looking to consume content from social network influencers? Google+ is where they are! (even former Facebook President Sean Parker says so).

By working with Gnip along with a stream of Google+ data (and the availability of an abundance of other social data sources), you’ll have access to a normalized data format, unwound URLs, and data deduplication. Existing Gnip customers can seamlessly add Google+ to their Gnip Data Collectors (all you need is a Google API Key). New to Gnip? Let us help you design the right solution for your social data needs, contact sales@gnip.com.

Guide to the Twitter API – Part 3 of 3: An Overview of Twitter’s Streaming API

The Twitter Streaming API is designed to deliver limited volumes of data via two main types of realtime data streams: sampled streams and filtered streams. Many users like to use the Streaming API because the streaming nature of the data delivery means that the data is delivered closer to realtime than it is from the Search API (which I wrote about last week). But the Streaming API wasn’t designed to deliver full coverage results and so has some key limitations for enterprise customers. Let’s review the two types of data streams accessible from the Streaming API.The first type of stream is “sampled streams.” Sampled streams deliver a random sampling of Tweets at a statistically valid percentage of the full 100% Firehose. The free access level to the sampled stream is called the “Spritzer” and Twitter has it currently set to approximately 1% of the full 100% Firehose. (You may have also heard of the “Gardenhose,” or a randomly sampled 10% stream. Twitter used to provide some increased access levels to businesses, but announced last November that they’re not granting increased access to any new companies and gradually transitioning their current Gardenhose-level customers to Spritzer or to commercial agreements with resyndication partners like Gnip.)

The second type of data stream is “filtered streams.” Filtered streams deliver all the Tweets that match a filter you select (eg. keywords, usernames, or geographical boundaries). This can be very useful for developers or businesses that need limited access to specific Tweets.

Because the Streaming API is not designed for enterprise access, however, Twitter imposes some restrictions on its filtered streams that are important to understand. First, the volume of Tweets accessible through these streams is limited so that it will never exceed a certain percentage of the full Firehose. (This percentage is not publicly shared by Twitter.) As a result, only low-volume queries can reliably be accommodated. Second, Twitter imposes a query limit: currently, users can query for a maximum of 400 keywords and only a limited number of usernames. This is a significant challenge for many businesses. Third, Boolean operators are not supported by the Streaming API like they are by the Search API (and by Gnip’s API). And finally, there is no guarantee that Twitter’s access levels will remain unchanged in the future. Enterprises that need guaranteed access to data over time should understand that building a business on any free, public APIs can be risky.

The Search API and Streaming API are great ways to gather a sampling of social media data from Twitter. We’re clearly fans over here at Gnip; we actually offer Search API access through our Enterprise Data Collector. And here’s one more cool benefit of using Twitter’s free public APIs: those APIs don’t prohibit display of the Tweets you receive to the general public like premium Twitter feeds from Gnip and other resyndication partners do.

But whether you’re using the Search API or the Streaming API, keep in mind that those feeds simply aren’t designed for enterprise access. And as a result, you’re using the same data sets available to anyone with a computer, your coverage is unlikely to be complete, and Twitter reserves the right change the data accessibility or Terms of Use for those APIs at any time.

If your business dictates a need for full coverage data, more complex queries, an agreement that ensures continued access to data over time, or enterprise-level customer support, then we recommend getting in touch with a premium social media data provider like Gnip. Our complementary premium Twitter products include Power Track for data filtered by keyword or other parameters, and Decahose and Halfhose for randomly sampled data streams (10% and 50%, respectively). If you’d like to learn more, we’d love to hear from you at sales@gnip.com or 888.777.7405.

Guide to the Twitter API – Part 2 of 3: An Overview of Twitter’s Search API

The Twitter Search API can theoretically provide full coverage of ongoing streams of Tweets. That means it can, in theory, deliver 100% of Tweets that match the search terms you specify almost in realtime. But in reality, the Search API is not intended and does not fully support the repeated constant searches that would be required to deliver 100% coverage.Twitter has indicated that the Search API is primarily intended to help end users surface interesting and relevant Tweets that are happening now. Since the Search API is a polling-based API, the rate limits that Twitter has in place impact the ability to get full coverage streams for monitoring and analytics use cases.  To get data from the Search API, your system may repeatedly ask Twitter’s servers for the most recent results that match one of your search queries. On each request, Twitter returns a limited number of results to the request (for example “latest 100 Tweets”). If there have been more than 100 Tweets created about a search query since the last time you sent the request, some of the matching Tweets will be lost.

So . . . can you just make requests for results more frequently? Well, yes, you can, but the total number or requests you’re allowed to make per unit time is constrained by Twitter’s rate limits. Some queries are so popular (hello “Justin Bieber”) that it can be impossible to make enough requests to Twitter for that query alone to keep up with this stream.  And this is only the beginning of the problem as no monitoring or analytics vendor is interested in just one term; many have hundreds to thousands of brands or products to monitor.

Let’s consider a couple examples to clarify.  First, say you want all Tweets mentioning “Coca Cola” and only that one term. There might be fewer than 100 matching Tweets per second usually — but if there’s a spike (say that term becomes a trending topic after a Super Bowl commercial), then there will likely be more than 100 per second. If because of Twitter’s rate limits, you’re only allowed to send one request per second, you will have missed some of the Tweets generated at the most critical moment of all.

Now, let’s be realistic: you’re probably not tracking just one term. Most of our customers are interested in tracking somewhere between dozens and hundreds of thousands of terms. If you add 999 more terms to your list, then you’ll only be checking for Tweets matching “Coca Cola” once every 1,000 seconds. And in 1,000 seconds, there could easily be more than 100 Tweets mentioning your keyword, even on an average day. (Keep in mind that there are over a billion Tweets per week nowadays.) So, in this scenario, you could easily miss Tweets if you’re using the Twitter Search API. It’s also worth bearing in mind that the Tweets you do receive won’t arrive in realtime because you’re only querying for the Tweets every 1,000 seconds.

Because of these issues related to the monitoring use cases, data collection strategies relying exclusively on the Search API will frequently deliver poor coverage of Twitter data. Also, be forewarned, if you are working with a monitoring or analytics vendor who claims full Twitter coverage but is using the Search API exclusively, you’re being misled.

Although coverage is not complete, one great thing about the Twitter Search API is the complex operator capabilities it supports, such as Boolean queries and geo filtering. Although the coverage is limited, some people opt to use the Search API to collect a sampling of Tweets that match their search terms because it supports Boolean operators and geo parameters. Because these filtering features have been so well liked, Gnip has replicated many of them in our own premium Twitter API (made even more powerful by the full coverage and unique data enrichments we offer).

So, to recap, the Twitter Search API offers great operator support but you should know that you’ll generally only see a portion of the total Tweets that match your keywords and your data might arrive with some delay. To simplify access to the Twitter Search API, consider trying out Gnip’s Enterprise Data Collector; our “Keyword Notices” feed retrieves, normalizes, and deduplicates data delivered through the Search API. We can also stream it to you so you don’t have to poll for your results. (“Gnip” reverses the “ping,” get it?)

But the only way to ensure you receive full coverage of Tweets that match your filtering criteria is to work with a premium data provider (like us! blush…) for full coverage Twitter firehose filtering. (See our Power Track feed if you’d like for more info on that.)

Stay tuned for Part 3, our overview of Twitter’s Streaming API coming next week…

Announcing Multiple Connections for Premium Twitter Feeds

A frequent request from our customers has been the ability to open multiple connections to Premium Twitter Feeds on their Gnip data collectors. Our customers have asked and we have delivered!

While multiple connections to standard data feeds have been available for quite some time, we have only allowed one connection to our Premium Twitter Feeds.  Beginning today you will be able to open multiple mirrored connections to Power Track, Decahose, Halfhose, and all of our other Premium Twitter Feeds.  This feature will be helpful when testing connections to your Gnip data collector in different environments (such as staging or review) without having an impact on your production connection.

You may be saying “Sounds great Gnip, but will I be charged the standard Twitter licensing fee for the same tweet delivered across multiple connections?”. The answer is no!  You will pay a small flat fee per month for each additional connection.  If you’re interested in adding Multiple Connections to your Premium Twitter Feed please Contact Us.

Hidden Engineering Gotchas Behind Polling

I just spent a couple of days optimizing a customer’s data collection on a Gnip instance, for a specific social media data source API. It had been awhile since I’d done this level of tuning, and it reminded me of just how many variables must be considered when optimally polling source API for data.

Requests Per Second (RPS) Limits

Most services have a rate limit that an given IP address (or API key/login) cannot break. If you hit an endpoint too hard, the API backs you off and/or blocks you. Don’t confuse RPS with concurrent connections however; they’re measured differently and each has its own limitations for a given API. In this particular case I was able to parallelize three requests because the total response time per request was ~3 seconds. The result was that a given IP address was not violating the API’s RPS limitations. Had the API been measuring concurrent connections, that would have been a different story.

Document/Page/Result-set Size

Impacting my ability to parallelize my requests was the document size I was requesting of the API. Smaller document sizes (e.g. 10 activities instead of 1000) meant faster response times, which when parallelized, run the risk of violating the RPS limits. On the other hand, larger document sizes take more time to get; whether because they’re simply bigger and take longer to transfer over the wire, or because the API you’re accessing is taking a long time to assemble the document on the backend.

Cycle Time

The particular API I was working with was a “keyword” based API, meaning that I was polling for search terms/keywords. In Gnip parlance we call these “terms” or “keywords,” “rules” in order to generalize the terminology. A rule-set’s “cycle time” is how long it takes a Gnip Data Collector to poll for a given rule-set once. For example, if a rule-set size is 1,000, and the API’s RPS limit is 1, that rule-set’s cycle time would be 1,000 seconds; every 1k seconds, each rule in the set has been polled. Obviously, the cycle time would increase if the server took longer than a second to respond to each requests.

Skipping (missing data)

A given rule “skips” data during polling (meaning, you will miss data because you’re not covering enough ground) when one of the following conditions is true. ARU (activity update rate) is the rate at which activities/events occur on the given rule (e.g. the number of times per second someone uploads a picture with the tag “foo”)

  • ARU is greater than the RPS limit (RPS represented as 1/RPS) multiplied by the document size.
  • ARU is greater than the rule-set’s cycle time

In order to optimally collect the data you need, in a timely manner, you have to balance all of these variables, and adjust them based on the activity update rate for the rule-set you’re interested in. While the variables make for engaging engineering exercises, do you want to spend time sorting these out, or spend time working on the core business issues you’re trying to solve? Gnip provides visibility into these variables to ensure data is most effectively collected.

What's Up.

A few weeks have past since making some major product direction/staffing/technology-stack changes at Gnip. Most of the dust has settled and here’s an update.

What Changed Externally

api.gnip.com is alive, well, and fully supported. From a product standpoint we’re now also pursuing a decentralized data access model to broaden our offering. The original centralized product continues to serve its customers well, but it doesn’t fit all the use cases we want to nail. It turns out that while many folks want to be completely hands-off WRT how their data is collected (“just get me the data”), they still want full transparency into, and control of, the process. That transparency and control is on its way via Gnip Data Collectors that customers configure through an easy to use GUI.

To summarize, externally, you will soon see additional product offerings/approaches around data movement.

What Changed Internally

A lot. api.gnip.com is a phenomenal message bus that can reliably filter & move data from A to B at insane volumes. In order to achieve this however, we left a few things by the wayside that we realized we couldn’t leave there any longer. Customer demand, and internal Product direction needs (obviously coupled with customer needs) were such that we needed to approach the product offering from a different technical angle.

GUI & Data

We neglected a non-trivial tier of our customer base by almost exclusively focusing on the REST API to the system. Without the constraint of a GUI, technical/architectural/implementation decisions that come with building software were blinded by “the backend.” As a result, we literally cut our data off from the GUI tier. Getting data into the GUI was like raising the Titanic; doable, but hard and time consuming. Too hard for what we needed to do as a business. We’d bolted the UI framework onto the side, and customized how everything moved in/out of the core platform to the GUI layer. We weren’t able to keep up with Product needs on the GUI side.

Statistics

Similar to GUI, getting statistics out of the system in a consumer friendly manner was too hard. Business has become accustomed to running SQL queries to collect information/statistics. While one can bolt SQL interfaces onto customized systems, you have to ask yourself whether or not you really want to? What if you started with something that natively spoke SQL?

So…

We introduced a stack that supports a decentralized data collection approach, as well as off-the-shelf GUI, statistics collection/display, and SQL interface; “Cloud” instances, running Linux (obviously), MySQL, and Rails. We have prototypes up and running internally, and things are going great.

Product Details

I’ve been vague here on purpose. We’re still honing all the features, capabilities, and market opportunities in front of us, and I don’t want to commit to them right now.

The People

I want to end on a personal note. My mind was blown by the people we decided to “let go” in this process; all of them incredibly high quality.

All I can say here is that it’s all in the people. You build teams that meet the needs of the business. For the sand that shifted, Eric and I are to blame. We undoubtedly burned bridges with amazing people during this process, and that is excruciating. Those no longer with us are great, and all of them have either already jumped into new projects/companies, or are weighing their options. The best of luck to you, and I hope to work with you again someday.