Handling High-Volume, Realtime, Big Social Data

The social ecosystem has become the pulse of the world. From delivering breaking news like the death of Osama Bin Laden before it hit mainstream media to helping President Obama host the first Twitter Town Hall, the realtime social web is flooded with valuable information just waiting to be analyzed and acted upon. With millions of users and billions of social activities passing through the ever-growing realtime social web each day, it is no wonder that companies need to reevaluate their traditional business models to take advantage of this big social data.But with the exponentially ever-growing social web, massive amounts of data are pouring into and out of social media publishers’ websites and APIs every second. In a talk I gave at GlueCon a couple of months ago, I ran down some math to put things into perspective. The numbers are a little dated, but the impact is the same. At that time there were approximately 155,000,000 Tweets per day and the average size of a Tweet was approximately 2,500 Bytes (keep in mind this could include Retweets).

A Little Bit of Arithmetic

155,000,000 Tweets/day   2,500 Bytes = 387,500,000,000 Bytes/day

387,500,000,000 Bytes/day  24 Hours = 16,145,833,333 Bytes/hour

16,145,833,333 Bytes/hour 60 minutes = 269,097,222 Bytes/minute

269,097,222 Bytes/minute 60 second = 4,484,953 Bytes/second

4,484,953 Bytes/second  1,048,576 Bytes/megabyte = 4.2 Megabytes/second

And in terms of data transfer rates . . .

1 Megabyte/second = 8 Megabits/second

So . . .

4.2 Megabytes/second  8 Megabits/Megabyte = 33.8 Megabits/second

That’s a Lot of Data

So what does this mean for the data consumers, the companies wanting to reevaluate their traditional business models to take advantage of vast amounts of Twitter data? At Gnip we’ve learned that some of the collective industry data processing tools simply don’t work at this scale: out-of-the-box HTTP servers/configs aren’t sufficient to move the data, out-of-the-box config’d TCP stacks can’t deliver this much data, and consumption via typical synchronous GET request handling isn’t applicable. So we’ve built our own proprietary data handling mechanisms to capture and process mass amounts of realtime social data for our clients.

Twitter is just one example. We’re seeing more activity on today’s popular social media platforms and a simultaneous increase in the number of popular social media platforms. We’re dedicated to seamless social data delivery to our enterprise customer base and we’re looking forward to the next data processing challenge.

Guide to the Twitter API – Part 2 of 3: An Overview of Twitter’s Search API

The Twitter Search API can theoretically provide full coverage of ongoing streams of Tweets. That means it can, in theory, deliver 100% of Tweets that match the search terms you specify almost in realtime. But in reality, the Search API is not intended and does not fully support the repeated constant searches that would be required to deliver 100% coverage.Twitter has indicated that the Search API is primarily intended to help end users surface interesting and relevant Tweets that are happening now. Since the Search API is a polling-based API, the rate limits that Twitter has in place impact the ability to get full coverage streams for monitoring and analytics use cases.  To get data from the Search API, your system may repeatedly ask Twitter’s servers for the most recent results that match one of your search queries. On each request, Twitter returns a limited number of results to the request (for example “latest 100 Tweets”). If there have been more than 100 Tweets created about a search query since the last time you sent the request, some of the matching Tweets will be lost.

So . . . can you just make requests for results more frequently? Well, yes, you can, but the total number or requests you’re allowed to make per unit time is constrained by Twitter’s rate limits. Some queries are so popular (hello “Justin Bieber”) that it can be impossible to make enough requests to Twitter for that query alone to keep up with this stream.  And this is only the beginning of the problem as no monitoring or analytics vendor is interested in just one term; many have hundreds to thousands of brands or products to monitor.

Let’s consider a couple examples to clarify.  First, say you want all Tweets mentioning “Coca Cola” and only that one term. There might be fewer than 100 matching Tweets per second usually — but if there’s a spike (say that term becomes a trending topic after a Super Bowl commercial), then there will likely be more than 100 per second. If because of Twitter’s rate limits, you’re only allowed to send one request per second, you will have missed some of the Tweets generated at the most critical moment of all.

Now, let’s be realistic: you’re probably not tracking just one term. Most of our customers are interested in tracking somewhere between dozens and hundreds of thousands of terms. If you add 999 more terms to your list, then you’ll only be checking for Tweets matching “Coca Cola” once every 1,000 seconds. And in 1,000 seconds, there could easily be more than 100 Tweets mentioning your keyword, even on an average day. (Keep in mind that there are over a billion Tweets per week nowadays.) So, in this scenario, you could easily miss Tweets if you’re using the Twitter Search API. It’s also worth bearing in mind that the Tweets you do receive won’t arrive in realtime because you’re only querying for the Tweets every 1,000 seconds.

Because of these issues related to the monitoring use cases, data collection strategies relying exclusively on the Search API will frequently deliver poor coverage of Twitter data. Also, be forewarned, if you are working with a monitoring or analytics vendor who claims full Twitter coverage but is using the Search API exclusively, you’re being misled.

Although coverage is not complete, one great thing about the Twitter Search API is the complex operator capabilities it supports, such as Boolean queries and geo filtering. Although the coverage is limited, some people opt to use the Search API to collect a sampling of Tweets that match their search terms because it supports Boolean operators and geo parameters. Because these filtering features have been so well liked, Gnip has replicated many of them in our own premium Twitter API (made even more powerful by the full coverage and unique data enrichments we offer).

So, to recap, the Twitter Search API offers great operator support but you should know that you’ll generally only see a portion of the total Tweets that match your keywords and your data might arrive with some delay. To simplify access to the Twitter Search API, consider trying out Gnip’s Enterprise Data Collector; our “Keyword Notices” feed retrieves, normalizes, and deduplicates data delivered through the Search API. We can also stream it to you so you don’t have to poll for your results. (“Gnip” reverses the “ping,” get it?)

But the only way to ensure you receive full coverage of Tweets that match your filtering criteria is to work with a premium data provider (like us! blush…) for full coverage Twitter firehose filtering. (See our Power Track feed if you’d like for more info on that.)

Stay tuned for Part 3, our overview of Twitter’s Streaming API coming next week…

Gnip. The Story Behind the Name

Have you ever thought “Gnip”. . . well that is a strange name for a company, what does it mean? As one of the newest members of the Gnip team I found myself thinking that very same thing. And as I began telling my friends about this amazing new start-up that I was going to be working for in Boulder, Colorado they too began to inquire as to the meaning behind the name.

Gnip, pronounced (guh’nip), got its name from the very heart of what we do, realtime social media data collection and delivery. So let’s dive in to . . .

Data Collection 101

There are two general methods for data collection, pull technology and push technology. Pull technology is best described as a data transfer in which the request is initiated by the data consumer and responded to by the data publisher’s server. In contrast, push technology refers to the request being initiated by the data publisher’s server and sent to the data consumer.

So why does this matter . . .

Well most social media publishers use the pull method. This means that the data consumer’s system must constantly go out and “ping” the data publisher’s server asking, “do you have any new data now?” . . . “how about now?” . . . “and now?” And this can cause a few issues:

  1. Deduplication – If you ping the social media server one second and then ping it again a second later and there were no new results, you will receive the same results you got one second ago. This would then require deduplication of the data.
  2. Rate Limiting – every social media data publisher’s server out there sets different rate limits, a limit used to control the number of times you can ping a server in a given time frame. These rate limits are constantly changing and typically don’t get published. As such, if your server is set to ping the publisher’s server above the rate limit, it could potentially result in complete shut down of your data collection, leaving you to determine why the connection is broken (Is it the API . . . Is it the rate limit . . . What is the rate limit)?

So as you can see, pull technology can be a tricky beast.

Enter Gnip

Gnip sought to provide our customers with the option: to receive data in either the push model or the pull model, regardless of the native delivery from the data publisher’s server. In other words we wanted to reverse the “ping” process for our customers. Hence, we reversed the word “ping” to get the name Gnip. And there you have it, the story behind the name!

Twitter XML, JSON & Activity Streams at Gnip

About a month ago Twitter announced they will be shutting off XML for stream based endpoints on Dec, 6th, 2010, in order to exclusively support JSON. While JSON users/supporters are cheering, for some developers this is a non-trivial change. Tweet parsers around the world have to change from XML to JSON. If your brain, and code, only work in XML, you’ll be forced to get your head around something new. You’ll have to get smart, find the right JSON lib, change your code to use it (and any associated dependencies you weren’t already relying on), remove obsolete dependencies, test everything again, and ultimately get comfortable with a new format.

Gnip’s format normalization shields you from all of this as it turns out. Gnip customers get to stay focused on delivering value to their customers. Others integrating directly, and consuming stream data from Twitter in XML, have to make a change (arguably a good one from a pure format standpoint, but change takes time regardless).

From day one, Gnip has been working to shield data consumers from the inevitable API shifts (protocols, formats) that occur in the market at large. Today we ran a query to see what percentage of our customers would benefit from this shield; today we smiled. We’re going to sleep well tonight knowing all of our customers digesting our Activity Streams normalization get to stay focused on what matters to them most (namely NOT data collection intricacies).

Fun.

Our Travel Schedule: Come Meet Us!

Some days we feel like a rock band. We go “on tour” regularly for client or publisher meetings and to attend conferences relevant to social data collection. In recent months, we’ve been to New York and San Francisco and attended TWTRCon and Enterprise 2.0 and Web 2.0 Expo, to name a few. We like to meet our fans — er, I mean, our customers — in person, so here’s our upcoming travel schedule…

===========================

October 18-19, 2010 – Washington, DC
Predictive Analytics World Conference (more info)

Labeled as “the business-focused event for predictive analytics professionals, managers and commercial practitioners,” we’ll be there! Drop us a note if you’re attending or if your business is in the DC area and you’d like to meet us.

November 17-18, 2010 – Denver, CO
Defrag Conference (
more info)
We’re a sponsor for Defrag, and we’d love to see you there. Register for Defrag and enter code “sponsor15” for a 15% registration discount.

November 17-18, 2010 – NY, NY
Client Visit Trip
While some of us are at Defrag, our Director of Business and Strategy will be in the Big Apple. If you’re in the City and you’d like to meet up, just drop him a note (rob@gnip.com).

We don’t have many business trips planned over the holidays, but we are already getting excited about a conference coming up in 2011…

February 1-3, 2011 – Santa Clara, CA
O’Reilly’s Strata Conference: The Business of Data (more info)
We’re so excited about a conference all about The Business of Data (that’s us!) that we’ll be a Strata sponsor. We hope to see lots of businesses interested in social media data analysis there, so we’d love to hear about it if you expect to attend.

===========================

Please drop us a note at info@gnip.com if we’ll be in your neck of the woods anytime soon. It will be our pleasure to meet you.

Our Poem for Mountain.rb

Hello and Greetings, Our Ruby Dev Friends,
Mountain.rb we were pleased to attend.

Perhaps we did meet you! Perhaps we did not.
We hope, either way, you’ll give our tools a shot.

What do we do? Manage API feeds.
We fight the rate limits, dedupe all those tweets.

Need to know where those bit.ly’s point to?
Want to choose polling or streaming, do you?

We do those things, and on top of all that,
We put all your results in just one format.

You write only one parser for all of our feeds.
(We’ve got over 100 to meet your needs.)

The Facebook, The Twitter, The YouTube and More
If mass data collection makes your head sore…

Do not curse publishers, don’t make a fuss.
Just go to the Internet and visit us.

We’re not the best poets. Data’s more our thing.
So when you face APIs… give us a ring.

Hidden Engineering Gotchas Behind Polling

I just spent a couple of days optimizing a customer’s data collection on a Gnip instance, for a specific social media data source API. It had been awhile since I’d done this level of tuning, and it reminded me of just how many variables must be considered when optimally polling source API for data.

Requests Per Second (RPS) Limits

Most services have a rate limit that an given IP address (or API key/login) cannot break. If you hit an endpoint too hard, the API backs you off and/or blocks you. Don’t confuse RPS with concurrent connections however; they’re measured differently and each has its own limitations for a given API. In this particular case I was able to parallelize three requests because the total response time per request was ~3 seconds. The result was that a given IP address was not violating the API’s RPS limitations. Had the API been measuring concurrent connections, that would have been a different story.

Document/Page/Result-set Size

Impacting my ability to parallelize my requests was the document size I was requesting of the API. Smaller document sizes (e.g. 10 activities instead of 1000) meant faster response times, which when parallelized, run the risk of violating the RPS limits. On the other hand, larger document sizes take more time to get; whether because they’re simply bigger and take longer to transfer over the wire, or because the API you’re accessing is taking a long time to assemble the document on the backend.

Cycle Time

The particular API I was working with was a “keyword” based API, meaning that I was polling for search terms/keywords. In Gnip parlance we call these “terms” or “keywords,” “rules” in order to generalize the terminology. A rule-set’s “cycle time” is how long it takes a Gnip Data Collector to poll for a given rule-set once. For example, if a rule-set size is 1,000, and the API’s RPS limit is 1, that rule-set’s cycle time would be 1,000 seconds; every 1k seconds, each rule in the set has been polled. Obviously, the cycle time would increase if the server took longer than a second to respond to each requests.

Skipping (missing data)

A given rule “skips” data during polling (meaning, you will miss data because you’re not covering enough ground) when one of the following conditions is true. ARU (activity update rate) is the rate at which activities/events occur on the given rule (e.g. the number of times per second someone uploads a picture with the tag “foo”)

  • ARU is greater than the RPS limit (RPS represented as 1/RPS) multiplied by the document size.
  • ARU is greater than the rule-set’s cycle time

In order to optimally collect the data you need, in a timely manner, you have to balance all of these variables, and adjust them based on the activity update rate for the rule-set you’re interested in. While the variables make for engaging engineering exercises, do you want to spend time sorting these out, or spend time working on the core business issues you’re trying to solve? Gnip provides visibility into these variables to ensure data is most effectively collected.

How to Select a Social Media Data Provider

If you’re looking for social media data, you’ve got a lot of options: social media monitoring companies provide end-user brand tracking tools, some businesses provide deep-dive analyses of social data, other companies provide a reputation scores for individual users, and still other services specialize in geographic social media display, to name just a few. 

Some organizations ultimately decide to build internal tools for social media data analysis. Then they must decide between outsourcing the social data collection bit so they can focus their efforts on analyzing and visualizing the data, or building everything — including API connections to each individual publisher — internally. Establishing and maintaining those API connections over time can be costly. If your team has the money and resources to build your own social media integrations, then go for it!

But if you’re shopping for raw social media data, you should consider a social media API – that is, a single API that aggregates raw data from dozens of different social media publishers – instead of making connections to each one of those dozens of social media APIs individually. And in the social media API market, there is only a small handful of companies for you to choose from. We are one of them and we would love to work with you. But we know that you’ll probably want to shop your options before making a decision, so we’d like to offer our advice to help you understand some of the most important factors in selecting a social media API provider.

Here are some good questions for you to ask every social media API solution you consider (including your own internal engineers, if you’re considering hiring them for the job):

Are your data collection methods in compliance with all social media publishers’ terms of use?

–> Here’s why it matters: by working with a company that violates any publisher’s terms of use, you risk unstable (or sudden loss of) access to violated publisher’s data — not to mention the potential legal consequences of using black market data in your product. Conversely, if you work with a company that has a strong relationship with the social media publishers, our experience shows that you not only get stable, reliable data access, but you just might get rewarded with *extra* data access every now and then. (In case you’re wondering, Gnip’s methods are in compliance with each of our social media publishers’ terms of use.)

Do you provide results and allow parameter modifications via API, and do you maintain those API connections over time?

–> In our experience, establishing a single API connection to collect data from a single publisher isn’t hard. But! Establishing many API connections to various social media publishers and – this is key – maintaining those connections over time is really quite a chore. So much so, we made a whole long list of API-related difficulties associated with that integration work, based on our own experiences. Make sure that whoever you work with understands the ongoing work involved and is prepared to maintain your access to all of the social media APIs you care about over time.

How many data sources do you provide access to?

–> Even if you only want access to Twitter and Facebook today, it’s a good idea to think ahead. How much incremental work will be involved for you to integrate additional sources a few months down the line? Our own answer to this question is this: using Gnip’s social media API, once you’re set up to receive your first feed from Gnip via API, it takes about 1 minute for you to configure Gnip to send you data from a 2nd feed. Ten minutes later, you’re collecting data from 10 different feeds, all at no extra charge. Since you can configure Gnip to send all of your data in one format, you only need to create one parser and all the data you want gets streamed into your product. You can even start getting data from a new social media source, decide it’s not useful for your product, and replace it with a different feed from a different source, all in a matter of seconds. We’re pretty proud that we’ve made it so fast and simple for you to receive data from new sources… (blush)… and we hope you’ll find it to be useful, too.

What format is your data delivered in?

–> Ten different social media sources might provide data in 10 different formats. And that means you have to write 10 different parsers to get all the data into your product. Gnip allows you to normalize all the social media data you want into one single format — Activity Streams — so you can collect all your results via one API and feed them into your product with just one parser.

Hope this helps! If you’ve got additional questions to suggest for our list, don’t hesitate to drop us a note. We’d love to hear from you.

Recent Hole In Your Facebook Data Coverage?

Recent changes to Facebook’s crawling policies have had a detrimental impact on some of our customers who were relying on other (black/grey market) third parties for their Facebook social media data collection strategy. I wrote about this dynamic and its potential fallout last month. These recent events underscore the risk and reality around relying on black market data in your business.

Gnip works with publishers, like Facebook and Twitter, to ensure our software is in clear, transparent, compliance with their terms of use/service in order to drastically reduce risks like this. All of Gnip’s data collection is above board and is done respectfully of the policies set forth by the Publishers you want data from. When you use Gnip for your collection needs, you don’t have to live in fear of being “turned off” or sued over data you shouldn’t have.

For specifics on our Facebook integrations, checkout yesterday’s post.

Have no fear, Gnip’s here to help. Give our numerous Facebook and other social media feeds a try at http://try.gnip.com or give us a shout at info@gnip.com