Big Boulder: Creating the Social Data Ecosystem with Twitter

Ryan Sarver and Doug Williams of the Twitter platform discuss the launch of commercial public social data nearly two years ago and how the Twitter firehose has evolved.

Doug Williams and Ryan Sarver at Big Boulder

Ryan and Doug of Twitter have been there three years and describe the experience as “learning as you go.” What excited Doug was for the first time was that there was an open data source on the web, and he became the API evangelist. During the last six years, Twitter now has 140 million users. During that incredible growth page, the scaling has been tough for the company yet having that many users has created many opportunities for people to build products on top of that social data.

Several years ago, Twitter decided to change how data was syndicated. Twitter was built to serve consumers, so at the time it was hard to take resources away from that to support the API. Two years ago, Twitter was receiving multiple requests and weren’t able to provide enterprise-level support for features. At the time, Joe Fernandez of Klout was in the same building and was making multiple requests because Klout felt they could create much cooler features if they had access to more types of metadata. Wanting to focus on serving the consumers that were using the product but still wanting to support the API, Twitter decided to work with outside companies such as Gnip to provide its social data to provide that enterprise level support. Twitter decided to select a small number of companies to provide the data because they wanted to know where the data was going and how it was being used. Doug Williams called working with Gnip one of the most successful partnerships that Twitter has ever had. But it was important to Twitter that by providing companies with data, they wanted to create value for both directions. The companies represented at Big Boulder are helping to create a better audience, encouraging companies to invest time and resources into Twitter.

One of the most frequent questions that Twitter is asked if they’re going to do analytics. What Ryan and Doug talked about is that they’re going to continue to build out baseline features,  but they’ll rely on other companies to provide the features that companies need to do business and interact on Twitter. They want companies to get the analytics they need and recognize that the analytics that other social media companies provide add value to Twitter. When people get the analytics they need, they understand the value that Twitter provides and it powers their decisions to use Twitter and advertise on the Twitter platform. They also talked about how Twitter is firmly committed to providing Twitter data and are invest in by making sure companies receive all the tweets they need.

The conversation segued to the firehose, which Gnip still receives requests to have access to the entire firehose but most companies are realizing that they really don’t need it. Receiving the entire firehose can be prohibitive because it can be too much to consume and expensive. Ryan and Doug talked about how they will continue to license the firehose when it makes sense but the overall trend is to service those use cases less and less. Each case is evaluated on a case by case basis. Twitter is committed to ensuring that businesses have a clear path to getting the social data they need, as they recognize businesses are being built around the Twitter data.

Last week Twitter announced expandable tweets, or as they call it internally, In Tweet Media. This is an exciting advancement for the platform because it provides more options to chose how your content on Twitter gets consumed. Publishers want to have control on how that social data gets syndicated. Since news is frequently breaking on Twitter, publishers want to be able to tell their stories on the platform, allowing them to drive greater distribution with Twitter.

So exactly what is Twitter doing about spam? A lot. According to Ryan, the spam prevention team is one the largest at Twitter, and they’ve made several acquisitions around it. As they pointed out, spam is still a problem for email which has been around for many years. Spam is an ongoing battle that they’ll have to fight for the existence of the program. Doug talked about how when people test the Gnip platform, they often start a new account and then the tweet can be marked as spam and doesn’t make it through. Doug said to the delight of Jud and Chris, “messages marked as spam that you sent from a test account is Twitter’s issue, not Gnip’s issue.” The takeaway is not to create a new account to test Twitter.

Twitter now has 140 million users, and 400 millions tweets every couple of days. Yet the team still sees lots of room for growth, and that they’d love to see everyone with a phone using Twitter. In fact, they talked about during the Q&A with how often Twitter is used during protests that they put much consideration into making sure that Twitter can be used without a smart phone and that Tweets are quickly delivered across the world. As they’re trying to encourage more use, they recognize that much of the world knows about Twitter, but the gap is to help people understand why they need to be part of the platform.

Big Boulder is the world’s first social data conference. Follow along at #BigBoulder, on the blog under Big BoulderBig Boulder on Storify and on Gnip’s Facebook page.

Taming The Social Media Firehose, Part I

This is the first post in our series on what a social media “firehose” (e.g. streaming api) is and what it takes to turn it into useful information for your organization.  Here I outline some of the high-level challenges and considerations when consuming the social media firehose; in Parts II and III, I will give more practical examples.

Social Media Firehose

Why consume the social media firehose?

The idea of consuming large amounts of social data is to get small data–to gain insights and answer questions, to guide strategy and help with decision making. To accomplish these objectives, you are not only going to collect data from the firehose, but you are going to have to parse it, scrub and structure it based on the analysis you will pursue. (If you’re not familiar with the term “parse,” it means machines are working to understand the structure and contents of the social media activity data.) This might mean analyzing text for sentiment, looking at the time-series of the volume of mentions of your brand on Tumblr, following the trail of political reactions on the social network of commenters or any of thousands of other possibilities.

What do we mean by a social media firehose?

Gnip offers social media data from Twitter, Tumblr, Disqus and Automattic (WordPress blogs) in the form of “firehoses.”  In each case, the firehose is a continuous stream of flexibly structured social media activities arriving in near-real time. Consuming that sounds like it might be a little tricky. While the technology required to consume and analyze social media firehoses is not new, the synthesis of tools and ideas needed to successfully consume the firehose deserves some consideration.

It may help to start by contrasting firehoses with a more common way of looking at the API world–the plain vanilla HTTP request and response. The explosion of SOAPy (Simple Object Access Protocol) and RESTful APIs has enabled the integration and functional ecosystem of nearly every application on the Web. At the core of web services is a pair of simple ideas: that we can leverage the simple infrastructure of HTTP requests (the biggest advantage may be that we can build on existing web server, load balancers, etc.), and that scaleable applications can be build on simple stateless request/response pairs exchanging bite-sized chunks of data in standard formats.

Firehoses are a little different in that, while we may choose to use HTTP for many of the reasons REST and SOAP did, we don’t plan to get responses in mere bite-sized chunks.  With a firehose, we intend to open a connect to the server once and stream data indefinitely.

Once you are consuming the firehose, and–even more importantly–with some analysis in mind, you will choose a structure that adequately supports approach. With any luck (more likely smart people and hard work), you will end up not with Big Data, but rather with simple insights–simple to understand and clearly prescriptive for improving products, building stronger customer relationships, preventing the spread of disease, or any other outcome you can imagine.

The Elements Of a Firehose

Now that we have a why, let’s zero in on consuming the firehose. Returning to the definition above, here is what we need to address:

Continuous. For example, the Twitter full firehose delivers over 300M activities per day. That is an average of 3,500 activities/second or 1 activity every 290 microseconds. The WordPress firehose delivers nearly 400K activities day. While this is a much more leisurely 4.6 activities/second there still isn’t much time to sleep between the 1 activity every 0.22 s.  And if your system isn’t continuously pulling data out of the firehose, much can be lost in a short time.

Streams. As mentioned above, the intention is to make a firehose connection and consume the stream of social media activities indefinitely. Gnip delivers the social media stream over HTTP. The consumer of data needs to build their HTTP client so that it can decompress and process the buffer without waiting for the end of the response. This isn’t your traditional request-response paradigm (that’s why we’re not called Ping–and also, that name was taken).

Unstructured data. I prefer “flexibly structured” because there is plenty of structure in the JSON or XML formatted activities contained in the firehose. While you can simply and quickly get to the data and metadata for the activity, you will need to parse and filter the activity. You will need to make choices about how to store activity data in the structure that best supports your modeling and analysis. It is not so much what tool is good or popular, but rather what question you want to answer with the data.

Time-ordered activities done by people. The primary structure of the firehose data is that it represents the individual activities of people rather than summaries or aggregations. The stream of data in the firehose describes activities such as:

  • Tweets, micro-blogs
  • Blog/rich-media posts
  • Comments/threaded discussions
  • Rich media-sharing (urls, reposts)
  • Location data (place, long/lat)
  • Friend/follower relationships
  • Engagement (e.g. Likes, up- and down-votes, reputation)
  • Tagging

Real-time. Activities can be delivered soon after they are created by the user (this is referred to as low latency). (Paul Kedrosky points out that a 70s station wagon full of DVDs has about the same bandwidth as the internet, but an inconvenient coast-to-coast latency of about 4 days.) Both bandwidth and latency are measures of speed. Many people know how to worry about bandwidth but latency issues can really mess up real-time communications even if you have plenty of bandwidth. When consuming the Twitter firehose, it is common to realize latency (measured as the time from Tweet creation to the parsing the tweet coming from the firehose) of ~1.6 s  and as low as 300 milliseconds. WordPress posts and comments arrive 2.5 seconds after they are created on average.

So there are a lot of activities and they are coming fast. And they never stop, so you never want to close your connection or stop processing activities.

However, in real life “indefinitely” is more of an ideal than a regular achievement. The stream of data may be interrupted by any number of variations in the network and server capabilities along the line between Justin Bieber tweeting and my analyzing what brand of hair gel teenaged girls are going to be talking their boyfriends into using next week.
We need to work around practicalities such as high network latency, limited bandwidth, running out of disk space, service provider outages, etc. In the real world, we need connection monitoring, dynamic shaping of the firehose, redundant connections and historical replay to get at missed data.

In Part II we make this all more concrete. We will collect data from the firehose and analyze it. Along the way, we will address particular challenges of consuming the firehose and discuss some strategies for dealing with them.

 

Customer Spotlight – Klout

Providing Klout Scores, a measurement of a user’s overall online influence, for every individual in the exponentially ever-growing base of Twitter users was the task at hand for Matthew Thomson, VP of Platform at Klout. With massive amounts of data flowing in by the second, Thomson and Klout’s scientists and engineers needed a fast and reliable solution for processing, filtering, and eliminating data from the Twitter Firehose that was unnecessary for calculating and assigning Twitter users’ Klout Scores

“Not only has Gnip helped us triple our API volume in less than one month but they provided us with a trusted social media data delivery platform necessary for efficiently scaling our offerings and keeping up with the ever-increasing volume of Twitter users.”

- Matthew Thomson
VP of Platform, Klout

By selecting Gnip as their trusted premium Twitter data delivery partner, Klout tripled their API volume and increased their ability to provide influence scores by 50 percent among Twitter users in less than one month.

Get the full detail, read the success story here.