The Social Cocktail, Part 1: Mixology

Gnip’s Chris Moody has been talking about the “Social Cocktail” recently, both at Strata and Big Boulder. At Gnip, we talk about the social cocktail a lot–mainly because people like cocktails. But also, it is an apt metaphor for thinking about what social data is useful for our customers. What audiences and modes of conversation are needed to build out understanding your market, your customers, perceptions of your product and the evolution of your message?

The fundamental question it answers is: Why analyze social data from more than one publisher?

Each social media publisher brings distinct capabilities and audiences, and encourages unique ways for users to interact and express themselves. The overlap in audience between some publishers is low, so adding publishers helps broaden topic coverage and audience perspective. Microblogs (e.g. Twitter) are fast and concise, making it easier to tease out breaking stories and emerging conversations. Blog comments indicate engagement and controversy, and therefore point back to interesting blog posts, where the in-depth analysis is found. Votes and likes provide additional signals of reader engagement–indications of the quality and the pitch of conversation.

To get the right mix, it is essential to understand some of the properties of each publisher’s firehose.  In this post, we’ll look at high-level attributes of the social media publishers.  In the next post, we will dive into a brief description of the social media responses to expected and unexpected events. Finally, in the third post, we’ll end with an example of the social cocktail in examining a real-world event—the JPMorgan-Chase $2+ billion loss announcement in May 2012.

One revealing way to compare publishers is to understand their relative speed and content richness.  In this case, fast content means that a statistically relevant sample of activity arrives shortly after the event or topic happens in the real world.  “Shortly” can mean tweets follow the event by be less than a minute, for some topics on Twitter (e.g. earthquakes).  In contrast to the speediest media responses, posts about the 2008 banking crisis in major US backs are probably still being written in 2012 as we continue to examine and discuss bank regulation.

Another telling dimension for comparison is content richness. While Tweets are very fast, they are also concise. To be a rapid responder to an earthquake or other immediate event, you only have time for the barest facts because you have only 30 seconds to respond with 140 characters. “Just felt an earthquake in DC,” would be a typical response. On the other hand, a publisher such as Tumblr encourages rich media sharing with Spotify plugins, support for video and audio formats, very simple photo uploading and sharing capabilities and a tradition of the users appreciating and sharing creative and artistic photography.  Blogs on Automattic’s WordPress platform can range from 10s of words to 10,000, giving ample opportunity for a writer to explore subtle ideas and complex analyses.

A few properties of Social Media Firehoses explained

[table “2” not found /]


Table 1. Comparison of publisher ingredients.

Both Speed and Content Size can be quantified. We often use measures such as the time for the story activity to peak or the ½-life of a story to characterize speed. See Social Media Pulse for discussion and analytical details.

Content richness can be simply characterized by the number of characters. While this is fairly indicative of the balance of rich vs. concise information in a stream for text content, it overlooks media such as audio, photos, interactive applications, video and music. Other measures of richness might include audience participation, user-network interactions, amount of back-and-forth in a conversation and many higher-level measurements from textual analysis.

Surprising and satisfying cocktails come from a careful mixing of quality and–sometimes unexpected–ingredients. With practice and the right combination of publishers, you can mix a social cocktail to enrich your understanding of the conversations between customers, prospects, partners and pundits. You want to experiment with the mix to match your business use case. Getting the mix just right is always rewarding.

Next week: The Social Media Cocktail, Part 2 – Expected vs. Unexpected Events

Taming The Social Media Firehose, Part III – Tumblr

In part I, I discussed high-level attributes of the social media firehose. In Part II , I examined a single event by looking at activities from four firehoses for the earthquake in Mexico earlier this year. In Part III, I wrap up this series with some guidelines for using unique rich content from social media firehoses that may be less familiar. To keep it real, I used examples from the Tumblr firehose.

Since the Twitter APIs and firehoses have been available for years, you may be very familiar with many analysis strategies you can apply to the Twitter data and metadata.  I illustrated a couple of very simple ideas in the last post. With Twitter data and metadata, the opportunities to understand tweets in the context of time, timezone, geolocation, language, social graph, etc. are as big as your imagination.

Due to the popularity of blogging for both personal and corporate communication, many of you will also understand some of the opportunities of the WordPress firehose.  With the addition of firehoses of comments, you have the capabilities of connecting threads of conversation to realize another possible analysis strategy. “Likes” and Disqus “votes” provide additional hints about user reaction and engagement–yet another way to filter and understand posts and comments.

Why go to the effort and expense of adding a new firehose?
There are three benefits from investing your efforts in learning to integrate these differences. Users of social networks choose to participate in Twitter, Tumblr or other social networks based on their affinities and preferences. Integrating additional active social media sources gives:

  1. Richer audience demographics
  2. More diverse perspective and preference
  3. Broader topic coverage.

Here’s an example.


The newest firehose from Gnip became available earlier in 2012. Tumblr’s exciting because the unique, rich content from Tumblr provides a complementary perspective and a distinct form of conversation. Tumblr is important because of the unique audience and modes of interaction common within this audience and platform.

With a firehose of over 50 million new posts a day from web users, Tumblr is a source with strong social sharing features and an active network of users where discussions can reach a large audience quickly.  Some Tumblr posts have been reblogged more than a million times and stories regularly travel to thousands of readers in a couple of days.

Before jumping into consuming the Tumblr firehose in the next section, it may help to understand some of what makes it different and valuable. These questions provide a useful framework when approaching any unfamiliar stream of social data.

What is unique about the Tumblr firehose?

1. Demographics. The user community on Tumblr skews young, over-indexing strongly in the 18-24 demographic of trend setters and cool hunters.

2. Communication and Activity Style. As you are thinking about filtering and mining the Tumblr firehose, realize conversations on Tumblr are often quite different from what you’ll find on other social platforms. As you start to interpret the data from Tumblr it’s important to note that Tumblr has an inside language. For example, many sites contain f**kyeah___ in their name and URL. When you start to hone in on your topic, you will need to understand the inside language used for both positive and negative responses. Terms you consider negative on one platform may have positive connotations on another. Be sure to review a subset of your data to get a feel for the nuances before drawing larger conclusions.

3. Rich Content. Content is rich in that there many types of media and a wide range of depth. Users will post audio, video, animated gifs, simple photos as well as short and long text posts.

You’ll also see 7 different Post Types on Tumblr. These represent the different types of content that users can post on Tumblr. They break out as follows:

Table of Post Types on Tumblr

Table 1 – Tumblr post type breakdown.

To answer the questions, we often rely on filters based on text since these are the simplest filters to think about and create.  The textual data and metadata available in the Tumblr firehose include titles, tags and image captions in addition to the text of the body of the post. Including all of this content allows us to filter approximately 20% of the Tumblr firehose based on text. Additional strategies include looking at reblog and “like” activity, as well as reblog and “like” relationships between users.  More sophisticated strategies such as applying character or object recognition to images open up the tens of millions of activities daily for mining and exploration.

4. Rich Topics. In addition to diverse content forms, Tumblr has attracted many active conversations on a wide variety of topics. This content is often very complementary to other social media platforms due to differences in audience, tone, volume or perspective. With more than 20 billion total posts to date, there is content for about almost  anything you can imagine.  Some examples include:

  • Brands. Any brand you can think of is being discussed right now on Tumblr. Big brands with an official presence on Tumblr include Coca-Cola, Nike, IBM, Target, Urban Outfitters, Puma, Huggies, Lufthansa, Mac Cosmetics and many more. NPR and the President of the United States have their own presences on Tumblr.
  • Fashion and Cosmetics. Because of the visual nature of the medium and cool-hunting audience it attracts, there is a large volume of content related to cosmetics and fashion.
  • Music and Movies. With Spotify music plugins and easy upload and sharing of visual content, pop culture plays a big role in the interests and attention of many of the active users on Tumblr. Information, analysis and fan content is rich, creative and travels through the community rapidly.

5. Reblogs and Likes. Tumblr is all about engagement! The primary user activities for interactions are Reblogs and Likes. Some entries are reblogged thousands of time in a day or two. When a user reblogs a post, it places the other user’s post into your blog with any changes they make. There is a list of all of the notes (likes, reblogs) associated with a post appended to that post wherever it shows up on Tumblr. Each post activity record in the firehose can contain reblog info. It will have a count, a link to the blog this entry was a reblog of and a link to the root entry. To build the blog note list that a user would see at the bottom of a liked or reblogged entry, you have to trace each entry in the stream (i.e. keep a history or know what you want to watch) or scrape the notes section of a page.

Filtering and Mining The Tumblr Firehose

Volume. There are a number of metrics we can use to talk about the volume of the Tumblr firehose. The three gating resources that we run up against most often are related to the network (bandwidth and latency) and storage (e.g. disk space). Tumblr activities are delivered compressed, so for estimating, the bandwidth and disk space requirements can be based on the same numbers. The Tumblr firehose averages about 900 MB/hour compressed volume during peak hours, falling to a minimum of 300 MB/hour during slower periods of the day.

To store the firehose on disk, plan on ~16 GB/day based on current volumes. Planning for bandwidth, you want headroom of 2-5 x average peak hourly bandwidth (4 to 10 Mbps) depending on your tolerance for disconnects during peak events.

The other consideration is end-to-end network latency as discussed in Consuming the Firehose, Part II.  Very simplistically, latency can limit the throughput of your network (regardless of bandwidth) by using up too much time negotiating connections and acknowledging packets. (For a detailed calculation, see, for example, The TCP Window, Latency, and the Bandwidth Delay Product.)  The theoretical limit for 20 Mbps throughput is 50-70 ms (depends on TCP window size), but practically you will want to reliably observe less than this (< 50 ms) to realize reliable network performance.

Metadata. A firehose is a time-ordered, near real-time stream of user activities. While this structure is clearly powerful for identifying emerging trends around brands or news stories, the time-ordered stream is not the optimal structure for looking at other things like the structure social networks to discover, e.g., influencers. Fortunately, the Tumblr firehose activities contain a lot of helpful metadata about place, time, and social network to get answers to these questions.
Each activity has a post objectType as discussed above as well as links to resources referred to in the post such as image files, video files and audio files. Each activity has a source link that takes you back to the original post on Tumblr. If the post is a re-blog, it will also have records like the JSON example below, describing the number of reblogs, the root blog and blog this post reblogged.

"tumblrRebloggedFrom" :
         "author" :
               "displayName" : "A Glimpse",
               "link" : ""
         "link" : ""
"tumblrRebloggedRoot" :
         "author" :
                "displayName" : "Armed With A Mind",
                "link" : ""
         "link" : ""

To assemble the entire reblog chain, you must connect the reblog activities within the firehose using this metadata.

Additional engagement metadata is available in the form of likes (hearts in the Tumblr interface) in a separate Tumblr engagement firehose.

Tumblr Likes Metadata

Non-Text Based Filters. Not all non-text post types have enough textual context (captions, title and tags) to identify a topic or analyze sentiment through simple text filtering. You will want to develop strategies for dealing with some ambiguity around the meaning of posts with very little text content. This ambiguity can be reduced unless you have audio or image analysis capabilities (e.g. OCR or audio transcription). Approximately 20% of all posts can be filtered effectively with text-based filtering of text, URL text, tags and captions–about 15M activities per day).

Memes. Another consideration related to the Tumblr language is that official brand sites as well as many bloggers tend to promote a style or overall image more than providing a catalog of particular products. As a result, e.g., you will match the brand name with a lot of cool stuff, but may see specific product names and descriptions much less frequently. There are many memes within Tumblr that will lead you to influencers and sentiment, but looking at “catalog” terms won’t be the most effective path.

I hope I have uncovered some of the mysteries of successfully consuming social media firehoses.  I have only suggested a handful of questions one might try to answer with the social media data. The community of professionals providing text analysis, image analysis, machine learning for prediction, classification and recommendation, and many other wonders is continuing to invent and refine ways to model and predict real-world behavior based on billions of social media interactions.  The start of this process is always a great question.  Best of luck (and the benefits of all of Gnip’s experience and technology) to you as you jump into consuming the social media firehose.

Full Series:

Taming The Social Media Firehose, Part I – High-level attributes of a firehose

Taming The Social Media Firehose, Part II – Looking at a single event through four firehoses

Taming The Social Media Firehose, Part III – Tumblr


Taming The Social Media Firehose, Part II

In part I, we discussed some high-level attributes of the social media firehose and what is needed to digest the data. Now let’s collect some data and perform a simple analysis to see how it goes.

On March 20, 2012, there was a large 7.4 earthquake in Oaxaca, Mexico. (See the USGS record.) Due to the severity of the earthquake, it was felt in many locations in Southern Mexico and as far away as 300 miles away in Mexico City.

Collecting data from multiple firehoses around a surprise event such as an earthquake can give a quick sense of the unfolding situation on the ground in the short term as well as helping us understand the long-term implications of destruction or injury. Social data use continues to evolve in natural disasters.

For this post, let’s limit the analysis to two questions:

  1. How does the volume of earthquake-related posts and tweets evolve over time?
  2. How rich or concise are social media conversations about the earthquake? To keep this simple, treat the size of the social media activities as a proxy for content richness.

In the past few months, Gnip has made new, rich firehoses such as Disqus, WordPress and Tumblr available. Each social media service attracts a different audience and has strengths for revealing different types of social interactions. For each topic of interest, you’ll want to understand the audience and the activities common to the publisher.

Out of the possible firehoses we can use to track the earthquake through social media activity, four firehoses will be used in this post:

  • Twitter
  • WordPress Posts
  • WordPress Comments
  • Newsgator (to compare to traditional media)

There are some common tasks and considerations for solving problems like looking at earthquake data across social media.  For examples, see this post on the taxonomy of data science. To work toward an answer, you will always need a strategy of attack to address each of the common data science challenges.

For this project, we will focus on the following steps needed to digest and analyze a social media firehose:

  1. Connect and stream data from the firehoses
  2. Apply filters to the incoming data to reduce to a manageable volume
  3. Store the data
  4. Parse and structure relevant data for analysis
  5. Count (descriptive statistics)
  6. Model
  7. Visualize
  8. Interpret

Ok, that’s the background. Let’s go.

1. Connect to the firehose. We are going to collect about a day’s worth of data. The simplest way to collect the data is with a cURL statement like this,

curl –compressed -s

This command opens a streaming HTTP connection to the Gnip firehose server and delivers a continuous stream of JSON-formatted, GZIP-compressed data from the stream named “pt1.json” to my analysis data collector. If everything goes as planned, this will collect data from the firehose until the process is manually stopped.

The depth of the URL (each level of …/…/…) is the RESTful way of defining the interface. Gnip provides a URL for every stream a user configures on the server. In this case, I configured a streaming Twitter server named “pt1.json.”

A more realistic application would ensure against data loss by making this basic client connection more robust. For example, you may want to monitor the connection so that if it dies due to network latency or other network issues, the connection can be quickly re-established. Or, if you cannot afford to miss any activities, you may want to maintain redundant connections.

One of the network challenges is that volumes from the firehoses changes with a daily cycle, weekly cycle and surprise events such as earthquakes. These changes can be many times the average volume of posts. There are many strategies to dealing with volume variations and planning network and server capacities. For example, you may design graceful data loss procedures or, if your data provider provides it, content shaping such as buffering, prioritized rules or rule-production caps. In the case of content shaping, you may build an application to monitor rule-production volumes and react quickly to large changes in volume by restricting your rule set.

Here is short list of issues to keep in mind when planning your firehose connection:

  • Bandwidth must be sufficient for activity volume peaks rather than averages.
  • Latency can cause disconnects as well as having adverse effects on time-sensitive analysis.
  • Disconnects may occur due to bandwidth or latency as well as network outage or client congestion.
  • Implement redundancy and connection monitoring to ensure against activity loss.
  • Activity bursts may require additional hardware, bandwidth, processing power or filter updates. Volume can change by 10x or more.
  • Publisher terms of service may make additional filtering to comply with requirement as to how or when data may be used, for example, appropriately handling activities that were deleted or protected after your system received them.
  • De-duplicating repeated activities; identifying missing activities

2. Preliminary filters. Generally, you will want to apply broad filter terms early in the process to enable faster, more manageable downstream processing. Many terms related to nearly any topic of interest will appear not only in the activities you are interested in, but also in unrelated activities (noise). The practical response is to continuously refine filter rules to exclude unwanted activities. This may be a simple as keyword filtering, or sophisticated machine learning identification of activity noise.

While it would help to use a carefully crafted rule set for our earthquake filter of the firehoses, it turns out that we can learn a lot with the two simple rules “quake” and “terramoto,” the English and Spanish terms commonly appearing in activities related the the earthquake.  For our example analysis, we don’t get enough noise with these two terms to worry about additional filtering. So, each of the firehoses is initially filtered with these two key words. With a simple filter added, our connection looks like this,

curl –compressed -s
| grep -i -e”quake” -e”terramoto”

The “grep” command simply looks for activities with only the terms “quake” or “terramoto;” the “-i” means to do this without worrying about case.

The filter shown in the example will match activities in which either term appears in any part of the activity including activity text, URLs, tagging, descriptions, captions etc. In order to filter more precisely on, for example, only blog post content, or only tweet user profile, we would need to parse the activity before filtering.

Alternatively, we can configure Gnip’s Powertrack filtering when we set up our server with rules for restricting filtering to certain fields or volume shaping. For example, to filter on tweet based on a Twitter user’s profile location settings, we might add the rule,

user_profile_location:”Mexico City”

Or, to shape matched Tweets volume for very common terms, we might add the rule to restrict output to 50% of matched Tweets with,


For the earthquake example, we use all matched activities.

3. Store the data. Based on the desired analysis, there are a wide variety of choices for storing data. You may choose to create an historical archive, load a processing queue, and push the data to cluster storage for processing with, for example, Hadoop. Cloud-based key-value stores can be economical, but may not have the response characteristics required for solving your problem. Choices should be driven by precise business questions rather than technology buzz.

Continuing the working toward earthquake analysis, we will store activities to a file to keep things simple.

curl –compressed -s
| grep -i -e”quake” -e”terramoto”
> earthquake_data.json

Plans for moving and storing data should take into account typical activity volumes. Let’s look at some examples of firehose volumes. JSON-formatted activities compressed with GZIP have a size of 100M Tweets ≈ 25 gigabytes.  While this takes less than 2 minutes to transfer to disk at 300 MB/s (SATA II), it takes about 6 hours at 10 Mb/s (e.g. typical congested ethernet network). Firehose sizes vary and one day of posts is a bit more manageable at 350MB.

Filtered Earthquake data for the Twitter, WordPress and Newsgator firehoses is only a few gigabytes, so we will just work from local disk.

4. Parse and structure relevant data. This is the point where we make decisions about data structure and tools that best support the desired analysis. The data are time-ordered social media activities with a variety of metadata. On one hand, it may prove useful to load an HBase to leverage the scalability of Hadoop, while on the other, structuring a subset of the data or metadata and inserting into a relational database to leverage the speed of indexes might be a good fit. There is no silver bullet for big data.

Keeping it simple for the earthquake data, use a Python script to parse the JSON-formatted activities and extract the date-time of each post.

import json
with open(“earthquake_data.json” as f:
for activity in f:
print json.loads(activity)[“postedTime”]

Now we can analyze the time-evolution of activity volume by counting up the number of mentions appearing in a minute. Similarly, to estimate content complexity, we can add a few more lines of Python to count characters in the text of the activity.

import json
with open(“earthquake_data.json” as f:
for activity in f:
print len(json.loads(activity)[“body”])

5. Descriptive statistics. Essentially, counting the number of earthquake references. Now that we have extracted dates and sizes, our earthquake analysis is simple. A few ideas for more interesting analysis could be understanding and identifying key players in the social network, extracting entities from text such as places or people, or performing sentiment analysis, or watching the wave of tweets move out from the earthquake epicenter.

6. Model. Descriptive statistics are okay, but well-formed questions make explicit assumption about correlation and/or causality and–importantly–are testable.  The next step is to build a model and some related hypothesis we can test.

A simple model we can examine is that surprise events fit a “double-exponential” pulse in activity rate. The motivating idea is that news spreads more quickly as more people know about it (exponential growth) until nearly everyone who cares knows. After saturation, the discussion of a topic dies off exponentially (analogously to radioactive decay). If this hypothesis works out, we have a useful modelling tool that enables comparison of events and conversations between different types of events and across firehoses. To learn more about the attributes and fitting social media volume data to the double exponential, see Social Media Pulse.

7. Visualize. Finally, we are ready to visualize the results of earthquake-related activity. We will use the simple tactic of looking at time-dependent activity in Figure 1. Twitter reports start arriving immediately after the earthquake and the volume grows to a peak within minutes. Traditional media (Newsgator) stories peak about the same time and continue throughout the day while blogs and blog comments peak and continue into the following day.

Twitter Reaction to Earthquakes

8. Interpret. Referring to Figure 1, a few interesting features emerge.  First, Twitter volume is significant within a minute or two of the earthquake and peaks in about half an hour. Within an hour, the story on Twitter starts to decay. In this case, the stories natural decay is slowed by the continual release of news stories about the earthquake. Users continue share new Tweets for nearly 12 hours.

The prominent bump in Tweet volume  just before 1:00 UTC is the result of Tweets by the members of the Brazilian boy-band “Restart,” who appeared to be visiting Mexico City at the time of the earthquake and came online to inform their fanbase that they were okay. The band’s combined retweet volume during this period added a couple thousand tweets to the background of earthquake tweets (which also grew slightly during this period due to news coverage).

While it is not in general the case, traditional media earthquake coverage  (represented by the bottom graph of the Newsgator firehose) peaks at about the same time as Tweet volume.  We commonly see Tweet volume peaking minutes, hours and occasionally days before traditional media.  In this case, the quake was very large attracting attention and pressure to answer questions about damage and injuries.

WordPress blog posts about the earthquake illustrate a common pattern for blog posts — that they live on a daily cycle. Notice the second wave of WordPress posts starting on around 9:00 UTC.  Blogs take little longer to peak because they typically contain analysis of the situation, photos and official statements.  Also, many blog readers choose to check in on the blogs they follow in the morning and evening.

A couple of comments on the topic of content richness… As you might have guessed, Tweets are concise. Tweets are limited to 140 characters, but average in the mid 80s.  Comments tend to be slightly longer than Tweets at about 250 characters. Posts have the widest range of sizes with few very large posts (up to 15,000 words). Posts also often contain rich media such as images or embedded video.

From our brief analysis, it is clear that different firehoses represent different audiences, a range of content richness and publisher-specific user modes of interaction. “What firehoses do I need?” To start to answer, it may be useful to frame the comparison in terms of speed vs. content richness. Stories move at different speeds and have different content richness based on the firehose and the story. Your use case may require rapid reaction to the social conversation, or nuanced understanding of long-term sentiment. Surprise stories don’t move in the same ways as conversations about expected events.  Each firehose represents a different set of users with different demographics and sensibilities. You will need to understand the content and behavior of each firehose based on the questions you want to answer.

As you can see by getting your hands a little dirty, looking at how a specific event is being discussed in social media is fairly straightforward.  Hopefully I have both shown a simple and direct route to getting to answers as well as giving some useful context as to considerations for building a real-word social media application. Always remember to start with a good, well-formed question.

In Taming The Social Media Firehose,  Part III, we will look at consuming the unique, rich and diverse content of the Tumblr Firehose.

Taming The Social Media Firehose, Part I

This is the first post in our series on what a social media “firehose” (e.g. streaming api) is and what it takes to turn it into useful information for your organization.  Here I outline some of the high-level challenges and considerations when consuming the social media firehose; in Parts II and III, I will give more practical examples.

Social Media Firehose

Why consume the social media firehose?

The idea of consuming large amounts of social data is to get small data–to gain insights and answer questions, to guide strategy and help with decision making. To accomplish these objectives, you are not only going to collect data from the firehose, but you are going to have to parse it, scrub and structure it based on the analysis you will pursue. (If you’re not familiar with the term “parse,” it means machines are working to understand the structure and contents of the social media activity data.) This might mean analyzing text for sentiment, looking at the time-series of the volume of mentions of your brand on Tumblr, following the trail of political reactions on the social network of commenters or any of thousands of other possibilities.

What do we mean by a social media firehose?

Gnip offers social media data from Twitter, Tumblr, Disqus and Automattic (WordPress blogs) in the form of “firehoses.”  In each case, the firehose is a continuous stream of flexibly structured social media activities arriving in near-real time. Consuming that sounds like it might be a little tricky. While the technology required to consume and analyze social media firehoses is not new, the synthesis of tools and ideas needed to successfully consume the firehose deserves some consideration.

It may help to start by contrasting firehoses with a more common way of looking at the API world–the plain vanilla HTTP request and response. The explosion of SOAPy (Simple Object Access Protocol) and RESTful APIs has enabled the integration and functional ecosystem of nearly every application on the Web. At the core of web services is a pair of simple ideas: that we can leverage the simple infrastructure of HTTP requests (the biggest advantage may be that we can build on existing web server, load balancers, etc.), and that scaleable applications can be build on simple stateless request/response pairs exchanging bite-sized chunks of data in standard formats.

Firehoses are a little different in that, while we may choose to use HTTP for many of the reasons REST and SOAP did, we don’t plan to get responses in mere bite-sized chunks.  With a firehose, we intend to open a connect to the server once and stream data indefinitely.

Once you are consuming the firehose, and–even more importantly–with some analysis in mind, you will choose a structure that adequately supports approach. With any luck (more likely smart people and hard work), you will end up not with Big Data, but rather with simple insights–simple to understand and clearly prescriptive for improving products, building stronger customer relationships, preventing the spread of disease, or any other outcome you can imagine.

The Elements Of a Firehose

Now that we have a why, let’s zero in on consuming the firehose. Returning to the definition above, here is what we need to address:

Continuous. For example, the Twitter full firehose delivers over 300M activities per day. That is an average of 3,500 activities/second or 1 activity every 290 microseconds. The WordPress firehose delivers nearly 400K activities day. While this is a much more leisurely 4.6 activities/second there still isn’t much time to sleep between the 1 activity every 0.22 s.  And if your system isn’t continuously pulling data out of the firehose, much can be lost in a short time.

Streams. As mentioned above, the intention is to make a firehose connection and consume the stream of social media activities indefinitely. Gnip delivers the social media stream over HTTP. The consumer of data needs to build their HTTP client so that it can decompress and process the buffer without waiting for the end of the response. This isn’t your traditional request-response paradigm (that’s why we’re not called Ping–and also, that name was taken).

Unstructured data. I prefer “flexibly structured” because there is plenty of structure in the JSON or XML formatted activities contained in the firehose. While you can simply and quickly get to the data and metadata for the activity, you will need to parse and filter the activity. You will need to make choices about how to store activity data in the structure that best supports your modeling and analysis. It is not so much what tool is good or popular, but rather what question you want to answer with the data.

Time-ordered activities done by people. The primary structure of the firehose data is that it represents the individual activities of people rather than summaries or aggregations. The stream of data in the firehose describes activities such as:

  • Tweets, micro-blogs
  • Blog/rich-media posts
  • Comments/threaded discussions
  • Rich media-sharing (urls, reposts)
  • Location data (place, long/lat)
  • Friend/follower relationships
  • Engagement (e.g. Likes, up- and down-votes, reputation)
  • Tagging

Real-time. Activities can be delivered soon after they are created by the user (this is referred to as low latency). (Paul Kedrosky points out that a 70s station wagon full of DVDs has about the same bandwidth as the internet, but an inconvenient coast-to-coast latency of about 4 days.) Both bandwidth and latency are measures of speed. Many people know how to worry about bandwidth but latency issues can really mess up real-time communications even if you have plenty of bandwidth. When consuming the Twitter firehose, it is common to realize latency (measured as the time from Tweet creation to the parsing the tweet coming from the firehose) of ~1.6 s  and as low as 300 milliseconds. WordPress posts and comments arrive 2.5 seconds after they are created on average.

So there are a lot of activities and they are coming fast. And they never stop, so you never want to close your connection or stop processing activities.

However, in real life “indefinitely” is more of an ideal than a regular achievement. The stream of data may be interrupted by any number of variations in the network and server capabilities along the line between Justin Bieber tweeting and my analyzing what brand of hair gel teenaged girls are going to be talking their boyfriends into using next week.
We need to work around practicalities such as high network latency, limited bandwidth, running out of disk space, service provider outages, etc. In the real world, we need connection monitoring, dynamic shaping of the firehose, redundant connections and historical replay to get at missed data.

In Part II we make this all more concrete. We will collect data from the firehose and analyze it. Along the way, we will address particular challenges of consuming the firehose and discuss some strategies for dealing with them.


Are Facebook Users More Optimistic than Twitter Users?

New Year’s Eve gives us a sense of closure on the past and an opportunity to make new dreams. With the emergence of social media, we can now see these reflections and resolutions transpire in realtime. As we observed the posts, comments, and tweets related to the New Year, we saw the typical expressions on Facebook and Twitter of best wishes for the coming year and pithy observations about the past year. What we didn’t expect was that users of the two popular social media sites would have different outlooks on the world.

As we enter 2012, Facebook users are more optimistic than Twitter users.

You’re probably wondering how we can say that. Well, we looked at all of the public posts on Facebook and Tweets on Twitter that contained “Happy New Year.” For all of those posts and Tweets, we compared the use of positive words such as “better” and “good” to the use of negative words such as “worse” and “bad.” We found that Tweets with positive words appeared 8 times more frequently than Tweets with negative words. You might be thinking a ratio of 8 to 1 is pretty optimistic…

It may be, but posts on Facebook had a ratio of 40 to 1–such a huge difference lead us to speculate that Facebook is a more optimistic place than Twitter.

Interesting stuff. Could be a variety of reasons for the difference, from the mix of users on each service to the fact that Facebook is used to communicate with friends, while Twitter is user to broadcast to followers. We’ll leave the speculation up to you.

Gnip Cagefight #2: Pumpkin Pie vs. Pecan Pie

Thanksgiving is a time for family gatherings, turkey with all the delicious fixings, football, and let’s not forget, pie! If your family is anything like mine, multiple pie flavors are required to satisfy the differing palates and strong opinions. So we wondered, which pies are people discussing for the holiday? What better way to celebrate and answer that question than with a Gnip Cagefight.

Welcome to the Battle of the Pies!

For those of you that have been in a pie eating contest or had a pie in the face, you know this one will be a fight all the way down to the very last crumb. In one corner (well actually it is the Gnip Octagon so can you really have corners, oh well) we have The Traditionalist, pumpkin pie and in the opposite corner, The New Comer, pecan pie. Without further ado, Ladies and Gentleman, Let’s Get Ready to Rumble, wait wrong sport. Let’s Fight!

Six Social Media Sources, Two Words, One Winner . . . And the Winner Is . . .


 Source  Pumpkin Pie  Pecan Pie  Winning Ratio
Pumpkin Pie to Pecan Pie
Twitter X 4:1
Facebook X 5:1
Google+ X 6:1
Newsgator X 3:1
WordPress X 5:1
WordPress Comments X 2:1
Overall +6 Winner! +0 :(


We looked at one week’s worth of data across six of the top social media sources and determined that pumpkin pie “takes the cake” (so to speak) across every source.

In this case, it is interesting to point out that in sources like Twitter, Facebook, Google+ and WordPress we see higher winning ratios, while sources that tend to have higher latency such as Newsgator and WordPress Comments were a little more even. Is this because, on further consideration, pecan pie sounds pretty good? Or is it that everyone will have to have two pies and, with pecan as the traditional second, it is highly discussed?

Top Pie Recipes

Even though pumpkin pie was our clear winner, we thought it would be fun to share a few of the most popular holiday pie recipes by social media source:

  1. Twitter – Cook du Jour Gluten-Free Pumpkin Pie and Pecan Pie Video Recipe from
  2. Facebook – Ben Starr’s Pumpkin Bourbon Pecan Pie Recipe
  3. Newsgator – BlogHer’s Pumpkin Pecan Roulade with Orange Mascarpone Cream Pie Recipe
  4. WordPress and WordPress Comments – Chocolate Bourbon Pecan Pie from

Non-Traditional Thanksgiving Pies

Another interesting fact that came out of this Cagefight was the counts of non-traditional Thanksgiving pies that were mentioned across the social media sources we surveyed. Though we rarely find these useful for communicating numerical values effectively, you can’t not have a pie chart in this post.

Happy Thanksgiving!

Gnip Cagefight #1: Beer vs. Wine

Welcome to the very first edition of the Gnip Cagefight! Over the next couple of weeks we’ll select a common word pair to enter the Gnip Octagon to fight to the finish in a no holds barred battle of Tweets. Two words will enter. Only one will leave.

In addition to crowning the victor, we’ll also call out some of the fun, interesting, strange, and bizarre trends that we glean from the data. Leave us a comment with any contenders you’d like to see in the future.

Now without further delay, let’s dive into our first Gnip Cagefight… Put your hands together for Wine vs. Beer!

And the Winner is . . .

We looked at one week of Tweets that contained the words “beer” or “wine,” and beer was the more commonly used term, appearing in 53.1% of those tweets vs. 48.1% for wine. Now you might be saying, “Hey, that’s more than 100%!” You are correct! That’s because beer and wine appear together about 13,801 times–along with an uncomfortable hangover, we presume. (Is this an opportunity to sell aspirin?)

With beer as our victor, we wanted to answer the age old question . . .

What time is Beer Thirty?

To answer this question, we analyzed the volume of Tweets containing the term “beer” throughout each day and averaged that across the week’s worth of data we collected. Each Tweet’s time was moved into the time zone of the Tweeter and normalized against the daily cycle of Tweet volume. Based on the graph below, true beer thirty is 5pm local time. This gives great meaning to the saying “It’s 5 o’clock somewhere.”

Beer Drinkers have a Wider Vocabulary than Wine Drinkers

Another fascinating tidbit that came out of the data was that beer drinkers have a wider vocabulary than wine drinkers. Normalizing for the number of words used, we find that beer drinkers use 14% more distinct words than wine drinkers. Wine drinkers tend to use the same idioms, for example, “glass of wine” or “red wine,” more than beer drinkers use their most common phrases. Does this mean that beer drinkers are 14% smarter than wine drinkers? Or that they use very creative spelling? We won’t wade any further into that question, but you can be the judge.

That’s all for our inaugural Gnip Cagefight. Hope you enjoyed it and be sure to let us know what what words you’d like to see in the octagon in the future.

The VMAs, Lady Gaga and Data Science

Hi everyone. I’m the new Data Scientist here at Gnip. I’ll be analyzing the fascinating data that we have coming from all of our varied social data streams to pull out the stories, both impactful and trivial, that are flowing through social media conversations. I’m still getting up-to-speed but wanted to share one of the first social events that I’ve dug into, the 2011 MTV Video Music Awards.

Check out the info below and let me know in the comments what you think and what you’d like to see more of.  And now, on with the show…

3.6M Tweets Mention “VMA”

The volume of tweets containing “VMA” rose steadily from a few hours before the VMA pre-show was broadcast, up to the starting of the pre-show at 8:00 PM ET (00:00 GMT) and remained fairly strong during the event. It trailed to low volume within the hour after the VMA broadcast ended at 11:15 PM ET (03:15 GMT). Tweets mentioning “VMA” totaled 3.6M during the 7 hours surrounding and including the VMA broadcast.


Lady Gaga Steals the “Tweet” Show

The largest volume of tweets for an individual artist are the mentions of “gaga.” Lady Gaga performed early in the show and the surge of tweets during her performance surpassed 35k tweets per minute for about 8 minutes. Again in the second half, Lady Gaga tweet volume briefly jumped above 50k per minute. Tweets mentioning “gaga” totaled 1.8M during the 7 hours surrounding and including the VMA broadcast.

As you can see in the chart below, other artists that garnered significant tweet volumes included Beyonce’, Justin Beiber, Chris Brown, Katy Perry and Kanye West. Perry, West and Brown got a lot of attention during their appearances, while Justin Bieber and Lady Gaga lead the counts in volume by maintaining a fairly steady stream of tweets during the broadcast.

Term Representation of Tweets Sampled
VMA 44 %
Lady Gaga 21 %
Beyonce 16 %
Justin Bieber 10 %
MTV 9.2 %
Chris Brown 8.0 %
Katy Perry 5.6 %
Kanye West 4.8 %
Jonas 3.5 %
Taylor Swift 2.1 %
Rihanna 1.1 %
Eminem 0.55 %
Michael Jackson 0.18 %
Ke$ha 0.17 %
Cher 0.14 %
Paramore 0.12 %




Contrasting, it is interesting to note that Beyonce’ and Chris Brown gained most of their tweet attention around their performances with very larger surges in tweet volume. Beyonce’s volume–another Beyonce’ bump–continues after her performance as twitter users absorb the news of her pregnancy.



One surprise that emerges from looking for other artists connected to the VMAs was Michael Jackson’s tweet volume. While Jackson gleaned many Retweets after winning the King of the VMA poll, he also received a large number of natural tweets lamenting his passing and celebrating his past successes.


The free-form text and limited length of twitter messages creates a number of challenges for monitoring an event via twitter comments. People refer to the event differently and focus on different parts of the event. There will be spelling variations and differences in idioms and nicknames used to describe people and performances. Do we search for “Bieber”,”Beiber” and “Justin”?  Will tweeters use “Beyonce” or Beyonce'”? Knowledge of what we are monitoring is required; preparing tools to adapt things we learn during the events is also essential to getting good results.

One effective strategy is to use one or two tokens to identify tweets related to the event. The objective is to choose terms that we know are related to the event, that won’t be widely used outside the event, and that will give a representative sample–diverse and with sufficient volume. Once we have started to collect the event-focused twitter sample, we can look for relevant terms correlated with the filter term to find out what else people are tweeting about during the event.

Hope you enjoyed this first post. Look for more to come.