Sampling: Not Just For Rappers Anymore

The Gnip data science team (myself, Dr. Josh Montague, Brian Lehman) has been thinking about firehose sampling in the last few weeks.  We see research and hear stories based on analysis of a randomly sampled subset of the Twitter, Tumblr or other social data firehose. This whitepaper will look at the common trade off we see in sampling, which is that we want a data stream that represents the entire audience of a social platform, while controlling costs and limiting activities in order to match analysis capacity.

Both Gnip’s customers and the greater social data ecosystem frequently use sampling to assess patterns in social data. We created a whitepaper that provides a step-by-step methodology to calculate social activity rates, confidence in our rate estimates and the ability to identify signals of emerging topics or stories. We wanted to know what we could detect and with what certainty.

The sampling whitepaper, describes the tradeoffs between the three key variables in sampling social data: rate of activities (e.g. the number of blog posts or Tweets over time), confidence levels around our estimates of rate and meaningful changes those rates, i.e., signal. These three variables are interrelated and present a measurement challenge: With the choices or constraints imposed by two of these parameters, you then calculate the third.

Sampling - Confidence vs Activities vs Signals

While the whitepaper deals with the tradeoffs of activity, signal and confidence when designing a measurement, that is a little abstract. To make this more concrete, think of the trade-off problem as a way of addressing questions like those below. If you’ve asked any of these questions in your own work with social data, we think this whitepaper might help.

  • The activity rate has doubled from five counts to ten counts between two of my measurements. Is this a significant change, or is this expected variation e.g. due to low-frequency events?

  • I want to minimize the total number activities that I consume (for reasons of cost, storage, etc). How can I do this while still detecting a factor of two change in activity rate in one hour?

  • How long should I count activities to detect a change in rate of 5%?

  • How do I describe the trade-off between signal latency and rate uncertainty?

  • How do I define confidence levels on activity rate estimates for a time series with only twenty events per day?

  • I plan to bucket the data in order to estimate activity rate, how big (i.e. what duration) should the buckets be?

  • How many activities should I target to collect in each bucket in order to be have a 95% confidence that my activity rate estimate is accurate for each bucket?

 Our summer data science intern, Jinsub Hong, and data scientist, Brian Lehman created an animation to help visualize the relationship between confidence interval size, time of observation (or, alternatively, the number of activities observed), and the signal we can detect in a firehose of social data.

The animation below shows confidence intervals for different bin sizes. As the bin size increases, we count more events, so the rate estimate becomes increasingly certain. However, we have to wait longer to get the result (latency).

At what bin size can we be confident that the activity rate has changed significantly?  For short buckets of only a minute or two, the variation in the measured rate is large, comparable to the potential signal.  For longer buckets, the signal becomes more distinct, but the time we have to wait in order to make this conclusion goes up accordingly.

The first and last frame show representative potential signals. In the first frame, this potential signal is about the same size as the variability of the activity rate, so we can’t conclusively say the activity rate has changed. With the larger bin size in the final frame, the signal is much larger than the activity rate uncertainty. We can be confident this represents a real change in the activity rate.

For full details, you can download the paper at! If you have questions about the whitepaper, please leave a comment below.

Social Data Mashups Following Natural Disasters

Exactly what role can social data play in natural disasters?

We collaborated with Gnip Plugged In partner, Alteryx, to do data mashups around FEMA relief related to Hurricane Sandy for a recent presentation at the Glue Conference. Alteryx makes it easy for users to solve data analysis problems and make decisions.

Gnip and Alteryx created a data mashup for Hurricane Sandy using six different data sources showing what kinds of tools we can create after natural disasters. Starting with mapping from TomTom, the data mashup also included data about businesses from Dun & Bradstreet, demographic data from Experian, registrations from FEMA, geotagged articles from Metacarta, and geotagged Tweets from Gnip.  We concentrated on FEMA efforts and reactions during spring 2013.

This kind of data mashup allows us to drill down into multiple aspects of evacuation zones. One of the easiest examples of this mashup is the ability to see what services and resources are available from businesses (from Dun & Bradstreet) while complimentary official efforts are organized.

FEMA Hurricane Sandy Maps

Or it can help prioritize which areas to assist first by mashing population densities with registrations from FEMA.

FEMA Hurricane Sandy Registrations

FEMA Hurricane Sandy Density Map

Using geotagged social data from Twitter is another way to identify areas that need help, as well as monitor recovering areas. Combining sentiment analysis with Tweets provides instant feedback on the frustrations or successes that constituents are feeling and seeing.

Hurricane Sandy Social Tweets

We think this type of data mashups with Alteryx is just the beginning of what is possible with social data. If you have questions or ideas for data mashups, leave it in the comments!

Aspirational Brands on Tumblr: Lexus vs. Toyota

Gnip conducted a brief analysis of the Toyota family of brands (Toyota, 4Runner, Camry, Highlander, Lexus, Prius, Rav4, Scion, Sequoia, Tacoma, Tundra) on multiple social media platforms. We looked at brand mentions on Tumblr, Twitter, WordPress and WordPress comments during the period of Oct. 15 to Nov. 15, 2012.

As you would expect, Toyota was the most frequently mentioned brand on each social platform, with one enormous exception – Tumblr. Lexus had 5 times as many mentions on Tumblr as Toyota. This highlights how aspirational brands do exceptionally well on Tumblr where niche communities of fans often form around brands. (Attention brand managers, this happens whether the company is involved or not). A central component of Tumblr is visual content, which also plays well with aspirational brands. Furthermore, Tumblr content is both extremely viral and has a long shelf life meaning that content shared on Tumblr can be shared for longer periods of time and jump to more diverse sub-groups within the network than other social networks. During the month Gnip tracked mentions, Lexus received more than 200,000 mentions while Toyota received 40,000.

In social media, it is easy to rely on Twitter as a kind of alert system of when content is being shared, but at Gnip we’ve seen time and time again where content that pops up elsewhere doesn’t always pop up on Twitter. Each social media network has its own attributes and audience and modes of interaction. Because of likes, reblogging, and the way timelines are read by Tumblr users, Tumblr has active communities that aren’t found elsewhere.

Lexus on Tumblr

Four Themes From the Visualized Conference

The first Visualized conference was held in mid-town Manhattan last week. Even with Sandy and a nor’easter, the conference went off with only a few minor hiccups. The idea behind Visualized is a TED-like objective of exploring the intersection of big data, story telling and design. It worked.

Throwing designers and techies together is one of my favorite forums because of what is common and what is different. On one hand, artists are increasingly skilled with technical tools, on the other these people are often coming at things from very different perspectives.

The advantages of mixing these people at Visualized go beyond simple idea sharing.  Each person specializes, leading to amazing expertise, skill, and focused perspective, but also leaving something out. It is not that everyone can learn to do everything, but rather, by sharing projects, methods and tools, we can learn what to ask and who to seek out for collaboration.  The advantages of this mix are that it is the most reliable way to produce projects that evoke emotion with story, design and data to engage and inform.

We were treated to amazing technical talent and creativity, evident in, for example, Cedric Kiefer’s generative dancer reproduction “unnamed soundsculpture.”  To creating the basic model his team started with song, a dancer and knitting together the 3D surface images from three Microsoft Kinect cameras. They re-generated the movie of the dance by simulating the individual particles captured in the imaging and the enhancing these to generate more particles under the influence of “gravity” and “wind” driven by the music.

unnamed soundsculpture from Daniel Franke on Vimeo.

Cedric and his team radically expand ideas of numeric visualization by capturing and building on organic physical data in complex and subtle ways, generating a whole, engrossing new experience from the familiar elements.

Four themes surfaced repeatedly in the ideas and presentations of the speakers:


Most of the projects were produced by teams made up of people with  a handful diverse skills and affinities. I heard descriptions of teams such as, “we have a designer (color, composition and proportion sense, works in Illustrator, photoshop, pen and paper…), a data scientist (data munging, machine learning, statistical analysis…), a data visualization artist (Javascript, D3 skills, web API mashup skills…) someone who is driven by narrative and story telling (journalist, marketing project lead…), a database guy, etc.”

Assembling and honing these teams of technical and artistic creatives is probably a rare skill in itself and the result is a powerful engine of exploration, creativity and communication.

Hilary Mason from summed up the second-level data scientist talent shortage clearly: “Every company I know is looking to hire a data scientist; every data scientist I know is looking to hire a data artist.”  As broad as data scientists skills are, many are recognizing the value of talented designers with the appropriate programming skills for crafting a clear, engaging message.

The New York Times teams (two different teams presented), the WNYC team of two,, and many others showed the power of teams creating together and bringing diverse talents to projects.

There were a couple of notable individual efforts. My favorite was Santiago Ortiz’s beautiful, complex and functional visualization and navigation of his personal Knowledgebase. His design elegantly uses the 7-set Venn diagram, and his deep insights into searching by category and time come together perfectly.


Sniffing out the story is fundamental to projects that evoke emotion with story, design and data to engage and inform. Journalists can smell drama and conflict and ask lots of questions. They have a sense of where to dig deeper. They are able to stick to the thread of the story and a have a valuable work ethic around finding the details and tying up loose ends.

A large part of success of Shan Carter and his team in creating the New York Times paths to the White House win visualization come from their ability to return over and over the the basic idea of making a relevant, accurate and understandable visualization of the various likely outcomes each each of the battleground states.  This visualization went through 257 iterations being checked into their Github repository with a few evident cycles of creative expanding followed by refocusing on the story.

Data Mashups

API-mashup skills found there best examples in the news teams. WNYC’s accomplishments in creating data/visualization mashups to communicate evacuation zones, subway outages, flood zone information updated in real-time during the storms, and other embeddable web widgets was amazing.  While their designs didn’t have the polish of some of the “slower” work presented, they produced great, accurate and timely results in days and sometimes hours.


“A visualization should clarify in ways that words cannot.” (Sven Ehrmann)

This summed up what I found awe-inspiring and satisfying in the design work. Since I primarily work with data visualization, I often rely on the graph-reading skills of my audience rather than optimized design. This may be necessary for many business applications, but when the message is important and the investment you can reasonable expect from your audience is uneven, to take short cuts on design is to completely miss opportunities to engage and inform.  Great designers are masters at creating memory because the are able to reliably create “emotion linked to experience” (Ciel Hunter)

Jake Porway summed up data, team, story and design in the observations section of his presentation:

  • Data without analysis isn’t doing anything
  • Interdisciplinary teams are required
  • Visualization is a process (see the example from Shan Carter at NY Times above)
  • Tools enable amazing outcomes possible with limited resources
  • There is a lot of potential to do a great deal of good with when we learn to evoke emotion with story, design and data to engage and inform.


Strata 2012 – Strata Grows Up

O'Reilly Strata Conference Making Data Work

O’Reilly Media’s Strata was held in NYC recently, just before Sandy arrived.  The conference was sold out to building capacity.  By this measure, it was the most successful Strata to date. Strata and Hadoop World were combined into a single conference. The week started with tutorials, meet-ups, a mini Maker Faire and Ignite Talks and ended with the more traditional conference format of keynotes, breakouts and an exhibit hall for vendors.

Some of the most interesting and relevant idea-driven keynotes included Mike Flowers talking about data used to understand building code violations in NYC, Rich Hickey addressing opportunities and challenges of adding back to big-data analysis platforms some of the traditional features like indexes, queries and transactions, and Shamila Mullighan’s recommendations for combining internal data with public APIs to enhance your data science.

Tim Estes gave a passionate talk about attention and the “responsibility to know”–and the growing gap them–asserting “Understanding is a great cause.”  Doug Cutting talked about the future of Hadoop, Julie Steele interviewed “Mathbabe” (Cathy O’Neil) about real-world vs. academic data science.  Joe Hellerstein addressed the challenges of resources and attention resulting from 80% of a typical data scientist’s activity being spent preparing and transforming data for analysis. Samantha Ravich wrapped up the keynotes with an appeal for data science tools that better match decisions maker’s needs and modes of working.


Demand Outstrips Supply for Talent
Many speakers pointed to the dire need for more data science talent. In many cases, this was emphasized by pointing to the data going unanalyzed, answers going unfound, and sometimes open positions unfilled.

In what ways is the data scientist shortage directly problematic and in what ways has it become shorthand for the larger problem that businesses don’t have decision processes, managers, infrastructure, etc. needed to effectively make decisions from big data. There seemed to be some glossing over the point that the data science talent shortage is largely an opportunity cost and, to the extent there is uneven use of data in your industry, a competitive issue, rather than an actual cost of doing business today. The shortage of data science talent is accompanied by a shortage of managers able to focus on good questions, direct resources to data science projects and make decisions based on data.

The data scientist is key, but also, only successful in bringing competitive advantages when the context of good questions and data-driven decisions is in place. Concentrating on the shortage of DS when your management team is unprepared to participate in and leverage insights gained from good data science work seems sort of silly–Samantha Ravich’s talk had a clear example of this in how the Bush administration decision process regarding poppy production in Afghanistan went wrong.

Data Science Infrastructure
The other recurring theme was Date Science Infrastructure. Many data scientists have noticed the huge proportion of their daily work is finding, shaping, loading, moving, transformation and connecting data. The product announcements as well as many of the idea-oriented talks pointed to and quantified this challenge. “Spend more time doing the science part of your job” is the idea behind, Platfora, OpenChorus, Impala and Joe Hallerstein’s Data Wrangler.

For Gnip, a personal highlight was GreenPlum announcing OpenChorus, a collaborative big data environment with integrations to Gnip, Kaggle and Tableau. Informatica announced their continued work toward a “no-code” environment for big-data analytics. MapR, SaS and other well-known players had their say at keynotes as well.

In general, the last two days of Strata seemed focused more on the line manager of big-data insights and infrastructure in the organization and less on the analysis or visualization practitioner. Some bright spots on the practitioners side were Donal Miner’s MapReduce patterns, Kim Rees on creating great visualizations and Cathy O’Neil on the realities of mining and making decisions based on weak signals in timeseries data.

Twitter and SXSW: Barometer of Trends

Last week we talked about tracking SXSW from 2007 to 2012 using Gnip’s Historical PowerTrack for Twitter. This gave us insight into year-over-year trends in SXSW Tweets and now we’re going to look at how SXSW trends have changed over time.

With every square inch of Austin packed with the social media influential, SXSW provides an interesting avenue to examine trends, big and small, to see what people are talking about on Twitter. Now that companies can use Gnip’s Historical PowerTrack for Twitter to baseline events, it provides a whole another avenue to determine trends.

Party vs. Panel
People have such a love/hate relationship with SXSW. Some people love it for its networking opportunities and great sessions, while other people decry it as one giant party. Letting the data speak for the truth, it seems that in earlier years of the conference, people came for the panels and hopefully to learn something from their peers. But by 2011, the word “party” overtook those interested in “panel” by more than 10,000 Tweets. People were talking about the best places to meet people rather than the best places to learn. That same year, there were 13,072 mentions of the word RSVP in SXSW Tweets talking about plans to find the best parties and likely indulging in the practice of RSVPing for 136 events and actually attending 12 of those events.

Geo-location Wars
While Twitter is useful for helping understand how cultural events are changing, the use cases extend further into helping understand the rise and fall of startups. With the launch of Foursquare and Gowalla at SXSW in 2009, it was the beginning of the so-called geo-location wars. Many people have wondered how Foursquare ended up the winner, and SXSW provides interesting insight into how Foursquare came out on top. Back in 2009, if you looked at SXSW Tweets, it would tell you it was anyone’s game because surprisingly Foursquare only received a little more 100 Tweets than Gowalla. By 2010, Foursquare had been more clearly marked as the winner with Foursquare receiving nearly double the Tweets that Gowalla was receiving. At that point, everyone was still writing posts to determine the pros and cons of each service, but the social data was clear — Foursquare had the buzz that year in part to their ability to easily publish updates, badges and mayorships on Twitter and perhaps even their rogue game of Foursquare outside the Convention Center. By 2011, Foursquare had completely suckerpunched Gowalla with Foursquare receiving the lion’s share of public voice receiving nearly 65,000 Tweets to Gowalla’s nearly 8,000 Tweets. By the end of 2011, Facebook ended up making an acqui-hire for the Gowalla team.

BBQ vs. Tacos
This next trend might seem silly, who cares if more people are interested in BBQ or Tacos? I mean, what significant impact can this social data have? But if you’re a restaurant chain or looking to start a new franchise chain, it would be interesting to know about cultural food trends such as the rise of cupcakes as it is happening.

While many have long suspected that Austin was a BBQ kind of town, the social data has shown that at the last SXSW, Tacos overtook BBQ as the most talked about grub to grab. More data science would have to be done to determine if the Taco is becoming a more widestream cultural trend, but when all other Tweet volumes were falling in 2012, the term Tacos was charging full-steam ahead.

This is just the beginning of what social data can tell companies about trends and market research. We think historical social data will provide invaluable to market research with the sheer volume of conversations that are happening on Twitter.

In The Future, The Data Scientist Will be Replaced by Tools

Some of you are celebrating. Some of you are muttering about how you could never be replaced by a machine.

What is the case for? What is the case against? How should we think about the investments in infrastructure, talent, education and tools that we hope will provide the competitive insights from “big data” everyone seems to be buzzing about?

First, you might ask why try to replace the data scientist with tools?  At least one reason is in the news: The looming talent gap.

WireUK reports,

Demand is already outstripping supply. A recent global survey from EMC found that 65 percent of data science professionals believe demand for data science talent will outpace supply over the next five years, while a report from last year by McKinsey identified the need in the US alone for at least 190,000 deep analytical data scientists in the coming years.”

Maybe we should turn to tools to replace some or all of what the data scientist does. Can you replace a data scientist with tools?  An emerging group of startups would like you to think this is already possible. For example, Metamarkets headlines their product page with “Data science as a service.” They go on to explain:

 Analyzing and understanding these data streams can increase revenue and improve user engagement, but only if you have the highly skilled data scientists necessary to turn data into useful information.

Metamarkets’ mission is to democratize data science by delivering powerful analytics that are easy and intuitive for everyone.

SriSatish Ambati of the early startup 0xdata (pronounced hex-data) goes a step further with the idea that “the scale of the underlying data and the complexity of running advanced analysis are details that need to be hidden.“ (GigaOm article)

On the other side of the coin, Cathy O’Neil at Mathbabe set out the case in her blog a few weeks ago that not only can you not replace the data scientist with tools, you shouldn’t even allow the non-data-scientist near the data scientist’s tools:

 As I see it, there are three problems with the democratization of algorithms:

 1. As described already, it lets people who can load data and press a button describe themselves as data scientists.

 2. It tempts companies to never hire anyone who actually knows how these things work, because they don’t see the point. This is a mistake, and could have dire consequences, both for the company and for the world, depending on how widely their crappy models get used.

 3. Businesses might think they have awesome data scientists when they don’t. […] posers can be fantastically successful exactly because non-data scientists who hire data scientists in business, i.e. business people, don’t know how to test for real understanding.

If this is a topic that interests you, we’ve submitted a panel on this topic for SXSW this spring in Austin to discuss issues surrounding data science and tools. We will talk about what tools are available today, how they make us more effective as well as some of the pitfalls of tool use. And we will look into the future of tools to see where and if data scientists can be replaced by tools. Would love a vote!


  • John Myles White (@johnmyleswhite) – Coauthor of Machine learning for hackers and Ph.D. student in the Princeton Psychology Department, where he studies human decision-making.
  • Yael Garten (@yaelgarten) – Senior Data Scientist at LinkedIn.
  • James Dixon (@jamespentaho) – CTO at Pentaho, open source tools for business intelligence.

Update: One of our panelists, John Myles White, has provided some thoughtful analysis of companies that rely on automating or assisting data science tasks. See his blog post at

The Social Cocktail, Part 3: Many Publishers Build One Story

In the first post, we looked at high-level attributes of the social media publishers. Then, we spent time looking at the social media responses to expected and unexpected events. To end this series, let’s dive into an example of the evolution of a single story across a mix of publishers. This will provide some intuition into how the social cocktail works when examining a real-world event— in this case, the JPMorgan-Chase $2+ billion loss announcement on May 10, 2012.

JPMorgan-Chase Trading Loss

Twitter: Fast and Concise
On May 10, 2012, immediately after market closing, JPMorgan-Chase CEO Jamie Dimon held a shareholder call to announce a $2 billion trading loss. While traditional news agencies reported the call announcement late in the afternoon, Twitter led the way with reports from call participants who started tweeting while on the call a few minutes after it started.

To see how the volume on Twitter evolved, see figure 1. In each case, the points represent activity volumes on the topic of JPMorgan and “loss” while the lines represent function fits to either the Social Media Pulse or a Gaussian curve (a simple approximation for expected event traffic when averaging over the daily cycle.)

As Reuters and others released news stories and Europe started to wake up, a second Twitter pulse is visible. Toward the right-hand of the graph, the daily cycle of Tweets dominates the conversation about JPMorgan and “loss” with a curve more characteristic of broadly reported, expected events.Twitter Reacts to JPMorgan Trading Loss

Figure 1. Twitter and Stocktwits audience comment on JPMorgan and “loss” after the announcement of a $2B trading loss on the evening May 10, 2012. Volumes are normalized so that peak volume = 1 for each publisher.

StockTwits: Fast and Concise, Focused

Much of the analysis that applies to Twitter applies to StockTwits–the major exceptions are in the expertise of the users and focus of the content. The StockTwits service serves traders and participants are mostly professional investors. Because the audience and the content is curated, there is very little off-topic chatter.  Further, much of the content is specific analysis of JPMorgan’s loss, analysis of the stock price movement following the announcement and information about after-hours price indicators.
On Friday (May 11th), discussion of the loss reaches only about 40% of the peak of the night before. This is likely due to the message rapidly saturating the highly connected community on StockTwits.

Comments: Both Fast and Slow, Concise

Because there was a lot of financial news attention on the story, news stories started to appear soon after the call and these attracted comments immediately (this was the fast response). The data shown in Figure 2 includes both comments from Automattic and Disqus. These comment platforms are used for comments on both personal blogs and on news stories posted online by news organizations, so there is a mix of comments on news stories as well as personal analysis.A graph about comments on the JPMorgan trading loss

Figure 2.  Commenters on blogs and news stories react to the announcement of ta $2B trading loss on the evening 10 May 2012, and an even stronger contingent react early on 11 May. Volumes are normalized so that peak volume = 1.

More-considered news and blog stories appeared on May 11th, Friday morning and these spurred a second (slower) pulse of comment responses.

An additional pattern that is often seen in comments is that people tend to read blogs at certain times of day (e.g. morning or evening) by habit. Because of this, we sometimes see comment volumes spiking at the start or end of the day in very active timezones.

Tumblr: Medium and Very Rich

The Tumblr audience reacted to the news as if the story was broken on Tumblr rather than broken on traditional news. This is unique among the publishers studied here. This pattern of slowly growing traffic during the first few hours after the shareholder call may indicate the nature of the conversation on Tumblr. Rather than an event-response reaction such as twitter, or a considered reaction, as with blogs, the reaction of the audience on Tumblr accelerates as the type of content Tumblrs reblog appears in the network. While the initial posts on Tumblr refer to news stories, the spread of the story through reblogging happens as a ramp up to the peak over a few hours.

The following day, the Tumblr story evolves like an expected event.

Not only is the timeline unique, but Tumblr content is also unique. Early posts have rich media including political cartoons and more right-brained political commentary and humor than the text-comment crowd. Adding Tumblr to your social media mix may present additional challenges in evaluating and analyzing the content, but the sensibilities as well as the activity of this audience adds a dimension not found in the content from the other publishers.

Blogs: Medium and Rich

A few quick, factual reports from the call were published in the form of blog posts as can be seen by the slight “heaviness” in the curve at the end of the day (May 10th). However, the large majority of the blog traffic is the traditional, considered and refined reactions published throughout the following day. The traffic on May 11th follows the pattern of an event everyone already knows about.  The discussion here is analysis and commentary as people explore the implications of the story.

The large majority of the blog content is text or text with a picture of Mr. Dimon. Stories vary from dozens of words to a few thousand.Graph Showing Blog Reactions to JPMorgan Trading Loss

Figure 3.  Content-rich and text-rich reactions to the announcement of ta $2B trading loss on the evening 10 May 2012.  In-depth analysis continues with heavy posting during the day on the 11 May. Volumes are normalized so that peak volume = 1 for each publisher.

Finally, take a look at these timelines shown together in Figure 4.  This view gives a clear indication of the timing of reactions between the publishers.

Social media reaction to JPMorgan trading loss

 Figure 4. The points show the normalized volume of activities about “JPMorgan” and “loss” following the May 10th announcement from Jamie Dimon. Lines represent fits to models of typical social media reactions.  Volumes are normalized so that peak volume = 1 for each publisher.

This example story demonstrates the potential of mixing perspectives, audience and styles of conversation in creating a full description of the social media response to events. With the right mix, we can identify stories and emerging topics within minutes and we can quickly characterize the relative size and speed of a story. We can identify user engagement, dig into deeper analysis, and the rate and focus of content sharing. With this mix of social data, we might be getting close to the perfect cocktail.

The Social Cocktail, Part 2 Expected vs. Unexpected Events

In the last post on social cocktails, we looked at some high-level attributes of the social media publishers, and how these attributes might help you choose the right mix of topic coverage, audience, speed and depth. In this post, I will spend a little more attention on the speed dimension by diving into a description of the social media responses to expected and unexpected events. Finally, in the third post, we’ll end with an example of the social cocktail in examining a real-world event—the JPMorgan-Chase $2+ billion loss announcement in May 2012.

When we want to quantify the social data response to an event, we often start by looking at the volume of activities around that event. Applying some filtering allows us to group activities on a topic and look at how the volume of the stream of these activities evolves with time.

Time-series volume measurements of social data generally show three distinct patterns for breaking events. These patterns can be related to the user’s expectations of the event. The rate of spread and reach of a story also depends on the level of interest of the audience as well. When looking at the time series of activity volume, we see patterns characterized in Table 1.

[table “4” not found /]

Table 1.  Social data reaction event types.

Expected events often drive significant volume on social media because people want to comment, observe, banter, trash talk, and analyse.  Twitter volumes surge during the World Cup, the Super Bowl, the Video Music Awards, etc.  However, theses events show a gradual growth and decay around the event rather than abrupt changes. The bump in volume may last from a few hours to a few days.  It is often somewhat symmetric in its growth and decay. Figure 1 shows some examples of social data volume round expected events including the VMAs and Hurricane Irene.

Figure 1. Expected events show smooth and fairly symetric growth and decay of social media volume over time. Examples from the VMAs and Hurricane Irene. 

Unexpected events result in abrupt spikes in volume. On Twitter, these spikes may reach tens of thousands of related tweets per minute within 5 minutes of an event. Social data volume around unexpected events usually grows rapidly until the networks for related users are saturated with the information, then the volume decays exponentially.  These spikes have well-defined growth, peak and decay half-lives. See Social Media Pulse for discussion and analytical details.

There is a key difference between events that are witnessed by many social media users simultaneously, and news that break exclusively on social media.  Examples of the former include a spectacular goal goal in the World Cup, an Earthquake, or Beyonce’s performing pregnant at the VMAs.  Because users see these events at the same time, the social data volume instantly jumps to high levels; there is very little ramp up. See Figure 2a and b for an example of a simultaneous unexpected event response curve.

Figure 2a – Earthquake in Mexico.

Figure 2b – Steve Jobs Resigns as CEO of Apple

News the breaks partially or exclusively on social media has an observable ramp up as the news spreads through the network of related users. This spread can depend essentially on the number of followers, the credibility of the source, etc.  These events have a convex ramp up as shown in Figure 3.

Figure 3.  The sad news of Steve Jobs passing away was originally broken through traditional new sources.  But the story quickly travelled on twitter to many users that were not watching the news, so it took took a shape somewhat like a story breaking on twitter.

You are now equipped to understand the dynamics of an event at a deeper level.  Quantifying the speed of a story will help you consistently characterize and compare the impact of events and the response of the audience. With these tools in hand, you are now ready add a twist to your social cocktail with the garnish of recognizing activity patterns over time.