A Collection of 25 Data Stories

25 Data Stories from Gnip

At Gnip, we believe that social data has unlimited value and near limitless application. This Data Stories collection is a compilation of applications that we have found in our practice, highlighting both unique uses of social data and interesting discoveries along the way. Our initial Data Story describes an interview with the world’s first music data journalist, Liv Buli, and how she applied social data in her work. Her answers honestly blew us away, and to this day we continue to be surprised by the different ways people are applying social data in each new interview.

The 25 different real-world examples detailed in this Data Stories compilation cover an incredible range of topics. Some of my favorite use cases are as compelling as the use of social data within epidemiology studies and how social data is used after natural disasters. Others deal with the seemingly more ordinary, such as common recipe substitutions and exploring traffic patterns within cities. One thing is consistent across them all, however — the understanding that social data is key to unlocking previously unknown insights and solutions.

One of the more exciting aspects of these new discoveries is the fact that we are just now learning where social data will take our research next. Today, social data is used in fields as disparate as journalism, academic research, financial markets, business intelligence and consumer taste sampling. Tomorrow? Only the future will tell, but we’re excited to be along for the ride.

We hope you enjoy this collection of Data Stories, and please continue to share with us the stories you find. We think the whole ecosystem benefits when we let the world know what social data is capable of producing.

Download the 25 Data Stories Here. 

Data Story: Eric Colson of Stitch Fix

Data Stories is our blog series highlighting the cool and unusual ways people use data. I was intrigued by a presentation that Eric Colson gave to Strata about Stitch Fix, a personal shopping site, that relied heavily on Stitch Fix data along with its personal shoppers. This was a fun interview for us because several of my female colleagues order Stitch Fix boxes filled with items Stitch Fix thinks they might like. It’s amazing to see how data impacts even fashion. As a side note, this is Gnip’s 25th Data Story, so be on the watch for a compilation of all of our amazing stories. 

Eric Colson of Stitch Fix

1. Most people think of Stitch Fix as personal shopping service, powered by professional stylists. But, behind the scenes you are also using data and algorithms. Can you explain how this all works?

We use both machine processing and expert-human judgment in our styling algorithm.   Each resource plays a vital role. Our inventory is both diverse and vast. This is necessary to ensure we have relevant merchandise for each customer’s specific preferences.  However, it is so vast that it is simply not feasible for a human stylist to search through it all.  So, we use machine-learning algorithms to filter and rank-order all the inventory in the context of each customer. The results of this process are presented to the human stylist through a graphical interface that allows her to further refine the selections.  By focusing her on only the most relevant merchandise, the stylist can apply her expert judgment.   We’ve learned that, while machines are very fast at processing millions of data points of information, they still lack the prowess of the virtuoso. For example, machines often struggle with curating items around a unifying theme. In addition, machines are not capable at empathizing; they can’t detect when a customer has unarticulated preferences – say, a secret yearning to be pushed in a more edgy direction. In contrast, the human stylist are great at these things. Yet, they are far more costly and slower in their processing. So, the two resources are very complementary! The machines narrow down the vast inventory to a highly relevant and qualified subset so that the more thoughtful and discerning human stylist can effectively apply her expert judgment.

2. What do you think would need to change if you ever began offering a similar service for men?

We would likely need entirely new algorithms and different sets of data.  Men are less self-aware of how things should fit on them or what styles would look good on them (at least, I am!). Men also shop less frequently, but typically indulge in bigger hauls when they do. Also, the styles are less fluid for men and we tend to be more loyal to what is tried & true.  In fact, a feature to “send me the same stuff I got last time” might do really well with men. In contrast, our female customers would be sorely disappointed if we ever sent them the same thing twice!

So, while the major technology pieces of our platform are general enough to scale into different categories, we’d still want to collect new data and development different algorithms and features to accommodate Men.

3. How did you use your background at Netflix to help Stitch Fix become such a data driven company?
Data is in the DNA at Stitch Fix. Even before I joined (first, as an advisor and later as an employee), they had already built a platform to capture extremely rich data. Powerful distinctions that describe the merchandise are captured and persisted into structured data attributes through both expert human judgment as well as from transactional histories (e.g. How edgy is a piece of merchandise?, How well does it do with moms?, …etc).  This is a rare capability – one that even surpasses what Netflix had. And, the customer data at Stitch Fix is unprecedented! We are able to collect so much more information about preferences because our customers know its critical to our efforts to personalize for them. I only wish I had this type of data while at Netflix!

So, in some ways Stitch Fix already had edge over Netflix with respect to data. That said, the Netflix ethos for democratizing innovation has permeated into the Stitch Fix culture. Like Netflix, we try not to let our biases and opinions blind us as we try new ideas. Instead, we take our beliefs for how to improve the customer experience and reformulate them as hypotheses. We then run an AB test and let the data speak for itself. We either reject or accept the hypothesis based on the observed outcome. The process takes emotion and ego away and allows us to make better decisions.

Also, like Netflix, we invest heavily in our data and algorithms.  Both companies recognize the differentiating value in finding relevant things for their customers. In fact, given our business model, algorithms are even more crucial to Stitch Fix than they are to Netflix.  Yet, it was Netflix which pioneered the framework for establishing the capability as strategic differentiator.

4. How else is Stitch Fix driven by data?

Given our unique data, we are able to pioneer new techniques for most business processes. For example, take the process of sourcing and procuring our inventory. Since we have the capability of getting the right merchandise in front of the right customer, we can do more targeted purchasing. We don’t need to make sweeping generalization about our customer base. Instead, we can allow each customer to be unique. This allows us to buy more diverse inventory in smaller lots since we know we will be able to send it only to the customers for which it is relevant.

We also have the inherent ability to improve over time. With each shipment, we get valuable feedback. Our customers tell us what they liked and didn’t like. They give us feedback on the overall experience and on every item they receive. This allows us to better personalize to them for the next shipment and even allows us to apply the learnings to other customers.

5.  Your stylists will sometimes override machine-generated recommendations based on other information they have access to. For example, customers can put together a Pinterest board so that they can show the stylist things they like. Do you think machines will ever process this data?

No time soon! Processing unstructured data such as images and raw text are squarely in the purview of humans. Machines are notoriously challenged when it comes extracting the meaning that is conveyed in this type of information. For example, when a customer pins a picture to a Pinterest board, often they are expressing their fondness for a general concept, or even an aspiration, as opposed to the desire for a specific item. While machine learning has made great strides in processing unstructured data, there is still a long ways to go before they can be reliable.

Thanks to Eric for the interview! If you have suggestions for other Data Stories, please leave a comment! 

Continue reading

Adobe Adds Foursquare to Adobe Social

One of the most exciting aspects of launching new social media publishers is seeing how that social data is used. Earlier this summer we announced that full coverage of Foursquare check-in data was available exclusively from Gnip. Today, Plugged In to Gnip partner Adobe announced that they’re first-to-market with a commercial Foursquare analytics platform.

While Foursquare has always provided tools for businesses to manage their own presence, the introduction of Foursquare data to Adobe Social creates entirely new opportunities. Large retailers will be able to understand check-in trends over time and compare their performance against the competition. Concert promoters will be able to analyze the impact of specific marketing campaigns. Sports teams will be able to see which games are popular with an engaged social media audience.

We’re thrilled that this data is now available to the broad range of customers who use Adobe Social and can’t wait to see how they use it to derive new insights to drive their businesses.

Kudos to Adobe for moving so quickly to adopt Foursquare into their platform!

Social-Buzz

Social Data Mashups Following Natural Disasters

Exactly what role can social data play in natural disasters?

We collaborated with Gnip Plugged In partner, Alteryx, to do data mashups around FEMA relief related to Hurricane Sandy for a recent presentation at the Glue Conference. Alteryx makes it easy for users to solve data analysis problems and make decisions.

Gnip and Alteryx created a data mashup for Hurricane Sandy using six different data sources showing what kinds of tools we can create after natural disasters. Starting with mapping from TomTom, the data mashup also included data about businesses from Dun & Bradstreet, demographic data from Experian, registrations from FEMA, geotagged articles from Metacarta, and geotagged Tweets from Gnip.  We concentrated on FEMA efforts and reactions during spring 2013.

This kind of data mashup allows us to drill down into multiple aspects of evacuation zones. One of the easiest examples of this mashup is the ability to see what services and resources are available from businesses (from Dun & Bradstreet) while complimentary official efforts are organized.

FEMA Hurricane Sandy Maps

Or it can help prioritize which areas to assist first by mashing population densities with registrations from FEMA.

FEMA Hurricane Sandy Registrations

FEMA Hurricane Sandy Density Map

Using geotagged social data from Twitter is another way to identify areas that need help, as well as monitor recovering areas. Combining sentiment analysis with Tweets provides instant feedback on the frustrations or successes that constituents are feeling and seeing.

Hurricane Sandy Social Tweets

We think this type of data mashups with Alteryx is just the beginning of what is possible with social data. If you have questions or ideas for data mashups, leave it in the comments!

Data Stories: Dmitrii Vlasov on Kaggle Contests

At Gnip, we’re big fans of what the team at Kaggle is doing and have a fun time keeping tabs on their contests. One contest that I loved was held by WordPress and GigaOm to see what posts were most likely to generate likes, and we interviewed Dmitrii Vlasov who came in second in the Splunk Innovation Prospect and sixth overall. For me, it was interesting to speak to an up and coming data scientist who isn’t well known yet. Follow him at @yablokoff.

Dmitrii Vlasov of the GigaOm WordPress contest

1. You were recognized for your work in the first Kaggle contest you ever entered. What attracted you to Kaggle, and specifically the WordPress competition?

I came to Kaggle accidentally as it always happens. I read some blog post about the Million Song Dataset Challenge provided by Last.fm and bunch of other organizations. The task was to predict which songs will be liked by users based on their existing listening history. This immediately made me feel excited because I’m an active Last.fm user and was reflecting about what connections between people can be established based on their music preferences. But the contest was coming to end and so I switched to WordPress GigaOm contest and got 6th place there. Well, it is always interesting to predict something you already use.

2. What is your background in data science?

Now I’m a senior CS student in Togliatty, Russia. Can’t say that I have a special background in Data Science – I had more than a year-long course of probability theory and math statistics in university, some self-learned skills about semantic analysis and have big love to Python as a tool for implementing ideas. Also, I’ve entered the Machine Learning course on Coursera.

3. You found that blog posts with 30 to 50 pictures were more likely to be popular. You also found that longer blog posts also attract more likes (80,000-90,000 characters). This struck my marketing team as really high and was contrary to your hypothesis that longer content might be less viral. Why do you think this is?

Well, my numbers show relative correlation between amount of photos, characters and videos and the amount of likes received. Big relative “folks love” on several prominent amount of photos means that there were not so many posts with such amount of photos but most of them were qualitative. Quick empirical analysis shows that these are special type of posts – “big photo posts”. They usually are photo report, photo collection or scrapbook. For such types of posts 10-15 photos are not enough but at the same time 10-15 photos seem too overloaded for normal post. The same can be said about big amount of text in post. Of course, the most “likeable” posts contain 1,000-3,000 characters, but posts with 80-90 thousands are winners in “heavyweight category”. These are big researches, novels, political contemplation. Analyse is quite simple but it shows that if you want to create media-rich or text-rich content it should be really media-text-rich. Or you may fall in a hollow of not suitableness.

4. What else would like to predict with social data if you got the chance?

Now I work on romantic and friend relationships that could be established based on people’s music preferences (it’s a privately held startup in alpha). This is a really interesting and deep area! Also, I’d like to work with some political data e.g. to predict reaction on one or another politician’s statement based on a user’s Twitter feed. Or to extract all “real” thesis of politician based on all of his public speeches.

Bad Data, the Right Data and Le Data

Gnip believes social data can change the world and our leadership team has been writing about data in O’Reilly, speaking at the Sentiment Symposium and at LeWeb. We wanted to share what they were talking about.

Bad Data by O'Reilly

Gnip CEO Jud Valeski wrote a chapter in the recently released O’Reilly handbook “Bad Data” by Ethan McCullum. Jud wrote the chapter called “Social Data: Erasable Ink?” about how the evolving social media landscape is challenging expectations about how people interact with social data and who owns it. Gnip is committed to providing terms-of-service compliant social data and this chapter talks about the expectations around social data and how the various players are managing them.

 

 

 

Our COO Chris Moody speaking at the Sentiment Symposium on “Building Sentiment Analysis on the Right Social Data”

Building Sentiment Analysis on the Right Social Data (Chris Moody, Gnip) from Seth Grimes on Vimeo.

Jud being interviewed by Robert Scoble at LeWeb

Social Data and The Election

If you’re excitedly waiting the results of the election, and wanted to keep an eye what people are saying about Election 2012 on social media, we have a list of resources below:

  • I Voted Map – A realtime map by the good people at Foursquare allowing people to check in and say “I voted.” The map compiles checkins from voters.
  • The Twitter Political Index – Twitter’s official coverage measuring the sentiment between each Presidential candidate and trending topics related to the election.
  • Tumbling the Election – Coverage from Tumblr’s editorial team tracking top election related hashtags.
  • Facebook Stories on the Election – Watch Americans that said they voted on Facebook in real time.
  • Tumblr Election 2012 – Union Metrics visualization of trending election-related tags on Tumblr. Shows how many posts per second are about the election.
  • Infinigon Group – Tracking realtime political sentiment on social media.
  • 2012 Election Mood Meter – Netbase election mood meter tracks sentiment on social media for both candidates for President and Vice President candidates and breaks it down further by gender.
  • Electoral Map based on Tweets – The Guardian has created a map to show who would win the election based on Tweets.
  • Election Day on Twitter – Al Jazeera and Flowics show buzz volume on Twitter about each of the Presidential candidates as well as a live stream about Tweets on each candidate.
  • Rock The Vote Real-Time Politics - Splunk and MTV have created a visualization of hashtags on Twitter about each Presidential candidate.
  • The Crowdwire – Bluefin Labs has created what they’re calling a “Social Exit Poll” looking at how people are talking about how they voted on social media.
  • Yahoo! Election Control Room –  Attensity has teamed up with Yahoo! to show how America is feeling about the election and sharing select Tweets.
  • US Electoral Compass 2012 – Brandwatch allows you to select a state and date range to show what political issues each state is talking about.
  • NBC Politics – Using Crimson Hexagon to power their social media analysis.
  • Twitter Sentiment Analysis – USC is tracking sentiment around each candidate.

Who else are we missing? Also, as a bonus you can see our interview with Gabriel Banos of Zauber Labs on predicting the election with social data and Union Metrics comparison of the candidates using Tumblr data.

Social Data Around Voter Values

What's In The [Social] Data?

I was reading about Higgs Boson this morning and came across this cartoon explanation of matter, particle acceleration, and Higgs Boson. The very last pic in the cartoon (below) reminded me of the cartoon that’s been in my head for years; the one that pops into my head when I go to work. It drives what we do at Gnip. All of our energy is focused on helping our customers answer that question; “What’s in the data?” We do this by reliably collecting, filtering, enriching, and delivering billions of public social activities (social data) to our customers with business critical data needs, everyday.

What's In the Data

What's In The Data slide from PHD Comics - http://www.phdcomics.com/comics.php?f=1489

Geosocial Data: Patterns of Everyday Life

My love for checking in and thus, geolocation, began after SXSW of 2009 while I racked up points and worked hard to become the leader of Boulder, ultimately losing to Eric Wu. Since then, my views on geolocation have evolved, and I have become especially enamored with the way geosocial data allows us to leave trails of the lives we and others are living. At its best, geolocation + social connects us to friends we are close to by letting us know who is near and collectively, social data can identify common interests and patterns of behavior we couldn’t see in the past.

Since 2008, Foursquare has evolved becoming a service with 50 million users and two billion check-ins and a facelift launching tomorrow, Twitter has opened up a geolocation API, Facebook Places launched and continues to evolve, Highlight launched and Gowalla was acquired by Facebook. All of these advancements have happened in a couple of short years. Geotagging allows these new crop of social networks to add your geographic location via metadata and now you can add location to tweets, photos, videos, etc.

Patterns of My Life

Every time I check in and share my location, I start leaving a trail of my day-to-day life. This trail, at its most basic, serves as a virtual diary of where I went and with whom. Timehop emails me each day to tell me what I did a year ago, while services such as Rewind.Me allow me to search my patterns and how I stack up against others.

Tripmeter lets me see my virtual trail and the how I travel throughout the day based on Foursquare and Facebook checkins, similar to what Route does. Where Do You Go even lets you heatmap where you most often visit (hint: I hate South Boulder).

Foursquare Heat Map

Checkins Are a Moving Census

But collectively, the patterns woven by geosocial data are incredibly telling and act as a living census. Intriguingly, researchers from Carnegie Mellon have created what they call “Livehoods” which are neighborhoods defined on not only on geographic proximity, but also based on social geotagged data. Essentially, the similarities are based on where people check in. While the data only includes those using geolocation, it shows that people who check into a local restaurant and a similar bar create cultural neighborhoods. This data is more than just an intellectual curiosity. Companies can analyze customer patterns to focus marketing efforts, identify companies to partner with and determine new brick-and-mortar locations.

Example of Livehood Data

I particularly love the idea of an app using Foursquare data called “When Should I Visit?” that tells you when is a good time to visit London tourist attractions based on Foursquare checkins. Other use cases for this type of social data could tell people when to visit high-traffic destinations such as the DMV. I love knowing when not to be somewhere as much as knowing what locations and parties are trending.

HealthMaps uses geosocial data and news reports to help track epidemics as they pop up. The mapping system was created by a team of researchers, epidemiologists and software developers from Children’s Hospital Boulder to monitor real-time epidemics as they break out. Rumi Chunara, worked on this project and also helped use geosocial data to track how cholera spread in Haiti. (Rumi will be speaking at Gnip’s social data conference, Big Boulder, about social data in public service.) Geosocial data has unlimited uses in the cases of health epidemics and natural disasters.

Companies are starting to create passive geolocation checkins such as EpicMix from Vail Resorts, which enables skiers to automatically check in using the RFID tags on their ski lifts. The system tells users how much they skied, where they skied, their vertical ascents and where their friends are on the mountain. During the last Coachella, 30,000 concertgoers used RFID bands from Intellix to checkin and update their Facebook status on various portals spaced throughout concert grounds. Near field communication is another way social data provides amazing patterns.

Geosocial data allows us insight into the patterns of everyday people, and the applications for this are endless.

Taming The Social Media Firehose, Part II

In part I, we discussed some high-level attributes of the social media firehose and what is needed to digest the data. Now let’s collect some data and perform a simple analysis to see how it goes.

On March 20, 2012, there was a large 7.4 earthquake in Oaxaca, Mexico. (See the USGS record.) Due to the severity of the earthquake, it was felt in many locations in Southern Mexico and as far away as 300 miles away in Mexico City.

Collecting data from multiple firehoses around a surprise event such as an earthquake can give a quick sense of the unfolding situation on the ground in the short term as well as helping us understand the long-term implications of destruction or injury. Social data use continues to evolve in natural disasters.

For this post, let’s limit the analysis to two questions:

  1. How does the volume of earthquake-related posts and tweets evolve over time?
  2. How rich or concise are social media conversations about the earthquake? To keep this simple, treat the size of the social media activities as a proxy for content richness.

In the past few months, Gnip has made new, rich firehoses such as Disqus, WordPress and Tumblr available. Each social media service attracts a different audience and has strengths for revealing different types of social interactions. For each topic of interest, you’ll want to understand the audience and the activities common to the publisher.

Out of the possible firehoses we can use to track the earthquake through social media activity, four firehoses will be used in this post:

  • Twitter
  • WordPress Posts
  • WordPress Comments
  • Newsgator (to compare to traditional media)

There are some common tasks and considerations for solving problems like looking at earthquake data across social media.  For examples, see this post on the taxonomy of data science. To work toward an answer, you will always need a strategy of attack to address each of the common data science challenges.

For this project, we will focus on the following steps needed to digest and analyze a social media firehose:

  1. Connect and stream data from the firehoses
  2. Apply filters to the incoming data to reduce to a manageable volume
  3. Store the data
  4. Parse and structure relevant data for analysis
  5. Count (descriptive statistics)
  6. Model
  7. Visualize
  8. Interpret

Ok, that’s the background. Let’s go.

1. Connect to the firehose. We are going to collect about a day’s worth of data. The simplest way to collect the data is with a cURL statement like this,

curl –compressed -s -ushendrickson@gnip.com
“https://stream.gnip.com:443/accounts/client/publishers/twitter/streams/track/pt1.json”

This command opens a streaming HTTP connection to the Gnip firehose server and delivers a continuous stream of JSON-formatted, GZIP-compressed data from the stream named “pt1.json” to my analysis data collector. If everything goes as planned, this will collect data from the firehose until the process is manually stopped.

The depth of the URL (each level of …/…/…) is the RESTful way of defining the interface. Gnip provides a URL for every stream a user configures on the server. In this case, I configured a streaming Twitter server named “pt1.json.”

A more realistic application would ensure against data loss by making this basic client connection more robust. For example, you may want to monitor the connection so that if it dies due to network latency or other network issues, the connection can be quickly re-established. Or, if you cannot afford to miss any activities, you may want to maintain redundant connections.

One of the network challenges is that volumes from the firehoses changes with a daily cycle, weekly cycle and surprise events such as earthquakes. These changes can be many times the average volume of posts. There are many strategies to dealing with volume variations and planning network and server capacities. For example, you may design graceful data loss procedures or, if your data provider provides it, content shaping such as buffering, prioritized rules or rule-production caps. In the case of content shaping, you may build an application to monitor rule-production volumes and react quickly to large changes in volume by restricting your rule set.

Here is short list of issues to keep in mind when planning your firehose connection:

  • Bandwidth must be sufficient for activity volume peaks rather than averages.
  • Latency can cause disconnects as well as having adverse effects on time-sensitive analysis.
  • Disconnects may occur due to bandwidth or latency as well as network outage or client congestion.
  • Implement redundancy and connection monitoring to ensure against activity loss.
  • Activity bursts may require additional hardware, bandwidth, processing power or filter updates. Volume can change by 10x or more.
  • Publisher terms of service may make additional filtering to comply with requirement as to how or when data may be used, for example, appropriately handling activities that were deleted or protected after your system received them.
  • De-duplicating repeated activities; identifying missing activities

2. Preliminary filters. Generally, you will want to apply broad filter terms early in the process to enable faster, more manageable downstream processing. Many terms related to nearly any topic of interest will appear not only in the activities you are interested in, but also in unrelated activities (noise). The practical response is to continuously refine filter rules to exclude unwanted activities. This may be a simple as keyword filtering, or sophisticated machine learning identification of activity noise.

While it would help to use a carefully crafted rule set for our earthquake filter of the firehoses, it turns out that we can learn a lot with the two simple rules “quake” and “terramoto,” the English and Spanish terms commonly appearing in activities related the the earthquake.  For our example analysis, we don’t get enough noise with these two terms to worry about additional filtering. So, each of the firehoses is initially filtered with these two key words. With a simple filter added, our connection looks like this,

curl –compressed -s -ushendrickson@gnip.com
“https://stream.gnip.com:443/accounts/client/publishers/twitter/streams/track/pt1.json”
| grep -i -e”quake” -e”terramoto”

The “grep” command simply looks for activities with only the terms “quake” or “terramoto;” the “-i” means to do this without worrying about case.

The filter shown in the example will match activities in which either term appears in any part of the activity including activity text, URLs, tagging, descriptions, captions etc. In order to filter more precisely on, for example, only blog post content, or only tweet user profile, we would need to parse the activity before filtering.

Alternatively, we can configure Gnip’s Powertrack filtering when we set up our server with rules for restricting filtering to certain fields or volume shaping. For example, to filter on tweet based on a Twitter user’s profile location settings, we might add the rule,

user_profile_location:”Mexico City”

Or, to shape matched Tweets volume for very common terms, we might add the rule to restrict output to 50% of matched Tweets with,

sample:50

For the earthquake example, we use all matched activities.

3. Store the data. Based on the desired analysis, there are a wide variety of choices for storing data. You may choose to create an historical archive, load a processing queue, and push the data to cluster storage for processing with, for example, Hadoop. Cloud-based key-value stores can be economical, but may not have the response characteristics required for solving your problem. Choices should be driven by precise business questions rather than technology buzz.

Continuing the working toward earthquake analysis, we will store activities to a file to keep things simple.

curl –compressed -s -ushendrickson@gnip.com
“https://stream.gnip.com:443/accounts/client/publishers/twitter/streams/track/pt1.json”
| grep -i -e”quake” -e”terramoto”
> earthquake_data.json

Plans for moving and storing data should take into account typical activity volumes. Let’s look at some examples of firehose volumes. JSON-formatted activities compressed with GZIP have a size of 100M Tweets ≈ 25 gigabytes.  While this takes less than 2 minutes to transfer to disk at 300 MB/s (SATA II), it takes about 6 hours at 10 Mb/s (e.g. typical congested ethernet network). Firehose sizes vary and one day of WordPress.com posts is a bit more manageable at 350MB.

Filtered Earthquake data for the Twitter, WordPress and Newsgator firehoses is only a few gigabytes, so we will just work from local disk.

4. Parse and structure relevant data. This is the point where we make decisions about data structure and tools that best support the desired analysis. The data are time-ordered social media activities with a variety of metadata. On one hand, it may prove useful to load an HBase to leverage the scalability of Hadoop, while on the other, structuring a subset of the data or metadata and inserting into a relational database to leverage the speed of indexes might be a good fit. There is no silver bullet for big data.

Keeping it simple for the earthquake data, use a Python script to parse the JSON-formatted activities and extract the date-time of each post.

import json
with open(“earthquake_data.json” as f:
for activity in f:
print json.loads(activity)[“postedTime”]

Now we can analyze the time-evolution of activity volume by counting up the number of mentions appearing in a minute. Similarly, to estimate content complexity, we can add a few more lines of Python to count characters in the text of the activity.

import json
with open(“earthquake_data.json” as f:
for activity in f:
print len(json.loads(activity)[“body”])

5. Descriptive statistics. Essentially, counting the number of earthquake references. Now that we have extracted dates and sizes, our earthquake analysis is simple. A few ideas for more interesting analysis could be understanding and identifying key players in the social network, extracting entities from text such as places or people, or performing sentiment analysis, or watching the wave of tweets move out from the earthquake epicenter.

6. Model. Descriptive statistics are okay, but well-formed questions make explicit assumption about correlation and/or causality and–importantly–are testable.  The next step is to build a model and some related hypothesis we can test.

A simple model we can examine is that surprise events fit a “double-exponential” pulse in activity rate. The motivating idea is that news spreads more quickly as more people know about it (exponential growth) until nearly everyone who cares knows. After saturation, the discussion of a topic dies off exponentially (analogously to radioactive decay). If this hypothesis works out, we have a useful modelling tool that enables comparison of events and conversations between different types of events and across firehoses. To learn more about the attributes and fitting social media volume data to the double exponential, see Social Media Pulse.

7. Visualize. Finally, we are ready to visualize the results of earthquake-related activity. We will use the simple tactic of looking at time-dependent activity in Figure 1. Twitter reports start arriving immediately after the earthquake and the volume grows to a peak within minutes. Traditional media (Newsgator) stories peak about the same time and continue throughout the day while blogs and blog comments peak and continue into the following day.

Twitter Reaction to Earthquakes

8. Interpret. Referring to Figure 1, a few interesting features emerge.  First, Twitter volume is significant within a minute or two of the earthquake and peaks in about half an hour. Within an hour, the story on Twitter starts to decay. In this case, the stories natural decay is slowed by the continual release of news stories about the earthquake. Users continue share new Tweets for nearly 12 hours.

The prominent bump in Tweet volume  just before 1:00 UTC is the result of Tweets by the members of the Brazilian boy-band “Restart,” who appeared to be visiting Mexico City at the time of the earthquake and came online to inform their fanbase that they were okay. The band’s combined retweet volume during this period added a couple thousand tweets to the background of earthquake tweets (which also grew slightly during this period due to news coverage).

While it is not in general the case, traditional media earthquake coverage  (represented by the bottom graph of the Newsgator firehose) peaks at about the same time as Tweet volume.  We commonly see Tweet volume peaking minutes, hours and occasionally days before traditional media.  In this case, the quake was very large attracting attention and pressure to answer questions about damage and injuries.

WordPress blog posts about the earthquake illustrate a common pattern for blog posts — that they live on a daily cycle. Notice the second wave of WordPress posts starting on around 9:00 UTC.  Blogs take little longer to peak because they typically contain analysis of the situation, photos and official statements.  Also, many blog readers choose to check in on the blogs they follow in the morning and evening.

A couple of comments on the topic of content richness… As you might have guessed, Tweets are concise. Tweets are limited to 140 characters, but average in the mid 80s.  Comments tend to be slightly longer than Tweets at about 250 characters. Posts have the widest range of sizes with few very large posts (up to 15,000 words). Posts also often contain rich media such as images or embedded video.

From our brief analysis, it is clear that different firehoses represent different audiences, a range of content richness and publisher-specific user modes of interaction. “What firehoses do I need?” To start to answer, it may be useful to frame the comparison in terms of speed vs. content richness. Stories move at different speeds and have different content richness based on the firehose and the story. Your use case may require rapid reaction to the social conversation, or nuanced understanding of long-term sentiment. Surprise stories don’t move in the same ways as conversations about expected events.  Each firehose represents a different set of users with different demographics and sensibilities. You will need to understand the content and behavior of each firehose based on the questions you want to answer.

As you can see by getting your hands a little dirty, looking at how a specific event is being discussed in social media is fairly straightforward.  Hopefully I have both shown a simple and direct route to getting to answers as well as giving some useful context as to considerations for building a real-word social media application. Always remember to start with a good, well-formed question.

In Taming The Social Media Firehose,  Part III, we will look at consuming the unique, rich and diverse content of the Tumblr Firehose.