The 4 Ways People Use Social Media in Natural Disasters

Earlier this year I interviewed Brooke Fisher Liu with the University of Maryland about her research around how people used social media during natural disasters. She broke it down as this:

During natural disasters people tend to use social media for four interrelated reasons: checking in with family and friends, obtaining emotional support and healing, determining disaster magnitude, and providing first-hand disaster accounts.

I was reflecting upon this interview and how much I saw these four scenarios during the recent Boulder flood, which many in the community are still suffering from the aftermath. The Boulder Flood provided the perfect way to look at how people use social media in natural disasters.

1) Checking in with family and friends
People were using social media to let their friends and loved ones know they were safe, what their current status was, and offering (or soliciting) help. Across the community, there were people offering help to those who needed it whether they be strangers or family. For myself, I tried posting daily updates on Facebook, so I could keep people up to date and then focus on figuring out cleanup.

Social Media During Boulder Flood

2) Obtaining emotional support and healing

Twitter, Facebook and Instagram provided enormous amounts of emotional support along with concrete offers of help. The hashtag #BoulderStrong offered great encouragement to those suffering losses.

3) Determining disaster magnitude
Many of the people following the #BoulderFlood hashtag were following some of the official accounts including the @dailycamera (Boulder newspaper), @mitchellbyar (reporter at the Daily Camera), @boulderoem (Boulder Office of Emergency Management), @bouldercounty. As a community we were looking to hear about our homes, our schools, our neighbors and how they fared. We were looking to understand just how damaged our community was and how long it took to recover. One of the more interesting aspects I saw was people focused on determining road closures. While Boulder OEM was publishing their reports, many people were determining how to get in and out of Boulder. I can’t help but think how social data can represent more accurate information and real-time reporting than official sources.

4) Providing first-hand disaster accounts
While newspapers shared collections of horrifying images of the damages happening among Boulder floods, we were looking to our contacts on social media for first-hand accounts too. We were using our networks on Twitter to confirm what we were hearing online or even what we thought we were seeing.

Boulder Flood on Instagram

Our CTO Jud Valeski posted many shots of the flood on on his Instagram account that were picked up on the media. In fact, Michael Davidson at the Xconomy even wrote an article “Gnip Co-founder Jud Valeski on His Flood Shots Seen Around the World.”

The one aspect that really seemed to be missing from Brooke Fisher Liu’s research was the coordination that was taking across social media. People were offering to help strangers, organize cleanups, share tools, share bottled water and spare bedrooms, solicit donations, check on other people’s houses and a thousand other ways. Resource sharing was one of the major ways that social media played a role in the Boulder flood.

Continue reading

Tweeting in the Rain, Part 3

(This is part 3 of our series looking at how social data can create a signal about major rain events. Part 1 examines whether local rain events produce a Twitter signal. Part 2 looks at the technology needed to detect a Twitter signal.) 

What opportunities do social networks bring to early-warning systems?

Social media networks are inherently real-time and mobile, making them a perfect match for early-warning systems. A key part of any early-warning system is its notification mechanisms. Accordingly, we wanted to explore the potential of Twitter as a communication platform for these systems (See Part 1 for an introduction to this project).

We started by surveying operators of early-warning systems about their current use of social media. Facebook and Twitter were the most mentioned social networks. The level of social network integration was extremely varied, depending largely on how much public communications were a part of their mission. Agencies having a public communications mission viewed social media as a potentially powerful channel for public outreach. However, as of early 2013, most agencies surveyed had minimal and static social media presence.

Some departments have little or no direct responsibility for public communications and have a mission focused on real-time environmental data collection. Such groups typically have elaborate private communication networks for system maintenance and infrastructure management, but serve mainly to provide accurate and timely meteorological data to other agencies charged with data analysis and modeling, such as the National Weather Service (NWS). Such groups can be thought of being on the “front-line” of meteorological data collection, and have minimal operational focus on networks outside their direct control. Their focus is commonly on radio transmissions, and dependence on the public internet is seen as an unnecessary risk to their core mission.

Meanwhile, other agencies have an explicit mission of broadcasting public notifications during significant weather events. Many groups that operate flood-warning systems act as control centers during extreme events, coordinating information between a variety of sources such as the National Weather Service (NWS), local police and transportation departments, and local media. Hydroelectric power generators have Federally-mandated requirements for timely public communications. Some operators interact with large recreational communities and frequently communicate about river levels and other weather observations including predictions and warnings. These types of agencies expressed strong interest in using Twitter to broadcast public safety notifications.

What are some example broadcast use-cases?

From our discussions with early-warning system operators, some general themes emerged. Early-warning system operators work closely with other departments and agencies, and are interested in social networks for generating and sharing data and information. Another general theme was the recognition that these networks are uniquely suited for reaching a mobile audience.

Social media networks provide a channel for efficiently sharing information from a wide variety of sources. A common goal is to broadcast information such as:

  • Transportation Information about road closures and traffic hazards.

  • Real-time meteorological data, such as current water levels and rain time-series data.

Even when an significant weather event is not happening, there are other common use-cases for social networks:

  • Scheduled reservoir releases for recreation/boating communities.

  • Water conservation and safety education.

[Below] is a great example from the Clark County Regional Flood Control District of using Twitter to broadcast real-time conditions. The Tweet contains location metadata, a promoted hashtag to target an interested audience, and links to more information.

— Regional Flood (@RegionalFlood) September 8, 2013

So, we tweet about the severe weather and its aftermath, now what?

We also asked about significant rain events since 2008. (That year was our starting point since the first tweet was posted in 2006, and in 2008 Twitter was in its relative infancy. By 2009 there were approximately 15 million Tweets per day, while today there are approximately 400 million per day.) With this information we looked for a Twitter ‘signal’ around a single rain gauge. Part 2 presents the correlations we saw between hourly rain accumulations and hourly Twitter traffic during ten events.

These results suggest that there is an active public using Twitter to comment and share information about weather events as they happen. This provides the foundation to make Twitter a two-way communication platform during weather events. Accordingly, we also asked survey participants if there was interest in also monitoring communications coming in from the public. In general, there was interest in this along with a recognition that this piece of the puzzle was more difficult to implement. Efficiently listening to the public during extreme events requires significant effort in promoting Twitter accounts and hashtags. The [tweet to the left] is an example from the Las Vegas area, a region where it does not require a lot of rain to cause flash floods. The Clark County Regional Flood Control District detected this Tweet and retweeted within a few minutes.

 

Any agency or department that sets out to integrate social networks into their early-warning system will find a variety of challenges. Some of these challenges are more technical in nature, while others are more policy-related and protocol-driven.

Many weather-event monitoring systems and infrastructures are operated on an ad hoc, or as-needed, basis. When severe weather occurs, many county and city agencies deploy a temporary “emergency operations centers.” During significant events personnel are often already “maxed out” operating other data and infrastructure networks. There are also concerns over data privacy, that the public will misinterpret meteorological data, and that there is little ability to “curate” the public reactions to shared event information. Yet another challenge cited was that some agencies have policies that require special permissions to even access social networks.

There are also technical challenges when integrating social data. From automating the broadcasting of meteorological data to collecting data from social networks, there are many software and hardware details to implement. In order to identify Tweets of local interest, there are also many challenges in geo-referencing incoming data.  (Challenges made a lot easier by the new Profile Location enrichments.)

Indeed, effectively integrating social networks requires effort and dedicated resources. The most successful agencies are likely to have personnel dedicated to public outreach via social media. While the Twitter signal we detected seems to have grown naturally without much ‘coaching’ from agencies, promotion of agency accounts and hashtags is critical. The public needs to know what Twitter accounts are available for public safety communications, and hashtags enable the public to find the information they need. Effective campaigns will likely attract followers using newsletters, utility bills, Public Service Announcements, and advertising. The Clark County Regional Flood Control District even mails a newsletter to new residents highlighting local flash flood areas while promoting specific hashtags and accounts used in the region.

The Twitter response to the hydrological events we examined was substantial. Agencies need to decide how to best use social networks to augment their public outreach programs. Through education and promotion, it is likely that social media users could be encouraged to communicate important public safety observations in real time, particularly if there is an understanding that their activities are being monitored during such events. Although there are considerable challenges, there is significant potential for effective two-way communication between a mobile public and agencies charged with public safety.

Special thanks to Mike Zucosky, Manager of Field Services, OneRain, Inc., my co-presenter at the 2013 National Hydrologic Warning Council Conference.

Full Series: 

Tweeting in the Rain, Part 2

Searching for rainy tweets

To help assess the potential of using social media for early-warning and public safety communications, we wanted to explore whether there was a Twitter ‘signal’ from local rain events. Key to this challenge was seeing if there was enough geographic metadata in the data to detect it. As described in Part 1 of this series, we interviewed managers of early-warning systems across the United States, and with their help identified ten rain events of local significance. In our previous post we presented data from two events in Las Vegas that showed promise in finding a correlation between a local rain gauge and Twitter data.

We continue our discussion by looking at an extreme rain and flood event that occurred in Louisville, KY on August 4-5, 2009. During this storm rainfall rates of more than 8 inches per hour occurred, producing widespread flooding. In hydrologic terms, this event has been characterized as having a 1000-year return period.

During this 48-hour period in 2009, there were approximately 30 million tweets posted from around the world. (While that may seem like a lot of tweets, keep in mind that there are now more than 400 millions tweets per day.) Using “filtering” methods based on weather-related keywords and geographic metadata, we set off to find a local Twitter response to this particular rain event.

 

Domain-based Searching – Developing your business logic

Our first round of filtering focused on developing a set of “business logic” keywords around our domain of interest, in this case rain events. Developing how you filter data from any social media firehose is an iterative process involving analyzing collected data and applying new insights. Since we were focusing on rain events, words with the substring “rain” were searched for, along with other weather-related words. Accordingly, we first searched with this set of keywords and substrings:

  • Keywords: weather, hail, lightning, pouring
  • Substrings: rain, storm, flood, precip

Applying these filters to the 30 million tweets resulted in approximately 630,000 matches. We soon found out that there are many, many tweets about training programs, brain dumps, and hundreds of other words containing the substring ‘rain.’ So, we made adjustments to our filters, including focusing on the specific keywords of interest: rain, raining, rainfall, and rained. By using these domain-specific words we were able to reduce the amount of non-rain ‘noise’ by over 28% and ended up with approximately 450,000 rain- and weather-related tweets from around the world. But how many were from the Louisville area?

Finding Tweets at the County and City Level – Finding the needle in the haystack

The second step was mining this Twitter data for geographic metadata that would allow us to geo-reference these weather-related tweets to the Louisville, KY area. There are generally three methods for geo-referencing Twitter data

  • Activity Location: tweets that are geo-tagged by the user.
  • Profile Location: parsing the Twitter Account Profile location provided by the user.
    • “I live in Louisville, home of the Derby!”
  • Mentioned Location: parsing the tweet message for geographic location.
    • “I’m in Louisville and it is raining cats and dogs”

Having a tweet explicitly tied to a specific location or a Twitter Place is extremely useful for any geographic analysis. However, the percentage of tweets with an Activity Location is less than 2%, and these were not available for this 2009 event. Given that, what chance was there to be able to correlate tweet activity with local rain events?

For this event we searched for any tweet that used one of our weather-related items, and either mentioned “Louisville” in the tweet, or came from an Twitter account with a Profile Location setting including “Louisville.” It’s worth noting that since we live near Louisville, CO, we explicitly excluded account locations that mentioned “CO” or “Colorado.” (By the way, the Twitter Profile Geo Enrichments announced yesterday would have really helped our efforts.)

After applying these geographic filters, the number of tweets went from 457,000 to 4,085. So, based on these tweets, did we have any success in finding a Twitter response to this extreme rain event in Louisville?

Did Louisville Tweet about this event?

Figure 1 compares tweets per hour with hourly rainfall from a gauge located just west of downtown Louisville on the Ohio River. As with the Las Vegas data presented previously, the tweets occurring during the rain event display a clear response, especially when compared to the “baseline” level of tweets before the event occurred. Tweets around this event spiked as the storm entered the Louisville area. The number of tweets per hour peaked as the heaviest rain hit central Louisville and remained elevated as the flooding aftermath unfolded.

 

Louisville Rain Event

Figure 1 – Louisville, KY, August 4-5, 2009. Event had 4085 activities, baseline had 178.

Other examples of Twitter signal compared with local rain gauges

Based on the ten events we analyzed it is clear that social media is a popular method of public communication during significant rain and flood events.

In Part 3, we’ll discuss the opportunities and challenges social media communication brings to government agencies charged with public safety and operating early-warning systems.

Full Series: 

Social Data Mashups Following Natural Disasters

Exactly what role can social data play in natural disasters?

We collaborated with Gnip Plugged In partner, Alteryx, to do data mashups around FEMA relief related to Hurricane Sandy for a recent presentation at the Glue Conference. Alteryx makes it easy for users to solve data analysis problems and make decisions.

Gnip and Alteryx created a data mashup for Hurricane Sandy using six different data sources showing what kinds of tools we can create after natural disasters. Starting with mapping from TomTom, the data mashup also included data about businesses from Dun & Bradstreet, demographic data from Experian, registrations from FEMA, geotagged articles from Metacarta, and geotagged Tweets from Gnip.  We concentrated on FEMA efforts and reactions during spring 2013.

This kind of data mashup allows us to drill down into multiple aspects of evacuation zones. One of the easiest examples of this mashup is the ability to see what services and resources are available from businesses (from Dun & Bradstreet) while complimentary official efforts are organized.

FEMA Hurricane Sandy Maps

Or it can help prioritize which areas to assist first by mashing population densities with registrations from FEMA.

FEMA Hurricane Sandy Registrations

FEMA Hurricane Sandy Density Map

Using geotagged social data from Twitter is another way to identify areas that need help, as well as monitor recovering areas. Combining sentiment analysis with Tweets provides instant feedback on the frustrations or successes that constituents are feeling and seeing.

Hurricane Sandy Social Tweets

We think this type of data mashups with Alteryx is just the beginning of what is possible with social data. If you have questions or ideas for data mashups, leave it in the comments!

Data Stories: Brooke Fisher Liu on Using Social Media in Natural Disasters

Data Stories is Gnip’s project to tell the stories of how social data is being used. This week we’re interviewing Brooke Fisher Liu from the University of Maryland about her research on how people use social media in natural disasters (PDF). You can follow Brooke on Twitter at @Bfliu. (Also, you can see our data scientists post on Twitter’s reaction to an earthquake in Mexico.)

Brooke Fisher Liu

Brooke Fisher Liu (photo courtesy of Anne McDonough)

1. When the wildfires broke out in Boulder, I found Twitter to be the best source of information hands down. What kind of information do you see people communicating about natural disasters?

During natural disasters people tend to use social media for four interrelated reasons: checking in with family and friends, obtaining emotional support and healing, determining disaster magnitude, and providing first-hand disaster accounts. A consistent research finding is that people are less likely to follow official, government sources on social media than their friends and family during disasters. I think that may change over time as government sources become more savvy about effectively using social media during disasters.

2. How is curated content such as Storify changing how people communicate during disasters?

This is one area where the research hasn’t caught up with practice yet. However, I think that social media sites that curate content such as Storify, Pinterest, or even Instagram are going to be major players in disaster communication in the future. One of the reasons people don’t turn to social media for disaster information is that the quantity of information is difficult to sift through and verify. Sites that curate content help cut through the sea of online information, and also provide a familiar, reliable source of information through online connections established before disasters.

3. You talked about people mobilizing on social media after natural disasters in your report. Do you ever see people respond in real time?

Absolutely. Real-time communication is one of the primary draws of social media during disasters. There are multiple examples of social media being the first source of disaster information such as for the 2011 Tuscaloosa tornadoes and the 2008 Mumbai terrorist attacks.

4. What surprised you the most about how people were using social media during natural disasters?

By far the biggest surprise is that people still turn to traditional media sources, especially broadcast journalism, as the most accurate source of disaster information. So, while they may first turn to social media, they still prefer traditional media during disasters. I think this may change over time, but it certainly was a surprise for me. Of course, journalists often rely on social media for disaster information, and I think over time we’ll see the distinction between traditional media and so-called new media blur even more.

5. How do you think the use of social media in natural disasters will evolve?

I think over time people will view social media as more trustworthy and thus turn to it as their primary source of information. I also think social media will continue to play a large role in facilitating disaster recovery by helping people connect with each other and rebuild communities. “Official sources” such as governments and the media will increasingly enhance their social media presence before disasters, which likely will position them to be not only the first, but also most trustworthy social media sources down the road. Perhaps most importantly I think social media will continue to surprise us by providing new communication capabilities during disasters that we can’t currently predict.

Continue reading

Social Data at the Eye of the Hurricane

At Gnip, we are fascinated at the possibilities of how social data can be used in natural disasters. If you’re on Twitter, your newsfeed is likely flooded (pun intended) with tweets about Hurricane Sandy. Our customer VisionLink uses geotagged Tweets about Hurricane Sandy and overlays them in a map with the RedCross shelters. You can also overlay Tweets with the path of Hurricane Sandy. You can play around with their map here, but this screenshot gives you a snapshot of how social data can track events such as Hurricanes.

Mapping Tweets and Red Cross shelters

This map makes it easier for emergency response teams to map where the flooding and other potential hazards are happening. Using timestamps, social data can even show a timeline of how conditions are worsening over time.

We expect that this technology will get increasingly sophisticated in a short amount of time. If you’re looking for more resources to follow on Twitter about Hurricane Sandy, check out these Twitter accounts.

Big Boulder Speakers Using Social Data in Innovative Ways

Big Boulder is next week and we’re excited to add four new speakers who are using social data in amazing ways, from disaster response and epidemic tracking to predicting the stock market and monitoring political developments.

If you want to follow the conversation about Big Boulder, be sure to follow the hashtag #BigBoulder , the Gnip blog for live blogging and pictures from the conference on our Facebook page.

Taming The Social Media Firehose, Part II

In part I, we discussed some high-level attributes of the social media firehose and what is needed to digest the data. Now let’s collect some data and perform a simple analysis to see how it goes.

On March 20, 2012, there was a large 7.4 earthquake in Oaxaca, Mexico. (See the USGS record.) Due to the severity of the earthquake, it was felt in many locations in Southern Mexico and as far away as 300 miles away in Mexico City.

Collecting data from multiple firehoses around a surprise event such as an earthquake can give a quick sense of the unfolding situation on the ground in the short term as well as helping us understand the long-term implications of destruction or injury. Social data use continues to evolve in natural disasters.

For this post, let’s limit the analysis to two questions:

  1. How does the volume of earthquake-related posts and tweets evolve over time?
  2. How rich or concise are social media conversations about the earthquake? To keep this simple, treat the size of the social media activities as a proxy for content richness.

In the past few months, Gnip has made new, rich firehoses such as Disqus, WordPress and Tumblr available. Each social media service attracts a different audience and has strengths for revealing different types of social interactions. For each topic of interest, you’ll want to understand the audience and the activities common to the publisher.

Out of the possible firehoses we can use to track the earthquake through social media activity, four firehoses will be used in this post:

  • Twitter
  • WordPress Posts
  • WordPress Comments
  • Newsgator (to compare to traditional media)

There are some common tasks and considerations for solving problems like looking at earthquake data across social media.  For examples, see this post on the taxonomy of data science. To work toward an answer, you will always need a strategy of attack to address each of the common data science challenges.

For this project, we will focus on the following steps needed to digest and analyze a social media firehose:

  1. Connect and stream data from the firehoses
  2. Apply filters to the incoming data to reduce to a manageable volume
  3. Store the data
  4. Parse and structure relevant data for analysis
  5. Count (descriptive statistics)
  6. Model
  7. Visualize
  8. Interpret

Ok, that’s the background. Let’s go.

1. Connect to the firehose. We are going to collect about a day’s worth of data. The simplest way to collect the data is with a cURL statement like this,

curl –compressed -s -ushendrickson@gnip.com
“https://stream.gnip.com:443/accounts/client/publishers/twitter/streams/track/pt1.json”

This command opens a streaming HTTP connection to the Gnip firehose server and delivers a continuous stream of JSON-formatted, GZIP-compressed data from the stream named “pt1.json” to my analysis data collector. If everything goes as planned, this will collect data from the firehose until the process is manually stopped.

The depth of the URL (each level of …/…/…) is the RESTful way of defining the interface. Gnip provides a URL for every stream a user configures on the server. In this case, I configured a streaming Twitter server named “pt1.json.”

A more realistic application would ensure against data loss by making this basic client connection more robust. For example, you may want to monitor the connection so that if it dies due to network latency or other network issues, the connection can be quickly re-established. Or, if you cannot afford to miss any activities, you may want to maintain redundant connections.

One of the network challenges is that volumes from the firehoses changes with a daily cycle, weekly cycle and surprise events such as earthquakes. These changes can be many times the average volume of posts. There are many strategies to dealing with volume variations and planning network and server capacities. For example, you may design graceful data loss procedures or, if your data provider provides it, content shaping such as buffering, prioritized rules or rule-production caps. In the case of content shaping, you may build an application to monitor rule-production volumes and react quickly to large changes in volume by restricting your rule set.

Here is short list of issues to keep in mind when planning your firehose connection:

  • Bandwidth must be sufficient for activity volume peaks rather than averages.
  • Latency can cause disconnects as well as having adverse effects on time-sensitive analysis.
  • Disconnects may occur due to bandwidth or latency as well as network outage or client congestion.
  • Implement redundancy and connection monitoring to ensure against activity loss.
  • Activity bursts may require additional hardware, bandwidth, processing power or filter updates. Volume can change by 10x or more.
  • Publisher terms of service may make additional filtering to comply with requirement as to how or when data may be used, for example, appropriately handling activities that were deleted or protected after your system received them.
  • De-duplicating repeated activities; identifying missing activities

2. Preliminary filters. Generally, you will want to apply broad filter terms early in the process to enable faster, more manageable downstream processing. Many terms related to nearly any topic of interest will appear not only in the activities you are interested in, but also in unrelated activities (noise). The practical response is to continuously refine filter rules to exclude unwanted activities. This may be a simple as keyword filtering, or sophisticated machine learning identification of activity noise.

While it would help to use a carefully crafted rule set for our earthquake filter of the firehoses, it turns out that we can learn a lot with the two simple rules “quake” and “terramoto,” the English and Spanish terms commonly appearing in activities related the the earthquake.  For our example analysis, we don’t get enough noise with these two terms to worry about additional filtering. So, each of the firehoses is initially filtered with these two key words. With a simple filter added, our connection looks like this,

curl –compressed -s -ushendrickson@gnip.com
“https://stream.gnip.com:443/accounts/client/publishers/twitter/streams/track/pt1.json”
| grep -i -e”quake” -e”terramoto”

The “grep” command simply looks for activities with only the terms “quake” or “terramoto;” the “-i” means to do this without worrying about case.

The filter shown in the example will match activities in which either term appears in any part of the activity including activity text, URLs, tagging, descriptions, captions etc. In order to filter more precisely on, for example, only blog post content, or only tweet user profile, we would need to parse the activity before filtering.

Alternatively, we can configure Gnip’s Powertrack filtering when we set up our server with rules for restricting filtering to certain fields or volume shaping. For example, to filter on tweet based on a Twitter user’s profile location settings, we might add the rule,

user_profile_location:”Mexico City”

Or, to shape matched Tweets volume for very common terms, we might add the rule to restrict output to 50% of matched Tweets with,

sample:50

For the earthquake example, we use all matched activities.

3. Store the data. Based on the desired analysis, there are a wide variety of choices for storing data. You may choose to create an historical archive, load a processing queue, and push the data to cluster storage for processing with, for example, Hadoop. Cloud-based key-value stores can be economical, but may not have the response characteristics required for solving your problem. Choices should be driven by precise business questions rather than technology buzz.

Continuing the working toward earthquake analysis, we will store activities to a file to keep things simple.

curl –compressed -s -ushendrickson@gnip.com
“https://stream.gnip.com:443/accounts/client/publishers/twitter/streams/track/pt1.json”
| grep -i -e”quake” -e”terramoto”
> earthquake_data.json

Plans for moving and storing data should take into account typical activity volumes. Let’s look at some examples of firehose volumes. JSON-formatted activities compressed with GZIP have a size of 100M Tweets ≈ 25 gigabytes.  While this takes less than 2 minutes to transfer to disk at 300 MB/s (SATA II), it takes about 6 hours at 10 Mb/s (e.g. typical congested ethernet network). Firehose sizes vary and one day of WordPress.com posts is a bit more manageable at 350MB.

Filtered Earthquake data for the Twitter, WordPress and Newsgator firehoses is only a few gigabytes, so we will just work from local disk.

4. Parse and structure relevant data. This is the point where we make decisions about data structure and tools that best support the desired analysis. The data are time-ordered social media activities with a variety of metadata. On one hand, it may prove useful to load an HBase to leverage the scalability of Hadoop, while on the other, structuring a subset of the data or metadata and inserting into a relational database to leverage the speed of indexes might be a good fit. There is no silver bullet for big data.

Keeping it simple for the earthquake data, use a Python script to parse the JSON-formatted activities and extract the date-time of each post.

import json
with open(“earthquake_data.json” as f:
for activity in f:
print json.loads(activity)[“postedTime”]

Now we can analyze the time-evolution of activity volume by counting up the number of mentions appearing in a minute. Similarly, to estimate content complexity, we can add a few more lines of Python to count characters in the text of the activity.

import json
with open(“earthquake_data.json” as f:
for activity in f:
print len(json.loads(activity)[“body”])

5. Descriptive statistics. Essentially, counting the number of earthquake references. Now that we have extracted dates and sizes, our earthquake analysis is simple. A few ideas for more interesting analysis could be understanding and identifying key players in the social network, extracting entities from text such as places or people, or performing sentiment analysis, or watching the wave of tweets move out from the earthquake epicenter.

6. Model. Descriptive statistics are okay, but well-formed questions make explicit assumption about correlation and/or causality and–importantly–are testable.  The next step is to build a model and some related hypothesis we can test.

A simple model we can examine is that surprise events fit a “double-exponential” pulse in activity rate. The motivating idea is that news spreads more quickly as more people know about it (exponential growth) until nearly everyone who cares knows. After saturation, the discussion of a topic dies off exponentially (analogously to radioactive decay). If this hypothesis works out, we have a useful modelling tool that enables comparison of events and conversations between different types of events and across firehoses. To learn more about the attributes and fitting social media volume data to the double exponential, see Social Media Pulse.

7. Visualize. Finally, we are ready to visualize the results of earthquake-related activity. We will use the simple tactic of looking at time-dependent activity in Figure 1. Twitter reports start arriving immediately after the earthquake and the volume grows to a peak within minutes. Traditional media (Newsgator) stories peak about the same time and continue throughout the day while blogs and blog comments peak and continue into the following day.

Twitter Reaction to Earthquakes

8. Interpret. Referring to Figure 1, a few interesting features emerge.  First, Twitter volume is significant within a minute or two of the earthquake and peaks in about half an hour. Within an hour, the story on Twitter starts to decay. In this case, the stories natural decay is slowed by the continual release of news stories about the earthquake. Users continue share new Tweets for nearly 12 hours.

The prominent bump in Tweet volume  just before 1:00 UTC is the result of Tweets by the members of the Brazilian boy-band “Restart,” who appeared to be visiting Mexico City at the time of the earthquake and came online to inform their fanbase that they were okay. The band’s combined retweet volume during this period added a couple thousand tweets to the background of earthquake tweets (which also grew slightly during this period due to news coverage).

While it is not in general the case, traditional media earthquake coverage  (represented by the bottom graph of the Newsgator firehose) peaks at about the same time as Tweet volume.  We commonly see Tweet volume peaking minutes, hours and occasionally days before traditional media.  In this case, the quake was very large attracting attention and pressure to answer questions about damage and injuries.

WordPress blog posts about the earthquake illustrate a common pattern for blog posts — that they live on a daily cycle. Notice the second wave of WordPress posts starting on around 9:00 UTC.  Blogs take little longer to peak because they typically contain analysis of the situation, photos and official statements.  Also, many blog readers choose to check in on the blogs they follow in the morning and evening.

A couple of comments on the topic of content richness… As you might have guessed, Tweets are concise. Tweets are limited to 140 characters, but average in the mid 80s.  Comments tend to be slightly longer than Tweets at about 250 characters. Posts have the widest range of sizes with few very large posts (up to 15,000 words). Posts also often contain rich media such as images or embedded video.

From our brief analysis, it is clear that different firehoses represent different audiences, a range of content richness and publisher-specific user modes of interaction. “What firehoses do I need?” To start to answer, it may be useful to frame the comparison in terms of speed vs. content richness. Stories move at different speeds and have different content richness based on the firehose and the story. Your use case may require rapid reaction to the social conversation, or nuanced understanding of long-term sentiment. Surprise stories don’t move in the same ways as conversations about expected events.  Each firehose represents a different set of users with different demographics and sensibilities. You will need to understand the content and behavior of each firehose based on the questions you want to answer.

As you can see by getting your hands a little dirty, looking at how a specific event is being discussed in social media is fairly straightforward.  Hopefully I have both shown a simple and direct route to getting to answers as well as giving some useful context as to considerations for building a real-word social media application. Always remember to start with a good, well-formed question.

In Taming The Social Media Firehose,  Part III, we will look at consuming the unique, rich and diverse content of the Tumblr Firehose.