From A to B: Visualizing Language for the Entire History of Twitter

It all started with a simple question: “How could we show the growth and change in languages on Twitter?”

Easy, right?

Well, several months later, here we are; finally ready to show off our final product. You can see a static image of the final viz below and check out the full story and interactive version in The Evolution of Languages on Twitter.

Looking back on the process that led us here, I realized that we’d been through an huge range of ideas and wanted to share that experience with others.

Where Did We Get the Data?

As a data scientist, I walk into Gnip’s vast data playground excited to analyze, visualize and tell stories. For this project, I had access to the full archive of public Tweets that’s part of Gnip’s product offering – that’s every Tweet since the beginning of Twitter in March of 2006.

The next question is: “With this data set, what’s the best way to analyze language?” We had two options here – use Gnip’s language detection or use the language field that’s in every Twitter user’s account settings. Gnip’s language detection enrichment looks at the text of every Tweet and classifies the Tweet as one of 24 different languages. It’s a great enrichment, but for historical data it’s only available back to March 2012.

Since we wanted to tell the story back to the beginning of Twitter, we decided to use the language field that’s in every Twitter user’s account settings.


This field has been part of the Twitter account setup since the beginning, giving us the coverage we need to tell our story.

The First Cut

Having defined how we would determine language, we created our first visualization.



Interesting, but it doesn’t really tell the story we’re looking for.  This visualization tells the story of the growth of Twitter – it grew a lot. The challenge is that this growth obscures the presence of anything other than English, Japanese and Spanish. The sharp rise in volume also makes languages prior to 2010 impossible to see.

So we experimented with rank, language subsets, and other visualization techniques that could tell a broader story. At times, we dabbled in fugly.

Round Two

Moving through insights and iterations, we started to see each Twitter language become its own story. We chose relative rank as an important element and the streams grew into individual banners waving from year end marker poles like flags in the wind.


With this version, we felt like we were getting somewhere…

The Final Version

To get to the final version, we reintroduced the line width as a meaningful element to indicate the percent of Tweet volume, pared down the number of languages to focus the story, and used D3 to spiff up the presentation layer. The end result is a simple visualization that tells the story of how language has grown and changed on Twitter. 

What became clear to me in this process is that visualization is a hugely iterative process and there’s not a single thing that leads to a successful end result. It’s a combination of the questions you ask, how you structure the data, the choices you make in what to show and what not to show and finally the tools you use to display the result.

Let me know what you think…

The Evolution of Languages on Twitter

Revolution. Global economy. Internet access. What story do you see?

This interactive visualization shows the evolution of languages of Tweets according to the language that the user selected in their Twitter profile. The height of the line reflects the percentage of total Tweets and the vertical order is based on rank vs. other languages.

Check it out. Hover over it. See how the languages you care about have changed since the beginning of Twitter.

As you’d expect, Twitter was predominantly English speaking in the early days, but that’s changed as Twitter has grown its adoption globally. English is still the dominant language but with only 51% share in 2013 vs. 79% in 2007. Japanese, Spanish and Portuguese emerged the consistent number two, three and four languages. Beyond that, you can see that relative rankings change dramatically year over year.

In this data, you can see several different stories. The sharp rise in Arabic reflects the impact of the Arab Spring – a series of revolutionary events that made use of Twitter. A spike in Indonesian is indicative of a country with a fast growing online population. Turkish starts to see growth and we expect that growth will continue to spike after the Occupygezi movement. Or step back for a broader view of the timeline; the suggestion of a spread in the globalization of communication networks comes to mind. Each potential story could be a reason to drill down further, expose new ideas and explore the facts.

Adding your own perspective, what story do you see?

(Curious about how we created this viz? Look for our blog post later tomorrow for that story.)

Geo Operators and the Oscars: Using Gnip’s Search API to Identify Tweets Originating from Hollywood’s Dolby Theater

At Gnip, we’ve long believed that the “geo” part of social data is hugely valuable. Geodata adds real-world context to help make sense of online conversations. We’re always looking for ways to make it easier to leverage the geo component of social data, and the geo features in our Search API provide powerful tools to do so. Gnip’s Search API now includes full support for all PowerTrack geo operators, which means analysts, marketers and academics can now get instantaneous answers to their questions about Twitter that require location.

In this post, we’ll walk through exactly what these Search API features look like, and then look at an example from the Oscars of where they add real value. We will find the people actually tweeting from inside Hollywood’s Dolby Theater and do deeper analysis of those Tweets.

What It Does

Gnip’s Search API for Twitter provides the fastest, easiest way to get up and running consuming full-fidelity Twitter data. Users can enter any query using our standard PowerTrack operators and get results instantaneously through a simple request/response API (no streaming required).

Using the Search API, you can run queries that make use of latitude/longitude data, specifically with our bounding_box and point_radius rules. Each rule lets you limit results to content within a box or circle with up to 25-mile sides or radius. These rules can be particularly helpful to cut down on the noise and identify the most relevant content for your query.

These rules can be used independently to filter both types of Twitter geodata available from Gnip:

  1. Twitter’s “geo” value: The native geodata provided by Twitter, based on metadata from mobile devices. This is present for about 2% of Tweets.
  2. Gnip’s “Profile Geo” value: Geocoded locations from users’ profiles, provided by Gnip. This is present for about 30% of Tweets.

An Oscars’ Example: What apps do the glitterati use to Tweet?

There were 17.1 million Oscar-related Tweets this year. Ellen Degeneres’ selfie Tweet at the Oscars was the most retweeted Tweet ever. If you run a search for “arm was longer” during the Oscars, you’ll get a result set that looks like this — over 1.5 million retweets in the first hour following @theellenshow’s Tweet.


The Tweet created a bit of a stir since it was sponsored by Samsung and Ellen used her Galaxy S5 to tweet it, but then used an iPhone backstage shortly afterward. We’ve seen plenty of interest in the past in the iPhone vs. Android debate, so we thought we’d use the opportunity to check what the real “score” was inside the building. What applications were other celebrities using to post to Twitter?

Using the PowerTrack point_radius operator, we created a very narrow search for a 100 meter radius from the center of the Dolby Theater — roughly the size of the building — for the time period from 5:30-11:30 PM PST on Sunday night. It’s a single-clause query:

point_radius:[-118.3409742 34.1021528 0.1km]

The result set that came back included about 250 Tweets from 98 unique users during that narrow time period, and you can see from the spike during the “selfie” moment that it’s a much more refined data set:


Using this geo search parameter, you can quickly identify the people actually at the Oscars during that major Twitter moment.

Among them, we saw the following application usage:

Geotagged Tweets by Application

With all the chatter about the mobile device battles, it’s interesting to see so many Oscar attendees actually posting to Twitter from Instagram versus Twitter mobile clients. Foursquare also made a solid showing in terms of the attendees’ preferred Twitter posting mechanism.

This is just one example of how using Gnip’s geo PowerTrack operators in the Search API can help with doing more powerful social media analysis. For a more detailed overview of all the potential geo-related PowerTrack queries that can be created, check out our support docs here.

P.S. – More on Geo & Twitter Data at #SXSW

If you get excited about social data + geo and will be in Austin this weekend, come check out our session  “Beyond Dots on a Map: Visualizing 3 Billion Tweets” on Sunday at 1 pm. I’ll be on stage with our good friend Eric Gundersen from Mapbox talking about the work we did together to visualize a huge set of Twitter data from around the world.

A Quick Look at the Hashtags in Jimmy Fallon’s Skit

Jimmy Fallon and Jonah Hill did a new skit called #Hashtag2 where they express themselves with hashtags on Twitter. If you haven’t seen it, go take two minutes. I’ll wait.

Screen Shot 2014-02-21 at 11.15.37 AM

After seeing this, I was curious if people actually used these hashtags in real life. Are we all really #blessed and #winning? It appears so. There were 784,207 uses of #blessed and 420,696 uses of #winning in the last 30 days on Twitter. But not many of us are living #livinlavidaloca with 580 uses. Maybe we’re not #livinlavidaloca because there have been only 83 mentions of #sippinonginandjuice.

Below are two charts that show real-world usage of the hashtags included in the video’s script. It’s rather amusing to see how prevalent (or not) they themes are in our daily lives!




Sampling: Not Just For Rappers Anymore

The Gnip data science team (myself, Dr. Josh Montague, Brian Lehman) has been thinking about firehose sampling in the last few weeks.  We see research and hear stories based on analysis of a randomly sampled subset of the Twitter, Tumblr or other social data firehose. This whitepaper will look at the common trade off we see in sampling, which is that we want a data stream that represents the entire audience of a social platform, while controlling costs and limiting activities in order to match analysis capacity.

Both Gnip’s customers and the greater social data ecosystem frequently use sampling to assess patterns in social data. We created a whitepaper that provides a step-by-step methodology to calculate social activity rates, confidence in our rate estimates and the ability to identify signals of emerging topics or stories. We wanted to know what we could detect and with what certainty.

The sampling whitepaper, describes the tradeoffs between the three key variables in sampling social data: rate of activities (e.g. the number of blog posts or Tweets over time), confidence levels around our estimates of rate and meaningful changes those rates, i.e., signal. These three variables are interrelated and present a measurement challenge: With the choices or constraints imposed by two of these parameters, you then calculate the third.

Sampling - Confidence vs Activities vs Signals

While the whitepaper deals with the tradeoffs of activity, signal and confidence when designing a measurement, that is a little abstract. To make this more concrete, think of the trade-off problem as a way of addressing questions like those below. If you’ve asked any of these questions in your own work with social data, we think this whitepaper might help.

  • The activity rate has doubled from five counts to ten counts between two of my measurements. Is this a significant change, or is this expected variation e.g. due to low-frequency events?

  • I want to minimize the total number activities that I consume (for reasons of cost, storage, etc). How can I do this while still detecting a factor of two change in activity rate in one hour?

  • How long should I count activities to detect a change in rate of 5%?

  • How do I describe the trade-off between signal latency and rate uncertainty?

  • How do I define confidence levels on activity rate estimates for a time series with only twenty events per day?

  • I plan to bucket the data in order to estimate activity rate, how big (i.e. what duration) should the buckets be?

  • How many activities should I target to collect in each bucket in order to be have a 95% confidence that my activity rate estimate is accurate for each bucket?

 Our summer data science intern, Jinsub Hong, and data scientist, Brian Lehman created an animation to help visualize the relationship between confidence interval size, time of observation (or, alternatively, the number of activities observed), and the signal we can detect in a firehose of social data.

The animation below shows confidence intervals for different bin sizes. As the bin size increases, we count more events, so the rate estimate becomes increasingly certain. However, we have to wait longer to get the result (latency).

At what bin size can we be confident that the activity rate has changed significantly?  For short buckets of only a minute or two, the variation in the measured rate is large, comparable to the potential signal.  For longer buckets, the signal becomes more distinct, but the time we have to wait in order to make this conclusion goes up accordingly.

The first and last frame show representative potential signals. In the first frame, this potential signal is about the same size as the variability of the activity rate, so we can’t conclusively say the activity rate has changed. With the larger bin size in the final frame, the signal is much larger than the activity rate uncertainty. We can be confident this represents a real change in the activity rate.

For full details, you can download the paper at! If you have questions about the whitepaper, please leave a comment below.

Tweeting in the Rain, Part 3

(This is part 3 of our series looking at how social data can create a signal about major rain events. Part 1 examines whether local rain events produce a Twitter signal. Part 2 looks at the technology needed to detect a Twitter signal.) 

What opportunities do social networks bring to early-warning systems?

Social media networks are inherently real-time and mobile, making them a perfect match for early-warning systems. A key part of any early-warning system is its notification mechanisms. Accordingly, we wanted to explore the potential of Twitter as a communication platform for these systems (See Part 1 for an introduction to this project).

We started by surveying operators of early-warning systems about their current use of social media. Facebook and Twitter were the most mentioned social networks. The level of social network integration was extremely varied, depending largely on how much public communications were a part of their mission. Agencies having a public communications mission viewed social media as a potentially powerful channel for public outreach. However, as of early 2013, most agencies surveyed had minimal and static social media presence.

Some departments have little or no direct responsibility for public communications and have a mission focused on real-time environmental data collection. Such groups typically have elaborate private communication networks for system maintenance and infrastructure management, but serve mainly to provide accurate and timely meteorological data to other agencies charged with data analysis and modeling, such as the National Weather Service (NWS). Such groups can be thought of being on the “front-line” of meteorological data collection, and have minimal operational focus on networks outside their direct control. Their focus is commonly on radio transmissions, and dependence on the public internet is seen as an unnecessary risk to their core mission.

Meanwhile, other agencies have an explicit mission of broadcasting public notifications during significant weather events. Many groups that operate flood-warning systems act as control centers during extreme events, coordinating information between a variety of sources such as the National Weather Service (NWS), local police and transportation departments, and local media. Hydroelectric power generators have Federally-mandated requirements for timely public communications. Some operators interact with large recreational communities and frequently communicate about river levels and other weather observations including predictions and warnings. These types of agencies expressed strong interest in using Twitter to broadcast public safety notifications.

What are some example broadcast use-cases?

From our discussions with early-warning system operators, some general themes emerged. Early-warning system operators work closely with other departments and agencies, and are interested in social networks for generating and sharing data and information. Another general theme was the recognition that these networks are uniquely suited for reaching a mobile audience.

Social media networks provide a channel for efficiently sharing information from a wide variety of sources. A common goal is to broadcast information such as:

  • Transportation Information about road closures and traffic hazards.

  • Real-time meteorological data, such as current water levels and rain time-series data.

Even when an significant weather event is not happening, there are other common use-cases for social networks:

  • Scheduled reservoir releases for recreation/boating communities.

  • Water conservation and safety education.

[Below] is a great example from the Clark County Regional Flood Control District of using Twitter to broadcast real-time conditions. The Tweet contains location metadata, a promoted hashtag to target an interested audience, and links to more information.

— Regional Flood (@RegionalFlood) September 8, 2013

So, we tweet about the severe weather and its aftermath, now what?

We also asked about significant rain events since 2008. (That year was our starting point since the first tweet was posted in 2006, and in 2008 Twitter was in its relative infancy. By 2009 there were approximately 15 million Tweets per day, while today there are approximately 400 million per day.) With this information we looked for a Twitter ‘signal’ around a single rain gauge. Part 2 presents the correlations we saw between hourly rain accumulations and hourly Twitter traffic during ten events.

These results suggest that there is an active public using Twitter to comment and share information about weather events as they happen. This provides the foundation to make Twitter a two-way communication platform during weather events. Accordingly, we also asked survey participants if there was interest in also monitoring communications coming in from the public. In general, there was interest in this along with a recognition that this piece of the puzzle was more difficult to implement. Efficiently listening to the public during extreme events requires significant effort in promoting Twitter accounts and hashtags. The [tweet to the left] is an example from the Las Vegas area, a region where it does not require a lot of rain to cause flash floods. The Clark County Regional Flood Control District detected this Tweet and retweeted within a few minutes.


Any agency or department that sets out to integrate social networks into their early-warning system will find a variety of challenges. Some of these challenges are more technical in nature, while others are more policy-related and protocol-driven.

Many weather-event monitoring systems and infrastructures are operated on an ad hoc, or as-needed, basis. When severe weather occurs, many county and city agencies deploy a temporary “emergency operations centers.” During significant events personnel are often already “maxed out” operating other data and infrastructure networks. There are also concerns over data privacy, that the public will misinterpret meteorological data, and that there is little ability to “curate” the public reactions to shared event information. Yet another challenge cited was that some agencies have policies that require special permissions to even access social networks.

There are also technical challenges when integrating social data. From automating the broadcasting of meteorological data to collecting data from social networks, there are many software and hardware details to implement. In order to identify Tweets of local interest, there are also many challenges in geo-referencing incoming data.  (Challenges made a lot easier by the new Profile Location enrichments.)

Indeed, effectively integrating social networks requires effort and dedicated resources. The most successful agencies are likely to have personnel dedicated to public outreach via social media. While the Twitter signal we detected seems to have grown naturally without much ‘coaching’ from agencies, promotion of agency accounts and hashtags is critical. The public needs to know what Twitter accounts are available for public safety communications, and hashtags enable the public to find the information they need. Effective campaigns will likely attract followers using newsletters, utility bills, Public Service Announcements, and advertising. The Clark County Regional Flood Control District even mails a newsletter to new residents highlighting local flash flood areas while promoting specific hashtags and accounts used in the region.

The Twitter response to the hydrological events we examined was substantial. Agencies need to decide how to best use social networks to augment their public outreach programs. Through education and promotion, it is likely that social media users could be encouraged to communicate important public safety observations in real time, particularly if there is an understanding that their activities are being monitored during such events. Although there are considerable challenges, there is significant potential for effective two-way communication between a mobile public and agencies charged with public safety.

Special thanks to Mike Zucosky, Manager of Field Services, OneRain, Inc., my co-presenter at the 2013 National Hydrologic Warning Council Conference.

Full Series: 

Data Story: Eric Colson of Stitch Fix

Data Stories is our blog series highlighting the cool and unusual ways people use data. I was intrigued by a presentation that Eric Colson gave to Strata about Stitch Fix, a personal shopping site, that relied heavily on Stitch Fix data along with its personal shoppers. This was a fun interview for us because several of my female colleagues order Stitch Fix boxes filled with items Stitch Fix thinks they might like. It’s amazing to see how data impacts even fashion. As a side note, this is Gnip’s 25th Data Story, so be on the watch for a compilation of all of our amazing stories. 

Eric Colson of Stitch Fix

1. Most people think of Stitch Fix as personal shopping service, powered by professional stylists. But, behind the scenes you are also using data and algorithms. Can you explain how this all works?

We use both machine processing and expert-human judgment in our styling algorithm.   Each resource plays a vital role. Our inventory is both diverse and vast. This is necessary to ensure we have relevant merchandise for each customer’s specific preferences.  However, it is so vast that it is simply not feasible for a human stylist to search through it all.  So, we use machine-learning algorithms to filter and rank-order all the inventory in the context of each customer. The results of this process are presented to the human stylist through a graphical interface that allows her to further refine the selections.  By focusing her on only the most relevant merchandise, the stylist can apply her expert judgment.   We’ve learned that, while machines are very fast at processing millions of data points of information, they still lack the prowess of the virtuoso. For example, machines often struggle with curating items around a unifying theme. In addition, machines are not capable at empathizing; they can’t detect when a customer has unarticulated preferences – say, a secret yearning to be pushed in a more edgy direction. In contrast, the human stylist are great at these things. Yet, they are far more costly and slower in their processing. So, the two resources are very complementary! The machines narrow down the vast inventory to a highly relevant and qualified subset so that the more thoughtful and discerning human stylist can effectively apply her expert judgment.

2. What do you think would need to change if you ever began offering a similar service for men?

We would likely need entirely new algorithms and different sets of data.  Men are less self-aware of how things should fit on them or what styles would look good on them (at least, I am!). Men also shop less frequently, but typically indulge in bigger hauls when they do. Also, the styles are less fluid for men and we tend to be more loyal to what is tried & true.  In fact, a feature to “send me the same stuff I got last time” might do really well with men. In contrast, our female customers would be sorely disappointed if we ever sent them the same thing twice!

So, while the major technology pieces of our platform are general enough to scale into different categories, we’d still want to collect new data and development different algorithms and features to accommodate Men.

3. How did you use your background at Netflix to help Stitch Fix become such a data driven company?
Data is in the DNA at Stitch Fix. Even before I joined (first, as an advisor and later as an employee), they had already built a platform to capture extremely rich data. Powerful distinctions that describe the merchandise are captured and persisted into structured data attributes through both expert human judgment as well as from transactional histories (e.g. How edgy is a piece of merchandise?, How well does it do with moms?, …etc).  This is a rare capability – one that even surpasses what Netflix had. And, the customer data at Stitch Fix is unprecedented! We are able to collect so much more information about preferences because our customers know its critical to our efforts to personalize for them. I only wish I had this type of data while at Netflix!

So, in some ways Stitch Fix already had edge over Netflix with respect to data. That said, the Netflix ethos for democratizing innovation has permeated into the Stitch Fix culture. Like Netflix, we try not to let our biases and opinions blind us as we try new ideas. Instead, we take our beliefs for how to improve the customer experience and reformulate them as hypotheses. We then run an AB test and let the data speak for itself. We either reject or accept the hypothesis based on the observed outcome. The process takes emotion and ego away and allows us to make better decisions.

Also, like Netflix, we invest heavily in our data and algorithms.  Both companies recognize the differentiating value in finding relevant things for their customers. In fact, given our business model, algorithms are even more crucial to Stitch Fix than they are to Netflix.  Yet, it was Netflix which pioneered the framework for establishing the capability as strategic differentiator.

4. How else is Stitch Fix driven by data?

Given our unique data, we are able to pioneer new techniques for most business processes. For example, take the process of sourcing and procuring our inventory. Since we have the capability of getting the right merchandise in front of the right customer, we can do more targeted purchasing. We don’t need to make sweeping generalization about our customer base. Instead, we can allow each customer to be unique. This allows us to buy more diverse inventory in smaller lots since we know we will be able to send it only to the customers for which it is relevant.

We also have the inherent ability to improve over time. With each shipment, we get valuable feedback. Our customers tell us what they liked and didn’t like. They give us feedback on the overall experience and on every item they receive. This allows us to better personalize to them for the next shipment and even allows us to apply the learnings to other customers.

5.  Your stylists will sometimes override machine-generated recommendations based on other information they have access to. For example, customers can put together a Pinterest board so that they can show the stylist things they like. Do you think machines will ever process this data?

No time soon! Processing unstructured data such as images and raw text are squarely in the purview of humans. Machines are notoriously challenged when it comes extracting the meaning that is conveyed in this type of information. For example, when a customer pins a picture to a Pinterest board, often they are expressing their fondness for a general concept, or even an aspiration, as opposed to the desire for a specific item. While machine learning has made great strides in processing unstructured data, there is still a long ways to go before they can be reliable.

Thanks to Eric for the interview! If you have suggestions for other Data Stories, please leave a comment! 

Continue reading

The Timing of a Joke (on Twitter) is Everything

At Gnip, we’re always curious about how news travels on social media, so when the Royal Baby was born, we wanted to find some of the most popular posts. While digging around, we found a Tweet on the Royal Baby that was a joke from account @_Snape_ on July 22, 2013.

With more than 53,000 Retweets and more than 19,000 Favorites, this Tweet certainly resonated with Snape’s million-plus followers. But while exploring the data, we saw an interesting pattern: people were Retweeting this joke before Snape used it. How was this possible? At first, we assumed that Snape was a joke-stealer, but going back several years, we saw that Snape had actually thrice Tweeted the same joke!

Interest in the joke varied by Snape’s delivery date. An indicator for how well a joke resonates with an audience is the length of time over which the Tweet is Retweeted. In general, content on Twitter has a short shelf life, meaning that people typically stop Retweeting the content within a few hours. The graph below has dashed lines indicating the half-life for each time Snape delivered the joke, which we can use to see how much time passed before half of the total Retweets took place. So for the first two uses of the joke, half of all Retweets took place within an hour. The third use case has a significantly longer half-life, especially by Twitter’s standards. Uniquely timed with the actual birth of the newest Prince George, the July date coincided with all of the anticipation about the Royal Baby and created the perfect storm for this Half-blooded Prince joke to keep going…and going…and going. The timing was impeccable and shows timing matters for humor on Twitter.

Pivotal HD Tips the Time-to-Value Ratio

For many companies, analyzing large internal data sets combined with realtime data, like social data, and extracting relevant insights isn’t easy. As a marketing director, you might need to look at several months of customer feedback on a product feature and compare those comments to ad mentions on Twitter for an ad-spend decision happening…tomorrow. Depending on the capabilities of your platform, this type of analysis could take awhile and you might not have the answers to inform the decision you need make.

This spring, our Plugged In partner Pivotal HD, formerly EMC Greenplum, went through a bit of a transformation that facilitates analysis of big data sets that combine both internal and external data. A combination of EMC Greenplum and VMware, the newly formed Pivotal is a Hadoop distribution that enables enterprise customers to combine both structured and unstructured data in a single cloud-powered platform, rather than operating multiple systems. What does this mean for enterprises and their customers? In a word (or two) it means speed and accessibility. Through this platform, companies can more easily store and analyze giant data sets, including social data, without spending all of their time building data models to do the analysis.

If you’ve read any of our Data Stories, you know we love data analysis here at Gnip, so we thought it was worthwhile to learn more about what Pivotal HD does for big data analysis. We talked to Jim Totte, Director of Business Development and Srivatsan Ramanujam, Senior Data Scientist at Pivotal HD and captured some of the conversation in this short video clip:

Tweeting in the Rain, Part 2

Searching for rainy tweets

To help assess the potential of using social media for early-warning and public safety communications, we wanted to explore whether there was a Twitter ‘signal’ from local rain events. Key to this challenge was seeing if there was enough geographic metadata in the data to detect it. As described in Part 1 of this series, we interviewed managers of early-warning systems across the United States, and with their help identified ten rain events of local significance. In our previous post we presented data from two events in Las Vegas that showed promise in finding a correlation between a local rain gauge and Twitter data.

We continue our discussion by looking at an extreme rain and flood event that occurred in Louisville, KY on August 4-5, 2009. During this storm rainfall rates of more than 8 inches per hour occurred, producing widespread flooding. In hydrologic terms, this event has been characterized as having a 1000-year return period.

During this 48-hour period in 2009, there were approximately 30 million tweets posted from around the world. (While that may seem like a lot of tweets, keep in mind that there are now more than 400 millions tweets per day.) Using “filtering” methods based on weather-related keywords and geographic metadata, we set off to find a local Twitter response to this particular rain event.


Domain-based Searching – Developing your business logic

Our first round of filtering focused on developing a set of “business logic” keywords around our domain of interest, in this case rain events. Developing how you filter data from any social media firehose is an iterative process involving analyzing collected data and applying new insights. Since we were focusing on rain events, words with the substring “rain” were searched for, along with other weather-related words. Accordingly, we first searched with this set of keywords and substrings:

  • Keywords: weather, hail, lightning, pouring
  • Substrings: rain, storm, flood, precip

Applying these filters to the 30 million tweets resulted in approximately 630,000 matches. We soon found out that there are many, many tweets about training programs, brain dumps, and hundreds of other words containing the substring ‘rain.’ So, we made adjustments to our filters, including focusing on the specific keywords of interest: rain, raining, rainfall, and rained. By using these domain-specific words we were able to reduce the amount of non-rain ‘noise’ by over 28% and ended up with approximately 450,000 rain- and weather-related tweets from around the world. But how many were from the Louisville area?

Finding Tweets at the County and City Level – Finding the needle in the haystack

The second step was mining this Twitter data for geographic metadata that would allow us to geo-reference these weather-related tweets to the Louisville, KY area. There are generally three methods for geo-referencing Twitter data

  • Activity Location: tweets that are geo-tagged by the user.
  • Profile Location: parsing the Twitter Account Profile location provided by the user.
    • “I live in Louisville, home of the Derby!”
  • Mentioned Location: parsing the tweet message for geographic location.
    • “I’m in Louisville and it is raining cats and dogs”

Having a tweet explicitly tied to a specific location or a Twitter Place is extremely useful for any geographic analysis. However, the percentage of tweets with an Activity Location is less than 2%, and these were not available for this 2009 event. Given that, what chance was there to be able to correlate tweet activity with local rain events?

For this event we searched for any tweet that used one of our weather-related items, and either mentioned “Louisville” in the tweet, or came from an Twitter account with a Profile Location setting including “Louisville.” It’s worth noting that since we live near Louisville, CO, we explicitly excluded account locations that mentioned “CO” or “Colorado.” (By the way, the Twitter Profile Geo Enrichments announced yesterday would have really helped our efforts.)

After applying these geographic filters, the number of tweets went from 457,000 to 4,085. So, based on these tweets, did we have any success in finding a Twitter response to this extreme rain event in Louisville?

Did Louisville Tweet about this event?

Figure 1 compares tweets per hour with hourly rainfall from a gauge located just west of downtown Louisville on the Ohio River. As with the Las Vegas data presented previously, the tweets occurring during the rain event display a clear response, especially when compared to the “baseline” level of tweets before the event occurred. Tweets around this event spiked as the storm entered the Louisville area. The number of tweets per hour peaked as the heaviest rain hit central Louisville and remained elevated as the flooding aftermath unfolded.


Louisville Rain Event

Figure 1 – Louisville, KY, August 4-5, 2009. Event had 4085 activities, baseline had 178.

Other examples of Twitter signal compared with local rain gauges

Based on the ten events we analyzed it is clear that social media is a popular method of public communication during significant rain and flood events.

In Part 3, we’ll discuss the opportunities and challenges social media communication brings to government agencies charged with public safety and operating early-warning systems.

Full Series: