From A to B: Visualizing Language for the Entire History of Twitter

It all started with a simple question: “How could we show the growth and change in languages on Twitter?”

Easy, right?

Well, several months later, here we are; finally ready to show off our final product. You can see a static image of the final viz below and check out the full story and interactive version in The Evolution of Languages on Twitter.

Looking back on the process that led us here, I realized that we’d been through an huge range of ideas and wanted to share that experience with others.

Where Did We Get the Data?

As a data scientist, I walk into Gnip’s vast data playground excited to analyze, visualize and tell stories. For this project, I had access to the full archive of public Tweets that’s part of Gnip’s product offering – that’s every Tweet since the beginning of Twitter in March of 2006.

The next question is: “With this data set, what’s the best way to analyze language?” We had two options here – use Gnip’s language detection or use the language field that’s in every Twitter user’s account settings. Gnip’s language detection enrichment looks at the text of every Tweet and classifies the Tweet as one of 24 different languages. It’s a great enrichment, but for historical data it’s only available back to March 2012.

Since we wanted to tell the story back to the beginning of Twitter, we decided to use the language field that’s in every Twitter user’s account settings.

Twitter_Account_Screenshot

This field has been part of the Twitter account setup since the beginning, giving us the coverage we need to tell our story.

The First Cut

Having defined how we would determine language, we created our first visualization.

streamgraph_2013-08-15_Volume_JeffsEdit

 

Interesting, but it doesn’t really tell the story we’re looking for.  This visualization tells the story of the growth of Twitter – it grew a lot. The challenge is that this growth obscures the presence of anything other than English, Japanese and Spanish. The sharp rise in volume also makes languages prior to 2010 impossible to see.

So we experimented with rank, language subsets, and other visualization techniques that could tell a broader story. At times, we dabbled in fugly.

Round Two

Moving through insights and iterations, we started to see each Twitter language become its own story. We chose relative rank as an important element and the streams grew into individual banners waving from year end marker poles like flags in the wind.

1yr_bump

With this version, we felt like we were getting somewhere…

The Final Version

To get to the final version, we reintroduced the line width as a meaningful element to indicate the percent of Tweet volume, pared down the number of languages to focus the story, and used D3 to spiff up the presentation layer. The end result is a simple visualization that tells the story of how language has grown and changed on Twitter. 

What became clear to me in this process is that visualization is a hugely iterative process and there’s not a single thing that leads to a successful end result. It’s a combination of the questions you ask, how you structure the data, the choices you make in what to show and what not to show and finally the tools you use to display the result.

Let me know what you think…

The Evolution of Languages on Twitter

Revolution. Global economy. Internet access. What story do you see?

This interactive visualization shows the evolution of languages of Tweets according to the language that the user selected in their Twitter profile. The height of the line reflects the percentage of total Tweets and the vertical order is based on rank vs. other languages.

Check it out. Hover over it. See how the languages you care about have changed since the beginning of Twitter.

As you’d expect, Twitter was predominantly English speaking in the early days, but that’s changed as Twitter has grown its adoption globally. English is still the dominant language but with only 51% share in 2013 vs. 79% in 2007. Japanese, Spanish and Portuguese emerged the consistent number two, three and four languages. Beyond that, you can see that relative rankings change dramatically year over year.

In this data, you can see several different stories. The sharp rise in Arabic reflects the impact of the Arab Spring – a series of revolutionary events that made use of Twitter. A spike in Indonesian is indicative of a country with a fast growing online population. Turkish starts to see growth and we expect that growth will continue to spike after the Occupygezi movement. Or step back for a broader view of the timeline; the suggestion of a spread in the globalization of communication networks comes to mind. Each potential story could be a reason to drill down further, expose new ideas and explore the facts.

Adding your own perspective, what story do you see?

(Curious about how we created this viz? Look for our blog post later tomorrow for that story.)

Mapping Travel, Languages & Mobile OS Usage with Twitter Data

Some of the most compelling use cases we’ve seen for analyzing Twitter data involve geolocation. From NGO’s looking at geotagged Tweets to help deploy resources after disasters, to brands paying attention to where their fans are (or their disgruntled customers) to help drive engagement and marketing strategies, location adds key value to Tweet content.

We’ve been fascinated by these use cases and have wondered what else could be done with this data. A couple months ago our Data Science team set out to explore these questions, and to create some resources at the same time that would help others study and make use of geotagged Tweets. We brought in the team at MapBox – including data artist Eric Fischer – to help us dig into the data and visualize what we found in fast, fully navigable geotagged Twitter maps that would let us and our readers really explore this data in depth.

The interactive maps we created together build on other recent analyses and visualizations of Twitter data done by others, including this great post about details of the data and these static maps from Twitter’s Visual Insights team. The results are stunning, and we hope they’re helpful for you to make the data more practical and accessible as you evaluate what else you could be doing with geolocation in Twitter.

Locals and Tourists (Round 2)

Where do people tweet relative to where they live?

In 2010, Eric Fischer made a static map he called “Locals and Tourists” that showed geolocation for both Tweets and Flickr photos side by side, with the data color coded to show when a post was by a “local” (a post at or near the user’s stated home location) or a “tourist” (a post far from the user’s home location). Twitter has matured significantly since then, and we wanted to see what we could learn from looking at just the Twitter data today, with the ability to browse at any local level around the world. We gathered a sample of Twitter data with unique geotagged Tweet locations from the past ~18 months to generate this new interactive map.

As the dynamic maps took shape, the new version of “Locals and Tourists” impressed us in a couple ways. The first was simply how much resolution Twitter data provides. For instance, not only were primary and secondary roads clearly visible, but you can clearly see roads taken by tourists vs. roads used for local commutes, like this screenshot of I-95 snaking past Wilmington, DE and Philadelphia, PA in red across the bottom third of this image:

Twitter Visualizaiton

You can also clearly see the outlines of buildings like airports, sports stadiums, and major shopping malls that are frequented by tourists. Dig into your local area and see for yourself.

This map could be a resource for city planners, the travel industry, or for creative marketers thinking about how to localize their mobile advertising for different audiences.

Device Usage Patterns

This map shows off usage patterns for various mobile operating systems used to tweet around the world. Since geotagged Tweets require a Twitter client that includes GPS support, most geotagged Tweets come from handheld devices – and we can look at exactly which client was used in the “generator” metadata field provided by Twitter. Among other things, this visualization suggests correlations between mobile OS and income level in the US, and highlights just how prolific Blackberry use is in Southeast Asia, Indonesia and the Middle East.

Languages of the World

Using the same data sample, this final visualization plots where people tweeted in various languages, using metadata from the Gnip Language Detection Enrichment and the Chromium Compact Language Detector as a fallback.

For starters, this map makes clear that English is still the dominant language on Twitter around the world — toggling to the English-only view reveals nearly as much resolution in the global map as when all languages are enabled:

English Language Twitter Visualization of the US

 

English only 

 

Twitter Language Visualization

 

All languages

What might come as more of a surprise though is just how many other languages are being spoken frequently, and particularly how much overlap there is in the United States:

Twitter Visualization for Languages

 

Non-English Tweets across the US; Spanish in green

A Note on the Data

These maps are created with a data set that was significantly culled down to remove locations that would create visual noise. From the original data set, the following were removed:

  • Multiple geotagged Tweets in the exact same location (we made no attempt to communicate density in these visualizations)
  • Geotagged Tweets from the same user in very close proximity to other Tweets from the same user
  • Geotagged Tweets from known or detectable bots

Together these maps point to something powerful – by looking at geolocation data from Twitter in the aggregate, important understanding can be gained to drive marketing, product development, crisis response, or even inform research and policy decisions. In the coming weeks, we’ll be digging in deeper here on the blog to explore other important aspects of geolocation in social data that we hope together will build a picture of the opportunity that exists in understanding social data geospatially.

Find something compelling here or in any of the other maps? Tell us with a Tweet: @gnip.