From A to B: Visualizing Language for the Entire History of Twitter

It all started with a simple question: “How could we show the growth and change in languages on Twitter?”

Easy, right?

Well, several months later, here we are; finally ready to show off our final product. You can see a static image of the final viz below and check out the full story and interactive version in The Evolution of Languages on Twitter.

Looking back on the process that led us here, I realized that we’d been through an huge range of ideas and wanted to share that experience with others.

Where Did We Get the Data?

As a data scientist, I walk into Gnip’s vast data playground excited to analyze, visualize and tell stories. For this project, I had access to the full archive of public Tweets that’s part of Gnip’s product offering – that’s every Tweet since the beginning of Twitter in March of 2006.

The next question is: “With this data set, what’s the best way to analyze language?” We had two options here – use Gnip’s language detection or use the language field that’s in every Twitter user’s account settings. Gnip’s language detection enrichment looks at the text of every Tweet and classifies the Tweet as one of 24 different languages. It’s a great enrichment, but for historical data it’s only available back to March 2012.

Since we wanted to tell the story back to the beginning of Twitter, we decided to use the language field that’s in every Twitter user’s account settings.

Twitter_Account_Screenshot

This field has been part of the Twitter account setup since the beginning, giving us the coverage we need to tell our story.

The First Cut

Having defined how we would determine language, we created our first visualization.

streamgraph_2013-08-15_Volume_JeffsEdit

 

Interesting, but it doesn’t really tell the story we’re looking for.  This visualization tells the story of the growth of Twitter – it grew a lot. The challenge is that this growth obscures the presence of anything other than English, Japanese and Spanish. The sharp rise in volume also makes languages prior to 2010 impossible to see.

So we experimented with rank, language subsets, and other visualization techniques that could tell a broader story. At times, we dabbled in fugly.

Round Two

Moving through insights and iterations, we started to see each Twitter language become its own story. We chose relative rank as an important element and the streams grew into individual banners waving from year end marker poles like flags in the wind.

1yr_bump

With this version, we felt like we were getting somewhere…

The Final Version

To get to the final version, we reintroduced the line width as a meaningful element to indicate the percent of Tweet volume, pared down the number of languages to focus the story, and used D3 to spiff up the presentation layer. The end result is a simple visualization that tells the story of how language has grown and changed on Twitter. 

What became clear to me in this process is that visualization is a hugely iterative process and there’s not a single thing that leads to a successful end result. It’s a combination of the questions you ask, how you structure the data, the choices you make in what to show and what not to show and finally the tools you use to display the result.

Let me know what you think…

The Evolution of Languages on Twitter

Revolution. Global economy. Internet access. What story do you see?

This interactive visualization shows the evolution of languages of Tweets according to the language that the user selected in their Twitter profile. The height of the line reflects the percentage of total Tweets and the vertical order is based on rank vs. other languages.

Check it out. Hover over it. See how the languages you care about have changed since the beginning of Twitter.

As you’d expect, Twitter was predominantly English speaking in the early days, but that’s changed as Twitter has grown its adoption globally. English is still the dominant language but with only 51% share in 2013 vs. 79% in 2007. Japanese, Spanish and Portuguese emerged the consistent number two, three and four languages. Beyond that, you can see that relative rankings change dramatically year over year.

In this data, you can see several different stories. The sharp rise in Arabic reflects the impact of the Arab Spring – a series of revolutionary events that made use of Twitter. A spike in Indonesian is indicative of a country with a fast growing online population. Turkish starts to see growth and we expect that growth will continue to spike after the Occupygezi movement. Or step back for a broader view of the timeline; the suggestion of a spread in the globalization of communication networks comes to mind. Each potential story could be a reason to drill down further, expose new ideas and explore the facts.

Adding your own perspective, what story do you see?

(Curious about how we created this viz? Look for our blog post later tomorrow for that story.)