The Evolution of Languages on Twitter

Revolution. Global economy. Internet access. What story do you see?

This interactive visualization shows the evolution of languages of Tweets according to the language that the user selected in their Twitter profile. The height of the line reflects the percentage of total Tweets and the vertical order is based on rank vs. other languages.

Check it out. Hover over it. See how the languages you care about have changed since the beginning of Twitter.

As you’d expect, Twitter was predominantly English speaking in the early days, but that’s changed as Twitter has grown its adoption globally. English is still the dominant language but with only 51% share in 2013 vs. 79% in 2007. Japanese, Spanish and Portuguese emerged the consistent number two, three and four languages. Beyond that, you can see that relative rankings change dramatically year over year.

In this data, you can see several different stories. The sharp rise in Arabic reflects the impact of the Arab Spring – a series of revolutionary events that made use of Twitter. A spike in Indonesian is indicative of a country with a fast growing online population. Turkish starts to see growth and we expect that growth will continue to spike after the Occupygezi movement. Or step back for a broader view of the timeline; the suggestion of a spread in the globalization of communication networks comes to mind. Each potential story could be a reason to drill down further, expose new ideas and explore the facts.

Adding your own perspective, what story do you see?

(Curious about how we created this viz? Look for our blog post later tomorrow for that story.)

Gnip and Twitter Bringing Social Data to Academic Researchers

What if the next generation of data scientists could have access to social data for their research? And what if we could help increase the quantity and quality of published research using social data? Exploring what might be possible has led to an exciting new collaboration between Twitter and Gnip. Today, we’re announcing the pilot of the Twitter Data Grants program, a new initiative designed to support research and fuel innovation in our industry. (You can find Twitter’s announcement here.)

Beginning today, Twitter is accepting Data Grant proposals for non-commercial, academic research. The submission period will run through March 15th, and Twitter will select a small number of recipients on April 15th. Gnip will provide grantees with the data they need for their research projects, and Twitter engineers will be available for research support.

Gnip is committed to establishing the foundation for the long-term success of the social data industry. In 2012, we launched the Big Boulder Conference, bringing together the leaders in the social data ecosystem to discuss trends, best practices, and how social data is being used to shape current industries and create new ones. In 2013, we launched the Big Boulder Initiative to jointly address the key challenges facing us as an industry.

We’re excited to partner with Twitter on the Data Grants program and look forward to working more closely with researchers in the academic community. We also want to thank these researchers for the valuable insights and feedback that they’ve shared with us to date. This is an important first step and we’re just getting started.

To get updates and stay in touch with the Twitter Data Grants program, make sure to follow @TwitterEng, or email with questions.

Gnip Introduces Search API for Twitter: Instant and Complete Access to Recent Twitter Data

Today Gnip is announcing the public availability of a new product offering, our Search API for Twitter. We couldn’t be more excited about its disruptive potential to get social data into more tools, more products, and ultimately, more business decisions.

Ever since we announced our industry-first partnership with Twitter in 2010, customers have told us how important it is to them to deliver answers to questions about what has just happened on Twitter and to deliver these answers confidently and quickly. When the conversation about a brand or a product gets heated, brands immediately turn to the companies we serve to engage, measure, and respond. Our new Search API enables Gnip customers to build the product to meet these needs in a way that hasn’t been feasible before.

Although we’re announcing the public availability of this offering today, we’ve been working with a few close partners to integrate the Search API into their products and have seen the impact it has made for them. Realtime marketing leader Expion has used the Search API to build tools to help marketers instantly engage and react to marketing opportunities for their brands that are happening now. Simply Measured, provider of easy and powerful social media analytics, is using the Search API to enable their customers to quickly understand the broader context around conversations about their brands or products. In these early uses, we’ve heard from our customers about how important the “fast” and “instant” experience is for their needs.

Like all Gnip products, we’ve built our Search API to be enterprise-ready and capable of handling the most demanding and business critical use cases. When any Tweet can be the one that changes a decision, every Tweet matters. We’ve built the redundancy and robustness into the Search API to meet the needs of the the most sophisticated and advanced social data uses. At the same time, the Search API also provides an easy path to get started for those new to social data.

With the Search API, the foundational first step in social analytics — counting Tweets over a timeframe about a brand or product — is easier than ever, and is now possible without a having to tackle the traditional challenges with big data, streaming data, or even having to mess with storing data at all.

When reliable, sustainable, and complete social data is easier to get and use, we expect even more business decisions to be made using this incredible data. We can’t wait to help accelerate this future and can’t wait to see what our customers and partners build with this data.

If you’re interested in learning more about our new Search API for Twitter, please visit or email

Quantifying Tweets: Trading on 140 Characters

Social media analysis… for traders? That statement 5 years ago elicited a much different response than it does now. The market now recognizes the importance – and the impact – of social media channels, and as such, has recognized the need to monitor and trade off the research created from that data.

One of the earliest and most important voices in this conversation was that of Joe Gits and the Social Market Analytics team. Which is why we are incredibly excited to announce that they’ve joined the Plugged In to Gnip partner program.

Social Market Analytics quantifies social data for traders, portfolio managers, hedge funds and risk managers. Their technology extracts, evaluates and calculates social data to generate directional sentiment and volatility indications on stocks, ETFs, sectors and indices – providing predictive indicators for clients. They have succeeded in turning qualitative text into quantitative indicators that can easily be incorporated into trading strategies – broadening the types of traders and firms who can now access and incorporate social signal into their decision making.

As shown in their recently announced agreement with New York Stock Exchange (allowing NYSE reseller and distribution rights to their sentiment feed), SMA is helping bring social analytics to a wider group of financial firms than has ever been possible. That client base requires the highest-level of enterprise reliability in the products they buy –  which means SMA’s product requires the strongest data foundation possible. And Gnip is honored to be the company providing the reliable and complete access to the social data that fuels this solution.

To see what their technology looks like, check out the webinar we recently held with them.

The Timing of a Joke (on Twitter) is Everything

At Gnip, we’re always curious about how news travels on social media, so when the Royal Baby was born, we wanted to find some of the most popular posts. While digging around, we found a Tweet on the Royal Baby that was a joke from account @_Snape_ on July 22, 2013.

With more than 53,000 Retweets and more than 19,000 Favorites, this Tweet certainly resonated with Snape’s million-plus followers. But while exploring the data, we saw an interesting pattern: people were Retweeting this joke before Snape used it. How was this possible? At first, we assumed that Snape was a joke-stealer, but going back several years, we saw that Snape had actually thrice Tweeted the same joke!

Interest in the joke varied by Snape’s delivery date. An indicator for how well a joke resonates with an audience is the length of time over which the Tweet is Retweeted. In general, content on Twitter has a short shelf life, meaning that people typically stop Retweeting the content within a few hours. The graph below has dashed lines indicating the half-life for each time Snape delivered the joke, which we can use to see how much time passed before half of the total Retweets took place. So for the first two uses of the joke, half of all Retweets took place within an hour. The third use case has a significantly longer half-life, especially by Twitter’s standards. Uniquely timed with the actual birth of the newest Prince George, the July date coincided with all of the anticipation about the Royal Baby and created the perfect storm for this Half-blooded Prince joke to keep going…and going…and going. The timing was impeccable and shows timing matters for humor on Twitter.

Tweeting in the Rain, Part 2

Searching for rainy tweets

To help assess the potential of using social media for early-warning and public safety communications, we wanted to explore whether there was a Twitter ‘signal’ from local rain events. Key to this challenge was seeing if there was enough geographic metadata in the data to detect it. As described in Part 1 of this series, we interviewed managers of early-warning systems across the United States, and with their help identified ten rain events of local significance. In our previous post we presented data from two events in Las Vegas that showed promise in finding a correlation between a local rain gauge and Twitter data.

We continue our discussion by looking at an extreme rain and flood event that occurred in Louisville, KY on August 4-5, 2009. During this storm rainfall rates of more than 8 inches per hour occurred, producing widespread flooding. In hydrologic terms, this event has been characterized as having a 1000-year return period.

During this 48-hour period in 2009, there were approximately 30 million tweets posted from around the world. (While that may seem like a lot of tweets, keep in mind that there are now more than 400 millions tweets per day.) Using “filtering” methods based on weather-related keywords and geographic metadata, we set off to find a local Twitter response to this particular rain event.


Domain-based Searching – Developing your business logic

Our first round of filtering focused on developing a set of “business logic” keywords around our domain of interest, in this case rain events. Developing how you filter data from any social media firehose is an iterative process involving analyzing collected data and applying new insights. Since we were focusing on rain events, words with the substring “rain” were searched for, along with other weather-related words. Accordingly, we first searched with this set of keywords and substrings:

  • Keywords: weather, hail, lightning, pouring
  • Substrings: rain, storm, flood, precip

Applying these filters to the 30 million tweets resulted in approximately 630,000 matches. We soon found out that there are many, many tweets about training programs, brain dumps, and hundreds of other words containing the substring ‘rain.’ So, we made adjustments to our filters, including focusing on the specific keywords of interest: rain, raining, rainfall, and rained. By using these domain-specific words we were able to reduce the amount of non-rain ‘noise’ by over 28% and ended up with approximately 450,000 rain- and weather-related tweets from around the world. But how many were from the Louisville area?

Finding Tweets at the County and City Level – Finding the needle in the haystack

The second step was mining this Twitter data for geographic metadata that would allow us to geo-reference these weather-related tweets to the Louisville, KY area. There are generally three methods for geo-referencing Twitter data

  • Activity Location: tweets that are geo-tagged by the user.
  • Profile Location: parsing the Twitter Account Profile location provided by the user.
    • “I live in Louisville, home of the Derby!”
  • Mentioned Location: parsing the tweet message for geographic location.
    • “I’m in Louisville and it is raining cats and dogs”

Having a tweet explicitly tied to a specific location or a Twitter Place is extremely useful for any geographic analysis. However, the percentage of tweets with an Activity Location is less than 2%, and these were not available for this 2009 event. Given that, what chance was there to be able to correlate tweet activity with local rain events?

For this event we searched for any tweet that used one of our weather-related items, and either mentioned “Louisville” in the tweet, or came from an Twitter account with a Profile Location setting including “Louisville.” It’s worth noting that since we live near Louisville, CO, we explicitly excluded account locations that mentioned “CO” or “Colorado.” (By the way, the Twitter Profile Geo Enrichments announced yesterday would have really helped our efforts.)

After applying these geographic filters, the number of tweets went from 457,000 to 4,085. So, based on these tweets, did we have any success in finding a Twitter response to this extreme rain event in Louisville?

Did Louisville Tweet about this event?

Figure 1 compares tweets per hour with hourly rainfall from a gauge located just west of downtown Louisville on the Ohio River. As with the Las Vegas data presented previously, the tweets occurring during the rain event display a clear response, especially when compared to the “baseline” level of tweets before the event occurred. Tweets around this event spiked as the storm entered the Louisville area. The number of tweets per hour peaked as the heaviest rain hit central Louisville and remained elevated as the flooding aftermath unfolded.


Louisville Rain Event

Figure 1 – Louisville, KY, August 4-5, 2009. Event had 4085 activities, baseline had 178.

Other examples of Twitter signal compared with local rain gauges

Based on the ten events we analyzed it is clear that social media is a popular method of public communication during significant rain and flood events.

In Part 3, we’ll discuss the opportunities and challenges social media communication brings to government agencies charged with public safety and operating early-warning systems.

Full Series: 

Cracking the Code to Discovering Insights in Consumer Conversations:

How Networked Insights helps brands make data-driven marketing decisions

While short-form content is good for predicting trends, long-form content carries distinctive elements that enable deeper, threaded analysis of ongoing conversations and commentary. This analysis can help marketers better understand pre- and post-sale behavior, identifying what moves consumers through the purchase funnel. No surprise that smart analytics companies are focusing on learning more about these long-form sources to complement short-form content sources, such as Twitter.

To fully serve the needs of their customers, Plugged In partner Networked Insights wanted to provide a comprehensive view of consumer conversations—which required tracking across different social media platforms, including more long-form content sources. To do this, it was important to Networked Insights to have complete access to the firehose of WordPress and IntenseDebate data from Gnip. Through their next-generation analytics platform SocialSense, Networked Insights helps brand marketers and CMOs better understand their consumers by providing insights from the social sphere—by focusing on three distinctive areas of insights: Audience, Content, and Media. These insights help the company’s clients gain a clearer picture of consumer behavior and affinities, uncover new audiences, optimize advertising spend, and inform a wide range of marketing decisions.

Audience Insight: Networked Insights collects and analyzes millions of data points a day from a myriad of sources. They understand that social data isn’t just about Twitter and Facebook, so they incorporate as much long-form content as possible from various blogs and forums. In fact, more than 20% of the long-form content Networked Insights consumes comes from WordPress— a source available through Gnip’s exclusive partnership with WordPress.

The chart below explores the classic data-mining marketing illustration of Dads shopping for beer and diapers, giving an example of how audience insights play into marketing decisions. Networked Insights looked at mentions of both diapers and beer during a nine-month period. The greatest correlation between the two topics occurred as Dads prepared for the Super Bowl. What can marketers and advertisers take away from this? It’s a great opportunity for diaper consumer packaged goods (CPG) brands to access the typically expensive to reach sports fan audience.

Beer vs Diapers on WordPress

Content Insight: Building from the audience insights they uncover, Networked Insights works with marketers to provide a 360-degree view of their brand, competitors, or ecosystem’s consumers. They accomplish this through identifying the key affinities (favorite celebrities, musicians, TV shows, etc.) within a brand’s key target audiences. This gives a huge advantage to marketers because millions of dollars are spent every year guessing on content that would resonate with their target consumers.

So how does long-form content lend a different kind of insight? One difference is the length of user engagement. The engagement that WordPress users create is more sustained than what Twitter users create, even though WordPress volume pales in comparison to Twitter volume. Here’s one way to see the difference using social mentions related to TV shows as an example. After an episode airs, the number of conversations on WordPress doesn’t decrease as quickly as it does on Twitter, as you can see in the two charts below. Since conversations are shared on WordPress longer, this results in an increased capacity for marketers to get deep, actionable insights.

Reality Programming: Twitter vs WordPress

Comedy Programming: Twitter vs WordPress

Media Insight: By leveraging audience insights with content affinities of the target consumers, marketers can now more effectively buy media and organically reach new consumers. A great example of this is work Networked Insights did with a consumer tech company to amplify the reach of their digital ads.  Through access to real-time blog data from WordPress, Networked Insights distinguished early tech trends and themes that would ultimately trickle down to the general consumer conversation. They did this by identifying a group of tech influencers—a specific blogging community that relies upon platforms including WordPress—and analyzed how these influencers engaged with tech-related products. Leveraging what they knew about the audience and their consumer behaviors and interests, Networked Insights was able to provide new insights into what content to promote and whom to target on digital, and as a result increased the effectiveness of the company’s cost per impression by over 30%.


Data Science: The Sexiest Profession Going

Data scientists Mohammad Shahangian of Pinterest; Kostas Tsioutsiouliklis of Twitter, Adam Laiacano of Tumblr discuss the challenges and opportunities in social data.

Data Scientists at Big Boulder

As Gnip’s own data scientist Dr. Skippy was joined on stage by three data scientists representing three prolific social networks, Big Boulder Master of Ceremonies Lindsay Campbell couldn’t help herself gushing to the crowd, “This is by far the sexiest panel this year”. (Which was a reference to the Harvard Business Review naming data science the sexiest profession of the 21st century.)

Physical appearance aside, there could hardly be a truer statement to Big Boulder attendees: a legion of self-proclaimed data nerds.

Scott Hendrickson, better known as Dr. Skippy, Data Scientist at Gnip was joined on stage by Mohammad Shahangian of Pinterest, Kostas Tsioutsiouliklis of Twitter, and Adam Laiacano of Tumblr.

A Look at the Data Science Departments

The conversation began with each guest sharing the size of data science teams and roles at their respective organizations.

The data science team at Twitter is currently comprised of 7-8 people, looking to build to team of 20 in the near future (see open positions here). Data scientists at Twitter fall into two departments: a business intelligence and insights team of data scientists and individual data scientists who are embedded into teams. Data scientists embedded into teams become key stakeholders in improving and evolving the product.

The business intelligence team works collaboratively to explore ideas and create reports, even if it is not always favorable to the company. As Kostas explains, data scientists are trusted at Twitter. It’s ok to report the truth.

At Pinterest, there are 8 full-time data scientists on the team. The primary goal for data scientists is to understand what users are doing, to put pinners first- a strong company value. Much like Twitter, Pinterest data scientists are integrated into other engineering teams. This blend of engineers and data scientists on the same team enables nimble product iterations. Since adding data scientists to the mix at Pinterest teams are now requesting deeper and deeper metrics to measure success and plan product.

Tumblr’s team of data scientists is also eight strong in two roles, first a search and discovery team six strong and second, a two person, very self reflective business intelligence team. The search and discovery team is tasked to maintain the quality of the data and build products that can make the data usable, and ensure the end product is something users enjoy. The business intelligence team of two people is highly self-reflective investigating actions users take to determine which actions are indicatory of long term success.The outcome of which is most frequently is reporting.

Data Science Impact on Product

At Tumblr, there is a significant amount of testing around registration and onboarding, what users see when they land at However, Adam is quck to add that Tumblr has a unique view on their research, stating, “You don’t have to do as much research on your product when you use it yourself”.

Data scientists at Twitter report metrics all the way to the top. The CEO and the executives are asking questions about the data around launch of a new product and value the input of data scientists.

By sharing data with product teams, Pinterest engineers are being driven by the data. Mohammad shares, “After exposing metrics to people, the first instinct is to want to make the metrics better. This brings a culture of people who come to the data science team and seek their input. They take the ideas of product and run some queries to see if the data validates it. We’ve made it very easy for product teams to set up experiments, we don’t even call them experiments anymore.” Expounding on this fact, he shares an anecdote from a recent rewrite of the entire website. When launched, scientists noticed a dip in follows. Investigation from the team lead to understanding that the enhanced speed of the rewritten website had eliminated a small lag which followed a users like. A lag of time in which users had been following pinners on the site. By correcting the lag, follows went back up.

Who You Callin’ Sexy?

As Dr. Skippy joked about the popularity, ahem sexiness, of the data science title, conversation turned to the lack of an industry standard definition for the role, noting there is often confusion and a lack of differentiation from business analysts and business intelligence roles.

Kostas began noting that data science is not about analyzing but about prediction. Twiter data scientists are also engineers. Backgrounds of Twitter data scientists include statistics, data mining, machine learning, and engineering.

Further delineating from data analysts, Mohammad points out that role isn’t pulling their own data. Continuing on he added, “If you can’t pull your own data, how can you figure out what you want? A data scientist is skeptical. If results seem too good to be true, they will investigate. Question the data. Analysts will take the data as the data.”

Adam relates a good scientist as individual who can get data in any format and clean it up, can take weird, fuzzy forms and see the layout of the information is available. To connect the puzzle and build the data set that is useful.

The Future For Analysis of Social Data

Much of data science to date has been ad hoc, but the panelists agree that as you look closely at what data scientists do, it’s templates and patterns. Over time this work will become progressively more standardized. With new, faster tools it will move away from ad hoc processes. Teams will build models and tools to solve recurring problems.

Adam of Twitter added optimistically that the future is the work data scientists will do as they collect data across platforms and across multiple streams. It’s up to those developing third-party tools and resources to innovate using all the data.

Lastly, Mohammad chimed in that machine learning and prediction modeling is the sexy amongst the sexy. Adding, “That’s what we’re all waiting for”.

Big Boulder is the world’s first social data conference. Follow along at #BigBoulder, on the blog under Big BoulderBig Boulder on Storify and on Gnip’s Facebook pag

Twitter Certified Partners and International Expansion

An interview with Conway Chen and Zach Hofer-Shall of Twitter on Twitter Certified Partners and International Expansion.

Zach Hofer Shall and Conway Chen of Twitter

As Chris Moody sat down with Conway Chen and Zach Hofer-Shall of Twitter this morning, the conversation began with shared optimism on increased talk about Twitter data. All panelists were quick to praise the recent conversation of Twitter CEO Dick Costello on All Things D, where the Twitter data stream was the star of the conversation.

Conway explained this emerging interest in data with an anecdote around Twitter’s early expectations when opening the data stream- expectations that were little to none. Instead, it is the innovation built using the data that is making Twitter infinitely more valuable.

Twitter Data Is Special

4 things set Twitter data apart:

1. It is real time

2. It is public

3. It is conversational, people aren’t just speaking into the ether the conversation goes both ways

4. It is distributed

Honor Thy User

It is a delicate balance to simultaneously respect users creating the data while also wanting to get data out there and ensure it is monetizable. Zach is quick to mention strict adherence and support of a Twitter core values: Defend and respect the users voice. He continues by stating that if this goes wrong, the whole system falls apart.

Twitter has mindfully created a structure that honors this, a key component of which is data resellers. Data resellers enable Twitter to maintain values and still be able to scale. These partnerships have allowed Twitter to encourage and foster innovation in ways they would not have been able to.

Sustainability and Long-term Growth

Conway- we are absolutely committed to the success of Twitter data and the ecosystem around it. Continuing to look at is the data we are pushing out correct? Is the way we are pushing out helping resellers and developers to innovate and build on it? Twitter data and the strategy around Twitter data is pivotal in how Twitter sees their growth.

Data is a core part of the business that wasn’t always seen as a core part of the business. We are so invested in the success of Twitter data long term that we are committed to seeing it scale. And a key part of that is improving efficiency.

There is an understanding now that Twitter data is important- this speaks volume to the sustainability of the system. People don’t need a sell on the access to the data, they are instead interested in how resellers can make that data useful to them.

Twitter Certified Partner Program

Zach defines the Twitter Certified Partner Program as the answer to skeptics that Twitter doesn’t like their ecosystem. The program was established to help the ecosystem grow, help them succeed and grant providers their seal of approval.

The program ultimately acts as a tool to empower innovation on the Twitter stream. Twitter does not have the capacity to create these tools and resources independently. Less than a year old, the program has been adding 5 to 10 strategic companies each quarter. Factors when selecting certified partners include innovative uses of the data (beyond analytics and engagement) and strategic international partnerships.

Certified partners benefit from instant credibility provided through membership in the program when talking to investors and customers, access to prioritized developer support and promotion from the Twitter sales team. Twitter sales team members are trained and knowledgeable of certified partner products. As the team sells promoted content, they are also able to suggest and recommend partners to fill needs Twitter cannot.

International Growth: Not Just Language Localization

Conway identifies two areas of growth that are current bright spots: Europe and Japan. In identifying new markets, Twitter is looking for existing ecosystems where then can bolster and support what’s already happening. Brazil, Japan, South Korea and India are four regions appealing to Twitter now.

Localization isn’t just localization in terms of language, there is localization of analytics and data types as well.

International tools looking to join the Twitter Certified Partner Program need to match the same high standards of other partners. Twitter works with products in new markets to bring them to their standards.


Conway calls for service providers to develop tools to empower advertisers to move to ROI driven decisions. He encourages developers to focus on tools to provide actionable insights to inform ad-spend.

The Future of Twitter Data

In a word: Media. In the last year Twitter has blossomed beyond the 140 to the media hung off those characters. Innovation in the data will include tapping into what is attached to the Tweet. Not just the Tweet itself.


Self-defining as a mobile-first company, Conway identifies explaining why geodata remains so low as one of his biggest pain points. The balance to respect user’s privacy first while acknowledging delivering a better consumer experience depends on the inclusion of geodata. Ultimately, Conway categorizes it as a product side problem: to get users to opt-in to share their data.

Big Boulder is the world’s first social data conference. Follow along at #BigBoulder, on the blog under Big BoulderBig Boulder on Storify and on Gnip’s Facebook page.

Big Boulder 2013

Big Boulder’s back for 2013 and better than ever.

The leaders in social data: Facebook, Twitter, Tumblr, Foursquare, Automattic, Disqus and many more are descending on Boulder again this summer to talk about the future of their platforms. Last year was a huge success and the expectations this year are even higher. We have a line-up that will deliver!

Headshots for Big Boulder

We’ll go deep into Asia and Latin America with speakers from China, Brazil and Japan, including the CEO of LINE, one of the fastest growing social networks on the planet. We’ll hear about non-traditional applications of Social Data with discussions on Finance, Government, Academic Research and Data Science. And to help us make sense of it all, we’ll have industry analysts discussing their views of the future. See the agenda and speakers pages for all the details.

In addition to all the great topics covered in the sessions, we’ve left plenty of time for networking with others in Social Data, including sunset cocktails with views of the Flatirons, a bicycle pub crawl, and since this is Boulder after all, morning yoga and hiking.

Big Boulder is an invite-only event for the leaders in the social data ecosystem. Space is filling up quickly so if you’re still thinking about it, sign up now before we hit capacity. Interested in coming but haven’t been invited? First check out our blog post about social data vs. social media. If you’re all about social data, email for information.