The Evolution of Languages on Twitter

Revolution. Global economy. Internet access. What story do you see?

This interactive visualization shows the evolution of languages of Tweets according to the language that the user selected in their Twitter profile. The height of the line reflects the percentage of total Tweets and the vertical order is based on rank vs. other languages.

Check it out. Hover over it. See how the languages you care about have changed since the beginning of Twitter.

As you’d expect, Twitter was predominantly English speaking in the early days, but that’s changed as Twitter has grown its adoption globally. English is still the dominant language but with only 51% share in 2013 vs. 79% in 2007. Japanese, Spanish and Portuguese emerged the consistent number two, three and four languages. Beyond that, you can see that relative rankings change dramatically year over year.

In this data, you can see several different stories. The sharp rise in Arabic reflects the impact of the Arab Spring – a series of revolutionary events that made use of Twitter. A spike in Indonesian is indicative of a country with a fast growing online population. Turkish starts to see growth and we expect that growth will continue to spike after the Occupygezi movement. Or step back for a broader view of the timeline; the suggestion of a spread in the globalization of communication networks comes to mind. Each potential story could be a reason to drill down further, expose new ideas and explore the facts.

Adding your own perspective, what story do you see?

(Curious about how we created this viz? Look for our blog post later tomorrow for that story.)

Energy Industry Use of Sentiment Analysis

Companies don’t rely solely on sentiment analysis to inform strategic decisions, but it is a powerful complement to traditional market intelligence. Like many things in the social data ecosystem, sentiment is a rapidly evolving tool. While challenges like accurately identifying and classifying  irony, sarcasm, and emoticons exist, companies are meeting that challenge with increasingly sophisticated, Twitter syntax-specific tools.

IHS, a leading analysis and information provider–and one we’re excited to announce as a new Plugged In partner–built a sentiment intelligence tool to facilitate its clients’ use of social data. This past summer, IHS released the U.S. Sentiment Index, a tool that assess realtime Tweets, providing a representation of the average mood of the United States.

We were curious to learn more about how sentiment analysis is being used across industries not typically known for their use of social data. IHS shared an example of how companies, in this case in the oil and gas industry, incorporated sentiment analysis to provide a deeper understanding of public opinion on hydraulic fracturing, commonly referred to as “fracking.” IHS looked at the sentiment of fracking-related Tweets globally, as well as in specific states like Colorado.  Analysis determined in which states the most Tweets about fracking originated and what keywords are most commonly associated with the topic. Both of these things contribute to the companies’ understanding of the drivers of public sentiment on the topic of fracking–valuable information.

To expose another layer of insight, IHS used network analysis to understand and measure the virality of messages. One of the takeaways from the research was that the content of a message is not as important as understanding which voices influence the dissemination of that message. Further, the number of followers an influencer has was not as important as whether one of those followers retweeted the message outside the influencer’s immediate circle of followers. These are just a few highlights from a more in-depth paper we wrote.

Gnip Introduces Search API for Twitter: Instant and Complete Access to Recent Twitter Data

Today Gnip is announcing the public availability of a new product offering, our Search API for Twitter. We couldn’t be more excited about its disruptive potential to get social data into more tools, more products, and ultimately, more business decisions.

Ever since we announced our industry-first partnership with Twitter in 2010, customers have told us how important it is to them to deliver answers to questions about what has just happened on Twitter and to deliver these answers confidently and quickly. When the conversation about a brand or a product gets heated, brands immediately turn to the companies we serve to engage, measure, and respond. Our new Search API enables Gnip customers to build the product to meet these needs in a way that hasn’t been feasible before.

Although we’re announcing the public availability of this offering today, we’ve been working with a few close partners to integrate the Search API into their products and have seen the impact it has made for them. Realtime marketing leader Expion has used the Search API to build tools to help marketers instantly engage and react to marketing opportunities for their brands that are happening now. Simply Measured, provider of easy and powerful social media analytics, is using the Search API to enable their customers to quickly understand the broader context around conversations about their brands or products. In these early uses, we’ve heard from our customers about how important the “fast” and “instant” experience is for their needs.

Like all Gnip products, we’ve built our Search API to be enterprise-ready and capable of handling the most demanding and business critical use cases. When any Tweet can be the one that changes a decision, every Tweet matters. We’ve built the redundancy and robustness into the Search API to meet the needs of the the most sophisticated and advanced social data uses. At the same time, the Search API also provides an easy path to get started for those new to social data.

With the Search API, the foundational first step in social analytics — counting Tweets over a timeframe about a brand or product — is easier than ever, and is now possible without a having to tackle the traditional challenges with big data, streaming data, or even having to mess with storing data at all.

When reliable, sustainable, and complete social data is easier to get and use, we expect even more business decisions to be made using this incredible data. We can’t wait to help accelerate this future and can’t wait to see what our customers and partners build with this data.

If you’re interested in learning more about our new Search API for Twitter, please visit or email

Quantifying Tweets: Trading on 140 Characters

Social media analysis… for traders? That statement 5 years ago elicited a much different response than it does now. The market now recognizes the importance – and the impact – of social media channels, and as such, has recognized the need to monitor and trade off the research created from that data.

One of the earliest and most important voices in this conversation was that of Joe Gits and the Social Market Analytics team. Which is why we are incredibly excited to announce that they’ve joined the Plugged In to Gnip partner program.

Social Market Analytics quantifies social data for traders, portfolio managers, hedge funds and risk managers. Their technology extracts, evaluates and calculates social data to generate directional sentiment and volatility indications on stocks, ETFs, sectors and indices – providing predictive indicators for clients. They have succeeded in turning qualitative text into quantitative indicators that can easily be incorporated into trading strategies – broadening the types of traders and firms who can now access and incorporate social signal into their decision making.

As shown in their recently announced agreement with New York Stock Exchange (allowing NYSE reseller and distribution rights to their sentiment feed), SMA is helping bring social analytics to a wider group of financial firms than has ever been possible. That client base requires the highest-level of enterprise reliability in the products they buy –  which means SMA’s product requires the strongest data foundation possible. And Gnip is honored to be the company providing the reliable and complete access to the social data that fuels this solution.

To see what their technology looks like, check out the webinar we recently held with them.

Sampling: Not Just For Rappers Anymore

The Gnip data science team (myself, Dr. Josh Montague, Brian Lehman) has been thinking about firehose sampling in the last few weeks.  We see research and hear stories based on analysis of a randomly sampled subset of the Twitter, Tumblr or other social data firehose. This whitepaper will look at the common trade off we see in sampling, which is that we want a data stream that represents the entire audience of a social platform, while controlling costs and limiting activities in order to match analysis capacity.

Both Gnip’s customers and the greater social data ecosystem frequently use sampling to assess patterns in social data. We created a whitepaper that provides a step-by-step methodology to calculate social activity rates, confidence in our rate estimates and the ability to identify signals of emerging topics or stories. We wanted to know what we could detect and with what certainty.

The sampling whitepaper, describes the tradeoffs between the three key variables in sampling social data: rate of activities (e.g. the number of blog posts or Tweets over time), confidence levels around our estimates of rate and meaningful changes those rates, i.e., signal. These three variables are interrelated and present a measurement challenge: With the choices or constraints imposed by two of these parameters, you then calculate the third.

Sampling - Confidence vs Activities vs Signals

While the whitepaper deals with the tradeoffs of activity, signal and confidence when designing a measurement, that is a little abstract. To make this more concrete, think of the trade-off problem as a way of addressing questions like those below. If you’ve asked any of these questions in your own work with social data, we think this whitepaper might help.

  • The activity rate has doubled from five counts to ten counts between two of my measurements. Is this a significant change, or is this expected variation e.g. due to low-frequency events?

  • I want to minimize the total number activities that I consume (for reasons of cost, storage, etc). How can I do this while still detecting a factor of two change in activity rate in one hour?

  • How long should I count activities to detect a change in rate of 5%?

  • How do I describe the trade-off between signal latency and rate uncertainty?

  • How do I define confidence levels on activity rate estimates for a time series with only twenty events per day?

  • I plan to bucket the data in order to estimate activity rate, how big (i.e. what duration) should the buckets be?

  • How many activities should I target to collect in each bucket in order to be have a 95% confidence that my activity rate estimate is accurate for each bucket?

 Our summer data science intern, Jinsub Hong, and data scientist, Brian Lehman created an animation to help visualize the relationship between confidence interval size, time of observation (or, alternatively, the number of activities observed), and the signal we can detect in a firehose of social data.

The animation below shows confidence intervals for different bin sizes. As the bin size increases, we count more events, so the rate estimate becomes increasingly certain. However, we have to wait longer to get the result (latency).

At what bin size can we be confident that the activity rate has changed significantly?  For short buckets of only a minute or two, the variation in the measured rate is large, comparable to the potential signal.  For longer buckets, the signal becomes more distinct, but the time we have to wait in order to make this conclusion goes up accordingly.

The first and last frame show representative potential signals. In the first frame, this potential signal is about the same size as the variability of the activity rate, so we can’t conclusively say the activity rate has changed. With the larger bin size in the final frame, the signal is much larger than the activity rate uncertainty. We can be confident this represents a real change in the activity rate.

For full details, you can download the paper at! If you have questions about the whitepaper, please leave a comment below.

Tweeting in the Rain, Part 3

(This is part 3 of our series looking at how social data can create a signal about major rain events. Part 1 examines whether local rain events produce a Twitter signal. Part 2 looks at the technology needed to detect a Twitter signal.) 

What opportunities do social networks bring to early-warning systems?

Social media networks are inherently real-time and mobile, making them a perfect match for early-warning systems. A key part of any early-warning system is its notification mechanisms. Accordingly, we wanted to explore the potential of Twitter as a communication platform for these systems (See Part 1 for an introduction to this project).

We started by surveying operators of early-warning systems about their current use of social media. Facebook and Twitter were the most mentioned social networks. The level of social network integration was extremely varied, depending largely on how much public communications were a part of their mission. Agencies having a public communications mission viewed social media as a potentially powerful channel for public outreach. However, as of early 2013, most agencies surveyed had minimal and static social media presence.

Some departments have little or no direct responsibility for public communications and have a mission focused on real-time environmental data collection. Such groups typically have elaborate private communication networks for system maintenance and infrastructure management, but serve mainly to provide accurate and timely meteorological data to other agencies charged with data analysis and modeling, such as the National Weather Service (NWS). Such groups can be thought of being on the “front-line” of meteorological data collection, and have minimal operational focus on networks outside their direct control. Their focus is commonly on radio transmissions, and dependence on the public internet is seen as an unnecessary risk to their core mission.

Meanwhile, other agencies have an explicit mission of broadcasting public notifications during significant weather events. Many groups that operate flood-warning systems act as control centers during extreme events, coordinating information between a variety of sources such as the National Weather Service (NWS), local police and transportation departments, and local media. Hydroelectric power generators have Federally-mandated requirements for timely public communications. Some operators interact with large recreational communities and frequently communicate about river levels and other weather observations including predictions and warnings. These types of agencies expressed strong interest in using Twitter to broadcast public safety notifications.

What are some example broadcast use-cases?

From our discussions with early-warning system operators, some general themes emerged. Early-warning system operators work closely with other departments and agencies, and are interested in social networks for generating and sharing data and information. Another general theme was the recognition that these networks are uniquely suited for reaching a mobile audience.

Social media networks provide a channel for efficiently sharing information from a wide variety of sources. A common goal is to broadcast information such as:

  • Transportation Information about road closures and traffic hazards.

  • Real-time meteorological data, such as current water levels and rain time-series data.

Even when an significant weather event is not happening, there are other common use-cases for social networks:

  • Scheduled reservoir releases for recreation/boating communities.

  • Water conservation and safety education.

[Below] is a great example from the Clark County Regional Flood Control District of using Twitter to broadcast real-time conditions. The Tweet contains location metadata, a promoted hashtag to target an interested audience, and links to more information.

— Regional Flood (@RegionalFlood) September 8, 2013

So, we tweet about the severe weather and its aftermath, now what?

We also asked about significant rain events since 2008. (That year was our starting point since the first tweet was posted in 2006, and in 2008 Twitter was in its relative infancy. By 2009 there were approximately 15 million Tweets per day, while today there are approximately 400 million per day.) With this information we looked for a Twitter ‘signal’ around a single rain gauge. Part 2 presents the correlations we saw between hourly rain accumulations and hourly Twitter traffic during ten events.

These results suggest that there is an active public using Twitter to comment and share information about weather events as they happen. This provides the foundation to make Twitter a two-way communication platform during weather events. Accordingly, we also asked survey participants if there was interest in also monitoring communications coming in from the public. In general, there was interest in this along with a recognition that this piece of the puzzle was more difficult to implement. Efficiently listening to the public during extreme events requires significant effort in promoting Twitter accounts and hashtags. The [tweet to the left] is an example from the Las Vegas area, a region where it does not require a lot of rain to cause flash floods. The Clark County Regional Flood Control District detected this Tweet and retweeted within a few minutes.


Any agency or department that sets out to integrate social networks into their early-warning system will find a variety of challenges. Some of these challenges are more technical in nature, while others are more policy-related and protocol-driven.

Many weather-event monitoring systems and infrastructures are operated on an ad hoc, or as-needed, basis. When severe weather occurs, many county and city agencies deploy a temporary “emergency operations centers.” During significant events personnel are often already “maxed out” operating other data and infrastructure networks. There are also concerns over data privacy, that the public will misinterpret meteorological data, and that there is little ability to “curate” the public reactions to shared event information. Yet another challenge cited was that some agencies have policies that require special permissions to even access social networks.

There are also technical challenges when integrating social data. From automating the broadcasting of meteorological data to collecting data from social networks, there are many software and hardware details to implement. In order to identify Tweets of local interest, there are also many challenges in geo-referencing incoming data.  (Challenges made a lot easier by the new Profile Location enrichments.)

Indeed, effectively integrating social networks requires effort and dedicated resources. The most successful agencies are likely to have personnel dedicated to public outreach via social media. While the Twitter signal we detected seems to have grown naturally without much ‘coaching’ from agencies, promotion of agency accounts and hashtags is critical. The public needs to know what Twitter accounts are available for public safety communications, and hashtags enable the public to find the information they need. Effective campaigns will likely attract followers using newsletters, utility bills, Public Service Announcements, and advertising. The Clark County Regional Flood Control District even mails a newsletter to new residents highlighting local flash flood areas while promoting specific hashtags and accounts used in the region.

The Twitter response to the hydrological events we examined was substantial. Agencies need to decide how to best use social networks to augment their public outreach programs. Through education and promotion, it is likely that social media users could be encouraged to communicate important public safety observations in real time, particularly if there is an understanding that their activities are being monitored during such events. Although there are considerable challenges, there is significant potential for effective two-way communication between a mobile public and agencies charged with public safety.

Special thanks to Mike Zucosky, Manager of Field Services, OneRain, Inc., my co-presenter at the 2013 National Hydrologic Warning Council Conference.

Full Series: 

The Timing of a Joke (on Twitter) is Everything

At Gnip, we’re always curious about how news travels on social media, so when the Royal Baby was born, we wanted to find some of the most popular posts. While digging around, we found a Tweet on the Royal Baby that was a joke from account @_Snape_ on July 22, 2013.

With more than 53,000 Retweets and more than 19,000 Favorites, this Tweet certainly resonated with Snape’s million-plus followers. But while exploring the data, we saw an interesting pattern: people were Retweeting this joke before Snape used it. How was this possible? At first, we assumed that Snape was a joke-stealer, but going back several years, we saw that Snape had actually thrice Tweeted the same joke!

Interest in the joke varied by Snape’s delivery date. An indicator for how well a joke resonates with an audience is the length of time over which the Tweet is Retweeted. In general, content on Twitter has a short shelf life, meaning that people typically stop Retweeting the content within a few hours. The graph below has dashed lines indicating the half-life for each time Snape delivered the joke, which we can use to see how much time passed before half of the total Retweets took place. So for the first two uses of the joke, half of all Retweets took place within an hour. The third use case has a significantly longer half-life, especially by Twitter’s standards. Uniquely timed with the actual birth of the newest Prince George, the July date coincided with all of the anticipation about the Royal Baby and created the perfect storm for this Half-blooded Prince joke to keep going…and going…and going. The timing was impeccable and shows timing matters for humor on Twitter.

4 Things You Need To Know About Migrating to Version 1.1 of the Twitter API

Access to Twitter data through their API has been evolving since its inception. Last September, Twitter announced their most recent changes which will take effect this coming March 5. These changes make enhancements to feed delivery, while further limiting the amount of Tweets you can get from the public Twitter API.

The old API was version 1.0 and the new one is version 1.1. If your business or app relies on Twitter’s public API, you may be asking yourself “What’s new in Twitter API 1.1?” or “What changed in Twitter API 1.1?” While there’s not much new, a lot has changed and there are several steps you need to take to ensure that you’re still able to access Twitter data after March 5th.

1. OAuth Connection Required
In Twitter API 1.1, access to the API requires authentication using OAuth. To get your Twitter OAuth token, you’ll need to fill out this form.  Note that rate limits will be applied on a per-endpoint, per-OAuth token basis and distributing your requests among multiple IP addresses will not work anymore as a workaround. Requests to the API without OAuth authorization will not return data and will receive a HTTP 410 Gone response.

2. 80% Less Data
In version 1.0, the rate limit on the Twitter Search API was 1 request per second. In Twitter API 1.1, that changes to 1 request per every 5 seconds. A more stark way to put this is that previously you could make 3600 requests/hour but you are now limited to 720 requests/hour for Twitter data. Combined with the existing limits to the number of results returned per request, it will be much more difficult to consume the volume or levels of data coverage you could previously through the Twitter API. If the new rate limit is an issue, you can get full coverage commercial grade Twitter access through Gnip which isn’t subject to rate limits.

3. New Endpoint URLs
Twitter API 1.1 also has new endpoint URLs that you will need to direct your application to in order to access the data. If you try to access the old endpoints, you won’t receive any data and will receive a HTTP 410 Gone response.

4. Hello JSON. Goodbye XML.
Twitter has changed the format in which the data is delivered. In version 1.0 of the Twitter API, data was delivered in XML format. Twitter API 1.1 delivers data in JSON format only. Twitter has been slowly transitioning away from XML starting with the Streaming API and Trend API.  Going forward, all APIs will be using JSON and not XML. The Twitter JSON API is a great step forward as JSON has a much wider standardization than XML does.

All in all, some pretty impactful changes.  If you’re looking for more information, we’ve provided some links below with more details.  If you’re interested in getting full coverage commercial grade access to Twitter data where rate limits are a thing of the past, check out the details of Gnip’s Twitter offerings.  We have a variety of Twitter products, including realtime coverage and volume streams, as well as access to the entire archive of historical Tweets.

Update: Twitter has recently announced that the Twitter REST API v1.0 will officially retire on May 7, 2013. Between now and then they will continue to run blackout tests and those who have not migrated will see interrupted coverage so migrating as soon as possible is highly encouraged.

Helpful Links
Version 1.0 Retirement Post
Version 1.0 Retirement Final Dates
Changes coming in Twitter API 1.1
OAuth Application Form
REST API Version 1.1 Resources
Twitter API 1.1 FAQ
Twitter API 1.1 Discussion
Twitter Error Code Responses

Gnip Cagefight #2: Pumpkin Pie vs. Pecan Pie

Thanksgiving is a time for family gatherings, turkey with all the delicious fixings, football, and let’s not forget, pie! If your family is anything like mine, multiple pie flavors are required to satisfy the differing palates and strong opinions. So we wondered, which pies are people discussing for the holiday? What better way to celebrate and answer that question than with a Gnip Cagefight.

Welcome to the Battle of the Pies!

For those of you that have been in a pie eating contest or had a pie in the face, you know this one will be a fight all the way down to the very last crumb. In one corner (well actually it is the Gnip Octagon so can you really have corners, oh well) we have The Traditionalist, pumpkin pie and in the opposite corner, The New Comer, pecan pie. Without further ado, Ladies and Gentleman, Let’s Get Ready to Rumble, wait wrong sport. Let’s Fight!

Six Social Media Sources, Two Words, One Winner . . . And the Winner Is . . .


 Source  Pumpkin Pie  Pecan Pie  Winning Ratio
Pumpkin Pie to Pecan Pie
Twitter X 4:1
Facebook X 5:1
Google+ X 6:1
Newsgator X 3:1
WordPress X 5:1
WordPress Comments X 2:1
Overall +6 Winner! +0 :(


We looked at one week’s worth of data across six of the top social media sources and determined that pumpkin pie “takes the cake” (so to speak) across every source.

In this case, it is interesting to point out that in sources like Twitter, Facebook, Google+ and WordPress we see higher winning ratios, while sources that tend to have higher latency such as Newsgator and WordPress Comments were a little more even. Is this because, on further consideration, pecan pie sounds pretty good? Or is it that everyone will have to have two pies and, with pecan as the traditional second, it is highly discussed?

Top Pie Recipes

Even though pumpkin pie was our clear winner, we thought it would be fun to share a few of the most popular holiday pie recipes by social media source:

  1. Twitter – Cook du Jour Gluten-Free Pumpkin Pie and Pecan Pie Video Recipe from
  2. Facebook – Ben Starr’s Pumpkin Bourbon Pecan Pie Recipe
  3. Newsgator – BlogHer’s Pumpkin Pecan Roulade with Orange Mascarpone Cream Pie Recipe
  4. WordPress and WordPress Comments – Chocolate Bourbon Pecan Pie from

Non-Traditional Thanksgiving Pies

Another interesting fact that came out of this Cagefight was the counts of non-traditional Thanksgiving pies that were mentioned across the social media sources we surveyed. Though we rarely find these useful for communicating numerical values effectively, you can’t not have a pie chart in this post.

Happy Thanksgiving!