Gnip recently launched Historical PowerTrack for Twitter. Historical PowerTrack for Twitter gives Gnip’s customers the same filtering capabilities they have on real-time Twitter streams, but opens up the entire historical corpus of public Tweets from 2006 and on for research, campaign and event comparisons and backtesting of models.
As with any new toy, we immediately needed to take Historical PowerTrack out for a spin. I wanted to see what we could find out about a topic that has been on Twitter from the beginning (yes, we passed over the opportunities offered by sex) to see what Historical PowerTrack added to the mix of data. I was delighted.
We collected all of the Tweets containing “sxsw” from Jan 2007 to September 2012. This resulted in 5,366,570 Tweets. There were only few days without at least 1 Tweet containing SXSW in the 5 ½ year time period.
The first realization was a couple of significant differences between analyzing historical and real-time streams. While the result in each case is the filtered, time-ordered subset of all Twitter activities for the specified time period, the steps for analyzing the data are a little different.
Learning On the Fly vs. Working In A Well-Defined Context.
When we analyze real-time streams, learning-and-adjusting as you go along is key. We continually analyze the stream to update filter rules, changing stream shaping parameters to match both the current content coming through Twitter as well as to meet the business and technical needs of collecting, storing and analyzing data.
Historical analysis is different. In fact, you may decide to have a second, offline system for analysis as feeding older data into your real-time data processing pipeline may cause more problems than it solves. For example, you may be increasing volume by adding historical data to your existing stream processing. You may also rely on the fact that Twitter data is sorted (k-sorted, see Snowflake). When you consume historical data, it will be sorted within the set, but cover a wide range of past times.
The way you set up rules will also change. While we often collect data on a broad range of topics simultaneously on the live stream, historical Tweets are usually collected for a narrower, specific project, to answer specific questions. That means, for example, you may want to run a couple of test jobs to understand the data set better before collecting 5M Tweets over 5+ years for a project.
The second realization was that, even for simple data sets like our SXSW data, the long-term trends are fascinating!
Figure 1: Tweets per Day containing the term “sxsw” from January 2007 through August 2012 plotted on the same y-axis. Note the growth in discussions from 2007 to 2011 and the drop from 2011 to 2012.
SXSW is a week of music, art, and conversation held in Austin, Texas each spring. Twitter traffic during the event grew rapidly over the years to 2011, reaching a peak of 150K Tweets per day in 2011.
While this graph shows the trend clearly, it fails to show the significant volume of Tweets in the early years, because the scale is dominated by the large number of Tweets in later years. The peak Tweets per day grew from 550 in 2007 to 4,012, 39,312, and 88,145 in successive years and finally 150,244 Tweets per day at the highest point in 2011. Interestingly, peak Tweets per day dropped in 2012 to 129,437. More on that in a bit.
The figure below shows each on its own scale. One feature that is clear each year is the “echo” of Tweets in late summer each year. What is going on? This is when we discovered that the Panel Selection process serves as a reminder to get back to tweeting about SXSW. Part of the panel selection process is for prospective attendees to vote for panels they want to see. So at the end of the summer, next year’s panels are proposed and promoted–through tweeting as well as other means. (The bump for 2012 is just starting at the end of the data set.)
Figure 2: Tweets per Day containing the term “sxsw” from January 2007 through August 2012 with each year plotted on its own y-axis. Note the peak in conversation during the event in March and the echo during panel selection in August.
SXSW Tweets Declined in 2012
Despite the 2012 SXSW being the largest SXSW yet, there was a 13.8% decline in the peak daily Tweet volume during the event. While we don’t know exactly why this happened, there are several possibilities:
- As people have started using social media more and more, they feel less of a need to broadcast every single activity on Twitter.
- Cell service and WiFi continues to be awful.
In the Spring, we’ll look at peak Tweet volume for SXSW 2013 to see if 2012 was an anomaly or part of a larger trend.
In the next post, we’ll analyze Tweets about specific trends within SXSW conversations over the years.