The Gnip data science team (myself, Dr. Josh Montague, Brian Lehman) has been thinking about firehose sampling in the last few weeks. We see research and hear stories based on analysis of a randomly sampled subset of the Twitter, Tumblr or other social data firehose. This whitepaper will look at the common trade off we see in sampling, which is that we want a data stream that represents the entire audience of a social platform, while controlling costs and limiting activities in order to match analysis capacity.
Both Gnip’s customers and the greater social data ecosystem frequently use sampling to assess patterns in social data. We created a whitepaper that provides a step-by-step methodology to calculate social activity rates, confidence in our rate estimates and the ability to identify signals of emerging topics or stories. We wanted to know what we could detect and with what certainty.
The sampling whitepaper, describes the tradeoffs between the three key variables in sampling social data: rate of activities (e.g. the number of blog posts or Tweets over time), confidence levels around our estimates of rate and meaningful changes those rates, i.e., signal. These three variables are interrelated and present a measurement challenge: With the choices or constraints imposed by two of these parameters, you then calculate the third.
While the whitepaper deals with the tradeoffs of activity, signal and confidence when designing a measurement, that is a little abstract. To make this more concrete, think of the trade-off problem as a way of addressing questions like those below. If you’ve asked any of these questions in your own work with social data, we think this whitepaper might help.
The activity rate has doubled from five counts to ten counts between two of my measurements. Is this a significant change, or is this expected variation e.g. due to low-frequency events?
I want to minimize the total number activities that I consume (for reasons of cost, storage, etc). How can I do this while still detecting a factor of two change in activity rate in one hour?
How long should I count activities to detect a change in rate of 5%?
How do I describe the trade-off between signal latency and rate uncertainty?
How do I define confidence levels on activity rate estimates for a time series with only twenty events per day?
I plan to bucket the data in order to estimate activity rate, how big (i.e. what duration) should the buckets be?
How many activities should I target to collect in each bucket in order to be have a 95% confidence that my activity rate estimate is accurate for each bucket?
Our summer data science intern, Jinsub Hong, and data scientist, Brian Lehman created an animation to help visualize the relationship between confidence interval size, time of observation (or, alternatively, the number of activities observed), and the signal we can detect in a firehose of social data.
The animation below shows confidence intervals for different bin sizes. As the bin size increases, we count more events, so the rate estimate becomes increasingly certain. However, we have to wait longer to get the result (latency).
At what bin size can we be confident that the activity rate has changed significantly? For short buckets of only a minute or two, the variation in the measured rate is large, comparable to the potential signal. For longer buckets, the signal becomes more distinct, but the time we have to wait in order to make this conclusion goes up accordingly.
The first and last frame show representative potential signals. In the first frame, this potential signal is about the same size as the variability of the activity rate, so we can’t conclusively say the activity rate has changed. With the larger bin size in the final frame, the signal is much larger than the activity rate uncertainty. We can be confident this represents a real change in the activity rate.
For full details, you can download the paper at https://github.com/DrSkippy27/Gnip-Realtime-Social-Data-Sampling! If you have questions about the whitepaper, please leave a comment below.