We were recently asked to demonstrate some of the different scenarios where Historical PowerTrack for Twitter can be applied. Using a major local event, the Boulder Flood of 2013, this post will describe how the use of different filters can shape both the quality and quantity of resulting Tweets. This will also serve as a guide about how to construct your own filters in order to produce anticipated data. If you’d like more detail on filtering capabilities than what is explained below, please see Historical PowerTrack for additional documentation.
Filtering by Dates
An important consideration to keep in mind when setting date ranges for a historical data request is that they are inclusive on the front end and exclusive on the tail end. For example, let’s say that we want to pull Twitter data on the Boulder Flood from 9/9/13 to 9/15/13. We would enter a start date at the moment that we want our data retrieval to begin and end date at a point just beyond where we want the coverage to end. In this case our date range ends 9/16/13 at 00:00 UTC to ensure that we will receive all of data through the last second of the day on September 15th.
Now that we’ve defined the window of time over which we’ll be applying our filters, let’s look at a few different ways that we could design the Historical PowerTrack rules to focus on only the data we’re seeking. In this example, we want to study flood-related media posted to Twitter by Boulder residents in order to assist with efforts to assess damages caused by the event.
From a quick scan on search.twitter.com, we know that the hashtag #BoulderFlood was commonly used in Tweets about the floods. We used Gnip’s documentation page to find the corresponding operator for this hashtag-based query and entered it into our ruleset.
As mentioned previously, we are aiming to extract Tweets only from Boulder residents for this study, so we add a couple of filters to exclude Tweets matching keywords about other neighboring towns that were also affected by the floods. We used the negation operator from Gnip’s documentation page to augment our existing rule.
With our current set of rules, we could potentially receive Tweets that contain text without images or videos, but for purposes of our damage assessment, it is important that we only extract content with those media. Therefore, we combined a few additional statements to our existing ruleset to indicate that we want Tweets with #BoulderFlood that also have “media” or else contain URLs associated with properties of interest (e.g. Flickr or Instagram).
Lastly, we want to restrict our results to Tweets with geographic origin data to help validate that the media are in fact from Boulder, so we created a bounding box using lat/long coordinates around the Boulder area. This geographic boundary will omit any data that falls outside our area of interest as well as ignore any Tweets for which the coordinates were not provided.
Upon review, we are now confident that we have a ruleset that reflects each of the study’s requirements and that should yield us a relatively high quality dataset. We submit the dates and filters for the request, and our estimate reveals that there are approximately 3,000 Tweets that match our criteria.
We created Historical PowerTrack for Twitter to leverage rule-based queries that return exactly the data you need. When submitting a request of your own, it is always best to break down each requirement of the project and ensure that your ruleset accounts for each of those criteria. We hope that you enjoyed this quick walkthrough, and look for additional Gnip product tutorials in the near future.