Data Stories: Sherry Emery on Social Data and Smoking Cessation

Sherry Emery is a Senior Research Scientist at the UIC Institute for Health Research and Policy focusing on understanding how both traditional and new media influence health behavior. Sherry’s research has been focusing on social data and smoking cessation, looking at how people talk about smoking, their behaviors and their reactions to smoking cessations campaigns on social media. Sherry works with Gnip’s client DiscoverText to access the Twitter firehose.  

Sherry Emery of UIC

1. You’ve been studying the media and smoking for the past 15 years, what caused you to be interested in social data?


For a long time my research focused on TV advertising, but a few years ago I began to worry that our work was going to be less and less relevant unless we started to understand new media, including social.

2.    How has your research with social data compared to previous research among other mediums, especially TV?
Researching social media and using social data is much harder — there’s more of it, and it’s way more complex. In the past, we were just worried about exposure to ads — and the measures were developed and widely accepted decades ago; now we’re still worried about exposure, but also searching for information, and sharing information on social media; and with social data, it’s still the wild west for measurements. How do you measure exposure, search and exchange across social platforms, and how are these behaviors related to health behavior. In addition, with TV advertising, there was only an anti-tobacco message to measure. With social media, we need to figure out who’s talking about smoking cigarettes, and how to distinguish them from people talking about smoking ribs or smoking hot girls. And then we need to figure out if the information they are promoting/sharing is pro- or anti-smoking.

3. What are insights from your research on smoking cessation and social data?

We’ve learned so much! First, lots of people who are talking about smoking are not talking about cigarettes! One of our biggest challenges has been to refine our key words and develop techniques to code Tweets and other content as tobacco-relevant. Early on in our process, Gnip’s own Charles Ince had the brilliant insight to introduce us to Stu Shulman, who developed DiscoverText, which is an invaluable tool tthat we rely upon for our data cleaning and analysis process.  DiscoverText allows us to sort through and code the millions of tweets that contain some reference to ‘smoking’.  Using DiscoverText gives us both the transparency and control that we need to make sure that the tweets we analyze are the tweets that are relevant to our research questions.  We can use humans to code for tobacco relevance, and then a boolian language recognition algorithm in DiscoverText can learnfrom the human coders, and code literally tens of thousands of tweets—actually more accurately than humans could at that scale!  As part of this process, we’ve also learned that there are lots and lots of words people use to talk about cigarettes and smoking tobacco — an obvious statement, but one that has really important implications for searching for/measuring the content we’re interested in. No matter how thorough, broad and prospective we try to be, we cannot anticipate all the the terms and keywords that turn out to be relevant. The ability to go back and look for content once we’ve identified key ideas will be critically important to our work. Now that we’re getting a handle on how to deal with this massive and very complex data, we’re also learning a ton about people are talking and thinking about smoking. In simplest terms, smoking weed is discussed much more favorably than smoking cigarettes. In the world of media campaignevaluation, we learned that the recent CDC anti-smoking media campaign really struck a chord with people — the effect of the graphic images were broad and deep. This was an important observation because the graphic approach of these ads were very controversial. By looking at the social media reaction, we could see that they achieve substantial engagement, rather than rejection of their message, which was a concern.

4.    People are less likely to be honest about bad habits on surveys. What are some of the advantages and disadvantages of using social data to capture life habits?
Social data reflects such spontaneous and generally unfiltered responses. It’s great to see and analyze what people are saying and claiming as their own. I think that surveys still have their place — there is a lot of individual-level information that is important, and which social data doesn’t reveal well. But it’s now critically important to understand what and how many much people are saying, searching for, and passing along on social platforms. These data can give context to traditional survey data and can also guide the development of better, more relevant surveys.

5.    Several years ago danah boyd talked about the class divisions between MySpace and Facebook, and how Facebook was for the “good” kids and MySpace was for the burnouts. How do you see the audiences matching up on segments you’re trying to study and the social data sources you’re using?
That’s an interesting question. So far, we’re pretty focused on Twitter data. We’re just beginning to explore Disqus and other social platforms, so I can’t really compare across platforms. We do see that there is particular language/words used on Twitter that seems to characterize different populations such as the slang words for cigarettes. By understanding the slang, we can see regional differences, as well as cultural differences, in attitudes about tobacco.

6.    How is the health world starting to use social data and what are some of the struggles they’re seeing?
The health world seems to be just starting to use social data. There’s some super cool work on developing social networks/data to monitor health conditions. I haven’t seen many other projects that are trying to wrangle massive social data similar to what we are doing. It’s hard to get your head around the variety and complexity of the data that is now available. We have been obsessed with data management and measure development. I think that’s the missing link for the public health world, and it’s one of our biggest challenges — translating how these data can answer questions the public health community is interested in.

Data Stories: