A Collection of 25 Data Stories

25 Data Stories from Gnip

At Gnip, we believe that social data has unlimited value and near limitless application. This Data Stories collection is a compilation of applications that we have found in our practice, highlighting both unique uses of social data and interesting discoveries along the way. Our initial Data Story describes an interview with the world’s first music data journalist, Liv Buli, and how she applied social data in her work. Her answers honestly blew us away, and to this day we continue to be surprised by the different ways people are applying social data in each new interview.

The 25 different real-world examples detailed in this Data Stories compilation cover an incredible range of topics. Some of my favorite use cases are as compelling as the use of social data within epidemiology studies and how social data is used after natural disasters. Others deal with the seemingly more ordinary, such as common recipe substitutions and exploring traffic patterns within cities. One thing is consistent across them all, however — the understanding that social data is key to unlocking previously unknown insights and solutions.

One of the more exciting aspects of these new discoveries is the fact that we are just now learning where social data will take our research next. Today, social data is used in fields as disparate as journalism, academic research, financial markets, business intelligence and consumer taste sampling. Tomorrow? Only the future will tell, but we’re excited to be along for the ride.

We hope you enjoy this collection of Data Stories, and please continue to share with us the stories you find. We think the whole ecosystem benefits when we let the world know what social data is capable of producing.

Download the 25 Data Stories Here. 

Data Stories: Interview with Lada Adamic of University of Michigan

While looking at the speakers for the International Conference on Weblogs and Social Media, the premier academic conference for social media, I stumbled across the research of Lada Adamic. Not only was Lada one of the keynote speakers for the conference, her research at the University of Michigan was just plain awesome. Lada’s research included understanding commonly used ingredient substituions from the 40,000 recipes in Allrecipes.com, understanding how peers rate each other on Couchsurfing, Facebook memes, and more. You can check out all of her research on ladamic.com, follow her on Twitter at @ladamic and be sure to check out her hilarious blog

Lada Adamic of the University of Michigan

1. Your background focuses on networks and how information spreads. You’ve done multiple projects with different data sources, what are some of the overarching trends you’ve seen?

The only sure thing is the unpredictability of information in a network. Sure, in aggregate some information will go viral, while most will not, but predicting what will go where, that’s not so simple. To complicate matters further, information is not only diffusing, but also evolving, while concurrently spurring changes in the social network itself. One trend I do keep seeing is that social networks’ greatest boosting effect is in the niche. There are lots of ways to find out about something widely popular, but information about that curious interest that you and your friends share — that is more likely to come through your friends.

2. What information do you get from looking at networks vs all the other sources you use?

I think it’s more a question of whether there are any data that I don’t try to represent as networks! All I have to do is identify connections between entities in the data, and presto, I have a network. It’s the structure of these connections that can turn up fascinating results: identifying experts from their online interactions, predicting which recipe is going to be rated more highly, or understanding the structure of federal law from the way it’s strung together.

3. What is useful, difficult and unique about connections found in social data?

Well, you’re dealing with data by and about humans. Humans are difficult. Humans interacting with other humans, that’s complicated… but also highly informative, because a lot of human interaction is about informing one another. And as they inform one another about what’s worthwhile, their location, their mood, etc., that data can be harnessed to detect trends and patterns in human behavior. And perhaps precisely because this data is so rich and powerful, it is important to be mindful of privacy.

4. You were able to determine commonly used ingredient substitutions by looking at 40,000 recipes from Allrecipes.com. How much did the comments in the recipes help determine substitutions and what other insights do you think could be pulled from recipe comments?

In the research paper we relied entirely on the comments in the user-supplied recipe reviews to figure out how often cooks substituted one ingredient for another in a recipe, whether ingredients can be cut or omitted, and, crucially, whether the recipe needs more or less garlic (our data showed, usually, more). Untapped kinds of information included in the reviews include who the recipe was a hit with (the kids, the husband etc.) and vetted improvements, e.g. “I put the dough in the fridge for 2 hours as the other reviewer suggested…”. I think this is a really fun example of harnessing our collective intelligence. Instead of each cook tweaking recipes in their own kitchen and sharing their recommendations with a few friends, now we can gather millions of tweaks and start to understand food and cooking systematically.

5. You’ve used data from a wide variety of sources including Couchsurfing.com, Allrecipes.com, Facebook, etc. What do you look for in a data source?

I’m not too discriminating about data, though sometimes I have a question that only certain data can answer. For example, when my husband and I first started dating, I defended my reluctance to watch Sci-Fi movies by pulling their ratings distribution from the IMDb. On an only slightly more serious note, I turned to online recipes because they comprised lots of data about something that I had no clue about: cooking.

Other times you just know the data is good even if your questions about it are not (yet). Such was the case with the CouchSurfing dataset, which encompassed anonymized user-to-user trust and friendship ratings. The data was so rich, that even our initial stumbling steps led to some interesting results about rating human relationships. But it wasn’t until the 2nd and 3rd paper that we really got a handle on how the visibility of the ratings skews them, and some more fundamental insights about the relationship between friendship and trust that are rendered beautifully evident in such a large data set.

6. What study have you done that has surprised you the most? What projects do you see in the future that you think academia should focus on to better understand social data?

Some nice surprises actually came up as I was gathering data for my statistics class. When the Economist published an article about the U-curve of happiness vs. age, I thought, wait a sec, we see the same curve in CouchSurfing ratings: people in their 30s & 40s rate and are rated less enthusiastically than those either younger or older. Then my statistics class used the American Time Use Survey to see how much sleep people were getting, and it was the same curve. Coincidence? I think not!

Another happiness vs. age trend came up in the Adolescent Health data, also analyzed in my stats class. Teens having sex in 8th and 9th grade were less happy on average than their peers who were abstaining, but by senior year, the relationship was reversed. It goes to show that you never know which underused columns in existing data sets hold fun statistics (we also explored the “cheerleading”, “math team” and “wears braces” columns…).

To answer the second question: researchers have only started to take advantage of the abundance of social data. There are many long-standing questions in sociology that were previously studied in small groups. Now these questions can be tested on very large data, just at the time when we really do need to understand how they pertain to changing social interactions as they shift online. Among the questions I’m personally interested in are how online social networks shape media consumption, and how information evolves in social networks.

I should mention that the crucial bottleneck for academics doing this kind of research is access to the data. GNIP is certainly part of the solution (you guys have academic discounts, right?). To anyone else who has interesting data, please consider sharing it with data-starved academics.

Thanks to Lada for her interview (and yes, we’re looking at partner programs for academic researchers!). If you have any other suggestions for Data Stories, please leave a comment. 

Continue reading

Data Stories: Interview with Hilary Mason of bitly

 Data Stories is Gnip’s opportunity to tell the cool stories about the data scientists, data journalists and other people who are working in data. This week we’re interviewing Hilary Mason, the chief data scientist of bitly.  She is currently helping organize DataGotham, a celebration of the New York’s data community happening Sept. 13 -14th. You can follow her on Twitter at @hmason and read her blog at HilaryMason.com

Hilary Mason of bitly

1) How did you get started in your role as a data scientist?
I’m a computer scientist and have always had a keen interest in both algorithms and databases. It became clear to me in the last decade that the most interesting algorithms were those that worked on real data. When I found that there were opportunities to design math and infrastructure to build new types of applications, I couldn’t resist!

2) bitly users share 80 million links a day. What are some of the coolest insights and trends you’ve been able to see from these shared links?

We see all kinds of fascinating things in the data. For example, people who read about physics also read about fashion (http://bit.ly/vSa6AO) and people who use kindles use them very differently than any other kind of device (http://bit.ly/wbRe6o). We’re always posting these things on our blog. For example, on July 4th we posted the most popular recipe by state for the holiday. Did you know that people in Florida enjoy Alligator Ribs (http://bit.ly/NwUEUL)?

3) bitly just updated its site making it even easier to share and curate links. As the chief data scientist, what excites you most about the new capabilities?

It’s wonderful to see bitly evolve from a utility into a truly social platform. We’re excited for bitly to become the central place for you to store, share, and analyze the things that you care about on the internet. We can then use the aggregate data that we collect to enhance that experience for you.

4) What are some of your favorite projects you’ve worked on while at bitly?

Our goal at bitly is to understand the internet’s attention, and to build systems that make that useful. It’s too hard just to pick one bit of it! I’m proud of some of the work that’s made it out into the world, like our post about the half life of links on various social networks (http://bit.ly/puUbzs) and our collaboration with Forbes on the interactive map of media influence (http://onforb.es/GFzphG). I’m also incredibly excited about a few product-oriented experiments that are going to be public shortly … stay tuned.

5) What tools are in your arsenal as a data scientist?

I’m a firm believer in finding the smartest people you can, and letting them use whatever works best. Personally, I’m a huge fan of the old skool unix utilities, and do more with grep and awk than I should probably admit.

Python is my current programming language of choice, though I’m not averse to C when necessary. A few people on my team have started to fall in love with Go, so that’s on my list to check out.

We use the best datastore for each challenge, and make heavy use of memcached, Redis, HDFS, and even text files.

In the non-tech world, I keep a moleskine notebook around and have fallen in love with the Hi-Tec-C .4mm pens from JetPens.

6) As the chief scientist, where do you think your team adds additional business value? How does data science help bitly make decisions it wouldn’t make otherwise?

My team plays a few roles within the company. We handle the business analytics, which can be answering very simple questions like, “How many new URLs did we see yesterday?” to complex questions like, “How do we value a URL being clicked from platform X vs platform Y over time?”.

We do research, pushing the boundaries of what we know to be possible with our data and systems. A few examples of these types of questions are, “Can we build a model of attention to any phrase people are actively clicking on?”, or “Can we predict opening weekend box office takes for movies that people are reading about via bitly links?”

Finally, we build products. Generally these are APIs, like the API that accepts a URL and returns the geographic distribution of attention to the URL, but sometimes they’re human-facing producs. More on that shortly.

In summary, my team is responsible for pushing the boundaries of where bitly can go. It’s fun.

Thanks to Hilary for taking the time to talk to us about her work with bitly! Let us know in the comments if you have a suggestion for another Data Stories 

Continue reading