Data Story: Interview with Christyn Perras of YouTube

I have long wanted to interview someone from YouTube as I think their social data is fascinating and incredibly vast. Every minute, 100 hours of video are uploaded to YouTube. Christyn Perras, a quantitative analyst at YouTube, is talking with Gnip about the career path to being a data scientist, the tools in her arsenal, YouTube’s data-driven culture and Coursera. 

Christyn Perras

1. What was your path to becoming a quantitative analyst at YouTube? What would you recommend for others?

As an undergraduate, I studied psychology and was particularly drawn to the experimental side of the discipline. When I was considering an advanced degree, I concentrated on the aspects of psychology that I loved during my search for a graduate program. I eventually found a program that focused on applied statistics and experimental design at the University of Pennsylvania, where I received an MS and PhD. However, even after graduation, my career path remained unclear and the tech industry wasn’t even on the radar. It was when I started looking for jobs using search terms referring to my skill set rather than job titles that I saw a world of opportunity unfold in front of me.

My first job on the west coast was at Slide, a social gaming company. It was an amazing experience. At Slide, I used my psychology background to understand our users and the way they interacted with our products. In addition, my background in statistics and experimental design gave me the skills to study, test, quantify and interpret user behavior and to measure the impact of our influence.  We sought answers to questions such as: Why were these people using our our products? What made them come back? And what could we do to change their behavior and/or enhance their experience? I am now doing this at YouTube and concentrate my efforts on understanding our creators and continuing to improve their YouTube experience via foundational research and experiment analysis.

2. I’ve noticed that Google doesn’t tend to use the title of data scientist. Is there a reason for this?

Not that I’m aware of. Data scientist, quantitative analyst, statistician and decision support analyst are all fairly interchangeable terms in the tech industry. As I mentioned before, my job search was most successful when I used keywords related to my skills and interests (statistics, psychology, experiments) rather than searching job titles (statistician). However, I imagine with the rising popularity and awareness of the field, naming conventions for job titles will likely become more standardized.

3. What is one of the most surprising aspects you’ve learned about YouTube data?

Honestly, I was surprised by the sheer amount of data! It is staggering. I had to learn a number of new programming languages and techniques just to be able to get the data I needed for an analysis into a manageable format. During my time at Penn, SAS, SPSS and SQL were the preferred tools and were incorporated into the curriculum. Without a more extensive computer science background, areas such as MapReduce and Python were quite new to me. I’m also continually expanding my knowledge and experience with techniques used to manipulate, reshape and connect data on this scale. When working with billions of data points, you often need to think creatively.

4. How do quantitative analysts work with product managers to shape YouTube?

There is a strong data-driven culture at YouTube and, as a result, product managers and analysts work very closely. In the case of a product change or redesign, analysts are involved in the process from the start. Early involvement ensures, for example, the data necessary for analysis are collected, experimental arms are set-up correctly, logging is accessible and bug-free. We discuss the goals and expectations of product changes in depth to make sure analyses are designed to answer the right question and will produce valid, actionable results. Analysts and product managers typically have a steady dialogue throughout the course of analysis. Once the analysis is complete, we discuss the results, interpret the meaning, consider the implications, and make decisions about the next steps.

5. How do you think companies such as Coursera are changing fields such as data science?

I love Coursera! My favorite courses include Data Analysis with Jeff Leek and Computing for Data Analysis with Roger Peng. Coursera is doing something truly great and I look forward to seeing how they grow and progress. Data science is a bit nebulous in terms of education (at least it was when I was in school). There wasn’t a “data science” major or anything like that, so it was necessary to piece it together yourself. I have an amazing team with wildly different backgrounds from physics to psychology to economics. I love bouncing ideas off my colleagues and am guaranteed wonderfully unique and clever perspectives. Companies like Coursera make dynamic teams like this possible by giving people from a wide variety of disciplines access to the additional education they need to shape their career path and be successful in their job.

Another amazing resource for future data scientists is OpenIntro (, co-founded by my colleague David Diez. With OpenIntro, you’ll find a top-notch, open-source statistics textbook and a wealth of supporting material.

Thanks to Christyn for the interview! If you’re interested in reading more interviews, check out Gnip’s collection of 25 Data Stories for interviews with data scientists from Foursquare, Pinterest, Kaggle and more. 

Data Story: Michele Trevisiol on Image Analysis

Social media content is frequently shifting to a visual medium and companies are often having a harder time understanding and digesting visual content. So I loved happening upon Michele Trevisiol, PhD student at Universitat Pompeu Fabra and PhD intern at Yahoo Labs Barcelona, whose research focused on image analysis for videos and photos. 

Michele Trevisol

1. Your research has always revolved around analyzing videos and photos, and this is an area that the rest of the world is struggling how to figure out. Where do you see the future of research around this heading?

For my own experience I can see a huge work for research in the near future. Every day there are tons of new photos and videos uploaded online. Just think about Facebook, Flickr, YouTube, Vimeo and many others new services. Actually, in one of our projects we are working with Vine, a short-video app that allows you to record 6 seconds in loop. Twitter has bought it in January and Vine reached already 40 million of users.

Just to write some numbers, recent studies estimated a volume of about 880 billion photos will be uploaded in 2014, without considering other multimedia data. This volume of information makes it very hard for the user to explore such a large multimedia space and pick the most interesting items. Therefore, we need smart and efficient algorithms to fix this. The amount of data is growing every day, researchers needs to keep improving their systems to analyze, understand and rank the media items in the best way as possible (often in a personalized way for the user).

Researchers have studied these topics from many different angles. Working with multimedia objects involves analyzing the content of the data (i.e., computer vision like object detection, pattern recognition, etc.), understanding the textual information (e.g., meta-data, description, tags, comments), or studying how media is shared in the social network. This is a research space that has still many things left to explore.

2. You’ve previously researched how to determine geo data from video based on a multi-modal approach based on how videos are tagged, user networks, and other meta-data. What are the advantages of understanding the geo location of videos?

You can see the problem from a different point of view. If you have a complete knowledge about the multimedia items you have, like the visual content, the meta-data, the location, or even how they are made (technical details), and so on: this data would be easily classify and discoverable for the users. All the information about the item helps researchers to understand its properties and its importance for the users that is looking for it. However, very often this information is missing, incomplete or even wrong. In particular the geo location is not provided on the vast majority of photos and videos online.

Only in the recent years there has been an increment of cameras and phones with automatic geo-tagging (able to record the latitude and longitude where the photo was taken). As a result, just few multimedia items have this information. Being able to discover the geo location of videos/images helps you to organize and classify them, and helps the users to find items related to any specific location, improving their search and the retrieval. We presented a multi-modal approach that keeps into account various factors related to the given video, like the content, the textual information, the user’s profile, the user’s social network and her previous uploaded photos. With such information the accuracy of the geo location prediction improved dramatically.

Recently, the research is spending more effort on this topic, mainly due to the increasing interest in the mobile world. In the near future, the activity in this area is destined to increase.

3. How are the browsing patterns different for people viewing photos than text? What motivates people to click on different photos?

The browsing patterns are strongly biased by the structure of the website and, of course, by the type of website. The first case is quite obvious as the users browse the links that are more evident on the page, therefore the way the website selects and shows the related items is really important. The latter case instead is related to the type of website.

Consider, for example, Many users land on the website from some search engine and read just one article. This means that the goal of the user is very focused, as she’s able to define the query, click on the right link, consume the information, and leave. But that’s not always the case, as there are also users who browse deep and look at many articles related to one topic (e.g., about TV series, episodes, actors, etc.). If you consider News websites instead, the behavior is different as the user could enter only to spend some time, to take a break, to get distracted with something interesting. A photo sharing website presents even different behavior, often characterized by the social network structure. Many users interact mostly with the photos shared by friends, or contacts, or they like to get involved in social groups trying to get more visibility and positive feedbacks as possible.

The main interest of any service provider is to keep the user engaged long as possible on its pages. To do this, it shows the links with the highest interest for the users to keep them clicking and browsing. That’s what the user wants as well, she wants to find something interesting for her needs. The rationale is similar for photo sharing sites, but the power of the image to catch the interest at the first glance is an important difference. For example, in Flickr there are “photostreams” (sets of photos) shown to the user for each image she is watching. Slideshows show image thumbnails in order to catch the interest of the user with the content of the recommended images. Recently, we developed a study on these specific slideshows, we found that the users love to navigate from one slideshow to another instead of searching directly for images or browsing specific groups. We also tried to recommend different slideshows instead than different photos with positive and interesting results.

4. Much of your research has focused around Flickr. How does data science improve the user experience of Flickr?

Recently Flickr has improved a lot in this direction, for example with the new free accounts with 1TB of free storage, or the interface that has been recently refreshed. But the changes are just at the beginning.

In general, the data scientists need first to study and understand how the users are engaging with the websites, how they are searching and consuming the information, how they are socializing, and especially what they would like to have and to find on the website. In order to improve the user experience you need to know the users and then to work on what (and how) the website can offer to improve their navigation.

5. Based on your research around photo recommendations, what characteristics make photos most appealing to viewers?

This is a complex question as the appealing is subjective and changes for each user, especially the taste, or even better, the interest of the user changes over time. Some days you’re looking for something more funny so maybe the aesthetics of the image are less important. Other days instead, you get captured by very cool images that can be professional or just incredible amateur shots.

In the majority of cases each user is quite unique in term of taste, so you need to know what she appreciated before and how her taste changed over time in order to show her the photos that she could like more. On the other hand, there are cases that can catch the interests of any users in an objective way. For example, in photos related to real world events the content is highly informative, instead, the quality and the aesthetic are often ignored.

In a research work that we presented last year, we compare different ranking approaches in function of various factors. One of these was the so called external impact. With this features we could measure how much interest the image has outside Flickr, in other words, how much visibility the image has on the Web. If an image uploaded by a Flickr user in her page has a huge set of visits coming from outside (e.g., other social network, search engine), it means this image has high attractiveness that need to be considered even if inside the network it does not show particular popularity. We found that this could also be a relevant factor to be considered in the ranking, and we are still investigating this point.

If you’re interested in more data stories, please check out our collection of 25 Data Stories for interviews with data scientists from Pinterest, Foursquare, and more! 

Data Stories: Tyler Singletary of Klout

This Data Story is with Tyler Singletary, Director of Platform at Klout, and we’re talking about Klout Scores, Klout data, international influence and more. Klout is an extremely popular Gnip enrichment and it’s clear that the world is interested in Klout data, so we thought this would a fun interview. You can keep up with Tyler on Twitter at @harmophone and on Klout at

Tyler Singletary from Klout

1. Gnip’s CTO Jud Valeski has a Klout Score of 56, our VP of Product Rob Johnson has a Klout Score of 50, and I have a Klout Score of 60. If you’re a business, who do you give an offer to?

It depends on what you’re looking for. While the Klout Score is an expression of a user’s potent network effects, Klout Topics are an expression of where and what the user drives engagement on, and to some degree, where their interests and passionate community lies. If I wanted to reach Engineers and Boulderites, Jud would be a good choice: his audience is very engaged with him on those topics. We’ve been leveraging this understanding and segmentation in our products for years now.

2. What has the introduction of Cinch meant for Klout? Will Cinch data be available through the Klout API?

Cinch is another way to look through the prism of what it means to influence someone, and is built on the idea of social authority being an important piece in the collaborative economy. To that end, again, Cinch is built on Klout Topics as an important way to quickly find subject-matter authority and trust, combined with a user’s personal network.

Cinch is in the early days as a product, but we can easily imagine it as part of the platform in the future. There’s already a wealth of good advice around lifestyle topics, and it would be a fantastic channel for businesses and consumers to integrate with to pose engaging questions and take a pulse on influencers recommending their products.

3. What is the potential for companies using Klout in a CRM?

Klout’s influence graph helps CRM users gain insight into users, enabling them to improve prioritization of issues and outreach, as well as customer satisfaction. By finding their most influential customers, these companies can also streamline outreach and word-of-mouth marketing to increase new business. As the truly “Social CRM” platform evolves, and inbound leads are generated from social feeds, it will become increasingly important to know how and what influence leads and customers have. It’s about finding more relevance and reach.

4. Klout is big in Japan. What are the challenges of defining influence in different countries?

I tend to think of things in terms of “units of influence.” Each network has a different set of actions that can be influential. You have an actor, an actee, an interaction, and a subject (or two or three). In terms of Japan, and other countries using different languages, you still have an actor, actee, and interactions– all of those units that go into the Klout Score. What we need to build is an interpreter for the subjects (effectively, Klout Topics), using the character sets, grammars, and dictionaries mapped to the common and unique meanings. This is not a trivial undertaking.

So the Klout Score is effective and applies easily to activity on the networks we’re already surveying in Japan and South Korea, and other countries. Klout Topics will take an investment of resources, but it’s not an impossible problem. One interesting other way to look at the problem– a short circuit– is to delve deeper into content. A URL is a URL in any language. If you can understand what that destination is about, as a subject of influence, you may be able to come to a solution for the Topic problem before tackling the language issues.

There’s also the diversity of networks, and the availability of data. From my research, platforms like LINE and Gree aren’t yet opening up APIs or partnering with the Gnips of the world. Getting access to the wealth of data in what might be a more dominant platform in foreign countries  needs to be solved.

5. What does the future of Klout data look like?

Klout Topics are a constantly adapting and improving system. I hope to have us release different prisms to view them under, like standard ontologies like the IAB, while still retaining the adaptive and “in the moment” nature that social data requires. I think you’ll see us encouraging developers and companies to build into the platform more and to derive new insights from aggregated data and around individual pieces of content, and you’ll see us make more inroads in our offering there. With Cinch we’re proving that there are broad use cases for influence data, and we’ve been encouraging the platform community to build on that premise.

6. What are common misconceptions people have about Klout?

There’s still this thought that we are only about the Klout Score. Topics have been around for several years now. There’s also this sense that it’s a value judgment, or a rank-order. We’re none of these things– we have scientists and social people working on the tough problems on social media, and really, in a society where money drives so much. People should be recognized and rewarded, even indirectly, for the impact they have in their networks.

7. If you’re a business, what is the first thing you should know about Klout?

Klout is the best platform for driving authentic earned media, and our data is the best lens with which to capture, catalog, and understand all of the earned media being generated around products, entertainment, services and brands.

Thanks to Tyler for the interview! If you’re interested in more data stories, check out our compilation of 25 Data Stories from Gnip!

A Collection of 25 Data Stories

25 Data Stories from Gnip

At Gnip, we believe that social data has unlimited value and near limitless application. This Data Stories collection is a compilation of applications that we have found in our practice, highlighting both unique uses of social data and interesting discoveries along the way. Our initial Data Story describes an interview with the world’s first music data journalist, Liv Buli, and how she applied social data in her work. Her answers honestly blew us away, and to this day we continue to be surprised by the different ways people are applying social data in each new interview.

The 25 different real-world examples detailed in this Data Stories compilation cover an incredible range of topics. Some of my favorite use cases are as compelling as the use of social data within epidemiology studies and how social data is used after natural disasters. Others deal with the seemingly more ordinary, such as common recipe substitutions and exploring traffic patterns within cities. One thing is consistent across them all, however — the understanding that social data is key to unlocking previously unknown insights and solutions.

One of the more exciting aspects of these new discoveries is the fact that we are just now learning where social data will take our research next. Today, social data is used in fields as disparate as journalism, academic research, financial markets, business intelligence and consumer taste sampling. Tomorrow? Only the future will tell, but we’re excited to be along for the ride.

We hope you enjoy this collection of Data Stories, and please continue to share with us the stories you find. We think the whole ecosystem benefits when we let the world know what social data is capable of producing.

Download the 25 Data Stories Here. 

The 4 Ways People Use Social Media in Natural Disasters

Earlier this year I interviewed Brooke Fisher Liu with the University of Maryland about her research around how people used social media during natural disasters. She broke it down as this:

During natural disasters people tend to use social media for four interrelated reasons: checking in with family and friends, obtaining emotional support and healing, determining disaster magnitude, and providing first-hand disaster accounts.

I was reflecting upon this interview and how much I saw these four scenarios during the recent Boulder flood, which many in the community are still suffering from the aftermath. The Boulder Flood provided the perfect way to look at how people use social media in natural disasters.

1) Checking in with family and friends
People were using social media to let their friends and loved ones know they were safe, what their current status was, and offering (or soliciting) help. Across the community, there were people offering help to those who needed it whether they be strangers or family. For myself, I tried posting daily updates on Facebook, so I could keep people up to date and then focus on figuring out cleanup.

Social Media During Boulder Flood

2) Obtaining emotional support and healing

Twitter, Facebook and Instagram provided enormous amounts of emotional support along with concrete offers of help. The hashtag #BoulderStrong offered great encouragement to those suffering losses.

3) Determining disaster magnitude
Many of the people following the #BoulderFlood hashtag were following some of the official accounts including the @dailycamera (Boulder newspaper), @mitchellbyar (reporter at the Daily Camera), @boulderoem (Boulder Office of Emergency Management), @bouldercounty. As a community we were looking to hear about our homes, our schools, our neighbors and how they fared. We were looking to understand just how damaged our community was and how long it took to recover. One of the more interesting aspects I saw was people focused on determining road closures. While Boulder OEM was publishing their reports, many people were determining how to get in and out of Boulder. I can’t help but think how social data can represent more accurate information and real-time reporting than official sources.

4) Providing first-hand disaster accounts
While newspapers shared collections of horrifying images of the damages happening among Boulder floods, we were looking to our contacts on social media for first-hand accounts too. We were using our networks on Twitter to confirm what we were hearing online or even what we thought we were seeing.

Boulder Flood on Instagram

Our CTO Jud Valeski posted many shots of the flood on on his Instagram account that were picked up on the media. In fact, Michael Davidson at the Xconomy even wrote an article “Gnip Co-founder Jud Valeski on His Flood Shots Seen Around the World.”

The one aspect that really seemed to be missing from Brooke Fisher Liu’s research was the coordination that was taking across social media. People were offering to help strangers, organize cleanups, share tools, share bottled water and spare bedrooms, solicit donations, check on other people’s houses and a thousand other ways. Resource sharing was one of the major ways that social media played a role in the Boulder flood.

Continue reading

Data Story: Eric Colson of Stitch Fix

Data Stories is our blog series highlighting the cool and unusual ways people use data. I was intrigued by a presentation that Eric Colson gave to Strata about Stitch Fix, a personal shopping site, that relied heavily on Stitch Fix data along with its personal shoppers. This was a fun interview for us because several of my female colleagues order Stitch Fix boxes filled with items Stitch Fix thinks they might like. It’s amazing to see how data impacts even fashion. As a side note, this is Gnip’s 25th Data Story, so be on the watch for a compilation of all of our amazing stories. 

Eric Colson of Stitch Fix

1. Most people think of Stitch Fix as personal shopping service, powered by professional stylists. But, behind the scenes you are also using data and algorithms. Can you explain how this all works?

We use both machine processing and expert-human judgment in our styling algorithm.   Each resource plays a vital role. Our inventory is both diverse and vast. This is necessary to ensure we have relevant merchandise for each customer’s specific preferences.  However, it is so vast that it is simply not feasible for a human stylist to search through it all.  So, we use machine-learning algorithms to filter and rank-order all the inventory in the context of each customer. The results of this process are presented to the human stylist through a graphical interface that allows her to further refine the selections.  By focusing her on only the most relevant merchandise, the stylist can apply her expert judgment.   We’ve learned that, while machines are very fast at processing millions of data points of information, they still lack the prowess of the virtuoso. For example, machines often struggle with curating items around a unifying theme. In addition, machines are not capable at empathizing; they can’t detect when a customer has unarticulated preferences – say, a secret yearning to be pushed in a more edgy direction. In contrast, the human stylist are great at these things. Yet, they are far more costly and slower in their processing. So, the two resources are very complementary! The machines narrow down the vast inventory to a highly relevant and qualified subset so that the more thoughtful and discerning human stylist can effectively apply her expert judgment.

2. What do you think would need to change if you ever began offering a similar service for men?

We would likely need entirely new algorithms and different sets of data.  Men are less self-aware of how things should fit on them or what styles would look good on them (at least, I am!). Men also shop less frequently, but typically indulge in bigger hauls when they do. Also, the styles are less fluid for men and we tend to be more loyal to what is tried & true.  In fact, a feature to “send me the same stuff I got last time” might do really well with men. In contrast, our female customers would be sorely disappointed if we ever sent them the same thing twice!

So, while the major technology pieces of our platform are general enough to scale into different categories, we’d still want to collect new data and development different algorithms and features to accommodate Men.

3. How did you use your background at Netflix to help Stitch Fix become such a data driven company?
Data is in the DNA at Stitch Fix. Even before I joined (first, as an advisor and later as an employee), they had already built a platform to capture extremely rich data. Powerful distinctions that describe the merchandise are captured and persisted into structured data attributes through both expert human judgment as well as from transactional histories (e.g. How edgy is a piece of merchandise?, How well does it do with moms?, …etc).  This is a rare capability – one that even surpasses what Netflix had. And, the customer data at Stitch Fix is unprecedented! We are able to collect so much more information about preferences because our customers know its critical to our efforts to personalize for them. I only wish I had this type of data while at Netflix!

So, in some ways Stitch Fix already had edge over Netflix with respect to data. That said, the Netflix ethos for democratizing innovation has permeated into the Stitch Fix culture. Like Netflix, we try not to let our biases and opinions blind us as we try new ideas. Instead, we take our beliefs for how to improve the customer experience and reformulate them as hypotheses. We then run an AB test and let the data speak for itself. We either reject or accept the hypothesis based on the observed outcome. The process takes emotion and ego away and allows us to make better decisions.

Also, like Netflix, we invest heavily in our data and algorithms.  Both companies recognize the differentiating value in finding relevant things for their customers. In fact, given our business model, algorithms are even more crucial to Stitch Fix than they are to Netflix.  Yet, it was Netflix which pioneered the framework for establishing the capability as strategic differentiator.

4. How else is Stitch Fix driven by data?

Given our unique data, we are able to pioneer new techniques for most business processes. For example, take the process of sourcing and procuring our inventory. Since we have the capability of getting the right merchandise in front of the right customer, we can do more targeted purchasing. We don’t need to make sweeping generalization about our customer base. Instead, we can allow each customer to be unique. This allows us to buy more diverse inventory in smaller lots since we know we will be able to send it only to the customers for which it is relevant.

We also have the inherent ability to improve over time. With each shipment, we get valuable feedback. Our customers tell us what they liked and didn’t like. They give us feedback on the overall experience and on every item they receive. This allows us to better personalize to them for the next shipment and even allows us to apply the learnings to other customers.

5.  Your stylists will sometimes override machine-generated recommendations based on other information they have access to. For example, customers can put together a Pinterest board so that they can show the stylist things they like. Do you think machines will ever process this data?

No time soon! Processing unstructured data such as images and raw text are squarely in the purview of humans. Machines are notoriously challenged when it comes extracting the meaning that is conveyed in this type of information. For example, when a customer pins a picture to a Pinterest board, often they are expressing their fondness for a general concept, or even an aspiration, as opposed to the desire for a specific item. While machine learning has made great strides in processing unstructured data, there is still a long ways to go before they can be reliable.

Thanks to Eric for the interview! If you have suggestions for other Data Stories, please leave a comment! 

Continue reading

Data Story: Looking at Social Media in India, Iran and Russia

We came across Navid Hassanpour’s research at Yale through his submission for a SXSW panel “Mapping Social Media Debates: India, Iran, Russia.” We were intrigued by looking and comparing the use of social media in these three countries. You can vote for Navid and his colleagues SXSW presentation here, and you can follow Navid on Twitter at @navidhassanpour
1. For your SXSW panel, you’re looking at the the dynamics of social media exchange in relation to political activity in India, Iran and Russia. What were some of your more interesting findings?

I started from following the results of a daily phone poll among Iranians on the 2013 Iran election at IPOS in the days prior to the voting. There were candidates who were favorites, and those who caught up later on in the volatile game of Iranian electoral politics. Meanwhile I followed the Farsi election discussions on Twitter. In particular, I could see that some of the candidates had put resources into Twitter astroturfing, while others enjoyed a more organic attention, and all this was happening in a country in which Twitter is banned and blocked. Nevertheless, most of the dignitaries including the head of the state have active Twitter accounts.

I noticed that mere volume of tweeting does not tell you what is “trending” or is about to trend, instead what matters the most for the dynamics of proliferation is “how” the conversation is had. That is important because most of what we hear on sentiment analysis and similar topics are the dynamic frequency profile of tweets, not the type of conversation structure that develops through time. For example, Barack Obama tweeting to you as a follower, among 35 million others, will not have the same effect as one of your friends directly tweeting to you and starting a conversation on the same topic. The Tweet count can be the same.

2. How do you think social media has changed the politics for both leaders and citizens? For example, the Syrian Presidency now has its own account on Instagram.

The immediacy of engagement between leaders and their constituents on Twitter and alike is unprecedented and that introduces a house of new potentials. A lot of emphasis is put on the side of the constituents, for example how they mobilize and advertise on Twitter, but I think the leaders’ side is as interesting.

For example, during the Iranian election, as the competition became more and more fierce, it was fascinating to see how political figures, such as Rouhani and Aref, chose to actively tweet in a country where Twitter is blocked and banned. At the end I learned the results of the election from Rouhani’s Twitter feed a few hours ahead of the official announcement.

Here there is hope and a cautionary tale: I remember going through Medvedev’s Facebook account and seeing questions such as “the enemies of the great Russian nation are telling us that the election was rigged, is that true? what do you think?” and all of a sudden what you have is five thousand comments acting as a direct barometer of the electorate’s sentiments. I think this trend is going to continue, when you do not know the exact implications and stakes of Twitter mobilization, why not being a part of it? It would be the best insurance mechanism. Traditionally authorities know that, on the ground, blackshirts work effectively against grassroots mobilization. Now this leveling the playing field transforms censorship from its traditional mode of disruption to something more nuanced.

3. Part of your research revolved around censorship in social media. How have you found that citizens work their way around censorship and how do censors keep up with social media?

The situation in Iran where Twitter is banned verges on the level of bizarre. Social media is banned but then the leaders have Twitter accounts. In China censorship of online expression is outsourced to private entities, individuals who run a grassroots censorship campaign. The more the censors engage in tweeting and alike, the more important a study of the types of interaction on social media become. For example, it is timely to ask what might be potential patterns of manipulation in Twitter discussions? How is an astroturfed discussion different from one that grows rapidly by itself? Ironically these are the the types of questions that excite consumer marketers as much as electioneers.

4. What were some of your biggest takeaways working with Twitter data?

I work with a wonderful team of collaborators, Pablo Barbera and Joshua Tucker from NYU’s SMaPP (Social Media and Political Participation Lab), and Erik Borra from Digital Methods Initiative at University of Amsterdam, to extract patterns of conversation in Twitter data. To give you an example, what we can see in the Iran data is that network parameters of the conversation networks differ meaningfully from one candidate to another. For example, the discussion around Rouhani is more clustered than the others. For Rouhani himself, cliquishness of the discussion around him increased when he started to surge in the polls. 

It’s obvious that the platform imposes a certain type of diffusion grammar. For example retweets always refer to the source tweet and the trail of retweets are lost in the data we collect from the API, and I think that also influences how users retweet.

I have noticed the network characteristics of conversation patterns we see can differentiate contrived discussions from viral grassroots discussions. The dynamics of these interaction parameters have been eye-opening. If there is manipulation, the network structure of the forced conversation would be different from something that is not forced and we would be able to detect that. Manipulators leave a trace–even in something as chaotic as Twitter where there is so much noise.

There is much talk about users and their attributes, but conversation structure itself is as important. If you know the structure of conversation around a candidate you can tell something meaningful about the potentials of the campaign going viral on Twitter–and hopefully on the ground.

A simple keyword frequency profile does not tell you about the swarming effect that defines Twitter–a tweet from CNN can generate millions of retweets, at the same time something simmering in many clusters for a while could also lead to a sudden breakout – that hidden simmering is what I really like about Twitter data and would like to understand better. In the process we learn more about similar historical dynamics that we have no access to, such as public opinion during a revolution in the past, or quick reversals of religion and identity.

5. What other research possibilities around social media and politics excite you?

Understanding the dynamic of “trending” is what excites me the most. This has potential applications to the study of markets as well as elections. I am not saying that Twitter data reflects reality perfectly, but what is exciting is that now we have an opportunity to understand mechanisms that we had no idea about before, just because there did not exist the right exploratory platform for observations.

Thanks for Navid for doing the interview! If you have any suggestions for our next Data Story, please let us know in the comments!

Continue reading

Help Gnip Present at SXSW!

We had a great time at SXSW this year with our Big Boulder: Bourbon & Boots event, and we’ll be heading back again next year. We’ve submitted three speaker submissions this year, and if you think the below are topics you’d like to listen to at SXSW, we’d love an upvote (or three!)

The Anatomy of a Twitter Rumor:
Solo presentation by Gnip’s lead data scientist Dr. Scott Hendrickson 

Like a match to a fireworks factory, the hacked AP account ignited rumors that President Obama had been hurt in a terrorist attack causing a hundred billion dollar drop in the stock market. What was even more significant about the Hash Crash was the ability of Twitter users to suppress the rumor and cause the market to rally within minutes despite how quickly and far the rumor spread.

This session by Gnip data scientist, Dr. Scott Hendrickson, will look at the anatomy of a Twitter rumor, how it spreads, how Twitter users react with accurate information and how rumors die. Looking at a bank run, the rumors from Hurricane Sandy and the Hash Crash, we’ll see why Twitter users are good at ferreting out fact from fiction and how to recognize the difference on Twitter.

White House Hash Crash

A look at the White House Hash Crash

Beyond Dots on a Map: The Future of Mapping Tweets
Ian Cairns of Gnip and Eric Gundersen of MapBox

Earlier this year Gnip and MapBox collaborated on three different maps using geotagged Tweets and this presentation is an extension of that work.

What can 3 billion geotagged Tweets collected over 18 months tell us? Turns out, a lot. Gnip collaborated with the team at Mapbox to study 3 billion geotagged Tweets in aggregate and visualize the results. That work led to 3 maps showing iOS vs Android usage, where tourists vs. local hang out, and language usage patterns. From just these maps there were some surprising findings revealing demographic, cultural and social patterns down to city level detail, across the entire world. For instance in the US, Tweets from iOS showed where the wealthy live ( The data has many other stories to tell as well. As Twitter use becomes more ubiquitous, it’s increasingly serving as a valid proxy not just for what’s happening “on social media,” but for what’s happening in the world in general. This is the first time social data has been mapped at this scale, and we’ll talk about both lessons gleaned from the data and what we learned about making this big of a visualizations.

Marketing’s Big Data Dissonance:
Duo Presentation by Rob Johnson of Gnip and Dan Neely of Networked Insights

Marketers know they need big data, but like the velvet rope blocking entrance to a SXSW music event, the perceived barrier is hard to overcome. The problem for the modern marketer: cutting through the noise of all of this data and zeroing in on insights that can help them better reach consumers. Big Data grows every day and marketers are faced with an additional challenge: keeping up with the speed in which new consumer data is created. The good news for marketers is that there’s no shortage of places to get information about consumers–point of sale systems to mobile check-ins to even consumer conversations across the social web. Together, all of these actions add up to an incredible mass of information known as Big Data for marketers. In this session, Networked Insights will be joined by Gnip and to discuss the tools and techniques that marketers need in order to turn the mass of Big Data into actionable and understandable insights.


Data Stories: Stefan Papp of Blab on Predictive Social Intelligence

Data Stories is about telling cool stories about the amazing ways that social data is used. This week we’re interviewing Stefan Papp of Blab, a Gnip customer, about their predictive social analytics that are able to help customers understand the directions conversations online are headed in 24, 48, and 72 hour time frames. I absolutely loved this concept and wanted to understand more about how it worked.
Stefan Papp of Blab
1. Why did Blab decided to take a different approach instead of doing social media monitoring?

Typical social media monitoring tools use keyword-based searches – you put in your keyword and the tools return all historical results matching that keyword.  The catch is that you will only find the insights you are looking for. Our approach is to listen to the entire conversation as it organically evolves, allowing you to discover much more of what your users are discussing — including the unexpected insights you probably would not have thought to search for. Finding those non-obvious but vibrant discussions gives companies whole new opportunities to engage with their target audiences.

2. What makes social data a good source in predicting behavior?

When you monitor social data you have a unique opportunity to listen to the unfiltered personal voices of millions of users. This offers insight into not just what is trending in social but what people are really thinking and feeling; their beliefs and true opinions. Standard behavioral prediction methods use things like focus groups and polls. But these approaches have long been known to produce skewed data – both consciously and subconsciously, users tend to tailor their responses to what they think the “best” response should be. Social data has none of this user bias and as a result is an excellent source of raw unfiltered intelligence.

3. How do you think Blab fits into the trend of real-time marketing?

The challenge that real-time marketing presents is creating on-topic content and getting that content out the door before the conversation is old news. Blab not only monitors current conversations in real time but predicts which of those conversations are going to be hot 72 hours from now. This takes the reaction out of real-time marketing and for the first time gives control to the brand. Blabenables you to put relevant content in front of an engaged audience at the right time – before a conversation has grown so big that your voice can’t be heard. You can also strike a chord by engaging your audience in those non-obvious conversations happening right now that you would not have thought to join.

4. What’s some surprising findings companies have found when using Blab?

One of the major insights that companies have gained from using our Predictive Social Intelligence engine is how conversations evolve online. So many companies only think about Facebook and Twitter when they think social, but social is so much more than that. Blab enables you to watch a conversation evolve across the entire spectrum of social networks. You can follow a conversation that begins with a YouTube post, which then drives a larger conversation on Twitter, and ends up being predicted to explode on Tumblr 72 hours from now.  Without a holistic view like this companies can be led to believe that a conversation has ended when in fact it continues vibrantly on another untracked social platform.

Another interesting finding is that our clients get an unadulterated view of their standing in social discourse. One client, a global technology concern, was surprised and chagrined to find that while there was lively discussion around their competitors, there was no discussion at all about them in their area of expertise. As humbling as that was, it became a call to action and fueled enthusiasm for engaging more effectively. Blab helped that company discover a negative and use that knowledge to improve their position.

5. Clearly, Blab has huge implications when it comes to crisis communications. One of the things that has amazed me about social media is that you don’t need a huge following to start a fuss about a brand. How does Blab separate the wheat from the chaff when it comes to determining conversations that might spike?

We use two unique methods to identify truly relevant conversations and to make accurate predictions on when a conversation will spike.  First, we throw NLP out the window and use a proprietary contextual classification approach to find the conversations that are related to a given topic. Rather than filtering out words like “got” we let our engine tell us if a term or phrase should be included. And guess what? There is a thriving “got” conversation among people who are passionate about “Game of Thrones.” We embrace acronyms, slang, abbreviations and sarcasm in a language agnostic manner (from Klingon to emoticon). The result is that we give you a picture of the whole conversation, unfiltered yet relevant. The second unique method is our proprietary approach to determining which conversations will spike or cool down. As conversations ebb and flow on the social canvas they establish patterns of historical facts. We’ve discovered that regardless of the topic, these patterns tend to repeat themselves. So while there are a huge number of them, the universe of conversation patterns is not infinite. When we see a familiar pattern we can predict, and often with high confidence, how a conversation will progress up to 72 hours into the future.

Taken together, Blab gives brands the ability to find potentially troubling conversations as they emerge; to determine if action is important by predicting which conversation is likely to take off; to engage that conversation or take other remedial action; and to know if the engagement is having an impact by watching to see if the prediction of growth turns into a prediction of decline for the troubling conversation.

Continue reading

Data Story: Oliver O'Brien on Open Data Maps

I stumbled across the most amazing set of open data maps for bike sharing cities and tracked down the creator in London to interview him for a Data Story. Oliver O’Brien is the creator of the maps, which tracks available bikes and open spaces at bike sharing stations at more than 100 cities across the world. We interviewed him about his work with open maps and his research trying to understand how people move about the city. 

Ollie O'Brien Open Maps

1. What was the genesis for creating the maps?
It started from seeing the launch of London’s system in August 2010. It was at a time when I was working with Transport for London data on a project called MapTube. Transport for London had recently created a Developer portal for their datasets. When the London bikeshare launched, their map was not great (and still isn’t) – it was just a mass of white icons – so I took advantage of the data being provided on the Developer portal to create my own version, reusing some web code from an earlier map that showed General Election voting results in a fairer and clearer way. Once London’s was created, it proved to be a hit with people, as it could be used to see areas were bikes (or free spaces) might be in short supply. I was easily able to extend the map to Montreal and Minneapolis (the latter thanks to an enthusiastic local there) and then realised there was a whole world of bikesharing systems out there waiting to be mapped.

The maps act primarily as a “front-end” to the bikesharing data that I collect, for current and potential future research into the geomorphology of cities and their changing demographics and travel patterns, based on how the population uses bikesharing systems. However i have continued to update the map as it has remained popular, adding cities whenever I discover their bikeshare datasets. After three years, I am now up to exactly 100 “live” cities, where the data is fresh to within a few minutes, plus around 50 where the data is no longer available.

2. Where did you get the information to build the maps?
Mainly from APIs provided by each city authority or bikesharing operating company, or, where this is not available (which is often the case for smaller system) from their Google Map or other online mapping page that normally has the information in the HTML.

3. What is your background?
I’m an academic researcher and software developer at UCL’s Centre for Advanced Spatial Analysis. The lab specialises in urban modelling, and my current main project, EUNOIA, is aiming to build a travel mobility model, using social media as well as transport datasets, for the major European cities of London, Barcelona and Zurich. Bikesharing systems will form a key part of the overall travel model. Previously to CASA I worked as a financial GUI technologist at one of the big City banks – before then, at university, I studied Physics.

4. What are you looking to build next?
I am looking to continue to add cities to the global map, particularly from large bikesharing systems that are appearing – I am looking forward to the San Francisco Bay Area’s system launching in August – and I’m working on creating London’s EUNOIA model, taking in the transport data and augmenting it with other geospatial information, including data from Twitter. I am also looking at more effective ways to visualise data and statistics that are emerging from the recent (2011) Census that we had in the UK – the results of which are being gradually made available.

5. What open-source maps do you think should be created next?
I am hopeful that soon, an integrated map of all social media and sensor datasets, will become easily available and widely used. Partly to increase people’s awareness of the data that now surrounds them and partly to inform decision makers and other stakeholders, in creating a better, more inclusive city landscape – the so called “smart city”.

I would add that you may be interested in some of the other maps that we have created at UCL CASA, such as the Twitter Languages maps for London and New York: and …and also – these maps were all created mainly by my colleagues, with me just helping with the web work.

Boulder Bike Sharing

 Bike sharing map in Boulder, CO

Thanks to Oliver for the interview! If you’re interested in more geo + social, check out our recent posts on Social Data Mashups Following Natural Disasters and Mapping Travel, Languages & Mobile OS Usage with Twitter Data.