Data Story: John Foreman of MailChimp on the Data Science Behind Emails

 When I was in charge of email at my last startup, the MailChimp blog was a must read. Their approach to email marketing is brilliant so when my colleague suggested I interview MailChimp’s chief data scientist, John Foreman, for a Data Story, I was definitely onboard. In addition to being a data scientist at MailChimp, John is also the author of Data Smart: Using Data Science to Transform Information into Insight. You can follow him on Twitter at @john4man

1. People have a love/hate relationship with email. How can data science help people love email more and get more out of it?

Recently, people across industries seem to be waking up from their social-induced haze and rediscovering the effectiveness of direct email communication with their core audience.

Think about a true double-opted email subscription versus, say, a Facebook “like” of a product. When I like a product page on Facebook, do I really want to hear from them in my feed? In part, isn’t that “like” just an expression that’s meant for public display and not for 1-to-1 ongoing communication from the business?

Contrast that with email. If I opt into a newsletter, I’m not doing that for anyone but myself. Email is a private communication channel (I like the term “safe place”). And I want your business to speak to me there. That’s powerful. Now think of a company like MailChimp. We have billions of these subscriptions from billions of people all across the world. MailChimp’s interest data is unparalleled online.

OK, so that means that as a data scientist, I have some pretty meaty subscription data to work with. But I’ve also got individual engagement data. Email is set up perfectly to track individual engagement, both in the email, and as people leave the email to interact with a sender’s website.

So I use this engagement and interest data to build products — both weapons to fight bad actors as well as power tools to help companies effectively segment and target individuals with content that’s more relevant to the recipient. My goal is to make the email ecosystem a strong one, where unwanted marketing email goes away and the content that hits your mailbox is ideal for you.

For instance, MailChimp recently released a product called Discovered Segments that uses unsupervised learning to help users find hidden segments in their list. Using these segments, the sender can craft better content for their different communities of recipients. MailChimp uses the product ourselves; for example, rather than tell all our customers about our new transactional API, Mandrill, we used data mining to only send an announcement to a discovered segment of software developers who were likely to use it, resulting in a doubling of engagement on that campaign.

2. How is data science structured at MailChimp? How big is your team, and what departments do you work with?
MailChimp has three data scientists, and our job as a little cell is to deliver insights and products to our customers. That sounds like business-speak, so let me break it down.

By insights, I mean one-off research and analysis of large data sets that’s actionable for the customer. And by products, I mean tools that the customer can use to perform data analysis themselvesIf the tool or product isn’t useful or required by the customer, we don’t build it. A data science team is not a research group at a university, nor is it a place to just to show off technologies to investors. We’re not here to publish, and we’re not here to build “look at our data…ooooo” products for the media. Whenever a data science team is involved in those activities, I assume the business doesn’t actually know what to do with the technical resources they’ve hired.

Now, who is the “customer” in this mission? We serve other teams internally as well as MailChimp’s user base. So an example of a data product built for an internal customer would be Omnivore — our compliance AI model, while an example of a data product built for the general user population would be our Discovered Segments collaborative filtering tool.

We work very closely with the user experience team at MailChimp — the UX team is constantly interviewing and interacting with our users, so they generate a lot of hypotheses which we investigate using our data. The UX team, because their insight is built quickly from human interactions, can flit from thought to thought and project to project; when they think they’re onto something good, they kick the research idea to the lumbering beast that is the data science team. We can comb through our billions of records of sends, clicks, opens, http requests, user and campaign metadata, purchase data, etc. to quantitatively back or dismiss their new thinking.


Data Science Team as Robots <– DATA SCIENCE TEAM

3. Your book, Data Smart, is about helping to teach anyone to get value out of data. Why did you see a need for this book? 

I used to work as a consultant for lots of large organizations, such as the IRS, DoD, Coca-Cola, and Intercontinental Hotels. And when I thought about the semi-quantitative folks in the middle and upper rungs of those organizations (people more likely to still be using the phrase “business intelligence” as opposed to “data science”), I realized there was no way for those folks to dip their toe into data science. Most of the intro books made a lot of assumptions about the reader’s math education background, and they depended on R and Python, so the reader needed to learn to code at the same time they learned data science. Furthermore, most data science books were “script kiddy” books, the reader just loaded stuff like the SVM package, built an AI model, and didn’t really know how the AI algorithms worked.

I wanted to teach the algorithms in a code free environment using tools the average “left behind” BI profession would be familiar with. So I chose to write all my tutorials in Data Smart using spreadsheets. At the same time though, I pride myself on writing a more mathematically deep intro text than what you find in many of the other intro data science texts. The book is guided learning — it’s not just a book about data science.

Now, I don’t leave the reader in Excel. I guide them into using R at the end of the book, but I only take them there after they understand the algorithms. Anything else would be sloppy.

Another reason I wrote the book is because the market didn’t have  a broad data science book. Most books focus on one topic — such as supervised AI. Data Smart covers data mining, supervised AI, basic probability and statistics, optimization modeling, time series forecasting, simulation, and outlier detection. So by the time the reader finishes the book, they’ve got a swiss army knife of techniques in their pocket and they’re able to distinguish when you use one technique and when you use another. I think we need more well-rounded data scientists, rather than the specialists that PhD programs are geared to produce.

4. You’ve written a book, maintain a personal blog and write for MailChimp. How important has communication and writing skills become to data scientists?

I believe that communication skills, both writing and speaking, are vital to being an effective data scientist. Data science is a business practice, not an academic pursuit, which means that collaboration with the other business units in a company is essential. And how is that collaboration possible if the data scientist cannot translate problems from the high-level vague definition a marketing team or an executive might provide into actual math?

Others in an organization don’t know what’s mathematically possible or impossible when they identify problems, so the data science team cannot rely on them to fully articulate problems and “throw them over the fence” to a data science team ready-to-go. No, an effective data science team works as an internal, technical consultancy. The data science team knows what’s possible and they must communicate with colleagues and customers to understand processes and problems deeply, translate what they learn into something data can address, and then craft solutions that assist the customer.

5. Time for the Miss America question. If you had access to any data in the world, what is the question or problem you’d like to most solve?

I am a huge fan of Taco Bell. And I recognize that the restaurant actually has very few ingredients to work with — their menu is essentially an exercise in combinatorial math where ingredients are recombined in new formats to produce new menu items which are then tested in the marketplace. I’d love to get data on the success of each Taco Bell menu item. Combined with possible delivery format information, nutrition information, flavor data, and price elasticity data, I’d love to take a swing at algorithmically generating new menu items for testing in the market. If sales and elasticity data were timestamped, perhaps we could even generate menu items optimized for and only available during the stoner-friendly “fourthmeal.”

Thanks to John for taking the time to speak with Gnip! If you’re interested in more Data Stories, please check out our collection of 25 Data Stories featuring interviews with data scientists from Kaggle, Foursquare, Pinterest, bitly and more! 

Data Story: Michele Trevisiol on Image Analysis

Social media content is frequently shifting to a visual medium and companies are often having a harder time understanding and digesting visual content. So I loved happening upon Michele Trevisiol, PhD student at Universitat Pompeu Fabra and PhD intern at Yahoo Labs Barcelona, whose research focused on image analysis for videos and photos. 

Michele Trevisol

1. Your research has always revolved around analyzing videos and photos, and this is an area that the rest of the world is struggling how to figure out. Where do you see the future of research around this heading?

For my own experience I can see a huge work for research in the near future. Every day there are tons of new photos and videos uploaded online. Just think about Facebook, Flickr, YouTube, Vimeo and many others new services. Actually, in one of our projects we are working with Vine, a short-video app that allows you to record 6 seconds in loop. Twitter has bought it in January and Vine reached already 40 million of users.

Just to write some numbers, recent studies estimated a volume of about 880 billion photos will be uploaded in 2014, without considering other multimedia data. This volume of information makes it very hard for the user to explore such a large multimedia space and pick the most interesting items. Therefore, we need smart and efficient algorithms to fix this. The amount of data is growing every day, researchers needs to keep improving their systems to analyze, understand and rank the media items in the best way as possible (often in a personalized way for the user).

Researchers have studied these topics from many different angles. Working with multimedia objects involves analyzing the content of the data (i.e., computer vision like object detection, pattern recognition, etc.), understanding the textual information (e.g., meta-data, description, tags, comments), or studying how media is shared in the social network. This is a research space that has still many things left to explore.

2. You’ve previously researched how to determine geo data from video based on a multi-modal approach based on how videos are tagged, user networks, and other meta-data. What are the advantages of understanding the geo location of videos?

You can see the problem from a different point of view. If you have a complete knowledge about the multimedia items you have, like the visual content, the meta-data, the location, or even how they are made (technical details), and so on: this data would be easily classify and discoverable for the users. All the information about the item helps researchers to understand its properties and its importance for the users that is looking for it. However, very often this information is missing, incomplete or even wrong. In particular the geo location is not provided on the vast majority of photos and videos online.

Only in the recent years there has been an increment of cameras and phones with automatic geo-tagging (able to record the latitude and longitude where the photo was taken). As a result, just few multimedia items have this information. Being able to discover the geo location of videos/images helps you to organize and classify them, and helps the users to find items related to any specific location, improving their search and the retrieval. We presented a multi-modal approach that keeps into account various factors related to the given video, like the content, the textual information, the user’s profile, the user’s social network and her previous uploaded photos. With such information the accuracy of the geo location prediction improved dramatically.

Recently, the research is spending more effort on this topic, mainly due to the increasing interest in the mobile world. In the near future, the activity in this area is destined to increase.

3. How are the browsing patterns different for people viewing photos than text? What motivates people to click on different photos?

The browsing patterns are strongly biased by the structure of the website and, of course, by the type of website. The first case is quite obvious as the users browse the links that are more evident on the page, therefore the way the website selects and shows the related items is really important. The latter case instead is related to the type of website.

Consider, for example, Many users land on the website from some search engine and read just one article. This means that the goal of the user is very focused, as she’s able to define the query, click on the right link, consume the information, and leave. But that’s not always the case, as there are also users who browse deep and look at many articles related to one topic (e.g., about TV series, episodes, actors, etc.). If you consider News websites instead, the behavior is different as the user could enter only to spend some time, to take a break, to get distracted with something interesting. A photo sharing website presents even different behavior, often characterized by the social network structure. Many users interact mostly with the photos shared by friends, or contacts, or they like to get involved in social groups trying to get more visibility and positive feedbacks as possible.

The main interest of any service provider is to keep the user engaged long as possible on its pages. To do this, it shows the links with the highest interest for the users to keep them clicking and browsing. That’s what the user wants as well, she wants to find something interesting for her needs. The rationale is similar for photo sharing sites, but the power of the image to catch the interest at the first glance is an important difference. For example, in Flickr there are “photostreams” (sets of photos) shown to the user for each image she is watching. Slideshows show image thumbnails in order to catch the interest of the user with the content of the recommended images. Recently, we developed a study on these specific slideshows, we found that the users love to navigate from one slideshow to another instead of searching directly for images or browsing specific groups. We also tried to recommend different slideshows instead than different photos with positive and interesting results.

4. Much of your research has focused around Flickr. How does data science improve the user experience of Flickr?

Recently Flickr has improved a lot in this direction, for example with the new free accounts with 1TB of free storage, or the interface that has been recently refreshed. But the changes are just at the beginning.

In general, the data scientists need first to study and understand how the users are engaging with the websites, how they are searching and consuming the information, how they are socializing, and especially what they would like to have and to find on the website. In order to improve the user experience you need to know the users and then to work on what (and how) the website can offer to improve their navigation.

5. Based on your research around photo recommendations, what characteristics make photos most appealing to viewers?

This is a complex question as the appealing is subjective and changes for each user, especially the taste, or even better, the interest of the user changes over time. Some days you’re looking for something more funny so maybe the aesthetics of the image are less important. Other days instead, you get captured by very cool images that can be professional or just incredible amateur shots.

In the majority of cases each user is quite unique in term of taste, so you need to know what she appreciated before and how her taste changed over time in order to show her the photos that she could like more. On the other hand, there are cases that can catch the interests of any users in an objective way. For example, in photos related to real world events the content is highly informative, instead, the quality and the aesthetic are often ignored.

In a research work that we presented last year, we compare different ranking approaches in function of various factors. One of these was the so called external impact. With this features we could measure how much interest the image has outside Flickr, in other words, how much visibility the image has on the Web. If an image uploaded by a Flickr user in her page has a huge set of visits coming from outside (e.g., other social network, search engine), it means this image has high attractiveness that need to be considered even if inside the network it does not show particular popularity. We found that this could also be a relevant factor to be considered in the ranking, and we are still investigating this point.

If you’re interested in more data stories, please check out our collection of 25 Data Stories for interviews with data scientists from Pinterest, Foursquare, and more! 

Pivotal HD Tips the Time-to-Value Ratio

For many companies, analyzing large internal data sets combined with realtime data, like social data, and extracting relevant insights isn’t easy. As a marketing director, you might need to look at several months of customer feedback on a product feature and compare those comments to ad mentions on Twitter for an ad-spend decision happening…tomorrow. Depending on the capabilities of your platform, this type of analysis could take awhile and you might not have the answers to inform the decision you need make.

This spring, our Plugged In partner Pivotal HD, formerly EMC Greenplum, went through a bit of a transformation that facilitates analysis of big data sets that combine both internal and external data. A combination of EMC Greenplum and VMware, the newly formed Pivotal is a Hadoop distribution that enables enterprise customers to combine both structured and unstructured data in a single cloud-powered platform, rather than operating multiple systems. What does this mean for enterprises and their customers? In a word (or two) it means speed and accessibility. Through this platform, companies can more easily store and analyze giant data sets, including social data, without spending all of their time building data models to do the analysis.

If you’ve read any of our Data Stories, you know we love data analysis here at Gnip, so we thought it was worthwhile to learn more about what Pivotal HD does for big data analysis. We talked to Jim Totte, Director of Business Development and Srivatsan Ramanujam, Senior Data Scientist at Pivotal HD and captured some of the conversation in this short video clip:

Gnip is Now The Provider of GetGlue Social Data

During Big Boulder this year, Maya Harris from GetGlue talked about the community on GetGlue and how they banded together to keep the TV show Nikita on the air when it was on the verge of being cancelled. Maya also showed the strong correlation between the volume of GetGlue check-ins and Nielsen ratings for a show like “The Walking Dead.” With examples like these, we’re incredibly excited about the insights our customers will develop now that they have access to the full firehose of check-ins and comments from GetGlue.

GetGlue is a recognized leader in second screen engagement and social television. It’s a community of more than 4 million people bantering about their favorite shows and movies, the cliffhangers and the surprise endings. With GetGlue, users can use their phones and computers to check-in, like, comment and engage with other fans around the TV shows, movies and sports that they love. More than 75 major television networks and 25 movie studios use GetGlue to promote their shows and movies and engage with their fans.

The possibilities we see for analysis of this data are immense. Looking for a realtime measure of a TV show’s popularity? GetGlue check-ins are closely correlated to Nielsen ratings.

GetGlue Walking Dead Infographic

Want to get an early measure on the box office success of a big movie release? You can use the check-ins on GetGlue to get a realtime measure of the most popular movies on any given weekend. Trying to figure out it fans of Game of Thrones are also fans of The Walking Dead? Now you can.

Check out to learn more or shoot us an email at if you’d like to get in touch.

Data Stories: Interview with Kaggle Data Scientist Will Cukierski

Data Stories is how Gnip tells the stories of those doing cutting-edge work around data, and this week we have an interview with Will Cukierski, data scientist at Kaggle. I have loved watching the different types of Kaggle contests and how they make really interesting challenges available to people all over the world. Will was gracious enough to be interviewed about his data science background, Kaggle’s work and community. 

1. You entered many Kaggle contests before you started working for Kaggle. What were some of the biggest lessons you learned?

Indeed, many years back I competed in the Netflix prize. As looking at spreadsheets goes, it was a thrilling experience (albeit also quite humbling). I took out a $3,000 loan from my parents to buy a computer with enough RAM to even load the data. A few years later, I was in the final throes of my doctorate when Kaggle was founded. I made it a side hobby and spent my evenings trying to port what I researched in my biomedical engineering day job to all sorts of crazy problems.

The fact that I was able to get anywhere is evidence that domain expertise can be overstated when working within different fields. If I can price bonds, it’s not that I understand bond pricing; it’s that I can learn how bonds were priced in the past. This is not to say that domain expertise is not important or necessary to make progress, but that there is a set of statistical skills that support all data problems.

What are these skills? People make them sound more fancy than they really are. It’s not about knowing the latest, greatest, machine learning methods. Will it help? Sure, but you don’t need to train a gigantic deep learning net to solve problems. The lesson Kaggle reinforced for me was the importance of the scientific method applied to data. It was really basic, embarrassing things: e.g. When you do something many times, the results need to be the same. When you add a bit of noise to the input, the output shouldn’t change too much. If two models tell you the same thing, but a little differently, then you can blend them and do better. If two models tell you something completely different, then you have a bug–or even better, a massive flaw in your entire understanding of what you’re doing. Training on a lot of different perspectives of the data is better than training on one perspective of all the data. Look at pictures of what you’re doing! Write down the things that you try, because you will forget a few hours later! The competition format forces you to do all of these basic things right, more so than having them lectured at you, or reading them in a paper.

I’m also happy to report that I have paid back the loan to my parents, though the jury is still out on whether I’m any wiser in the face of data. Humility is one of most used tools in my arsenal!

2. Wired recently called attention to the fact that PhDs were leaving academia to become data scientists. At Kaggle, do you see this pattern? What kind of backgrounds do the 85,000 data scientists in the Kaggle community have?

It’s not hard to believe that PhDs are leaving to join data science positions. Academia is brutally competitive, and the difficulty is compounded by a dearth in grant funding. In data science, the disparity is flipped; companies are clamoring to hire data-literate people.

The information we have is mostly self-reported, so it’s difficult to make any real quantitative statement about a mass migration from academia to industry. What we can say about our userbase is that many thousands of them have PhDs, and that they are coming from all kinds of backgrounds. Physics, engineering, bioinformatics, actuarial science, you name it.

3. Do you see Kaggle as democratizing data science? We interviewed one of the winners of a Kaggle contest, and he was a student from Togliatti, Russia and was taking classes in data science on Coursera. I was blown away by him.

This is the fun part of sitting in the middle of a data labor market. I get to work with people who make me—presumably a not-unintelligent person…I hope…on my good days—realize how much I didn’t even know I don’t know.

Your question also brings up a controversial point. People have an understandable misconception about Kaggle’s democracy. Our critics are fond of saying that we are solving billion-dollar problems five times over and paying people a wooden nickel to do it. I think this reaction is partly a fear that smart people from anywhere, regardless of credentials, are given equal access to data problems, but I also think it’s a criticism that mistakes what our deliverable really represents. The fear over democratizing data science more parallels the old open-source software fallacy (how will we make money writing code if others give it out for free?!) than it does an outsourcing analogy.

Let’s take the problem of solving flight delay prediction. People immediately think “well that’s worth billions of dollars and if MyConsultingCorp were to solve that problem it would be for tens of millions in fees.” This stance is out of touch with what is really happening in these competitions. To wit:

  • People are solving singular problems for one company in one sector
  • The devil is in the implementation details
  • There are no constraints on absolute model performance, just relative rankings
  • The crowd always (p < 0.05) outperforms on accuracy, so when a business wants to optimize on accuracy, crowdsourcing gets chosen because it works well

Our asset is our community, not an outsourcing value proposition. To this end, we believe our efforts will actually increase the scope and amount of work available for people in analytics. Is this democracy? I think so. We sell in to companies, convince them of the merits of machine learning, isolate their problems, and open them up to the world.

The alternative is that DataDinosaur Corp. sells them on their proprietary Hadoop platform, cornering them into a big data pipe dream and leeching money via support contracts. The phrase “actionable intelligence” has never meant less than it does right now. It’s a scary, fake world out there in big-data land!

4. What data do you wish would be made available for a Kaggle contest?

I have a cancer research background. Much of the data from medical experiments is extremely shrouded in privacy fears. A lot of this fear is justified–it’s certainly nonnegotiable that we preserve patient privacy—but I believe the majority reason is that saying no means less work & bureaucracy and saying yes means new approvals & lawsuit risk. There is a tragic amount of health and pharmaceutical data that goes to waste because it lives (dies?) in institutional silos.

Access to data for health researchers is not a new problem, but I think the tragedy is especially exacerbated given what I’ve seen Kagglers do with data for other industries.

5. What is your favorite problem that a Kaggle contest has solved?

We ran a competition with Marinexplore and Cornell University to identify the sound of the endangered North American Right whale in audio recordings. Researchers have a network of buoys positioned around the ocean, constantly recording with underwater microphones. When a whale is detected in the vicinity of the buoy, they warn shipping traffic to steer clear of the area.

Not only did the participants come up with an algorithm that was very close to perfect, but how about the opportunity to work on a data science problem with such a clear and unquestionably positive goal? We spend a lot of time as data scientists thinking about optimizing ads, predicting ratings, marketing widgets, etc. These are economically important and still quite interesting, but they lack the feel-good factor of sitting at your keyboard thousands of miles away and knowing that your work might trickle down to save the life of an endangered whale.

North Atlantic Whale Kaggle Competition


Gnip is hiring for a data scientist position if you’re looking to do your own cutting-edge work. 

Continue reading

Data Science: The Sexiest Profession Going

Data scientists Mohammad Shahangian of Pinterest; Kostas Tsioutsiouliklis of Twitter, Adam Laiacano of Tumblr discuss the challenges and opportunities in social data.

Data Scientists at Big Boulder

As Gnip’s own data scientist Dr. Skippy was joined on stage by three data scientists representing three prolific social networks, Big Boulder Master of Ceremonies Lindsay Campbell couldn’t help herself gushing to the crowd, “This is by far the sexiest panel this year”. (Which was a reference to the Harvard Business Review naming data science the sexiest profession of the 21st century.)

Physical appearance aside, there could hardly be a truer statement to Big Boulder attendees: a legion of self-proclaimed data nerds.

Scott Hendrickson, better known as Dr. Skippy, Data Scientist at Gnip was joined on stage by Mohammad Shahangian of Pinterest, Kostas Tsioutsiouliklis of Twitter, and Adam Laiacano of Tumblr.

A Look at the Data Science Departments

The conversation began with each guest sharing the size of data science teams and roles at their respective organizations.

The data science team at Twitter is currently comprised of 7-8 people, looking to build to team of 20 in the near future (see open positions here). Data scientists at Twitter fall into two departments: a business intelligence and insights team of data scientists and individual data scientists who are embedded into teams. Data scientists embedded into teams become key stakeholders in improving and evolving the product.

The business intelligence team works collaboratively to explore ideas and create reports, even if it is not always favorable to the company. As Kostas explains, data scientists are trusted at Twitter. It’s ok to report the truth.

At Pinterest, there are 8 full-time data scientists on the team. The primary goal for data scientists is to understand what users are doing, to put pinners first- a strong company value. Much like Twitter, Pinterest data scientists are integrated into other engineering teams. This blend of engineers and data scientists on the same team enables nimble product iterations. Since adding data scientists to the mix at Pinterest teams are now requesting deeper and deeper metrics to measure success and plan product.

Tumblr’s team of data scientists is also eight strong in two roles, first a search and discovery team six strong and second, a two person, very self reflective business intelligence team. The search and discovery team is tasked to maintain the quality of the data and build products that can make the data usable, and ensure the end product is something users enjoy. The business intelligence team of two people is highly self-reflective investigating actions users take to determine which actions are indicatory of long term success.The outcome of which is most frequently is reporting.

Data Science Impact on Product

At Tumblr, there is a significant amount of testing around registration and onboarding, what users see when they land at However, Adam is quck to add that Tumblr has a unique view on their research, stating, “You don’t have to do as much research on your product when you use it yourself”.

Data scientists at Twitter report metrics all the way to the top. The CEO and the executives are asking questions about the data around launch of a new product and value the input of data scientists.

By sharing data with product teams, Pinterest engineers are being driven by the data. Mohammad shares, “After exposing metrics to people, the first instinct is to want to make the metrics better. This brings a culture of people who come to the data science team and seek their input. They take the ideas of product and run some queries to see if the data validates it. We’ve made it very easy for product teams to set up experiments, we don’t even call them experiments anymore.” Expounding on this fact, he shares an anecdote from a recent rewrite of the entire website. When launched, scientists noticed a dip in follows. Investigation from the team lead to understanding that the enhanced speed of the rewritten website had eliminated a small lag which followed a users like. A lag of time in which users had been following pinners on the site. By correcting the lag, follows went back up.

Who You Callin’ Sexy?

As Dr. Skippy joked about the popularity, ahem sexiness, of the data science title, conversation turned to the lack of an industry standard definition for the role, noting there is often confusion and a lack of differentiation from business analysts and business intelligence roles.

Kostas began noting that data science is not about analyzing but about prediction. Twiter data scientists are also engineers. Backgrounds of Twitter data scientists include statistics, data mining, machine learning, and engineering.

Further delineating from data analysts, Mohammad points out that role isn’t pulling their own data. Continuing on he added, “If you can’t pull your own data, how can you figure out what you want? A data scientist is skeptical. If results seem too good to be true, they will investigate. Question the data. Analysts will take the data as the data.”

Adam relates a good scientist as individual who can get data in any format and clean it up, can take weird, fuzzy forms and see the layout of the information is available. To connect the puzzle and build the data set that is useful.

The Future For Analysis of Social Data

Much of data science to date has been ad hoc, but the panelists agree that as you look closely at what data scientists do, it’s templates and patterns. Over time this work will become progressively more standardized. With new, faster tools it will move away from ad hoc processes. Teams will build models and tools to solve recurring problems.

Adam of Twitter added optimistically that the future is the work data scientists will do as they collect data across platforms and across multiple streams. It’s up to those developing third-party tools and resources to innovate using all the data.

Lastly, Mohammad chimed in that machine learning and prediction modeling is the sexy amongst the sexy. Adding, “That’s what we’re all waiting for”.

Big Boulder is the world’s first social data conference. Follow along at #BigBoulder, on the blog under Big BoulderBig Boulder on Storify and on Gnip’s Facebook pag

Data Story: Mohammad Shahangian on Pinterest Data Science

At Gnip, we believe the value of social data is unlimited. Data Stories is how we bring this belief to life by showcasing how social data is used. This week we’re interviewing Pinterest’s data scientist Mohammad Shahangian about how the data science team works at Pinterest, surprising uses of Pinterest and data science as a career path. You can follow him on Pinterest at

Data Scientist at Pinterest

1. What do you see is your role as the data scientist for Pinterest?

The company’s focus is on helping millions of people discover things they love and get inspiration to go do those things in their life. For me, that means analyzing the rich data that is created by the millions of people interacting with billions of pins from across the web each day. I evaluate this data and provide insights that make data actionable. My team also prototypes and validates ideas, performs deep analysis and builds tools that allow us to answer our most frequent questions in seconds. We work with every team to answer Pinterest’s biggest questions and ensure that each decision positively impacts Pinners over the long term.

For example, we take a business question like “How should our web, tablet and phone experiences differ?” and present the results as insights like, “Many users use the mobile apps in the morning and again at night, but prefer the website during the day” and “Users prefer to use mobile apps to casually discover new content, whereas they use the web to curate and organize content.” We then work with the design and product teams to build features around these insights and measure their impact.

2. What are some of your favorite ways that people use Pinterest that people wouldn’t expect?

What makes Pinterest unique is that it’s a tool and the users really define its use cases. For me, Pinterest was really helpful when I was planning my wedding and it made perfect sense to use as collaborative office shopping list. I would have never thought to use it as a tool for:

A collection of Stop signs from around the world
Daily Grommet gets their community to collaborate on a board to see things they want to sell
Vintage Driving - a collaborative board where users pin their favorite vintage cars:
GE Badass machines featuring GE tech
Madewell’s Rainbow board
Michelle Obama’s MyPlate Recipes encourages health eating
Stunning virtual collections of minerals and shipwrecks
The “365 Days of Pinterest” challenge. She made a Pinterest project every day for a year!
Sammy Sosa awesomeness
Sony shows off their technology with food pictures shot with a Sony Camera
Pantone announces the color of the year
The National Pork Board

3. What category do you see as the most viral on Pinterest?

DIY and recipes pins generally go viral year round. Around the holidays, holiday-themed content across all categories tends to get the most traction.

4. How has data science added value to Pinterest?

We have this internal value we refer to as “knit.” It means that we have an open, curious culture where everyone in different disciplines—from engineering and design to marketing to community—works together. Data science is at the core of that. The search, recommendations and spam teams apply data science to improve the quality of content we put in front of Pinners. This is only a subset of how we apply data though; most of the decisions we make at Pinterest are actually backed by data.

Data is a universal language that teams across the company use to collaborate and make decisions. Each team has a set of performance metrics, and we hold a weekly meeting to understand the impact that each area is having on company-wide metrics. As data scientists we do more than just analyze data, we create rich data sources that we make available to other teams so they can do their own analysis. More than half of Pinterest employees run MapReduce jobs via Hive.  Our metrics dashboards are accessible to everyone and our core metrics are emailed daily to the entire team.  We also share our data studies and insights with the whole team.

We also use data just for fun. During our weekly happy hour, we share a weekly Data Fun Fact with the team. We present the fact in the form of a multiple choice question and have the team vote on the answer. For example, we asked, “How many days before Valentine’s day does the query ‘Valentine’s day ideas’ increase the most: 1, 3, 5 or 7 days?” (Hint for the curious reader: two*three/two).

5. What do you think someone should know before becoming a data scientist at a major web company like Pinterest?

I would say go for it! If you are hungry to extract value from real world data, you’re really going to enjoy it. I know that for a lot of really talented people in academia the only thing standing between them and the opportunity to solve a really interesting problem is the lack of rich data. My experience at Pinterest has been the exact opposite. Our team can’t grow fast enough to tap into a world of valuable insights that are sitting dormant within billions of records somewhere in the cloud.

Continue reading

Data Stories: Dmitrii Vlasov on Kaggle Contests

At Gnip, we’re big fans of what the team at Kaggle is doing and have a fun time keeping tabs on their contests. One contest that I loved was held by WordPress and GigaOm to see what posts were most likely to generate likes, and we interviewed Dmitrii Vlasov who came in second in the Splunk Innovation Prospect and sixth overall. For me, it was interesting to speak to an up and coming data scientist who isn’t well known yet. Follow him at @yablokoff.

Dmitrii Vlasov of the GigaOm WordPress contest

1. You were recognized for your work in the first Kaggle contest you ever entered. What attracted you to Kaggle, and specifically the WordPress competition?

I came to Kaggle accidentally as it always happens. I read some blog post about the Million Song Dataset Challenge provided by and bunch of other organizations. The task was to predict which songs will be liked by users based on their existing listening history. This immediately made me feel excited because I’m an active user and was reflecting about what connections between people can be established based on their music preferences. But the contest was coming to end and so I switched to WordPress GigaOm contest and got 6th place there. Well, it is always interesting to predict something you already use.

2. What is your background in data science?

Now I’m a senior CS student in Togliatty, Russia. Can’t say that I have a special background in Data Science – I had more than a year-long course of probability theory and math statistics in university, some self-learned skills about semantic analysis and have big love to Python as a tool for implementing ideas. Also, I’ve entered the Machine Learning course on Coursera.

3. You found that blog posts with 30 to 50 pictures were more likely to be popular. You also found that longer blog posts also attract more likes (80,000-90,000 characters). This struck my marketing team as really high and was contrary to your hypothesis that longer content might be less viral. Why do you think this is?

Well, my numbers show relative correlation between amount of photos, characters and videos and the amount of likes received. Big relative “folks love” on several prominent amount of photos means that there were not so many posts with such amount of photos but most of them were qualitative. Quick empirical analysis shows that these are special type of posts – “big photo posts”. They usually are photo report, photo collection or scrapbook. For such types of posts 10-15 photos are not enough but at the same time 10-15 photos seem too overloaded for normal post. The same can be said about big amount of text in post. Of course, the most “likeable” posts contain 1,000-3,000 characters, but posts with 80-90 thousands are winners in “heavyweight category”. These are big researches, novels, political contemplation. Analyse is quite simple but it shows that if you want to create media-rich or text-rich content it should be really media-text-rich. Or you may fall in a hollow of not suitableness.

4. What else would like to predict with social data if you got the chance?

Now I work on romantic and friend relationships that could be established based on people’s music preferences (it’s a privately held startup in alpha). This is a really interesting and deep area! Also, I’d like to work with some political data e.g. to predict reaction on one or another politician’s statement based on a user’s Twitter feed. Or to extract all “real” thesis of politician based on all of his public speeches.

Data Stories: Dino Citraro of Periscopic on Data Visualization

The Periscopic team has a long-standing reputation for their excellent work in data visualizations, so we asked on of the founders, Dino Citraro, to participate in a Data Story about data visualizations. You can follow Dino on Twitter at @dinocitraro and check out their work at

Dino Citraro of Periscopic

1) Periscopic’s tagline is “Do good with data”. What are some of the projects that Periscopic that embody that tagline?

We formed Periscopic with the hope that we could do good with data. To us that means helping people that share the ideals of progressive social change, sustainability, human rights, equality, environmentalism, and transparency to name a few. Most of our work enables insights and discussions in those areas. Some recent and/or notable projects are:

“VoteEasy” is a voter education tool that was designed to allow the general public to quickly and easily see how closely political candidates align with their views on key issues. It’s like for political candidates. It utilizes thousands of hours of research and a vast collection of data assembled by the nonpartisan group, Project Vote Smart. It is the most up-to-date resource for candidate political information, including voting records, interest groups ratings, campaign finances, and personal biography.

“The State of the Polar Bear”

The State of the Polar Bear is the authoritative source for the health and status of the world’s polar bears. This multipart datavisualization was developed through an international partnership with the Polar Bear Specialist Group, a scientific collaboration of the five polar bear nations: Canada, Denmark, Norway, the USA, and Russia. It covers data related to pollution levels, tribal hunting, and population dynamics of the bears.

“Who’s Talking About Breast Cancer”

Developed for GE’s Healthymagination data visualization forum, this tool takes a realtime look at the discussions happening on Twitter around the topic of breast cancer. Tweets from all over the world are aggregated in a single location, allowing visitors to quickly understand the current topics, trends, and stories.

2) With infographics now being an over-hyped tool for marketing, what challenges does that create for a company actually trying to tell stories with data?

If they are done well, infographics can be a very effective story-telling device. Unfortunately, many of them seem to either lack an engaging metaphor, or don’t do a good job of letting the data be the story.  Since most of our work is interactive, we have an advantage over traditional infographics because we can reveal information in a user-directed way. The challenges we face are how to slowly introduce these stories in a way that is engaging for visitors, and not overwhelming.

3) What are the greatest opportunities right now for data visualization?

The greatest opportunities for data visualization probably relate to public data and personal data. Public data, because it has that greatest potential for good and efficiency. Personal data, because it is the thing that most people seem to find interesting. The Quantified Self movement has exploded, and along with it the desire to understand our social media behaviors, and the rise of the Quantified Social Self.

4) How do you separate the wheat from the chaff when it comes to good data? 

There is no such thing as “good data”, there is only good context. You can create a compelling data visualization out of any data source, as long as you use the right context.  For instance, one of our pieces uses the gaps in the data – the lack of data – as part of the story. Our client wanted to highlight the fact that they needed to increase the data collection efforts, and wanted public support for this effort. You could have a massive data set that is impeccably organized, but without the right context, it can go unnoticed.

5) How does good visualization help create data literacy?

To us, the issue is literacy in general. Like good design, data visualizations should be transparent and unnoticed. The epiphanies one gets from interacting with data are the things that should be retained, not the fact that an interface was unique, or the interactivity was sophisticated.

Having said that, the very process of interacting with data through a visualization tool brings an understanding of what is possible, and with that, the desire increases for more, and better experiences.

Continue reading

In The Future, The Data Scientist Will be Replaced by Tools

Some of you are celebrating. Some of you are muttering about how you could never be replaced by a machine.

What is the case for? What is the case against? How should we think about the investments in infrastructure, talent, education and tools that we hope will provide the competitive insights from “big data” everyone seems to be buzzing about?

First, you might ask why try to replace the data scientist with tools?  At least one reason is in the news: The looming talent gap.

WireUK reports,

Demand is already outstripping supply. A recent global survey from EMC found that 65 percent of data science professionals believe demand for data science talent will outpace supply over the next five years, while a report from last year by McKinsey identified the need in the US alone for at least 190,000 deep analytical data scientists in the coming years.”

Maybe we should turn to tools to replace some or all of what the data scientist does. Can you replace a data scientist with tools?  An emerging group of startups would like you to think this is already possible. For example, Metamarkets headlines their product page with “Data science as a service.” They go on to explain:

 Analyzing and understanding these data streams can increase revenue and improve user engagement, but only if you have the highly skilled data scientists necessary to turn data into useful information.

Metamarkets’ mission is to democratize data science by delivering powerful analytics that are easy and intuitive for everyone.

SriSatish Ambati of the early startup 0xdata (pronounced hex-data) goes a step further with the idea that “the scale of the underlying data and the complexity of running advanced analysis are details that need to be hidden.“ (GigaOm article)

On the other side of the coin, Cathy O’Neil at Mathbabe set out the case in her blog a few weeks ago that not only can you not replace the data scientist with tools, you shouldn’t even allow the non-data-scientist near the data scientist’s tools:

 As I see it, there are three problems with the democratization of algorithms:

 1. As described already, it lets people who can load data and press a button describe themselves as data scientists.

 2. It tempts companies to never hire anyone who actually knows how these things work, because they don’t see the point. This is a mistake, and could have dire consequences, both for the company and for the world, depending on how widely their crappy models get used.

 3. Businesses might think they have awesome data scientists when they don’t. […] posers can be fantastically successful exactly because non-data scientists who hire data scientists in business, i.e. business people, don’t know how to test for real understanding.

If this is a topic that interests you, we’ve submitted a panel on this topic for SXSW this spring in Austin to discuss issues surrounding data science and tools. We will talk about what tools are available today, how they make us more effective as well as some of the pitfalls of tool use. And we will look into the future of tools to see where and if data scientists can be replaced by tools. Would love a vote!


  • John Myles White (@johnmyleswhite) – Coauthor of Machine learning for hackers and Ph.D. student in the Princeton Psychology Department, where he studies human decision-making.
  • Yael Garten (@yaelgarten) – Senior Data Scientist at LinkedIn.
  • James Dixon (@jamespentaho) – CTO at Pentaho, open source tools for business intelligence.

Update: One of our panelists, John Myles White, has provided some thoughtful analysis of companies that rely on automating or assisting data science tasks. See his blog post at