Data Story: John Foreman of MailChimp on the Data Science Behind Emails

 When I was in charge of email at my last startup, the MailChimp blog was a must read. Their approach to email marketing is brilliant so when my colleague suggested I interview MailChimp’s chief data scientist, John Foreman, for a Data Story, I was definitely onboard. In addition to being a data scientist at MailChimp, John is also the author of Data Smart: Using Data Science to Transform Information into Insight. You can follow him on Twitter at @john4man

1. People have a love/hate relationship with email. How can data science help people love email more and get more out of it?

Recently, people across industries seem to be waking up from their social-induced haze and rediscovering the effectiveness of direct email communication with their core audience.

Think about a true double-opted email subscription versus, say, a Facebook “like” of a product. When I like a product page on Facebook, do I really want to hear from them in my feed? In part, isn’t that “like” just an expression that’s meant for public display and not for 1-to-1 ongoing communication from the business?

Contrast that with email. If I opt into a newsletter, I’m not doing that for anyone but myself. Email is a private communication channel (I like the term “safe place”). And I want your business to speak to me there. That’s powerful. Now think of a company like MailChimp. We have billions of these subscriptions from billions of people all across the world. MailChimp’s interest data is unparalleled online.

OK, so that means that as a data scientist, I have some pretty meaty subscription data to work with. But I’ve also got individual engagement data. Email is set up perfectly to track individual engagement, both in the email, and as people leave the email to interact with a sender’s website.

So I use this engagement and interest data to build products — both weapons to fight bad actors as well as power tools to help companies effectively segment and target individuals with content that’s more relevant to the recipient. My goal is to make the email ecosystem a strong one, where unwanted marketing email goes away and the content that hits your mailbox is ideal for you.

For instance, MailChimp recently released a product called Discovered Segments that uses unsupervised learning to help users find hidden segments in their list. Using these segments, the sender can craft better content for their different communities of recipients. MailChimp uses the product ourselves; for example, rather than tell all our customers about our new transactional API, Mandrill, we used data mining to only send an announcement to a discovered segment of software developers who were likely to use it, resulting in a doubling of engagement on that campaign.

2. How is data science structured at MailChimp? How big is your team, and what departments do you work with?
MailChimp has three data scientists, and our job as a little cell is to deliver insights and products to our customers. That sounds like business-speak, so let me break it down.

By insights, I mean one-off research and analysis of large data sets that’s actionable for the customer. And by products, I mean tools that the customer can use to perform data analysis themselvesIf the tool or product isn’t useful or required by the customer, we don’t build it. A data science team is not a research group at a university, nor is it a place to just to show off technologies to investors. We’re not here to publish, and we’re not here to build “look at our data…ooooo” products for the media. Whenever a data science team is involved in those activities, I assume the business doesn’t actually know what to do with the technical resources they’ve hired.

Now, who is the “customer” in this mission? We serve other teams internally as well as MailChimp’s user base. So an example of a data product built for an internal customer would be Omnivore — our compliance AI model, while an example of a data product built for the general user population would be our Discovered Segments collaborative filtering tool.

We work very closely with the user experience team at MailChimp — the UX team is constantly interviewing and interacting with our users, so they generate a lot of hypotheses which we investigate using our data. The UX team, because their insight is built quickly from human interactions, can flit from thought to thought and project to project; when they think they’re onto something good, they kick the research idea to the lumbering beast that is the data science team. We can comb through our billions of records of sends, clicks, opens, http requests, user and campaign metadata, purchase data, etc. to quantitatively back or dismiss their new thinking.

UX Team<– UX TEAM

Data Science Team as Robots <– DATA SCIENCE TEAM

3. Your book, Data Smart, is about helping to teach anyone to get value out of data. Why did you see a need for this book? 

I used to work as a consultant for lots of large organizations, such as the IRS, DoD, Coca-Cola, and Intercontinental Hotels. And when I thought about the semi-quantitative folks in the middle and upper rungs of those organizations (people more likely to still be using the phrase “business intelligence” as opposed to “data science”), I realized there was no way for those folks to dip their toe into data science. Most of the intro books made a lot of assumptions about the reader’s math education background, and they depended on R and Python, so the reader needed to learn to code at the same time they learned data science. Furthermore, most data science books were “script kiddy” books, the reader just loaded stuff like the SVM package, built an AI model, and didn’t really know how the AI algorithms worked.

I wanted to teach the algorithms in a code free environment using tools the average “left behind” BI profession would be familiar with. So I chose to write all my tutorials in Data Smart using spreadsheets. At the same time though, I pride myself on writing a more mathematically deep intro text than what you find in many of the other intro data science texts. The book is guided learning — it’s not just a book about data science.

Now, I don’t leave the reader in Excel. I guide them into using R at the end of the book, but I only take them there after they understand the algorithms. Anything else would be sloppy.

Another reason I wrote the book is because the market didn’t have  a broad data science book. Most books focus on one topic — such as supervised AI. Data Smart covers data mining, supervised AI, basic probability and statistics, optimization modeling, time series forecasting, simulation, and outlier detection. So by the time the reader finishes the book, they’ve got a swiss army knife of techniques in their pocket and they’re able to distinguish when you use one technique and when you use another. I think we need more well-rounded data scientists, rather than the specialists that PhD programs are geared to produce.

4. You’ve written a book, maintain a personal blog and write for MailChimp. How important has communication and writing skills become to data scientists?

I believe that communication skills, both writing and speaking, are vital to being an effective data scientist. Data science is a business practice, not an academic pursuit, which means that collaboration with the other business units in a company is essential. And how is that collaboration possible if the data scientist cannot translate problems from the high-level vague definition a marketing team or an executive might provide into actual math?

Others in an organization don’t know what’s mathematically possible or impossible when they identify problems, so the data science team cannot rely on them to fully articulate problems and “throw them over the fence” to a data science team ready-to-go. No, an effective data science team works as an internal, technical consultancy. The data science team knows what’s possible and they must communicate with colleagues and customers to understand processes and problems deeply, translate what they learn into something data can address, and then craft solutions that assist the customer.

5. Time for the Miss America question. If you had access to any data in the world, what is the question or problem you’d like to most solve?

I am a huge fan of Taco Bell. And I recognize that the restaurant actually has very few ingredients to work with — their menu is essentially an exercise in combinatorial math where ingredients are recombined in new formats to produce new menu items which are then tested in the marketplace. I’d love to get data on the success of each Taco Bell menu item. Combined with possible delivery format information, nutrition information, flavor data, and price elasticity data, I’d love to take a swing at algorithmically generating new menu items for testing in the market. If sales and elasticity data were timestamped, perhaps we could even generate menu items optimized for and only available during the stoner-friendly “fourthmeal.”

Thanks to John for taking the time to speak with Gnip! If you’re interested in more Data Stories, please check out our collection of 25 Data Stories featuring interviews with data scientists from Kaggle, Foursquare, Pinterest, bitly and more! 

Data Stories: Interview with Kaggle Data Scientist Will Cukierski

Data Stories is how Gnip tells the stories of those doing cutting-edge work around data, and this week we have an interview with Will Cukierski, data scientist at Kaggle. I have loved watching the different types of Kaggle contests and how they make really interesting challenges available to people all over the world. Will was gracious enough to be interviewed about his data science background, Kaggle’s work and community. 

1. You entered many Kaggle contests before you started working for Kaggle. What were some of the biggest lessons you learned?

Indeed, many years back I competed in the Netflix prize. As looking at spreadsheets goes, it was a thrilling experience (albeit also quite humbling). I took out a $3,000 loan from my parents to buy a computer with enough RAM to even load the data. A few years later, I was in the final throes of my doctorate when Kaggle was founded. I made it a side hobby and spent my evenings trying to port what I researched in my biomedical engineering day job to all sorts of crazy problems.

The fact that I was able to get anywhere is evidence that domain expertise can be overstated when working within different fields. If I can price bonds, it’s not that I understand bond pricing; it’s that I can learn how bonds were priced in the past. This is not to say that domain expertise is not important or necessary to make progress, but that there is a set of statistical skills that support all data problems.

What are these skills? People make them sound more fancy than they really are. It’s not about knowing the latest, greatest, machine learning methods. Will it help? Sure, but you don’t need to train a gigantic deep learning net to solve problems. The lesson Kaggle reinforced for me was the importance of the scientific method applied to data. It was really basic, embarrassing things: e.g. When you do something many times, the results need to be the same. When you add a bit of noise to the input, the output shouldn’t change too much. If two models tell you the same thing, but a little differently, then you can blend them and do better. If two models tell you something completely different, then you have a bug–or even better, a massive flaw in your entire understanding of what you’re doing. Training on a lot of different perspectives of the data is better than training on one perspective of all the data. Look at pictures of what you’re doing! Write down the things that you try, because you will forget a few hours later! The competition format forces you to do all of these basic things right, more so than having them lectured at you, or reading them in a paper.

I’m also happy to report that I have paid back the loan to my parents, though the jury is still out on whether I’m any wiser in the face of data. Humility is one of most used tools in my arsenal!

2. Wired recently called attention to the fact that PhDs were leaving academia to become data scientists. At Kaggle, do you see this pattern? What kind of backgrounds do the 85,000 data scientists in the Kaggle community have?

It’s not hard to believe that PhDs are leaving to join data science positions. Academia is brutally competitive, and the difficulty is compounded by a dearth in grant funding. In data science, the disparity is flipped; companies are clamoring to hire data-literate people.

The information we have is mostly self-reported, so it’s difficult to make any real quantitative statement about a mass migration from academia to industry. What we can say about our userbase is that many thousands of them have PhDs, and that they are coming from all kinds of backgrounds. Physics, engineering, bioinformatics, actuarial science, you name it.

3. Do you see Kaggle as democratizing data science? We interviewed one of the winners of a Kaggle contest, and he was a student from Togliatti, Russia and was taking classes in data science on Coursera. I was blown away by him.

This is the fun part of sitting in the middle of a data labor market. I get to work with people who make me—presumably a not-unintelligent person…I hope…on my good days—realize how much I didn’t even know I don’t know.

Your question also brings up a controversial point. People have an understandable misconception about Kaggle’s democracy. Our critics are fond of saying that we are solving billion-dollar problems five times over and paying people a wooden nickel to do it. I think this reaction is partly a fear that smart people from anywhere, regardless of credentials, are given equal access to data problems, but I also think it’s a criticism that mistakes what our deliverable really represents. The fear over democratizing data science more parallels the old open-source software fallacy (how will we make money writing code if others give it out for free?!) than it does an outsourcing analogy.

Let’s take the problem of solving flight delay prediction. People immediately think “well that’s worth billions of dollars and if MyConsultingCorp were to solve that problem it would be for tens of millions in fees.” This stance is out of touch with what is really happening in these competitions. To wit:

  • People are solving singular problems for one company in one sector
  • The devil is in the implementation details
  • There are no constraints on absolute model performance, just relative rankings
  • The crowd always (p < 0.05) outperforms on accuracy, so when a business wants to optimize on accuracy, crowdsourcing gets chosen because it works well

Our asset is our community, not an outsourcing value proposition. To this end, we believe our efforts will actually increase the scope and amount of work available for people in analytics. Is this democracy? I think so. We sell in to companies, convince them of the merits of machine learning, isolate their problems, and open them up to the world.

The alternative is that DataDinosaur Corp. sells them on their proprietary Hadoop platform, cornering them into a big data pipe dream and leeching money via support contracts. The phrase “actionable intelligence” has never meant less than it does right now. It’s a scary, fake world out there in big-data land!

4. What data do you wish would be made available for a Kaggle contest?

I have a cancer research background. Much of the data from medical experiments is extremely shrouded in privacy fears. A lot of this fear is justified–it’s certainly nonnegotiable that we preserve patient privacy—but I believe the majority reason is that saying no means less work & bureaucracy and saying yes means new approvals & lawsuit risk. There is a tragic amount of health and pharmaceutical data that goes to waste because it lives (dies?) in institutional silos.

Access to data for health researchers is not a new problem, but I think the tragedy is especially exacerbated given what I’ve seen Kagglers do with data for other industries.

5. What is your favorite problem that a Kaggle contest has solved?

We ran a competition with Marinexplore and Cornell University to identify the sound of the endangered North American Right whale in audio recordings. Researchers have a network of buoys positioned around the ocean, constantly recording with underwater microphones. When a whale is detected in the vicinity of the buoy, they warn shipping traffic to steer clear of the area.

Not only did the participants come up with an algorithm that was very close to perfect, but how about the opportunity to work on a data science problem with such a clear and unquestionably positive goal? We spend a lot of time as data scientists thinking about optimizing ads, predicting ratings, marketing widgets, etc. These are economically important and still quite interesting, but they lack the feel-good factor of sitting at your keyboard thousands of miles away and knowing that your work might trickle down to save the life of an endangered whale.

North Atlantic Whale Kaggle Competition

 

Gnip is hiring for a data scientist position if you’re looking to do your own cutting-edge work. 

Continue reading

Data Story: Mohammad Shahangian on Pinterest Data Science

At Gnip, we believe the value of social data is unlimited. Data Stories is how we bring this belief to life by showcasing how social data is used. This week we’re interviewing Pinterest’s data scientist Mohammad Shahangian about how the data science team works at Pinterest, surprising uses of Pinterest and data science as a career path. You can follow him on Pinterest at pinterest.com/mshahang

Data Scientist at Pinterest

1. What do you see is your role as the data scientist for Pinterest?

The company’s focus is on helping millions of people discover things they love and get inspiration to go do those things in their life. For me, that means analyzing the rich data that is created by the millions of people interacting with billions of pins from across the web each day. I evaluate this data and provide insights that make data actionable. My team also prototypes and validates ideas, performs deep analysis and builds tools that allow us to answer our most frequent questions in seconds. We work with every team to answer Pinterest’s biggest questions and ensure that each decision positively impacts Pinners over the long term.

For example, we take a business question like “How should our web, tablet and phone experiences differ?” and present the results as insights like, “Many users use the mobile apps in the morning and again at night, but prefer the website during the day” and “Users prefer to use mobile apps to casually discover new content, whereas they use the web to curate and organize content.” We then work with the design and product teams to build features around these insights and measure their impact.

2. What are some of your favorite ways that people use Pinterest that people wouldn’t expect?

What makes Pinterest unique is that it’s a tool and the users really define its use cases. For me, Pinterest was really helpful when I was planning my wedding and it made perfect sense to use as collaborative office shopping list. I would have never thought to use it as a tool for:

A collection of Stop signs from around the world
Daily Grommet gets their community to collaborate on a board to see things they want to sell
Vintage Driving - a collaborative board where users pin their favorite vintage cars:
GE Badass machines featuring GE tech
Madewell’s Rainbow board
Michelle Obama’s MyPlate Recipes encourages health eating
Stunning virtual collections of minerals and shipwrecks
The “365 Days of Pinterest” challenge. She made a Pinterest project every day for a year!
Sammy Sosa awesomeness
Sony shows off their technology with food pictures shot with a Sony Camera
Pantone announces the color of the year
The National Pork Board

3. What category do you see as the most viral on Pinterest?

DIY and recipes pins generally go viral year round. Around the holidays, holiday-themed content across all categories tends to get the most traction.

4. How has data science added value to Pinterest?

We have this internal value we refer to as “knit.” It means that we have an open, curious culture where everyone in different disciplines—from engineering and design to marketing to community—works together. Data science is at the core of that. The search, recommendations and spam teams apply data science to improve the quality of content we put in front of Pinners. This is only a subset of how we apply data though; most of the decisions we make at Pinterest are actually backed by data.

Data is a universal language that teams across the company use to collaborate and make decisions. Each team has a set of performance metrics, and we hold a weekly meeting to understand the impact that each area is having on company-wide metrics. As data scientists we do more than just analyze data, we create rich data sources that we make available to other teams so they can do their own analysis. More than half of Pinterest employees run MapReduce jobs via Hive.  Our metrics dashboards are accessible to everyone and our core metrics are emailed daily to the entire team.  We also share our data studies and insights with the whole team.

We also use data just for fun. During our weekly happy hour, we share a weekly Data Fun Fact with the team. We present the fact in the form of a multiple choice question and have the team vote on the answer. For example, we asked, “How many days before Valentine’s day does the query ‘Valentine’s day ideas’ increase the most: 1, 3, 5 or 7 days?” (Hint for the curious reader: two*three/two).

5. What do you think someone should know before becoming a data scientist at a major web company like Pinterest?

I would say go for it! If you are hungry to extract value from real world data, you’re really going to enjoy it. I know that for a lot of really talented people in academia the only thing standing between them and the opportunity to solve a really interesting problem is the lack of rich data. My experience at Pinterest has been the exact opposite. Our team can’t grow fast enough to tap into a world of valuable insights that are sitting dormant within billions of records somewhere in the cloud.

Continue reading

In The Future, The Data Scientist Will be Replaced by Tools


Some of you are celebrating. Some of you are muttering about how you could never be replaced by a machine.

What is the case for? What is the case against? How should we think about the investments in infrastructure, talent, education and tools that we hope will provide the competitive insights from “big data” everyone seems to be buzzing about?

First, you might ask why try to replace the data scientist with tools?  At least one reason is in the news: The looming talent gap.

WireUK reports,

Demand is already outstripping supply. A recent global survey from EMC found that 65 percent of data science professionals believe demand for data science talent will outpace supply over the next five years, while a report from last year by McKinsey identified the need in the US alone for at least 190,000 deep analytical data scientists in the coming years.”

Maybe we should turn to tools to replace some or all of what the data scientist does. Can you replace a data scientist with tools?  An emerging group of startups would like you to think this is already possible. For example, Metamarkets headlines their product page with “Data science as a service.” They go on to explain:

 Analyzing and understanding these data streams can increase revenue and improve user engagement, but only if you have the highly skilled data scientists necessary to turn data into useful information.

Metamarkets’ mission is to democratize data science by delivering powerful analytics that are easy and intuitive for everyone.

SriSatish Ambati of the early startup 0xdata (pronounced hex-data) goes a step further with the idea that “the scale of the underlying data and the complexity of running advanced analysis are details that need to be hidden.“ (GigaOm article)

On the other side of the coin, Cathy O’Neil at Mathbabe set out the case in her blog a few weeks ago that not only can you not replace the data scientist with tools, you shouldn’t even allow the non-data-scientist near the data scientist’s tools:

 As I see it, there are three problems with the democratization of algorithms:

 1. As described already, it lets people who can load data and press a button describe themselves as data scientists.

 2. It tempts companies to never hire anyone who actually knows how these things work, because they don’t see the point. This is a mistake, and could have dire consequences, both for the company and for the world, depending on how widely their crappy models get used.

 3. Businesses might think they have awesome data scientists when they don’t. [...] posers can be fantastically successful exactly because non-data scientists who hire data scientists in business, i.e. business people, don’t know how to test for real understanding.

If this is a topic that interests you, we’ve submitted a panel on this topic for SXSW this spring in Austin to discuss issues surrounding data science and tools. We will talk about what tools are available today, how they make us more effective as well as some of the pitfalls of tool use. And we will look into the future of tools to see where and if data scientists can be replaced by tools. Would love a vote!

Panelists:

  • John Myles White (@johnmyleswhite) – Coauthor of Machine learning for hackers and Ph.D. student in the Princeton Psychology Department, where he studies human decision-making.
  • Yael Garten (@yaelgarten) – Senior Data Scientist at LinkedIn.
  • James Dixon (@jamespentaho) – CTO at Pentaho, open source tools for business intelligence.

Update: One of our panelists, John Myles White, has provided some thoughtful analysis of companies that rely on automating or assisting data science tasks. See his blog post at http://www.johnmyleswhite.com/notebook/2012/08/28/will-data-scientists-be-replaced-by-tools

Data Stories: Interview with Hilary Mason of bitly

 Data Stories is Gnip’s opportunity to tell the cool stories about the data scientists, data journalists and other people who are working in data. This week we’re interviewing Hilary Mason, the chief data scientist of bitly.  She is currently helping organize DataGotham, a celebration of the New York’s data community happening Sept. 13 -14th. You can follow her on Twitter at @hmason and read her blog at HilaryMason.com

Hilary Mason of bitly

1) How did you get started in your role as a data scientist?
I’m a computer scientist and have always had a keen interest in both algorithms and databases. It became clear to me in the last decade that the most interesting algorithms were those that worked on real data. When I found that there were opportunities to design math and infrastructure to build new types of applications, I couldn’t resist!

2) bitly users share 80 million links a day. What are some of the coolest insights and trends you’ve been able to see from these shared links?

We see all kinds of fascinating things in the data. For example, people who read about physics also read about fashion (http://bit.ly/vSa6AO) and people who use kindles use them very differently than any other kind of device (http://bit.ly/wbRe6o). We’re always posting these things on our blog. For example, on July 4th we posted the most popular recipe by state for the holiday. Did you know that people in Florida enjoy Alligator Ribs (http://bit.ly/NwUEUL)?

3) bitly just updated its site making it even easier to share and curate links. As the chief data scientist, what excites you most about the new capabilities?

It’s wonderful to see bitly evolve from a utility into a truly social platform. We’re excited for bitly to become the central place for you to store, share, and analyze the things that you care about on the internet. We can then use the aggregate data that we collect to enhance that experience for you.

4) What are some of your favorite projects you’ve worked on while at bitly?

Our goal at bitly is to understand the internet’s attention, and to build systems that make that useful. It’s too hard just to pick one bit of it! I’m proud of some of the work that’s made it out into the world, like our post about the half life of links on various social networks (http://bit.ly/puUbzs) and our collaboration with Forbes on the interactive map of media influence (http://onforb.es/GFzphG). I’m also incredibly excited about a few product-oriented experiments that are going to be public shortly … stay tuned.

5) What tools are in your arsenal as a data scientist?

I’m a firm believer in finding the smartest people you can, and letting them use whatever works best. Personally, I’m a huge fan of the old skool unix utilities, and do more with grep and awk than I should probably admit.

Python is my current programming language of choice, though I’m not averse to C when necessary. A few people on my team have started to fall in love with Go, so that’s on my list to check out.

We use the best datastore for each challenge, and make heavy use of memcached, Redis, HDFS, and even text files.

In the non-tech world, I keep a moleskine notebook around and have fallen in love with the Hi-Tec-C .4mm pens from JetPens.

6) As the chief scientist, where do you think your team adds additional business value? How does data science help bitly make decisions it wouldn’t make otherwise?

My team plays a few roles within the company. We handle the business analytics, which can be answering very simple questions like, “How many new URLs did we see yesterday?” to complex questions like, “How do we value a URL being clicked from platform X vs platform Y over time?”.

We do research, pushing the boundaries of what we know to be possible with our data and systems. A few examples of these types of questions are, “Can we build a model of attention to any phrase people are actively clicking on?”, or “Can we predict opening weekend box office takes for movies that people are reading about via bitly links?”

Finally, we build products. Generally these are APIs, like the API that accepts a URL and returns the geographic distribution of attention to the URL, but sometimes they’re human-facing producs. More on that shortly.

In summary, my team is responsible for pushing the boundaries of where bitly can go. It’s fun.

Thanks to Hilary for taking the time to talk to us about her work with bitly! Let us know in the comments if you have a suggestion for another Data Stories 

Continue reading