Data Stories: Interview with Kaggle Data Scientist Will Cukierski

Data Stories is how Gnip tells the stories of those doing cutting-edge work around data, and this week we have an interview with Will Cukierski, data scientist at Kaggle. I have loved watching the different types of Kaggle contests and how they make really interesting challenges available to people all over the world. Will was gracious enough to be interviewed about his data science background, Kaggle’s work and community. 

1. You entered many Kaggle contests before you started working for Kaggle. What were some of the biggest lessons you learned?

Indeed, many years back I competed in the Netflix prize. As looking at spreadsheets goes, it was a thrilling experience (albeit also quite humbling). I took out a $3,000 loan from my parents to buy a computer with enough RAM to even load the data. A few years later, I was in the final throes of my doctorate when Kaggle was founded. I made it a side hobby and spent my evenings trying to port what I researched in my biomedical engineering day job to all sorts of crazy problems.

The fact that I was able to get anywhere is evidence that domain expertise can be overstated when working within different fields. If I can price bonds, it’s not that I understand bond pricing; it’s that I can learn how bonds were priced in the past. This is not to say that domain expertise is not important or necessary to make progress, but that there is a set of statistical skills that support all data problems.

What are these skills? People make them sound more fancy than they really are. It’s not about knowing the latest, greatest, machine learning methods. Will it help? Sure, but you don’t need to train a gigantic deep learning net to solve problems. The lesson Kaggle reinforced for me was the importance of the scientific method applied to data. It was really basic, embarrassing things: e.g. When you do something many times, the results need to be the same. When you add a bit of noise to the input, the output shouldn’t change too much. If two models tell you the same thing, but a little differently, then you can blend them and do better. If two models tell you something completely different, then you have a bug–or even better, a massive flaw in your entire understanding of what you’re doing. Training on a lot of different perspectives of the data is better than training on one perspective of all the data. Look at pictures of what you’re doing! Write down the things that you try, because you will forget a few hours later! The competition format forces you to do all of these basic things right, more so than having them lectured at you, or reading them in a paper.

I’m also happy to report that I have paid back the loan to my parents, though the jury is still out on whether I’m any wiser in the face of data. Humility is one of most used tools in my arsenal!

2. Wired recently called attention to the fact that PhDs were leaving academia to become data scientists. At Kaggle, do you see this pattern? What kind of backgrounds do the 85,000 data scientists in the Kaggle community have?

It’s not hard to believe that PhDs are leaving to join data science positions. Academia is brutally competitive, and the difficulty is compounded by a dearth in grant funding. In data science, the disparity is flipped; companies are clamoring to hire data-literate people.

The information we have is mostly self-reported, so it’s difficult to make any real quantitative statement about a mass migration from academia to industry. What we can say about our userbase is that many thousands of them have PhDs, and that they are coming from all kinds of backgrounds. Physics, engineering, bioinformatics, actuarial science, you name it.

3. Do you see Kaggle as democratizing data science? We interviewed one of the winners of a Kaggle contest, and he was a student from Togliatti, Russia and was taking classes in data science on Coursera. I was blown away by him.

This is the fun part of sitting in the middle of a data labor market. I get to work with people who make me—presumably a not-unintelligent person…I hope…on my good days—realize how much I didn’t even know I don’t know.

Your question also brings up a controversial point. People have an understandable misconception about Kaggle’s democracy. Our critics are fond of saying that we are solving billion-dollar problems five times over and paying people a wooden nickel to do it. I think this reaction is partly a fear that smart people from anywhere, regardless of credentials, are given equal access to data problems, but I also think it’s a criticism that mistakes what our deliverable really represents. The fear over democratizing data science more parallels the old open-source software fallacy (how will we make money writing code if others give it out for free?!) than it does an outsourcing analogy.

Let’s take the problem of solving flight delay prediction. People immediately think “well that’s worth billions of dollars and if MyConsultingCorp were to solve that problem it would be for tens of millions in fees.” This stance is out of touch with what is really happening in these competitions. To wit:

  • People are solving singular problems for one company in one sector
  • The devil is in the implementation details
  • There are no constraints on absolute model performance, just relative rankings
  • The crowd always (p < 0.05) outperforms on accuracy, so when a business wants to optimize on accuracy, crowdsourcing gets chosen because it works well

Our asset is our community, not an outsourcing value proposition. To this end, we believe our efforts will actually increase the scope and amount of work available for people in analytics. Is this democracy? I think so. We sell in to companies, convince them of the merits of machine learning, isolate their problems, and open them up to the world.

The alternative is that DataDinosaur Corp. sells them on their proprietary Hadoop platform, cornering them into a big data pipe dream and leeching money via support contracts. The phrase “actionable intelligence” has never meant less than it does right now. It’s a scary, fake world out there in big-data land!

4. What data do you wish would be made available for a Kaggle contest?

I have a cancer research background. Much of the data from medical experiments is extremely shrouded in privacy fears. A lot of this fear is justified–it’s certainly nonnegotiable that we preserve patient privacy—but I believe the majority reason is that saying no means less work & bureaucracy and saying yes means new approvals & lawsuit risk. There is a tragic amount of health and pharmaceutical data that goes to waste because it lives (dies?) in institutional silos.

Access to data for health researchers is not a new problem, but I think the tragedy is especially exacerbated given what I’ve seen Kagglers do with data for other industries.

5. What is your favorite problem that a Kaggle contest has solved?

We ran a competition with Marinexplore and Cornell University to identify the sound of the endangered North American Right whale in audio recordings. Researchers have a network of buoys positioned around the ocean, constantly recording with underwater microphones. When a whale is detected in the vicinity of the buoy, they warn shipping traffic to steer clear of the area.

Not only did the participants come up with an algorithm that was very close to perfect, but how about the opportunity to work on a data science problem with such a clear and unquestionably positive goal? We spend a lot of time as data scientists thinking about optimizing ads, predicting ratings, marketing widgets, etc. These are economically important and still quite interesting, but they lack the feel-good factor of sitting at your keyboard thousands of miles away and knowing that your work might trickle down to save the life of an endangered whale.

North Atlantic Whale Kaggle Competition


Gnip is hiring for a data scientist position if you’re looking to do your own cutting-edge work. 

Continue reading

Data Stories: Dmitrii Vlasov on Kaggle Contests

At Gnip, we’re big fans of what the team at Kaggle is doing and have a fun time keeping tabs on their contests. One contest that I loved was held by WordPress and GigaOm to see what posts were most likely to generate likes, and we interviewed Dmitrii Vlasov who came in second in the Splunk Innovation Prospect and sixth overall. For me, it was interesting to speak to an up and coming data scientist who isn’t well known yet. Follow him at @yablokoff.

Dmitrii Vlasov of the GigaOm WordPress contest

1. You were recognized for your work in the first Kaggle contest you ever entered. What attracted you to Kaggle, and specifically the WordPress competition?

I came to Kaggle accidentally as it always happens. I read some blog post about the Million Song Dataset Challenge provided by and bunch of other organizations. The task was to predict which songs will be liked by users based on their existing listening history. This immediately made me feel excited because I’m an active user and was reflecting about what connections between people can be established based on their music preferences. But the contest was coming to end and so I switched to WordPress GigaOm contest and got 6th place there. Well, it is always interesting to predict something you already use.

2. What is your background in data science?

Now I’m a senior CS student in Togliatty, Russia. Can’t say that I have a special background in Data Science – I had more than a year-long course of probability theory and math statistics in university, some self-learned skills about semantic analysis and have big love to Python as a tool for implementing ideas. Also, I’ve entered the Machine Learning course on Coursera.

3. You found that blog posts with 30 to 50 pictures were more likely to be popular. You also found that longer blog posts also attract more likes (80,000-90,000 characters). This struck my marketing team as really high and was contrary to your hypothesis that longer content might be less viral. Why do you think this is?

Well, my numbers show relative correlation between amount of photos, characters and videos and the amount of likes received. Big relative “folks love” on several prominent amount of photos means that there were not so many posts with such amount of photos but most of them were qualitative. Quick empirical analysis shows that these are special type of posts – “big photo posts”. They usually are photo report, photo collection or scrapbook. For such types of posts 10-15 photos are not enough but at the same time 10-15 photos seem too overloaded for normal post. The same can be said about big amount of text in post. Of course, the most “likeable” posts contain 1,000-3,000 characters, but posts with 80-90 thousands are winners in “heavyweight category”. These are big researches, novels, political contemplation. Analyse is quite simple but it shows that if you want to create media-rich or text-rich content it should be really media-text-rich. Or you may fall in a hollow of not suitableness.

4. What else would like to predict with social data if you got the chance?

Now I work on romantic and friend relationships that could be established based on people’s music preferences (it’s a privately held startup in alpha). This is a really interesting and deep area! Also, I’d like to work with some political data e.g. to predict reaction on one or another politician’s statement based on a user’s Twitter feed. Or to extract all “real” thesis of politician based on all of his public speeches.