Data Stories is how Gnip tells the stories of those doing cutting-edge work around data, and this week we have an interview with Will Cukierski, data scientist at Kaggle. I have loved watching the different types of Kaggle contests and how they make really interesting challenges available to people all over the world. Will was gracious enough to be interviewed about his data science background, Kaggle’s work and community.
1. You entered many Kaggle contests before you started working for Kaggle. What were some of the biggest lessons you learned?
Indeed, many years back I competed in the Netflix prize. As looking at spreadsheets goes, it was a thrilling experience (albeit also quite humbling). I took out a $3,000 loan from my parents to buy a computer with enough RAM to even load the data. A few years later, I was in the final throes of my doctorate when Kaggle was founded. I made it a side hobby and spent my evenings trying to port what I researched in my biomedical engineering day job to all sorts of crazy problems.
The fact that I was able to get anywhere is evidence that domain expertise can be overstated when working within different fields. If I can price bonds, it’s not that I understand bond pricing; it’s that I can learn how bonds were priced in the past. This is not to say that domain expertise is not important or necessary to make progress, but that there is a set of statistical skills that support all data problems.
What are these skills? People make them sound more fancy than they really are. It’s not about knowing the latest, greatest, machine learning methods. Will it help? Sure, but you don’t need to train a gigantic deep learning net to solve problems. The lesson Kaggle reinforced for me was the importance of the scientific method applied to data. It was really basic, embarrassing things: e.g. When you do something many times, the results need to be the same. When you add a bit of noise to the input, the output shouldn’t change too much. If two models tell you the same thing, but a little differently, then you can blend them and do better. If two models tell you something completely different, then you have a bug–or even better, a massive flaw in your entire understanding of what you’re doing. Training on a lot of different perspectives of the data is better than training on one perspective of all the data. Look at pictures of what you’re doing! Write down the things that you try, because you will forget a few hours later! The competition format forces you to do all of these basic things right, more so than having them lectured at you, or reading them in a paper.
I’m also happy to report that I have paid back the loan to my parents, though the jury is still out on whether I’m any wiser in the face of data. Humility is one of most used tools in my arsenal!
2. Wired recently called attention to the fact that PhDs were leaving academia to become data scientists. At Kaggle, do you see this pattern? What kind of backgrounds do the 85,000 data scientists in the Kaggle community have?
It’s not hard to believe that PhDs are leaving to join data science positions. Academia is brutally competitive, and the difficulty is compounded by a dearth in grant funding. In data science, the disparity is flipped; companies are clamoring to hire data-literate people.
The information we have is mostly self-reported, so it’s difficult to make any real quantitative statement about a mass migration from academia to industry. What we can say about our userbase is that many thousands of them have PhDs, and that they are coming from all kinds of backgrounds. Physics, engineering, bioinformatics, actuarial science, you name it.
3. Do you see Kaggle as democratizing data science? We interviewed one of the winners of a Kaggle contest, and he was a student from Togliatti, Russia and was taking classes in data science on Coursera. I was blown away by him.
This is the fun part of sitting in the middle of a data labor market. I get to work with people who make me—presumably a not-unintelligent person…I hope…on my good days—realize how much I didn’t even know I don’t know.
Your question also brings up a controversial point. People have an understandable misconception about Kaggle’s democracy. Our critics are fond of saying that we are solving billion-dollar problems five times over and paying people a wooden nickel to do it. I think this reaction is partly a fear that smart people from anywhere, regardless of credentials, are given equal access to data problems, but I also think it’s a criticism that mistakes what our deliverable really represents. The fear over democratizing data science more parallels the old open-source software fallacy (how will we make money writing code if others give it out for free?!) than it does an outsourcing analogy.
Let’s take the problem of solving flight delay prediction. People immediately think “well that’s worth billions of dollars and if MyConsultingCorp were to solve that problem it would be for tens of millions in fees.” This stance is out of touch with what is really happening in these competitions. To wit:
- People are solving singular problems for one company in one sector
- The devil is in the implementation details
- There are no constraints on absolute model performance, just relative rankings
- The crowd always (p < 0.05) outperforms on accuracy, so when a business wants to optimize on accuracy, crowdsourcing gets chosen because it works well
Our asset is our community, not an outsourcing value proposition. To this end, we believe our efforts will actually increase the scope and amount of work available for people in analytics. Is this democracy? I think so. We sell in to companies, convince them of the merits of machine learning, isolate their problems, and open them up to the world.
The alternative is that DataDinosaur Corp. sells them on their proprietary Hadoop platform, cornering them into a big data pipe dream and leeching money via support contracts. The phrase “actionable intelligence” has never meant less than it does right now. It’s a scary, fake world out there in big-data land!
4. What data do you wish would be made available for a Kaggle contest?
I have a cancer research background. Much of the data from medical experiments is extremely shrouded in privacy fears. A lot of this fear is justified–it’s certainly nonnegotiable that we preserve patient privacy—but I believe the majority reason is that saying no means less work & bureaucracy and saying yes means new approvals & lawsuit risk. There is a tragic amount of health and pharmaceutical data that goes to waste because it lives (dies?) in institutional silos.
Access to data for health researchers is not a new problem, but I think the tragedy is especially exacerbated given what I’ve seen Kagglers do with data for other industries.
5. What is your favorite problem that a Kaggle contest has solved?
We ran a competition with Marinexplore and Cornell University to identify the sound of the endangered North American Right whale in audio recordings. Researchers have a network of buoys positioned around the ocean, constantly recording with underwater microphones. When a whale is detected in the vicinity of the buoy, they warn shipping traffic to steer clear of the area.
Not only did the participants come up with an algorithm that was very close to perfect, but how about the opportunity to work on a data science problem with such a clear and unquestionably positive goal? We spend a lot of time as data scientists thinking about optimizing ads, predicting ratings, marketing widgets, etc. These are economically important and still quite interesting, but they lack the feel-good factor of sitting at your keyboard thousands of miles away and knowing that your work might trickle down to save the life of an endangered whale.
Gnip is hiring for a data scientist position if you’re looking to do your own cutting-edge work.