O’Reilly Media’s Strata was held in NYC recently, just before Sandy arrived. The conference was sold out to building capacity. By this measure, it was the most successful Strata to date. Strata and Hadoop World were combined into a single conference. The week started with tutorials, meet-ups, a mini Maker Faire and Ignite Talks and ended with the more traditional conference format of keynotes, breakouts and an exhibit hall for vendors.
Some of the most interesting and relevant idea-driven keynotes included Mike Flowers talking about data used to understand building code violations in NYC, Rich Hickey addressing opportunities and challenges of adding back to big-data analysis platforms some of the traditional features like indexes, queries and transactions, and Shamila Mullighan’s recommendations for combining internal data with public APIs to enhance your data science.
Tim Estes gave a passionate talk about attention and the “responsibility to know”–and the growing gap them–asserting “Understanding is a great cause.” Doug Cutting talked about the future of Hadoop, Julie Steele interviewed “Mathbabe” (Cathy O’Neil) about real-world vs. academic data science. Joe Hellerstein addressed the challenges of resources and attention resulting from 80% of a typical data scientist’s activity being spent preparing and transforming data for analysis. Samantha Ravich wrapped up the keynotes with an appeal for data science tools that better match decisions maker’s needs and modes of working.
Demand Outstrips Supply for Talent
Many speakers pointed to the dire need for more data science talent. In many cases, this was emphasized by pointing to the data going unanalyzed, answers going unfound, and sometimes open positions unfilled.
In what ways is the data scientist shortage directly problematic and in what ways has it become shorthand for the larger problem that businesses don’t have decision processes, managers, infrastructure, etc. needed to effectively make decisions from big data. There seemed to be some glossing over the point that the data science talent shortage is largely an opportunity cost and, to the extent there is uneven use of data in your industry, a competitive issue, rather than an actual cost of doing business today. The shortage of data science talent is accompanied by a shortage of managers able to focus on good questions, direct resources to data science projects and make decisions based on data.
The data scientist is key, but also, only successful in bringing competitive advantages when the context of good questions and data-driven decisions is in place. Concentrating on the shortage of DS when your management team is unprepared to participate in and leverage insights gained from good data science work seems sort of silly–Samantha Ravich’s talk had a clear example of this in how the Bush administration decision process regarding poppy production in Afghanistan went wrong.
Data Science Infrastructure
The other recurring theme was Date Science Infrastructure. Many data scientists have noticed the huge proportion of their daily work is finding, shaping, loading, moving, transformation and connecting data. The product announcements as well as many of the idea-oriented talks pointed to and quantified this challenge. “Spend more time doing the science part of your job” is the idea behind, Platfora, OpenChorus, Impala and Joe Hallerstein’s Data Wrangler.
For Gnip, a personal highlight was GreenPlum announcing OpenChorus, a collaborative big data environment with integrations to Gnip, Kaggle and Tableau. Informatica announced their continued work toward a “no-code” environment for big-data analytics. MapR, SaS and other well-known players had their say at keynotes as well.
In general, the last two days of Strata seemed focused more on the line manager of big-data insights and infrastructure in the organization and less on the analysis or visualization practitioner. Some bright spots on the practitioners side were Donal Miner’s MapReduce patterns, Kim Rees on creating great visualizations and Cathy O’Neil on the realities of mining and making decisions based on weak signals in timeseries data.