Strata 2012 – Strata Grows Up

O'Reilly Strata Conference Making Data Work

O’Reilly Media’s Strata was held in NYC recently, just before Sandy arrived.  The conference was sold out to building capacity.  By this measure, it was the most successful Strata to date. Strata and Hadoop World were combined into a single conference. The week started with tutorials, meet-ups, a mini Maker Faire and Ignite Talks and ended with the more traditional conference format of keynotes, breakouts and an exhibit hall for vendors.

Some of the most interesting and relevant idea-driven keynotes included Mike Flowers talking about data used to understand building code violations in NYC, Rich Hickey addressing opportunities and challenges of adding back to big-data analysis platforms some of the traditional features like indexes, queries and transactions, and Shamila Mullighan’s recommendations for combining internal data with public APIs to enhance your data science.

Tim Estes gave a passionate talk about attention and the “responsibility to know”–and the growing gap them–asserting “Understanding is a great cause.”  Doug Cutting talked about the future of Hadoop, Julie Steele interviewed “Mathbabe” (Cathy O’Neil) about real-world vs. academic data science.  Joe Hellerstein addressed the challenges of resources and attention resulting from 80% of a typical data scientist’s activity being spent preparing and transforming data for analysis. Samantha Ravich wrapped up the keynotes with an appeal for data science tools that better match decisions maker’s needs and modes of working.


Demand Outstrips Supply for Talent
Many speakers pointed to the dire need for more data science talent. In many cases, this was emphasized by pointing to the data going unanalyzed, answers going unfound, and sometimes open positions unfilled.

In what ways is the data scientist shortage directly problematic and in what ways has it become shorthand for the larger problem that businesses don’t have decision processes, managers, infrastructure, etc. needed to effectively make decisions from big data. There seemed to be some glossing over the point that the data science talent shortage is largely an opportunity cost and, to the extent there is uneven use of data in your industry, a competitive issue, rather than an actual cost of doing business today. The shortage of data science talent is accompanied by a shortage of managers able to focus on good questions, direct resources to data science projects and make decisions based on data.

The data scientist is key, but also, only successful in bringing competitive advantages when the context of good questions and data-driven decisions is in place. Concentrating on the shortage of DS when your management team is unprepared to participate in and leverage insights gained from good data science work seems sort of silly–Samantha Ravich’s talk had a clear example of this in how the Bush administration decision process regarding poppy production in Afghanistan went wrong.

Data Science Infrastructure
The other recurring theme was Date Science Infrastructure. Many data scientists have noticed the huge proportion of their daily work is finding, shaping, loading, moving, transformation and connecting data. The product announcements as well as many of the idea-oriented talks pointed to and quantified this challenge. “Spend more time doing the science part of your job” is the idea behind, Platfora, OpenChorus, Impala and Joe Hallerstein’s Data Wrangler.

For Gnip, a personal highlight was GreenPlum announcing OpenChorus, a collaborative big data environment with integrations to Gnip, Kaggle and Tableau. Informatica announced their continued work toward a “no-code” environment for big-data analytics. MapR, SaS and other well-known players had their say at keynotes as well.

In general, the last two days of Strata seemed focused more on the line manager of big-data insights and infrastructure in the organization and less on the analysis or visualization practitioner. Some bright spots on the practitioners side were Donal Miner’s MapReduce patterns, Kim Rees on creating great visualizations and Cathy O’Neil on the realities of mining and making decisions based on weak signals in timeseries data.

Strata, Big Data & Big Streams

Gnip had a great time at O’Reilly’s Strata 2011 conference in California last week. We signed up several months ago as a big sponsor without knowing exactly how things were going to come together. The bet paid off and Strata was a huge success for us, and the industry at large. We were blown away with the relevance of the topics discussed and the quality of the attendees and discussions that were sparked. I was amazed at how much knowledge everyone now has surrounding big data set analysis and processing. Technologies that were immature and new just a few years ago, are now baked into the ecosystem and have become tools of the trade (e.g. Hadoop). All very cool to see.


That said, there remains a distinct gap between big data set handling and high-volume/real-time data stream handling. We’ve come a long way in handling monster data set processing in batch or offline modes, but we have a long way to go when it comes to handling large streaming data set challenges. Hillary Mason, of, hit this point squarely in her “What Data Tells Us” talk at Strata. We can open sourcely fan out ungodly amounts of processing… like piranha on fresh meat. However, blending that processing, and high-latency transactions, into real-time streams of thousands of activities per second is not as refined and well understood. Frankly, I’m shocked at the number of engineers I run into that simply don’t understand asynchronous programming at all.


The night before the conference started, Pete Warden drove BigDataCamp @Strata, where Mike Montano from BackType gave a high-level overview of their infrastructure. He laid out a few tiers and described the “speed” tier as something that did a lot of work on high-volume streams, and a “batch” tier that did stuff in a more offline manner. The blend of approaches was an interesting teaser into how Big Stream challenges can be handled. Gnip’s own infrastructure has had to address these challenges of course, and we launched into a thread of detail in our Expanding The Twitter Firehose post awhile back.


Big Stream handling occupies a good part of my brain. I’d like to see Big Data discussion start to unravel Big Stream challenges as well.

Q1 2011 Data Analysis Conferences: Strata and PAW

Our business is data. We stream social media conversation data to businesses that need it for their products and applications. Our customers, too, are in the business of data. And so, when important conferences about the business of data happen, we’re there. 

This quarter, we’re proud sponsors of two important data conferences:

Strata: Making Data Work
February 1-3, 2011 | Santa Clara, CA

Because the business of data is growing so tremendously, O’Reilly is launching a brand new conference to capture and explore this exploding industry: the Strata Conference. Strata focuses on the future of the data industry, and how we can turn what’s academically interesting about data analysis into insights that drive fundamental business decisions. Gnip is a Gold Sponsor for this seminal event, so please look for us there.
Save 25% on registration using discount code “str11fsd” at Strata: Making Data Work

Predictive Analytics World (PAW)
March 14-15, 2011 | San Francisco, CA

Predictive Analysts predict future trends and events by looking at the past and the present to identify future trends. (How cool is that — these people are professional future-predictors!) These professionals come from large and small companies and share their methods and learnings at the PAW conference. And because we think there’s a lot to be learned from social media data, Gnip will be a Bronze Sponsor for PAW SF this year and we hope to see you there as well.
Save 15% on the conference program with code “GNIP1115″ or a complimentary social networking pass using code “GNIPSF11Comp” at Predictive Analytics World SF

In other words… we’ll be enjoying plenty of sweet California sunshine and palm trees this quarter while we talk with experts about our favorite topic of all — data. Hope to see you there!

Our Travel Schedule: Come Meet Us!

Some days we feel like a rock band. We go “on tour” regularly for client or publisher meetings and to attend conferences relevant to social data collection. In recent months, we’ve been to New York and San Francisco and attended TWTRCon and Enterprise 2.0 and Web 2.0 Expo, to name a few. We like to meet our fans — er, I mean, our customers — in person, so here’s our upcoming travel schedule…


October 18-19, 2010 – Washington, DC
Predictive Analytics World Conference (more info)

Labeled as “the business-focused event for predictive analytics professionals, managers and commercial practitioners,” we’ll be there! Drop us a note if you’re attending or if your business is in the DC area and you’d like to meet us.

November 17-18, 2010 – Denver, CO
Defrag Conference (
more info)
We’re a sponsor for Defrag, and we’d love to see you there. Register for Defrag and enter code “sponsor15” for a 15% registration discount.

November 17-18, 2010 – NY, NY
Client Visit Trip
While some of us are at Defrag, our Director of Business and Strategy will be in the Big Apple. If you’re in the City and you’d like to meet up, just drop him a note (

We don’t have many business trips planned over the holidays, but we are already getting excited about a conference coming up in 2011…

February 1-3, 2011 – Santa Clara, CA
O’Reilly’s Strata Conference: The Business of Data (more info)
We’re so excited about a conference all about The Business of Data (that’s us!) that we’ll be a Strata sponsor. We hope to see lots of businesses interested in social media data analysis there, so we’d love to hear about it if you expect to attend.


Please drop us a note at if we’ll be in your neck of the woods anytime soon. It will be our pleasure to meet you.