Gnip had a great time at O’Reilly’s Strata 2011 conference in California last week. We signed up several months ago as a big sponsor without knowing exactly how things were going to come together. The bet paid off and Strata was a huge success for us, and the industry at large. We were blown away with the relevance of the topics discussed and the quality of the attendees and discussions that were sparked. I was amazed at how much knowledge everyone now has surrounding big data set analysis and processing. Technologies that were immature and new just a few years ago, are now baked into the ecosystem and have become tools of the trade (e.g. Hadoop). All very cool to see.
That said, there remains a distinct gap between big data set handling and high-volume/real-time data stream handling. We’ve come a long way in handling monster data set processing in batch or offline modes, but we have a long way to go when it comes to handling large streaming data set challenges. Hillary Mason, of bit.ly, hit this point squarely in her “What Data Tells Us” talk at Strata. We can open sourcely fan out ungodly amounts of processing… like piranha on fresh meat. However, blending that processing, and high-latency transactions, into real-time streams of thousands of activities per second is not as refined and well understood. Frankly, I’m shocked at the number of engineers I run into that simply don’t understand asynchronous programming at all.
The night before the conference started, Pete Warden drove BigDataCamp @Strata, where Mike Montano from BackType gave a high-level overview of their infrastructure. He laid out a few tiers and described the “speed” tier as something that did a lot of work on high-volume streams, and a “batch” tier that did stuff in a more offline manner. The blend of approaches was an interesting teaser into how Big Stream challenges can be handled. Gnip’s own infrastructure has had to address these challenges of course, and we launched into a thread of detail in our Expanding The Twitter Firehose post awhile back.
Big Stream handling occupies a good part of my brain. I’d like to see Big Data discussion start to unravel Big Stream challenges as well.