Gnip moved into our new office yesterday (other end of the block from our old office). The transition provided an opportunity for me to think about where we’ve been, and where we’re going.
We continue to grow, primarily on the engineering side. Checkout our jobs page if you’re interested in working on a hard problem, with smart people, in a beautiful place (Boulder, CO).
We’ve built a serious chunk of back-end infrastructure that I’d break into two general pieces: “the bus”, and “the pollers.”
Our back-end moves large volumes of relatively small (usually <~3k bytes) chunks of data from A to B in a hurry. Data is “published” into Gnip, we do some backflips with it, then spit it out the other side to consumers.
Our efforts to get Publishers to push directly into Gnip didn’t pan out the way we initially planned. As a result we had to change course and acquire data ourselves. The bummer here was that we set out on an altruistic mission to relieve the polling pain that the industry has been suffering from, but were met with such inertia that we didn’t get the coverage we wanted. The upside is that building polling infrastructure has allowed us to control more of our business destiny. We’ve gone through a few iterations on approach to polling. From complex job scheduling and systems that “learn” & “adapt” to their surroundings, to dirt simple, mindless grinders that ignorantly eat APIs/endpoints all day long. We’re currently slanting heavily toward simplicity in the model. The idea is to take learning’s from the simple model over time, and feed them into abstractions/re-factorings that make the system smarter.
We’re still in the cloud. Amazon’s Ec2/S3 products have been a solid (albeit not necessarily the most cost effective when your CPU utilization isn’t in the 90%+ range per box), highly flexible, framework for us; hats off to those guys.
“The Polling Problem”
It’s been great to see the industry wake up and acknowledge “the polling problem” over the past year. SUP (Simple Update Protocol) popped up to provide more efficient polling for systems that couldn’t, or wouldn’t, move to an event-driven model. Providing a compact change-log for pollers, you can poll the change-log, and then go do heavier polls for only stuff that has changed. PubSubHubbub popped up to provide the framework for a distributed Gnip (though lacking inherent normalization). A combination of polling and events spread across nodes allows for a more decentralized approach.
The Activity Streams initiative grew legs and is walking. As with any “standards” (or “standards-like”) initiative things are only as good as adoption. Building ideas in a silo without users makes for a fun exercise, but not much else. Uptake matters, and MySpace and Facebook (among many other smaller initiatives) have bitten off chunks of Activity Streams, and that’s a very big, good, sign for the industry. Structural, and semantic, consistency matters for applications digesting a lot of information. Gnip provides highly structured and consistent data to its consumers via gnip.xsd.
In order to meet its business needs, and to adapt to the constantly moving industry around it, Gnip has adjusted it’s approach on several fronts. We moved to incorporate polling. We understand that there is more than one way of doing and will incorporate SUP and PubSubHubbub into our framework. Doing so will make our own polling efforts more effective, and also provide data to our consumers with flexibility. While normalized data is nice for a large category of consumers, there is a large tier of customers that doesn’t need, or want, heavy normalization. Opaque message flow has significant value as well.
We set out to move mind-boggling amounts of information from A to B, and we’re doing that. Some of the nodes in the graph are shifting, but the model is sound. We’ve found there are primarily two types of data consumers: high-coverage of a small number of sources (“I need 100% of Joe, Jane, and Mike’s activity”), and “as high as you can get it”-coverage of a large number of sources (“I don’t need 100%, but I want very broad coverage”). Gnip’s adjusted to accommodate both.
We’ve had to shift our resources to better focus on the paying segments of our audience. We initially thought “life-stream aggregators” would be our biggest paying customer segment, however data/media analytics firms have proven significant. Catering to the customers who tell you “we have budget for that!” makes good business sense, and we’re attacking those opportunities.