Clusters & Silos

Gnip is nearing its one-year anniversary of our 2.0 product. We reset our direction several months ago. As part of that shift, we completely changed our architecture. I thought I’d write about that experience a bit.

Gnip 1.0

Our initial implementation is best referred to as a clustered, non-relational DB (aka NoSQL), data aggregation service. We built/ran this product for about a year and a half. The system was comprised of a centralized cluster of machines that divvy’d up load, centralized streams of publisher data, and then fanned that data out to many customers. Publishers did not like this approach as it obfuscated the ultimate consumer of their data; they wanted transparency. Our initial motivation for this architecture was around alleviating load pain on the Publishers. “Real-time” APIs were the rage, and having detrimental impact to real-time delivery was in-part due to load on the Publisher’s API. A single stream of data to Gnip, and allowing Gnip to deal w/ the fan-out via a system built for such demand, was part of the solution we sold. We thought we could charge Publishers for alleviating their load pain. Boy were we wrong on that count. While Publishers love to complain about the load on their API, effectively none of them wanted to do anything about it. Some smartly built caching proxies, and others built homegrown notification-like/PubSub solutions (SIP, SUP, PubSubHubBub). However, most simply horizontally scaled and threw money at the problem. Twitter has shinned a light on streaming HTTP (or whatever you want to call it… there are so many monikers), which is “as good as it gets” (leaving proto buffers and compressed HTTP streams as simply optimizations to the model). I digress. The 1.0 platform was a fantastic engineering feat, ahead of its time, and unfortunately a thorn in Publisher’s sides. As a data integration middle-man, Gnip couldn’t afford to have antagonistic relations with data sources.

Gnip 2.0

Literally overnight, we walked away from further construction on our 1.0 platform. We had paying customers on it however, so we operated it for several months before ultimately shutting it down; after migrating everyone we could to 2.0. Gnip 2.0 un-intuitively departed from a clustered environment, and instead started providing a consuming customer with explicit, transparent, integrations with Publishers, all via standalone instances of the software running on standalone virtualized hardware instances (Ec2). Whereas 1.0 would sometimes leverage Gnip-owned authentication/app credentials to the benefit of many consuming customers, 2.0 was architected explicitly not to support this. For each 2.0 instance a customer runs, they use credentials they obtain themselves, from the Publisher, to configure the instances. Publishers have full transparency into, and control of, who’s using their data.

The result is an architecture that doesn’t leverage certain data structures an engineer would naturally wish to use. That said, an unexpected operational benefit has fallen out of the 2.0 system. Self-healing, zero SPOF (single point of failure), clusters aside (I’d argue they’re actually relatively few of them out there), the reality with clusters is that they’re actually hard to build in a fault tolerant manner, and SPOFs find their way in. From there, you have all of your customers leveraged against a big SPOF. If something cracks in the system, all of your customers feel that pain. On the flip side, silo’d instances rarely suffer from systemic failure. Sure operational issues arise, but you can treat each case uniquely and react accordingly. The circumstances in which all of your customers feel pain simultaneously are very few and far between. So, the cost of not leveraging hardware/software that we’re generally inclined to want to architect for and leverage, is indeed higher, but a simplified system has it’s benefits to be sure.

We now find ourselves promoting Publisher integration best practices, and they advocate our usage. Two such significant architectures built under the same roof has been a fascinating thing to experience. The pros and cons to each are many. Where you wind up in your system is an interesting function of what your propensity is technically, as well as what the business constraints are. One size never fits all.