Copyright © 2010 Gnip, inc.
Gnip makes it easy to build social media tracking tools.

As software evolves, you constantly adjust your view from forest perspective, to tree perspective. Recently we’ve been stepping back from our Polling infrastructure to get the forest view. Today we spent the afternoon revisiting initial requirements, considering new requirements, and doing whiteboard math around sweet spots on curves. The conversation reminded me of a meeting we had with Bob Pasker nearly ten months ago.
We met with Bob as part of an investor’s due diligence process; Bob prodded us to a) opine on whether or not we were solving a worthy problem, and b) opine on whether or not we, as a team, were worth the investment risk (the investor gave us the money). Within sixty seconds of sitting down with him, he stopped our pitch and blurted out “you’re a queue!” In so many ways he was right. I’ve always thought about Gnip as a series of pipes that yield a message bus, but today we were visited by the queuing theory ghosts of implementations past. If you look at our architecture from a distance, we have several queues in the system. If you’ve ever worked with a multi-queue system, you’ve grappled with some of our challenges.
Managing the flow of data, in “real-time,” from point A, to point N, in a reliable manner, means you’re queuing it up along the way at various points. At each processing stage, you queue the data, and digest it (usually FIFO for Gnip). The particular challenges we faced today were around queuing constraints in our Polling model. Linear, in process, FIFO queues are relatively straightforward (though complex in and of themselves when “pushes” outstrip “pops” for long durations). However, when you start prioritizing queue entries (changing the natural FIFO order), and adding rate limitations on “pops,” things get more challenging. When you start distributing queue “pop” consumption (e.g. worker threads) across NICs, you really start to feel the joy.
Gnip has a typical scheduler/job processing pattern in our Polling infrastructure. We push jobs on a queue in a prioritized manner, then off-box workers pop those jobs off the queue for processing. Currently those job processors are “dumb;” they know nothing other than how to process the job they just acquired. Getting that distributed grid of independently aware/scheduled job processors to coordinate and have shared state is the fun challenge du-jour at Gnip.
Another general queuing pattern challenge arises when queues become overwhelmed due to a pause in processing downstream. Whether the pause is caused by a node outage, or a flood in “pushes,” queue backup in an automated job creation environment can yield phenomenal data build up. Imagine inbound data from a given Publisher arriving at a rate of 30 messages/sec. For every minute of processing that doesn’t occur, 1,800 messages backup on the queue. General traffic ebb-and-flow isn’t such a big deal as you build your system to grow/shrink accordingly, but when there are major issues (a network outage downstream, a bug, a crash, whatever), physical hardware limitations come into view. Trade-offs between data-loss (dropping data to keep things alive) and chaining “store and forward” queues several levels deep have to be made. Operationally at Gnip, we’ve had to drop data at times, and at others we’ve had to build additional queuing layers where queue backup is more common.
Just another day in the life of shuffling data around the network.