Thanks to everyone who has taken the time to visit the new community voting web application at http://gnip.uservoice.com to help us prioritize what services to integrate.
Here is a look at the top of the list:

Thanks to everyone who has taken the time to visit the new community voting web application at http://gnip.uservoice.com to help us prioritize what services to integrate.
Here is a look at the top of the list:

Now that we have the beta of the new Gnip schema up in the new demo system we are ready to roll out something we hope everyone in our partner and user community will be excited to participate.
Today we have turned on a new web application that we are hosting at UserVoice, who specializes in hosting customer feedback forums. The new Gnip Forum provides anyone and everyone the chance to tell us the social media and business services that we should integrate to and the priority of integration.
You heard us, we want you to tell us what to do! Just go to http://gnip.uservoice.com. What you will find is a list that as of right now includes 352 different services in priority order based on votes for those services. Just create an account and decide how to allocate your votes. Also, if there is a service we did not include feel free to add the service and tell us the URL and it will be added automatically.
The list also includes status labels that will allow people to track our progress. There are a few new services in progress right now and after we complete the current beta on the new schema we expect to be adding publishers at a good clip.
We have flipped the switch to allow people to start working with our new schema at http://demo.gnip.com. In addition to standing up the site with the updated schema we have moved over all the existing accounts from the current system, so your existing gnipcentral.com user and password also get you access to the demo system.
The following publishers are in the demo system and we plan to add more over the course of the next month during the beta period.
We will be posting additional examples of how we mapped these social media services to the updated Gnip Schema in the Gnip Community and will link to those examples from the blog as well as point to them in our standard release newsletter that will go out later today.
Based on the feedback we have received there is a lot of interest in the enhanced metadata in the new schema that can be used to support additional types of URLs, multiple tags, geo data, and rich media. Now, go grab some data and do something cool with it.
Today’s general data standards are akin to yesterday’s HTML/CSS browser support standards. The first rev of Gecko (not to be confused w/ the original Mosaic/Navigator rendering engine) at Netscape was truly standards compliant in that it did not provide backwards compatibility for the years of web content that had been built up; that idea made it an Alpha or two into the release cycle, until “quirks-mode” became status quo. The abyss of broken data that machines, and humans, generate, eclipsed web pages back then, and it’s an ever present issue in the ATOM/RSS/XML available today.
Gnip, along with social data aggregators like Plaxo and FriendFeed, has a unique view of the data world. While ugly to us, we normalize data to make our Customers’ lives better. Consumer facing aggregators (Plaxo/FF) beautify the picture for their display layers. Gnip beautifies the picture for it’s data consumption API. Cleaning up the mess that exists on the network today has been an eye opening process. When our data producers (publishers) PUSH data in Gnip XML, life is great. We’re able to work closely with said producers to ensure properly structured, formatted, encoded, and escaped data comes into the system. When, data comes into the system through any other means (e.g. XMPP feeds, RSS/ATOM polling) it’s a rats nest of unstructured, cobbled-together, ill-formated, and poorly-encoded/escaped data.
XML has provided self describing formats and structure, but it ends there. Thousands of pounds of wounded data shows up on Gnip’s doorstep each day, and that’s where Gnip’s normalization heavy lifting work comes into play. I thought I’d share some of the more common bustage we see, along with a little commentary around the category of problem
You’re probably wondering about the quality of the XML structure itself. By volume, the bulk of data that comes into Gnip validates out of the box. Shocking, but true. As you could probably guess, most of our energy is spent resolving the above data quality issues. The unfortunate reality for Gnip is that the “edge” cases consume lots of cycles. As a Gnip consumer, you get to draft off of our efforts, and we’re happy to do it in order to make your lives better.
If everyone would clean up their data by the end of the day, that’d be great. Thanks.

As software evolves, you constantly adjust your view from forest perspective, to tree perspective. Recently we’ve been stepping back from our Polling infrastructure to get the forest view. Today we spent the afternoon revisiting initial requirements, considering new requirements, and doing whiteboard math around sweet spots on curves. The conversation reminded me of a meeting we had with Bob Pasker nearly ten months ago.
We met with Bob as part of an investor’s due diligence process; Bob prodded us to a) opine on whether or not we were solving a worthy problem, and b) opine on whether or not we, as a team, were worth the investment risk (the investor gave us the money). Within sixty seconds of sitting down with him, he stopped our pitch and blurted out “you’re a queue!” In so many ways he was right. I’ve always thought about Gnip as a series of pipes that yield a message bus, but today we were visited by the queuing theory ghosts of implementations past. If you look at our architecture from a distance, we have several queues in the system. If you’ve ever worked with a multi-queue system, you’ve grappled with some of our challenges.
Managing the flow of data, in “real-time,” from point A, to point N, in a reliable manner, means you’re queuing it up along the way at various points. At each processing stage, you queue the data, and digest it (usually FIFO for Gnip). The particular challenges we faced today were around queuing constraints in our Polling model. Linear, in process, FIFO queues are relatively straightforward (though complex in and of themselves when “pushes” outstrip “pops” for long durations). However, when you start prioritizing queue entries (changing the natural FIFO order), and adding rate limitations on “pops,” things get more challenging. When you start distributing queue “pop” consumption (e.g. worker threads) across NICs, you really start to feel the joy.
Gnip has a typical scheduler/job processing pattern in our Polling infrastructure. We push jobs on a queue in a prioritized manner, then off-box workers pop those jobs off the queue for processing. Currently those job processors are “dumb;” they know nothing other than how to process the job they just acquired. Getting that distributed grid of independently aware/scheduled job processors to coordinate and have shared state is the fun challenge du-jour at Gnip.
Another general queuing pattern challenge arises when queues become overwhelmed due to a pause in processing downstream. Whether the pause is caused by a node outage, or a flood in “pushes,” queue backup in an automated job creation environment can yield phenomenal data build up. Imagine inbound data from a given Publisher arriving at a rate of 30 messages/sec. For every minute of processing that doesn’t occur, 1,800 messages backup on the queue. General traffic ebb-and-flow isn’t such a big deal as you build your system to grow/shrink accordingly, but when there are major issues (a network outage downstream, a bug, a crash, whatever), physical hardware limitations come into view. Trade-offs between data-loss (dropping data to keep things alive) and chaining “store and forward” queues several levels deep have to be made. Operationally at Gnip, we’ve had to drop data at times, and at others we’ve had to build additional queuing layers where queue backup is more common.
Just another day in the life of shuffling data around the network.