api.gnip.com has been up for a couple of weeks now and I thought I’d take a moment to update folks on how things are going.
One thing I’ll note early is that our original production environment remains our on-call monitored environment. While we’re honing api.gnip.com it’s monitoring does not “page” us when there are issues. Plenty of internal alarms sound when things crater, but the “on-call” person doesn’t get paged.
The three new major components in this Beta:
- New Schema
- Polling Infrastructure
- Data Normalization
New Schema
We’ve vastly improved our schema. Now Gnip offers normalized meta-data across Publishers/services. We’re striving to minimize the guesswork you have to go through in order to display/process activities in your application. We digested our community feedback, and out popped the new schema. So far so good. A few minor issues have been brought to our attention, and we’ll address those in the next rev of the schema.
Polling Infrastructure
This one’s been fun. Along the road to “delivering the web’s data,” we realized we were going to need to acquire some of the data ourselves. Some Publishers aren’t interested in PUSHing to Gnip yet, however demand for their data remains strong. As a result, we’ve embarked on our own polling infrastructure to ensure our Consumer’s needs are met (visit gnip.uservoice.com and tell us which services/data sources/Publishers you’d like to see Gnip support next). We’ve built a traditional job scheduling/queuing model to poll variabled URL endpoints for data. Replete with various back-off algorithms and rate-limiting, the system grabs activities out on the network, normalizes the results, and publishes them back into Gnip’s PubSub API.
We’re plowing through bugs and the challenges that come with job scheduling (starvation, ordering, prioritization, fairness) at scale. We long for some of the linear, relatively simple, polling architectures some of our partners have built. Building polling infrastructure for a few hundred thousand endpoints is one thing; building polling infrastructure for a few hundred million endpoints is quite another.
Until we resolve these issues you may experience intermittent results from Polled Publishers in Gnip. Things can go from working smoothly, to sporadic gaps in data. Bottom line is we have starvation issues we’re working to address. Bare with us.
Data Normalization
We’re vastly expanding the breadth of our Publisher offering by leveraging our Polling Infrastructure. For Polled Publishers we’ve built a layer that translates from arbitrary feed/API formats, into Gnip XML. While our core is written in Java, we leverage Python’s Universal Feed Parser to ingest XML, and map fields to Gnip XML. When UFP can’t handle things (even it has its limitations), we punt out to text processing with our own simple mapping language (with regex support). Our investment in this layer has highlighted two things really well: one, sadly even with the RSS/ATOM/XML’ization of data, severe challenges remain with data handling on the web (rampant inconsistencies remain). two, Gnip’s value proposition continues to grow. We handle the headache of this kind of mapping/parsing once, and many of you get the reap the rewards.
AWS/Ec2
We remain exclusively in “the Cloud.” Some thoughts on our 12 month experience with it…
- Cross instance latency remains high (say 5-10x higher than average non-Zen local interconnects). While we’ve been able to build a “real-time” system regardless, it’s certainly gotten in our way at times.
- Inbound packet transmission “into the Cloud” is slow. It’s not uncommon for sustained, large, uploads (say moving builds onto Ec2 instances) to average 70kB/s. This is lame and needs to be fixed by Amazon. Happy to pay more for throughput, just give me the option.
- Dedicated instances. One way to look at Ec2 performance is that the cheaper/smaller the instance, the more issues you’re going to have. Basically, Amazon heavily vslices and dices the cheaper boxes; the cheaper the box, the more loaded it can be with other apps. The larger instances (e.g. XL) equate to dedicated hardware just for you; your own CPUs, your own mem, your own NIC.
- I get asked “how many times have you lost instances?” all the time. The answer is “rarely enough that it’s never on anyone’s mind and I can’t remember the last time it happened.” Maybe 3 instances out of 50 continuously running over the course of 12 months.
- We continue to use the free version of RightScale to manage our deployments.
Please let us know what features you’d like to see going forward, or if you have questions. info@gnip.com