Gnip Inc.

  • Product
    • Product Overview
    • Success Stories
    • Documentation
  • Industries
    • Social Media Monitoring
    • Finance
    • Business Intelligence
    • Government
    • Ad Tech
    • Other
  • Data Sources
  • Company
    • About
    • News
    • Careers at Gnip
    • Plugged In To Gnip
  • Blogs
    • Company
    • Engineering
  • Contact

888-777-7405

Company Blog

Beta 2 Technical Update

Posted on March 26, 2009 by Jud Valeski, Co-Founder and CEO

0 comments

api.gnip.com has been up for a couple of weeks now and I thought I’d take a moment to update folks on how things are going.

One thing I’ll note early is that our original production environment remains our on-call monitored environment. While we’re honing api.gnip.com it’s monitoring does not “page” us when there are issues. Plenty of internal alarms sound when things crater, but the “on-call” person doesn’t get paged.

The three new major components in this Beta:

  1. New Schema
  2. Polling Infrastructure
  3. Data Normalization

New Schema

We’ve vastly improved our schema. Now Gnip offers normalized meta-data across Publishers/services. We’re striving to minimize the guesswork you have to go through in order to display/process activities in your application. We digested our community feedback, and out popped the new schema. So far so good. A few minor issues have been brought to our attention, and we’ll address those in the next rev of the schema.

Polling Infrastructure

This one’s been fun. Along the road to “delivering the web’s data,” we realized we were going to need to acquire some of the data ourselves. Some Publishers aren’t interested in PUSHing to Gnip yet, however demand for their data remains strong. As a result, we’ve embarked on our own polling infrastructure to ensure our Consumer’s needs are met (visit gnip.uservoice.com and tell us which services/data sources/Publishers you’d like to see Gnip support next). We’ve built a traditional job scheduling/queuing model to poll variabled URL endpoints for data. Replete with various back-off algorithms and rate-limiting, the system grabs activities out on the network, normalizes the results, and publishes them back into Gnip’s PubSub API.

We’re plowing through bugs and the challenges that come with job scheduling (starvation, ordering, prioritization, fairness) at scale. We long for some of the linear, relatively simple, polling architectures some of our partners have built. Building polling infrastructure for a few hundred thousand endpoints is one thing; building polling infrastructure for a few hundred million endpoints is quite another.

Until we resolve these issues you may experience intermittent results from Polled Publishers in Gnip. Things can go from working smoothly, to sporadic gaps in data. Bottom line is we have starvation issues we’re working to address. Bare with us.

Data Normalization

We’re vastly expanding the breadth of our Publisher offering by leveraging our Polling Infrastructure. For Polled Publishers we’ve built a layer that translates from arbitrary feed/API formats, into Gnip XML. While our core is written in Java, we leverage Python’s Universal Feed Parser to ingest XML, and map fields to Gnip XML. When UFP can’t handle things (even it has its limitations), we punt out to text processing with our own simple mapping language (with regex support). Our investment in this layer has highlighted two things really well: one, sadly even with the RSS/ATOM/XML’ization of data, severe challenges remain with data handling on the web (rampant inconsistencies remain). two, Gnip’s value proposition continues to grow. We handle the headache of this kind of mapping/parsing once, and many of you get the reap the rewards.

AWS/Ec2

We remain exclusively in “the Cloud.” Some thoughts on our 12 month experience with it…

  • Cross instance latency remains high (say 5-10x higher than average non-Zen local interconnects). While we’ve been able to build a “real-time” system regardless, it’s certainly gotten in our way at times.
  • Inbound packet transmission “into the Cloud” is slow. It’s not uncommon for sustained, large, uploads (say moving builds onto Ec2 instances) to average 70kB/s. This is lame and needs to be fixed by Amazon. Happy to pay more for throughput, just give me the option.
  • Dedicated instances. One way to look at Ec2 performance is that the cheaper/smaller the instance, the more issues you’re going to have. Basically, Amazon heavily vslices and dices the cheaper boxes; the cheaper the box, the more loaded it can be with other apps. The larger instances (e.g. XL) equate to dedicated hardware just for you; your own CPUs, your own mem, your own NIC.
  • I get asked “how many times have you lost instances?” all the time. The answer is “rarely enough that it’s never on anyone’s mind and I can’t remember the last time it happened.” Maybe 3 instances out of 50 continuously running over the course of 12 months.
  • We continue to use the free version of RightScale to manage our deployments.

Please let us know what features you’d like to see going forward, or if you have questions. info@gnip.com

Categories: Development

Tagged: amazon, api, api rate limiting, api rate limits, atom, aws, data consumers, data feeds, data normalization, data publishers, data sources, ec2, gnip, java, metadata, normalization, poll, polling, push, python, realtime data, rightscale, rss, xml

Like What You've Read? Subscribe to the Blog!

Author: Jud Valeski, Co-Founder and CEO

Engineer in practice and neurological layout, CEO by trade. Married parent, mountain biker, runner, foodie, and a Boulderite.
  • More posts by Jud Valeski, Co-Founder and CEO
  • @jvaleski

← Previous

Next →

Lijit Search

Follow Gnip

  • RSS
  • Facebook
  • Twitter
  • LinkedIn

Subscribe by Email

Related Posts

  • The WHAT of Gnip: Changing APIs from Pull to Push
  • Top 10 Reasons Gnip Data Collectors Outperform Direct API Connections
  • Data Standards?
  • The HOW of Gnip: Keep it Simple Stupid!
  • Gnip 2.0 is Coming…

Want to Get Social Data from Gnip?

Tags

activity data activity streams api api rate limits apis Big Boulder boulder colorado data aggregators data collection data consumers data delivery data feeds data filtering data filters data normalization data producers data publishers data sources data streaming data streams delicious digg facebook firehose flickr gnip http keyword filters metadata normalization poll polling power track protocols push realtime data rest social data social media social media data tweets twitter twitter data xml xmpp

Copyright © 2013 Gnip, inc.

Gnip makes it easy to build social media tracking tools.

  • Home
  • Product
  • Industries
  • Sources
  • Blogs
  • Company
  • Contact
Gnip 1050 Walnut St, Suite 115 Boulder, CO 80302 info@gnip.com 888-777-7405
  • Gnip on Twitter