Clusters & Silos

AUG
2010
25

Gnip is nearing its one-year anniversary of our 2.0 product. We reset our direction several months ago. As part of that shift, we completely changed our architecture. I thought I’d write about that experience a bit.

Gnip 1.0

Our initial implementation is best referred to as a clustered, non-relational DB (aka NoSQL), data aggregation service. We built/ran this product for about a year and a half. The system was comprised of a centralized cluster of machines that divvy’d up load, centralized streams of publisher data, and then fanned that data out to many customers. Publishers did not like this approach as it obfuscated the ultimate consumer of their data; they wanted transparency. Our initial motivation for this architecture was around alleviating load pain on the Publishers. “Real-time” APIs were the rage, and having detrimental impact to real-time delivery was in-part due to load on the Publisher’s API. A single stream of data to Gnip, and allowing Gnip to deal w/ the fan-out via a system built for such demand, was part of the solution we sold. We thought we could charge Publishers for alleviating their load pain. Boy were we wrong on that count. While Publishers love to complain about the load on their API, effectively none of them wanted to do anything about it. Some smartly built caching proxies, and others built homegrown notification-like/PubSub solutions (SIP, SUP, PubSubHubBub). However, most simply horizontally scaled and threw money at the problem. Twitter has shinned a light on streaming HTTP (or whatever you want to call it… there are so many monikers), which is “as good as it gets” (leaving proto buffers and compressed HTTP streams as simply optimizations to the model). I digress. The 1.0 platform was a fantastic engineering feat, ahead of its time, and unfortunately a thorn in Publisher’s sides. As a data integration middle-man, Gnip couldn’t afford to have antagonistic relations with data sources.

Gnip 2.0

Literally overnight, we walked away from further construction on our 1.0 platform. We had paying customers on it however, so we operated it for several months before ultimately shutting it down; after migrating everyone we could to 2.0. Gnip 2.0 un-intuitively departed from a clustered environment, and instead started providing a consuming customer with explicit, transparent, integrations with Publishers, all via standalone instances of the software running on standalone virtualized hardware instances (Ec2). Whereas 1.0 would sometimes leverage Gnip-owned authentication/app credentials to the benefit of many consuming customers, 2.0 was architected explicitly not to support this. For each 2.0 instance a customer runs, they use credentials they obtain themselves, from the Publisher, to configure the instances. Publishers have full transparency into, and control of, who’s using their data.

The result is an architecture that doesn’t leverage certain data structures an engineer would naturally wish to use. That said, an unexpected operational benefit has fallen out of the 2.0 system. Self-healing, zero SPOF (single point of failure), clusters aside (I’d argue they’re actually relatively few of them out there), the reality with clusters is that they’re actually hard to build in a fault tolerant manner, and SPOFs find their way in. From there, you have all of your customers leveraged against a big SPOF. If something cracks in the system, all of your customers feel that pain. On the flip side, silo’d instances rarely suffer from systemic failure. Sure operational issues arise, but you can treat each case uniquely and react accordingly. The circumstances in which all of your customers feel pain simultaneously are very few and far between. So, the cost of not leveraging hardware/software that we’re generally inclined to want to architect for and leverage, is indeed higher, but a simplified system has it’s benefits to be sure.

We now find ourselves promoting Publisher integration best practices, and they advocate our usage. Two such significant architectures built under the same roof has been a fascinating thing to experience. The pros and cons to each are many. Where you wind up in your system is an interesting function of what your propensity is technically, as well as what the business constraints are. One size never fits all.

I just spent a couple of days optimizing a customer’s data collection on a Gnip instance, for a specific social media data source API. It had been awhile since I’d done this level of tuning, and it reminded me of just how many variables must be considered when optimally polling source API for data.

Requests Per Second (RPS) Limits

Most services have a rate limit that an given IP address (or API key/login) cannot break. If you hit an endpoint too hard, the API backs you off and/or blocks you. Don’t confuse RPS with concurrent connections however; they’re measured differently and each has its own limitations for a given API. In this particular case I was able to parallelize three requests because the total response time per request was ~3 seconds. The result was that a given IP address was not violating the API’s RPS limitations. Had the API been measuring concurrent connections, that would have been a different story.

Document/Page/Result-set Size

Impacting my ability to parallelize my requests was the document size I was requesting of the API. Smaller document sizes (e.g. 10 activities instead of 1000) meant faster response times, which when parallelized, run the risk of violating the RPS limits. On the other hand, larger document sizes take more time to get; whether because they’re simply bigger and take longer to transfer over the wire, or because the API you’re accessing is taking a long time to assemble the document on the backend.

Cycle Time

The particular API I was working with was a “keyword” based API, meaning that I was polling for search terms/keywords. In Gnip parlance we call these “terms” or “keywords,” “rules” in order to generalize the terminology. A rule-set’s “cycle time” is how long it takes a Gnip Data Collector to poll for a given rule-set once. For example, if a rule-set size is 1,000, and the API’s RPS limit is 1, that rule-set’s cycle time would be 1,000 seconds; every 1k seconds, each rule in the set has been polled. Obviously, the cycle time would increase if the server took longer than a second to respond to each requests.

Skipping (missing data)

A given rule “skips” data during polling (meaning, you will miss data because you’re not covering enough ground) when one of the following conditions is true. ARU (activity update rate) is the rate at which activities/events occur on the given rule (e.g. the number of times per second someone uploads a picture with the tag “foo”)

  • ARU is greater than the RPS limit (RPS represented as 1/RPS) multiplied by the document size.
  • ARU is greater than the rule-set’s cycle time

In order to optimally collect the data you need, in a timely manner, you have to balance all of these variables, and adjust them based on the activity update rate for the rule-set you’re interested in. While the variables make for engaging engineering exercises, do you want to spend time sorting these out, or spend time working on the core business issues you’re trying to solve? Gnip provides visibility into these variables to ensure data is most effectively collected.

Gnip’s doing great in the SMM (Social Media Monitoring) marketplace. However, we want more. We attended the Gov 2.0 Expo a few months ago, and we’ll also be at the upcoming Gov 2.0 Summit in Sept. Watching markets evolve their understanding of new technologies, concepts and solutions is always fascinating. The world of government projects, technologies, contracts, and vendors, is vastly different from the world we tend to work in day-to-day. Adoption and understanding takes a lot longer than what those of us more in the “web space” are used to, and policy often has significant impact on how/when something can be incorporated. Yet, there is an incredible market opportunity in front of social media related firms.

Government spending is obviously a tremendous force, and while sales/adoption cycles are long, it needs to be tapped. Thankfully, government agency awareness around social media is rising. From technology stack understanding, to communication paradigm shifts (e.g. Twitter & Facebook), gov. firms and teams are realizing the need for integration and use. Whether it’s the Defense Department’s need to apply predictive algorithms to new communication streams, or disaster recovery organizations needing to tap into crowd sourcing when catastrophe strikes, a vast array of teams are engaging at an increasing rate. A friend of mine lit up a room at the recent Emergency American Red Cross Summit, when he showed them how communication (messaging and photos) can be mashed-up onto a map, in real-time (via Gnip btw); highly relevant when considering disaster situations. “Who’s there?” “What’s the situation?” are questions easily answered when social data streams are tapped and blended.

The social media echo chamber we live in is broadening to include significant government agencies, and the fruits that are falling from today’s social applications are landing in good places. I’m looking forward to participating in the burgeoning conversation around social media and government’s digestion of it. I encourage you to dive in as well, though be prepared for a relatively slow pace. Don’t expect the same turnaround times we’ve become accustomed to, rather, consider back-grounding some time in the space, and consider it an investment with a longer term payoff.

Posted in comedy by Jud

4 Comments

One of the things I love about dev teams is all the righteousness that falls out of day to day conversation; particularly around how things are supposed to work. We’re considering hiring a lobby group and “paying to play” in order to fix a few things in our modern world. We’ve been jotting down the following list of legislative items over the past few months and it’s time to float them. Ultimately, US Constitutional amendment might be in order.

  • Copy-Paste operations MUST adopt surrounding destination format
  • Audio is MUST NOT play by default when visiting a URL
  • Application upgrades MUST NOT cause OS restarts
  • Tab ordering MUST start in username field on documents that have forms
  • Tabbing from password field MUST go to the submit button, and skip the “remember me” checkbox
  • If a computer has a user facing camera, that camera MUST be used to visually recognize the user and auto-login the user everywhere login is required. Supersedes previous two items
  • Printers MUST NOT print bogus/fragmented last page
  • Selecting the “remember me” checkbox MUST actually remember me
  • Email services MUST provide a new tier/folder for transactional email categorization. Said categorization is between “inbox” and “spam” and users will look for missing email there instead of spam where all the spam is
  • Progress meters MUST NOT hang at the end
  • Progress meters MUST NOT be used for indeterminate periods

Today we’re excited to announce the integration of the Google Buzz firehose into Gnip’s social media data offering. Google Buzz data has been available via Gnip for some time, but today Gnip became one of the first official providers of the Google Buzz firehose.

The Google Buzz firehose is a stream of all public Buzz posts (excluding Twitter tweets) from all Google Buzz users. If you’re interested in the Google Buzz firehose, here are some things to know:

  • Google delivers it via Pubsubhubbub. If you don’t want to consume it via Pubsubhubbub, Gnip makes it available in any of our supported delivery methods: Polling (HTTP GET), Streaming HTTP (Comet), or Outbound HTTP Post (Webhooks).
  • The format of the Firehose is XML Activity Streams. Gnip loves Activity Streams and we’re excited to see Google continue to push this standard forward.
  • Google Buzz activities are Geo-enabled. If the end user attaches a geolocation on a Buzz post (either from a mobile Google Buzz client or through an import from another geo-enabled service), that location will be included in the Buzz activity.

We’re excited to bring the Google Buzz firehose to the Social Media Monitoring and Business Intelligence community through the power of the Gnip platform.

Here’s how to access the Google Buzz firehose. If you’re already a Gnip customer, just log in to your Gnip account and with 3 clicks you can have the Buzz firehose flowing into your system. If you’re not yet using Gnip and you’d like to try out the Buzz firehose to get a sense of volume, latency, and other key metrics, grab a free 3 day trial at http://try.gnip.com and check it out along with the 100 or so other feeds available through Gnip’s social media API.

Posted in How-To by sam

2 Comments
If you’re looking for social media data, you’ve got a lot of options: social media monitoring companies provide end-user brand tracking tools, some businesses provide deep-dive analyses of social data, other companies provide a reputation scores for individual users, and still other services specialize in geographic social media display, to name just a few.

Some organizations ultimately decide to build internal tools for social media data analysis. Then they must decide between outsourcing the social data collection bit so they can focus their efforts on analyzing and visualizing the data, or building everything — including API connections to each individual publisher — internally. Establishing and maintaining those API connections over time can be costly. If your team has the money and resources to build your own social media integrations, then go for it!

But if you’re shopping for raw social media data, you should consider a social media API – that is, a single API that aggregates raw data from dozens of different social media publishers – instead of making connections to each one of those dozens of social media APIs individually. And in the social media API market, there is only a small handful of companies for you to choose from. We are one of them and we would love to work with you. But we know that you’ll probably want to shop your options before making a decision, so we’d like to offer our advice to help you understand some of the most important factors in selecting a social media API provider.

Here are some good questions for you to ask every social media API solution you consider (including your own internal engineers, if you’re considering hiring them for the job):

Are your data collection methods in compliance with all social media publishers’ terms of use?

–> Here’s why it matters: by working with a company that violates any publisher’s terms of use, you risk unstable (or sudden loss of) access to violated publisher’s data — not to mention the potential legal consequences of using black market data in your product. Conversely, if you work with a company that has a strong relationship with the social media publishers, our experience shows that you not only get stable, reliable data access, but you just might get rewarded with *extra* data access every now and then. (In case you’re wondering, Gnip’s methods are in compliance with each of our social media publishers’ terms of use.)

Do you provide results and allow parameter modifications via API, and do you maintain those API connections over time?

–> In our experience, establishing a single API connection to collect data from a single publisher isn’t hard. But! Establishing many API connections to various social media publishers and – this is key – maintaining those connections over time is really quite a chore. So much so, we made a whole long list of API-related difficulties associated with that integration work, based on our own experiences. Make sure that whoever you work with understands the ongoing work involved and is prepared to maintain your access to all of the social media APIs you care about over time.

How many data sources do you provide access to?

–> Even if you only want access to Twitter and Facebook today, it’s a good idea to think ahead. How much incremental work will be involved for you to integrate additional sources a few months down the line? Our own answer to this question is this: using Gnip’s social media API, once you’re set up to receive your first feed from Gnip via API, it takes about 1 minute for you to configure Gnip to send you data from a 2nd feed. Ten minutes later, you’re collecting data from 10 different feeds, all at no extra charge. Since you can configure Gnip to send all of your data in one format, you only need to create one parser and all the data you want gets streamed into your product. You can even start getting data from a new social media source, decide it’s not useful for your product, and replace it with a different feed from a different source, all in a matter of seconds. We’re pretty proud that we’ve made it so fast and simple for you to receive data from new sources… (blush)… and we hope you’ll find it to be useful, too.

What format is your data delivered in?

–> Ten different social media sources might provide data in 10 different formats. And that means you have to write 10 different parsers to get all the data into your product. Gnip allows you to normalize all the social media data you want into one single format — Activity Streams — so you can collect all your results via one API and feed them into your product with just one parser.

Hope this helps! If you’ve got additional questions to suggest for our list, don’t hesitate to drop us a note. We’d love to hear from you.

Recent changes to Facebook’s crawling policies have had a detrimental impact on some of our customers who were relying on other (black/grey market) third parties for their Facebook social media data collection strategy. I wrote about this dynamic and its potential fallout last month. These recent events underscore the risk and reality around relying on black market data in your business.

Gnip works with publishers, like Facebook and Twitter, to ensure our software is in clear, transparent, compliance with their terms of use/service in order to drastically reduce risks like this. All of Gnip’s data collection is above board and is done respectfully of the policies set forth by the Publishers you want data from. When you use Gnip for your collection needs, you don’t have to live in fear of being “turned off” or sued over data you shouldn’t have.

For specifics on our Facebook integrations, checkout yesterday’s post.

Have no fear, Gnip’s here to help. Give our numerous Facebook and other social media feeds a try at http://try.gnip.com or give us a shout at info@gnip.com

One of our most requested features has long been Facebook support. While customers have had beta access for awhile now, today we’re officially announcing support for several new Facebook Graph API feeds. As with the other feeds available through Gnip, Facebook data is available in Activity Streams format (as well as original if you so desire), and you can choose your own delivery method (polling, webhook POSTing, or streaming). Gnip integrates with Facebook on your behalf, in a fully transparent manner, in order to feed you the Facebook data you’ve been longing for.
As with most services, Facebook’s APIs are also in constant flux. Integrating with Gnip shields you from the ever shifting sands of service integration. You don’t have to worry about authentication implementation changes or delivery method shifts.

Use-case Highlight

Discovery is hard. If you’re monitoring a brand or keyword for popularity (positive or negative sentiment), it’s challenging to keep track of fan pages that crop up without notice. With Gnip, you can receive real-time notification when one of your search terms is found within a fan page. Discover when a community is forming around a given topic, product, or brand before others do.
We currently support the following endpoints, and will be adding more based on customer demand.
  • Keyword Search – Search over all public objects in the Facebook social graph.
  • Lookup Fan Pages by Keyword – Look up IDs for Fan Pages with titles containing your search terms.
  • Fan Page Feed (with or without comments) – Receive wall posts from a list of Facebook Fan Pages you define.
  • Fan Page Posts (by page owner, without comments) – Receive wall posts from a list of Facebook Fan Pages you define. Only shows wall posts made by the page owner.
  • Fan Page Photos (without comments) – Get photos for a list of Facebook Fan Pages.
  • Fan Page Info – Get information including fan count, mission, and products for a list of Fan Pages.
Give Facebook via Gnip a try (http://try.gnip.com), and let us know what you think info@gnip.com

Activity Streams

JUN
2010
8

Gnip pledges allegiance to Activity Streams.

Consuming data from APIs with heterogeneous response formats is a pain. From basic format differences (XML vs JSON) to the semantics around structure and element meaning (custom XML structure, Atom, RSS), if you’re consuming data from multiple APIs, you have to handle each API’s responses differently. Gnip minimizes this pain by normalizing data from across services into Activity Streams. Activity Streams allows you to consistently digest responses from many services, using a single parsing routine in your code; no more special casing.

Gnip’s history with Activity Streams runs long and deep. We contributed to one of the first service/activity/verb mapping proposals, and have been implementing aspects of Activity Streams over the past couple of years. Over the past several months Activity Streams has gained enough traction that the decision for it to be Gnip’s canonical normalization format was only natural. We’ve flipped the switch and are proud to be part of such a useful standard.

The Activity Streams initiative is in the process of getting its JSON version together, so for now, we offer the XML version. As JSON crystalizes, we’ll offer that as well.

Have you scheduled your engineering work to move to OAuth for Twitter’s interfaces? If you’re a Gnip user, you don’t have to! Gnip users don’t have to know anything about OAuth’s implementation in order to keep their data flowing, and for all of them, that’s a huge relief. OAuth is non-trivial to setup and support, and is an API authentication/authorization mechanism that most data consumers shouldn’t have to worry about. That’s where Gnip steps in! One of our value-adds is that many API integrations shifts, like this one, are hidden from our customers. You can merrily consume data without having to sink expensive resources into adapting to the constantly shifting sands of data provider APIs.

If you’re consuming data via Gnip, when Twitter makes the switch to affected APIs, all you’ll need to do is provide Gnip with your OAuth tokens (the new “username” and “password”; just more secure and controllable), and off you go! You don’t have to worry about query param ordering, hashing, signing, and associated handshakes.