So You Want Some Social Data

If your product or service needs social data in today’s API marketplace, there are a few things you need to consider in order to most effectively consume said data.

 

I need all the data

First, you should double-check your needs. Data consumers often think they need “all the data,” when in fact they don’t. You may need “all the data” for a given set of entities (e.g. keywords, or users) on a particular service, but don’t confuse that with needing “all the data” a service generates. When it comes to high-volume services (such as Twitter), consuming “all of the data” actually amounts to resource intensive engineering exercises on your end. There are often non-trivial scaling challenges involved when handling large data-sets. Do some math and determine whether or not statistical sampling will give you all you need; the answer is usually “yes.” If the answer is “no” be ready for an uphill (technical, financial, or business model) battle with service providers; they don’t necessarily want all of their data floating around out there.
Social data APIs are generally designed around prohibiting “all of the data” being accessed, either technically, or through terms of service agreements. However, they usually provide great access to narrow sets of data. Consider whether you need “100% of the data” for a relatively narrow slice of information; most social data APIs support this use case quite well.

 

Ingestion

 

Connectivity

There are three general styles that you’ll wind up using to access an API, all of them HTTP based: inbound-POST; event driven (e.g. PubSubHubbub/WebHooks), GET; polling, or GET/POST; streaming. Each of these has its pros and cons. I’m avoiding XMPP in this post only because it is infrequently used and hasn’t seen widespread adoption (yet). Each style requires a different level of operational and programmatic understanding.

 

Authentication/Authorization

APIs usually have publicly available versions (usually limited in their capabilities), as well as versions that require registration for subsequent authenticated connections. The authC and authZ semantics around APIs range from simple, to complex. You’ll need to understand the access characteristics around the specific services you want to access. Some require hands-on, human, authorization-level justification processes to be followed in order to have the “right level of access” granted to you and your product. Some are simple automated online registration forms that directly yield the account credentials necessary for API access.
HTTP-Basic authentication, not surprisingly, is the predominate authentication scheme used, and authorization levels are conveniently tied to the account by the service provider. OAuth (proper and 2-legged) is gaining steam however. You’ll also find API-keys (URL params or HTTP header based) are still widely used.

 

Processing

How you process data once you receive it is certainly affected by which connection style you use. Note, that most APIs don’t give you an option in how you connect to them; the provider decides for you. Processing data in the same step as receiving it can cause bottlenecks in your system, and ultimately put you on bad terms with the API provider you’re connecting to. An analogy would be drinking from the proverbial firehose. If you connect the firehose to your mouth, you might get a gulp or two down before you’re overwhelmed by the amount of water actually coming at you. You’ll either cause the firehose to backup on you, or you’ll start leaking water all over the place. Either way, you won’t be able to keep up with the amount of water coming at you. If your, average, ability to process data is slower than the rate at which it arrives, you’ll have a queueing challenge to contend with. Consider offline, or out-of-band, processing of data as it becomes available. For example, write it to disk or a database and have parallelized worker threads/processes parse/handle it from there. The point is, don’t process it in the moment in this case.
Many APIs don’t produce enough data to warrant out-of-band processing, so often inline processing is just fine. It all depends on what operations you’re trying to perform, the speed at which your technology stack can accomplish those operations, and the rate at which data arrives.

 

Reporting

If you don’t care about reporting initially, you will in short order. How much data are you receiving? What are peak volume periods? Which of the things you’re looking for are generating the most results?
API integrations inherently bind your software to someone else’s. Understanding how that relationship is functioning at any given moment is crucial to your day to day operations.

 

Monitoring

Reporting’s close sibling is monitoring. Understanding when an integration has gone south is just as important as knowing when your product is having issues; they’re one and the same. Integrating with an API means you’re dependent on someone else’s software, and that software can have any number of issues. From bugs, to planned upgrades or API changes, you’ll need to know when certain things change, and take appropriate action.

 

Web services/APIs are usually incredibly easy to “sample,” but truly integrating and operationalizing them is another, more challenging, process.