So . . . can you just make requests for results more frequently? Well, yes, you can, but the total number or requests you’re allowed to make per unit time is constrained by Twitter’s rate limits. Some queries are so popular (hello “Justin Bieber”) that it can be impossible to make enough requests to Twitter for that query alone to keep up with this stream. And this is only the beginning of the problem as no monitoring or analytics vendor is interested in just one term; many have hundreds to thousands of brands or products to monitor.
Let’s consider a couple examples to clarify. First, say you want all Tweets mentioning “Coca Cola” and only that one term. There might be fewer than 100 matching Tweets per second usually — but if there’s a spike (say that term becomes a trending topic after a Super Bowl commercial), then there will likely be more than 100 per second. If because of Twitter’s rate limits, you’re only allowed to send one request per second, you will have missed some of the Tweets generated at the most critical moment of all.
Now, let’s be realistic: you’re probably not tracking just one term. Most of our customers are interested in tracking somewhere between dozens and hundreds of thousands of terms. If you add 999 more terms to your list, then you’ll only be checking for Tweets matching “Coca Cola” once every 1,000 seconds. And in 1,000 seconds, there could easily be more than 100 Tweets mentioning your keyword, even on an average day. (Keep in mind that there are over a billion Tweets per week nowadays.) So, in this scenario, you could easily miss Tweets if you’re using the Twitter Search API. It’s also worth bearing in mind that the Tweets you do receive won’t arrive in realtime because you’re only querying for the Tweets every 1,000 seconds.
Because of these issues related to the monitoring use cases, data collection strategies relying exclusively on the Search API will frequently deliver poor coverage of Twitter data. Also, be forewarned, if you are working with a monitoring or analytics vendor who claims full Twitter coverage but is using the Search API exclusively, you’re being misled.
Although coverage is not complete, one great thing about the Twitter Search API is the complex operator capabilities it supports, such as Boolean queries and geo filtering. Although the coverage is limited, some people opt to use the Search API to collect a sampling of Tweets that match their search terms because it supports Boolean operators and geo parameters. Because these filtering features have been so well liked, Gnip has replicated many of them in our own premium Twitter API (made even more powerful by the full coverage and unique data enrichments we offer).
So, to recap, the Twitter Search API offers great operator support but you should know that you’ll generally only see a portion of the total Tweets that match your keywords and your data might arrive with some delay. To simplify access to the Twitter Search API, consider trying out Gnip’s Enterprise Data Collector; our “Keyword Notices” feed retrieves, normalizes, and deduplicates data delivered through the Search API. We can also stream it to you so you don’t have to poll for your results. (“Gnip” reverses the “ping,” get it?)
But the only way to ensure you receive full coverage of Tweets that match your filtering criteria is to work with a premium data provider (like us! blush…) for full coverage Twitter firehose filtering. (See our Power Track feed if you’d like for more info on that.)
Stay tuned for Part 3, our overview of Twitter’s Streaming API coming next week…
Understanding Twitter’s Public APIs . . . You Mean There is More than One?
In fact, there are three Twitter APIs: the REST API, the Streaming API, and the Search API. Within the world of social media monitoring and social media analytics, we need to focus primarily on the latter two.
- Search API - The Twitter Search API is a dedicated API for running searches against the index of recent Tweets
- Streaming API – The Twitter Streaming API allows high-throughput, near-realtime access to various subsets of Twitter data (eg. 1% random sampling of Tweets, filtering for up to 400 keywords, etc.)
Whether you get your Twitter data from the Search API, the Streaming API, or through Gnip, only public statuses are available (and NOT protected Tweets). Additionally, before Tweets are made available to both of these APIs and Gnip, Twitter applies a quality filter to weed out spam.
So now that you have a general understanding of Twitter’s APIs . . . stay tuned for Part 2, where we will take a deeper dive into understanding Twitter’s Search API, coming next week…
Sometimes we’re asked why it makes sense to access social data from Gnip and not through direct access to the publicly accessible APIs. (We usually get this question from people who have never tried to access data from various social media APIs; those who have tried it understand how tedious and time-intensive data collection is and they can’t wait to hand their social data collection over to Gnip to manage for them.)
So, if you’ve never tried collecting data from multiple social media APIs at once… why would you use Gnip instead of connecting directly to the publicly accessible APIs? Here are 10 of the reasons…
#10 – Customer Support
When you use most public APIs, development teams are often busy, so they’re tough (if not impossible) for most developers to reach with questions. At Gnip, we actually want to talk to you. We offer enterprise-level support so clients can contact us at all odd hours and receive a thoughtful, thorough response. And we work closely with a variety of sources, so we can reach out to them directly if necessary.
#9 – Reliability
#8 – Rate limit recommendations
Instead of having to figure out rate limits for the various sources on your own, Gnip can recommend rate limits based on our own extensive experience with the various APIs.
#7 – Delivery in your protocol of choice: never poll for data again
A lot of developers think polling for data is tedious… and unfortunately, most APIs are polling-based. So if you go to the sources directly, you have to poll their servers for the data. By using Gnip, you can choose between polling for your data or to having your data streamed to you.
#6 – New feed setup in seconds
Without Gnip, it can take many hours (or days) of a developer’s time to set up a new API connection, parse the new feed, and start bringing data into your system. With Gnip, it can take as little as 30 seconds and no dev effort at all to start consuming the data.
#5 – Gnip is the only source for some data
Gnip can offer access to some data that’s not available from any other source (eg. Premium Twitter volume-based feeds like our Decahose and Halfhose).
#4 – Established premium data partnerships
Established partnerships with premium data publishers (Twitter, BackType, WordPress, etc.) make it quick and easy for Gnip customers to test and add premium data feeds.
#3 – Established relationships with all publishers
Because we manage data collection for customers all day every day, we’re among the earliest to know when API changes happen and the fastest to make any necessary changes to keep your data flowing.
#2 – APIs are generally hard to manage
Publishers change their APIs sometimes. Some APIs change frequently and without warning or documentation (cough, Facebook, cough) while others change less frequently. But no matter what, change is inevitable. Gnip manages your social media data delivery over time so you can keep your data flowing smoothly and reliably with minimal effort.
#1 – Enrichments
A variety of enrichments, or added metadata and features, come included with feeds delivered through Gnip data collectors. Some of the most popular enrichments include format normalization across sources (so you only have to write one parser for all your social media data), Klout Score inclusion (currently available for premium Twitter feeds), and language detection and filtering via a proprietary Gnip algorithm. We add enrichments all the time, so look for lots more to come.
We think Gnip is pretty cool (yes, we’re biased)… but even we know that Gnip isn’t for everyone. If you only need 1 feed from 1 source, the data you need is available through a publicly accessible API, you have an engineer who can monitor and optimize your data consumption regularly, and you’re certain that you will never need any other feeds forever and ever, then Gnip probably isn’t the right choice for you.
But if you’d like to ensure you’re receiving top-quality premium data access without requiring your engineering team to invest lots of time in data collection, we’d like to invite you to give Gnip a try. We’ve got lots of happy customers already and we just might prove valuable to you, too.
Not too long ago Gnip celebrated its third birthday. I am celebrating my one week anniversary with the company today. To say a lot happened before my time at Gnip would be the ultimate understatement, and yet it is easy for me to see the results produced from those three years of effort. Some of those results include:
Gnip’s social media API offering is the clear leader in the industry. Gnip is delivering over a half a billion social media activities daily from dozens of sources. That certainly sounds impressive, but how can I be so confident Gnip is the leader? Because the most important social media monitoring companies rely on our services to deliver results to their customers every single day. For example, Gnip currently works with 8 of the top 9 enterprise social media monitoring companies, and the rate we are adding enterprise focused companies is accelerating.
Another obvious result is the strong partnerships that have been cultivated. Some of our partnerships such as Twitter and Klout were well publicized when the agreements were put in place. However, having strong strategic partners takes a lot more than just a signed agreement. It takes a lot of dedication, investment, and hard work by both parties in order to deliver on the full promise of the agreement. It is obvious to me that Gnip has amazing partnerships that run deep and are built upon a foundation of mutual trust and respect.
The talent level at Gnip is mind blowing, but it isn’t the skills of the people that have stood out the most for me so far. It is the dedication of each individual to doing the right thing for our customers and our partners that has made the biggest impression. When it comes to gathering and delivering social media data, there are a lot of shortcuts that can be taken in order to save time, money, and effort. Unfortunately, these shortcuts can often come at the expense of publishers, customers, or both. The team at Gnip has no interest in shortcuts and that comes across in every individual discussion and in every meeting. If I were going to describe this value in one word, the word would be “integrity”.
In my new role as President & COO, I’m responsible for helping the company grow quickly and smoothly while maintaining the great values that have been established from the company’s inception. The growth has already started and I couldn’t be more pleased with the talent of the people who have recently joined the organization including: Bill Adkins, Seth McGuire, Charles Ince, and Brad Bokal who have all joined Gnip within the last week. And, we are hiring more! In fact, it is worth highlighting one particular open position for a Customer Support Engineer. I’m hard pressed to think of a higher impact role at our company because we consider supporting our customers to be such an important priority. If you have 2+ years of coding experience including working with RESTful Web APIs and you love delivering over-the-top customer service, Gnip offers a rare opportunity to work in an environment where your skills will be truly appreciated. Apply today!
I look forward to helping Gnip grow on top of a strong foundation of product, partners, and people. If you have any questions, I can be reached at chris [at] gnip.com.
Late last week, several members of the Gnip team attended the 2011 Glue Conference, including Gnip’s very own CEO Jud Valeski, who delivered a keynote presentation on High-Volume, Realtime Data Stream Handling. Check out Audrey Watters’ article, Gnip CEO on the Challenges of Handling the Real-Time, Big Data Firehose on ReadWriteCloud, it does a great job of summing up Jud’s presentation.
We were thrilled to once again sponsor such an innovative and informative event dedicated to the bits and pieces, APIs and metadata, standards, and connectors that help “glue” together the applications of the web. We would like to congratulate Eric Norlin (@defrag) and team for putting on a great conference.
It was exciting to have seen several of our customers and partners at the conference, including ReportGrid, a realtime analytics startup, who used Gnip data to power their demo at the conference. For those of you that we met at the conference, it was a pleasure! For those of you that we missed, give us a call or shoot us an email, we would love to hear from you. See you next year at the 2012 Glue Conference!
Once Albert Einstein was asked what he found to be important discoveries. His answer did not mention physics, relativity theory, or fun stuff like Higgs bosons – but instead he said: “Compound interest is the greatest mathematical discovery of all time.”
I trust that most of you understand compound interest when it comes to investing or debt, but humor me and let’s walk through an example: Say you owe your credit card company $1000, and your interest rate is 16%. To make it simple, we assume the credit card company only requires you to pay 1% as your minimal payment every year, so the effective interest rate is 15%. After 30 years of compound interest you owe almost $60 000!
If there would be no compounding, you’d just owe a little bit over 5 grand!
What I find truly bizarre though is that when us software engineers throw around words like “technological debt” the eyes of our project managers or CEOs frequently just glaze over. Instead of doing the right thing – I’ll get back to that later – we are asked to come up with the quick hack that will make it work tomorrow and deal with the fallout later. Really? Sounds like we are using one credit card to pay off the other.
And we are even staying within terminology using “debt”! We could have said something like “Well, it would take us roughly 1 week longer to integrate our current J2EE backend with this 3rd party SOAP API instead of expanding our current custom XML parser, but then we would be done for good with maintaining that (POS) part of the app and can focus on our core IP.” But no, we keep it simple and refer to the custom XML parser as “technological debt”, but to no avail.
Now, the next time you have this conversation with your boss, show him the plot above and label the y-axis with “lines of code we have to maintain”, and the x-axis with “development iterations”, and perhaps a bell will go off.
Coming back to doing the right thing. Unfortunately determining what is the right thing is sometimes hard, but here are two strategies that in my experience decrease technological debt almost immediately:
- Refactor early and often
- Outsource as much as possible of what you don’t consider your core competency.
For instance, if you have to consume millions of tweets every day, but your core competency does not contain:
- developing high performance code that is distributed in the cloud
- writing parsers processing real time social activity data
- maintaining OAuth client code and access tokens
- keeping up with squishy rate limits and evolving social activity APIs
then it might be time for you to talk to us at Gnip!
Hello and Greetings, Our Ruby Dev Friends,
Mountain.rb we were pleased to attend.
Perhaps we did meet you! Perhaps we did not.
We hope, either way, you’ll give our tools a shot.
What do we do? Manage API feeds.
We fight the rate limits, dedupe all those tweets.
Need to know where those bit.ly’s point to?
Want to choose polling or streaming, do you?
We do those things, and on top of all that,
We put all your results in just one format.
You write only one parser for all of our feeds.
(We’ve got over 100 to meet your needs.)
The Facebook, The Twitter, The YouTube and More
If mass data collection makes your head sore…
Do not curse publishers, don’t make a fuss.
Just go to the Internet and visit us.
We’re not the best poets. Data’s more our thing.
So when you face APIs… give us a ring.
At Gnip, we spend a large part of our days integrating with third party APIs in the Social Media space. As part of this effort, we’ve come up with some API design best practices.
Use Standard HTTP Response Codes
HTTP has been around since the the early 90′s. Standard HTTP Response codes have been around for quite some time. For example, 200 codes level have meant success, 400 level have meant a client side error, and 500 level have been indicative of a server error. If there was an error during an API call to your service, please don’t send us back a 200 response and expect us to parse the response body for error details. If you want to rate limit us, please don’t send us back a 500, that makes us freak out.
Publish Your Rate Limits
We get it. You want the right to scale back your rate limits without a hoard of angry developers wielding virtual pitchforks showing up on your mailing list. It would make everyone’s lives easier if you published your rate limits rather than having developers playing a constant guessing game. Bonus points if you describe how your rate limits work. Do you limit per set of credentials, per API key, per IP address?
Use Friendly Ids, Not System Ids
We understand that it’s a common pattern to have an ugly system id (e.g. 17134916) backing a human readable id (e.g. ericwryan). As users of your API, we really don’t want to remember system ids, so why not go the extra mile and let us hit your API with friendly ids?
Allow Us to Limit Response Data
Let’s say your rate limit is pretty generous. What if Joe User is hammering your API once a second and retrieving 100 items with every request, even though on average, he will only see one new item per day. Joe has just wasted a lot of your precious CPU, memory, and bandwidth. Protect your users. Allow them to ask for everything since the last id or timestamp they received.
Keep Your Docs Up to Date
Who has time to update their docs when you have customers banging on your door for bug fixes and new features? Well, you would probably have less customers banging on your door if they had a better understanding of how to use your product. Keep your docs up to date with your code.
Publish Your Search Parameter Constraints
Search endpoints are very common these days. Do you have one? How do we go about searching your data? Do you split search terms on whitespace? Do you split on punctuation? How does quoting affect your query terms? Do you allow boolean operators?
Use Your Mailing List
Do you have a community mailing list? Great! Then use it. Is there an unavoidable, breaking change coming in a future release? Let your users know as soon as possible. Do you keep a changelog of features and bug fixes? Why not publish this information for your users to see?
We consider this to be a fairly complete list on designing an API that is easy to work with. Feel free to yell at us (info at gnip) if you see us lacking in any of these departments.
Gnip is nearing its one-year anniversary of our 2.0 product. We reset our direction several months ago. As part of that shift, we completely changed our architecture. I thought I’d write about that experience a bit.
Our initial implementation is best referred to as a clustered, non-relational DB (aka NoSQL), data aggregation service. We built/ran this product for about a year and a half. The system was comprised of a centralized cluster of machines that divvy’d up load, centralized streams of publisher data, and then fanned that data out to many customers. Publishers did not like this approach as it obfuscated the ultimate consumer of their data; they wanted transparency. Our initial motivation for this architecture was around alleviating load pain on the Publishers. “Real-time” APIs were the rage, and having detrimental impact to real-time delivery was in-part due to load on the Publisher’s API. A single stream of data to Gnip, and allowing Gnip to deal w/ the fan-out via a system built for such demand, was part of the solution we sold. We thought we could charge Publishers for alleviating their load pain. Boy were we wrong on that count. While Publishers love to complain about the load on their API, effectively none of them wanted to do anything about it. Some smartly built caching proxies, and others built homegrown notification-like/PubSub solutions (SIP, SUP, PubSubHubBub). However, most simply horizontally scaled and threw money at the problem. Twitter has shinned a light on streaming HTTP (or whatever you want to call it… there are so many monikers), which is “as good as it gets” (leaving proto buffers and compressed HTTP streams as simply optimizations to the model). I digress. The 1.0 platform was a fantastic engineering feat, ahead of its time, and unfortunately a thorn in Publisher’s sides. As a data integration middle-man, Gnip couldn’t afford to have antagonistic relations with data sources.
Literally overnight, we walked away from further construction on our 1.0 platform. We had paying customers on it however, so we operated it for several months before ultimately shutting it down; after migrating everyone we could to 2.0. Gnip 2.0 un-intuitively departed from a clustered environment, and instead started providing a consuming customer with explicit, transparent, integrations with Publishers, all via standalone instances of the software running on standalone virtualized hardware instances (Ec2). Whereas 1.0 would sometimes leverage Gnip-owned authentication/app credentials to the benefit of many consuming customers, 2.0 was architected explicitly not to support this. For each 2.0 instance a customer runs, they use credentials they obtain themselves, from the Publisher, to configure the instances. Publishers have full transparency into, and control of, who’s using their data.
The result is an architecture that doesn’t leverage certain data structures an engineer would naturally wish to use. That said, an unexpected operational benefit has fallen out of the 2.0 system. Self-healing, zero SPOF (single point of failure), clusters aside (I’d argue they’re actually relatively few of them out there), the reality with clusters is that they’re actually hard to build in a fault tolerant manner, and SPOFs find their way in. From there, you have all of your customers leveraged against a big SPOF. If something cracks in the system, all of your customers feel that pain. On the flip side, silo’d instances rarely suffer from systemic failure. Sure operational issues arise, but you can treat each case uniquely and react accordingly. The circumstances in which all of your customers feel pain simultaneously are very few and far between. So, the cost of not leveraging hardware/software that we’re generally inclined to want to architect for and leverage, is indeed higher, but a simplified system has it’s benefits to be sure.
We now find ourselves promoting Publisher integration best practices, and they advocate our usage. Two such significant architectures built under the same roof has been a fascinating thing to experience. The pros and cons to each are many. Where you wind up in your system is an interesting function of what your propensity is technically, as well as what the business constraints are. One size never fits all.