The response to the commercial Twitter streams we’ve made available has been outstanding. We’ve talked to hundreds of companies who are building growing businesses that analyze conversations on Twitter and other social media sites. As Twitter’s firehose continues to grow (now over 110 million Tweets per day), we’re hearing more and more requests for a way to filter the firehose down to the Tweets that matter.
Today, we’re announcing a new commercial Twitter product called Power Track. This is a keyword based filter of the full firehose that provides 100% coverage over a stream that you define. Power Track customers no longer have to deal with polling rate limits on the Search API and volume limits on the Streaming API.
In addition to keyword based filters, Power Track also supports boolean operators and many of the custom operators allowed on Twitter Search API. With Power Track, companies and developers can define the precise slice of the Twitter stream they need and be confident they’re getting every Tweet, without worrying about volume restrictions.
Currently we support operators for narrowing the stream to a set of users, matching against unwound URLs, filtering by location, and more. We’ll continue to add support for more ways for our customers to filter the content relevant to them in the future. Check the documentation to see the technical details of these operators and more.
Gnip is here to ensure the enterprise marketplace gets the depth, breadth, and reliability of social media data it requires. Please contact us at firstname.lastname@example.org to find out more.
We at Gnip have been waiting a long time to write the following sentence: Gnip and Twitter have partnered to make Twitter data commercially available through Gnip’s Social Media API. I remember consuming the full firehose back in 2008 over XMPP. Twitter was breaking ground in realtime social streams at a then mind-blowing ~6 (six) Tweets per second. Today we see many more Tweets and a greater need for commercial access to higher volumes of Twitter data.
There’s enormous corporate demand for better monitoring and analytics tools, which help companies listen to their customers on Twitter and understand conversations about their brands and products. Twitter has partnered with Gnip to sublicense access to public Tweets, which is great news for developers interested in analyzing large amounts of this data. This partnership opens the door to developers who want to use Twitter streams to create monitoring and analytics tools for the non-display market.
Today, Gnip is announcing three new Twitter feeds with more on the way:
We are excited about how this partnership will make realtime social media analysis more accessible, reliable, and sustainable for businesses everywhere.
Gnip is nearing its one-year anniversary of our 2.0 product. We reset our direction several months ago. As part of that shift, we completely changed our architecture. I thought I’d write about that experience a bit.
Our initial implementation is best referred to as a clustered, non-relational DB (aka NoSQL), data aggregation service. We built/ran this product for about a year and a half. The system was comprised of a centralized cluster of machines that divvy’d up load, centralized streams of publisher data, and then fanned that data out to many customers. Publishers did not like this approach as it obfuscated the ultimate consumer of their data; they wanted transparency. Our initial motivation for this architecture was around alleviating load pain on the Publishers. “Real-time” APIs were the rage, and having detrimental impact to real-time delivery was in-part due to load on the Publisher’s API. A single stream of data to Gnip, and allowing Gnip to deal w/ the fan-out via a system built for such demand, was part of the solution we sold. We thought we could charge Publishers for alleviating their load pain. Boy were we wrong on that count. While Publishers love to complain about the load on their API, effectively none of them wanted to do anything about it. Some smartly built caching proxies, and others built homegrown notification-like/PubSub solutions (SIP, SUP, PubSubHubBub). However, most simply horizontally scaled and threw money at the problem. Twitter has shinned a light on streaming HTTP (or whatever you want to call it… there are so many monikers), which is “as good as it gets” (leaving proto buffers and compressed HTTP streams as simply optimizations to the model). I digress. The 1.0 platform was a fantastic engineering feat, ahead of its time, and unfortunately a thorn in Publisher’s sides. As a data integration middle-man, Gnip couldn’t afford to have antagonistic relations with data sources.
Literally overnight, we walked away from further construction on our 1.0 platform. We had paying customers on it however, so we operated it for several months before ultimately shutting it down; after migrating everyone we could to 2.0. Gnip 2.0 un-intuitively departed from a clustered environment, and instead started providing a consuming customer with explicit, transparent, integrations with Publishers, all via standalone instances of the software running on standalone virtualized hardware instances (Ec2). Whereas 1.0 would sometimes leverage Gnip-owned authentication/app credentials to the benefit of many consuming customers, 2.0 was architected explicitly not to support this. For each 2.0 instance a customer runs, they use credentials they obtain themselves, from the Publisher, to configure the instances. Publishers have full transparency into, and control of, who’s using their data.
The result is an architecture that doesn’t leverage certain data structures an engineer would naturally wish to use. That said, an unexpected operational benefit has fallen out of the 2.0 system. Self-healing, zero SPOF (single point of failure), clusters aside (I’d argue they’re actually relatively few of them out there), the reality with clusters is that they’re actually hard to build in a fault tolerant manner, and SPOFs find their way in. From there, you have all of your customers leveraged against a big SPOF. If something cracks in the system, all of your customers feel that pain. On the flip side, silo’d instances rarely suffer from systemic failure. Sure operational issues arise, but you can treat each case uniquely and react accordingly. The circumstances in which all of your customers feel pain simultaneously are very few and far between. So, the cost of not leveraging hardware/software that we’re generally inclined to want to architect for and leverage, is indeed higher, but a simplified system has it’s benefits to be sure.
We now find ourselves promoting Publisher integration best practices, and they advocate our usage. Two such significant architectures built under the same roof has been a fascinating thing to experience. The pros and cons to each are many. Where you wind up in your system is an interesting function of what your propensity is technically, as well as what the business constraints are. One size never fits all.
Gnip’s doing great in the SMM (Social Media Monitoring) marketplace. However, we want more. We attended the Gov 2.0 Expo a few months ago, and we’ll also be at the upcoming Gov 2.0 Summit in Sept. Watching markets evolve their understanding of new technologies, concepts and solutions is always fascinating. The world of government projects, technologies, contracts, and vendors, is vastly different from the world we tend to work in day-to-day. Adoption and understanding takes a lot longer than what those of us more in the “web space” are used to, and policy often has significant impact on how/when something can be incorporated. Yet, there is an incredible market opportunity in front of social media related firms.
Government spending is obviously a tremendous force, and while sales/adoption cycles are long, it needs to be tapped. Thankfully, government agency awareness around social media is rising. From technology stack understanding, to communication paradigm shifts (e.g. Twitter & Facebook), gov. firms and teams are realizing the need for integration and use. Whether it’s the Defense Department’s need to apply predictive algorithms to new communication streams, or disaster recovery organizations needing to tap into crowd sourcing when catastrophe strikes, a vast array of teams are engaging at an increasing rate. A friend of mine lit up a room at the recent Emergency American Red Cross Summit, when he showed them how communication (messaging and photos) can be mashed-up onto a map, in real-time (via Gnip btw); highly relevant when considering disaster situations. “Who’s there?” “What’s the situation?” are questions easily answered when social data streams are tapped and blended.
The social media echo chamber we live in is broadening to include significant government agencies, and the fruits that are falling from today’s social applications are landing in good places. I’m looking forward to participating in the burgeoning conversation around social media and government’s digestion of it. I encourage you to dive in as well, though be prepared for a relatively slow pace. Don’t expect the same turnaround times we’ve become accustomed to, rather, consider back-grounding some time in the space, and consider it an investment with a longer term payoff.
Today we’re excited to announce the integration of the Google Buzz firehose into Gnip’s social media data offering. Google Buzz data has been available via Gnip for some time, but today Gnip became one of the first official providers of the Google Buzz firehose.
The Google Buzz firehose is a stream of all public Buzz posts (excluding Twitter tweets) from all Google Buzz users. If you’re interested in the Google Buzz firehose, here are some things to know:
We’re excited to bring the Google Buzz firehose to the Social Media Monitoring and Business Intelligence community through the power of the Gnip platform.
Here’s how to access the Google Buzz firehose. If you’re already a Gnip customer, just log in to your Gnip account and with 3 clicks you can have the Buzz firehose flowing into your system. If you’re not yet using Gnip and you’d like to try out the Buzz firehose to get a sense of volume, latency, and other key metrics, grab a free 3 day trial at http://try.gnip.com and check it out along with the 100 or so other feeds available through Gnip’s social media API.
Long, long ago, in a galaxy far, far away, Twitter provided a firehose of data to a few of partners and the world was happy. These startups were awash in real-time data and they got spoiled, some might say, by the embarrassment of riches that came through the real-time feed. Over time, numerous factors caused Twitter to cease offering the firehose. There was much wailing and gnashing of teeth on that day, I can tell you!
At roughly the same time, Twitter bought real-time search company Summize and began offering to everyone access to what is now known as the Search API. Unlike Twitter’s existing REST API, which was based around usernames, the Search API enabled companies to query for recent data about a specific keyword. Because of the nature of polling, companies had to contend with latency (the time between when someone performs an action and when an API consumer learns about it) and Twitter had to deal with a constantly-growing number of developers connected to an inherently inefficient interface.
Last year, Twitter announced that they were developing the spiritual successor to the firehose — a real-time stream that could be filtered on a per-customer basis and provide the real-time, zero latency results people wanted. By August of last year, alpha customers had access to various components of the firehose (spritzer, the gardenhose, track, birddog, etc) and provided feedback that helped shape and solidify Twitter’s Streaming API.
A month ago Twitter Engineer John Kalucki (@jkalucki) posted on the Twitter API Announcements group that “High-Volume and Repeated Queries Should Migrate to Streaming API“. In the post, he detailed several reasons why the move is beneficial to developers. Two weeks later, another Twitter developer announced a new error code, 420, to let developers identify when they are getting rate limited by the Search API. Thus, both the carrot and the stick have been laid out.
The streaming API is going to be a boon for companies who collect keyword-relevant content from the Twitter stream, but it does require some work on the part of developers. In this post, we’ll help explain who will benefit from using Twitter’s new Streaming API and some ways to make the migration easier.
Question 1: Do I need to make the switch?
Let me answer your question with another question — Do you have a predictable set of keywords that you habitually query? If you don’t, keep using the Search API. If you do, get thee to the Streaming API.
Question 2: Why should I make the switch?
Question 3: Will I have to change my API integration?
Twitter’s Streaming API uses streaming HTTP
Streaming API consumers need to perform more filtering on their end
Throttling is performed differently
Start tracking deletes
Question 4: What if I want historical data too?
Twitter’s Streaming API is forward-looking, so you’ll only get new tweets when you add a new keyword. Depending on your use case you may need some historical data to kick things off. If so, you’ll want to make one simultaneous query to the Search API. This means that you’ll need to maintain two integrations with Twitter APIs (three, if you’re taking advantage of Twitter’s REST API for tracking specific users), but the benefit is historical data + low-latency / high-reliability future data.
And as described before, the general migration to the Streaming API should result in deeper results from the Search API, but even now you can get around 1,500 results for a keyword if you get acquainted with the “page” query parameter.
Questions 5: What if I need more help?
Streaming HTTP resources:
Gnip; shipping & handling for data. Since our inception a couple of years ago, this is one of the ways we’ve described ourselves. What many folks in the social data space (publishers and consumers alike) surprisingly don’t understand however is that such a thing is necessary. Several times we’ve come up against folks who indicate that either a) “our (random publisher X) data’s already freely available through an API” or b) “I (random consumer Y) have free access to their data through their API.” While both statements are often true, they’re shortsighted.
If you’re a “web engineer” versed in HTTP and XHR with time on your hands, then accessing data from a social media publisher (e.g. Twitter, Facebook, MySpace, Digg…. etc) may be relatively straightforward. However, while API integration might be “easy” for you, keep in mind that you’re in the minority. Thousands of companies, either not financially able to afford a “web engineer” or simply technically focused elsewhere (if at all), need help accessing the data they need to make business decisions. Furthermore, while you may do your own integrations, how robust is your error reporting, monitoring, and management of your overall strategy? Odds are that you have not given those areas the attention they require. Did your stream of data stop because of a bug in your code, or because the service you were integrated with went down? Could you more efficiently receive the same data from a publisher, while relieving load from your (and the publisher’s) system? Do you have live charts that depict how data is moving through the system (not just the publisher’s side of the house)? This is where Gnip Data Collection as a Service steps in.
As the social media/data space has evolved over the past couple of years, the necessity of a managed/solution-as-a-service has become clear. As expected, the number of data consumers continues to explode, while the number of consumers with technical capability to reliably integrate with the publishers, as a ratio to total, is shrinking.
Finally some good technical/formatting standards are catching on (PubSubHubbub, WebHooks, HTTP-long-polling/streaming/Comet (thanks Twitter), ActivityStreams), which is giving everyone a vocabulary and common conceptual understanding to use when discussing how/when real-time data is produced/consumed.
In 2010 we’re going to see the beginnings of maturation in the otherwise Wild-West of social data. As things evolve I hope innovation doesn’t suffer (mass availability of data has done wonderful things), but I do look forward to giving other, less inclined, players in the marketplace access to the data they need. As a highly focused example of this kind of maturation happening before our eyes, checkout SimpleGeo. Can I do geo stuff as an engineer, yes. Do I want to collect the thousand sources of light to build what I want to build around/with geo; no. I prefer a one-stop-shop.
The Gnip product offerings are growing today as we officially announce a new Push API Service that will help companies more quickly and effectively deliver data to their customers, partners and affiliates. (See the TechCrunch article: Gnip Launches Push API To Create Real-Time Stream Of Business Data )
This new offering leverages the Gnip SaaS Integration Platform but is provided as a complete white-label and embeddable solution adding real-time push to an existing infrastructure. The main capabilities include the following:
A company would be able to add the Push API Service to their existing infrastructure in hours or days with a few steps.
If your company is intersted in learning more about how Gnip can help move your existing repetitive API and website traffic to a more efficient push based approach contact us at email@example.com.
This post is meant to provide a reminder and additional guidance for Gnip platform users as we transition to the new Twitter Streaming API at the end of the week. We have lots going and want to make sure companies and developers are keeping up with the moving parts.
Helpful information about the new Twitter Streaming API:
PS: The planned Facebook integration is coming along and we have our internal prototype completed. Driving toward the beta and should have more details in the next week or two.
PSS: We would still appreciate any feedback people can provide on their Twitter data intgration needs – take the survey