Strata, Big Data & Big Streams

Gnip had a great time at O’Reilly’s Strata 2011 conference in California last week. We signed up several months ago as a big sponsor without knowing exactly how things were going to come together. The bet paid off and Strata was a huge success for us, and the industry at large. We were blown away with the relevance of the topics discussed and the quality of the attendees and discussions that were sparked. I was amazed at how much knowledge everyone now has surrounding big data set analysis and processing. Technologies that were immature and new just a few years ago, are now baked into the ecosystem and have become tools of the trade (e.g. Hadoop). All very cool to see.

 

That said, there remains a distinct gap between big data set handling and high-volume/real-time data stream handling. We’ve come a long way in handling monster data set processing in batch or offline modes, but we have a long way to go when it comes to handling large streaming data set challenges. Hillary Mason, of bit.ly, hit this point squarely in her “What Data Tells Us” talk at Strata. We can open sourcely fan out ungodly amounts of processing… like piranha on fresh meat. However, blending that processing, and high-latency transactions, into real-time streams of thousands of activities per second is not as refined and well understood. Frankly, I’m shocked at the number of engineers I run into that simply don’t understand asynchronous programming at all.

 

The night before the conference started, Pete Warden drove BigDataCamp @Strata, where Mike Montano from BackType gave a high-level overview of their infrastructure. He laid out a few tiers and described the “speed” tier as something that did a lot of work on high-volume streams, and a “batch” tier that did stuff in a more offline manner. The blend of approaches was an interesting teaser into how Big Stream challenges can be handled. Gnip’s own infrastructure has had to address these challenges of course, and we launched into a thread of detail in our Expanding The Twitter Firehose post awhile back.

 

Big Stream handling occupies a good part of my brain. I’d like to see Big Data discussion start to unravel Big Stream challenges as well.

Announcing Power Track – Full Firehose filtering for the Tweets you want

The response to the commercial Twitter streams we’ve made available has been outstanding. We’ve talked to hundreds of companies who are building growing businesses that analyze conversations on Twitter and other social media sites. As Twitter’s firehose continues to grow (now over 110 million Tweets per day), we’re hearing more and more requests for a way to filter the firehose down to the Tweets that matter.

Today, we’re announcing a new commercial Twitter product called Power Track. This is a keyword based filter of the full firehose that provides 100% coverage over a stream that you define. Power Track customers no longer have to deal with polling rate limits on the Search API and volume limits on the Streaming API.

In addition to keyword based filters, Power Track also supports boolean operators and many of the custom operators allowed on Twitter Search API. With Power Track, companies and developers can define the precise slice of the Twitter stream they need and be confident they’re getting every Tweet, without worrying about volume restrictions.

Currently we support operators for narrowing the stream to a set of users, matching against unwound URLs, filtering by location, and more. We’ll continue to add support for more ways for our customers to filter the content relevant to them in the future. Check the documentation to see the technical details of these operators and more.

Gnip is here to ensure the enterprise marketplace gets the depth, breadth, and reliability of social media data it requires. Please contact us at info@gnip.com to find out more.

New Gnip & Twitter Partnership

We at Gnip have been waiting a long time to write the following sentence: Gnip and Twitter have partnered to make Twitter data commercially available through Gnip’s Social Media API. I remember consuming the full firehose back in 2008 over XMPP. Twitter was breaking ground in realtime social streams at a then mind-blowing ~6 (six) Tweets per second. Today we see many more Tweets and a greater need for commercial access to higher volumes of Twitter data.

There’s enormous corporate demand for better monitoring and analytics tools, which help companies listen to their customers on Twitter and understand conversations about their brands and products. Twitter has partnered with Gnip to sublicense access to public Tweets, which is great news for developers interested in analyzing large amounts of this data. This partnership opens the door to developers who want to use Twitter streams to create monitoring and analytics tools for the non-display market.

Today, Gnip is announcing three new Twitter feeds with more on the way:

  • Twitter Halfhose. This volume-based feed is comprised of 50% of the full firehose.
  • Twitter Mentionhose. This coverage-based feed provides the realtime stream of all Tweets that mention a user, including @replies and retweets. We expect this to be very interesting to businesses studying the conversational graph on Twitter to determine influencers, engagement, and trending content.
  • Twitter Decahose. This volume-based product is comprised of 10% of the full firehose. Starting today, developers who want to access this sample rate will access it via Gnip instead of Twitter. Twitter will also begin to transition non-display developers with existing Twitter Gardenhose access over to Gnip.

We are excited about how this partnership will make realtime social media analysis more accessible, reliable, and sustainable for businesses everywhere.

To learn more about these premium Twitter products, visit http://gnip.com/twitter, send us an email at info@gnip.com, or appropriately, find us on Twitter @gnip.

Clusters & Silos

Gnip is nearing its one-year anniversary of our 2.0 product. We reset our direction several months ago. As part of that shift, we completely changed our architecture. I thought I’d write about that experience a bit.

Gnip 1.0

Our initial implementation is best referred to as a clustered, non-relational DB (aka NoSQL), data aggregation service. We built/ran this product for about a year and a half. The system was comprised of a centralized cluster of machines that divvy’d up load, centralized streams of publisher data, and then fanned that data out to many customers. Publishers did not like this approach as it obfuscated the ultimate consumer of their data; they wanted transparency. Our initial motivation for this architecture was around alleviating load pain on the Publishers. “Real-time” APIs were the rage, and having detrimental impact to real-time delivery was in-part due to load on the Publisher’s API. A single stream of data to Gnip, and allowing Gnip to deal w/ the fan-out via a system built for such demand, was part of the solution we sold. We thought we could charge Publishers for alleviating their load pain. Boy were we wrong on that count. While Publishers love to complain about the load on their API, effectively none of them wanted to do anything about it. Some smartly built caching proxies, and others built homegrown notification-like/PubSub solutions (SIP, SUP, PubSubHubBub). However, most simply horizontally scaled and threw money at the problem. Twitter has shinned a light on streaming HTTP (or whatever you want to call it… there are so many monikers), which is “as good as it gets” (leaving proto buffers and compressed HTTP streams as simply optimizations to the model). I digress. The 1.0 platform was a fantastic engineering feat, ahead of its time, and unfortunately a thorn in Publisher’s sides. As a data integration middle-man, Gnip couldn’t afford to have antagonistic relations with data sources.

Gnip 2.0

Literally overnight, we walked away from further construction on our 1.0 platform. We had paying customers on it however, so we operated it for several months before ultimately shutting it down; after migrating everyone we could to 2.0. Gnip 2.0 un-intuitively departed from a clustered environment, and instead started providing a consuming customer with explicit, transparent, integrations with Publishers, all via standalone instances of the software running on standalone virtualized hardware instances (Ec2). Whereas 1.0 would sometimes leverage Gnip-owned authentication/app credentials to the benefit of many consuming customers, 2.0 was architected explicitly not to support this. For each 2.0 instance a customer runs, they use credentials they obtain themselves, from the Publisher, to configure the instances. Publishers have full transparency into, and control of, who’s using their data.

The result is an architecture that doesn’t leverage certain data structures an engineer would naturally wish to use. That said, an unexpected operational benefit has fallen out of the 2.0 system. Self-healing, zero SPOF (single point of failure), clusters aside (I’d argue they’re actually relatively few of them out there), the reality with clusters is that they’re actually hard to build in a fault tolerant manner, and SPOFs find their way in. From there, you have all of your customers leveraged against a big SPOF. If something cracks in the system, all of your customers feel that pain. On the flip side, silo’d instances rarely suffer from systemic failure. Sure operational issues arise, but you can treat each case uniquely and react accordingly. The circumstances in which all of your customers feel pain simultaneously are very few and far between. So, the cost of not leveraging hardware/software that we’re generally inclined to want to architect for and leverage, is indeed higher, but a simplified system has it’s benefits to be sure.

We now find ourselves promoting Publisher integration best practices, and they advocate our usage. Two such significant architectures built under the same roof has been a fascinating thing to experience. The pros and cons to each are many. Where you wind up in your system is an interesting function of what your propensity is technically, as well as what the business constraints are. One size never fits all.

Gov 2.0 & Social Media

Gnip’s doing great in the SMM (Social Media Monitoring) marketplace. However, we want more. We attended the Gov 2.0 Expo a few months ago, and we’ll also be at the upcoming Gov 2.0 Summit in Sept. Watching markets evolve their understanding of new technologies, concepts and solutions is always fascinating. The world of government projects, technologies, contracts, and vendors, is vastly different from the world we tend to work in day-to-day. Adoption and understanding takes a lot longer than what those of us more in the “web space” are used to, and policy often has significant impact on how/when something can be incorporated. Yet, there is an incredible market opportunity in front of social media related firms.

Government spending is obviously a tremendous force, and while sales/adoption cycles are long, it needs to be tapped. Thankfully, government agency awareness around social media is rising. From technology stack understanding, to communication paradigm shifts (e.g. Twitter & Facebook), gov. firms and teams are realizing the need for integration and use. Whether it’s the Defense Department’s need to apply predictive algorithms to new communication streams, or disaster recovery organizations needing to tap into crowd sourcing when catastrophe strikes, a vast array of teams are engaging at an increasing rate. A friend of mine lit up a room at the recent Emergency American Red Cross Summit, when he showed them how communication (messaging and photos) can be mashed-up onto a map, in real-time (via Gnip btw); highly relevant when considering disaster situations. “Who’s there?” “What’s the situation?” are questions easily answered when social data streams are tapped and blended.

The social media echo chamber we live in is broadening to include significant government agencies, and the fruits that are falling from today’s social applications are landing in good places. I’m looking forward to participating in the burgeoning conversation around social media and government’s digestion of it. I encourage you to dive in as well, though be prepared for a relatively slow pace. Don’t expect the same turnaround times we’ve become accustomed to, rather, consider back-grounding some time in the space, and consider it an investment with a longer term payoff.

Official Google Buzz Firehose Added to Gnip’s Social Media API

Today we’re excited to announce the integration of the Google Buzz firehose into Gnip’s social media data offering. Google Buzz data has been available via Gnip for some time, but today Gnip became one of the first official providers of the Google Buzz firehose.

The Google Buzz firehose is a stream of all public Buzz posts (excluding Twitter tweets) from all Google Buzz users. If you’re interested in the Google Buzz firehose, here are some things to know:

  • Google delivers it via Pubsubhubbub. If you don’t want to consume it via Pubsubhubbub, Gnip makes it available in any of our supported delivery methods: Polling (HTTP GET), Streaming HTTP (Comet), or Outbound HTTP Post (Webhooks).
  • The format of the Firehose is XML Activity Streams. Gnip loves Activity Streams and we’re excited to see Google continue to push this standard forward.
  • Google Buzz activities are Geo-enabled. If the end user attaches a geolocation on a Buzz post (either from a mobile Google Buzz client or through an import from another geo-enabled service), that location will be included in the Buzz activity.

We’re excited to bring the Google Buzz firehose to the Social Media Monitoring and Business Intelligence community through the power of the Gnip platform.

Here’s how to access the Google Buzz firehose. If you’re already a Gnip customer, just log in to your Gnip account and with 3 clicks you can have the Buzz firehose flowing into your system. If you’re not yet using Gnip and you’d like to try out the Buzz firehose to get a sense of volume, latency, and other key metrics, grab a free 3 day trial at http://try.gnip.com and check it out along with the 100 or so other feeds available through Gnip’s social media API.

Migrating to the Twitter Streaming API: A Primer

Some context:

Long, long ago, in a galaxy far, far away, Twitter provided a firehose of data to a few of partners and the world was happy.  These startups were awash in real-time data and they got spoiled, some might say, by the embarrassment of riches that came through the real-time feed.  Over time, numerous factors caused Twitter to cease offering the firehose.  There was much wailing and gnashing of teeth on that day, I can tell you!

At roughly the same time, Twitter bought real-time search company Summize and began offering to everyone access to what is now known as the Search API.  Unlike Twitter’s existing REST API, which was based around usernames, the Search API enabled companies to query for recent data about a specific keyword.  Because of the nature of polling, companies had to contend with latency (the time between when someone performs an action and when an API consumer learns about it) and Twitter had to deal with a constantly-growing number of developers connected to an inherently inefficient interface.

Last year, Twitter announced that they were developing the spiritual successor to the firehose — a real-time stream that could be filtered on a per-customer basis and provide the real-time, zero latency results people wanted.  By August of last year, alpha customers had access to various components of the firehose (spritzer, the gardenhose, track, birddog, etc) and provided feedback that helped shape and solidify Twitter’s Streaming API.

A month ago Twitter Engineer John Kalucki (@jkalucki) posted on the Twitter API Announcements group that “High-Volume and Repeated Queries Should Migrate to Streaming API“.  In the post, he detailed several reasons why the move is beneficial to developers.  Two weeks later, another Twitter developer announced a new error code, 420, to let developers identify when they are getting rate limited by the Search API.  Thus, both the carrot and the stick have been laid out.

The streaming API is going to be a boon for companies who collect keyword-relevant content from the Twitter stream, but it does require some work on the part of developers.  In this post, we’ll help explain who will benefit from using Twitter’s new Streaming API and some ways to make the migration easier.

Question 1:  Do I need to make the switch?

Let me answer your question with another question — Do you have a predictable set of keywords that you habitually query?  If you don’t, keep using the Search API.  If you do, get thee to the Streaming API.

Examples:

  • Use the Streaming API any time you are tracking a keyword over time or sending notifications /  summaries to a subscriber.
  • Use the Streaming API if you need to get *all* the tweets about a specific keyword.
  • Use the Search API for visualization and search tools where a user enters a non-predictable search query for a one-time view of results.
  • What if you offer a configurable blog-based search widget? You may have gotten away with beating up the Search API so far, but I’d suggest setting up a centralized data store and using it as your first look-up location when loading content — it’s bad karma to force a data provider to act as your edge cache.

Question 2: Why should I make the switch?

  • First and foremost, you’ll get relevant tweets significantly faster.  Linearly polling an API or RSS feed for a given set of keywords automatically creates latency which increases at a linear rate.  Assuming one query per second, the average latency for 1,000 keywords is a little over eight minutes; the average latency for 100,000 keywords is almost 14 hours!  With the Streaming API, you get near-real-time (usually within one second) results, regardless of the number of keywords you track.
  • With traditional API polling, each query returns N results regardless of whether any results are new since your last request.  This puts the onus of deduping squarely on your shoulders.  This sounds like it should be simple — cache the last N resultIDs in memory and ignore anything that’s been seen before.  At scale, high-frequency keywords will consume the cache and low frequency keywords quickly age out.  This means you’ll invariably have to hit the disk and begin thrashing your database. Thankfully, Twitter has already obviated much of this in the Search API with an optional “since_id” query parameter, but plenty of folks either ignore the option or have never read the docs and end up with serious deduplication work.  With Twitter’s Streaming API, you get a stream of tweets with very little duplication.
  • You will no longer be able to get full fidelity (aka all the tweets for a given keyword) from the Search API.  Twitter is placing increased weight on relevance, which means that, among other things, the Search API’s results will no longer be chronologically ordered.  This is great news from a user-facing functionality perspective, but it also means that if you query the Search API for a given keyword every N seconds, you’re no longer guaranteed to receive the new tweets each time.
  • We all complain about the limited backwards view of Twitter’s search corpus.  On any given day, you’ll have access to somewhere between seven and 14 days worth of historical data (somewhere between one quarter to one half billion tweets), which is of limited value when trying to discover historical trends.  Additionally, for high volume keywords (think Obama or iPhone or Toyota), you may only have access to an hour of historical data, due to the limited number of results accessible through Twitter’s paging system.  While there is no direct correlation between the number of queries against a database and the amount of data that can be indexed, there IS a direct correlation between devoting resources to handle ever-growing query demands and not having resources to work on growing the index.  As persistent queries move to the Streaming API, Twitter will be able to devote more resources to growing the index of data available via the Search API (see Question 4, below).
  • Lastly, you don’t really have a choice.  While Twitter has not yet begun to heavily enforce rate limiting (Gnip’s customers currently see few errors at 3,600 queries per hour), you should expect the Search API’s performance profile to eventually align with the REST API (currently 150 queries per hour, reportedly moving to 1,500 in the near future).

Question 3: Will I have to change my API integration?

Twitter’s Streaming API uses streaming HTTP

  • With traditional HTTP requests, you initiate a connection to a web server, the server sends results and the connection is closed.  With streaming HTTP, the connection is maintained and new data gets sent over a single long-held response.  It’s not unusual to see a Streaming API connection last for two or three days before it gets reset.
  • That said, you’ll need to reset the connection every time you change keywords.  With the Streaming API , you upload the entire set of keywords when establishing a connection.  If you have a large number of keywords, it can take several minutes to upload all of them and during the duration you won’t get any streaming results.  The way to work around this is to initiate a second Streaming API connection, then terminate the original connection once the new one starts receiving data.  In order to adhere to Twitter’s request that you not initiate a connection more than once every couple of minutes, highly volatile rule sets will need to batch changes into two minute chunks.
  • You’ll need to decouple data collection from data processing.  If you fall behind in reading data from the stream, there is no way to go back and get it (barring making a request from the Search API).  The best way to ensure that you are always able to keep up with the flow of streaming data is to place incoming data into a separate process for transformation, indexing and other work.  As a bonus, decoupling enables you to more accurately measure the size of your backlog.

Streaming API consumers need to perform more filtering on their end

  • Twitter’s Streaming API only accepts single-term rules; no more complex queries.  Say goodbye to ANDs, ORs and NOTs.  This means that if you previously hit the Search API looking for “Avatar Movie -Game”, you’ve got some serious filtering to do on your end.  From now on, you’ll add to the Streaming API one or more of the required keywords (Avatar and/or Movie) and filter out from the results anything without both keywords and containing the word “Game”.
  • You may have previously relied on the query terms you sent to Twitter’s Search API to help you route the results internally, but now the onus is 100% on you.  Think of it this way: Twitter is sending you a personalized firehose based upon your one-word rules.  Twitter’s schema doesn’t include a <keyword> element, so you don’t know which of your keywords are contained in a given Tweet.  You’ll have to inspect the content of the tweet in order to route appropriately.
  • And remember, duplicates are the exception, not the rule, with the Streaming API, so if a given tweet matches multiple keywords, you’ll still only receive it once.  It’s important that you don’t terminate your filtering algo on your first keyword or filter match; test against every keyword, every time.

Throttling is performed differently

  • Twitter throttles their Search API by IP address based upon the number of queries per second.  In a world of real-time streaming results, this whole concept is moot.  Instead, throttling is defined by the number of keywords a given account can track and the overall percentage of the firehose you can receive.
  • The default access to the Streaming API is 200 keywords; just plug in your username and password and off you go.  Currently, Twitter offers approved customers access to 10,000 keywords (restricted track) and 200,000 keywords (partner track).  If you need to track more than 200,000 keywords, Twitter may bind “partner track” access to multiple accounts, giving you access to 400,000 keywords or even more.
  • In addition to keyword-based streams, Twitter makes available several specific-use streams, including the link stream (All tweets with a URL) and the retweet stream (all retweets).  There are also various levels of userid-based streams (follow, shadow and birddog) and the overall firehose (spritzer, gardenhose and firehose), but they are outside the bounds of this post.
  • The best place to begin your quest for increased Streaming API is an email to api@twitter.com — briefly describe your company and use case along with the requested access levels. (This process will likely change for coming Commercial Accounts.)
  • Twitter’s Streaming API is throttled at the overall stream level. Imagine that you’ve decided to try to get as many tweets as you can using track.  I know, I know, who would do such a thing?  Not you, certainly.  But imagine that you did — you entered 200 stop words, like “and”, “or”, “the” and “it” in order to get a ton of tweets flowing to you.  You would be sorely disappointed, because twitter enforces a secondary throttle, a percentage of firehose available to each access level.  The higher the access level (partner track vs. restricted track vs. default track), the greater the percentage you can consume.  Once you reach that amount, you will be momentarily throttled and all matching tweets will be dropped on the floor.  No soup for you!  You should monitor this by watching for “limit” notifications.  If you find yourself regularly receiving these, either tighten up your keywords are request greater access from Twitter.

Start tracking deletes

  • Twitter sends deletion notices down the pipe when a user deletes one of their own tweets.  While Twitter does not enforce adoption of this feature, please do the right thing and implement it.  When a user deletes a tweet, they want it stricken from the public record.  Remember, “it ain’t complete if you don’t delete.”  We just made that up.  Just now.  We’re pretty excited about it.

Question 4: What if I want historical data too?


Twitter’s Streaming API is forward-looking, so you’ll only get new tweets when you add a new keyword.  Depending on your use case you may need some historical data to kick things off.  If so, you’ll want to make one simultaneous query to the Search API.  This means that you’ll need to maintain two integrations with Twitter APIs (three, if you’re taking advantage of Twitter’s REST API for tracking specific users), but the benefit is historical data + low-latency / high-reliability future data.

And as described before, the general migration to the Streaming API should result in deeper results from the Search API, but even now you can get around 1,500 results for a keyword if you get acquainted with the “page” query parameter.

Questions 5: What if I need more help?

Twitter resources:

Streaming HTTP resources:

Gnip help:

  • Ask questions in the comments below and we’ll respond inline
  • Send email to eric@gnip.com to ask the Gnip team direct questions

Social Data in a Marketplace

Gnip; shipping & handling for data. Since our inception a couple of years ago, this is one of the ways we’ve described ourselves. What many folks in the social data space (publishers and consumers alike) surprisingly don’t understand however is that such a thing is necessary. Several times we’ve come up against folks who indicate that either a) “our (random publisher X) data’s already freely available through an API” or b) “I (random consumer Y) have free access to their data through their API.” While both statements are often true, they’re shortsighted.

If you’re a “web engineer” versed in HTTP and XHR with time on your hands, then accessing data from a social media publisher (e.g. Twitter, Facebook, MySpace, Digg…. etc) may be relatively straightforward. However, while API integration might be “easy” for you, keep in mind that you’re in the minority. Thousands of companies, either not financially able to afford a “web engineer” or simply technically focused elsewhere (if at all), need help accessing the data they need to make business decisions. Furthermore, while you may do your own integrations, how robust is your error reporting, monitoring, and management of your overall strategy? Odds are that you have not given those areas the attention they require. Did your stream of data stop because of a bug in your code, or because the service you were integrated with went down? Could you more efficiently receive the same data from a publisher, while relieving load from your (and the publisher’s) system? Do you have live charts that depict how data is moving through the system (not just the publisher’s side of the house)? This is where Gnip Data Collection as a Service steps in.

As the social media/data space has evolved over the past couple of years, the necessity of a managed/solution-as-a-service has become clear. As expected, the number of data consumers continues to explode, while the number of consumers with technical capability to reliably integrate with the publishers, as a ratio to total, is shrinking.

Finally some good technical/formatting standards are catching on (PubSubHubbub, WebHooks, HTTP-long-polling/streaming/Comet (thanks Twitter), ActivityStreams), which is giving everyone a vocabulary and common conceptual understanding to use when discussing how/when real-time data is produced/consumed.

In 2010 we’re going to see the beginnings of maturation in the otherwise Wild-West of social data. As things evolve I hope innovation doesn’t suffer (mass availability of data has done wonderful things), but I do look forward to giving other, less inclined, players in the marketplace access to the data they need. As a highly focused example of this kind of maturation happening before our eyes, checkout SimpleGeo. Can I do geo stuff as an engineer, yes. Do I want to collect the thousand sources of light to build what I want to build around/with geo; no. I prefer a one-stop-shop.

New Gnip Push API Service

The Gnip product offerings are growing today as we officially announce a new Push API Service that will help companies more quickly and effectively deliver data to their customers, partners and affiliates.  (See the TechCrunch article: Gnip Launches Push API To Create Real-Time Stream Of Business Data )

This new offering leverages the Gnip SaaS Integration Platform but is provided as a complete white-label and embeddable solution adding real-time push to an existing infrastructure.  The main capabilities include the following:

  • Push Endpoint Management: Easily register service endpoints and APIs to create alternative Push endpoints that are powered by the Gnip platform.
  • Real-time Data Delivery: Complete white-label approach allows for company defined URLs to be enhanced for real-time data delivery. Reduce your data latency and infrastructure costs while maintaining control of data access and offloading the delivery to Gnip
  • Reporting Dashboard: Access important metrics and usage information for service endpoints through a statistics API or a web-based dashboard

A company would be able to add the Push API Service  to their existing infrastructure in hours or days with a few steps.

  1. Tell us how you want Gnip to access your data and APIs as we have several methods depending on your infrastructure
  2. Integrate the Gnip Push API Service to your website with complete control of the user experience and branding
  3. CNAME the subdomain in order to seamlessly add the Push API Service to your existing infrastructure
  4. Track usage of the new Push Service API using a web-based console or stats API

If your company is intersted in learning more about how Gnip can help move your existing repetitive API and website traffic to a more efficient push based approach contact us at info@gnip.com.

Reminder: Gnip Platform Updates This Friday

This post is meant to provide a reminder and additional guidance for Gnip platform users as we transition to the new Twitter Streaming API at the end of the week.   We have lots going and want to make sure companies and developers are keeping up with the moving parts.

  • Friday, June 19th:  Twitter is turning off the original XMPP firehose that we have used as the default “Twitter Data Publisher” in the Community Edition of the platform.
  • Starting on Friday, June 19th the new default “Twitter Data Publisher” in the Community Edition of the platform will be integrated to the new “spritzer” tier of the Twitter Streaming API.     Spritzer is a sample of the Twitter stream and not a “firehose”.   This is the default publicly available stream that Twitter is allowing Gnip to make available for anyone to integrate.
  • All Gnip users will be able to access full-data filters with the updated Twitter Data Publisher
  • If your company has an authorized Twitter account for the gardenhose, shadow or birddog tiers and do not want to build and maintain this integration contact us by email at info@gnip.com or shane@gnip.com to discuss how Gnip can provide a solution.

Helpful information about the new Twitter Streaming API:

PS:  The planned Facebook integration is coming along and we have our internal prototype completed.  Driving toward the beta and should have more details in the next week or two.

PSS: We would still appreciate any feedback people can provide on their Twitter data intgration needs – take the survey