Streaming Data Just Got Easier: Announcing Gnip’s New Connector for Amazon Kinesis

I’m happy to announce a new solution we’ve built to make it simple to get massive amounts of social data into the AWS cloud environment. I’m here in London for the AWS Summit where Stephen E. Schmidt, Vice President of Amazon Web Services, just announced that Gnip’s new Kinesis Connector is available as a free AMI starting today in the AWS Marketplace. This new application takes care of ingesting streaming social data from Gnip into Amazon Kinesis. Spinning up a new instance of the Gnip Kinesis Connector takes about five minutes, and once you’re done, you can focus on writing your own applications that make use of social data instead of spending time writing code to consume it.

 

AWS_Logo_PoweredBy_300px

 

Amazon Kinesis is AWS’s managed service for processing streaming data. It has its own client libraries that enable developers to build streaming data processing applications and get data into AWS services like Amazon DynamoDB, Amazon S3 and Amazon Redshift for use in analytics and business intelligence applications. You can read an in-depth description of Amazon Kinesis and its benefits on the AWS blog.

We were excited when Amazon Kinesis launched last November because it helps solve key challenges that we know our customers face. At Gnip, we understand the challenges of streaming massive amounts of data much better than most. Some of the biggest hurdles – especially for high-volume streams – include maintaining a consistent connection, recovering data after a dropped connection, and keeping up with reading from a stream during large spikes of inbound data. The combination of Gnip’s Kinesis Connector and Amazon Kinesis provides a “best practice” solution for social data integration with Gnip’s streaming APIs that helps address all of these hurdles.

Gnip’s Kinesis Connector and the high-availability Amazon AWS environment provide a seamless “out-of-the-box” solution to maintain full fidelity data without worrying about HTTP streaming connections. If and when connections do drop (it’s impossible to maintain an HTTP streaming connection forever), Gnip’s Kinesis Connector automatically reconnects as quickly as possible and uses Gnip’s Backfill feature to ingest data you would have otherwise missed. And due to the durable nature of data in Amazon Kinesis, you can pick right back up where you left off reading from Amazon Kinesis if your consumer application needs to restart.

In addition to these features, one of the biggest benefits of Amazon Kinesis is its low cost. To give you a sense for what that low cost looks like, a Twitter Decahose stream delivers about 50MM messages in a day. Between Amazon Kinesis shard costs and HTTP PUT costs, it would cost about $2.12 per day to put all this data into Amazon Kinesis (plus Amazon EC2 costs for the instance).

Gnip’s Kinesis Connector is ready to use starting today for any Twitter PowerTrack or Decahose stream. We’re excited about the many new, different applications this will make possible for our customers. We hope you’ll take it for a test drive and share feedback with us about how it helps you and your business do more with social data.

Gnip and Amazon AWS

Plugging In Deeper: New Brandwatch API Integration Brings Gnip Data to Brands

Earlier today, our partner Brandwatch made an announcement that we expect to be a big deal for the social data ecosystem. Brandwatch has become Gnip’s first Plugged In partner to offer an API integration that allows their customers to get full Twitter data from Gnip, using the methods and functionality of the new Brandwatch Premium API. Brandwatch’s customers can now apply all the power of Brandwatch Analytics – including their query building tools, custom dashboard visualizations, sentiment, demographics, influence data and more – to reliable, complete access to the Twitter firehose from Gnip. With this first-of-its-kind integration, brands and agencies have the opportunity to get social media analytics from a leading provider together with full Twitter data from Gnip, using one seamless API.

Brandwatch Gnip Premium API Twitter Integration.png

Brands and agencies are increasingly using social data to make business decisions outside the marketing department, and Brandwatch’s new API offering fills an important gap that will make this much easier. We’ve seen an uptick in demand from brands wanting to use social data outside of their social listening services to power CRM applications, to incorporate social data into business intelligence tools to study alongside other business data, and to build custom dashboards that combine social with other important business data. At the same time, these brands often face a challenge. They’ve invested significant time and resources using their social listening services to hone in and find the data that’s most important to them. Additionally, their social listening services provide valuable analytics and additional metadata that brands rely on to help make sense of social data. When it’s come time to consume social data for use in other applications, until now they’ve needed to integrate with Gnip separately. Our new integration with Brandwatch’s Premium API gives their customers a “best of both” solution. It provides a seamless way to combine the powerful social media listening and analytics service they’ve come to rely on with full Twitter data from the world’s most trusted social data provider.

For folks interested in the technical details, the way this works is simple. When Brandwatch customers make API calls for Twitter data, they get routed through a Gnip app that fetches Brandwatch data and merges it with data from Gnip. This means Brandwatch customers can have full assurance that they’re getting licensed Twitter data directly from Gnip.

We know a straightforward, integrated solution like this is something brands have been asking for, and we’re glad it’s finally here. To learn more about how the Brandwatch Premium API works, join our joint webinar next week or contact us at info@gnip.com.

 

Geo Operators and the Oscars: Using Gnip’s Search API to Identify Tweets Originating from Hollywood’s Dolby Theater

At Gnip, we’ve long believed that the “geo” part of social data is hugely valuable. Geodata adds real-world context to help make sense of online conversations. We’re always looking for ways to make it easier to leverage the geo component of social data, and the geo features in our Search API provide powerful tools to do so. Gnip’s Search API now includes full support for all PowerTrack geo operators, which means analysts, marketers and academics can now get instantaneous answers to their questions about Twitter that require location.

In this post, we’ll walk through exactly what these Search API features look like, and then look at an example from the Oscars of where they add real value. We will find the people actually tweeting from inside Hollywood’s Dolby Theater and do deeper analysis of those Tweets.

What It Does

Gnip’s Search API for Twitter provides the fastest, easiest way to get up and running consuming full-fidelity Twitter data. Users can enter any query using our standard PowerTrack operators and get results instantaneously through a simple request/response API (no streaming required).

Using the Search API, you can run queries that make use of latitude/longitude data, specifically with our bounding_box and point_radius rules. Each rule lets you limit results to content within a box or circle with up to 25-mile sides or radius. These rules can be particularly helpful to cut down on the noise and identify the most relevant content for your query.

These rules can be used independently to filter both types of Twitter geodata available from Gnip:

  1. Twitter’s “geo” value: The native geodata provided by Twitter, based on metadata from mobile devices. This is present for about 2% of Tweets.
  2. Gnip’s “Profile Geo” value: Geocoded locations from users’ profiles, provided by Gnip. This is present for about 30% of Tweets.

An Oscars’ Example: What apps do the glitterati use to Tweet?

There were 17.1 million Oscar-related Tweets this year. Ellen Degeneres’ selfie Tweet at the Oscars was the most retweeted Tweet ever. If you run a search for “arm was longer” during the Oscars, you’ll get a result set that looks like this — over 1.5 million retweets in the first hour following @theellenshow’s Tweet.

TheEllenShowSelfieTweets

The Tweet created a bit of a stir since it was sponsored by Samsung and Ellen used her Galaxy S5 to tweet it, but then used an iPhone backstage shortly afterward. We’ve seen plenty of interest in the past in the iPhone vs. Android debate, so we thought we’d use the opportunity to check what the real “score” was inside the building. What applications were other celebrities using to post to Twitter?

Using the PowerTrack point_radius operator, we created a very narrow search for a 100 meter radius from the center of the Dolby Theater — roughly the size of the building — for the time period from 5:30-11:30 PM PST on Sunday night. It’s a single-clause query:

point_radius:[-118.3409742 34.1021528 0.1km]

The result set that came back included about 250 Tweets from 98 unique users during that narrow time period, and you can see from the spike during the “selfie” moment that it’s a much more refined data set:

OscarAttendeeswithGeotaggedTweets

Using this geo search parameter, you can quickly identify the people actually at the Oscars during that major Twitter moment.

Among them, we saw the following application usage:

Geotagged Tweets by Application

With all the chatter about the mobile device battles, it’s interesting to see so many Oscar attendees actually posting to Twitter from Instagram versus Twitter mobile clients. Foursquare also made a solid showing in terms of the attendees’ preferred Twitter posting mechanism.

This is just one example of how using Gnip’s geo PowerTrack operators in the Search API can help with doing more powerful social media analysis. For a more detailed overview of all the potential geo-related PowerTrack queries that can be created, check out our support docs here.

P.S. – More on Geo & Twitter Data at #SXSW

If you get excited about social data + geo and will be in Austin this weekend, come check out our session  “Beyond Dots on a Map: Visualizing 3 Billion Tweets” on Sunday at 1 pm. I’ll be on stage with our good friend Eric Gundersen from Mapbox talking about the work we did together to visualize a huge set of Twitter data from around the world.

Get More Twitter Geodata From Gnip With Our New Profile Geo Enrichment

Twitter Map - Giant Fans in the US Tweeting from the Stadium

When it comes to analyzing social data, “where” matters. After the topics of conversations, perhaps the strongest connection between social conversations online and the offline world is location. Location is an implicit part of what we do, who we know, what we need, etc. For years now at Gnip, the most requested feature for our existing data products has been “more geodata” to help our customers understand the offline locations that are relevant to online conversations. Today we’re pleased to announce a major step toward meeting that demand: the public beta launch of our new Profile Geo enrichment.

The Profile Geo enrichment is simple. Location data is provided publicly by millions of users in their profiles on social networks, but it’s rarely delivered in a normalized format with consistent latitude/longitude coordinates that are necessary for software to ingest the data and make use of it. The Profile Geo enrichment from Gnip normalizes this data to common geographies (for instance, “NYC,” “Manhattan,” etc. all map to “New York City, NY, US”) and provides latitude/longitude coordinates for those places so it’s easy to plot social data on a map.

Our customers are hungry to analyze Twitter through a geographic lens. As a brand, it can be great to know that people are talking about my brand and products online, but few things make those conversations more actionable than knowing where those conversations are taking place. Do we need to change our marketing campaign in a region? Focus on improving customer service? As government or civil society organizations responding to crises, location is the key to identifying need in an actionable way and then deploying resources effectively. It may be obvious we need clean water and blankets, but where is the most important place to send them?

For this new enrichment, we started with Twitter because it offers the biggest initial gain for our customers. While less than 2% of Tweets in the Twitter Firehose contain latitude/longitude coordinates for Twitter’s “geotagged” Tweets, more than half of all Tweets contain a profile location value from a user. And while just 1% of users generate approximately two-thirds of all geotagged Tweets (according to this helpful paper from our friend Kalev Leetaru and his colleagues), profile location data is much more evenly distributed. In that way, looking at profile location data “democratizes” the data that appear when mapping Twitter content – our customers can now hear from the whole world of Twitter users and not just this 1%.

This new premium enrichment from Gnip provides several key benefits for social data analysis. First, it increases the amount of usable Twitter geodata available for analysis by more than 15x for Twitter. Second, it adds a new kind of Twitter geodata from what may be natively available from social sources. It’s important to think about the three different types of location that exist in social media to understand this benefit.

  • Activity Location: Where the activity (Tweet, Check-in, etc.) directly came from, via GPS signal on a user’s device or association with a known venue location. This is the kind of location that provides latitude/longitude natively in Twitter’s or Foursquare’s firehoses.

  • Profile Location: The place the user provides as their location in their profile. They may or may not be there when posting to a social network.

  • Mentioned Locations: Places the user talks about in a post or check-in. These places may not have anything to do with where the person lives or where the person is when posting, e.g. “I can’t wait for Gnip to open its new office in the Maldives.” (The Maldives in this case might as well be a fictitious place considering the likelihood that will happen.)

Profile location data can be used to unlock demographic data and other information that is not otherwise possible with activity location. For instance, US Census Bureau statistics are aggregated at the locality level and can provide basic stats like household income. Profile location is also a strong indicator of activity location when one isn’t provided.

To get a sense of the impact of the Profile Geo enrichment in practice, we worked with the team at MapBox again to create a map of Tweets about the San Francisco Giants over the past few weeks (PS: check out the other maps we made together if you haven’t seen them). During that time period, over two thousand Tweets occurred at AT&T Park that were geotagged with the activity location. With the addition of the Profile Geo enrichment for the same Tweets, it’s now possible to quickly create a map that shows the relationship between activity location (all in the Park), and profile location – where those people came from to watch the game. Next time the Giants franchise wants to think about tourist attendance numbers, they’ll have a new way to do so. Check it out.

SF Giants Tweets from the stadium (center point of the orange lines) link to the profile locations of those users around the globe, showing how far they traveled. Click on the “USA” toggle to see the whole world. Hover over states/countries to see total counts.

The Profile Geo enrichment is now available to all Gnip customers as an option on their Twitter data products in this beta release. We’re looking forward to seeing how this enrichment changes what can be done with location and social data.

If you’re interested in learning more, please visit gnip.com/enrichments or hit us up at info@gnip.com.

Klout Scores & Topics Together in a New Twitter Enrichment from Gnip

klout-logo-color-dark

Today we’re announcing a new stage in our partnership with Klout. Gnip is now the exclusive provider of Klout Topics to the social data ecosystem. Along with this new data, we’re also able to provide Klout data under enterprise licensing terms specifically designed to support our customers’ needs and use cases. Together, Klout Scores, Klout Topics and enterprise licensing terms create a more powerful online influence product that is uniquely suited for social media monitoring, engagement and analytics use cases.

With this announcement, we’d also like to welcome Klout to the Plugged In to Gnip partner program. For Klout, this shows they’re committed to building their solutions on sustainable, complete and reliable social data and are joining fellow industry leaders in this pursuit.The Klout Score established Klout as “the standard in online influence,” and our Klout Score Enrichment is one of Gnip’s most popular data products – more than half our customers get Klout Scores delivered alongside Twitter data. The addition of Klout Topics will help our customers understand and evaluate online influence through a different lens. Where Scores tell you how much potential a user has to drive engagement online, Topics tell you what a user’s influence is about so you can better connect with and target the right users.

For example, consider a sports brand that sells golf equipment. The brand wants to identify users who tweet about their products and try to turn them into devoted fans. In a stream of thousands of Tweets, they want to identify the influencers talking about their brand so they can focus their engagement on these people – and they want to do this with smart software, not by hand. Klout Scores can help them identify the difference between my good friend Justin (not likely to drive much engagement) and Justin Bieber, whose Tweets regularly get retweeted 50,000-100,000 times. But while Justin Bieber may be a major influencer, his online influence doesn’t have much to do with golf, or really with sports in general. Using Klout Topics, the brand’s software can now quickly hone in on influencers with a Klout Score above a certain level and who are influential in relevant Topics like “Golf,” “PGA Tour,” “Masters Golf Tournament,” “Nike Golf,” “Sergio Garcia,” etc.  These users probably won’t have millions of teenage girls ready to engage with them, but their audiences may be more likely to make buying decisions that affect the brand.

There are more than 7,000 Klout Topics covering a huge range of topics from broadly relevant categories like “Social Media” or “Software Development,” to highly specific entities like the names of brands, products, movies, TV shows, or celebrities. Approximately two-thirds of users with Klout Scores are categorized with at least one Topic.

Where Klout Topics represent the heart of the new data we’re providing, we’ve also addressed some important licensing limitations to meet our customers’ need for sustainable Klout data suitable for commercial use. Gnip can now serve Klout Scores & Topics together under an enterprise license that goes beyond Klout’s basic Developer Terms of Service, including the important ability for our customers to save and use data for longer than seven days.

Online influence data provides critical insight for engagement and CRM use cases. We’re looking forward to what this new Klout data combo will do for these applications, as well as others in the world of social media analytics.

To learn more, check out Gnip.com/Enrichments.

Mapping Travel, Languages & Mobile OS Usage with Twitter Data

Some of the most compelling use cases we’ve seen for analyzing Twitter data involve geolocation. From NGO’s looking at geotagged Tweets to help deploy resources after disasters, to brands paying attention to where their fans are (or their disgruntled customers) to help drive engagement and marketing strategies, location adds key value to Tweet content.

We’ve been fascinated by these use cases and have wondered what else could be done with this data. A couple months ago our Data Science team set out to explore these questions, and to create some resources at the same time that would help others study and make use of geotagged Tweets. We brought in the team at MapBox – including data artist Eric Fischer – to help us dig into the data and visualize what we found in fast, fully navigable geotagged Twitter maps that would let us and our readers really explore this data in depth.

The interactive maps we created together build on other recent analyses and visualizations of Twitter data done by others, including this great post about details of the data and these static maps from Twitter’s Visual Insights team. The results are stunning, and we hope they’re helpful for you to make the data more practical and accessible as you evaluate what else you could be doing with geolocation in Twitter.

Locals and Tourists (Round 2)

Where do people tweet relative to where they live?

In 2010, Eric Fischer made a static map he called “Locals and Tourists” that showed geolocation for both Tweets and Flickr photos side by side, with the data color coded to show when a post was by a “local” (a post at or near the user’s stated home location) or a “tourist” (a post far from the user’s home location). Twitter has matured significantly since then, and we wanted to see what we could learn from looking at just the Twitter data today, with the ability to browse at any local level around the world. We gathered a sample of Twitter data with unique geotagged Tweet locations from the past ~18 months to generate this new interactive map.

As the dynamic maps took shape, the new version of “Locals and Tourists” impressed us in a couple ways. The first was simply how much resolution Twitter data provides. For instance, not only were primary and secondary roads clearly visible, but you can clearly see roads taken by tourists vs. roads used for local commutes, like this screenshot of I-95 snaking past Wilmington, DE and Philadelphia, PA in red across the bottom third of this image:

Twitter Visualizaiton

You can also clearly see the outlines of buildings like airports, sports stadiums, and major shopping malls that are frequented by tourists. Dig into your local area and see for yourself.

This map could be a resource for city planners, the travel industry, or for creative marketers thinking about how to localize their mobile advertising for different audiences.

Device Usage Patterns

This map shows off usage patterns for various mobile operating systems used to tweet around the world. Since geotagged Tweets require a Twitter client that includes GPS support, most geotagged Tweets come from handheld devices – and we can look at exactly which client was used in the “generator” metadata field provided by Twitter. Among other things, this visualization suggests correlations between mobile OS and income level in the US, and highlights just how prolific Blackberry use is in Southeast Asia, Indonesia and the Middle East.

Languages of the World

Using the same data sample, this final visualization plots where people tweeted in various languages, using metadata from the Gnip Language Detection Enrichment and the Chromium Compact Language Detector as a fallback.

For starters, this map makes clear that English is still the dominant language on Twitter around the world — toggling to the English-only view reveals nearly as much resolution in the global map as when all languages are enabled:

English Language Twitter Visualization of the US

 

English only 

 

Twitter Language Visualization

 

All languages

What might come as more of a surprise though is just how many other languages are being spoken frequently, and particularly how much overlap there is in the United States:

Twitter Visualization for Languages

 

Non-English Tweets across the US; Spanish in green

A Note on the Data

These maps are created with a data set that was significantly culled down to remove locations that would create visual noise. From the original data set, the following were removed:

  • Multiple geotagged Tweets in the exact same location (we made no attempt to communicate density in these visualizations)
  • Geotagged Tweets from the same user in very close proximity to other Tweets from the same user
  • Geotagged Tweets from known or detectable bots

Together these maps point to something powerful – by looking at geolocation data from Twitter in the aggregate, important understanding can be gained to drive marketing, product development, crisis response, or even inform research and policy decisions. In the coming weeks, we’ll be digging in deeper here on the blog to explore other important aspects of geolocation in social data that we hope together will build a picture of the opportunity that exists in understanding social data geospatially.

Find something compelling here or in any of the other maps? Tell us with a Tweet: @gnip.

Sentiment Analysis & Social Data — Picking the Right Tool for the Job

Chris Moody of Gnip

We were recently at the Sentiment Analysis Symposium with many of the leaders in the sentiment space. Conference presenters spanned a spectrum: academics, engineers and executives involved in research and developing sentiment analysis tools. The event was full of great content, and we wanted to share some takeaways.

It was impressive to see the variety of use cases and solutions that were presented — both signs of a rapidly maturing market.  Because of that variety, it was also clear that a one-size-fits-all solution for sentiment analysis of social data is not on the immediate horizon.  An overarching message from the event was that the different use cases for sentiment analysis and the types of social content being analyzed demand tailored sentiment tools purpose-built for the job.

I was also struck by just how central social data was to the conversation and how most of the speakers talked about analyzing social data in some form.  Sentiment analysis has clearly become important to social data analysis, and vice versa, social data is really attractive to companies and researchers seeking ways to understand the moods and opinions of broad populations of consumers and citizens.

Here are some of the highlights of the business cases that were presented at the conference:

  • Mood Analysis Informs Market Decisions at ThomsonReuters — Scanning a broad swath of social content and news allows ThomsonReuters to build their MarketPysch Indices, including their “Gloom Index,” which they’ve demonstrated provides a leading indicator for market drops. ThomsonReuters’ Aleksander Sobczyk made an interesting point for our readers: the social content they focus on is long form only (i.e. blogs), and not short form content like Twitter.  Developing the depth of insight and certainty essential to their current solution requires a bigger text sample than Twitter provides.
  • Guiding Product Development at Dell — Through its Social Media Listening Command Center, Dell picks up on customer sentiment and uses it to drill in and learn more. For instance, they picked up negative sentiment about a new Alienware laptop right after a release: units were overheating. The negative sentiment tipped them off, and their product team took action to learn more from users. Through engaging their customers online, they were able to discover an idiosyncrasy in laptop use among power users. When plugged into big auxiliary monitors, users were keeping the laptop screen nearly closed, blocking the exhaust fans. Dell fixed the fan placement and eliminated the problem for future customers. Knowing very specifically what the negative sentiment was about was critical to catching a problem spot among loyal fans and power users.
  • Understanding Weibo Emotions (了解微博的情感) — Detecting sentiment in other languages with different linguistic structures and use habits can be difficult, which is why Soshio has tried to normalize around universal emotions to help Western brands make sense of what Chinese weibo (microblog) users are saying about their brands.  According to Ken Hu, Soshio’s founder, they’re actually having to translate written language “into sound” to address sentiment challenges like written puns that only make sense when pronounced audibly.

These are just a sample of the stories we heard from the companies at the event, and what we hear from our customers building sentiment tools.  Some serve customers who need to know about general sentiment (i.e. positive or negative), while are others are interested in specific moods or opinions. Some support tailoring to a specific cultural context or industry vertical, while others help evaluate conversation in a broad, general context.  A few are working hard to analyze sentiment in a global context — including native natural language processing for different languages, each with their own norms related to sarcasm, puns, irony, etc. Content format is a major focus as well. Evaluating 140 characters rich with hashtags and URLs on Twitter requires a much different approach than evaluating a WordPress blog post. These are all valid use cases for sentiment analysis, and all come with unique nuances and benefit from different techniques.

Sentiment analysis is a critical category of tool for broader social data analysis, but at the same time, it’s not a monolithic category. If you’re building a product to analyze social data and want to evaluate sentiment, the good news is that you can pick from a wide range of approaches or tools (or maybe even choose more than one?) for sentiment analysis that is suited to the business problem you’re trying to address, and the content you’re trying to analyze.

If you’re interested to learn more about the event and some other interesting developments and business cases in the sentiment analysis world, check out our recent interview with Seth Grimes, the lead organizer for the Sentiment Analysis Symposium.