Introducing PowerTrack for Tumblr

Gnip is introducing a solution to make it even easier to find the Tumblr data you want to see. We’re introducing PowerTrack for Tumblr, a way to filter specific content. Tumblr is an unbelievable source of social data with more than 70 million new posts every day. Similar to PowerTrack for Twitter, PowerTrack for Tumblr will deliver full coverage of the posts you want based on the filtering criteria you create.

With PowerTrack for Tumblr, you can filter Tumblr data by not only keyword, but also by specific Tumblr blogs. This means if a brand wants to track the content on their Tumblr, they can track activity and reblogs around that specific Tumblr. In addition, users can track specific URLS, so if you’re a brand you can watch for all links to your company page. With Tumblr’s seven post types (text, photo, quote, link, chat, audio, video), it’ll now be possible to hone in on specific post types. If you’re only interested when a song is posted, now it’ll be even easier to track. You can read all of the specific filtering rules available on Gnip’s documentation.

After we announced the launch of the Tumblr firehose in April of this year, we’ve been blown away by the uniqueness and richness of Tumblr content and how active their community is. And the Tumblr analytics space is evolving quickly with the launch of Union Metrics for Tumblr several weeks ago. We know that content on Tumblr can be incredibly viral and sticky due to the ability to reblog and tag content. Whether brands have an official presence on Tumblr or not, they’re going to be mentioned there. We’re excited to offer a product that makes it easier to find mentions and content. Check out the details on your own or email us at to learn more.

Tumblr Analytics: It’s a Whole New World

Union Metrics has been with Gnip since the early days, using our social data in their flagship product, TweetReach. Earlier this year, when we announced the availability of social data from Tumblr, we were excited that Union Metrics moved quickly to start building a new product based on that data. Last week, Union Metrics launched Union Metrics for Tumblr and was named Tumblr’s preferred analytics provider.

We’re big believers in Tumblr and the value of the conversations taking place there. As we’ve talked about in the social cocktail, Tumblr content has unique properties. Our data science shows that Tumblr content is inherently viral – able to amplify conversations about any topic – and even more than that, the content on Tumblr has incredible staying power.

And we’re not the only believers in Tumblr. Brands like Adidas and Coca-Cola have been actively engaging and advertising on Tumblr since the launch of Tumblr’s advertising platform earlier this year.

Congrats to the team at Union Metrics! This is exciting news and we’re only at the beginning.

You can read more in AdWeek, The Next Web and GigaOm.

A Moment in History: Access the Full Archive of Public Tweets

We are proud to announce that, for the first time, access to the entire historical archive of public Tweets, dating back to @Jack’s very first Tweet more than 7 years ago, is now available via our new product, Historical PowerTrack for Twitter. This product has been years in the making, and we can’t wait to see what the world will build with this data.


We believe that social data has unlimited value and near limitless application. The nature (fast & viral) and newness of social conversations has naturally directed focus to realtime applications. However, as the world becomes more reliant on realtime social data and the amount of social data created grows exponentially, the need to put this information into historical context has become increasingly important. Often, companies are considering the realtime reaction in social data and asking “is this good or bad?” This is one of the main questions historical data can answer. For example, if an auto manufacturer launches a new model and 25% of the social conversation is determined to be negative, is that healthy?  Knowing that the last model launched to record sales & had 40% negativity helps put the new realtime data into context.

Historical data can also be highly informative to predictions about the future. Researchers have suggested to us that they can predict the outcome of a revolution by studying past revolutions online such as the “Arab Spring”.  Likewise, we’re seeing hedge funds make a real commitment to incorporating social data into their trading algorithms. It is critical for these funds to be able to refine their predictive trading models by studying vast quantities of historical data.

Until now, all this promise of social data has had a foundational limitation: very little reliable and complete historical data has been available. And as we know, historical analysis is only as good as the quality of the underlying data. You can’t provide complete context if you only have part of the data.  That’s why we are so excited to be the first company to offer complete coverage of all public Tweets from the beginning of time.

We’re able to deliver the full historical corpus via our long-standing partnership with Twitter. We helped Twitter deliver the full archive of Tweets to the Library of Congress. That was a massive effort that took a long time. The rest of the social data ecosystem can benefit from that effort starting today.

This level of access has never been available and we know it is really going to accelerate the rate of innovation going forward. We think there are new products and businesses that will now be possible with access to a “social layer” of historical data. We frequently ask ourselves “If you could know what the world was saying at any moment in time about any topic, what could you build?”

We’ve already been working with companies like Esri, Union Metrics, Brandwatch, Waggener Edstrom Worldwide, and Texifter during our early access period and it’s been incredible to see how fast they are innovating with this new data.

Gnip aspires to be the source of record for all public conversation. That’s a lofty goal. We’re taking a major step forward with today’s announcement.

Want to learn more about Historical PowerTrack for Twitter?  Email

The Staying Power of Tumblr

It took two days for the poll to pop.

Three days after the pride cookie, Houston radio station KTRH dropped a question for its listeners.

“The cookie your grandfather loved has ‘gone gay!’” the station wrote on its website, “What Do You Think? Does This Rainbow Flag Cookie Bother You?”

It bothered becausegretchensaidso (now using Tumblr username Gretchenisincognito). Well, at least, the question did. That day, the user left a tumble for followers:

“This poll is from a conservative news radio station,” the user wrote, “Let’s surprise them with overwhelming results in favor of equality.”

The post trickled out, gathering almost a hundred reblogs in a 24-hour period. Then it flatlined, holding without major gains through the morning of the episode’s fifth day.

And that’s when it burst. On the evening of June 30th, with Tumblr Oreo chatter sloping back to normal, becausegretchensaidso’s message went vertical — a full 48 hours after publication. Close to 300 users shared the post in a matter of hours. A day later, that number had doubled. Over at KTRH, the poll was tilting for the pride cookie.

Content lingers on Tumblr. becausegretchensaidso and another user waited for days before posts went viral. Others watched as posts drifted forward, adding one or two reblogs each day.
Figure 1 presents the accumulation of reblogs by content posted by different Tumblr users. Excluded from the picture is palahniukandchocolate. During the Oreo episode, fewer than 10 Tumblr users originated content that drove the explosion of the story.

It’s different on Twitter. Not only where the total volumes slow (as we saw in earlier posts here and here), but share rates of the story’s top drivers fell precipitously and sequentially as each piece of content yielded to the freshest meme. Traffic mapped a Social Media Pulse, the picture of social decay for unanticipated events. Across the nine users who drove most conversation on Twitter, user retweets — a analog for reblogs on Tumblr — did not display the endurance of a Tumblr conversation.

Figure 2 presents the rate of retweets by hour for content posted by top drivers of the Oreo conversation on Twitter.

For brands, the implications are clear: Conversations — promoted or unprovoked — endure on Tumblr through reblogging. That can heighten the returns to network engagement — and the risk of allowing negative perceptions to form.

Tumblr also has movement quality that can dominate a moment: During the height of the Oreo episode, reblogs made up more than 90 percent of tumbles related to the pride cookie. On Twitter, the number of retweets rarely rose above 50 percent of the tweet volume.

Figure 3 presents the shares of Tumblr and Twitter conversations related to Oreo, at the episode’s peak, driven by shared content.

In a sense, then, on Tumblr, the creator is king: The network offers those who would speak an unprecedented platform, engineered for replication and amplification. It falls to brands to take advantage of the behavior on this platform by creating content users want to associate themselves with and pass along.

Continue reading

The Social Cocktail, Part 3: Many Publishers Build One Story

In the first post, we looked at high-level attributes of the social media publishers. Then, we spent time looking at the social media responses to expected and unexpected events. To end this series, let’s dive into an example of the evolution of a single story across a mix of publishers. This will provide some intuition into how the social cocktail works when examining a real-world event— in this case, the JPMorgan-Chase $2+ billion loss announcement on May 10, 2012.

JPMorgan-Chase Trading Loss

Twitter: Fast and Concise
On May 10, 2012, immediately after market closing, JPMorgan-Chase CEO Jamie Dimon held a shareholder call to announce a $2 billion trading loss. While traditional news agencies reported the call announcement late in the afternoon, Twitter led the way with reports from call participants who started tweeting while on the call a few minutes after it started.

To see how the volume on Twitter evolved, see figure 1. In each case, the points represent activity volumes on the topic of JPMorgan and “loss” while the lines represent function fits to either the Social Media Pulse or a Gaussian curve (a simple approximation for expected event traffic when averaging over the daily cycle.)

As Reuters and others released news stories and Europe started to wake up, a second Twitter pulse is visible. Toward the right-hand of the graph, the daily cycle of Tweets dominates the conversation about JPMorgan and “loss” with a curve more characteristic of broadly reported, expected events.Twitter Reacts to JPMorgan Trading Loss

Figure 1. Twitter and Stocktwits audience comment on JPMorgan and “loss” after the announcement of a $2B trading loss on the evening May 10, 2012. Volumes are normalized so that peak volume = 1 for each publisher.

StockTwits: Fast and Concise, Focused

Much of the analysis that applies to Twitter applies to StockTwits–the major exceptions are in the expertise of the users and focus of the content. The StockTwits service serves traders and participants are mostly professional investors. Because the audience and the content is curated, there is very little off-topic chatter.  Further, much of the content is specific analysis of JPMorgan’s loss, analysis of the stock price movement following the announcement and information about after-hours price indicators.
On Friday (May 11th), discussion of the loss reaches only about 40% of the peak of the night before. This is likely due to the message rapidly saturating the highly connected community on StockTwits.

Comments: Both Fast and Slow, Concise

Because there was a lot of financial news attention on the story, news stories started to appear soon after the call and these attracted comments immediately (this was the fast response). The data shown in Figure 2 includes both comments from Automattic and Disqus. These comment platforms are used for comments on both personal blogs and on news stories posted online by news organizations, so there is a mix of comments on news stories as well as personal analysis.A graph about comments on the JPMorgan trading loss

Figure 2.  Commenters on blogs and news stories react to the announcement of ta $2B trading loss on the evening 10 May 2012, and an even stronger contingent react early on 11 May. Volumes are normalized so that peak volume = 1.

More-considered news and blog stories appeared on May 11th, Friday morning and these spurred a second (slower) pulse of comment responses.

An additional pattern that is often seen in comments is that people tend to read blogs at certain times of day (e.g. morning or evening) by habit. Because of this, we sometimes see comment volumes spiking at the start or end of the day in very active timezones.

Tumblr: Medium and Very Rich

The Tumblr audience reacted to the news as if the story was broken on Tumblr rather than broken on traditional news. This is unique among the publishers studied here. This pattern of slowly growing traffic during the first few hours after the shareholder call may indicate the nature of the conversation on Tumblr. Rather than an event-response reaction such as twitter, or a considered reaction, as with blogs, the reaction of the audience on Tumblr accelerates as the type of content Tumblrs reblog appears in the network. While the initial posts on Tumblr refer to news stories, the spread of the story through reblogging happens as a ramp up to the peak over a few hours.

The following day, the Tumblr story evolves like an expected event.

Not only is the timeline unique, but Tumblr content is also unique. Early posts have rich media including political cartoons and more right-brained political commentary and humor than the text-comment crowd. Adding Tumblr to your social media mix may present additional challenges in evaluating and analyzing the content, but the sensibilities as well as the activity of this audience adds a dimension not found in the content from the other publishers.

Blogs: Medium and Rich

A few quick, factual reports from the call were published in the form of blog posts as can be seen by the slight “heaviness” in the curve at the end of the day (May 10th). However, the large majority of the blog traffic is the traditional, considered and refined reactions published throughout the following day. The traffic on May 11th follows the pattern of an event everyone already knows about.  The discussion here is analysis and commentary as people explore the implications of the story.

The large majority of the blog content is text or text with a picture of Mr. Dimon. Stories vary from dozens of words to a few thousand.Graph Showing Blog Reactions to JPMorgan Trading Loss

Figure 3.  Content-rich and text-rich reactions to the announcement of ta $2B trading loss on the evening 10 May 2012.  In-depth analysis continues with heavy posting during the day on the 11 May. Volumes are normalized so that peak volume = 1 for each publisher.

Finally, take a look at these timelines shown together in Figure 4.  This view gives a clear indication of the timing of reactions between the publishers.

Social media reaction to JPMorgan trading loss

 Figure 4. The points show the normalized volume of activities about “JPMorgan” and “loss” following the May 10th announcement from Jamie Dimon. Lines represent fits to models of typical social media reactions.  Volumes are normalized so that peak volume = 1 for each publisher.

This example story demonstrates the potential of mixing perspectives, audience and styles of conversation in creating a full description of the social media response to events. With the right mix, we can identify stories and emerging topics within minutes and we can quickly characterize the relative size and speed of a story. We can identify user engagement, dig into deeper analysis, and the rate and focus of content sharing. With this mix of social data, we might be getting close to the perfect cocktail.

A Roundup of the Pacific Northwest BI Summit

I recently attended the Pacific Northwest BI Summit, an exclusive, invite-only Business Intelligence summit held in Grants Pass, Ore., by Scott Humphrey of  Humphrey Strategic Communications. This conference was at an interesting juncture as Business Intelligence is at the early stages of figuring out how to best take advantage and incorporate social data into their Big Data solution offerings.

With such an intimate conference, we had the opportunity to have in-depth conversations about what is happening in Business Intelligence. Here are some of the themes that came out of the conference:

Big Data Analytics
Big Data is such a buzz word that is used frequently, but as we were talking about it — no one knows exactly how to define what Big Data is. What was generally agreed upon was that perhaps Big Data Analytics is what should ultimately be talked about — how you analyze the data sets you do possess. It is more important to understand how people are effectively using Big Data than to bandy the term around without understanding its meaning.

Business Intelligence Incorporating Social Data
Social data has the opportunity to play an important role in Business Intelligence but social media data is still siloed from other data sources. Social media analytics need to be incorporated into the rest of Business Intelligence, but enterprise is still struggling with this. Business Intelligence is trying to incorporate social media, but at the same time must fight a perception that social media can be hyped up. Some believe that companies will only incorporate machine data (e.g. items such as satellite imagery), but Business Intelligence is best when it incorporates multiple sources of data. It’s not a win all approach as more value comes from a hybrid with multiple sources.

Data Scientists in the Enterprise
Data scientists within the enterprise is evolving from one single data scientist to a team of data scientists. Like social data, the data scientist can be siloed but we’re starting to see where data scientists are working with multiple departments to share insights.

Collaboration on the Go
One discussion centered around the mobility of workforces and how this plays into Business Intelligence. What many people forget when discussing mobility is that a laptop is a mobile device. With Business Intelligence, there is the opportunity to make information available on the go, but the downside is the many interfaces of mobile to design for — multiple versions of phones and laptops. What is needed is a simple interface to provide insights for a workforce that is increasingly mobile.

At the end of the conference, we gave our predictions for 2012 and TechTarget covered our Business Intelligence predictions.

I really liked this quote:

“The winner will be social data,” Shawn Rogers of EMA said. “It’s going to become a first-class data citizen for most enterprises. We’ll see stories that go well beyond the silos they’re stuck in now with customer care, brand analysis and marketing.”

Ultimately, it was great to be able to have great in-depth discussions about Business Intelligence with the  thought leaders from some of the leading Business Intelligence organizations and some of the more respected BI analysts. Beautiful location with a truly unique agenda and the ability to create an incredible experience.

The Social Cocktail, Part 1: Mixology

Gnip’s Chris Moody has been talking about the “Social Cocktail” recently, both at Strata and Big Boulder. At Gnip, we talk about the social cocktail a lot–mainly because people like cocktails. But also, it is an apt metaphor for thinking about what social data is useful for our customers. What audiences and modes of conversation are needed to build out understanding your market, your customers, perceptions of your product and the evolution of your message?

The fundamental question it answers is: Why analyze social data from more than one publisher?

Each social media publisher brings distinct capabilities and audiences, and encourages unique ways for users to interact and express themselves. The overlap in audience between some publishers is low, so adding publishers helps broaden topic coverage and audience perspective. Microblogs (e.g. Twitter) are fast and concise, making it easier to tease out breaking stories and emerging conversations. Blog comments indicate engagement and controversy, and therefore point back to interesting blog posts, where the in-depth analysis is found. Votes and likes provide additional signals of reader engagement–indications of the quality and the pitch of conversation.

To get the right mix, it is essential to understand some of the properties of each publisher’s firehose.  In this post, we’ll look at high-level attributes of the social media publishers.  In the next post, we will dive into a brief description of the social media responses to expected and unexpected events. Finally, in the third post, we’ll end with an example of the social cocktail in examining a real-world event—the JPMorgan-Chase $2+ billion loss announcement in May 2012.

One revealing way to compare publishers is to understand their relative speed and content richness.  In this case, fast content means that a statistically relevant sample of activity arrives shortly after the event or topic happens in the real world.  “Shortly” can mean tweets follow the event by be less than a minute, for some topics on Twitter (e.g. earthquakes).  In contrast to the speediest media responses, posts about the 2008 banking crisis in major US backs are probably still being written in 2012 as we continue to examine and discuss bank regulation.

Another telling dimension for comparison is content richness. While Tweets are very fast, they are also concise. To be a rapid responder to an earthquake or other immediate event, you only have time for the barest facts because you have only 30 seconds to respond with 140 characters. “Just felt an earthquake in DC,” would be a typical response. On the other hand, a publisher such as Tumblr encourages rich media sharing with Spotify plugins, support for video and audio formats, very simple photo uploading and sharing capabilities and a tradition of the users appreciating and sharing creative and artistic photography.  Blogs on Automattic’s WordPress platform can range from 10s of words to 10,000, giving ample opportunity for a writer to explore subtle ideas and complex analyses.

A few properties of Social Media Firehoses explained

[table “2” not found /]


Table 1. Comparison of publisher ingredients.

Both Speed and Content Size can be quantified. We often use measures such as the time for the story activity to peak or the ½-life of a story to characterize speed. See Social Media Pulse for discussion and analytical details.

Content richness can be simply characterized by the number of characters. While this is fairly indicative of the balance of rich vs. concise information in a stream for text content, it overlooks media such as audio, photos, interactive applications, video and music. Other measures of richness might include audience participation, user-network interactions, amount of back-and-forth in a conversation and many higher-level measurements from textual analysis.

Surprising and satisfying cocktails come from a careful mixing of quality and–sometimes unexpected–ingredients. With practice and the right combination of publishers, you can mix a social cocktail to enrich your understanding of the conversations between customers, prospects, partners and pundits. You want to experiment with the mix to match your business use case. Getting the mix just right is always rewarding.

Next week: The Social Media Cocktail, Part 2 – Expected vs. Unexpected Events

Get the Disqus Firehose With New Filtering Options

In February, we announced that the full Disqus firehose of public comments is available through Gnip. Our customers love the conversations in Disqus, but have asked for tools to filter the stream so they receive only the conversations they want. Today, we’re announcing our new Disqus PowerTrack offering. Similar to our Twitter PowerTrack product, Disqus PowerTrack offers powerful filtering so customers can filter the full Disqus firehose of public comments to extract the specific conversations they’re looking for. With over 500,000 comments created each day on Disqus, there are a huge range of conversations taking place and you don’t want to miss the ones about your brand or products.

With Disqus PowerTrack, you have a wide array of filtering options. You can filter for specific keywords. You can constrain that filter to specific websites. Or you can look for just the mentions that have links. So, if you’re looking for brand mentions of Apple, you can track conversations about the iPhone or brand mentions in general. You can also monitor for comments mentioning the iPhone that have links in them so you can understand what online stores are being promoted along with your products. See the full list of Disqus PowerTrack Operators in our documentation.

To see the power of the full Disqus firehose, check out this graph showing all mentions of Apple on Disqus. On a normal weekday, there are almost 10,000 comments about Apple. For big events, like WWDC, you see a spike to almost 40,000 comments per day. That’s a lot of conversations.

Get Disqus Firehose with New Filtering Options

We’re big proponents of the conversations that happen in comments, and we’re committed to making it easier for companies to understand and be able to participate. Our new Disqus PowerTrack makes it easier than ever to understand the types of conversations happening in comments.

If you have any questions about the new Disqus capabilities, please contact your sales rep or our sales team at

Taming The Social Media Firehose, Part III – Tumblr

In part I, I discussed high-level attributes of the social media firehose. In Part II , I examined a single event by looking at activities from four firehoses for the earthquake in Mexico earlier this year. In Part III, I wrap up this series with some guidelines for using unique rich content from social media firehoses that may be less familiar. To keep it real, I used examples from the Tumblr firehose.

Since the Twitter APIs and firehoses have been available for years, you may be very familiar with many analysis strategies you can apply to the Twitter data and metadata.  I illustrated a couple of very simple ideas in the last post. With Twitter data and metadata, the opportunities to understand tweets in the context of time, timezone, geolocation, language, social graph, etc. are as big as your imagination.

Due to the popularity of blogging for both personal and corporate communication, many of you will also understand some of the opportunities of the WordPress firehose.  With the addition of firehoses of comments, you have the capabilities of connecting threads of conversation to realize another possible analysis strategy. “Likes” and Disqus “votes” provide additional hints about user reaction and engagement–yet another way to filter and understand posts and comments.

Why go to the effort and expense of adding a new firehose?
There are three benefits from investing your efforts in learning to integrate these differences. Users of social networks choose to participate in Twitter, Tumblr or other social networks based on their affinities and preferences. Integrating additional active social media sources gives:

  1. Richer audience demographics
  2. More diverse perspective and preference
  3. Broader topic coverage.

Here’s an example.


The newest firehose from Gnip became available earlier in 2012. Tumblr’s exciting because the unique, rich content from Tumblr provides a complementary perspective and a distinct form of conversation. Tumblr is important because of the unique audience and modes of interaction common within this audience and platform.

With a firehose of over 50 million new posts a day from web users, Tumblr is a source with strong social sharing features and an active network of users where discussions can reach a large audience quickly.  Some Tumblr posts have been reblogged more than a million times and stories regularly travel to thousands of readers in a couple of days.

Before jumping into consuming the Tumblr firehose in the next section, it may help to understand some of what makes it different and valuable. These questions provide a useful framework when approaching any unfamiliar stream of social data.

What is unique about the Tumblr firehose?

1. Demographics. The user community on Tumblr skews young, over-indexing strongly in the 18-24 demographic of trend setters and cool hunters.

2. Communication and Activity Style. As you are thinking about filtering and mining the Tumblr firehose, realize conversations on Tumblr are often quite different from what you’ll find on other social platforms. As you start to interpret the data from Tumblr it’s important to note that Tumblr has an inside language. For example, many sites contain f**kyeah___ in their name and URL. When you start to hone in on your topic, you will need to understand the inside language used for both positive and negative responses. Terms you consider negative on one platform may have positive connotations on another. Be sure to review a subset of your data to get a feel for the nuances before drawing larger conclusions.

3. Rich Content. Content is rich in that there many types of media and a wide range of depth. Users will post audio, video, animated gifs, simple photos as well as short and long text posts.

You’ll also see 7 different Post Types on Tumblr. These represent the different types of content that users can post on Tumblr. They break out as follows:

Table of Post Types on Tumblr

Table 1 – Tumblr post type breakdown.

To answer the questions, we often rely on filters based on text since these are the simplest filters to think about and create.  The textual data and metadata available in the Tumblr firehose include titles, tags and image captions in addition to the text of the body of the post. Including all of this content allows us to filter approximately 20% of the Tumblr firehose based on text. Additional strategies include looking at reblog and “like” activity, as well as reblog and “like” relationships between users.  More sophisticated strategies such as applying character or object recognition to images open up the tens of millions of activities daily for mining and exploration.

4. Rich Topics. In addition to diverse content forms, Tumblr has attracted many active conversations on a wide variety of topics. This content is often very complementary to other social media platforms due to differences in audience, tone, volume or perspective. With more than 20 billion total posts to date, there is content for about almost  anything you can imagine.  Some examples include:

  • Brands. Any brand you can think of is being discussed right now on Tumblr. Big brands with an official presence on Tumblr include Coca-Cola, Nike, IBM, Target, Urban Outfitters, Puma, Huggies, Lufthansa, Mac Cosmetics and many more. NPR and the President of the United States have their own presences on Tumblr.
  • Fashion and Cosmetics. Because of the visual nature of the medium and cool-hunting audience it attracts, there is a large volume of content related to cosmetics and fashion.
  • Music and Movies. With Spotify music plugins and easy upload and sharing of visual content, pop culture plays a big role in the interests and attention of many of the active users on Tumblr. Information, analysis and fan content is rich, creative and travels through the community rapidly.

5. Reblogs and Likes. Tumblr is all about engagement! The primary user activities for interactions are Reblogs and Likes. Some entries are reblogged thousands of time in a day or two. When a user reblogs a post, it places the other user’s post into your blog with any changes they make. There is a list of all of the notes (likes, reblogs) associated with a post appended to that post wherever it shows up on Tumblr. Each post activity record in the firehose can contain reblog info. It will have a count, a link to the blog this entry was a reblog of and a link to the root entry. To build the blog note list that a user would see at the bottom of a liked or reblogged entry, you have to trace each entry in the stream (i.e. keep a history or know what you want to watch) or scrape the notes section of a page.

Filtering and Mining The Tumblr Firehose

Volume. There are a number of metrics we can use to talk about the volume of the Tumblr firehose. The three gating resources that we run up against most often are related to the network (bandwidth and latency) and storage (e.g. disk space). Tumblr activities are delivered compressed, so for estimating, the bandwidth and disk space requirements can be based on the same numbers. The Tumblr firehose averages about 900 MB/hour compressed volume during peak hours, falling to a minimum of 300 MB/hour during slower periods of the day.

To store the firehose on disk, plan on ~16 GB/day based on current volumes. Planning for bandwidth, you want headroom of 2-5 x average peak hourly bandwidth (4 to 10 Mbps) depending on your tolerance for disconnects during peak events.

The other consideration is end-to-end network latency as discussed in Consuming the Firehose, Part II.  Very simplistically, latency can limit the throughput of your network (regardless of bandwidth) by using up too much time negotiating connections and acknowledging packets. (For a detailed calculation, see, for example, The TCP Window, Latency, and the Bandwidth Delay Product.)  The theoretical limit for 20 Mbps throughput is 50-70 ms (depends on TCP window size), but practically you will want to reliably observe less than this (< 50 ms) to realize reliable network performance.

Metadata. A firehose is a time-ordered, near real-time stream of user activities. While this structure is clearly powerful for identifying emerging trends around brands or news stories, the time-ordered stream is not the optimal structure for looking at other things like the structure social networks to discover, e.g., influencers. Fortunately, the Tumblr firehose activities contain a lot of helpful metadata about place, time, and social network to get answers to these questions.
Each activity has a post objectType as discussed above as well as links to resources referred to in the post such as image files, video files and audio files. Each activity has a source link that takes you back to the original post on Tumblr. If the post is a re-blog, it will also have records like the JSON example below, describing the number of reblogs, the root blog and blog this post reblogged.

"tumblrRebloggedFrom" :
         "author" :
               "displayName" : "A Glimpse",
               "link" : ""
         "link" : ""
"tumblrRebloggedRoot" :
         "author" :
                "displayName" : "Armed With A Mind",
                "link" : ""
         "link" : ""

To assemble the entire reblog chain, you must connect the reblog activities within the firehose using this metadata.

Additional engagement metadata is available in the form of likes (hearts in the Tumblr interface) in a separate Tumblr engagement firehose.

Tumblr Likes Metadata

Non-Text Based Filters. Not all non-text post types have enough textual context (captions, title and tags) to identify a topic or analyze sentiment through simple text filtering. You will want to develop strategies for dealing with some ambiguity around the meaning of posts with very little text content. This ambiguity can be reduced unless you have audio or image analysis capabilities (e.g. OCR or audio transcription). Approximately 20% of all posts can be filtered effectively with text-based filtering of text, URL text, tags and captions–about 15M activities per day).

Memes. Another consideration related to the Tumblr language is that official brand sites as well as many bloggers tend to promote a style or overall image more than providing a catalog of particular products. As a result, e.g., you will match the brand name with a lot of cool stuff, but may see specific product names and descriptions much less frequently. There are many memes within Tumblr that will lead you to influencers and sentiment, but looking at “catalog” terms won’t be the most effective path.

I hope I have uncovered some of the mysteries of successfully consuming social media firehoses.  I have only suggested a handful of questions one might try to answer with the social media data. The community of professionals providing text analysis, image analysis, machine learning for prediction, classification and recommendation, and many other wonders is continuing to invent and refine ways to model and predict real-world behavior based on billions of social media interactions.  The start of this process is always a great question.  Best of luck (and the benefits of all of Gnip’s experience and technology) to you as you jump into consuming the social media firehose.

Full Series:

Taming The Social Media Firehose, Part I – High-level attributes of a firehose

Taming The Social Media Firehose, Part II – Looking at a single event through four firehoses

Taming The Social Media Firehose, Part III – Tumblr


Rich Comment Data from Disqus Now Available Through Gnip

Imagine going to a dinner party and listening to the first thing each person said. You’d learn a few things, but you’d miss out on the meat of the conversation that happens in the give and take of the dialogue.

In the world of online public social conversation, blog posts are the monologue and comments provide the dialogue. Each is valuable on their own, but to see the complete picture, you need both. Conversations happen in comments, and it has been a huge struggle for brands to be able to keep up with comments to fill in their understanding of this key piece of the conversation.

I’m excited to announce that we’re making it easier to access these public conversations with the addition of the full Disqus firehose to our publisher portfolio. As the largest third-party commenting platform in the world with 70 million commenter profiles, the Disqus firehose provides coverage of more than 500,000 comments every day, spanning almost every topic imaginable and reaching over 700 million readers each month.

Comments last forever. They appear in search results and remain part of the discussion long after the day they were written.  With their staying power and depth of discussion, the commenting ecosystem provides an important — and different — social signal. Disqus further embodies this by allowing users to react to others’ comments with up or down “votes” creating significantly more engagement. The 2 million “votes” on Disqus each day provide insight into what comments are generating the most reaction.

Our Disqus API partnership provides authorized access for the first time ever to full firehoses of discussion content and interaction across the Disqus network. To the extent that any of this data has been available before, it’s been provided by technologies like content scraping/crawling that pulled pieces of the discussion, but did not guarantee full coverage in real time on a publisher-safe, consistent and reliable basis. Because this new service is being provided via a direct partnership with Disqus, with Gnip’s full firehose, you get low-latency streams that provide full coverage with the support of the publisher to ensure the availability of the data over the long term.

We’re thrilled to have data from Disqus available on our platform and can’t wait to see the amazing ways that our customers are able to apply it to their businesses. Email us at to learn more about Disqus and set up a trial so you can see the data for yourself.