Blog and Comment Data Answers the Why

Shoppers Tweet about what they bought, but they turn to blogs and comments to share why they bought.

This is only one example of what makes the long-form data from blog and commenting platforms valuable to any company looking to better understand why their customers and prospects make the decisions they do. Simply put, blogs and comments are opinion rich. And when it comes to product development, sales, brand management, and more, these opinions provide a unique and critical lens into the nuanced thinking behind customer decisions.

In the past, some social media monitoring providers have used scraping solutions to include blog and comment data in their offerings. While this can get you the data, scraping has several fundamental challenges. The data can be days or weeks old. Scraping solutions often ignore terms of service and user intent, meaning the data can disappear at a moment’s notice when the scraper gets blocked. The data can come in a range of formats that make it very difficult to parse and analyze. And with scraped data, you only get results from the blogs and comments that you know you should be looking at, missing important discussions that surface in new and unexpected places.

It’s because of these challenges that we’re introducing Gnip for Blogs, combining content from four of the most popular long-form blog and comment sources. This first-of-its-kind package of data from Disqus, Tumblr, WordPress and IntenseDebate gives realtime, normalized, terms of service-compliant access to the rich conversations happening across a huge swath of the Internet. With Gnip for Blogs, customers are able to easily and confidently build their business applications on multiple sources of long-form data knowing it won’t suddenly disappear tomorrow.

Each of these sources has a story to tell on its own, but by looking at them all together brands are able to draw insights from an enormous range of discussion. This includes the mass market reach provided by WordPress who powers 19% of the web, the high volume of brand mentions on Tumblr, the highly-engaged audience on IntenseDebate and the enormous reach and quality of the conversations on Disqus.

One of our customers, Networked Insights, recently used realtime WordPress data to identify early technology trends based on influencer blog conversations. They then used this content to refine and focus a targeted online promotion for one of their customers. The end result? A 30% lift in ROI for their online ad spend. And this is only the beginning.

For more information, check out the Gnip for Blogs on our website or contact us at

Chumming for Insights: A Social Take on Sharknado

For a brief moment the term Sharknado took the social universe by storm. If you haven’t heard about the Syfy channel original TV movie let us inform you. The biggest actor attached to the film is Tara Reid. With an estimated budget of $1 million, the marketing push behind this Made-For-TV movie must have been incredibly low. Yet, at one point during the movie’s first air the term “Sharknado” hit 5,000 Tweets per minute and became a trending topic. According to Nielsen, approximately 12.3% of all Tweets related to TV were about Sharknado on the day it aired – twice as many Tweets as the next most Tweeted TV event, the return of Derek Jeter in the Yankees vs Kansas City game.

Companies spend millions of dollars promoting hashtags in commercials and yet this movie with a budget less than many companies spend on a single commercial was able to become an instant sensation. In the end, it is results that matter and in this case, the results are viewers. Sharknado was able to achieve an impressive 1.37 million viewers. To compare, NBC during primetime on the same day maxed out at 1.15 million viewers. So how does a small cable channel like Syfy get more viewers than the big boys like NBC?

While Twitter was the dominant focus of the conversation for Sharknado, we thought we would look at how that conversation translated on to other social channels. Was Sharknado spreading like wildfire on Tumblr the way it was on Twitter? Were people blogging about it and discussing it on WordPress and Disqus?

Let’s take a look at Sharknado Social Media:

The white line in the graph is when the first air of Sharknado happened.

These graphs show that, outside of Twitter, conversation about Sharknado acted mostly as expected, except for on Tumblr. WordPress and Disqus saw their peak of activity after the movie aired. People were likely using the long form nature of WordPress blogging reviews followed by Disqus comments to further the discussion, which is typical for these sources of data.

But the really interesting graph is the Tumblr graph:

There are a couple of interesting things to note about how Sharknado conversation happened on Tumblr:

  • The initial spike on July 7th, which is due to a teaser animated GIF that got picked up and reblogged 3000 times an hour.
  • The spike in activity on July 10, which marks the release of the official trailer for Sharknado on YouTube and it’s spread on Tumblr. Tumblr users picked this up and shared it at an impressive 5,500 posts per hour at its peak.
  • The consistent stream of posts related to Sharknado since the air. While all other networks, including Twitter, have seen a significant drop-off, Tumblr is sharing Sharknado related content more after the initial air than before it.

What this means is that social conversation online doesn’t just happen where you intend for it to, and it doesn’t just happen where you are looking. Analyzing the conversation across social networks gives you a full picture of the social conversation and gives you greater visibility into results of your marketing push. Rumor has it Sharknado has a sequel in the works, our bet is that you’ll find the first glimpses of it’s virality on Tumblr and you’ll see it last there until the first glimpses of Sharknado 3.

Unleashing the Creative Expression of 100 Million People

An interview with Derek Gofffrid and Danielle Strle of Tumblr about the unique experience behind Tumblr and its 100 Million users. 

Derek Gottfried and Danielle Strle at Big Boulder

What is Tumblr?

Tumblr is not a social network.

Derek Gottfrid explains that Tumblr is about the content, not relationships and relationship-building. Hence Tumblr is a media network — with a focus on content propelled by user passion for that media.

When broken down the Tumblr platform fills two roles around sharing media: for users to consume and to share. A Tumblr is a channel for a user to post, create and share to the world in an unlimited way. Second, the dashboard is an incredible media consumption tool.

What Makes Tumblr Special?

It is the diverse formats to share the different types of content (photo, video, text, links, quotes, chats, audio) available in one place. With 7 post types, it’s easier to take part in the community because sharing doesn’t mean having to fill a big white text box. Users can select to share photos, videos, quotes or even reblog content. It removes the intimidation many users find in long-form text blogging.

When it comes to the incredibly viral nature of posts on Tumblr, it’s no question their most unique and valuable asset is the reblog.

Derek explains reblogs as a unique, nuanced feature of the platform that encourage users to adopt content as their own. He adds quickly that a reblog is much like clothing. You could make your own clothes, or you could go buy and wear it. Either way, you make it your own when you put it on.

To fully understand the impact of the reblog feature, consider this — 85 to 90% of posts a day on Tumblr are reblogs. The Tumblr team has seen single posts be reblogged 10,000, even 100,000 times in one day.

How Are Brands Utilizing Tumblr?

As advertising and brands on Tumblr celebrate their one-year anniversary, Derek is quick to note that the development of Tumblr was not motivated by creating a space for brands. Instead, the Tumblr team, has taken time and care to figure out the best way brands can contribute to user’s content stream. Successful brands are utilizing the tools available on Tumblr to tell robust stories. The result is brand content that is a thoughtful, mindful addition to the stream of users, which can be adopted by users as their own content.

What Are The Untapped Opportunities Available With Tumblr Data?

There is a huge volume of data provided by Tumblr, yet analytics and understanding of much of it remains unexplored. As Derek explains, the next level deep dive (analytics) on Tumblr is a huge opportunity. To explore and understand the power of reblogs, specifically how they travel through the userbase. Another untapped opportunity is tools to understand the massive volume of data going through the system.

Big Boulder is the world’s first social data conference. Follow along at #BigBoulder, on the blog under Big BoulderBig Boulder on Storify and on Gnip’s Facebook page.


Building the Location Layer of the Internet With Mike Harkey of Foursquare

Mike Harkey, the Head of Platform Business Development at Foursquare, talks about how Foursquare is building the location layer of the Internet. 

Mike Harkey of Foursquare

To kick things off at Big Boulder, Gnip’s VP of Product, Rob Johnson interviewed Mike Harkey. As the Head of Platform Business Development at Foursquare, Mike talked about the evolution of Foursquare during the past four years. First introduced as the “check-in app,” Foursquare is now becoming known for its location recommendation services.

 Merchant Applications

As Mike stated, “the company is growing dramatically.” Foursquare recently received $41 million in funding in April 2013, and that is certainly shaping their growth. From a consumer application, check-ins and active uniques have grown 10% every month. However, Foursquare is really focused on providing real world applications for merchants, whose use has quadrupled in the past 6 months.

Foursquare has always offered a free solution for merchants to claim their business and run offers and specials within the app. Users can also follow merchants to keep an eye on these offers. However, at the end of the day this won’t matter if a merchant can’t see what needle Foursquare is moving for them. Enter merchant dashboards: Through the merchant API, merchants can track the value and success of their media campaigns and how Foursquare is influencing them.

 The Location Layer

Just as Facebook is the social layer of the internet, Foursquare has built the location layer. With 4 billion check-ins and 50 million places worldwide, it’s not hard to see why this data is so valuable and practical. And there’s something that’s fundamentally unique about Foursquare, in their ability to see real-time actions.

Foursquare is the first to find out when a venue opens and closes. This signal is not only beneficial for the application, but also for 3rd party platforms that rely on them. Maintaining the quality of data when it’s user-based is challenging but Foursquare has learned which levers to pull. A community of super users have the rights to edit and update data to help to “vet and validate” its quality. This further fuels the consumer application of Foursquare.

Using the Data

Foursquare check-ins show the pulse of New York City and Tokyo from Foursquare on Vimeo.

Foursquare holds itself to a higher standard with its data. They believe this data is not just theoretical, but has practical, real-world applications. For merchants, this means validating their presence on the app – according to Mike, 20% of users check-in to a place discovered by the recommendation service within 36 hours of discovery.

Since the founding of the company, people have wanted to access the data Foursquare provides. The API has always been open, but Foursquare has wanted to be careful about allowing access to the data. Gnip’s partnership with Foursquare to allow access to its firehose has tremendous possibilities for businesses. Examples include how individual users act during specific events. During Hurrican Sandy, Foursquare released visualizations around how people operated during and after a crisis.

Globally, using this data for good has been a priority for Foursquare. In Turkey, there was activity they didn’t expect during the recent riots. They had representatives on the ground of the riots and could see users posting photos and information as this was the only viable mechanism to expose this information.

The Future of Foursquare

Foursquare believes the applications for this data are virtually limitless, whether it’s making the data available for research or business applications. Foursquare is excited to see what people will build with their anonymized data from its partnership with Gnip. Foursquare has a number of products will be introduced this year. Soon, small businesses will be able to advertise through Foursquare and make the most out of this service. They will have the ability to turn on and off offers and reach long-term consumers.

Big Boulder is the world’s first social data conference. Follow along at #BigBoulder, on the blog under Big BoulderBig Boulder on Storify and on Gnip’s Facebook page.

Oreo, Tumblr and a Network's Power to Amplify

Really, it was bigger than Oreo.

When Nabisco posted an image supporting gay pride, Tumblr blew it up. Users took the statement of a single snack manufacturer and made a cause that touched many companies.

In this, the second part of a trilogy, major brands find themselves roped to a conversation about love in America. Part one talked about how Oreo cannonballed into the social web by posting an image of a rainbow Oreo in support of gay pride. Part three will use the episode to highlight conversation dynamics unique to the Tumblr network.

It began with maskedman.

“Gay oreo? Oreo suppoert Gays/??” the user wrote, “Never evating cookie again. … Disgustedng. THis is AMERICA, not HOMERICA.”

The post, which would ultimately accumulate some 1,500 notes, landed a day after Oreo’s image and touched off a wave of support for the company.

One user, palahniukandchocolate, made a list.

“Dear people boycotting Oreos for supporting gay rights: The following companies also support gay rights,” she wrote, adding the names of 37 companies, among them Allstate, Gap, Nike and Starbucks.

A day later, monkaroo retooled the tactic:

“Yes, please boycott Oreo for their support of gay rights,” monkaroo wrote before invoking two dozen companies aligned with Oreo, “We’ll all appreciate you going on a diet … [D]o us all a favor, don’t take it all out on a festive cookie… Just stay home and boycott everything.”

The note from palahniukandchocolate ran close to 900 characters. monkaroo’s topped out over 1,800. Together, they used the freedom of Tumblr’s platform to find a community in an ideology. They grabbed allies — and by doing so, they blew up the question.

The notes caught.

By the evening of the 26th, palahniukandchocolate’s message was pulling down hundreds of reblogs per hour. Indeed, that night, the note would lay claim to 75 percent of Tumblr’s Oreo conversation.

Graph Showing Oreo Mentions Spike on Tumblr

Figure 1 presents hourly Tumblr activity about Oreos (blue) and hourly reblogs of user palahniukandchocolate (orange).

The action spread elsewhere. Starbucks had seen a median 11 tumbles per hour in the two weeks leading up to the 24th. Pepsi had seen 14. On the night of the 26th, palahniukandchocolate lifted both brands, driving each to a network peak of more than 400 posts per hour.

Microsoft also bounced, rising to the 400 peak from 15 posts per hour and holding triple digits as late as the afternoon of the 29th. Costco, with barely a pulse on the network the week before, found itself in 7,100 tumbles the day after the cookie.

Figure 2 presents hourly Tumblr activity around Costco, McDonald’s, Microsoft, Pepsi, Sears and Starbucks. Association with Oreo’s pride cookie drove heightened activity for each brand.

palahniukandchocolate named 37 brands in her defense of Oreo. For most, including Coca-Cola, Levi’s,  Nike and Walgreen’s, that single association dominated the brand’s Tumblr presence in the second half of June.

Tumblr’s platform made that possible. Figure 3 shows four brands that bounced on Tumblr thanks to the Oreo affair. None saw pickup on Twitter in the wake of the image — the platform has no room for periphery.

Graph Showing Cookie Brand Mentions on Tumblr
Figure 3 presents hourly Twitter volumes for four brands that popped on Tumblr in the wake of Oreo’s image. Microsoft’s acquisition of Yammer drove the brand’s heightened activity pictured here.

In part, it’s not surprising that the Oreo story could cast so long a shadow over so many brands. Tumblr’s largely an extraprofessional platform; presence on the network requires personal connections between users and brands. Figure 3 presents average daily Tumblr volumes for corporate titans. The flows are thin, technology superbrands notwithstanding.
Graph of Brand Activity on Tumblr

Figure 4 presents average daily Tumblr activity around a subset of the 50 largest corporations by market capitalization (ranked Aug. 18, 2012).

Brands with little network presence risk leaving definition in the hands of others. And Tumblr encourages association: The platform provides flexibility in media and speeds the replication of conversation.

The series’ last installment dives into conversation dynamics on the network. If you like trace diagrams, this next one’s for you.

Twist, Lick, Dunk: A Tumblr Story

Oreo Showing Pride

Tumblr won’t soon forget the day America’s favorite cookie came out.

On June 25th, to promote the year of Oreo’s 100th birthday, Nabisco lent its cookie some currency: The company tweeted the image of a six-layered cookie, with crèmes the color of the rainbow, above a simple caption – “Pride.”

“We feel the Oreo ad is a fun reflection of our values,” a Kraft spokesman later told reporters. The cookie, the company said, illustrated ‘in a fun and playful way’ an issue that was making history.

The image lit up the social web. This post, and two that follow, explore conversations on Tumblr through the lens of Oreo. Part Two looks at how the episode touched other brands on the network. Part Three dives into the dynamics of Tumblr conversations and how they diverge from other platforms.

The image itself touched a vein. Opponents to marriage equality took to Oreo’s accounts on Facebook and Twitter to slam Nabisco and threaten boycott.

“[U]nliking oreo, cleaning out cupboard, changing buying habits, no more Oreo’s, and it’s parent company,” one user wrote.

“I will never eat an oreo again! ew!” said another.

Those comments, and others, drew counter-protests, among them:

“[W]onderful job Oreo on supporting equal rights, just for that, now I’ll buy a pack today.”

“I believe I’m going to go buy every package of Oreos I see when I go grocery shopping. Kudos!!”

Within hours, Oreo found itself the subject of some 7,500 tweets. The conversation ramped to midnight EST, when the brand was pulling back some 2,000 tweets per hour.
Graph Demonstrating Twitter Volume Around Pride Oreo
Figure 1 shows hourly Twitter volumes around Oreo between June 18 and July 2.

Tumblr followed on the 26th. In three hours that night, the company drew more than 300 textual posts on the network, double what the brand had done each day the week before.

The talk stayed political: “Way to go Kraft!,” one post read, “However it is also eye-opening to see how many people are proud to show their hate, or belief that all Americans do not deserve equal rights.”

Graph Showing Tumblr Volume Around the Pride Oreo
Figure 2 shows hourly Tumblr volumes around Oreo between June 18 and July 2.

By then, the story had spilled. ABC, NBC, Reuters and the Washington Post amplified news of the flap. A conservative family group urged supporters to look elsewhere for cookies. Meanwhile, the image was slowly amassing more than 60,000 Facebook comments and close to 300,000 likes. Two social analytics companies would later call that conversation overwhelmingly positive – for Oreo.

For days on Tumblr, the story echoed. Median hourly Twitter volumes had returned to normal by the fracas’ fourth day. But on Tumblr, a full week after Oreo’s image went live, chatter remained triple the cookie’s prior volume.

In that way, the image marked a breakthrough for Oreo on Tumblr. At peak, the pride cookie generated 2.6 times Oreo’s median Twitter volume from the week prior. For Tumblr, that figure was 19.8.
Graph Demonstrating Increase in Tumblr Traffic After the Pride Oreo
Figure 3 shows the ratio between hourly platform volume around Oreo and typical hourly platform volumes between June 18 and July 2.

Oreo had long been a social brand. Before the pride cookie, it counted 26 million Facebook fans and tens of thousands of Twitter followers. On Tumblr, the cookie already outstripped its rivals. And in a move that may help the company retain that lead, Oreo can rely on, the brand’s official Tumblr presence. Its first posted image? June 25 – the pride cookie.

Graph Showing Oreo Compared to Other Cookie Brands on Tumblr

Figure 4 shows Oreo’s Tumblr lead over major cookie brands in the United States between June 18 and July 2.

But Oreo’s Tumblr story rippled beyond the cookie alone. That broadening – a central quality of the Tumblr platform – has implications for brands linked by product, demographic or, in this case, ideology. Return for more in Part Two.

Taming The Social Media Firehose, Part III – Tumblr

In part I, I discussed high-level attributes of the social media firehose. In Part II , I examined a single event by looking at activities from four firehoses for the earthquake in Mexico earlier this year. In Part III, I wrap up this series with some guidelines for using unique rich content from social media firehoses that may be less familiar. To keep it real, I used examples from the Tumblr firehose.

Since the Twitter APIs and firehoses have been available for years, you may be very familiar with many analysis strategies you can apply to the Twitter data and metadata.  I illustrated a couple of very simple ideas in the last post. With Twitter data and metadata, the opportunities to understand tweets in the context of time, timezone, geolocation, language, social graph, etc. are as big as your imagination.

Due to the popularity of blogging for both personal and corporate communication, many of you will also understand some of the opportunities of the WordPress firehose.  With the addition of firehoses of comments, you have the capabilities of connecting threads of conversation to realize another possible analysis strategy. “Likes” and Disqus “votes” provide additional hints about user reaction and engagement–yet another way to filter and understand posts and comments.

Why go to the effort and expense of adding a new firehose?
There are three benefits from investing your efforts in learning to integrate these differences. Users of social networks choose to participate in Twitter, Tumblr or other social networks based on their affinities and preferences. Integrating additional active social media sources gives:

  1. Richer audience demographics
  2. More diverse perspective and preference
  3. Broader topic coverage.

Here’s an example.


The newest firehose from Gnip became available earlier in 2012. Tumblr’s exciting because the unique, rich content from Tumblr provides a complementary perspective and a distinct form of conversation. Tumblr is important because of the unique audience and modes of interaction common within this audience and platform.

With a firehose of over 50 million new posts a day from web users, Tumblr is a source with strong social sharing features and an active network of users where discussions can reach a large audience quickly.  Some Tumblr posts have been reblogged more than a million times and stories regularly travel to thousands of readers in a couple of days.

Before jumping into consuming the Tumblr firehose in the next section, it may help to understand some of what makes it different and valuable. These questions provide a useful framework when approaching any unfamiliar stream of social data.

What is unique about the Tumblr firehose?

1. Demographics. The user community on Tumblr skews young, over-indexing strongly in the 18-24 demographic of trend setters and cool hunters.

2. Communication and Activity Style. As you are thinking about filtering and mining the Tumblr firehose, realize conversations on Tumblr are often quite different from what you’ll find on other social platforms. As you start to interpret the data from Tumblr it’s important to note that Tumblr has an inside language. For example, many sites contain f**kyeah___ in their name and URL. When you start to hone in on your topic, you will need to understand the inside language used for both positive and negative responses. Terms you consider negative on one platform may have positive connotations on another. Be sure to review a subset of your data to get a feel for the nuances before drawing larger conclusions.

3. Rich Content. Content is rich in that there many types of media and a wide range of depth. Users will post audio, video, animated gifs, simple photos as well as short and long text posts.

You’ll also see 7 different Post Types on Tumblr. These represent the different types of content that users can post on Tumblr. They break out as follows:

Table of Post Types on Tumblr

Table 1 – Tumblr post type breakdown.

To answer the questions, we often rely on filters based on text since these are the simplest filters to think about and create.  The textual data and metadata available in the Tumblr firehose include titles, tags and image captions in addition to the text of the body of the post. Including all of this content allows us to filter approximately 20% of the Tumblr firehose based on text. Additional strategies include looking at reblog and “like” activity, as well as reblog and “like” relationships between users.  More sophisticated strategies such as applying character or object recognition to images open up the tens of millions of activities daily for mining and exploration.

4. Rich Topics. In addition to diverse content forms, Tumblr has attracted many active conversations on a wide variety of topics. This content is often very complementary to other social media platforms due to differences in audience, tone, volume or perspective. With more than 20 billion total posts to date, there is content for about almost  anything you can imagine.  Some examples include:

  • Brands. Any brand you can think of is being discussed right now on Tumblr. Big brands with an official presence on Tumblr include Coca-Cola, Nike, IBM, Target, Urban Outfitters, Puma, Huggies, Lufthansa, Mac Cosmetics and many more. NPR and the President of the United States have their own presences on Tumblr.
  • Fashion and Cosmetics. Because of the visual nature of the medium and cool-hunting audience it attracts, there is a large volume of content related to cosmetics and fashion.
  • Music and Movies. With Spotify music plugins and easy upload and sharing of visual content, pop culture plays a big role in the interests and attention of many of the active users on Tumblr. Information, analysis and fan content is rich, creative and travels through the community rapidly.

5. Reblogs and Likes. Tumblr is all about engagement! The primary user activities for interactions are Reblogs and Likes. Some entries are reblogged thousands of time in a day or two. When a user reblogs a post, it places the other user’s post into your blog with any changes they make. There is a list of all of the notes (likes, reblogs) associated with a post appended to that post wherever it shows up on Tumblr. Each post activity record in the firehose can contain reblog info. It will have a count, a link to the blog this entry was a reblog of and a link to the root entry. To build the blog note list that a user would see at the bottom of a liked or reblogged entry, you have to trace each entry in the stream (i.e. keep a history or know what you want to watch) or scrape the notes section of a page.

Filtering and Mining The Tumblr Firehose

Volume. There are a number of metrics we can use to talk about the volume of the Tumblr firehose. The three gating resources that we run up against most often are related to the network (bandwidth and latency) and storage (e.g. disk space). Tumblr activities are delivered compressed, so for estimating, the bandwidth and disk space requirements can be based on the same numbers. The Tumblr firehose averages about 900 MB/hour compressed volume during peak hours, falling to a minimum of 300 MB/hour during slower periods of the day.

To store the firehose on disk, plan on ~16 GB/day based on current volumes. Planning for bandwidth, you want headroom of 2-5 x average peak hourly bandwidth (4 to 10 Mbps) depending on your tolerance for disconnects during peak events.

The other consideration is end-to-end network latency as discussed in Consuming the Firehose, Part II.  Very simplistically, latency can limit the throughput of your network (regardless of bandwidth) by using up too much time negotiating connections and acknowledging packets. (For a detailed calculation, see, for example, The TCP Window, Latency, and the Bandwidth Delay Product.)  The theoretical limit for 20 Mbps throughput is 50-70 ms (depends on TCP window size), but practically you will want to reliably observe less than this (< 50 ms) to realize reliable network performance.

Metadata. A firehose is a time-ordered, near real-time stream of user activities. While this structure is clearly powerful for identifying emerging trends around brands or news stories, the time-ordered stream is not the optimal structure for looking at other things like the structure social networks to discover, e.g., influencers. Fortunately, the Tumblr firehose activities contain a lot of helpful metadata about place, time, and social network to get answers to these questions.
Each activity has a post objectType as discussed above as well as links to resources referred to in the post such as image files, video files and audio files. Each activity has a source link that takes you back to the original post on Tumblr. If the post is a re-blog, it will also have records like the JSON example below, describing the number of reblogs, the root blog and blog this post reblogged.

"tumblrRebloggedFrom" :
         "author" :
               "displayName" : "A Glimpse",
               "link" : ""
         "link" : ""
"tumblrRebloggedRoot" :
         "author" :
                "displayName" : "Armed With A Mind",
                "link" : ""
         "link" : ""

To assemble the entire reblog chain, you must connect the reblog activities within the firehose using this metadata.

Additional engagement metadata is available in the form of likes (hearts in the Tumblr interface) in a separate Tumblr engagement firehose.

Tumblr Likes Metadata

Non-Text Based Filters. Not all non-text post types have enough textual context (captions, title and tags) to identify a topic or analyze sentiment through simple text filtering. You will want to develop strategies for dealing with some ambiguity around the meaning of posts with very little text content. This ambiguity can be reduced unless you have audio or image analysis capabilities (e.g. OCR or audio transcription). Approximately 20% of all posts can be filtered effectively with text-based filtering of text, URL text, tags and captions–about 15M activities per day).

Memes. Another consideration related to the Tumblr language is that official brand sites as well as many bloggers tend to promote a style or overall image more than providing a catalog of particular products. As a result, e.g., you will match the brand name with a lot of cool stuff, but may see specific product names and descriptions much less frequently. There are many memes within Tumblr that will lead you to influencers and sentiment, but looking at “catalog” terms won’t be the most effective path.

I hope I have uncovered some of the mysteries of successfully consuming social media firehoses.  I have only suggested a handful of questions one might try to answer with the social media data. The community of professionals providing text analysis, image analysis, machine learning for prediction, classification and recommendation, and many other wonders is continuing to invent and refine ways to model and predict real-world behavior based on billions of social media interactions.  The start of this process is always a great question.  Best of luck (and the benefits of all of Gnip’s experience and technology) to you as you jump into consuming the social media firehose.

Full Series:

Taming The Social Media Firehose, Part I – High-level attributes of a firehose

Taming The Social Media Firehose, Part II – Looking at a single event through four firehoses

Taming The Social Media Firehose, Part III – Tumblr


Tumblr Firehose Now Available Exclusively from Gnip

I’m thrilled to announce that the full firehose of public Tumblr posts is now available exclusively from Gnip. Tumblr is one of the fastest growing social networks in the world. Much of this growth is fueled by the enormous number of conversations that are unique to the Tumblr community. These conversations cover a huge range of subjects, from movies, TV shows and fashion to business, apparel and consumer products. Check out these stats to get a feel for the volume of discussion on Tumblr:

  • 50 million new posts every day
  • 15 billion page views every month
  • 20 billion total posts
  • 300% traffic growth last year

While some social platforms react quickly to news and other events, Tumblr conversations often spread around concepts and trends. Take the example of Urban Outfitters where a photographer posted a picture to her personal Tumblr of a piece from one of their new collections. That post received over 1,000 notes and almost no mention elsewhere. In the case of Land Rover, the company posted a picture of a dog riding in a Land Rover to their Tumblr that received more than 5,000 notes and very little mention on other networks.

It doesn’t take a large leap to see the impact this type of information can have on brand management and product development. The conversations on Tumblr are rich in images and discussion about brands and products, from simply sharing a picture about a favorite pair of shoes to reblogging news about favorite brand. And given the highly social nature of the Tumblr community, these discussions move quickly and broadly through the community. You often see posts that are shared tens of thousands of times. For brands, every conversation matters and access to the full firehose ensures they won’t miss a thing.

We’re excited to be able to offer Tumblr to our customers and can’t wait to see what other intriguing use cases they find for this data.

Drop us a line at to learn more.