Blog and Comment Data Answers the Why

Shoppers Tweet about what they bought, but they turn to blogs and comments to share why they bought.

This is only one example of what makes the long-form data from blog and commenting platforms valuable to any company looking to better understand why their customers and prospects make the decisions they do. Simply put, blogs and comments are opinion rich. And when it comes to product development, sales, brand management, and more, these opinions provide a unique and critical lens into the nuanced thinking behind customer decisions.

In the past, some social media monitoring providers have used scraping solutions to include blog and comment data in their offerings. While this can get you the data, scraping has several fundamental challenges. The data can be days or weeks old. Scraping solutions often ignore terms of service and user intent, meaning the data can disappear at a moment’s notice when the scraper gets blocked. The data can come in a range of formats that make it very difficult to parse and analyze. And with scraped data, you only get results from the blogs and comments that you know you should be looking at, missing important discussions that surface in new and unexpected places.

It’s because of these challenges that we’re introducing Gnip for Blogs, combining content from four of the most popular long-form blog and comment sources. This first-of-its-kind package of data from Disqus, Tumblr, WordPress and IntenseDebate gives realtime, normalized, terms of service-compliant access to the rich conversations happening across a huge swath of the Internet. With Gnip for Blogs, customers are able to easily and confidently build their business applications on multiple sources of long-form data knowing it won’t suddenly disappear tomorrow.

Each of these sources has a story to tell on its own, but by looking at them all together brands are able to draw insights from an enormous range of discussion. This includes the mass market reach provided by WordPress who powers 19% of the web, the high volume of brand mentions on Tumblr, the highly-engaged audience on IntenseDebate and the enormous reach and quality of the conversations on Disqus.

One of our customers, Networked Insights, recently used realtime WordPress data to identify early technology trends based on influencer blog conversations. They then used this content to refine and focus a targeted online promotion for one of their customers. The end result? A 30% lift in ROI for their online ad spend. And this is only the beginning.

For more information, check out the Gnip for Blogs on our website or contact us at

Launching Automattic’s Certified Products Program

“Now that social data is becoming business-critical, it must become enterprise-class.” – Susan Etlinger, “The Emerging Social Data Ecosystem”, 9/18/2013

At Gnip, our customers’ experience is that social data has been “business-critical” for a while. But Susan Etlinger’s point is worth reiterating. While businesses have spent the last few years trying to figure out the “what” and the “why” of social data, they are increasingly focusing on “where” their data comes from. And the most sophisticated businesses know that scraped data is fundamentally different from a rate-limited public API, which is different than a full firehose.

The WordPress platform now represents an amazing 19% of the web. Which means if you want to know what people are saying about your brand on the web, you have to consider content from WordPress. With today’s announcement, we are putting a stake in the ground and saying that if your WordPress content is coming from elsewhere it is not suitable for building an enterprise-class product.

mBlast and Networked Insights, the inaugural members of the Automattic Certified Products Program, have always been strong advocates of including high quality blogs and comments among the wide variety of sources they offer their clients. Likewise, at Gnip we’ve long championed the “social cocktail”: the idea that the social data universe extends well beyond Twitter and Facebook and that any company trying to understand what is being said about their brand and industry has to consider other sources of data in order to get the full story.

Long-form content offers not only rich and deep expression, but the half life of the conversations that take place on a platform like WordPress is substantially longer than the equivalent conversation on microblogs, where conversations often begin and end in a matter of minutes or hours.

Until Gnip and Automattic brought the firehose to the market in late 2011, the only solution for even attempting to capture these conversations was building elaborate and often expensive crawlers in order to capture data from specific URLs. But if you’re in the business of realtime discovery and analytics, your customers tend not to be impressed if your crawled solution returns a conversation hours or even days after it happened and you cannot guarantee that you’ve provided them with everything.

To learn more about how mBlast and Networked Insights are using the WordPress firehose to provide valuable insights to their customers, check out their blog posts here and here.

For more info about how you can get access to realtime WordPress data via Gnip or become a member of the Certified Products Program, email

Cracking the Code to Discovering Insights in Consumer Conversations:

How Networked Insights helps brands make data-driven marketing decisions

While short-form content is good for predicting trends, long-form content carries distinctive elements that enable deeper, threaded analysis of ongoing conversations and commentary. This analysis can help marketers better understand pre- and post-sale behavior, identifying what moves consumers through the purchase funnel. No surprise that smart analytics companies are focusing on learning more about these long-form sources to complement short-form content sources, such as Twitter.

To fully serve the needs of their customers, Plugged In partner Networked Insights wanted to provide a comprehensive view of consumer conversations—which required tracking across different social media platforms, including more long-form content sources. To do this, it was important to Networked Insights to have complete access to the firehose of WordPress and IntenseDebate data from Gnip. Through their next-generation analytics platform SocialSense, Networked Insights helps brand marketers and CMOs better understand their consumers by providing insights from the social sphere—by focusing on three distinctive areas of insights: Audience, Content, and Media. These insights help the company’s clients gain a clearer picture of consumer behavior and affinities, uncover new audiences, optimize advertising spend, and inform a wide range of marketing decisions.

Audience Insight: Networked Insights collects and analyzes millions of data points a day from a myriad of sources. They understand that social data isn’t just about Twitter and Facebook, so they incorporate as much long-form content as possible from various blogs and forums. In fact, more than 20% of the long-form content Networked Insights consumes comes from WordPress— a source available through Gnip’s exclusive partnership with WordPress.

The chart below explores the classic data-mining marketing illustration of Dads shopping for beer and diapers, giving an example of how audience insights play into marketing decisions. Networked Insights looked at mentions of both diapers and beer during a nine-month period. The greatest correlation between the two topics occurred as Dads prepared for the Super Bowl. What can marketers and advertisers take away from this? It’s a great opportunity for diaper consumer packaged goods (CPG) brands to access the typically expensive to reach sports fan audience.

Beer vs Diapers on WordPress

Content Insight: Building from the audience insights they uncover, Networked Insights works with marketers to provide a 360-degree view of their brand, competitors, or ecosystem’s consumers. They accomplish this through identifying the key affinities (favorite celebrities, musicians, TV shows, etc.) within a brand’s key target audiences. This gives a huge advantage to marketers because millions of dollars are spent every year guessing on content that would resonate with their target consumers.

So how does long-form content lend a different kind of insight? One difference is the length of user engagement. The engagement that WordPress users create is more sustained than what Twitter users create, even though WordPress volume pales in comparison to Twitter volume. Here’s one way to see the difference using social mentions related to TV shows as an example. After an episode airs, the number of conversations on WordPress doesn’t decrease as quickly as it does on Twitter, as you can see in the two charts below. Since conversations are shared on WordPress longer, this results in an increased capacity for marketers to get deep, actionable insights.

Reality Programming: Twitter vs WordPress

Comedy Programming: Twitter vs WordPress

Media Insight: By leveraging audience insights with content affinities of the target consumers, marketers can now more effectively buy media and organically reach new consumers. A great example of this is work Networked Insights did with a consumer tech company to amplify the reach of their digital ads.  Through access to real-time blog data from WordPress, Networked Insights distinguished early tech trends and themes that would ultimately trickle down to the general consumer conversation. They did this by identifying a group of tech influencers—a specific blogging community that relies upon platforms including WordPress—and analyzed how these influencers engaged with tech-related products. Leveraging what they knew about the audience and their consumer behaviors and interests, Networked Insights was able to provide new insights into what content to promote and whom to target on digital, and as a result increased the effectiveness of the company’s cost per impression by over 30%.


Creating and Sharing Content on WordPress

An interview with Paul Maiorana, Vice President of Platform Services at Automattic, about creating and sharing content on WordPress. 

Paul Maiorana Big Boulder

There are a lot of names for the WordPress/Automattic group, so it’s important to distinguish who is who. WordPress, who just celebrated their 10 year anniversary in May is an open source platform, free to use and free to download. Automattic (named for its founder, Matt Mullenweg) is the organization providing services around WordPress and handling its infrastructure. Lastly, Jetpack is the plugin used to add features to a WordPress site, powered by the cloud infrastructure.

Paul Maiorana, Automattic’s VP of Platform Services dove into the VIP, a solution for large media organizations and enterprises. You can run WordPress anywhere in the world, and Automattic is the largest user of and contributor to the open source platform. They’ve built a significant amount of knowledge around scaling the product and now provide this knowledge to enterprises. Huge organizations like Turner Broadcasting, federal agencies and a wide spectrum of other groups are customers.

“Biggest Home of Users on the Web”

WordPress has a philosophy when building their open source software – the idea of the independent web. Paul says they like to think of WordPress as a digital hub and your home on the web. At the end of the day, they try to give you (the user) the tools to create and export content and put it where you want. The user will always own WordPress as much as the company does. “A place on the web you can call your own, where you own the data, you own the experience,” says Paul, is part of the DNA at WordPress. More than 18% of the top 10 million website are WordPress, and 70 million WordPress websites are hosted between and other sources.

Blogging and Enterprise

While WordPress’ roots have always been in blogging, they see themselves as more of a content management system. This perception has persisted because of reputation. But over the last couple years, they’ve expanded on this to bring tools to customize user sites and take advantage of it to be more than just a blog. More and more organizations are using WordPress as a CMS these days instead of just a blog. On an enterprise level, major websites like CBS are using WordPress for CMS. It’s a testament to how the tool has evolved over the recent years.

Product Roadmap

Paul says product decisions have an interesting in relationship with the open source portion of WordPress. At the end of the day, WordPress has little control over what happens on that side. Unlike other CMS platforms, WordPress updates three times a year. It is updated without breaks to make it seamless for people to use the best WordPress there is. Within Automattic, they’ve built a lot of enterprise solutions and open source solutions to help make WordPress better for everyone.

Mobile is also a huge focus of what they’re currently focusing on, and how they will continue to shape their roadmap. For now, it’s a big initiative in two ways: from a front-end user experience and from a dashboard admin experience. The past three releases have focused a default theme that is responsive, and they will continue to do so. For the admin experience, mobile is perfect for “of the moment” publishing. With apps for IOS, Android, Blackberry, and Windows, more content publishers will have the ability to publish on the go efficiently. They’ve seen real world use cases too, with reporters catching stories first because they were able to use the mobile publishing.

WordPress and Social

Blogging is inherently social and it’s not an accident comments are an important part of the WordPress software. The conversation is an important part of publishing on the web.  Paul said WordPress spends a lot of time thinking about additional social features they can add (likes, re-blogging, following, subscribing to updates). Looking forward, they’re hoping to expose the idea of consuming content within WordPress. They’re experimenting with reader interface and giving users ability to subscribe to content they like from topics or specific blogs and then see it all in one place and interact with it socially.

Big Boulder is the world’s first social data conference. Follow along at #BigBoulder, on the blog under Big BoulderBig Boulder on Storify and on Gnip’s Facebook page.

Data Stories: Dmitrii Vlasov on Kaggle Contests

At Gnip, we’re big fans of what the team at Kaggle is doing and have a fun time keeping tabs on their contests. One contest that I loved was held by WordPress and GigaOm to see what posts were most likely to generate likes, and we interviewed Dmitrii Vlasov who came in second in the Splunk Innovation Prospect and sixth overall. For me, it was interesting to speak to an up and coming data scientist who isn’t well known yet. Follow him at @yablokoff.

Dmitrii Vlasov of the GigaOm WordPress contest

1. You were recognized for your work in the first Kaggle contest you ever entered. What attracted you to Kaggle, and specifically the WordPress competition?

I came to Kaggle accidentally as it always happens. I read some blog post about the Million Song Dataset Challenge provided by and bunch of other organizations. The task was to predict which songs will be liked by users based on their existing listening history. This immediately made me feel excited because I’m an active user and was reflecting about what connections between people can be established based on their music preferences. But the contest was coming to end and so I switched to WordPress GigaOm contest and got 6th place there. Well, it is always interesting to predict something you already use.

2. What is your background in data science?

Now I’m a senior CS student in Togliatty, Russia. Can’t say that I have a special background in Data Science – I had more than a year-long course of probability theory and math statistics in university, some self-learned skills about semantic analysis and have big love to Python as a tool for implementing ideas. Also, I’ve entered the Machine Learning course on Coursera.

3. You found that blog posts with 30 to 50 pictures were more likely to be popular. You also found that longer blog posts also attract more likes (80,000-90,000 characters). This struck my marketing team as really high and was contrary to your hypothesis that longer content might be less viral. Why do you think this is?

Well, my numbers show relative correlation between amount of photos, characters and videos and the amount of likes received. Big relative “folks love” on several prominent amount of photos means that there were not so many posts with such amount of photos but most of them were qualitative. Quick empirical analysis shows that these are special type of posts – “big photo posts”. They usually are photo report, photo collection or scrapbook. For such types of posts 10-15 photos are not enough but at the same time 10-15 photos seem too overloaded for normal post. The same can be said about big amount of text in post. Of course, the most “likeable” posts contain 1,000-3,000 characters, but posts with 80-90 thousands are winners in “heavyweight category”. These are big researches, novels, political contemplation. Analyse is quite simple but it shows that if you want to create media-rich or text-rich content it should be really media-text-rich. Or you may fall in a hollow of not suitableness.

4. What else would like to predict with social data if you got the chance?

Now I work on romantic and friend relationships that could be established based on people’s music preferences (it’s a privately held startup in alpha). This is a really interesting and deep area! Also, I’d like to work with some political data e.g. to predict reaction on one or another politician’s statement based on a user’s Twitter feed. Or to extract all “real” thesis of politician based on all of his public speeches.

Social Media Knows As Much About The Holidays As Santa Does

The holidays are an exciting time at Gnip…and not just because our CEO loves bringing random bottles of excellent Scotch to the office. Around this time of year we get some visibility into the incredible ways our retail and consumer product clients are using social data. In fact, Mashable recently highlighted a study by Mr. Youth (a marketing firm) with an incredible stat that helps prove how valuable social data in holiday shopping truly is:

“66% of respondents who bought something on Black Friday did so as a direct result of social media interactions with friends and family.”

While that stat speaks to the impact social media has upon us as individuals, think more broadly about how powerful it is to analyze that data in aggregate, in real-time. Companies are leveraging data from WordPress blogs, Twitter mentions, Facebook likes and multiple other sources to inform critical realtime decisions for inventory management and operational planning, sales and marketing planning, revenue forecasting, and many others.

Example Scenario for Using Social Data: It’s holiday time, 2011. Your company begins to aggregate ‘mentions’ of a new product from Twitter, Facebook, WordPress blogs in realtime. You take that data and analyze it for # of mentions about the new product, geography of posts (where available), demographic information within user profiles (what keywords are most consistent within Twitter user profiles that mentioned your product?), etc.

You spread that data among multiple divisions, providing additional forecast, regional buying pattern, and customer habit data. Your teams use that to:

  1. Manage supply chain: Redirect inventory to areas with highest potential sales and (depending on how far out you are) use as a data point in the S&OP system for manufacturing forecasts to keep ahead of the holiday demand.
  2. Target marketing spend: Use regional buying patterns and customer habit data to inform what demographic you are, and aren’t, hitting. Do you need to reposition your marketing plan?
  3. Incorporate product feedback: Are there consistent reasons why people are buying your product – or why they aren’t? Information on quality, packaging, price, etc will be incredibly valuable for future products.
  4. Calibrate investor expectations: Inform your IR team of potential positive/negative performance feedback to give them running room ahead of any announcements.

Those are just some of the more common use cases we’re seeing. But new opportunities are popping up on a daily basis. We spotted this gem in a recent WSJ article about finding a parking space during crazy shopping times:

Bud Kleppe, a real-estate agent in St. Paul, Minn., watches Mall of America’s Twitter feed for parking updates. (The mall sends them out under the hash tag #moaparking.)

Imagine collecting data from update systems like this and using it measure parking turnover across prime shopping days. Now, overlay the turnover of spots in specific sections against a map of stores and you have some interesting potential for data on economic performance and forecasting. When incorporated with other traditional retail data and compared on a store-to-store basis, you’ve built a unique and realtime analysis tool.

You’re only limited by your imagination in how you can apply social media data to you business. The more software developers, corporations, and people use social media, and the more things they use it for (like parking updates!), the greater the possible use cases for analysis of that data and the more valuable it becomes.

Gnip Cagefight #2: Pumpkin Pie vs. Pecan Pie

Thanksgiving is a time for family gatherings, turkey with all the delicious fixings, football, and let’s not forget, pie! If your family is anything like mine, multiple pie flavors are required to satisfy the differing palates and strong opinions. So we wondered, which pies are people discussing for the holiday? What better way to celebrate and answer that question than with a Gnip Cagefight.

Welcome to the Battle of the Pies!

For those of you that have been in a pie eating contest or had a pie in the face, you know this one will be a fight all the way down to the very last crumb. In one corner (well actually it is the Gnip Octagon so can you really have corners, oh well) we have The Traditionalist, pumpkin pie and in the opposite corner, The New Comer, pecan pie. Without further ado, Ladies and Gentleman, Let’s Get Ready to Rumble, wait wrong sport. Let’s Fight!

Six Social Media Sources, Two Words, One Winner . . . And the Winner Is . . .


 Source  Pumpkin Pie  Pecan Pie  Winning Ratio
Pumpkin Pie to Pecan Pie
Twitter X 4:1
Facebook X 5:1
Google+ X 6:1
Newsgator X 3:1
WordPress X 5:1
WordPress Comments X 2:1
Overall +6 Winner! +0 :(


We looked at one week’s worth of data across six of the top social media sources and determined that pumpkin pie “takes the cake” (so to speak) across every source.

In this case, it is interesting to point out that in sources like Twitter, Facebook, Google+ and WordPress we see higher winning ratios, while sources that tend to have higher latency such as Newsgator and WordPress Comments were a little more even. Is this because, on further consideration, pecan pie sounds pretty good? Or is it that everyone will have to have two pies and, with pecan as the traditional second, it is highly discussed?

Top Pie Recipes

Even though pumpkin pie was our clear winner, we thought it would be fun to share a few of the most popular holiday pie recipes by social media source:

  1. Twitter – Cook du Jour Gluten-Free Pumpkin Pie and Pecan Pie Video Recipe from
  2. Facebook – Ben Starr’s Pumpkin Bourbon Pecan Pie Recipe
  3. Newsgator – BlogHer’s Pumpkin Pecan Roulade with Orange Mascarpone Cream Pie Recipe
  4. WordPress and WordPress Comments – Chocolate Bourbon Pecan Pie from

Non-Traditional Thanksgiving Pies

Another interesting fact that came out of this Cagefight was the counts of non-traditional Thanksgiving pies that were mentioned across the social media sources we surveyed. Though we rarely find these useful for communicating numerical values effectively, you can’t not have a pie chart in this post.

Happy Thanksgiving!

Pushing and Polling Data Differences in Approach on the Gnip platform

Obviously we have some understanding on the concepts of pushing and polling of data from service endpoints since we basically founded a company on the premise that the world needed a middleware push data service.    Over the last year we have had a lot of success with the push model, but we also learned that for many reasons we also need to work with services via a polling approach.   For this reason our latest v2.1 includes the Gnip Service Polling feature so that we can work with any service using push, poll or a mixed approach.

Now, the really great thing for users of the Gnip platform is that how Gnip collects data is mostly abstracted away.   Every end user developer or company has the option to tell Gnip where to push data that you have set up filters or have a subscription.   We also realize not everyone has an IT setup to handle push so we have always provided the option for HTTP GET support that lets people grab data from a Gnip generated URL for your filters.

One place where the way Gnip collects data can make a difference, at this time, for our users is the expected latency of data.  Latency here refers to the time between the activity happening (i.e. Bob posted a photo, Susie made a comment, etc) and the time it hits the Gnip platform to be delivered to our awaiting users.     Here are some basic expectation setting thoughts.

PUSH services: When we have push services the latency experience is usually under 60 seconds, but we know that this is not always the case sense sometimes the services can back-up during heavy usage and latency can spike to minutes or even hours.   Still, when the services that push to us are running normal it is reasonable to expect 60 second latency or better and this is consistent for both the Community and Standard Edition of the Gnip platform.

POLLED services:   When Gnip is using our polling service to collect data the latency can vary from service to service based on a few factors

a) How often we hit an endpoint (say 5 times per second)

b) How many rules we have to schedule for execution against the endpoint (say over 70 million on YouTube)

c) How often we execute a specific rule (i.e. every 10 minutes).     Right now with the Community edition of the Gnip platform we are setting rule execution by default at 10 minute intervals and people need to have this in mind with their expectation for data flow from any given publisher.

Expectations for POLLING in the Community Edition: So I am sure some people who just read the above stopped and said “Why 10 minutes?”  Well we chose to focus on “breadth of data ” as the initial use case for polling.   Also, the 10 minute interval is for the Community edition (aka: the free version).   We have the complete ability to turn the dial and use the smarts built into the polling service feature we can execute the right rules faster (i.e. every 60 seconds or faster for popular terms and every 10, 20, etc minutes or more for less popular ones).    The key issue here is that for very prolific posting people or very common keyword rules (i.e. “obama”, “http”, “google”) there can be more posts that exist in the 10 minute default time-frame then we can collect in a single poll from the service endpoint.

For now the default expectation for our Community edition platform users should be a 10 minute execution interval for all rules when using any data publisher that is polled, which is consistent with the experience during our v2.1 Beta.    If your project or company needs something a bit more snappy with the data publishers that are polled then contact us at or contact me directly at as these use cases require the Standard Edition of the Gnip platform.

Current pushed services on the platform include:  WordPress,, Intense Debate, Twitter, Seesmic,  Digg, and Delicious

Current polled services on the platform include:   Clipmarks, Dailymotion, deviantART, diigo, Flickr, Flixster, Fotolog, Friendfeed, Gamespot, Hulu, iLike, Multiply, Photobucket, Plurk, reddit, SlideShare, Smugmug, StumbleUpon, Tumblr, Vimeo, Webshots, Xanga, and YouTube

Newest Gnip Data Publisher: WordPress

We are pleased to be announce an agreement with Automattic, Inc. that allows us to add as our newest data publisher in the standard edition of the Gnip platform.

Gnip now provides access to the WordPress XMPP firehose for posts and comments.   The firehose is designed for companies who would like to ingest a real-time stream of new posts and comments the second they get published and access is via subscription only.   For more information contact Gnip at