Author: Scott Hendrickson, Data Science

Scott is a Data Scientist at Gnip. Before joining Gnip, Scott worked with startups and established software companies on data analysis, machine learning, data visualization and data-centric strategy projects. After completing a PhD in Physics at the University of Colorado, where he simulated beam particle-field interactions, Scott joined an early Internet startup working with Johnson & Johnson Healthcare Systems to create the first on-line version of their health risk assessment. Since then, his projects have included a system for realtime search result ranking using social media data and co-founding a building energy monitoring and analysis company.

Aspirational Brands & Tumblr: Lexus vs. Toyota

Gnip conducted a brief analysis of the Toyota family of brands (Toyota, 4Runner, Camry, Highlander, Lexus, Prius, Rav4, Scion, Sequoia, Tacoma, Tundra) on multiple social media platforms. We looked at brand mentions on Tumblr, Twitter, WordPress and WordPress comments during the period of Oct. 15 to Nov. 15, 2012.

As you would expect, Toyota was the most frequently mentioned brand on each social platform, with one enormous exception – Tumblr. Lexus had 5 times as many mentions on Tumblr as Toyota. This highlights how aspirational brands do exceptionally well on Tumblr where niche communities of fans often form around brands. (Attention brand managers, this happens whether the company is involved or not). A central component of Tumblr is visual content, which also plays well with aspirational brands. Furthermore, Tumblr content is both extremely viral and has a long shelf life meaning that content shared on Tumblr can be shared for longer periods of time and jump to more diverse sub-groups within the network than other social networks. During the month Gnip tracked mentions, Lexus received more than 200,000 mentions while Toyota received 40,000.

In social media, it is easy to rely on Twitter as a kind of alert system of when content is being shared, but at Gnip we’ve seen time and time again where content that pops up elsewhere doesn’t always pop up on Twitter. Each social media network has its own attributes and audience and modes of interaction. Because of likes, reblogging, and the way timelines are read by Tumblr users, Tumblr has active communities that aren’t found elsewhere.

Lexus on Tumblr

Four Themes From the Visualized Conference

The first Visualized conference was held in mid-town Manhattan last week. Even with Sandy and a nor’easter, the conference went off with only a few minor hiccups. The idea behind Visualized is a TED-like objective of exploring the intersection of big data, story telling and design. It worked.

Throwing designers and techies together is one of my favorite forums because of what is common and what is different. On one hand, artists are increasingly skilled with technical tools, on the other these people are often coming at things from very different perspectives.

The advantages of mixing these people at Visualized go beyond simple idea sharing.  Each person specializes, leading to amazing expertise, skill, and focused perspective, but also leaving something out. It is not that everyone can learn to do everything, but rather, by sharing projects, methods and tools, we can learn what to ask and who to seek out for collaboration.  The advantages of this mix are that it is the most reliable way to produce projects that evoke emotion with story, design and data to engage and inform.

We were treated to amazing technical talent and creativity, evident in, for example, Cedric Kiefer’s generative dancer reproduction “unnamed soundsculpture.”  To creating the basic model his team started with song, a dancer and knitting together the 3D surface images from three Microsoft Kinect cameras. They re-generated the movie of the dance by simulating the individual particles captured in the imaging and the enhancing these to generate more particles under the influence of “gravity” and “wind” driven by the music.

unnamed soundsculpture from Daniel Franke on Vimeo.

Cedric and his team radically expand ideas of numeric visualization by capturing and building on organic physical data in complex and subtle ways, generating a whole, engrossing new experience from the familiar elements.

Four themes surfaced repeatedly in the ideas and presentations of the speakers:

Teams

Most of the projects were produced by teams made up of people with  a handful diverse skills and affinities. I heard descriptions of teams such as, “we have a designer (color, composition and proportion sense, works in Illustrator, photoshop, pen and paper…), a data scientist (data munging, machine learning, statistical analysis…), a data visualization artist (Javascript, D3 skills, web API mashup skills…) someone who is driven by narrative and story telling (journalist, marketing project lead…), a database guy, etc.”

Assembling and honing these teams of technical and artistic creatives is probably a rare skill in itself and the result is a powerful engine of exploration, creativity and communication.

Hilary Mason from Bit.ly summed up the second-level data scientist talent shortage clearly: “Every company I know is looking to hire a data scientist; every data scientist I know is looking to hire a data artist.”  As broad as data scientists skills are, many are recognizing the value of talented designers with the appropriate programming skills for crafting a clear, engaging message.

The New York Times teams (two different teams presented), the WNYC team of two, Bit.ly, and many others showed the power of teams creating together and bringing diverse talents to projects.

There were a couple of notable individual efforts. My favorite was Santiago Ortiz’s beautiful, complex and functional visualization and navigation of his personal Knowledgebase. His design elegantly uses the 7-set Venn diagram, and his deep insights into searching by category and time come together perfectly.

Journalism

Sniffing out the story is fundamental to projects that evoke emotion with story, design and data to engage and inform. Journalists can smell drama and conflict and ask lots of questions. They have a sense of where to dig deeper. They are able to stick to the thread of the story and a have a valuable work ethic around finding the details and tying up loose ends.

A large part of success of Shan Carter and his team in creating the New York Times paths to the White House win visualization come from their ability to return over and over the the basic idea of making a relevant, accurate and understandable visualization of the various likely outcomes each each of the battleground states.  This visualization went through 257 iterations being checked into their Github repository with a few evident cycles of creative expanding followed by refocusing on the story.

Data Mashups

API-mashup skills found there best examples in the news teams. WNYC’s accomplishments in creating data/visualization mashups to communicate evacuation zones, subway outages, flood zone information updated in real-time during the storms, and other embeddable web widgets was amazing.  While their designs didn’t have the polish of some of the “slower” work presented, they produced great, accurate and timely results in days and sometimes hours.

Design

“A visualization should clarify in ways that words cannot.” (Sven Ehrmann)

This summed up what I found awe-inspiring and satisfying in the design work. Since I primarily work with data visualization, I often rely on the graph-reading skills of my audience rather than optimized design. This may be necessary for many business applications, but when the message is important and the investment you can reasonable expect from your audience is uneven, to take short cuts on design is to completely miss opportunities to engage and inform.  Great designers are masters at creating memory because the are able to reliably create “emotion linked to experience” (Ciel Hunter)

Jake Porway summed up data, team, story and design in the observations section of his presentation:

  • Data without analysis isn’t doing anything
  • Interdisciplinary teams are required
  • Visualization is a process (see the example from Shan Carter at NY Times above)
  • Tools enable amazing outcomes possible with limited resources
  • There is a lot of potential to do a great deal of good with when we learn to evoke emotion with story, design and data to engage and inform.

 

Strata 2012 – Strata Grows Up

O'Reilly Strata Conference Making Data Work

O’Reilly Media’s Strata was held in NYC recently, just before Sandy arrived.  The conference was sold out to building capacity.  By this measure, it was the most successful Strata to date. Strata and Hadoop World were combined into a single conference. The week started with tutorials, meet-ups, a mini Maker Faire and Ignite Talks and ended with the more traditional conference format of keynotes, breakouts and an exhibit hall for vendors.

Some of the most interesting and relevant idea-driven keynotes included Mike Flowers talking about data used to understand building code violations in NYC, Rich Hickey addressing opportunities and challenges of adding back to big-data analysis platforms some of the traditional features like indexes, queries and transactions, and Shamila Mullighan’s recommendations for combining internal data with public APIs to enhance your data science.

Tim Estes gave a passionate talk about attention and the “responsibility to know”–and the growing gap them–asserting “Understanding is a great cause.”  Doug Cutting talked about the future of Hadoop, Julie Steele interviewed “Mathbabe” (Cathy O’Neil) about real-world vs. academic data science.  Joe Hellerstein addressed the challenges of resources and attention resulting from 80% of a typical data scientist’s activity being spent preparing and transforming data for analysis. Samantha Ravich wrapped up the keynotes with an appeal for data science tools that better match decisions maker’s needs and modes of working.

Themes

Demand Outstrips Supply for Talent
Many speakers pointed to the dire need for more data science talent. In many cases, this was emphasized by pointing to the data going unanalyzed, answers going unfound, and sometimes open positions unfilled.

In what ways is the data scientist shortage directly problematic and in what ways has it become shorthand for the larger problem that businesses don’t have decision processes, managers, infrastructure, etc. needed to effectively make decisions from big data. There seemed to be some glossing over the point that the data science talent shortage is largely an opportunity cost and, to the extent there is uneven use of data in your industry, a competitive issue, rather than an actual cost of doing business today. The shortage of data science talent is accompanied by a shortage of managers able to focus on good questions, direct resources to data science projects and make decisions based on data.

The data scientist is key, but also, only successful in bringing competitive advantages when the context of good questions and data-driven decisions is in place. Concentrating on the shortage of DS when your management team is unprepared to participate in and leverage insights gained from good data science work seems sort of silly–Samantha Ravich’s talk had a clear example of this in how the Bush administration decision process regarding poppy production in Afghanistan went wrong.

Data Science Infrastructure
The other recurring theme was Date Science Infrastructure. Many data scientists have noticed the huge proportion of their daily work is finding, shaping, loading, moving, transformation and connecting data. The product announcements as well as many of the idea-oriented talks pointed to and quantified this challenge. “Spend more time doing the science part of your job” is the idea behind, Platfora, OpenChorus, Impala and Joe Hallerstein’s Data Wrangler.

For Gnip, a personal highlight was GreenPlum announcing OpenChorus, a collaborative big data environment with integrations to Gnip, Kaggle and Tableau. Informatica announced their continued work toward a “no-code” environment for big-data analytics. MapR, SaS and other well-known players had their say at keynotes as well.

In general, the last two days of Strata seemed focused more on the line manager of big-data insights and infrastructure in the organization and less on the analysis or visualization practitioner. Some bright spots on the practitioners side were Donal Miner’s MapReduce patterns, Kim Rees on creating great visualizations and Cathy O’Neil on the realities of mining and making decisions based on weak signals in timeseries data.

Twitter and SXSW: Barometer of Trends

Last week we talked about tracking SXSW from 2007 to 2012 using Gnip’s Historical PowerTrack for Twitter. This gave us insight into year-over-year trends in SXSW Tweets and now we’re going to look at how SXSW trends have changed over time.

With every square inch of Austin packed with the social media influential, SXSW provides an interesting avenue to examine trends, big and small, to see what people are talking about on Twitter. Now that companies can use Gnip’s Historical PowerTrack for Twitter to baseline events, it provides a whole another avenue to determine trends.

Party vs. Panel
People have such a love/hate relationship with SXSW. Some people love it for its networking opportunities and great sessions, while other people decry it as one giant party. Letting the data speak for the truth, it seems that in earlier years of the conference, people came for the panels and hopefully to learn something from their peers. But by 2011, the word “party” overtook those interested in “panel” by more than 10,000 Tweets. People were talking about the best places to meet people rather than the best places to learn. That same year, there were 13,072 mentions of the word RSVP in SXSW Tweets talking about plans to find the best parties and likely indulging in the practice of RSVPing for 136 events and actually attending 12 of those events.

Geo-location Wars
While Twitter is useful for helping understand how cultural events are changing, the use cases extend further into helping understand the rise and fall of startups. With the launch of Foursquare and Gowalla at SXSW in 2009, it was the beginning of the so-called geo-location wars. Many people have wondered how Foursquare ended up the winner, and SXSW provides interesting insight into how Foursquare came out on top. Back in 2009, if you looked at SXSW Tweets, it would tell you it was anyone’s game because surprisingly Foursquare only received a little more 100 Tweets than Gowalla. By 2010, Foursquare had been more clearly marked as the winner with Foursquare receiving nearly double the Tweets that Gowalla was receiving. At that point, everyone was still writing posts to determine the pros and cons of each service, but the social data was clear — Foursquare had the buzz that year in part to their ability to easily publish updates, badges and mayorships on Twitter and perhaps even their rogue game of Foursquare outside the Convention Center. By 2011, Foursquare had completely suckerpunched Gowalla with Foursquare receiving the lion’s share of public voice receiving nearly 65,000 Tweets to Gowalla’s nearly 8,000 Tweets. By the end of 2011, Facebook ended up making an acqui-hire for the Gowalla team.

BBQ vs. Tacos
This next trend might seem silly, who cares if more people are interested in BBQ or Tacos? I mean, what significant impact can this social data have? But if you’re a restaurant chain or looking to start a new franchise chain, it would be interesting to know about cultural food trends such as the rise of cupcakes as it is happening.

While many have long suspected that Austin was a BBQ kind of town, the social data has shown that at the last SXSW, Tacos overtook BBQ as the most talked about grub to grab. More data science would have to be done to determine if the Taco is becoming a more widestream cultural trend, but when all other Tweet volumes were falling in 2012, the term Tacos was charging full-steam ahead.

This is just the beginning of what social data can tell companies about trends and market research. We think historical social data will provide invaluable to market research with the sheer volume of conversations that are happening on Twitter.

In The Future, The Data Scientist Will be Replaced by Tools


Some of you are celebrating. Some of you are muttering about how you could never be replaced by a machine.

What is the case for? What is the case against? How should we think about the investments in infrastructure, talent, education and tools that we hope will provide the competitive insights from “big data” everyone seems to be buzzing about?

First, you might ask why try to replace the data scientist with tools?  At least one reason is in the news: The looming talent gap.

WireUK reports,

Demand is already outstripping supply. A recent global survey from EMC found that 65 percent of data science professionals believe demand for data science talent will outpace supply over the next five years, while a report from last year by McKinsey identified the need in the US alone for at least 190,000 deep analytical data scientists in the coming years.”

Maybe we should turn to tools to replace some or all of what the data scientist does. Can you replace a data scientist with tools?  An emerging group of startups would like you to think this is already possible. For example, Metamarkets headlines their product page with “Data science as a service.” They go on to explain:

 Analyzing and understanding these data streams can increase revenue and improve user engagement, but only if you have the highly skilled data scientists necessary to turn data into useful information.

Metamarkets’ mission is to democratize data science by delivering powerful analytics that are easy and intuitive for everyone.

SriSatish Ambati of the early startup 0xdata (pronounced hex-data) goes a step further with the idea that “the scale of the underlying data and the complexity of running advanced analysis are details that need to be hidden.“ (GigaOm article)

On the other side of the coin, Cathy O’Neil at Mathbabe set out the case in her blog a few weeks ago that not only can you not replace the data scientist with tools, you shouldn’t even allow the non-data-scientist near the data scientist’s tools:

 As I see it, there are three problems with the democratization of algorithms:

 1. As described already, it lets people who can load data and press a button describe themselves as data scientists.

 2. It tempts companies to never hire anyone who actually knows how these things work, because they don’t see the point. This is a mistake, and could have dire consequences, both for the company and for the world, depending on how widely their crappy models get used.

 3. Businesses might think they have awesome data scientists when they don’t. [...] posers can be fantastically successful exactly because non-data scientists who hire data scientists in business, i.e. business people, don’t know how to test for real understanding.

If this is a topic that interests you, we’ve submitted a panel on this topic for SXSW this spring in Austin to discuss issues surrounding data science and tools. We will talk about what tools are available today, how they make us more effective as well as some of the pitfalls of tool use. And we will look into the future of tools to see where and if data scientists can be replaced by tools. Would love a vote!

Panelists:

  • John Myles White (@johnmyleswhite) – Coauthor of Machine learning for hackers and Ph.D. student in the Princeton Psychology Department, where he studies human decision-making.
  • Yael Garten (@yaelgarten) – Senior Data Scientist at LinkedIn.
  • James Dixon (@jamespentaho) – CTO at Pentaho, open source tools for business intelligence.

Update: One of our panelists, John Myles White, has provided some thoughtful analysis of companies that rely on automating or assisting data science tasks. See his blog post at http://www.johnmyleswhite.com/notebook/2012/08/28/will-data-scientists-be-replaced-by-tools

The Social Cocktail, Part 3: Many Publishers Build One Story

In the first post, we looked at high-level attributes of the social media publishers. Then, we spent time looking at the social media responses to expected and unexpected events. To end this series, let’s dive into an example of the evolution of a single story across a mix of publishers. This will provide some intuition into how the social cocktail works when examining a real-world event— in this case, the JPMorgan-Chase $2+ billion loss announcement on May 10, 2012.

JPMorgan-Chase Trading Loss

Twitter: Fast and Concise
On May 10, 2012, immediately after market closing, JPMorgan-Chase CEO Jamie Dimon held a shareholder call to announce a $2 billion trading loss. While traditional news agencies reported the call announcement late in the afternoon, Twitter led the way with reports from call participants who started tweeting while on the call a few minutes after it started.

To see how the volume on Twitter evolved, see figure 1. In each case, the points represent activity volumes on the topic of JPMorgan and “loss” while the lines represent function fits to either the Social Media Pulse or a Gaussian curve (a simple approximation for expected event traffic when averaging over the daily cycle.)

As Reuters and others released news stories and Europe started to wake up, a second Twitter pulse is visible. Toward the right-hand of the graph, the daily cycle of Tweets dominates the conversation about JPMorgan and “loss” with a curve more characteristic of broadly reported, expected events.Twitter Reacts to JPMorgan Trading Loss

Figure 1. Twitter and Stocktwits audience comment on JPMorgan and “loss” after the announcement of a $2B trading loss on the evening May 10, 2012. Volumes are normalized so that peak volume = 1 for each publisher.

StockTwits: Fast and Concise, Focused

Much of the analysis that applies to Twitter applies to StockTwits–the major exceptions are in the expertise of the users and focus of the content. The StockTwits service serves traders and participants are mostly professional investors. Because the audience and the content is curated, there is very little off-topic chatter.  Further, much of the content is specific analysis of JPMorgan’s loss, analysis of the stock price movement following the announcement and information about after-hours price indicators.
On Friday (May 11th), discussion of the loss reaches only about 40% of the peak of the night before. This is likely due to the message rapidly saturating the highly connected community on StockTwits.

Comments: Both Fast and Slow, Concise

Because there was a lot of financial news attention on the story, news stories started to appear soon after the call and these attracted comments immediately (this was the fast response). The data shown in Figure 2 includes both comments from Automattic and Disqus. These comment platforms are used for comments on both personal blogs and on news stories posted online by news organizations, so there is a mix of comments on news stories as well as personal analysis.A graph about comments on the JPMorgan trading loss

Figure 2.  Commenters on blogs and news stories react to the announcement of ta $2B trading loss on the evening 10 May 2012, and an even stronger contingent react early on 11 May. Volumes are normalized so that peak volume = 1.

More-considered news and blog stories appeared on May 11th, Friday morning and these spurred a second (slower) pulse of comment responses.

An additional pattern that is often seen in comments is that people tend to read blogs at certain times of day (e.g. morning or evening) by habit. Because of this, we sometimes see comment volumes spiking at the start or end of the day in very active timezones.

Tumblr: Medium and Very Rich

The Tumblr audience reacted to the news as if the story was broken on Tumblr rather than broken on traditional news. This is unique among the publishers studied here. This pattern of slowly growing traffic during the first few hours after the shareholder call may indicate the nature of the conversation on Tumblr. Rather than an event-response reaction such as twitter, or a considered reaction, as with blogs, the reaction of the audience on Tumblr accelerates as the type of content Tumblrs reblog appears in the network. While the initial posts on Tumblr refer to news stories, the spread of the story through reblogging happens as a ramp up to the peak over a few hours.

The following day, the Tumblr story evolves like an expected event.

Not only is the timeline unique, but Tumblr content is also unique. Early posts have rich media including political cartoons and more right-brained political commentary and humor than the text-comment crowd. Adding Tumblr to your social media mix may present additional challenges in evaluating and analyzing the content, but the sensibilities as well as the activity of this audience adds a dimension not found in the content from the other publishers.

Blogs: Medium and Rich

A few quick, factual reports from the call were published in the form of blog posts as can be seen by the slight “heaviness” in the curve at the end of the day (May 10th). However, the large majority of the blog traffic is the traditional, considered and refined reactions published throughout the following day. The traffic on May 11th follows the pattern of an event everyone already knows about.  The discussion here is analysis and commentary as people explore the implications of the story.

The large majority of the blog content is text or text with a picture of Mr. Dimon. Stories vary from dozens of words to a few thousand.Graph Showing Blog Reactions to JPMorgan Trading Loss

Figure 3.  Content-rich and text-rich reactions to the announcement of ta $2B trading loss on the evening 10 May 2012.  In-depth analysis continues with heavy posting during the day on the 11 May. Volumes are normalized so that peak volume = 1 for each publisher.

Finally, take a look at these timelines shown together in Figure 4.  This view gives a clear indication of the timing of reactions between the publishers.

Social media reaction to JPMorgan trading loss

 Figure 4. The points show the normalized volume of activities about “JPMorgan” and “loss” following the May 10th announcement from Jamie Dimon. Lines represent fits to models of typical social media reactions.  Volumes are normalized so that peak volume = 1 for each publisher.

Conclusion
This example story demonstrates the potential of mixing perspectives, audience and styles of conversation in creating a full description of the social media response to events. With the right mix, we can identify stories and emerging topics within minutes and we can quickly characterize the relative size and speed of a story. We can identify user engagement, dig into deeper analysis, and the rate and focus of content sharing. With this mix of social data, we might be getting close to the perfect cocktail.

The Social Cocktail, Part 2 Expected vs. Unexpected Events

In the last post on social cocktails, we looked at some high-level attributes of the social media publishers, and how these attributes might help you choose the right mix of topic coverage, audience, speed and depth. In this post, I will spend a little more attention on the speed dimension by diving into a description of the social media responses to expected and unexpected events. Finally, in the third post, we’ll end with an example of the social cocktail in examining a real-world event—the JPMorgan-Chase $2+ billion loss announcement in May 2012.

When we want to quantify the social data response to an event, we often start by looking at the volume of activities around that event. Applying some filtering allows us to group activities on a topic and look at how the volume of the stream of these activities evolves with time.

Time-series volume measurements of social data generally show three distinct patterns for breaking events. These patterns can be related to the user’s expectations of the event. The rate of spread and reach of a story also depends on the level of interest of the audience as well. When looking at the time series of activity volume, we see patterns characterized in Table 1.

[table id=4 /]
Table 1.  Social data reaction event types.

Expected events often drive significant volume on social media because people want to comment, observe, banter, trash talk, and analyse.  Twitter volumes surge during the World Cup, the Super Bowl, the Video Music Awards, etc.  However, theses events show a gradual growth and decay around the event rather than abrupt changes. The bump in volume may last from a few hours to a few days.  It is often somewhat symmetric in its growth and decay. Figure 1 shows some examples of social data volume round expected events including the VMAs and Hurricane Irene.

Figure 1. Expected events show smooth and fairly symetric growth and decay of social media volume over time. Examples from the VMAs and Hurricane Irene. 

Unexpected events result in abrupt spikes in volume. On Twitter, these spikes may reach tens of thousands of related tweets per minute within 5 minutes of an event. Social data volume around unexpected events usually grows rapidly until the networks for related users are saturated with the information, then the volume decays exponentially.  These spikes have well-defined growth, peak and decay half-lives. See Social Media Pulse for discussion and analytical details.

There is a key difference between events that are witnessed by many social media users simultaneously, and news that break exclusively on social media.  Examples of the former include a spectacular goal goal in the World Cup, an Earthquake, or Beyonce’s performing pregnant at the VMAs.  Because users see these events at the same time, the social data volume instantly jumps to high levels; there is very little ramp up. See Figure 2a and b for an example of a simultaneous unexpected event response curve.

Figure 2a – Earthquake in Mexico.


Figure 2b – Steve Jobs Resigns as CEO of Apple


News the breaks partially or exclusively on social media has an observable ramp up as the news spreads through the network of related users. This spread can depend essentially on the number of followers, the credibility of the source, etc.  These events have a convex ramp up as shown in Figure 3.

Figure 3.  The sad news of Steve Jobs passing away was originally broken through traditional new sources.  But the story quickly travelled on twitter to many users that were not watching the news, so it took took a shape somewhat like a story breaking on twitter.

You are now equipped to understand the dynamics of an event at a deeper level.  Quantifying the speed of a story will help you consistently characterize and compare the impact of events and the response of the audience. With these tools in hand, you are now ready add a twist to your social cocktail with the garnish of recognizing activity patterns over time.

The Social Cocktail, Part 1: Mixology

Gnip’s Chris Moody has been talking about the “Social Cocktail” recently, both at Strata and Big Boulder. At Gnip, we talk about the social cocktail a lot–mainly because people like cocktails. But also, it is an apt metaphor for thinking about what social data is useful for our customers. What audiences and modes of conversation are needed to build out understanding your market, your customers, perceptions of your product and the evolution of your message?

The fundamental question it answers is: Why analyze social data from more than one publisher?

Each social media publisher brings distinct capabilities and audiences, and encourages unique ways for users to interact and express themselves. The overlap in audience between some publishers is low, so adding publishers helps broaden topic coverage and audience perspective. Microblogs (e.g. Twitter) are fast and concise, making it easier to tease out breaking stories and emerging conversations. Blog comments indicate engagement and controversy, and therefore point back to interesting blog posts, where the in-depth analysis is found. Votes and likes provide additional signals of reader engagement–indications of the quality and the pitch of conversation.

To get the right mix, it is essential to understand some of the properties of each publisher’s firehose.  In this post, we’ll look at high-level attributes of the social media publishers.  In the next post, we will dive into a brief description of the social media responses to expected and unexpected events. Finally, in the third post, we’ll end with an example of the social cocktail in examining a real-world event—the JPMorgan-Chase $2+ billion loss announcement in May 2012.

One revealing way to compare publishers is to understand their relative speed and content richness.  In this case, fast content means that a statistically relevant sample of activity arrives shortly after the event or topic happens in the real world.  “Shortly” can mean tweets follow the event by be less than a minute, for some topics on Twitter (e.g. earthquakes).  In contrast to the speediest media responses, posts about the 2008 banking crisis in major US backs are probably still being written in 2012 as we continue to examine and discuss bank regulation.

Another telling dimension for comparison is content richness. While Tweets are very fast, they are also concise. To be a rapid responder to an earthquake or other immediate event, you only have time for the barest facts because you have only 30 seconds to respond with 140 characters. “Just felt an earthquake in DC,” would be a typical response. On the other hand, a publisher such as Tumblr encourages rich media sharing with Spotify plugins, support for video and audio formats, very simple photo uploading and sharing capabilities and a tradition of the users appreciating and sharing creative and artistic photography.  Blogs on Automattic’s WordPress platform can range from 10s of words to 10,000, giving ample opportunity for a writer to explore subtle ideas and complex analyses.

A few properties of Social Media Firehoses explained

[table id=2 /]

 

Table 1. Comparison of publisher ingredients.

Both Speed and Content Size can be quantified. We often use measures such as the time for the story activity to peak or the ½-life of a story to characterize speed. See Social Media Pulse for discussion and analytical details.

Content richness can be simply characterized by the number of characters. While this is fairly indicative of the balance of rich vs. concise information in a stream for text content, it overlooks media such as audio, photos, interactive applications, video and music. Other measures of richness might include audience participation, user-network interactions, amount of back-and-forth in a conversation and many higher-level measurements from textual analysis.

Surprising and satisfying cocktails come from a careful mixing of quality and–sometimes unexpected–ingredients. With practice and the right combination of publishers, you can mix a social cocktail to enrich your understanding of the conversations between customers, prospects, partners and pundits. You want to experiment with the mix to match your business use case. Getting the mix just right is always rewarding.

Next week: The Social Media Cocktail, Part 2 – Expected vs. Unexpected Events

Taming The Social Media Firehose, Part III – Tumblr

In part I, I discussed high-level attributes of the social media firehose. In Part II , I examined a single event by looking at activities from four firehoses for the earthquake in Mexico earlier this year. In Part III, I wrap up this series with some guidelines for using unique rich content from social media firehoses that may be less familiar. To keep it real, I used examples from the Tumblr firehose.

Since the Twitter APIs and firehoses have been available for years, you may be very familiar with many analysis strategies you can apply to the Twitter data and metadata.  I illustrated a couple of very simple ideas in the last post. With Twitter data and metadata, the opportunities to understand tweets in the context of time, timezone, geolocation, language, social graph, etc. are as big as your imagination.

Due to the popularity of blogging for both personal and corporate communication, many of you will also understand some of the opportunities of the WordPress firehose.  With the addition of firehoses of comments, you have the capabilities of connecting threads of conversation to realize another possible analysis strategy. “Likes” and Disqus “votes” provide additional hints about user reaction and engagement–yet another way to filter and understand posts and comments.

Why go to the effort and expense of adding a new firehose?
There are three benefits from investing your efforts in learning to integrate these differences. Users of social networks choose to participate in Twitter, Tumblr or other social networks based on their affinities and preferences. Integrating additional active social media sources gives:

  1. Richer audience demographics
  2. More diverse perspective and preference
  3. Broader topic coverage.

Here’s an example.

Tumblr

The newest firehose from Gnip became available earlier in 2012. Tumblr’s exciting because the unique, rich content from Tumblr provides a complementary perspective and a distinct form of conversation. Tumblr is important because of the unique audience and modes of interaction common within this audience and platform.

With a firehose of over 50 million new posts a day from web users, Tumblr is a source with strong social sharing features and an active network of users where discussions can reach a large audience quickly.  Some Tumblr posts have been reblogged more than a million times and stories regularly travel to thousands of readers in a couple of days.

Before jumping into consuming the Tumblr firehose in the next section, it may help to understand some of what makes it different and valuable. These questions provide a useful framework when approaching any unfamiliar stream of social data.

What is unique about the Tumblr firehose?

1. Demographics. The user community on Tumblr skews young, over-indexing strongly in the 18-24 demographic of trend setters and cool hunters.

2. Communication and Activity Style. As you are thinking about filtering and mining the Tumblr firehose, realize conversations on Tumblr are often quite different from what you’ll find on other social platforms. As you start to interpret the data from Tumblr it’s important to note that Tumblr has an inside language. For example, many sites contain f**kyeah___ in their name and URL. When you start to hone in on your topic, you will need to understand the inside language used for both positive and negative responses. Terms you consider negative on one platform may have positive connotations on another. Be sure to review a subset of your data to get a feel for the nuances before drawing larger conclusions.

3. Rich Content. Content is rich in that there many types of media and a wide range of depth. Users will post audio, video, animated gifs, simple photos as well as short and long text posts.

You’ll also see 7 different Post Types on Tumblr. These represent the different types of content that users can post on Tumblr. They break out as follows:

Table of Post Types on Tumblr

Table 1 – Tumblr post type breakdown.

To answer the questions, we often rely on filters based on text since these are the simplest filters to think about and create.  The textual data and metadata available in the Tumblr firehose include titles, tags and image captions in addition to the text of the body of the post. Including all of this content allows us to filter approximately 20% of the Tumblr firehose based on text. Additional strategies include looking at reblog and “like” activity, as well as reblog and “like” relationships between users.  More sophisticated strategies such as applying character or object recognition to images open up the tens of millions of activities daily for mining and exploration.

4. Rich Topics. In addition to diverse content forms, Tumblr has attracted many active conversations on a wide variety of topics. This content is often very complementary to other social media platforms due to differences in audience, tone, volume or perspective. With more than 20 billion total posts to date, there is content for about almost  anything you can imagine.  Some examples include:

  • Brands. Any brand you can think of is being discussed right now on Tumblr. Big brands with an official presence on Tumblr include Coca-Cola, Nike, IBM, Target, Urban Outfitters, Puma, Huggies, Lufthansa, Mac Cosmetics and many more. NPR and the President of the United States have their own presences on Tumblr.
  • Fashion and Cosmetics. Because of the visual nature of the medium and cool-hunting audience it attracts, there is a large volume of content related to cosmetics and fashion.
  • Music and Movies. With Spotify music plugins and easy upload and sharing of visual content, pop culture plays a big role in the interests and attention of many of the active users on Tumblr. Information, analysis and fan content is rich, creative and travels through the community rapidly.

5. Reblogs and Likes. Tumblr is all about engagement! The primary user activities for interactions are Reblogs and Likes. Some entries are reblogged thousands of time in a day or two. When a user reblogs a post, it places the other user’s post into your blog with any changes they make. There is a list of all of the notes (likes, reblogs) associated with a post appended to that post wherever it shows up on Tumblr. Each post activity record in the firehose can contain reblog info. It will have a count, a link to the blog this entry was a reblog of and a link to the root entry. To build the blog note list that a user would see at the bottom of a liked or reblogged entry, you have to trace each entry in the stream (i.e. keep a history or know what you want to watch) or scrape the notes section of a page.

Filtering and Mining The Tumblr Firehose

Volume. There are a number of metrics we can use to talk about the volume of the Tumblr firehose. The three gating resources that we run up against most often are related to the network (bandwidth and latency) and storage (e.g. disk space). Tumblr activities are delivered compressed, so for estimating, the bandwidth and disk space requirements can be based on the same numbers. The Tumblr firehose averages about 900 MB/hour compressed volume during peak hours, falling to a minimum of 300 MB/hour during slower periods of the day.

To store the firehose on disk, plan on ~16 GB/day based on current volumes. Planning for bandwidth, you want headroom of 2-5 x average peak hourly bandwidth (4 to 10 Mbps) depending on your tolerance for disconnects during peak events.

The other consideration is end-to-end network latency as discussed in Consuming the Firehose, Part II.  Very simplistically, latency can limit the throughput of your network (regardless of bandwidth) by using up too much time negotiating connections and acknowledging packets. (For a detailed calculation, see, for example, The TCP Window, Latency, and the Bandwidth Delay Product.)  The theoretical limit for 20 Mbps throughput is 50-70 ms (depends on TCP window size), but practically you will want to reliably observe less than this (< 50 ms) to realize reliable network performance.

Metadata. A firehose is a time-ordered, near real-time stream of user activities. While this structure is clearly powerful for identifying emerging trends around brands or news stories, the time-ordered stream is not the optimal structure for looking at other things like the structure social networks to discover, e.g., influencers. Fortunately, the Tumblr firehose activities contain a lot of helpful metadata about place, time, and social network to get answers to these questions.
Each activity has a post objectType as discussed above as well as links to resources referred to in the post such as image files, video files and audio files. Each activity has a source link that takes you back to the original post on Tumblr. If the post is a re-blog, it will also have records like the JSON example below, describing the number of reblogs, the root blog and blog this post reblogged.

"tumblrRebloggedFrom" :
    {
         "author" :
         {
               "displayName" : "A Glimpse",
               "link" : "http://onlybutaglimpse.tumblr.com/"
         },
         "link" : "http://onlybutaglimpse.tumblr.com/post/24141204872"
    },
"tumblrRebloggedRoot" :
    {
         "author" :
         {
                "displayName" : "Armed With A Mind",
                "link" : "http://lizard-skin.tumblr.com/"
         },
         "link" : "http://lizard-skin.tumblr.com/post/16004808098/the-nautilus-car-from-the-league-of-extraordinary"
    },

To assemble the entire reblog chain, you must connect the reblog activities within the firehose using this metadata.

Additional engagement metadata is available in the form of likes (hearts in the Tumblr interface) in a separate Tumblr engagement firehose.

Tumblr Likes Metadata

Non-Text Based Filters. Not all non-text post types have enough textual context (captions, title and tags) to identify a topic or analyze sentiment through simple text filtering. You will want to develop strategies for dealing with some ambiguity around the meaning of posts with very little text content. This ambiguity can be reduced unless you have audio or image analysis capabilities (e.g. OCR or audio transcription). Approximately 20% of all posts can be filtered effectively with text-based filtering of text, URL text, tags and captions–about 15M activities per day).

Memes. Another consideration related to the Tumblr language is that official brand sites as well as many bloggers tend to promote a style or overall image more than providing a catalog of particular products. As a result, e.g., you will match the brand name with a lot of cool stuff, but may see specific product names and descriptions much less frequently. There are many memes within Tumblr that will lead you to influencers and sentiment, but looking at “catalog” terms won’t be the most effective path.

I hope I have uncovered some of the mysteries of successfully consuming social media firehoses.  I have only suggested a handful of questions one might try to answer with the social media data. The community of professionals providing text analysis, image analysis, machine learning for prediction, classification and recommendation, and many other wonders is continuing to invent and refine ways to model and predict real-world behavior based on billions of social media interactions.  The start of this process is always a great question.  Best of luck (and the benefits of all of Gnip’s experience and technology) to you as you jump into consuming the social media firehose.

Full Series:

Taming The Social Media Firehose, Part I – High-level attributes of a firehose

Taming The Social Media Firehose, Part II – Looking at a single event through four firehoses

Taming The Social Media Firehose, Part III – Tumblr