Data Story: Mohammad Shahangian on Pinterest Data Science

At Gnip, we believe the value of social data is unlimited. Data Stories is how we bring this belief to life by showcasing how social data is used. This week we’re interviewing data scientist Mohammad Shahangian of Pinterest about how the data science team works at Pinterest, surprising uses of Pinterest and data science as a career path. You can follow him on Pinterest at pinterest.com/mshahang

Data Scientist at Pinterest

1. What do you see is your role as the data scientist for Pinterest?

The company’s focus is on helping millions of people discover things they love and get inspiration to go do those things in their life. For me, that means analyzing the rich data that is created by the millions of people interacting with billions of pins from across the web each day. I evaluate this data and provide insights that make data actionable. My team also prototypes and validates ideas, performs deep analysis and builds tools that allow us to answer our most frequent questions in seconds. We work with every team to answer Pinterest’s biggest questions and ensure that each decision positively impacts Pinners over the long term.

For example, we take a business question like “How should our web, tablet and phone experiences differ?” and present the results as insights like, “Many users use the mobile apps in the morning and again at night, but prefer the website during the day” and “Users prefer to use mobile apps to casually discover new content, whereas they use the web to curate and organize content.” We then work with the design and product teams to build features around these insights and measure their impact.

2. What are some of your favorite ways that people use Pinterest that people wouldn’t expect?

What makes Pinterest unique is that it’s a tool and the users really define its use cases. For me, Pinterest was really helpful when I was planning my wedding and it made perfect sense to use as collaborative office shopping list. I would have never thought to use it as a tool for:

A collection of Stop signs from around the world
Daily Grommet gets their community to collaborate on a board to see things they want to sell
Vintage Driving - a collaborative board where users pin their favorite vintage cars:
GE Badass machines featuring GE tech
Madewell’s Rainbow board
Michelle Obama’s MyPlate Recipes encourages health eating
Stunning virtual collections of minerals and shipwrecks
The “365 Days of Pinterest” challenge. She made a Pinterest project every day for a year!
Sammy Sosa awesomeness
Sony shows off their technology with food pictures shot with a Sony Camera
Pantone announces the color of the year
The National Pork Board

3. What category do you see as the most viral on Pinterest?

DIY and recipes pins generally go viral year round. Around the holidays, holiday-themed content across all categories tends to get the most traction.

4. How has data science added value to Pinterest?

We have this internal value we refer to as “knit.” It means that we have an open, curious culture where everyone in different disciplines—from engineering and design to marketing to community—works together. Data science is at the core of that. The search, recommendations and spam teams apply data science to improve the quality of content we put in front of Pinners. This is only a subset of how we apply data though; most of the decisions we make at Pinterest are actually backed by data.

Data is a universal language that teams across the company use to collaborate and make decisions. Each team has a set of performance metrics, and we hold a weekly meeting to understand the impact that each area is having on company-wide metrics. As data scientists we do more than just analyze data, we create rich data sources that we make available to other teams so they can do their own analysis. More than half of Pinterest employees run MapReduce jobs via Hive.  Our metrics dashboards are accessible to everyone and our core metrics are emailed daily to the entire team.  We also share our data studies and insights with the whole team.

We also use data just for fun. During our weekly happy hour, we share a weekly Data Fun Fact with the team. We present the fact in the form of a multiple choice question and have the team vote on the answer. For example, we asked, “How many days before Valentine’s day does the query ‘Valentine’s day ideas’ increase the most: 1, 3, 5 or 7 days?” (Hint for the curious reader: two*three/two).

5. What do you think someone should know before becoming a data scientist at a major web company like Pinterest?

I would say go for it! If you are hungry to extract value from real world data, you’re really going to enjoy it. I know that for a lot of really talented people in academia the only thing standing between them and the opportunity to solve a really interesting problem is the lack of rich data. My experience at Pinterest has been the exact opposite. Our team can’t grow fast enough to tap into a world of valuable insights that are sitting dormant within billions of records somewhere in the cloud.

Continue reading

Commercial Evolution of Social Networks

Over the past four years Gnip has seen many social services come and go. Not surprisingly, a pattern has emerged in how they evolve, and the degree to which our customers need their public data. There are generally three distinct phases a social service goes through, and how the service does in each phase impacts how it ultimately participates in the broader public social data ecosystem which can complete a full commercial cycle. This cycle being one combining consumer use (often buying intent, or expression) with commercial engagement (identifying need in time of natural disaster, or ad buying).

Phase 1: Consumer Engagement
​A social service must engage us; the end-users/consumers. Whether via a homegrown social graph, or leveraging someone else’s (e.g. Facebook Connect), in order for a social service to become useful, it needs users. From there, those users need to participate in self-expression (from posting a comment, to retweeting a tweet) and generate activity on the service. There are a variety of ways to compel us users to engage in a social service, but the social service itself is solely responsible for the first experience. The vision of the services’ founders yields a web-app or mobile interface that allows us to take action, leveraging the expressions laid out by the app itself (e.g. sharing a photo). If users like the expressions, discovery methods, and sense of “connectedness,” you’ve got a relevant social service on your hands.

Phase 2: APIs; Outsourcing Engagement
At some point a successful social service realizes the potential for outsourcing the expression metaphors that make the service successful & useful, and they construct an API that allows others to RESTfully engage with the service. In some instances the API is read-only. In some instances the API is write-only; sometimes both. What is key is that nine times out of ten, the API is meant to drive core service engagement via other user-facing applications. A classic example of this would the zillions of non-Twitter Inc clients that “Tweet” on our behalves everyday. One look at the endless number of Tweet “sources” that flow through the Firehose and you’ll realize this engagement potential.

The exceptional API is one that has broader social data engagement ecosystem consumption in its DNA. Typical social services consider themselves the center of the universe, and that not only will they capture all consumer engagement, they will be the root of all broader ecosystem engagement as well. However, success with Consumer Engagement does not guarantee commercial engagement; not by a long-shot.

Some services execute phase 1 and 2 simultaneously these days.

Phase 3: Activity Transparency; Commercial Engagement
Allowing other applications & developers to inject activities into the core service is obviously valuable, however it is only part of the picture. Social services with broad social and commercial impact have achieved this by addressing commercial needs for complete, raw, activity availability. For example, in order for someone to deploy resources in a disaster relief scenario effectively, they need to make their own determination as to what victims need, where they are located, and general conditions surrounding the event. The social service limiting access to the activities taking place on the service, by definition, yields an incomplete picture to downstream commercial consumers of the content. The result is a fragmented & hobbled experience for commerce engagement.

Another key component to commercial engagement is realizing that the ecosystem of data analytics and insights is well established, complex, and interwoven. Massive investments have been made in the market over the years, and brands want to leverage that fact. It is illogical for a social service to address the endless needs of the enterprise by building their own tools. Attempts to supplement this market comes at the potential expense of losing focus on building a great consumer experience.

The most impactful, useful, and valuable social services that Gnip customers leverage for their needs (ad buying, campaign running, stock trading, disaster relief), are those that acknowledge that they are not an island in the ecosystem. They complete the cycle by providing unfettered access to one of their most significant assets. In trade, the relevance of the social service itself is maximized because commerce can engage with it.

A good example of how impactful this transparency can be is Twitter. Consider how Twitter is used across new, as well as traditional, media. They’ve completed the cycle with a strong offering of Phase 3.

All three phases are not required for success, but all three are indeed required for success in the broader public commercial social data ecosystem.

Data Story: Dan Lynn of Full Contact

Data stories is Gnip’s way to talk about the many amazing ways that data is used. Today on the blog we’re speaking with Dan Lynn, a cofounder and CTO of FullContact. FullContact is trying to solve the world’s contact information problem, which is no small feat. We thought the dilemmas faced by this team with dealing with disparate and decaying data makes for a great story. You can follow Dan on Twitter at @DanKLynn

Dan Lynn of FullContact

1. What problem is Full Contact trying to solve with data?

At FullContact, we’re solving the world’s contact information problem, which is that your contact information is a mess. In address books like GMail, Outlook, SalesForce and customer lists, you’ve got missing details, duplicate entries, and the same person fractured across multiple cloud systems. We’re using data to help you clean all that up and keep those address books in sync, up to date, and duplicate-free.

2. What do you see as the advantages of combining social data with contact information? Do people make deeper connections if they have social data?

When I was growing up, I had 3 or fewer ways I could contact my friends: street address, phone (usually their parents’!) and, later, email. As the Internet took off, they added instant messenger accounts, eBay usernames, Twitter handles, Facebook accounts, LinkedIn profiles and dozens more. These are all valid means of contacting someone, but most people prefer some over others, and it’s great to have that choice.

While it’s awesome for me to find out who among my contacts have Twitter accounts that I’m not yet following, using social data is very helpful for me (or a computer) to tell two similar-but-different contacts apart. Social profiles are starting to act more and more as a person’s public identifier, much like a Social Security number that you would actually *want* people to have. Filling-out my contacts with social data makes it that much easier to merge duplicates, tell the difference between John Smith Jr. and John Smith Sr., and contact people in ways other than email, phone, or snail mail.

3. What do you wish you knew a year ago about how people archive and share contact information?

Honestly, a year ago, the problem was staring us straight in the face: people *don’t* really archive and share contact information. Sharing has been too error prone for people to trust an automated system not to screw up their contacts. I’ve lost count of the number of times people share contact information by reading phone numbers from each other’s phone, yelling email addresses across the room, or emailing contact info back and forth with subject lines like “Bart Lorang’s phone number”. The problem is hard, and everyone has different expectations around the idea of sharing contact information. Many people want their contacts automatically kept up to date with changes in their co-workers’ address books. Others only want updates if the contact publicly changes his/her information. What should an automated system do if two of your colleagues share conflicting changes to one of your contacts? Ultimately we all just want the best way to get in touch with someone at a given time.

4. Contact information is considered decaying data. What are the challenges of working with decaying information?

The idea of decaying data is that the data you have *right* now is only a snapshot of the world at a given time. You could say that your data “decayed” if the real world has moved on and your database hasn’t caught up. This is a real problem with contact information. It changes constantly. People change jobs, change names, move, change phone carriers, and more. The challenge is keeping your address book up to date with all these changes. Many companies that work with contact information in bulk simply “punt” and apply a simple rule to their data by reducing their confidence in it some percentage every year. I think that’s too heavy-handed and doesn’t work for the end-user. At FullContact, we fundamentally believe that a person’s contact information is current until we find some other, newer, piece of contact information that suggests otherwise. That means that we’re constantly searching the internet for up-to-date information about your contacts.

5. How do you think Full Contact fits into the world of social media and how people are already obtaining contact information? 

For the last couple years, we’ve been seeing the social networks clamp down on their users’ contact information (often for good reason). We remember the spat between Google and Facebook over the ability to export your friends’ information. It’s easy to agree philosophically with elements of both arguments. To Facebook’s point, a person should be in control of her own contact information. To Google’s point, a person should be in control of her contacts, and has a reasonable expectation to get the same data back from a service that she put in. We think FullContact helps bridge this gap. We believe that you own your address book, but we also believe that you have a right to control what information about you is floating around out there on the Internet. We want to you to have the most up-to-date picture of your contacts, but we want to give your contacts control over their own information.

Continue reading

Data Stories: Dmitrii Vlasov on Kaggle Contests

At Gnip, we’re big fans of what the team at Kaggle is doing and have a fun time keeping tabs on their contests. One contest that I loved was held by WordPress and GigaOm to see what posts were most likely to generate likes, and we interviewed Dmitrii Vlasov who came in second in the Splunk Innovation Prospect and sixth overall. For me, it was interesting to speak to an up and coming data scientist who isn’t well known yet. Follow him at @yablokoff.

Dmitrii Vlasov of the GigaOm WordPress contest

1. You were recognized for your work in the first Kaggle contest you ever entered. What attracted you to Kaggle, and specifically the WordPress competition?

I came to Kaggle accidentally as it always happens. I read some blog post about the Million Song Dataset Challenge provided by Last.fm and bunch of other organizations. The task was to predict which songs will be liked by users based on their existing listening history. This immediately made me feel excited because I’m an active Last.fm user and was reflecting about what connections between people can be established based on their music preferences. But the contest was coming to end and so I switched to WordPress GigaOm contest and got 6th place there. Well, it is always interesting to predict something you already use.

2. What is your background in data science?

Now I’m a senior CS student in Togliatty, Russia. Can’t say that I have a special background in Data Science – I had more than a year-long course of probability theory and math statistics in university, some self-learned skills about semantic analysis and have big love to Python as a tool for implementing ideas. Also, I’ve entered the Machine Learning course on Coursera.

3. You found that blog posts with 30 to 50 pictures were more likely to be popular. You also found that longer blog posts also attract more likes (80,000-90,000 characters). This struck my marketing team as really high and was contrary to your hypothesis that longer content might be less viral. Why do you think this is?

Well, my numbers show relative correlation between amount of photos, characters and videos and the amount of likes received. Big relative “folks love” on several prominent amount of photos means that there were not so many posts with such amount of photos but most of them were qualitative. Quick empirical analysis shows that these are special type of posts – “big photo posts”. They usually are photo report, photo collection or scrapbook. For such types of posts 10-15 photos are not enough but at the same time 10-15 photos seem too overloaded for normal post. The same can be said about big amount of text in post. Of course, the most “likeable” posts contain 1,000-3,000 characters, but posts with 80-90 thousands are winners in “heavyweight category”. These are big researches, novels, political contemplation. Analyse is quite simple but it shows that if you want to create media-rich or text-rich content it should be really media-text-rich. Or you may fall in a hollow of not suitableness.

4. What else would like to predict with social data if you got the chance?

Now I work on romantic and friend relationships that could be established based on people’s music preferences (it’s a privately held startup in alpha). This is a really interesting and deep area! Also, I’d like to work with some political data e.g. to predict reaction on one or another politician’s statement based on a user’s Twitter feed. Or to extract all “real” thesis of politician based on all of his public speeches.

Bad Data, the Right Data and Le Data

Gnip believes social data can change the world and our leadership team has been writing about data in O’Reilly, speaking at the Sentiment Symposium and at LeWeb. We wanted to share what they were talking about.

Bad Data by O'Reilly

Gnip CEO Jud Valeski wrote a chapter in the recently released O’Reilly handbook “Bad Data” by Ethan McCullum. Jud wrote the chapter called “Social Data: Erasable Ink?” about how the evolving social media landscape is challenging expectations about how people interact with social data and who owns it. Gnip is committed to providing terms-of-service compliant social data and this chapter talks about the expectations around social data and how the various players are managing them.

 

 

 

Our COO Chris Moody speaking at the Sentiment Symposium on “Building Sentiment Analysis on the Right Social Data”

Building Sentiment Analysis on the Right Social Data (Chris Moody, Gnip) from Seth Grimes on Vimeo.

Jud being interviewed by Robert Scoble at LeWeb

Social Data and The Election

If you’re excitedly waiting the results of the election, and wanted to keep an eye what people are saying about Election 2012 on social media, we have a list of resources below:

  • I Voted Map – A realtime map by the good people at Foursquare allowing people to check in and say “I voted.” The map compiles checkins from voters.
  • The Twitter Political Index – Twitter’s official coverage measuring the sentiment between each Presidential candidate and trending topics related to the election.
  • Tumbling the Election – Coverage from Tumblr’s editorial team tracking top election related hashtags.
  • Facebook Stories on the Election – Watch Americans that said they voted on Facebook in real time.
  • Tumblr Election 2012 – Union Metrics visualization of trending election-related tags on Tumblr. Shows how many posts per second are about the election.
  • Infinigon Group – Tracking realtime political sentiment on social media.
  • 2012 Election Mood Meter – Netbase election mood meter tracks sentiment on social media for both candidates for President and Vice President candidates and breaks it down further by gender.
  • Electoral Map based on Tweets – The Guardian has created a map to show who would win the election based on Tweets.
  • Election Day on Twitter – Al Jazeera and Flowics show buzz volume on Twitter about each of the Presidential candidates as well as a live stream about Tweets on each candidate.
  • Rock The Vote Real-Time Politics - Splunk and MTV have created a visualization of hashtags on Twitter about each Presidential candidate.
  • The Crowdwire – Bluefin Labs has created what they’re calling a “Social Exit Poll” looking at how people are talking about how they voted on social media.
  • Yahoo! Election Control Room –  Attensity has teamed up with Yahoo! to show how America is feeling about the election and sharing select Tweets.
  • US Electoral Compass 2012 – Brandwatch allows you to select a state and date range to show what political issues each state is talking about.
  • NBC Politics – Using Crimson Hexagon to power their social media analysis.
  • Twitter Sentiment Analysis – USC is tracking sentiment around each candidate.

Who else are we missing? Also, as a bonus you can see our interview with Gabriel Banos of Zauber Labs on predicting the election with social data and Union Metrics comparison of the candidates using Tumblr data.

Social Data Around Voter Values

What's In The [Social] Data?

I was reading about Higgs Boson this morning and came across this cartoon explanation of matter, particle acceleration, and Higgs Boson. The very last pic in the cartoon (below) reminded me of the cartoon that’s been in my head for years; the one that pops into my head when I go to work. It drives what we do at Gnip. All of our energy is focused on helping our customers answer that question; “What’s in the data?” We do this by reliably collecting, filtering, enriching, and delivering billions of public social activities (social data) to our customers with business critical data needs, everyday.

What's In the Data

What's In The Data slide from PHD Comics - http://www.phdcomics.com/comics.php?f=1489

Geosocial Data: Patterns of Everyday Life

My love for checking in and thus, geolocation, began after SXSW of 2009 while I racked up points and worked hard to become the leader of Boulder, ultimately losing to Eric Wu. Since then, my views on geolocation have evolved, and I have become especially enamored with the way geosocial data allows us to leave trails of the lives we and others are living. At its best, geolocation + social connects us to friends we are close to by letting us know who is near and collectively, social data can identify common interests and patterns of behavior we couldn’t see in the past.

Since 2008, Foursquare has evolved becoming a service with 50 million users and two billion check-ins and a facelift launching tomorrow, Twitter has opened up a geolocation API, Facebook Places launched and continues to evolve, Highlight launched and Gowalla was acquired by Facebook. All of these advancements have happened in a couple of short years. Geotagging allows these new crop of social networks to add your geographic location via metadata and now you can add location to tweets, photos, videos, etc.

Patterns of My Life

Every time I check in and share my location, I start leaving a trail of my day-to-day life. This trail, at its most basic, serves as a virtual diary of where I went and with whom. Timehop emails me each day to tell me what I did a year ago, while services such as Rewind.Me allow me to search my patterns and how I stack up against others.

Tripmeter lets me see my virtual trail and the how I travel throughout the day based on Foursquare and Facebook checkins, similar to what Route does. Where Do You Go even lets you heatmap where you most often visit (hint: I hate South Boulder).

Foursquare Heat Map

Checkins Are a Moving Census

But collectively, the patterns woven by geosocial data are incredibly telling and act as a living census. Intriguingly, researchers from Carnegie Mellon have created what they call “Livehoods” which are neighborhoods defined on not only on geographic proximity, but also based on social geotagged data. Essentially, the similarities are based on where people check in. While the data only includes those using geolocation, it shows that people who check into a local restaurant and a similar bar create cultural neighborhoods. This data is more than just an intellectual curiosity. Companies can analyze customer patterns to focus marketing efforts, identify companies to partner with and determine new brick-and-mortar locations.

Example of Livehood Data

I particularly love the idea of an app using Foursquare data called “When Should I Visit?” that tells you when is a good time to visit London tourist attractions based on Foursquare checkins. Other use cases for this type of social data could tell people when to visit high-traffic destinations such as the DMV. I love knowing when not to be somewhere as much as knowing what locations and parties are trending.

HealthMaps uses geosocial data and news reports to help track epidemics as they pop up. The mapping system was created by a team of researchers, epidemiologists and software developers from Children’s Hospital Boulder to monitor real-time epidemics as they break out. Rumi Chunara, worked on this project and also helped use geosocial data to track how cholera spread in Haiti. (Rumi will be speaking at Gnip’s social data conference, Big Boulder, about social data in public service.) Geosocial data has unlimited uses in the cases of health epidemics and natural disasters.

Companies are starting to create passive geolocation checkins such as EpicMix from Vail Resorts, which enables skiers to automatically check in using the RFID tags on their ski lifts. The system tells users how much they skied, where they skied, their vertical ascents and where their friends are on the mountain. During the last Coachella, 30,000 concertgoers used RFID bands from Intellix to checkin and update their Facebook status on various portals spaced throughout concert grounds. Near field communication is another way social data provides amazing patterns.

Geosocial data allows us insight into the patterns of everyday people, and the applications for this are endless.

Taming The Social Media Firehose, Part II

In part I, we discussed some high-level attributes of the social media firehose and what is needed to digest the data. Now let’s collect some data and perform a simple analysis to see how it goes.

On March 20, 2012, there was a large 7.4 earthquake in Oaxaca, Mexico. (See the USGS record.) Due to the severity of the earthquake, it was felt in many locations in Southern Mexico and as far away as 300 miles away in Mexico City.

Collecting data from multiple firehoses around a surprise event such as an earthquake can give a quick sense of the unfolding situation on the ground in the short term as well as helping us understand the long-term implications of destruction or injury. Social data use continues to evolve in natural disasters.

For this post, let’s limit the analysis to two questions:

  1. How does the volume of earthquake-related posts and tweets evolve over time?
  2. How rich or concise are social media conversations about the earthquake? To keep this simple, treat the size of the social media activities as a proxy for content richness.

In the past few months, Gnip has made new, rich firehoses such as Disqus, WordPress and Tumblr available. Each social media service attracts a different audience and has strengths for revealing different types of social interactions. For each topic of interest, you’ll want to understand the audience and the activities common to the publisher.

Out of the possible firehoses we can use to track the earthquake through social media activity, four firehoses will be used in this post:

  • Twitter
  • WordPress Posts
  • WordPress Comments
  • Newsgator (to compare to traditional media)

There are some common tasks and considerations for solving problems like looking at earthquake data across social media.  For examples, see this post on the taxonomy of data science. To work toward an answer, you will always need a strategy of attack to address each of the common data science challenges.

For this project, we will focus on the following steps needed to digest and analyze a social media firehose:

  1. Connect and stream data from the firehoses
  2. Apply filters to the incoming data to reduce to a manageable volume
  3. Store the data
  4. Parse and structure relevant data for analysis
  5. Count (descriptive statistics)
  6. Model
  7. Visualize
  8. Interpret

Ok, that’s the background. Let’s go.

1. Connect to the firehose. We are going to collect about a day’s worth of data. The simplest way to collect the data is with a cURL statement like this,

curl –compressed -s -ushendrickson@gnip.com
“https://stream.gnip.com:443/accounts/client/publishers/twitter/streams/track/pt1.json”

This command opens a streaming HTTP connection to the Gnip firehose server and delivers a continuous stream of JSON-formatted, GZIP-compressed data from the stream named “pt1.json” to my analysis data collector. If everything goes as planned, this will collect data from the firehose until the process is manually stopped.

The depth of the URL (each level of …/…/…) is the RESTful way of defining the interface. Gnip provides a URL for every stream a user configures on the server. In this case, I configured a streaming Twitter server named “pt1.json.”

A more realistic application would ensure against data loss by making this basic client connection more robust. For example, you may want to monitor the connection so that if it dies due to network latency or other network issues, the connection can be quickly re-established. Or, if you cannot afford to miss any activities, you may want to maintain redundant connections.

One of the network challenges is that volumes from the firehoses changes with a daily cycle, weekly cycle and surprise events such as earthquakes. These changes can be many times the average volume of posts. There are many strategies to dealing with volume variations and planning network and server capacities. For example, you may design graceful data loss procedures or, if your data provider provides it, content shaping such as buffering, prioritized rules or rule-production caps. In the case of content shaping, you may build an application to monitor rule-production volumes and react quickly to large changes in volume by restricting your rule set.

Here is short list of issues to keep in mind when planning your firehose connection:

  • Bandwidth must be sufficient for activity volume peaks rather than averages.
  • Latency can cause disconnects as well as having adverse effects on time-sensitive analysis.
  • Disconnects may occur due to bandwidth or latency as well as network outage or client congestion.
  • Implement redundancy and connection monitoring to ensure against activity loss.
  • Activity bursts may require additional hardware, bandwidth, processing power or filter updates. Volume can change by 10x or more.
  • Publisher terms of service may make additional filtering to comply with requirement as to how or when data may be used, for example, appropriately handling activities that were deleted or protected after your system received them.
  • De-duplicating repeated activities; identifying missing activities

2. Preliminary filters. Generally, you will want to apply broad filter terms early in the process to enable faster, more manageable downstream processing. Many terms related to nearly any topic of interest will appear not only in the activities you are interested in, but also in unrelated activities (noise). The practical response is to continuously refine filter rules to exclude unwanted activities. This may be a simple as keyword filtering, or sophisticated machine learning identification of activity noise.

While it would help to use a carefully crafted rule set for our earthquake filter of the firehoses, it turns out that we can learn a lot with the two simple rules “quake” and “terramoto,” the English and Spanish terms commonly appearing in activities related the the earthquake.  For our example analysis, we don’t get enough noise with these two terms to worry about additional filtering. So, each of the firehoses is initially filtered with these two key words. With a simple filter added, our connection looks like this,

curl –compressed -s -ushendrickson@gnip.com
“https://stream.gnip.com:443/accounts/client/publishers/twitter/streams/track/pt1.json”
| grep -i -e”quake” -e”terramoto”

The “grep” command simply looks for activities with only the terms “quake” or “terramoto;” the “-i” means to do this without worrying about case.

The filter shown in the example will match activities in which either term appears in any part of the activity including activity text, URLs, tagging, descriptions, captions etc. In order to filter more precisely on, for example, only blog post content, or only tweet user profile, we would need to parse the activity before filtering.

Alternatively, we can configure Gnip’s Powertrack filtering when we set up our server with rules for restricting filtering to certain fields or volume shaping. For example, to filter on tweet based on a Twitter user’s profile location settings, we might add the rule,

user_profile_location:”Mexico City”

Or, to shape matched Tweets volume for very common terms, we might add the rule to restrict output to 50% of matched Tweets with,

sample:50

For the earthquake example, we use all matched activities.

3. Store the data. Based on the desired analysis, there are a wide variety of choices for storing data. You may choose to create an historical archive, load a processing queue, and push the data to cluster storage for processing with, for example, Hadoop. Cloud-based key-value stores can be economical, but may not have the response characteristics required for solving your problem. Choices should be driven by precise business questions rather than technology buzz.

Continuing the working toward earthquake analysis, we will store activities to a file to keep things simple.

curl –compressed -s -ushendrickson@gnip.com
“https://stream.gnip.com:443/accounts/client/publishers/twitter/streams/track/pt1.json”
| grep -i -e”quake” -e”terramoto”
> earthquake_data.json

Plans for moving and storing data should take into account typical activity volumes. Let’s look at some examples of firehose volumes. JSON-formatted activities compressed with GZIP have a size of 100M Tweets ≈ 25 gigabytes.  While this takes less than 2 minutes to transfer to disk at 300 MB/s (SATA II), it takes about 6 hours at 10 Mb/s (e.g. typical congested ethernet network). Firehose sizes vary and one day of WordPress.com posts is a bit more manageable at 350MB.

Filtered Earthquake data for the Twitter, WordPress and Newsgator firehoses is only a few gigabytes, so we will just work from local disk.

4. Parse and structure relevant data. This is the point where we make decisions about data structure and tools that best support the desired analysis. The data are time-ordered social media activities with a variety of metadata. On one hand, it may prove useful to load an HBase to leverage the scalability of Hadoop, while on the other, structuring a subset of the data or metadata and inserting into a relational database to leverage the speed of indexes might be a good fit. There is no silver bullet for big data.

Keeping it simple for the earthquake data, use a Python script to parse the JSON-formatted activities and extract the date-time of each post.

import json
with open(“earthquake_data.json” as f:
for activity in f:
print json.loads(activity)[“postedTime”]

Now we can analyze the time-evolution of activity volume by counting up the number of mentions appearing in a minute. Similarly, to estimate content complexity, we can add a few more lines of Python to count characters in the text of the activity.

import json
with open(“earthquake_data.json” as f:
for activity in f:
print len(json.loads(activity)[“body”])

5. Descriptive statistics. Essentially, counting the number of earthquake references. Now that we have extracted dates and sizes, our earthquake analysis is simple. A few ideas for more interesting analysis could be understanding and identifying key players in the social network, extracting entities from text such as places or people, or performing sentiment analysis, or watching the wave of tweets move out from the earthquake epicenter.

6. Model. Descriptive statistics are okay, but well-formed questions make explicit assumption about correlation and/or causality and–importantly–are testable.  The next step is to build a model and some related hypothesis we can test.

A simple model we can examine is that surprise events fit a “double-exponential” pulse in activity rate. The motivating idea is that news spreads more quickly as more people know about it (exponential growth) until nearly everyone who cares knows. After saturation, the discussion of a topic dies off exponentially (analogously to radioactive decay). If this hypothesis works out, we have a useful modelling tool that enables comparison of events and conversations between different types of events and across firehoses. To learn more about the attributes and fitting social media volume data to the double exponential, see Social Media Pulse.

7. Visualize. Finally, we are ready to visualize the results of earthquake-related activity. We will use the simple tactic of looking at time-dependent activity in Figure 1. Twitter reports start arriving immediately after the earthquake and the volume grows to a peak within minutes. Traditional media (Newsgator) stories peak about the same time and continue throughout the day while blogs and blog comments peak and continue into the following day.

Twitter Reaction to Earthquakes

8. Interpret. Referring to Figure 1, a few interesting features emerge.  First, Twitter volume is significant within a minute or two of the earthquake and peaks in about half an hour. Within an hour, the story on Twitter starts to decay. In this case, the stories natural decay is slowed by the continual release of news stories about the earthquake. Users continue share new Tweets for nearly 12 hours.

The prominent bump in Tweet volume  just before 1:00 UTC is the result of Tweets by the members of the Brazilian boy-band “Restart,” who appeared to be visiting Mexico City at the time of the earthquake and came online to inform their fanbase that they were okay. The band’s combined retweet volume during this period added a couple thousand tweets to the background of earthquake tweets (which also grew slightly during this period due to news coverage).

While it is not in general the case, traditional media earthquake coverage  (represented by the bottom graph of the Newsgator firehose) peaks at about the same time as Tweet volume.  We commonly see Tweet volume peaking minutes, hours and occasionally days before traditional media.  In this case, the quake was very large attracting attention and pressure to answer questions about damage and injuries.

WordPress blog posts about the earthquake illustrate a common pattern for blog posts — that they live on a daily cycle. Notice the second wave of WordPress posts starting on around 9:00 UTC.  Blogs take little longer to peak because they typically contain analysis of the situation, photos and official statements.  Also, many blog readers choose to check in on the blogs they follow in the morning and evening.

A couple of comments on the topic of content richness… As you might have guessed, Tweets are concise. Tweets are limited to 140 characters, but average in the mid 80s.  Comments tend to be slightly longer than Tweets at about 250 characters. Posts have the widest range of sizes with few very large posts (up to 15,000 words). Posts also often contain rich media such as images or embedded video.

From our brief analysis, it is clear that different firehoses represent different audiences, a range of content richness and publisher-specific user modes of interaction. “What firehoses do I need?” To start to answer, it may be useful to frame the comparison in terms of speed vs. content richness. Stories move at different speeds and have different content richness based on the firehose and the story. Your use case may require rapid reaction to the social conversation, or nuanced understanding of long-term sentiment. Surprise stories don’t move in the same ways as conversations about expected events.  Each firehose represents a different set of users with different demographics and sensibilities. You will need to understand the content and behavior of each firehose based on the questions you want to answer.

As you can see by getting your hands a little dirty, looking at how a specific event is being discussed in social media is fairly straightforward.  Hopefully I have both shown a simple and direct route to getting to answers as well as giving some useful context as to considerations for building a real-word social media application. Always remember to start with a good, well-formed question.

In Taming The Social Media Firehose,  Part III, we will look at consuming the unique, rich and diverse content of the Tumblr Firehose.

Taming The Social Media Firehose, Part I

This is the first post in our series on what a social media “firehose” (e.g. streaming api) is and what it takes to turn it into useful information for your organization.  Here I outline some of the high-level challenges and considerations when consuming the social media firehose; in Parts II and III, I will give more practical examples.

Social Media Firehose

Why consume the social media firehose?

The idea of consuming large amounts of social data is to get small data–to gain insights and answer questions, to guide strategy and help with decision making. To accomplish these objectives, you are not only going to collect data from the firehose, but you are going to have to parse it, scrub and structure it based on the analysis you will pursue. (If you’re not familiar with the term “parse,” it means machines are working to understand the structure and contents of the social media activity data.) This might mean analyzing text for sentiment, looking at the time-series of the volume of mentions of your brand on Tumblr, following the trail of political reactions on the social network of commenters or any of thousands of other possibilities.

What do we mean by a social media firehose?

Gnip offers social media data from Twitter, Tumblr, Disqus and Automattic (WordPress blogs) in the form of “firehoses.”  In each case, the firehose is a continuous stream of flexibly structured social media activities arriving in near-real time. Consuming that sounds like it might be a little tricky. While the technology required to consume and analyze social media firehoses is not new, the synthesis of tools and ideas needed to successfully consume the firehose deserves some consideration.

It may help to start by contrasting firehoses with a more common way of looking at the API world–the plain vanilla HTTP request and response. The explosion of SOAPy (Simple Object Access Protocol) and RESTful APIs has enabled the integration and functional ecosystem of nearly every application on the Web. At the core of web services is a pair of simple ideas: that we can leverage the simple infrastructure of HTTP requests (the biggest advantage may be that we can build on existing web server, load balancers, etc.), and that scaleable applications can be build on simple stateless request/response pairs exchanging bite-sized chunks of data in standard formats.

Firehoses are a little different in that, while we may choose to use HTTP for many of the reasons REST and SOAP did, we don’t plan to get responses in mere bite-sized chunks.  With a firehose, we intend to open a connect to the server once and stream data indefinitely.

Once you are consuming the firehose, and–even more importantly–with some analysis in mind, you will choose a structure that adequately supports approach. With any luck (more likely smart people and hard work), you will end up not with Big Data, but rather with simple insights–simple to understand and clearly prescriptive for improving products, building stronger customer relationships, preventing the spread of disease, or any other outcome you can imagine.

The Elements Of a Firehose

Now that we have a why, let’s zero in on consuming the firehose. Returning to the definition above, here is what we need to address:

Continuous. For example, the Twitter full firehose delivers over 300M activities per day. That is an average of 3,500 activities/second or 1 activity every 290 microseconds. The WordPress firehose delivers nearly 400K activities day. While this is a much more leisurely 4.6 activities/second there still isn’t much time to sleep between the 1 activity every 0.22 s.  And if your system isn’t continuously pulling data out of the firehose, much can be lost in a short time.

Streams. As mentioned above, the intention is to make a firehose connection and consume the stream of social media activities indefinitely. Gnip delivers the social media stream over HTTP. The consumer of data needs to build their HTTP client so that it can decompress and process the buffer without waiting for the end of the response. This isn’t your traditional request-response paradigm (that’s why we’re not called Ping–and also, that name was taken).

Unstructured data. I prefer “flexibly structured” because there is plenty of structure in the JSON or XML formatted activities contained in the firehose. While you can simply and quickly get to the data and metadata for the activity, you will need to parse and filter the activity. You will need to make choices about how to store activity data in the structure that best supports your modeling and analysis. It is not so much what tool is good or popular, but rather what question you want to answer with the data.

Time-ordered activities done by people. The primary structure of the firehose data is that it represents the individual activities of people rather than summaries or aggregations. The stream of data in the firehose describes activities such as:

  • Tweets, micro-blogs
  • Blog/rich-media posts
  • Comments/threaded discussions
  • Rich media-sharing (urls, reposts)
  • Location data (place, long/lat)
  • Friend/follower relationships
  • Engagement (e.g. Likes, up- and down-votes, reputation)
  • Tagging

Real-time. Activities can be delivered soon after they are created by the user (this is referred to as low latency). (Paul Kedrosky points out that a 70s station wagon full of DVDs has about the same bandwidth as the internet, but an inconvenient coast-to-coast latency of about 4 days.) Both bandwidth and latency are measures of speed. Many people know how to worry about bandwidth but latency issues can really mess up real-time communications even if you have plenty of bandwidth. When consuming the Twitter firehose, it is common to realize latency (measured as the time from Tweet creation to the parsing the tweet coming from the firehose) of ~1.6 s  and as low as 300 milliseconds. WordPress posts and comments arrive 2.5 seconds after they are created on average.

So there are a lot of activities and they are coming fast. And they never stop, so you never want to close your connection or stop processing activities.

However, in real life “indefinitely” is more of an ideal than a regular achievement. The stream of data may be interrupted by any number of variations in the network and server capabilities along the line between Justin Bieber tweeting and my analyzing what brand of hair gel teenaged girls are going to be talking their boyfriends into using next week.
We need to work around practicalities such as high network latency, limited bandwidth, running out of disk space, service provider outages, etc. In the real world, we need connection monitoring, dynamic shaping of the firehose, redundant connections and historical replay to get at missed data.

In Part II we make this all more concrete. We will collect data from the firehose and analyze it. Along the way, we will address particular challenges of consuming the firehose and discuss some strategies for dealing with them.