Data Story: Interview with Christyn Perras of YouTube

I have long wanted to interview someone from YouTube as I think their social data is fascinating and incredibly vast. Every minute, 100 hours of video are uploaded to YouTube. Christyn Perras, a quantitative analyst at YouTube, is talking with Gnip about the career path to being a data scientist, the tools in her arsenal, YouTube’s data-driven culture and Coursera. 

Christyn Perras

1. What was your path to becoming a quantitative analyst at YouTube? What would you recommend for others?

As an undergraduate, I studied psychology and was particularly drawn to the experimental side of the discipline. When I was considering an advanced degree, I concentrated on the aspects of psychology that I loved during my search for a graduate program. I eventually found a program that focused on applied statistics and experimental design at the University of Pennsylvania, where I received an MS and PhD. However, even after graduation, my career path remained unclear and the tech industry wasn’t even on the radar. It was when I started looking for jobs using search terms referring to my skill set rather than job titles that I saw a world of opportunity unfold in front of me.

My first job on the west coast was at Slide, a social gaming company. It was an amazing experience. At Slide, I used my psychology background to understand our users and the way they interacted with our products. In addition, my background in statistics and experimental design gave me the skills to study, test, quantify and interpret user behavior and to measure the impact of our influence.  We sought answers to questions such as: Why were these people using our our products? What made them come back? And what could we do to change their behavior and/or enhance their experience? I am now doing this at YouTube and concentrate my efforts on understanding our creators and continuing to improve their YouTube experience via foundational research and experiment analysis.

2. I’ve noticed that Google doesn’t tend to use the title of data scientist. Is there a reason for this?

Not that I’m aware of. Data scientist, quantitative analyst, statistician and decision support analyst are all fairly interchangeable terms in the tech industry. As I mentioned before, my job search was most successful when I used keywords related to my skills and interests (statistics, psychology, experiments) rather than searching job titles (statistician). However, I imagine with the rising popularity and awareness of the field, naming conventions for job titles will likely become more standardized.

3. What is one of the most surprising aspects you’ve learned about YouTube data?

Honestly, I was surprised by the sheer amount of data! It is staggering. I had to learn a number of new programming languages and techniques just to be able to get the data I needed for an analysis into a manageable format. During my time at Penn, SAS, SPSS and SQL were the preferred tools and were incorporated into the curriculum. Without a more extensive computer science background, areas such as MapReduce and Python were quite new to me. I’m also continually expanding my knowledge and experience with techniques used to manipulate, reshape and connect data on this scale. When working with billions of data points, you often need to think creatively.

4. How do quantitative analysts work with product managers to shape YouTube?

There is a strong data-driven culture at YouTube and, as a result, product managers and analysts work very closely. In the case of a product change or redesign, analysts are involved in the process from the start. Early involvement ensures, for example, the data necessary for analysis are collected, experimental arms are set-up correctly, logging is accessible and bug-free. We discuss the goals and expectations of product changes in depth to make sure analyses are designed to answer the right question and will produce valid, actionable results. Analysts and product managers typically have a steady dialogue throughout the course of analysis. Once the analysis is complete, we discuss the results, interpret the meaning, consider the implications, and make decisions about the next steps.

5. How do you think companies such as Coursera are changing fields such as data science?

I love Coursera! My favorite courses include Data Analysis with Jeff Leek and Computing for Data Analysis with Roger Peng. Coursera is doing something truly great and I look forward to seeing how they grow and progress. Data science is a bit nebulous in terms of education (at least it was when I was in school). There wasn’t a “data science” major or anything like that, so it was necessary to piece it together yourself. I have an amazing team with wildly different backgrounds from physics to psychology to economics. I love bouncing ideas off my colleagues and am guaranteed wonderfully unique and clever perspectives. Companies like Coursera make dynamic teams like this possible by giving people from a wide variety of disciplines access to the additional education they need to shape their career path and be successful in their job.

Another amazing resource for future data scientists is OpenIntro (openintro.org), co-founded by my colleague David Diez. With OpenIntro, you’ll find a top-notch, open-source statistics textbook and a wealth of supporting material.

Thanks to Christyn for the interview! If you’re interested in reading more interviews, check out Gnip’s collection of 25 Data Stories for interviews with data scientists from Foursquare, Pinterest, Kaggle and more. 

Gnip Launches YouTube Comments API

One of the main advantages of YouTube is that content posted on the site often continues to see views long after the content is posted. With comments accumulating on YouTube long after the video is posted, it’s tough for social media managers to stay on top of all the comments. Popular brands often have hundreds of videos and monitoring comments can be tedious.

This is why Gnip added access to the YouTube Comments API as part of our Enterprise Data Collector. With the YouTube Comments API, Gnip customers can easily track the comments for all the videos they care about long after the video is posted.

To look at the type of content that populates YouTube comments, we wanted to do a little bit of fun research. Our data scientist, Scott Hendrickson, looked at the most popular videos of some of the most popular brands on YouTube — Nike, Sephora, Volkswagen, Barbie and Nerdist. One area we were interested in was the language that people use in YouTube comments, looking at both light-hearted words and words indicating sentiment, we calibrated the Lol index. Ultimately, we looked at how often the words Lol, good, bad, love, hate, omg, lmao, wtf showed up as a percentage of total YouTube comments for that video.

Despite having a reputation for negative words, we found that often the word love was used far more frequently than hate. Lol was the most frequently used term in sillier videos such as the Barbie Dreamhouse video, while Nike’s inspirational video’s comments were more likely to include the word love.

Ultimately, we’re excited to launch a product that makes it easier for brands to monitor their YouTube comments.

Sephora’s most popular video is “Sephora Presents How to use Violent Lips.”

Sephora Violent Lips Lol Index

________________________________

Nerdist’s most popular video is “Sexy Jedi Bubblebath! Saber 2: Return of the Body Wash”

Nerdist YouTube Lol Index

________________________________

 

Nike’s most popular video is Find Your Greatness.

Nike YouTube Comments Lol Index

________________________________

 Volkswagen’s most popular video is The Force.

Volkswagen YouTube Lol Index________________________________

Barbie’s™ most popular video is “Life in the Dreamhouse — Happy Birthday Chelsea”

Barbie YouTube Lol Index

________________________________

Delivering 30 Billion Social Media Activities Monthly . . . and Counting

I’m excited to announce that, as of the end of October, Gnip is delivering over 30 billion paid social media activities per month to our customers. This is the largest number of paid social media activities that have ever been distributed in a 30 day period.Over the past year, we’ve seen extraordinary growth in the number of paid social media activities we deliver. At the start of 2011, Gnip was delivering 300 million activities per month.  By May, that number was up to 3 billion activities per month.  And in October, we delivered 30 billion activities.  In essence, we’ve been growing by a factor of 10 every 5 months.  At this rate, we’ll be delivering 300 billion activities per month by March of next year

Cool numbers, but what’s driving this growth?

We’re seeing three key areas that are driving this number. First, we’re signing on new customers at an increasing rate, as more and more companies are seeing the possibilities in social media data. Second, we’re seeing increased interest in our Twitter firehose products. From hedge funds using social data to drive trading strategies to business intelligence companies layering social data onto their existing structured data sources, interest in volume products from Twitter is consistently increasing.  And finally, we’re seeing a marked increase in the number of customers using multiple sources to enrich their product capabilities.  From boards and forums to YouTube and Facebook, our customers are seeing the potential in the many other social data we offer.

So, 300 billion per month by March? It’s a big number, but the way things are going, I’ll take the over.

Customer Spotlight – MutualMind

 
Like many startups seeking to enter and capitalize on the rising social media marketplace, timing is everything. MutualMind was no exception: getting their enterprise social media management product to market in a timely manner was crucial to the success of their business. MutualMind provides an enterprise social media intelligence and management system that monitors, analyzes, and promotes brands on social networks and helps increase social media ROI. The platform enables customers to listen to discussion on the social web, gauge sentiment, track competitors, identify and engage with influencers, and use resulting insights to improve their overall brand strategy.

“Through their social media API, Gnip helped us push our product to market six months ahead of schedule, enabling us to capitalize on the social media intelligence space. This allowed MutualMind to focus on the core value it adds by providing advanced analytics, seamless engagement, and enterprise-grade social management capabilities.”

- Babar Bhatti
CEO, MutualMind

By selecting Gnip as their data delivery partner, MutualMind was able to get their product to market six months ahead of schedule. Today, MutualMind processes tens of millions of data activities per month using multiple sources from Gnip including premium Twitter data, YouTube, Flickr, and more.
 
Get the full detail, read the success story here.

Our Poem for Mountain.rb

Hello and Greetings, Our Ruby Dev Friends,
Mountain.rb we were pleased to attend.

Perhaps we did meet you! Perhaps we did not.
We hope, either way, you’ll give our tools a shot.

What do we do? Manage API feeds.
We fight the rate limits, dedupe all those tweets.

Need to know where those bit.ly’s point to?
Want to choose polling or streaming, do you?

We do those things, and on top of all that,
We put all your results in just one format.

You write only one parser for all of our feeds.
(We’ve got over 100 to meet your needs.)

The Facebook, The Twitter, The YouTube and More
If mass data collection makes your head sore…

Do not curse publishers, don’t make a fuss.
Just go to the Internet and visit us.

We’re not the best poets. Data’s more our thing.
So when you face APIs… give us a ring.

Social Media in Natural Disasters

Gnip is located in Boulder, CO, and we’re unfortunately experiencing a spate of serious wildfires as we wind Summer down. Social media has been a crucial source of information for the community here over the past week as we have collectively Tweeted, Flickred, YouTubed and Facebooked our experiences. Mashups depicting the fires and associated social media quickly started emerging after the fires started. VisionLink (a Gnip customer) produced the most useful aggregated map of official boundary & placemark data, coupled with social media delivered by Gnip (click the “Feeds” section along the left-side to toggle social media); screenshot below.

Visionlink Gnip Social Media Map

With Gnip, they started displaying geo-located Tweets, then added Flickr photos with the flip of a switch. No new messy integrations that required learning a new API with all of it’s rate limiting, formatting, and delivery protocol nuances. Simple selection of data sources they deemed relevant to informing a community reacting, real-time, to a disaster.

It was great to see a firm focus on their core value proposition (official disaster relief data), and quickly integrate relevant social media without all the fuss.

Our thoughts are with everyone who was impacted by the fires.

Response Code Nuances

While fixing a bug yesterday, I plowed through the code that does Gnip’s HTTP response code special case handling. The scenarios we’re handling illustrate the complexities around doing integrations with many web APIs. It was a reminder of how much we all want standards to work, and how often they only partially do so. Here are a few nuances you should consider if you’re doing API integrations by hand.

“retry-after”

When doing a polling based integration with a “real-time” API, you’re inclined to poll it a lot. That has caused some service providers to tell you to slow down using the “retry-after” HTTP header. Some providers use other, not so standard, ways to cool you down, but those are beyond the scope of this post. When you get a non-200-level response back from a server, you should consider looking for the retry-after header, regardless of whether or not it was a 503 or 300-level code (per HTTP 1.1 specification). Generally, when a services sends a retry-after, they’re intention behind it is clear, and you should respect the value that comes back. Now, the format of that value can be either “seconds”, or in a more verbose time format that tells you when you should wait “until” before trying the request again. In practice, we’ve never seen the latter; only the “seconds” version. When we see retry-after, we sleep that duration; you should probably do the same.

HTTP Response-code ’999′

You can look for it in the spec, but you won’t find it. Delicious likes to send a ’999′ back when you’re hitting them too hard. Consider backing off for several minutes if you see this from them.

non-200 HTTP Response Bodies

While many services don’t bother sending response bodies back for non-200s (and those that do often don’t provide anything actionable), many do. It’s a good idea to write those bodies to a log file (or at least the first n-hundred bytes) for human inspection. There can be some useful information in there to help you build a more effective and efficient integration.

The matrix of services-to-response codes, and how you should respond to them, is big. The above is just a small slice of the scenarios your integrations will encounter, and that you’ll need to solve for.

While a service’s documentation is always some degree out of date, and you can only truly learn the behavioral characteristics through long nights of debugging, here are some pointers to service specific response codes that you might find useful.

Gnip Platform Update – Now For Authenticated Data Services

The Gnip Platform originally was built to support accessing public services and data.  In response to customer requests we soft launched support for authenticated data services over the summer and now we have fully rolled out the new service.   The difference between public and authenticated data services seems trivial, but in practice the differences are very important since authenticated services represent either business level arrangements between companies or private data access.   The new Gnip capabilities supports both of these scenarios.

As part of the new service Gnip also provides dedicated integration capacity for companies as we now are able to segment individually managed nodes on our platform for specific company accounts.   This means that a company with a developer key on Flickr, a whitelist account on Twitter, an application key on Facebook and a developer key on YouTube receives dedicated capacity on the Gnip platform to support all their data integration requirements.

Gnip will also continue to maintain the existing public data integration services which do not require authentication for access and distribution, and we expect most companies with use a blend of our data integration services.

Using the new support for authenticated data service requires contacting us at sales@gnip.com so we can enable your account. Please contact us today to leverage your existing whitelisting or authenticated account on Flickr, YouTube, Twitter or other APIs and feeds.

Pushing and Polling Data Differences in Approach on the Gnip platform

Obviously we have some understanding on the concepts of pushing and polling of data from service endpoints since we basically founded a company on the premise that the world needed a middleware push data service.    Over the last year we have had a lot of success with the push model, but we also learned that for many reasons we also need to work with services via a polling approach.   For this reason our latest v2.1 includes the Gnip Service Polling feature so that we can work with any service using push, poll or a mixed approach.

Now, the really great thing for users of the Gnip platform is that how Gnip collects data is mostly abstracted away.   Every end user developer or company has the option to tell Gnip where to push data that you have set up filters or have a subscription.   We also realize not everyone has an IT setup to handle push so we have always provided the option for HTTP GET support that lets people grab data from a Gnip generated URL for your filters.

One place where the way Gnip collects data can make a difference, at this time, for our users is the expected latency of data.  Latency here refers to the time between the activity happening (i.e. Bob posted a photo, Susie made a comment, etc) and the time it hits the Gnip platform to be delivered to our awaiting users.     Here are some basic expectation setting thoughts.

PUSH services: When we have push services the latency experience is usually under 60 seconds, but we know that this is not always the case sense sometimes the services can back-up during heavy usage and latency can spike to minutes or even hours.   Still, when the services that push to us are running normal it is reasonable to expect 60 second latency or better and this is consistent for both the Community and Standard Edition of the Gnip platform.

POLLED services:   When Gnip is using our polling service to collect data the latency can vary from service to service based on a few factors

a) How often we hit an endpoint (say 5 times per second)

b) How many rules we have to schedule for execution against the endpoint (say over 70 million on YouTube)

c) How often we execute a specific rule (i.e. every 10 minutes).     Right now with the Community edition of the Gnip platform we are setting rule execution by default at 10 minute intervals and people need to have this in mind with their expectation for data flow from any given publisher.

Expectations for POLLING in the Community Edition: So I am sure some people who just read the above stopped and said “Why 10 minutes?”  Well we chose to focus on “breadth of data ” as the initial use case for polling.   Also, the 10 minute interval is for the Community edition (aka: the free version).   We have the complete ability to turn the dial and use the smarts built into the polling service feature we can execute the right rules faster (i.e. every 60 seconds or faster for popular terms and every 10, 20, etc minutes or more for less popular ones).    The key issue here is that for very prolific posting people or very common keyword rules (i.e. “obama”, “http”, “google”) there can be more posts that exist in the 10 minute default time-frame then we can collect in a single poll from the service endpoint.

For now the default expectation for our Community edition platform users should be a 10 minute execution interval for all rules when using any data publisher that is polled, which is consistent with the experience during our v2.1 Beta.    If your project or company needs something a bit more snappy with the data publishers that are polled then contact us at info@gnip.com or contact me directly at shane@gnip.com as these use cases require the Standard Edition of the Gnip platform.

Current pushed services on the platform include:  WordPress, Identi.ca, Intense Debate, Twitter, Seesmic,  Digg, and Delicious

Current polled services on the platform include:   Clipmarks, Dailymotion, deviantART, diigo, Flickr, Flixster, Fotolog, Friendfeed, Gamespot, Hulu, iLike, Multiply, Photobucket, Plurk, reddit, SlideShare, Smugmug, StumbleUpon, Tumblr, Vimeo, Webshots, Xanga, and YouTube

New Gnip Publishers: FriendFeed, YouTube and Hulu

We continue to push out new publishers to the beta http://api.gnip.com environment as we work to finish up the release and get the final touches on lots of new features.

The new publishers this week include the following:

  • FriendFeed-search:  Supports the KEYWORD rule-type and works with the standard FriendFeed Search interface for tracking conversations
  • Hulu: Supports the ACTOR rule-type and works with the standard Hulu interface for tracking conversations
  • Hulu-search: Supports the KEYWORD rule-type and works with the standard Hulu Search interface
  • YouTube: Supports the ACTOR and TAG rule-types and works witih the standard YouTube interface and tracks “uploads”
  • YouTube-search: Supports the KEYWORD rule type and works witih the standard YouTube-search interface

Ok, now go grab some data from these or any of our other now 20+ data publishers in the system.   Or read up on the new features in http://www.gnip.com/docs

Continue reading