Data Stories: Gilad Lotan of betaworks

Gilad Lotan is the chief data scientist for betaworks, which has launched some pretty incredible companies including SocialFlow and bitly. I was very excited to interview him because it’s so rare for a data scientist to get to peek under the hood of so many different companies. He’s also speaking at the upcoming SXSW on “Algorithms, Journalism and Democracy.” You can follow him on Twitter at @gilgul

(Gnip is hosting a SXSW event for those involved in social data, email for an invite.) 

Gilad Lotan of Betaworks1. As the chief data scientist for betaworks, how do you divide your time amongst all of the companies? Basically, how do you even have time for coffee?

First of all, we take coffee very seriously at betaworks, especially team digg, who got us all hooked on the chemex drip. There’s always someone making really good coffee somewhere in the office.

Now beyond our amazing coffee, betaworks is such a unique place. We both invest in early stage companies, and incubate products, many of which have a strong data component. We’ve successfully launched companies over the past years, including Tweetdeck, Bitly, Chartbeat and SocialFlow. There are currently around 10 incubations at betaworks, at various stages and sizes.

Earlier this year, we decided to set up a central data team at betaworks rather than separate data teams within the incubations. Our hypothesis was that leveraging data wrangling skills and infrastructure as a common resource would be more efficient, and provide our incubations with a competitive advantage. Many companies face similar data-related problem-sets, especially in their early stages. From generating a trends list to building recommendation systems, the underlying methodology stays similar even when using different data streams. Our hope was that we could re-use solutions provided in one company for similar features in another. On top of that, we’re taking advantage of common data infrastructure, and using data streams to power features in existing products or new experiments.

When working with data, much of the time you’re building something you’ve never done before. Even if the general methodology might be known, when applied to a new dataset or within a new context there’s lots of tweaking to be done. For example, even if you’ve used naive bayes classifiers in the past, when applied towards a new data stream, results might not be good enough. So planning data features within product releases is challenging, as it is hard to predict how long development will take. And then there’s knowing when to stop, which isn’t necessarily intuitive. When is your model “good enough”? When are results from a recommendation system “good enough”?

The data team is focused on building data-driven features into incubated products. One week I’ll be working on scoring content from RSS feeds, and the following week I might be analyzing weather data clusters. We prioritize based on the stage the company and the importance of this data for the product’s growth. We tend to focus on companies that have high volumes of data, or are seeking to build features that rely on large amounts of data munching. But we keep it fairly loose. We’re small and nimble enough at betaworks that prioritization between companies has not been an issue yet.

I’m aware that it will become more challenging, especially as companies grow in size.

2. Previously, you were the VP of Research & Development at SocialFlow. How did data science figure into product development?

At SocialFlow the data team built systems that mine massive amounts of data, including the Twitter public firehose. From distributed ingestion to analytics and visualization, there were a few ways in which our work fed into the product.

The first and most obvious, was based on the product development roadmap. In an ideal situation, the data team’s work is always a few steps ahead of the product’s immediate needs, and can integrate its’ work into the product when needed. At SocialFlow, we powered a number of features including real-time audience trends, and personalized predictions for performance of content on social media. In both cases the modules were developed as a part of the product launch cycle and continuously maintained by the data team.

The second way in which we affected product development was by constantly running experiments. Continuous experimentation was a key way in which we innovated around our data. We would take time to test out hypothesis and explore different visualization techniques as a way to make better sense of our data. One of our best decisions was to bring in data science interns over the summer. They were all in the midst of their phd’s and incredibly passionate about data analysis, especially data from social networks. The summer was an opportunity for them to run experiments at a massive scale, using our data and infrastructure. As they learned our systems, each chose a focus and spent the rest of their time running analyses and experiments. Much of their work was integrated into the product in some manner. Additionally, several published their findings in academic journals and conference proceedings. Exploratory data analysis may be counter-productive, especially when there are no upper bounds set on when to stop the experimentation. But with strict deadlines, it may be invaluable.

The third, and most surprising for me, was storytelling. We made sure to always blog about interesting phenomenons that we were observing within our data. Some were directly related to our business – publishers, brands and marketers – but much was simply interesting to the general public. We added data visualization to make them more accessible. From the Osama Bin Laden raid, to the spread of Invisible Children’s Kony2012 video, we were generating a sizable amount of PR for SocialFlow just by blogging about interesting things we identified in our data. While the attention was nice, there were some great business and product opportunities that came because of that.

Working with interesting data? Always be telling stories!

3. In your SXSW session “Algorithms, Journalism and Democracy,” you’ll speak to the bias behind algorithms. What concerns you about these biases and what do people need to know?

There are a growing number of online spaces which are a product of an automated algorithmic process. These are the trending topics lists we see across media and social networks, or the personalized recommendations we get on retail sites. But how do we define what constitutes a “trend” or what piece of content should be included in our “hottest” list? Often times, it is not simply the top most read or clicked on item. What’s hot is an intuitive and very humane assessment, yet potentially a mathematically complex formula, if at all possible to produce. We’ve already seen numerous examples where algorithmically generated results led to awkward outcomes, such as Amazon’s $23,698,655.93 priced book about flies or Siri’s inability to find abortion clinics in New York City.

These are not Google, Apple, Amazon or Twitter conspiracies, but rather the unintended consequences of algorithmic recommendations being misaligned with people’s value systems and expectations of how the technology should work. The larger the gap between people’s expectations and the algorithmic output, the more user trust will be violated. As designers and builders of these technologies, we need to make sure our users understand enough about the choices we encode into our algorithms, but not too much to enable them to game the systems. People’s perception affects trust. And once trust is violated, it is incredibly difficult to gain back. There’s a misplaced faith in the algorithm, assuming it is fair, unbiased, and should accurately represent what we think is “truth”.

While it is clear for technologists that algorithmic systems are always biased, the public perception is that of neutrality. It’s math right? And math is honest and “true”. But when building these systems there are specific choices made, and biases encoded due to these choices.

In my upcoming SXSW session with Poynter Institute’s Kelly McBride, we’ll be untangling some of these topics. Please come!

4. When marketers typically look at algorithms, most try to game the system. How does the cycle of programmers trying to stop gaming and those trying to game it play into creating biases?

Indeed. In spaces where there’s a perceived value, people will always try to game the system. But this is not new. The practice of search engine optimization is effectively “gaming the system”, yet its’ been around for many years and is considered an important part of launching a website. SEO is what we have to do if we want to optimize traffic to our websites. Like a game of cat and mouse, as builders we have to constantly adjust the parameters of our systems in order to make them harder to game, while making sure they preserve their value. With Google search, for example, while we have a general sense of what affects results, we don’t know precisely, and they constantly change it up!

Another example is Twitter’s trending topics algorithm. There’s clear value in reaching the trending topics list on Twitter: visibility. In the early days of Twitter, Justin Bieber used to consistently dominate the trends list. In response, team Twitter implemented TFIDF (Term Frequency Inverse Document Frequency) as a way to make sure this didn’t happen – making it harder for popular content to trend. As a result, only new and “spiky” (rapid acceleration of shares) make it to the trends list. This means that trends such as Kim Kardashian’s wedding or Steve Jobs’ death are much more likely to trend, compared to topics that are a part of an ongoing story, such as the economy or immigration. I published a long blog post on why #OccupyWallStreet never trended in NYC explaining precisely this phenomenon.

Every event organizer wants to get their conference to trend on Twitter. Hence a unique hashtag is typically chosen to be used. We know enough about what gets something to trend, but it is still difficult to completely game the system. In a way, there’s always a natural progression with algorithmic products. The more successful these spaces are, the more perceived value they attain, the more people will try to game them. Changes in the system are necessary in order to keep it from being abused.

5. If you could have any other data scientist’s job, which one would you want?

I’m pretty sure I have the best data science gig out there. I get to work with passionate smart people, on creative and innovative approaches to use data in products that people love. Hard to beat that!

The Need for a Data Science Masters Program

Data science is a new profession and thus, there isn’t a clear educational or career path for data scientists. One of our most frequent questions we ask in our Data Stories series is asking about the career path people took to become data scientists. With Gnip’s own data science team, three of our members have PhDs in Physics and one has a masters in mathematics. So I am definitely interested in how universities are creating their own data science programs. To that end, I wanted to interview Annalee Saxenian, the dean of UC Berkeley’s School of Information, about their masters program for data scientists. 

This is part of our Data Stories series leading to SXSW. Dean Saxenian is speaking on “The Future Belongs to Data Scientists.” Gnip is hosting a SXSW event for those involved in social data, email for an invite. 

Dean Annalee Saxenian

1. Why create a masters program specifically for data scientists?

There is huge demand for people who can work with data, at large as well as small scales, using the new tools and technologies that are becoming available for data storage, analytics, and visualization. While data science has been pioneered by technology companies like Google, LinkedIn, and Facebook, we believe that every organization today (large and small, profit and non-profit, in every industry) has new sources of data that it can use to inform decision making and to develop new products and services. This new data, which comes from click streams, online sales receipts, sensor networks, mobile devices, and social media, is not only available at very large scales, but is also largely unstructured or semi structured. This makes the analysis of the data fundamentally different from analysis of the smaller and more structured data sets of the past.

2. Data scientist is a new career path. What are the advantages of receiving formal training through a program such as Berkeley’s School of Information versus real-world experience?

Most organizations don’t have the resources or the commitment to systematically expose employees to the range of new tools, technologies, and skills required of a data scientist. Even leading technology companies  only provide very limited on-the job training in the relevant skills to their employees. They are looking for employees who already have expertise in areas like statistics and data analytics.

One of the advantages of a Master’s degree program like the Master of Information and Data Science (MIDS) at Berkeley is that our faculty has built a complete curriculum from the ground up–designing both the individual courses we think are essential to practicing data scientists as well as building the dependencies between the courses so that the whole is greater than the sum of the individual parts. The curriculum covers the full life cycle of data science. We offer courses devoted to research design, data storage and retrieval, statistical analysis, machine learning, data visualization and communication, data privacy and ethics, field experiments, and scaling and parallelism. In addition, we require that students gain experience working in teams. In short, formal education like the MIDS program offers comprehensive exposure to the field of data science.

3. Why did Berkeley decide to make the I School an online program?

The School of Information faculty decided to offer the program online for several reasons. On one hand, we are growing our existing programs and are outgrowing our facilities on the Berkeley campus. Offering an online program relieves us of the need to compete for scarce space on campus. We also believe that, as a School of Information, we should to be experimenting with new educational technologies, and that since most of our graduates will be working in teams and online settings, we should play a leadership role in this space. Last, but not least, by offering the degree online we are able to reach a much wider range of students than we can with our face to face programs. We are providing access to a Berkeley quality degree to people who who can’t move to Berkeley for family or work reasons and to those who need to continue working while they seek further education.

4. What characteristics do you think makes for the most successful data scientists?

Data scientists do need a set of technical and analytical skills and mastery of certain tools and technologies, but just as important are the  soft skills. The most successful data scientists can think creatively about trends in data, collaborate well on teams, and communicate the findings from data to non specialists. So they need to be clear thinkers, good collaborators and communicators, and they need to be able to think creatively about what they see in the data.

5. What do you think are the upsides and downsides for companies for dealing with data that previously wasn’t accessible?

The upsides for companies: the new data can be used to enhance business decision making as well as to develop new products and services. Companies are using previously inaccessible data to learn more about customer behavior and about market trends. They are designing regular online experiments that allow them to generate data allowing them to learn real time about trade offs in design and other business decisions.

The downsides: most companies still don’t have people with the relevant skills to learn from the new data, and they will need to reorganize in order to take full advantage of the new data. The established silos in established companies mean that data is managed by a different group than those who are able to analyze it or who are developing products and they in turn are not well connected to senior decision makers. Taking full advantage of the new data will require much closer interaction between these different parts of the organization.

Occupy Gezi: How Twitter Facilitated a Social Movement in Turkey

Last summer in Turkey, a small protest over the removal of trees in Gezi park began a large movement trying to protect one of the last green spaces remaining in the heart of Istanbul. The movement resulted in 1,900 people being arrested and nearly that many were reported injured. Social media served as the primary source of information for citizens. We interviewed Yalçın Pembecioğlu of Bigumigu about how the movement sparked and what it means for social media in Turkey. This is part of our SXSW Data Stories, where we’re interviewing presenters about their data talks. Yalcin is presenting on #Occupygezi Movement: A Turkish Twitter Revolution.

(Gnip is hosting a SXSW event for those involved in social data, email for an invite.) 

Yalçın Pembecioğlu of  Bigumigu

1. How did Twitter help create the #Occupygezi movement?

During the start of the event, none of the broadcast networks covered #OccupyGezi. Not even a little bit. I guess this encouraged people to take control and be their own media. Suddenly, everybody started to take and distribute pictures and videos from the places that events took place. The content from citizen media went viral in seconds.

2. Why would people choose Twitter over mainstream media as a source of information?

If the information is coming from someone you trust, it is very important information. During #OccupyGezi, people have seen the cold brutal face of the mainstream media. Our friends were on the streets and they were telling unbelievable stories. We have decided to believe our friends and families, instead of the mainstream media.

3. How did the #Occupygezi movement respond to rumors via social media?

It was emotionally devastating to see the police brutality towards the protestors. At those kinds of times, I guess people become more tolerant to biased information. But many of us, including me, spent hours on the computer to decipher the dirty data into real bits of information. Hence the term “kesin bilgi mi?” was born in the Turkish internet. It means “is it a confirmed information?”. During #OccupyGezi, when important information came up, we were all asking questions to confirm it, and if it is confirmed, then we spread the information, if it’s not, we were warning the source to double check the data. Now it’s like a common meme on the internet to reply any joke as “kesin bilgi mi?”. We have learned not to trust everything on the internet in a quick course.

4. After #Occupygezi, how did the use of social media change in Turkey?

The most popular social media platform is Facebook in Turkey with over 30 million active users. After #OccupyGezi, penetration of Twitter has accelerated. It is said that nearly 1 million new accounts were created during the #OccupyGezi weeks. It is believed that Twitter has around 9 million active accounts in Turkey. A society, which was very comfortable in symmetrical networks discovered the power and potential of unsymmetrical networks.

5. Overall, what does social media mean for revolutions?
Social media is the place for individual voices. It is very important for revolutions, because via social media we see there are thousands, millions of people out there just like us. I am not sure that the social media will be easily spoiled by power holders in the future, but for now, it can be the single source that an individual’s voice can be heard.

Eric Swayne on the Fundamentals of Good Data Narration

Leading up to SXSW, we’ll be doing Data Stories with SXSW presenters starting with Director of Product of MutualMind, Eric Swayne, who is speaking on “Scientist to Storyteller: How to Narrate Data.” (Gnip is hosting a SXSW event for those involved in social data, email for an invite that goes out in a month or so.) 

Eric Swayne

1. In your SXSW session, you’re going to talk about being more than a “data janitor.” What do you mean by this?

Data Janitor is a term that resonates with many Analysts currently, as they’re basically being used for maintaining the facilities: scrubbing data sources, pushing data into prescribed buckets, rolling out the same reports because they’re the reports “we’ve always done.”  These are still important, but we have to aspire to more. People that live in data analysis have the crucial opportunity for extracting meaning that transforms businesses through data-driven decisions, and it takes much more than just pushing out the monthly graphs and charts.

2. How important is visualization to good data narration?

Great data visualizations put incredibly powerful tools in the hands of Data Narrators that enable them to tell better stories, as well as extract insights from incomprehensible data sets.  However, it’s critical not to confuse the visualization WITH the insight – they are distinctly separate, and not necessarily dependent upon each other.  A simple pie chart that tells a CEO exactly what they need to know to make a good decision isn’t any visual tour de force, but it clearly gets the job done.  In all cases, visualizations should serve the story: the string of insights that lead to data-driven decisions.  When a good picture makes a good idea stick, that’s when you know the Data Narrator has done their job.

3. What are the trademarks of good data narration?
You’ll often see three key hallmarks:

1. True Insights – An insight tells me something I don’t know, that I need to know, and that I can do something about.  If the data story doesn’t include these three elements, it’s factual or irrelevant, not insightful.

2. User-Centric Approach – Human Interface Design isn’t just for UX professionals – Analysts need to become more adept at its principles as well.  Everything we say in a report or dashboard through form, color, size or spatial relationships carries meaning – whether we intended it to or not.  Data Narrators not only understand their story but also their audience: what they’re used to seeing, how they might be biased against certain ideas, and what assumptions they’re making based on what they see and hear.

3. Idea Inception – I would call this “stickiness”, but we have a much better term for it now, thanks to Christopher Nolan and Leo DeCaprio! Work from great Data Narrators shows an intent to focus the audience on an idea, and make sure they remember it. Data Narrators often focus not on the meeting where they present their work, but the NEXT meeting their audience is having, and whether they remember what was said and use it.

4. When marketers misinterpret data, what are the ramifications that you see?
Of course the ultimate impact of misinterpreted data is bad decisions, but it usually starts by creating bad stories. Urban legends pervade businesses just like any other culture, and they often sound enough like real data that they’re not questioned.  “Our site visitors click on blue more often,” “We’ve never had a good Q4 for product X,” “Twitter hasn’t driven sales for us like Facebook,” and on and on. These “data-ish” stories are particularly insidious because they’re often unquestioned assumptions, and voices that seek to pick them apart are often quashed as trying to “rock the boat.” Storytelling isn’t just a tool to be used for good or ill – it’s the default processing protocol for human brains. Where good, data-driven stories are NOT created, it leaves a vacuum that others will fill with whatever they remember.

5. What are your data pet peeves? What is the data equivalent of driving slow in the fast lane?

  • Confusing correlation with causation. I know, I know, we say this maxim so many times it should be the Data Scientist’s Golden Rule. But the fact is that it’s tremendously hard for humans to avoid this trap, particularly when correlations appear to validate the opinions we already have. This is why it’s incredibly important for us to question each others’ assumptions, and to be open to ours being questioned.
  • “Perfect” data. When charts show a straight line, or scatterplots neatly cluster, or r^2 results are incredibly high, I get suspicious. Nothing in nature is perfect, particularly when humans are involved – the reality is that while many of our behaviors can be consistent, they aren’t absolute.  It’s incredibly important that we use analytics and statistics to tell the story that the data tells us, not the one we want to say.
  • Trophy numbers.  When I start a new client engagement, I like to ask them what their Trophy Numbers are.  These are the stats and figures that are used to report upwards (and often justify jobs), but that we know have no inherent value.  Pageviews, Hits, Impressions, Potential Reach, and Asset Views are all often found in this category.  While these may be good symptoms of success, they almost always aren’t the way your business wins in the world.  Data Narrators don’t ignore these, but rather they lead clients on a journey from here to better KPIs that indicate real business success.

If you’re interested in more Data Stories, check out Gnip’s collection of 25 Data Stories. 

Data Story: Adam Sadilek on Tracking Food Poisoning With Social Data

Adam Sadilek has done some pretty ground breaking research around social data including tracking food poisoning with social data. When he was a Ph.D. student at the University of Rochester, he led a team that found a correlation between geotagged Tweets about foodborne illnesses that closely aligned with restaurants with poor scores from the health department. Adam is now a researcher at Google, and you can follow him on Twitter at @Sadilek

Tracking Food Poisoning via Twitter


1. Where did your interest in identifying health trends on Twitter come from?

First, it was studying how Twitter can predict flu outbreaks and then looking at identifying food poisoning outbreaks too.

We were interested in how much can we learn about our environment by sifting through the vast amounts of day-to-day chatter online. It turns out that machine learning can identify strong signals that can be used to make predictions about individuals as well as venues they visit. For example, in our project, we predicted how likely is a Twitter user is to become sick based on how many symptomatic people he or she met recently. We leveraged geotags within the Tweets to estimate people’s encounters. In the nEmesis project, our model identified Twitter users who got sick after eating at a restaurant, which enabled us to rank food establishments by cleanliness.

2. Your machine learning can help assign scores to restaurants based on the chances of food poisoning that matches the Health Department based on Twitter data. Is there anyway to make Nemesis data public or as an add-on to services such as Yelp?

There certainly is — Henry Kaut’z group at the University of Rochester is working on an extending GermTracker to capture foodborne illness in real time as well.

3. What are the benefits and disadvantages of using social data over more traditional research on health patterns?

Online social media is very noisy, but significantly more timely. Many months pass between inspections of a typical restaurant. If they get a delivery of spoiled chicken a day after an A+ inspection, it will make their patrons sick anyway. Systems like nEmesis, on the other hand, can detect there is something going on very quickly. The flip side is that it’s hard to be certain on the basis of 140 characters. Therefore, we advocate for a hybrid approach, where inspectors use nEmesis to make better informed decisions. We can replace the current basically random inspections with a more adaptive workflow to detect dangerous venues faster.

4. What else do you think Twitter can tell us about public health?

We did a number of studies, focusing on multiple aspects of our health that can be informed by data mining online social media. Beyond flu and food poisoning, we looked at exposure to air pollution, mental health, commuting behavior, and other lifestyle habits. You can take a look at our publications at

If you’re interested in additional interviews with people using social data in research, check out our 25 Data Stories to hear about how researchers used social data to track cholera after Haiti’s earthquake. 

Data Story: John Foreman of MailChimp on the Data Science Behind Emails

 When I was in charge of email at my last startup, the MailChimp blog was a must read. Their approach to email marketing is brilliant so when my colleague suggested I interview MailChimp’s chief data scientist, John Foreman, for a Data Story, I was definitely onboard. In addition to being a data scientist at MailChimp, John is also the author of Data Smart: Using Data Science to Transform Information into Insight. You can follow him on Twitter at @john4man

1. People have a love/hate relationship with email. How can data science help people love email more and get more out of it?

Recently, people across industries seem to be waking up from their social-induced haze and rediscovering the effectiveness of direct email communication with their core audience.

Think about a true double-opted email subscription versus, say, a Facebook “like” of a product. When I like a product page on Facebook, do I really want to hear from them in my feed? In part, isn’t that “like” just an expression that’s meant for public display and not for 1-to-1 ongoing communication from the business?

Contrast that with email. If I opt into a newsletter, I’m not doing that for anyone but myself. Email is a private communication channel (I like the term “safe place”). And I want your business to speak to me there. That’s powerful. Now think of a company like MailChimp. We have billions of these subscriptions from billions of people all across the world. MailChimp’s interest data is unparalleled online.

OK, so that means that as a data scientist, I have some pretty meaty subscription data to work with. But I’ve also got individual engagement data. Email is set up perfectly to track individual engagement, both in the email, and as people leave the email to interact with a sender’s website.

So I use this engagement and interest data to build products — both weapons to fight bad actors as well as power tools to help companies effectively segment and target individuals with content that’s more relevant to the recipient. My goal is to make the email ecosystem a strong one, where unwanted marketing email goes away and the content that hits your mailbox is ideal for you.

For instance, MailChimp recently released a product called Discovered Segments that uses unsupervised learning to help users find hidden segments in their list. Using these segments, the sender can craft better content for their different communities of recipients. MailChimp uses the product ourselves; for example, rather than tell all our customers about our new transactional API, Mandrill, we used data mining to only send an announcement to a discovered segment of software developers who were likely to use it, resulting in a doubling of engagement on that campaign.

2. How is data science structured at MailChimp? How big is your team, and what departments do you work with?
MailChimp has three data scientists, and our job as a little cell is to deliver insights and products to our customers. That sounds like business-speak, so let me break it down.

By insights, I mean one-off research and analysis of large data sets that’s actionable for the customer. And by products, I mean tools that the customer can use to perform data analysis themselvesIf the tool or product isn’t useful or required by the customer, we don’t build it. A data science team is not a research group at a university, nor is it a place to just to show off technologies to investors. We’re not here to publish, and we’re not here to build “look at our data…ooooo” products for the media. Whenever a data science team is involved in those activities, I assume the business doesn’t actually know what to do with the technical resources they’ve hired.

Now, who is the “customer” in this mission? We serve other teams internally as well as MailChimp’s user base. So an example of a data product built for an internal customer would be Omnivore – our compliance AI model, while an example of a data product built for the general user population would be our Discovered Segments collaborative filtering tool.

We work very closely with the user experience team at MailChimp — the UX team is constantly interviewing and interacting with our users, so they generate a lot of hypotheses which we investigate using our data. The UX team, because their insight is built quickly from human interactions, can flit from thought to thought and project to project; when they think they’re onto something good, they kick the research idea to the lumbering beast that is the data science team. We can comb through our billions of records of sends, clicks, opens, http requests, user and campaign metadata, purchase data, etc. to quantitatively back or dismiss their new thinking.


Data Science Team as Robots <– DATA SCIENCE TEAM

3. Your book, Data Smart, is about helping to teach anyone to get value out of data. Why did you see a need for this book? 

I used to work as a consultant for lots of large organizations, such as the IRS, DoD, Coca-Cola, and Intercontinental Hotels. And when I thought about the semi-quantitative folks in the middle and upper rungs of those organizations (people more likely to still be using the phrase “business intelligence” as opposed to “data science”), I realized there was no way for those folks to dip their toe into data science. Most of the intro books made a lot of assumptions about the reader’s math education background, and they depended on R and Python, so the reader needed to learn to code at the same time they learned data science. Furthermore, most data science books were “script kiddy” books, the reader just loaded stuff like the SVM package, built an AI model, and didn’t really know how the AI algorithms worked.

I wanted to teach the algorithms in a code free environment using tools the average “left behind” BI profession would be familiar with. So I chose to write all my tutorials in Data Smart using spreadsheets. At the same time though, I pride myself on writing a more mathematically deep intro text than what you find in many of the other intro data science texts. The book is guided learning — it’s not just a book about data science.

Now, I don’t leave the reader in Excel. I guide them into using R at the end of the book, but I only take them there after they understand the algorithms. Anything else would be sloppy.

Another reason I wrote the book is because the market didn’t have  a broad data science book. Most books focus on one topic — such as supervised AI. Data Smart covers data mining, supervised AI, basic probability and statistics, optimization modeling, time series forecasting, simulation, and outlier detection. So by the time the reader finishes the book, they’ve got a swiss army knife of techniques in their pocket and they’re able to distinguish when you use one technique and when you use another. I think we need more well-rounded data scientists, rather than the specialists that PhD programs are geared to produce.

4. You’ve written a book, maintain a personal blog and write for MailChimp. How important has communication and writing skills become to data scientists?

I believe that communication skills, both writing and speaking, are vital to being an effective data scientist. Data science is a business practice, not an academic pursuit, which means that collaboration with the other business units in a company is essential. And how is that collaboration possible if the data scientist cannot translate problems from the high-level vague definition a marketing team or an executive might provide into actual math?

Others in an organization don’t know what’s mathematically possible or impossible when they identify problems, so the data science team cannot rely on them to fully articulate problems and “throw them over the fence” to a data science team ready-to-go. No, an effective data science team works as an internal, technical consultancy. The data science team knows what’s possible and they must communicate with colleagues and customers to understand processes and problems deeply, translate what they learn into something data can address, and then craft solutions that assist the customer.

5. Time for the Miss America question. If you had access to any data in the world, what is the question or problem you’d like to most solve?

I am a huge fan of Taco Bell. And I recognize that the restaurant actually has very few ingredients to work with — their menu is essentially an exercise in combinatorial math where ingredients are recombined in new formats to produce new menu items which are then tested in the marketplace. I’d love to get data on the success of each Taco Bell menu item. Combined with possible delivery format information, nutrition information, flavor data, and price elasticity data, I’d love to take a swing at algorithmically generating new menu items for testing in the market. If sales and elasticity data were timestamped, perhaps we could even generate menu items optimized for and only available during the stoner-friendly “fourthmeal.”

Thanks to John for taking the time to speak with Gnip! If you’re interested in more Data Stories, please check out our collection of 25 Data Stories featuring interviews with data scientists from Kaggle, Foursquare, Pinterest, bitly and more! 

Data Story: How Microsoft Research is Using Social Data to Understand Depression

Sometimes the use cases for social data go far beyond what you would expect is ever possible. Such is the case with Microsoft Research who has done some really groundbreaking work around using social data to study depression and whether you can indicate if someone is depressed by their activity on Twitter. We interviewed Dr. Munmun de Choudhury of Microsoft Research to ask about their research using social data to study mental health, the privacy implications and how social data can improve mental health. Dr. De Choudhury will be joining Georgia Tech’s School of Interactive Computing as an assistant professor this Spring.   

Munmun de Choudhury of Microsoft Research

1. What are the high-level takeaways you found on using Twitter to research depression?

This research direction has revealed for the first time, how social media activity such as on Twitter can reveal valuable indicators to mental health e.g., depression. Twitter has much noise, but it is promising to see that there are signals hidden in there too, that can tell us about important issues as health and lifestyle, both at the level of an individual, as well as the scope of larger populations. The most prominent signals of depression lie on people’s social activity (i.e., to what extent they post, what kind of posts they share, when do they mostly post), their social network structure (e.g., how are they connected to their friends and friends of friends), and the linguistic style of the content they share. That these rather implicit signals (e.g., a person may never explicitly mention they are “depressed”) can indicate people’s mental and behavioral issues was a rather surprise to us; though when we consulted with psychologists (in fact one of the collaborators in this project was a psychologist), we learned that mental health may manifest itself via various nuances in people’s everyday behavior. This gives us hope that observing social media use of people over time—something which is increasingly gaining popularity—can be used to build tools, forecasting algorithms, interventions, and prevention strategies for both individuals themselves as well as policymakers to help them deal with and manage this medical condition in a better way.

2. What are the privacy concerns of studying mental health with social data?

Studying mental health with online social data is extremely attractive, and can have widespread implications in enabling better healthcare; however it comes with its own set of privacy and ethics related challenges which cannot be ignored. A number of questions may arise: Can we design effective interventions for people, whom we have inferred to be vulnerable to a certain mental illness, in a way that is private, while raising awareness of this vulnerability to themselves and trusted others (doctors, family, friends)? In extreme situations, when an individual’s inferred vulnerability to a mental illness is alarmingly high (e.g., if the individual is suicide-prone), what should be our responsibility as a research community? For instance, should there be other kinds of special interventions where appropriate counseling communities or organizations are engaged? That is, finding the right types of interventions that can actually make a positive impact on people’s behavioral state as well as abide by adequate privacy and ethical norms is a research question on its own. I hope this line of work triggers conversations and involvement with the ethics and medical community to investigate opportunities and caution in this regard.

Additionally, as a community, we need to be aware of the limits up to which such inferences about illness or disability can be deemed to be safe for an individual’s professional and societal identity. In a sense, we need to ensure that such measurements do not introduce new means of discrimination or inequality given that we now have a mechanism to infer traditionally stigmatic conditions which are otherwise chosen to be kept private. These and other potential consequences such as revealing nuanced aspects of behavior and mental health conditions to insurance companies or employers make resolution of these ethical questions critical to the successful use of these new data sources and the research direction.

3. If you’re able to identify depressed individuals with social media, what does prevention and intervention look like?

In terms of prevention, the ability to automatically and privately infer concerns in people’s mental health issues can enable health professionals be more proactive and make arrangements that improve the access of at-risk individuals to appropriate medical help. At the same time it can help policymakers in better understanding the incidence of different diseases, such as depression which is extremely underreported and considered socially stigmatic, so that people can benefit from better healthcare practices. Further, the population-scale trends of depression over geography, time or gender may be a mechanism to trigger public health inquiry programs to take appropriate and needful measures, or allocate resources in a timely manner.

In terms of intervention, since our estimates of depression can be made considerably more frequent than conventional surveys such as by the Centers for Disease Control (CDC), the estimates can be utilized time to time to enable early detection and rapid treatment of depression in sufferers. At the individual level, a variety of personalized and private tools may be developed that may help individuals better manage depression as well as help them seek social and emotional support easily.

4. What made you interested in studying “collective human behavior as manifested via our online footprints”?

I have always been fascinated by how new and emergent technologies online (e.g., Facebook and Twitter) are increasingly getting into the mainstream of our lives. While we know and realize that our actions on these platforms serve as a reflection of characteristics in the physical world—e.g., several of our Facebook friends are actually friends in real life, I was actually curious as to whether the increasing use of these platforms is impacting our behavior in some way, that is, if the reverse is true. For example, does it affect the way we emote, interact, or build social ties with others? That is one reason I became very interested in exploring deep into understanding aspects of our behavior based on what we say and what we do online.

The other motivation lies in my inherent penchant to study people. The web and particularly social networks and social media provide us with a very powerful tool to do so, in a way that the behavioral findings are derived non-intrusively from people’s day-to-day activities, and because of the scale of the data are mostly generalizable. Lastly on a humorous note, computer scientists are often labeled to be socially awkward; so perhaps you can assume that this particular computer scientist intends to show that “hey, even we can be socially cool too, and even make sense of your social actions on the web!”.

5. Where do you see the future of health research and social data going?
As people are increasing joining social media sites with a goal to remain connected as well as learn about what is going on around them, there are people who have been using these sites for years now. As Twitter and Facebook’s penetration increases, it would lend us a rich source through which we can observe individual-centric behavior over time, and consequently use those trends to understand when and where unexpected or anomalous behavior, e.g., concerning health issues, may emerge. At the population level, large-scale naturalistic data obtained from the web may provide rich insights into understanding health concerns and health outcomes which may not be possible with traditional survey methods. This is because surveys are often retrospective, and hence lack the immediacy of the context in which policies may be changed or influenced, or interventions made for enabling better healthcare.

Even more so, I hope that in the future, social media use can be leveraged to identify health issues in difficult to reach populations, or populations who would otherwise not reveal a condition due to social stigma. For instance, one of the many challenges hindering the global response to some of the extremely deadly diseases like AIDS is the difficulty of collecting reliable information about the populations who are most at risk for the disease. Since social media use is consistently gaining more and more ground, they might be the new platform wherein activity traces may be utilized to identify, with appropriate privacy policies enforced, particular vulnerable populations, and enable them receive better healthcare and help as need may arise.

If you’re interested in more data stories, please check out our collection of 25 Data Stories for interviews with data scientists from Pinterest, Foursquare, and more! 


Data Story: Interview with Christyn Perras of YouTube

I have long wanted to interview someone from YouTube as I think their social data is fascinating and incredibly vast. Every minute, 100 hours of video are uploaded to YouTube. Christyn Perras, a quantitative analyst at YouTube, is talking with Gnip about the career path to being a data scientist, the tools in her arsenal, YouTube’s data-driven culture and Coursera. 

Christyn Perras

1. What was your path to becoming a quantitative analyst at YouTube? What would you recommend for others?

As an undergraduate, I studied psychology and was particularly drawn to the experimental side of the discipline. When I was considering an advanced degree, I concentrated on the aspects of psychology that I loved during my search for a graduate program. I eventually found a program that focused on applied statistics and experimental design at the University of Pennsylvania, where I received an MS and PhD. However, even after graduation, my career path remained unclear and the tech industry wasn’t even on the radar. It was when I started looking for jobs using search terms referring to my skill set rather than job titles that I saw a world of opportunity unfold in front of me.

My first job on the west coast was at Slide, a social gaming company. It was an amazing experience. At Slide, I used my psychology background to understand our users and the way they interacted with our products. In addition, my background in statistics and experimental design gave me the skills to study, test, quantify and interpret user behavior and to measure the impact of our influence.  We sought answers to questions such as: Why were these people using our our products? What made them come back? And what could we do to change their behavior and/or enhance their experience? I am now doing this at YouTube and concentrate my efforts on understanding our creators and continuing to improve their YouTube experience via foundational research and experiment analysis.

2. I’ve noticed that Google doesn’t tend to use the title of data scientist. Is there a reason for this?

Not that I’m aware of. Data scientist, quantitative analyst, statistician and decision support analyst are all fairly interchangeable terms in the tech industry. As I mentioned before, my job search was most successful when I used keywords related to my skills and interests (statistics, psychology, experiments) rather than searching job titles (statistician). However, I imagine with the rising popularity and awareness of the field, naming conventions for job titles will likely become more standardized.

3. What is one of the most surprising aspects you’ve learned about YouTube data?

Honestly, I was surprised by the sheer amount of data! It is staggering. I had to learn a number of new programming languages and techniques just to be able to get the data I needed for an analysis into a manageable format. During my time at Penn, SAS, SPSS and SQL were the preferred tools and were incorporated into the curriculum. Without a more extensive computer science background, areas such as MapReduce and Python were quite new to me. I’m also continually expanding my knowledge and experience with techniques used to manipulate, reshape and connect data on this scale. When working with billions of data points, you often need to think creatively.

4. How do quantitative analysts work with product managers to shape YouTube?

There is a strong data-driven culture at YouTube and, as a result, product managers and analysts work very closely. In the case of a product change or redesign, analysts are involved in the process from the start. Early involvement ensures, for example, the data necessary for analysis are collected, experimental arms are set-up correctly, logging is accessible and bug-free. We discuss the goals and expectations of product changes in depth to make sure analyses are designed to answer the right question and will produce valid, actionable results. Analysts and product managers typically have a steady dialogue throughout the course of analysis. Once the analysis is complete, we discuss the results, interpret the meaning, consider the implications, and make decisions about the next steps.

5. How do you think companies such as Coursera are changing fields such as data science?

I love Coursera! My favorite courses include Data Analysis with Jeff Leek and Computing for Data Analysis with Roger Peng. Coursera is doing something truly great and I look forward to seeing how they grow and progress. Data science is a bit nebulous in terms of education (at least it was when I was in school). There wasn’t a “data science” major or anything like that, so it was necessary to piece it together yourself. I have an amazing team with wildly different backgrounds from physics to psychology to economics. I love bouncing ideas off my colleagues and am guaranteed wonderfully unique and clever perspectives. Companies like Coursera make dynamic teams like this possible by giving people from a wide variety of disciplines access to the additional education they need to shape their career path and be successful in their job.

Another amazing resource for future data scientists is OpenIntro (, co-founded by my colleague David Diez. With OpenIntro, you’ll find a top-notch, open-source statistics textbook and a wealth of supporting material.

Thanks to Christyn for the interview! If you’re interested in reading more interviews, check out Gnip’s collection of 25 Data Stories for interviews with data scientists from Foursquare, Pinterest, Kaggle and more. 

Data Story: Michele Trevisiol on Image Analysis

Social media content is frequently shifting to a visual medium and companies are often having a harder time understanding and digesting visual content. So I loved happening upon Michele Trevisiol, PhD student at Universitat Pompeu Fabra and PhD intern at Yahoo Labs Barcelona, whose research focused on image analysis for videos and photos. 

Michele Trevisol

1. Your research has always revolved around analyzing videos and photos, and this is an area that the rest of the world is struggling how to figure out. Where do you see the future of research around this heading?

For my own experience I can see a huge work for research in the near future. Every day there are tons of new photos and videos uploaded online. Just think about Facebook, Flickr, YouTube, Vimeo and many others new services. Actually, in one of our projects we are working with Vine, a short-video app that allows you to record 6 seconds in loop. Twitter has bought it in January and Vine reached already 40 million of users.

Just to write some numbers, recent studies estimated a volume of about 880 billion photos will be uploaded in 2014, without considering other multimedia data. This volume of information makes it very hard for the user to explore such a large multimedia space and pick the most interesting items. Therefore, we need smart and efficient algorithms to fix this. The amount of data is growing every day, researchers needs to keep improving their systems to analyze, understand and rank the media items in the best way as possible (often in a personalized way for the user).

Researchers have studied these topics from many different angles. Working with multimedia objects involves analyzing the content of the data (i.e., computer vision like object detection, pattern recognition, etc.), understanding the textual information (e.g., meta-data, description, tags, comments), or studying how media is shared in the social network. This is a research space that has still many things left to explore.

2. You’ve previously researched how to determine geo data from video based on a multi-modal approach based on how videos are tagged, user networks, and other meta-data. What are the advantages of understanding the geo location of videos?

You can see the problem from a different point of view. If you have a complete knowledge about the multimedia items you have, like the visual content, the meta-data, the location, or even how they are made (technical details), and so on: this data would be easily classify and discoverable for the users. All the information about the item helps researchers to understand its properties and its importance for the users that is looking for it. However, very often this information is missing, incomplete or even wrong. In particular the geo location is not provided on the vast majority of photos and videos online.

Only in the recent years there has been an increment of cameras and phones with automatic geo-tagging (able to record the latitude and longitude where the photo was taken). As a result, just few multimedia items have this information. Being able to discover the geo location of videos/images helps you to organize and classify them, and helps the users to find items related to any specific location, improving their search and the retrieval. We presented a multi-modal approach that keeps into account various factors related to the given video, like the content, the textual information, the user’s profile, the user’s social network and her previous uploaded photos. With such information the accuracy of the geo location prediction improved dramatically.

Recently, the research is spending more effort on this topic, mainly due to the increasing interest in the mobile world. In the near future, the activity in this area is destined to increase.

3. How are the browsing patterns different for people viewing photos than text? What motivates people to click on different photos?

The browsing patterns are strongly biased by the structure of the website and, of course, by the type of website. The first case is quite obvious as the users browse the links that are more evident on the page, therefore the way the website selects and shows the related items is really important. The latter case instead is related to the type of website.

Consider, for example, Many users land on the website from some search engine and read just one article. This means that the goal of the user is very focused, as she’s able to define the query, click on the right link, consume the information, and leave. But that’s not always the case, as there are also users who browse deep and look at many articles related to one topic (e.g., about TV series, episodes, actors, etc.). If you consider News websites instead, the behavior is different as the user could enter only to spend some time, to take a break, to get distracted with something interesting. A photo sharing website presents even different behavior, often characterized by the social network structure. Many users interact mostly with the photos shared by friends, or contacts, or they like to get involved in social groups trying to get more visibility and positive feedbacks as possible.

The main interest of any service provider is to keep the user engaged long as possible on its pages. To do this, it shows the links with the highest interest for the users to keep them clicking and browsing. That’s what the user wants as well, she wants to find something interesting for her needs. The rationale is similar for photo sharing sites, but the power of the image to catch the interest at the first glance is an important difference. For example, in Flickr there are “photostreams” (sets of photos) shown to the user for each image she is watching. Slideshows show image thumbnails in order to catch the interest of the user with the content of the recommended images. Recently, we developed a study on these specific slideshows, we found that the users love to navigate from one slideshow to another instead of searching directly for images or browsing specific groups. We also tried to recommend different slideshows instead than different photos with positive and interesting results.

4. Much of your research has focused around Flickr. How does data science improve the user experience of Flickr?

Recently Flickr has improved a lot in this direction, for example with the new free accounts with 1TB of free storage, or the interface that has been recently refreshed. But the changes are just at the beginning.

In general, the data scientists need first to study and understand how the users are engaging with the websites, how they are searching and consuming the information, how they are socializing, and especially what they would like to have and to find on the website. In order to improve the user experience you need to know the users and then to work on what (and how) the website can offer to improve their navigation.

5. Based on your research around photo recommendations, what characteristics make photos most appealing to viewers?

This is a complex question as the appealing is subjective and changes for each user, especially the taste, or even better, the interest of the user changes over time. Some days you’re looking for something more funny so maybe the aesthetics of the image are less important. Other days instead, you get captured by very cool images that can be professional or just incredible amateur shots.

In the majority of cases each user is quite unique in term of taste, so you need to know what she appreciated before and how her taste changed over time in order to show her the photos that she could like more. On the other hand, there are cases that can catch the interests of any users in an objective way. For example, in photos related to real world events the content is highly informative, instead, the quality and the aesthetic are often ignored.

In a research work that we presented last year, we compare different ranking approaches in function of various factors. One of these was the so called external impact. With this features we could measure how much interest the image has outside Flickr, in other words, how much visibility the image has on the Web. If an image uploaded by a Flickr user in her page has a huge set of visits coming from outside (e.g., other social network, search engine), it means this image has high attractiveness that need to be considered even if inside the network it does not show particular popularity. We found that this could also be a relevant factor to be considered in the ranking, and we are still investigating this point.

If you’re interested in more data stories, please check out our collection of 25 Data Stories for interviews with data scientists from Pinterest, Foursquare, and more! 

Data Stories: Tyler Singletary of Klout

This Data Story is with Tyler Singletary, Director of Platform at Klout, and we’re talking about Klout Scores, Klout data, international influence and more. Klout is an extremely popular Gnip enrichment and it’s clear that the world is interested in Klout data, so we thought this would a fun interview. You can keep up with Tyler on Twitter at @harmophone and on Klout at

Tyler Singletary from Klout

1. Gnip’s CTO Jud Valeski has a Klout Score of 56, our VP of Product Rob Johnson has a Klout Score of 50, and I have a Klout Score of 60. If you’re a business, who do you give an offer to?

It depends on what you’re looking for. While the Klout Score is an expression of a user’s potent network effects, Klout Topics are an expression of where and what the user drives engagement on, and to some degree, where their interests and passionate community lies. If I wanted to reach Engineers and Boulderites, Jud would be a good choice: his audience is very engaged with him on those topics. We’ve been leveraging this understanding and segmentation in our products for years now.

2. What has the introduction of Cinch meant for Klout? Will Cinch data be available through the Klout API?

Cinch is another way to look through the prism of what it means to influence someone, and is built on the idea of social authority being an important piece in the collaborative economy. To that end, again, Cinch is built on Klout Topics as an important way to quickly find subject-matter authority and trust, combined with a user’s personal network.

Cinch is in the early days as a product, but we can easily imagine it as part of the platform in the future. There’s already a wealth of good advice around lifestyle topics, and it would be a fantastic channel for businesses and consumers to integrate with to pose engaging questions and take a pulse on influencers recommending their products.

3. What is the potential for companies using Klout in a CRM?

Klout’s influence graph helps CRM users gain insight into users, enabling them to improve prioritization of issues and outreach, as well as customer satisfaction. By finding their most influential customers, these companies can also streamline outreach and word-of-mouth marketing to increase new business. As the truly “Social CRM” platform evolves, and inbound leads are generated from social feeds, it will become increasingly important to know how and what influence leads and customers have. It’s about finding more relevance and reach.

4. Klout is big in Japan. What are the challenges of defining influence in different countries?

I tend to think of things in terms of “units of influence.” Each network has a different set of actions that can be influential. You have an actor, an actee, an interaction, and a subject (or two or three). In terms of Japan, and other countries using different languages, you still have an actor, actee, and interactions– all of those units that go into the Klout Score. What we need to build is an interpreter for the subjects (effectively, Klout Topics), using the character sets, grammars, and dictionaries mapped to the common and unique meanings. This is not a trivial undertaking.

So the Klout Score is effective and applies easily to activity on the networks we’re already surveying in Japan and South Korea, and other countries. Klout Topics will take an investment of resources, but it’s not an impossible problem. One interesting other way to look at the problem– a short circuit– is to delve deeper into content. A URL is a URL in any language. If you can understand what that destination is about, as a subject of influence, you may be able to come to a solution for the Topic problem before tackling the language issues.

There’s also the diversity of networks, and the availability of data. From my research, platforms like LINE and Gree aren’t yet opening up APIs or partnering with the Gnips of the world. Getting access to the wealth of data in what might be a more dominant platform in foreign countries  needs to be solved.

5. What does the future of Klout data look like?

Klout Topics are a constantly adapting and improving system. I hope to have us release different prisms to view them under, like standard ontologies like the IAB, while still retaining the adaptive and “in the moment” nature that social data requires. I think you’ll see us encouraging developers and companies to build into the platform more and to derive new insights from aggregated data and around individual pieces of content, and you’ll see us make more inroads in our offering there. With Cinch we’re proving that there are broad use cases for influence data, and we’ve been encouraging the platform community to build on that premise.

6. What are common misconceptions people have about Klout?

There’s still this thought that we are only about the Klout Score. Topics have been around for several years now. There’s also this sense that it’s a value judgment, or a rank-order. We’re none of these things– we have scientists and social people working on the tough problems on social media, and really, in a society where money drives so much. People should be recognized and rewarded, even indirectly, for the impact they have in their networks.

7. If you’re a business, what is the first thing you should know about Klout?

Klout is the best platform for driving authentic earned media, and our data is the best lens with which to capture, catalog, and understand all of the earned media being generated around products, entertainment, services and brands.

Thanks to Tyler for the interview! If you’re interested in more data stories, check out our compilation of 25 Data Stories from Gnip!