Gilad Lotan is the chief data scientist for betaworks, which has launched some pretty incredible companies including SocialFlow and bitly. I was very excited to interview him because it’s so rare for a data scientist to get to peek under the hood of so many different companies. He’s also speaking at the upcoming SXSW on “Algorithms, Journalism and Democracy.” You can follow him on Twitter at @gilgul.
(Gnip is hosting a SXSW event for those involved in social data, email email@example.com for an invite.)
1. As the chief data scientist for betaworks, how do you divide your time amongst all of the companies? Basically, how do you even have time for coffee?
First of all, we take coffee very seriously at betaworks, especially team digg, who got us all hooked on the chemex drip. There’s always someone making really good coffee somewhere in the office.
Now beyond our amazing coffee, betaworks is such a unique place. We both invest in early stage companies, and incubate products, many of which have a strong data component. We’ve successfully launched companies over the past years, including Tweetdeck, Bitly, Chartbeat and SocialFlow. There are currently around 10 incubations at betaworks, at various stages and sizes.
Earlier this year, we decided to set up a central data team at betaworks rather than separate data teams within the incubations. Our hypothesis was that leveraging data wrangling skills and infrastructure as a common resource would be more efficient, and provide our incubations with a competitive advantage. Many companies face similar data-related problem-sets, especially in their early stages. From generating a trends list to building recommendation systems, the underlying methodology stays similar even when using different data streams. Our hope was that we could re-use solutions provided in one company for similar features in another. On top of that, we’re taking advantage of common data infrastructure, and using data streams to power features in existing products or new experiments.
When working with data, much of the time you’re building something you’ve never done before. Even if the general methodology might be known, when applied to a new dataset or within a new context there’s lots of tweaking to be done. For example, even if you’ve used naive bayes classifiers in the past, when applied towards a new data stream, results might not be good enough. So planning data features within product releases is challenging, as it is hard to predict how long development will take. And then there’s knowing when to stop, which isn’t necessarily intuitive. When is your model “good enough”? When are results from a recommendation system “good enough”?
The data team is focused on building data-driven features into incubated products. One week I’ll be working on scoring content from RSS feeds, and the following week I might be analyzing weather data clusters. We prioritize based on the stage the company and the importance of this data for the product’s growth. We tend to focus on companies that have high volumes of data, or are seeking to build features that rely on large amounts of data munching. But we keep it fairly loose. We’re small and nimble enough at betaworks that prioritization between companies has not been an issue yet.
I’m aware that it will become more challenging, especially as companies grow in size.
2. Previously, you were the VP of Research & Development at SocialFlow. How did data science figure into product development?
At SocialFlow the data team built systems that mine massive amounts of data, including the Twitter public firehose. From distributed ingestion to analytics and visualization, there were a few ways in which our work fed into the product.
The first and most obvious, was based on the product development roadmap. In an ideal situation, the data team’s work is always a few steps ahead of the product’s immediate needs, and can integrate its’ work into the product when needed. At SocialFlow, we powered a number of features including real-time audience trends, and personalized predictions for performance of content on social media. In both cases the modules were developed as a part of the product launch cycle and continuously maintained by the data team.
The second way in which we affected product development was by constantly running experiments. Continuous experimentation was a key way in which we innovated around our data. We would take time to test out hypothesis and explore different visualization techniques as a way to make better sense of our data. One of our best decisions was to bring in data science interns over the summer. They were all in the midst of their phd’s and incredibly passionate about data analysis, especially data from social networks. The summer was an opportunity for them to run experiments at a massive scale, using our data and infrastructure. As they learned our systems, each chose a focus and spent the rest of their time running analyses and experiments. Much of their work was integrated into the product in some manner. Additionally, several published their findings in academic journals and conference proceedings. Exploratory data analysis may be counter-productive, especially when there are no upper bounds set on when to stop the experimentation. But with strict deadlines, it may be invaluable.
The third, and most surprising for me, was storytelling. We made sure to always blog about interesting phenomenons that we were observing within our data. Some were directly related to our business – publishers, brands and marketers – but much was simply interesting to the general public. We added data visualization to make them more accessible. From the Osama Bin Laden raid, to the spread of Invisible Children’s Kony2012 video, we were generating a sizable amount of PR for SocialFlow just by blogging about interesting things we identified in our data. While the attention was nice, there were some great business and product opportunities that came because of that.
Working with interesting data? Always be telling stories!
3. In your SXSW session “Algorithms, Journalism and Democracy,” you’ll speak to the bias behind algorithms. What concerns you about these biases and what do people need to know?
There are a growing number of online spaces which are a product of an automated algorithmic process. These are the trending topics lists we see across media and social networks, or the personalized recommendations we get on retail sites. But how do we define what constitutes a “trend” or what piece of content should be included in our “hottest” list? Often times, it is not simply the top most read or clicked on item. What’s hot is an intuitive and very humane assessment, yet potentially a mathematically complex formula, if at all possible to produce. We’ve already seen numerous examples where algorithmically generated results led to awkward outcomes, such as Amazon’s $23,698,655.93 priced book about flies or Siri’s inability to find abortion clinics in New York City.
These are not Google, Apple, Amazon or Twitter conspiracies, but rather the unintended consequences of algorithmic recommendations being misaligned with people’s value systems and expectations of how the technology should work. The larger the gap between people’s expectations and the algorithmic output, the more user trust will be violated. As designers and builders of these technologies, we need to make sure our users understand enough about the choices we encode into our algorithms, but not too much to enable them to game the systems. People’s perception affects trust. And once trust is violated, it is incredibly difficult to gain back. There’s a misplaced faith in the algorithm, assuming it is fair, unbiased, and should accurately represent what we think is “truth”.
While it is clear for technologists that algorithmic systems are always biased, the public perception is that of neutrality. It’s math right? And math is honest and “true”. But when building these systems there are specific choices made, and biases encoded due to these choices.
In my upcoming SXSW session with Poynter Institute’s Kelly McBride, we’ll be untangling some of these topics. Please come!
4. When marketers typically look at algorithms, most try to game the system. How does the cycle of programmers trying to stop gaming and those trying to game it play into creating biases?
Indeed. In spaces where there’s a perceived value, people will always try to game the system. But this is not new. The practice of search engine optimization is effectively “gaming the system”, yet its’ been around for many years and is considered an important part of launching a website. SEO is what we have to do if we want to optimize traffic to our websites. Like a game of cat and mouse, as builders we have to constantly adjust the parameters of our systems in order to make them harder to game, while making sure they preserve their value. With Google search, for example, while we have a general sense of what affects results, we don’t know precisely, and they constantly change it up!
Another example is Twitter’s trending topics algorithm. There’s clear value in reaching the trending topics list on Twitter: visibility. In the early days of Twitter, Justin Bieber used to consistently dominate the trends list. In response, team Twitter implemented TFIDF (Term Frequency Inverse Document Frequency) as a way to make sure this didn’t happen – making it harder for popular content to trend. As a result, only new and “spiky” (rapid acceleration of shares) make it to the trends list. This means that trends such as Kim Kardashian’s wedding or Steve Jobs’ death are much more likely to trend, compared to topics that are a part of an ongoing story, such as the economy or immigration. I published a long blog post on why #OccupyWallStreet never trended in NYC explaining precisely this phenomenon.
Every event organizer wants to get their conference to trend on Twitter. Hence a unique hashtag is typically chosen to be used. We know enough about what gets something to trend, but it is still difficult to completely game the system. In a way, there’s always a natural progression with algorithmic products. The more successful these spaces are, the more perceived value they attain, the more people will try to game them. Changes in the system are necessary in order to keep it from being abused.
5. If you could have any other data scientist’s job, which one would you want?
I’m pretty sure I have the best data science gig out there. I get to work with passionate smart people, on creative and innovative approaches to use data in products that people love. Hard to beat that!