Data Stories: Gilad Lotan of betaworks

Gilad Lotan is the chief data scientist for betaworks, which has launched some pretty incredible companies including SocialFlow and bitly. I was very excited to interview him because it’s so rare for a data scientist to get to peek under the hood of so many different companies. He’s also speaking at the upcoming SXSW on “Algorithms, Journalism and Democracy.” You can follow him on Twitter at @gilgul

(Gnip is hosting a SXSW event for those involved in social data, email events@gnip.com for an invite.) 

Gilad Lotan of Betaworks1. As the chief data scientist for betaworks, how do you divide your time amongst all of the companies? Basically, how do you even have time for coffee?

First of all, we take coffee very seriously at betaworks, especially team digg, who got us all hooked on the chemex drip. There’s always someone making really good coffee somewhere in the office.

Now beyond our amazing coffee, betaworks is such a unique place. We both invest in early stage companies, and incubate products, many of which have a strong data component. We’ve successfully launched companies over the past years, including Tweetdeck, Bitly, Chartbeat and SocialFlow. There are currently around 10 incubations at betaworks, at various stages and sizes.

Earlier this year, we decided to set up a central data team at betaworks rather than separate data teams within the incubations. Our hypothesis was that leveraging data wrangling skills and infrastructure as a common resource would be more efficient, and provide our incubations with a competitive advantage. Many companies face similar data-related problem-sets, especially in their early stages. From generating a trends list to building recommendation systems, the underlying methodology stays similar even when using different data streams. Our hope was that we could re-use solutions provided in one company for similar features in another. On top of that, we’re taking advantage of common data infrastructure, and using data streams to power features in existing products or new experiments.

When working with data, much of the time you’re building something you’ve never done before. Even if the general methodology might be known, when applied to a new dataset or within a new context there’s lots of tweaking to be done. For example, even if you’ve used naive bayes classifiers in the past, when applied towards a new data stream, results might not be good enough. So planning data features within product releases is challenging, as it is hard to predict how long development will take. And then there’s knowing when to stop, which isn’t necessarily intuitive. When is your model “good enough”? When are results from a recommendation system “good enough”?

The data team is focused on building data-driven features into incubated products. One week I’ll be working on scoring content from RSS feeds, and the following week I might be analyzing weather data clusters. We prioritize based on the stage the company and the importance of this data for the product’s growth. We tend to focus on companies that have high volumes of data, or are seeking to build features that rely on large amounts of data munching. But we keep it fairly loose. We’re small and nimble enough at betaworks that prioritization between companies has not been an issue yet.

I’m aware that it will become more challenging, especially as companies grow in size.

2. Previously, you were the VP of Research & Development at SocialFlow. How did data science figure into product development?

At SocialFlow the data team built systems that mine massive amounts of data, including the Twitter public firehose. From distributed ingestion to analytics and visualization, there were a few ways in which our work fed into the product.

The first and most obvious, was based on the product development roadmap. In an ideal situation, the data team’s work is always a few steps ahead of the product’s immediate needs, and can integrate its’ work into the product when needed. At SocialFlow, we powered a number of features including real-time audience trends, and personalized predictions for performance of content on social media. In both cases the modules were developed as a part of the product launch cycle and continuously maintained by the data team.

The second way in which we affected product development was by constantly running experiments. Continuous experimentation was a key way in which we innovated around our data. We would take time to test out hypothesis and explore different visualization techniques as a way to make better sense of our data. One of our best decisions was to bring in data science interns over the summer. They were all in the midst of their phd’s and incredibly passionate about data analysis, especially data from social networks. The summer was an opportunity for them to run experiments at a massive scale, using our data and infrastructure. As they learned our systems, each chose a focus and spent the rest of their time running analyses and experiments. Much of their work was integrated into the product in some manner. Additionally, several published their findings in academic journals and conference proceedings. Exploratory data analysis may be counter-productive, especially when there are no upper bounds set on when to stop the experimentation. But with strict deadlines, it may be invaluable.

The third, and most surprising for me, was storytelling. We made sure to always blog about interesting phenomenons that we were observing within our data. Some were directly related to our business – publishers, brands and marketers – but much was simply interesting to the general public. We added data visualization to make them more accessible. From the Osama Bin Laden raid, to the spread of Invisible Children’s Kony2012 video, we were generating a sizable amount of PR for SocialFlow just by blogging about interesting things we identified in our data. While the attention was nice, there were some great business and product opportunities that came because of that.

Working with interesting data? Always be telling stories!

3. In your SXSW session “Algorithms, Journalism and Democracy,” you’ll speak to the bias behind algorithms. What concerns you about these biases and what do people need to know?

There are a growing number of online spaces which are a product of an automated algorithmic process. These are the trending topics lists we see across media and social networks, or the personalized recommendations we get on retail sites. But how do we define what constitutes a “trend” or what piece of content should be included in our “hottest” list? Often times, it is not simply the top most read or clicked on item. What’s hot is an intuitive and very humane assessment, yet potentially a mathematically complex formula, if at all possible to produce. We’ve already seen numerous examples where algorithmically generated results led to awkward outcomes, such as Amazon’s $23,698,655.93 priced book about flies or Siri’s inability to find abortion clinics in New York City.

These are not Google, Apple, Amazon or Twitter conspiracies, but rather the unintended consequences of algorithmic recommendations being misaligned with people’s value systems and expectations of how the technology should work. The larger the gap between people’s expectations and the algorithmic output, the more user trust will be violated. As designers and builders of these technologies, we need to make sure our users understand enough about the choices we encode into our algorithms, but not too much to enable them to game the systems. People’s perception affects trust. And once trust is violated, it is incredibly difficult to gain back. There’s a misplaced faith in the algorithm, assuming it is fair, unbiased, and should accurately represent what we think is “truth”.

While it is clear for technologists that algorithmic systems are always biased, the public perception is that of neutrality. It’s math right? And math is honest and “true”. But when building these systems there are specific choices made, and biases encoded due to these choices.

In my upcoming SXSW session with Poynter Institute’s Kelly McBride, we’ll be untangling some of these topics. Please come!

4. When marketers typically look at algorithms, most try to game the system. How does the cycle of programmers trying to stop gaming and those trying to game it play into creating biases?

Indeed. In spaces where there’s a perceived value, people will always try to game the system. But this is not new. The practice of search engine optimization is effectively “gaming the system”, yet its’ been around for many years and is considered an important part of launching a website. SEO is what we have to do if we want to optimize traffic to our websites. Like a game of cat and mouse, as builders we have to constantly adjust the parameters of our systems in order to make them harder to game, while making sure they preserve their value. With Google search, for example, while we have a general sense of what affects results, we don’t know precisely, and they constantly change it up!

Another example is Twitter’s trending topics algorithm. There’s clear value in reaching the trending topics list on Twitter: visibility. In the early days of Twitter, Justin Bieber used to consistently dominate the trends list. In response, team Twitter implemented TFIDF (Term Frequency Inverse Document Frequency) as a way to make sure this didn’t happen – making it harder for popular content to trend. As a result, only new and “spiky” (rapid acceleration of shares) make it to the trends list. This means that trends such as Kim Kardashian’s wedding or Steve Jobs’ death are much more likely to trend, compared to topics that are a part of an ongoing story, such as the economy or immigration. I published a long blog post on why #OccupyWallStreet never trended in NYC explaining precisely this phenomenon.

Every event organizer wants to get their conference to trend on Twitter. Hence a unique hashtag is typically chosen to be used. We know enough about what gets something to trend, but it is still difficult to completely game the system. In a way, there’s always a natural progression with algorithmic products. The more successful these spaces are, the more perceived value they attain, the more people will try to game them. Changes in the system are necessary in order to keep it from being abused.

5. If you could have any other data scientist’s job, which one would you want?

I’m pretty sure I have the best data science gig out there. I get to work with passionate smart people, on creative and innovative approaches to use data in products that people love. Hard to beat that!

Data Stories: Interview with Hilary Mason of bitly

 Data Stories is Gnip’s opportunity to tell the cool stories about the data scientists, data journalists and other people who are working in data. This week we’re interviewing Hilary Mason, the chief data scientist of bitly.  She is currently helping organize DataGotham, a celebration of the New York’s data community happening Sept. 13 -14th. You can follow her on Twitter at @hmason and read her blog at HilaryMason.com

Hilary Mason of bitly

1) How did you get started in your role as a data scientist?
I’m a computer scientist and have always had a keen interest in both algorithms and databases. It became clear to me in the last decade that the most interesting algorithms were those that worked on real data. When I found that there were opportunities to design math and infrastructure to build new types of applications, I couldn’t resist!

2) bitly users share 80 million links a day. What are some of the coolest insights and trends you’ve been able to see from these shared links?

We see all kinds of fascinating things in the data. For example, people who read about physics also read about fashion (http://bit.ly/vSa6AO) and people who use kindles use them very differently than any other kind of device (http://bit.ly/wbRe6o). We’re always posting these things on our blog. For example, on July 4th we posted the most popular recipe by state for the holiday. Did you know that people in Florida enjoy Alligator Ribs (http://bit.ly/NwUEUL)?

3) bitly just updated its site making it even easier to share and curate links. As the chief data scientist, what excites you most about the new capabilities?

It’s wonderful to see bitly evolve from a utility into a truly social platform. We’re excited for bitly to become the central place for you to store, share, and analyze the things that you care about on the internet. We can then use the aggregate data that we collect to enhance that experience for you.

4) What are some of your favorite projects you’ve worked on while at bitly?

Our goal at bitly is to understand the internet’s attention, and to build systems that make that useful. It’s too hard just to pick one bit of it! I’m proud of some of the work that’s made it out into the world, like our post about the half life of links on various social networks (http://bit.ly/puUbzs) and our collaboration with Forbes on the interactive map of media influence (http://onforb.es/GFzphG). I’m also incredibly excited about a few product-oriented experiments that are going to be public shortly … stay tuned.

5) What tools are in your arsenal as a data scientist?

I’m a firm believer in finding the smartest people you can, and letting them use whatever works best. Personally, I’m a huge fan of the old skool unix utilities, and do more with grep and awk than I should probably admit.

Python is my current programming language of choice, though I’m not averse to C when necessary. A few people on my team have started to fall in love with Go, so that’s on my list to check out.

We use the best datastore for each challenge, and make heavy use of memcached, Redis, HDFS, and even text files.

In the non-tech world, I keep a moleskine notebook around and have fallen in love with the Hi-Tec-C .4mm pens from JetPens.

6) As the chief scientist, where do you think your team adds additional business value? How does data science help bitly make decisions it wouldn’t make otherwise?

My team plays a few roles within the company. We handle the business analytics, which can be answering very simple questions like, “How many new URLs did we see yesterday?” to complex questions like, “How do we value a URL being clicked from platform X vs platform Y over time?”.

We do research, pushing the boundaries of what we know to be possible with our data and systems. A few examples of these types of questions are, “Can we build a model of attention to any phrase people are actively clicking on?”, or “Can we predict opening weekend box office takes for movies that people are reading about via bitly links?”

Finally, we build products. Generally these are APIs, like the API that accepts a URL and returns the geographic distribution of attention to the URL, but sometimes they’re human-facing producs. More on that shortly.

In summary, my team is responsible for pushing the boundaries of where bitly can go. It’s fun.

Thanks to Hilary for taking the time to talk to us about her work with bitly! Let us know in the comments if you have a suggestion for another Data Stories 

Continue reading