Social Data: What’s Next in Finance?

After a couple exciting years in social finance and some major events, we’re back with an update to our previous paper “Social Media in Markets: The New Frontier”. We’re excited to be able to provide this broad update on a rapidly evolving and increasingly important segment of financial service.

Social media analytics for finance has lagged brand analytics by 3 to 4 years despite being an enormous potential for profit through investing based on social insights. Our whitepaper explains why that gap has existed and what has changed in the social media ecosystem that is causing that gap to close. Twitter conversation around tagged equities has grown by more than 500% since 2011. The whitepaper explores what that means for investors.


We examine the finance specific tools that have emerged as well as outline a framework for unlocking the value in social data for tools that are yet to be created. Then we provide an overview of changes in academic research, social content, and social analytics for finance providers that will help financial firms figure out how to capitalize on opportunities to generate alpha.

Download our new whitepaper.

Twitter, you've come a long way baby… #8years

Like a child’s first steps or your first experiment with pop rocks candy, the first ever Tweet went down in the Internet history books eight years ago today. On March 21st, 2006, Jack Dorsey, co-founder of Twitter published this.


Twttr, (the service’s first name), was launched to the public on July 15, 2006 where it was recognized for “good execution on a simple but viral idea.” Eight years later, that seems to have held true.

It has become the digital watering hole, the newsroom, the customer service do’s and don’ts, a place to store your witty jargon that would just be weird to say openly at your desk. And then there is that overly happy person you thought couldn’t actually exist, standing in front of you in line, and you just favorited their selfie #blessed. Well, this is awkward.

Just eight months after their release, the company made a sweeping entrance into SXSW 2007 sparking the platforms usage to balloon from 20,000 to 60,000 Tweets per day. Thus beginning the era of our public everyday lives being archived in 140 character tidbits. The manual “RT” turned into a click of a button, and favorites became the digital head nod. I see you.

In April 2009, Twitter launched the Trending Topics sidebar, identifying popular current world events and modish hashtags. Verified accounts became available that summer; Athletes, actors, and icons alike began to display the “verified account” tag on their Twitter pages. This increasingly became a necessity in recognizing the real Miley Cyrus vs. Justin Bieber. If differences do exist.

The Twitter Firehose launched in March 2010. By giving Gnip access, a new door had opened into the social data industry and come November, filtered access to social data was born. Twitter turned to Gnip to be their first partner serving the commercial market. By offering complete access to the full firehose of publicly-available Tweets under enterprise terms, this partnership enabled companies to build more advanced analytics solutions with the knowledge that they would have ongoing access to the underlying data. This was a key inflection point in the growth of the social data ecosystem. By April, Gnip played a key role in the delivering past and future Twitter data to the Library of Congress for historic preservation in the archives.

July 31, 2010, Twitter hit their 20 billionth Tweet milestone, or as we like to call it, twilestone. It is the platform of hashtags and Retweets, celebrities and nobodies, at-replies, political rants, entertainment 411 and “pics or it didn’t happen.” By June 1st, 2011, Twitter allowed just that as it broke into the photo sharing space, allowing users to upload their photos straight to their personal handle.

One of the most highly requested features was the ability to get historical Tweets. In March 2012, Gnip delivered just that by making every public Tweet available starting from March 21, 2006 by Mr. Dorsey himself.

Fast forward 8 years, Twitter is reporting over 500 million Tweets per day. That’s more than 25,000 times the amount of Tweets-per-day in just 8 years! With over 2 billion accounts, over a quarter of the world’s population, Twitter ranks high among the top websites visited everyday. Here’s to the times where we write our Twitter handle on our conference name tags instead of our birth names, and prefer to be tweeted at than texted. Voicemails? Ain’t nobody got time for that.

Twitter launched a special surprise for its 8th birthday. Want to check out your first tweet?

“There’s a #FirstTweet for everything.” Happy Anniversary!


See more memorable Twitter milestones

Plugging In Deeper: New Brandwatch API Integration Brings Gnip Data to Brands

Earlier today, our partner Brandwatch made an announcement that we expect to be a big deal for the social data ecosystem. Brandwatch has become Gnip’s first Plugged In partner to offer an API integration that allows their customers to get full Twitter data from Gnip, using the methods and functionality of the new Brandwatch Premium API. Brandwatch’s customers can now apply all the power of Brandwatch Analytics – including their query building tools, custom dashboard visualizations, sentiment, demographics, influence data and more – to reliable, complete access to the Twitter firehose from Gnip. With this first-of-its-kind integration, brands and agencies have the opportunity to get social media analytics from a leading provider together with full Twitter data from Gnip, using one seamless API.

Brandwatch Gnip Premium API Twitter Integration.png

Brands and agencies are increasingly using social data to make business decisions outside the marketing department, and Brandwatch’s new API offering fills an important gap that will make this much easier. We’ve seen an uptick in demand from brands wanting to use social data outside of their social listening services to power CRM applications, to incorporate social data into business intelligence tools to study alongside other business data, and to build custom dashboards that combine social with other important business data. At the same time, these brands often face a challenge. They’ve invested significant time and resources using their social listening services to hone in and find the data that’s most important to them. Additionally, their social listening services provide valuable analytics and additional metadata that brands rely on to help make sense of social data. When it’s come time to consume social data for use in other applications, until now they’ve needed to integrate with Gnip separately. Our new integration with Brandwatch’s Premium API gives their customers a “best of both” solution. It provides a seamless way to combine the powerful social media listening and analytics service they’ve come to rely on with full Twitter data from the world’s most trusted social data provider.

For folks interested in the technical details, the way this works is simple. When Brandwatch customers make API calls for Twitter data, they get routed through a Gnip app that fetches Brandwatch data and merges it with data from Gnip. This means Brandwatch customers can have full assurance that they’re getting licensed Twitter data directly from Gnip.

We know a straightforward, integrated solution like this is something brands have been asking for, and we’re glad it’s finally here. To learn more about how the Brandwatch Premium API works, join our joint webinar next week or contact us at


Putting the Data in Data Discovery – Qliktech & Gnip Partner Up

Gnip is excited to announce that Qliktech is the newest member of our Plugged In partner program. While we partner with many different types of companies – ranging from innovative social analytics products to well-known big data services and software providers – Qliktech is a unique and exciting addition to our program.
Qliktech is discovery software that combines key features of data analysis with intuitive decision-making features, including (to name a few):

  • The ability to consolidate data from multiple sources
  • An easy search function across all datasets and visualizations
  • State-of-the-art graphics for visualization and data discovery
  • Support for social decision-making through secure, real-time collaboration
  • Mobile device data capture and analysis

Our partnership means that joint Qliktech and Gnip clients can easily marry social data with internal datasets to create nuanced visualizations that surface performance indicators and real-time changes that can impact the decisions those clients are making.

To put the powerful capabilities of this new partnership to good use, Gnip will be co-sponsoring a partner hackathon on April 6th at Qonnections– the Qliktech Partner Summit.

Along with HP Vertica and Qliktech, we’ll enable partners to hack on behalf of Medair, Swiss based humanitarian organization that provides support for health, nutrition, water and sanitation, hygiene, and shelter initiatives to countries experiencing natural disasters or emergencies.

A series of recent academic papers have highlighted the usefulness that social media plays in obtaining real-time information following sudden natural disasters. This hackathon will follow in those steps, using Twitter data from during Typhoon Haiyan, which landed in the Philippines on Nov 8th, 2013. Using Gnip’s Profile Geo enhancement, we’ll provide data from the Philippines during that period, allowing other Qliktech partners to experiment with how Medair could leverage this data, within Qliktech, in future situations that require real-time analysis and response.

It will be a great time, but more importantly, will harness the power of the Gnip and Qliktech relationship to accomplish something everyone can be proud of. And that’s a pretty good start to a new partnership!

The History Of Social Data In One Beautiful Timeline

As a company that’s constantly innovating and driving forward, it’s sometimes easy to forget everything that’s led us to where we are today. When Gnip was founded 6 years ago, social data was in its infancy. Twitter produced only 300,000 Tweets per day; social data APIs were either non-existent or unreliable; and nobody had any idea what a selfie was.

Today social data analytics drives decisions in every industry you can imagine, from consumer brands to finance to the public sector to industrial goods. From then to now, there have been dozens of milestones that have helped create the social data industry and we thought it would be fun to highlight and detail all of them in one place.


The story begins humbly in Boulder, Colorado with the concept of changing the way data was gathered from public APIs of social networks. Normally, one would ‘ping’ the API and ask for data, Gnip wanted to reverse that structure (hence our name). In these early days, we focused on simplifying access to existing public APIs but our customers constantly asked us how they could get more and better access to social data. In November of 2010, we were finally able to better meet their needs when we partnered with Twitter to provide access to full Firehose of public Tweets, the first partnership of its kind.

This is when Gnip started to build the tools that have shaped the social data industry. While getting a Firehose of Tweets was great for the industry, the reality was our customers didn’t need 100% of all Tweets, they needed 100% of relevant Tweets. We created PowerTrack to enable sophisticated filtering on the full Firehose of Tweets and solve that problem. We also built valuable enrichments, reliability products, and historical data access to create the most robust Twitter data access available.

While Twitter data was where the industry started, our customers wanted data from other social networks as well. We soon created partnerships with Klout, StockTwits, WordPress, Disqus, Tumblr, Foursquare and others to be the first to bring their data to the market. Our work didn’t end there though. We have been continually adding in new sources, new enrichments, and new products. We also launched the first conference dedicated to social data as well as the first industry organization for social data. Things have come a long way in 6 years and we can’t wait to see the developments in the next 6 years.

Check out our interactive timeline for the full list of milestones and details.


From A to B: Visualizing Language for the Entire History of Twitter

It all started with a simple question: “How could we show the growth and change in languages on Twitter?”

Easy, right?

Well, several months later, here we are; finally ready to show off our final product. You can see a static image of the final viz below and check out the full story and interactive version in The Evolution of Languages on Twitter.

Looking back on the process that led us here, I realized that we’d been through an huge range of ideas and wanted to share that experience with others.

Where Did We Get the Data?

As a data scientist, I walk into Gnip’s vast data playground excited to analyze, visualize and tell stories. For this project, I had access to the full archive of public Tweets that’s part of Gnip’s product offering – that’s every Tweet since the beginning of Twitter in March of 2006.

The next question is: “With this data set, what’s the best way to analyze language?” We had two options here – use Gnip’s language detection or use the language field that’s in every Twitter user’s account settings. Gnip’s language detection enrichment looks at the text of every Tweet and classifies the Tweet as one of 24 different languages. It’s a great enrichment, but for historical data it’s only available back to March 2012.

Since we wanted to tell the story back to the beginning of Twitter, we decided to use the language field that’s in every Twitter user’s account settings.


This field has been part of the Twitter account setup since the beginning, giving us the coverage we need to tell our story.

The First Cut

Having defined how we would determine language, we created our first visualization.



Interesting, but it doesn’t really tell the story we’re looking for.  This visualization tells the story of the growth of Twitter – it grew a lot. The challenge is that this growth obscures the presence of anything other than English, Japanese and Spanish. The sharp rise in volume also makes languages prior to 2010 impossible to see.

So we experimented with rank, language subsets, and other visualization techniques that could tell a broader story. At times, we dabbled in fugly.

Round Two

Moving through insights and iterations, we started to see each Twitter language become its own story. We chose relative rank as an important element and the streams grew into individual banners waving from year end marker poles like flags in the wind.


With this version, we felt like we were getting somewhere…

The Final Version

To get to the final version, we reintroduced the line width as a meaningful element to indicate the percent of Tweet volume, pared down the number of languages to focus the story, and used D3 to spiff up the presentation layer. The end result is a simple visualization that tells the story of how language has grown and changed on Twitter. 

What became clear to me in this process is that visualization is a hugely iterative process and there’s not a single thing that leads to a successful end result. It’s a combination of the questions you ask, how you structure the data, the choices you make in what to show and what not to show and finally the tools you use to display the result.

Let me know what you think…

The Evolution of Languages on Twitter

Revolution. Global economy. Internet access. What story do you see?

This interactive visualization shows the evolution of languages of Tweets according to the language that the user selected in their Twitter profile. The height of the line reflects the percentage of total Tweets and the vertical order is based on rank vs. other languages.

Check it out. Hover over it. See how the languages you care about have changed since the beginning of Twitter.

As you’d expect, Twitter was predominantly English speaking in the early days, but that’s changed as Twitter has grown its adoption globally. English is still the dominant language but with only 51% share in 2013 vs. 79% in 2007. Japanese, Spanish and Portuguese emerged the consistent number two, three and four languages. Beyond that, you can see that relative rankings change dramatically year over year.

In this data, you can see several different stories. The sharp rise in Arabic reflects the impact of the Arab Spring – a series of revolutionary events that made use of Twitter. A spike in Indonesian is indicative of a country with a fast growing online population. Turkish starts to see growth and we expect that growth will continue to spike after the Occupygezi movement. Or step back for a broader view of the timeline; the suggestion of a spread in the globalization of communication networks comes to mind. Each potential story could be a reason to drill down further, expose new ideas and explore the facts.

Adding your own perspective, what story do you see?

(Curious about how we created this viz? Look for our blog post later tomorrow for that story.)

Geo Operators and the Oscars: Using Gnip’s Search API to Identify Tweets Originating from Hollywood’s Dolby Theater

At Gnip, we’ve long believed that the “geo” part of social data is hugely valuable. Geodata adds real-world context to help make sense of online conversations. We’re always looking for ways to make it easier to leverage the geo component of social data, and the geo features in our Search API provide powerful tools to do so. Gnip’s Search API now includes full support for all PowerTrack geo operators, which means analysts, marketers and academics can now get instantaneous answers to their questions about Twitter that require location.

In this post, we’ll walk through exactly what these Search API features look like, and then look at an example from the Oscars of where they add real value. We will find the people actually tweeting from inside Hollywood’s Dolby Theater and do deeper analysis of those Tweets.

What It Does

Gnip’s Search API for Twitter provides the fastest, easiest way to get up and running consuming full-fidelity Twitter data. Users can enter any query using our standard PowerTrack operators and get results instantaneously through a simple request/response API (no streaming required).

Using the Search API, you can run queries that make use of latitude/longitude data, specifically with our bounding_box and point_radius rules. Each rule lets you limit results to content within a box or circle with up to 25-mile sides or radius. These rules can be particularly helpful to cut down on the noise and identify the most relevant content for your query.

These rules can be used independently to filter both types of Twitter geodata available from Gnip:

  1. Twitter’s “geo” value: The native geodata provided by Twitter, based on metadata from mobile devices. This is present for about 2% of Tweets.
  2. Gnip’s “Profile Geo” value: Geocoded locations from users’ profiles, provided by Gnip. This is present for about 30% of Tweets.

An Oscars’ Example: What apps do the glitterati use to Tweet?

There were 17.1 million Oscar-related Tweets this year. Ellen Degeneres’ selfie Tweet at the Oscars was the most retweeted Tweet ever. If you run a search for “arm was longer” during the Oscars, you’ll get a result set that looks like this — over 1.5 million retweets in the first hour following @theellenshow’s Tweet.


The Tweet created a bit of a stir since it was sponsored by Samsung and Ellen used her Galaxy S5 to tweet it, but then used an iPhone backstage shortly afterward. We’ve seen plenty of interest in the past in the iPhone vs. Android debate, so we thought we’d use the opportunity to check what the real “score” was inside the building. What applications were other celebrities using to post to Twitter?

Using the PowerTrack point_radius operator, we created a very narrow search for a 100 meter radius from the center of the Dolby Theater — roughly the size of the building — for the time period from 5:30-11:30 PM PST on Sunday night. It’s a single-clause query:

point_radius:[-118.3409742 34.1021528 0.1km]

The result set that came back included about 250 Tweets from 98 unique users during that narrow time period, and you can see from the spike during the “selfie” moment that it’s a much more refined data set:


Using this geo search parameter, you can quickly identify the people actually at the Oscars during that major Twitter moment.

Among them, we saw the following application usage:

Geotagged Tweets by Application

With all the chatter about the mobile device battles, it’s interesting to see so many Oscar attendees actually posting to Twitter from Instagram versus Twitter mobile clients. Foursquare also made a solid showing in terms of the attendees’ preferred Twitter posting mechanism.

This is just one example of how using Gnip’s geo PowerTrack operators in the Search API can help with doing more powerful social media analysis. For a more detailed overview of all the potential geo-related PowerTrack queries that can be created, check out our support docs here.

P.S. – More on Geo & Twitter Data at #SXSW

If you get excited about social data + geo and will be in Austin this weekend, come check out our session  “Beyond Dots on a Map: Visualizing 3 Billion Tweets” on Sunday at 1 pm. I’ll be on stage with our good friend Eric Gundersen from Mapbox talking about the work we did together to visualize a huge set of Twitter data from around the world.

Gnip's Social Data Picks for SXSW

If you’re one of 30,000 headed to SXSW, we’ve got our social data and data science panel picks for SXSW that you should attend between BBQ and breakfast tacos. Also, if you’re interested in hanging with Gnip, we’ve listed the places we’ll have a presence at!

Also, we’ll be helping put on the Big Boulder: Boots & Bourbon party at SXSW for folks in the social data industry. Send an email to for an invite.


What Social Media Analytics Can’t Tell You
Friday, March 7 at 3:30 PM to 4:30 PM: Sheraton Austin, EFGH

Great panel with Vision Critical, Crowd Companies and more. “Whether you’re looking for fresh insight on what makes social media users tick, or trying to expand your own monitoring and analytics program, this session will give you a first look at the latest data and research methods.”

Book Signing – John Foreman, Chief Data Scientist at MailChimp
Friday, March 7 at 3:50 to 4:10 PM: Austin Convention Center, Ballroom D Foyer

During an interview with Gnip, John said that the data challenge he’d most like to solve is the Taco Bell menu. You should definitely get his book and get it signed.


Truth Will Set You Free but Data Will Piss You Off
Saturday, March 8 from 3:30 to 4:30 PM: Sheraton Austin Creekside

All-star speakers from DataKind, Periscopic and more talking about “the issues and ethics around data visualization–a subject of recent debate in the data visualization community–and suggest how we can use data in tandem with social responsibility.”

Keeping Score in Social: It’s More than Likes
Saturday, March 8 from 5:15 to 5:30 PM: Austin Convention Center, Ballroom F

Jim Rudden, the CMO of Spredfast, brands will talk about “what it takes to move beyond measuring likes to measuring real social impact.”


Mentor Session: Emi Hofmeister
Sunday, March 9 at 11 AM to 12 PM: Hilton Garden Inn, 10th Floor Atrium

Meet with Emi Hofmeister, the senior product marketing manager at Adobe Social. All sessions appear to be booked but keep an eye out for cancellations. Sign up here:

The Science of Predicting Earned Media
Sunday, March 9 at 12:30 to 1:30 PM: Sheraton Austin, EFGH

“In this panel session, renowned video advertising expert Brian Shin, Founder and CEO at Visible Measures, Seraj Bharwani, Chief Analytics Officer at Visible Measures, along with Kate Sirkin, Executive Vice President, Global Research at Starcom MediaVest Group, will go through the models built to quantify the impact of earned media, so that brands can not only plan for it, but optimize and repeat it.”

GNIP EVENT: Beyond Dots on a Map: Visualizing 3 Billion Tweets
Sunday, March 9 at 1:00-1:15 PM: Austin Convention Center, Ballroom E

Gnip’s product manager, Ian Cairns, will be speaking about the massive Twitter visualization Mapbox and Gnip created and what 3 billion geotagged Tweets can tell us.

Mentor Session: Jenn Deering Davis
Sunday, March 9 at 5 to 6 PM: Hilton Garden Inn, 10th Floor Atrium

SIgn up for a mentoring session with Jenn Deering Davis, the co-founder of Union Metrics. Sign up here –

Algorithms, Journalism & Democracy
Sunday, March 9 from 5 to 6 PM: Austin Convention Center, Room 12AB

Read our interview with Gilad Lotan of betaworks on his SXSW session and data science. Gilad will be joined by Kelly McBride of the Poynter Institute about the ways algorithms are biased in ways that we might not think about it. “Understanding how algorithms control and manipulate your world is key to becoming truly literate in today’s world.”


Scientist to Storyteller: How to Narrate Data
Monday, March 10 at 12:30 – 1:30 PM: Four Seasons Ballroom

See our interview with Eric Swayne about this SXSW session and data narration. On the session, “We will understand what a data-driven insight truly IS, and how we can help organizations not only understand it, but act on it.”

#Occupygezi Movement: A Turkish Twitter Revolution
Monday, March 10 at 12:30 – 1:30 PM:  Austin Convention Center, Room 5ABC

See our interview with Yalçin Pembeciogli about how the Occupygezi movement was affected by the use of Twitter. “We hope to show you the social and political side of the movements and explain how social media enabled this movement to be organic and leaderless with many cases and stories.”

GNIP EVENT: Dive Into Social Media Analytics
Monday, March 10 at 3:30 – 4:30 PM: Hilton Austin Downtown, Salon B

Gnip’s VP of Product, Rob Johnson, will be speaking alongside IBM about “how startups can push the boundaries of what is possible by capturing and analyzing data and using the insights gained to transform the business while blowing away the competition.”


Measure This; Change the World
Tuesday, March 11 at 11 AM to 12 PM: Sheraton Austin, EFGH

A panel with folks from Intel, Cornell, Knowable Research, etc. looking at what we can learn from social scientists and how they measure vs how marketers measure.

Make Love with Your Data
Tuesday, March 11 at 3:30 to 4:30 PM: Sheraton Austin, Capitol ABCD

This session is from the founder of OkCupid, Christian Rudder. I interviewed Christian previously and am a big fan. “We’ll interweave the story of our company with the story of our users, and by the end you will leave with a better understanding of not just OkCupid and data, but of human nature.”

Data Stories: Gilad Lotan of betaworks

Gilad Lotan is the chief data scientist for betaworks, which has launched some pretty incredible companies including SocialFlow and bitly. I was very excited to interview him because it’s so rare for a data scientist to get to peek under the hood of so many different companies. He’s also speaking at the upcoming SXSW on “Algorithms, Journalism and Democracy.” You can follow him on Twitter at @gilgul

(Gnip is hosting a SXSW event for those involved in social data, email for an invite.) 

Gilad Lotan of Betaworks1. As the chief data scientist for betaworks, how do you divide your time amongst all of the companies? Basically, how do you even have time for coffee?

First of all, we take coffee very seriously at betaworks, especially team digg, who got us all hooked on the chemex drip. There’s always someone making really good coffee somewhere in the office.

Now beyond our amazing coffee, betaworks is such a unique place. We both invest in early stage companies, and incubate products, many of which have a strong data component. We’ve successfully launched companies over the past years, including Tweetdeck, Bitly, Chartbeat and SocialFlow. There are currently around 10 incubations at betaworks, at various stages and sizes.

Earlier this year, we decided to set up a central data team at betaworks rather than separate data teams within the incubations. Our hypothesis was that leveraging data wrangling skills and infrastructure as a common resource would be more efficient, and provide our incubations with a competitive advantage. Many companies face similar data-related problem-sets, especially in their early stages. From generating a trends list to building recommendation systems, the underlying methodology stays similar even when using different data streams. Our hope was that we could re-use solutions provided in one company for similar features in another. On top of that, we’re taking advantage of common data infrastructure, and using data streams to power features in existing products or new experiments.

When working with data, much of the time you’re building something you’ve never done before. Even if the general methodology might be known, when applied to a new dataset or within a new context there’s lots of tweaking to be done. For example, even if you’ve used naive bayes classifiers in the past, when applied towards a new data stream, results might not be good enough. So planning data features within product releases is challenging, as it is hard to predict how long development will take. And then there’s knowing when to stop, which isn’t necessarily intuitive. When is your model “good enough”? When are results from a recommendation system “good enough”?

The data team is focused on building data-driven features into incubated products. One week I’ll be working on scoring content from RSS feeds, and the following week I might be analyzing weather data clusters. We prioritize based on the stage the company and the importance of this data for the product’s growth. We tend to focus on companies that have high volumes of data, or are seeking to build features that rely on large amounts of data munching. But we keep it fairly loose. We’re small and nimble enough at betaworks that prioritization between companies has not been an issue yet.

I’m aware that it will become more challenging, especially as companies grow in size.

2. Previously, you were the VP of Research & Development at SocialFlow. How did data science figure into product development?

At SocialFlow the data team built systems that mine massive amounts of data, including the Twitter public firehose. From distributed ingestion to analytics and visualization, there were a few ways in which our work fed into the product.

The first and most obvious, was based on the product development roadmap. In an ideal situation, the data team’s work is always a few steps ahead of the product’s immediate needs, and can integrate its’ work into the product when needed. At SocialFlow, we powered a number of features including real-time audience trends, and personalized predictions for performance of content on social media. In both cases the modules were developed as a part of the product launch cycle and continuously maintained by the data team.

The second way in which we affected product development was by constantly running experiments. Continuous experimentation was a key way in which we innovated around our data. We would take time to test out hypothesis and explore different visualization techniques as a way to make better sense of our data. One of our best decisions was to bring in data science interns over the summer. They were all in the midst of their phd’s and incredibly passionate about data analysis, especially data from social networks. The summer was an opportunity for them to run experiments at a massive scale, using our data and infrastructure. As they learned our systems, each chose a focus and spent the rest of their time running analyses and experiments. Much of their work was integrated into the product in some manner. Additionally, several published their findings in academic journals and conference proceedings. Exploratory data analysis may be counter-productive, especially when there are no upper bounds set on when to stop the experimentation. But with strict deadlines, it may be invaluable.

The third, and most surprising for me, was storytelling. We made sure to always blog about interesting phenomenons that we were observing within our data. Some were directly related to our business – publishers, brands and marketers – but much was simply interesting to the general public. We added data visualization to make them more accessible. From the Osama Bin Laden raid, to the spread of Invisible Children’s Kony2012 video, we were generating a sizable amount of PR for SocialFlow just by blogging about interesting things we identified in our data. While the attention was nice, there were some great business and product opportunities that came because of that.

Working with interesting data? Always be telling stories!

3. In your SXSW session “Algorithms, Journalism and Democracy,” you’ll speak to the bias behind algorithms. What concerns you about these biases and what do people need to know?

There are a growing number of online spaces which are a product of an automated algorithmic process. These are the trending topics lists we see across media and social networks, or the personalized recommendations we get on retail sites. But how do we define what constitutes a “trend” or what piece of content should be included in our “hottest” list? Often times, it is not simply the top most read or clicked on item. What’s hot is an intuitive and very humane assessment, yet potentially a mathematically complex formula, if at all possible to produce. We’ve already seen numerous examples where algorithmically generated results led to awkward outcomes, such as Amazon’s $23,698,655.93 priced book about flies or Siri’s inability to find abortion clinics in New York City.

These are not Google, Apple, Amazon or Twitter conspiracies, but rather the unintended consequences of algorithmic recommendations being misaligned with people’s value systems and expectations of how the technology should work. The larger the gap between people’s expectations and the algorithmic output, the more user trust will be violated. As designers and builders of these technologies, we need to make sure our users understand enough about the choices we encode into our algorithms, but not too much to enable them to game the systems. People’s perception affects trust. And once trust is violated, it is incredibly difficult to gain back. There’s a misplaced faith in the algorithm, assuming it is fair, unbiased, and should accurately represent what we think is “truth”.

While it is clear for technologists that algorithmic systems are always biased, the public perception is that of neutrality. It’s math right? And math is honest and “true”. But when building these systems there are specific choices made, and biases encoded due to these choices.

In my upcoming SXSW session with Poynter Institute’s Kelly McBride, we’ll be untangling some of these topics. Please come!

4. When marketers typically look at algorithms, most try to game the system. How does the cycle of programmers trying to stop gaming and those trying to game it play into creating biases?

Indeed. In spaces where there’s a perceived value, people will always try to game the system. But this is not new. The practice of search engine optimization is effectively “gaming the system”, yet its’ been around for many years and is considered an important part of launching a website. SEO is what we have to do if we want to optimize traffic to our websites. Like a game of cat and mouse, as builders we have to constantly adjust the parameters of our systems in order to make them harder to game, while making sure they preserve their value. With Google search, for example, while we have a general sense of what affects results, we don’t know precisely, and they constantly change it up!

Another example is Twitter’s trending topics algorithm. There’s clear value in reaching the trending topics list on Twitter: visibility. In the early days of Twitter, Justin Bieber used to consistently dominate the trends list. In response, team Twitter implemented TFIDF (Term Frequency Inverse Document Frequency) as a way to make sure this didn’t happen – making it harder for popular content to trend. As a result, only new and “spiky” (rapid acceleration of shares) make it to the trends list. This means that trends such as Kim Kardashian’s wedding or Steve Jobs’ death are much more likely to trend, compared to topics that are a part of an ongoing story, such as the economy or immigration. I published a long blog post on why #OccupyWallStreet never trended in NYC explaining precisely this phenomenon.

Every event organizer wants to get their conference to trend on Twitter. Hence a unique hashtag is typically chosen to be used. We know enough about what gets something to trend, but it is still difficult to completely game the system. In a way, there’s always a natural progression with algorithmic products. The more successful these spaces are, the more perceived value they attain, the more people will try to game them. Changes in the system are necessary in order to keep it from being abused.

5. If you could have any other data scientist’s job, which one would you want?

I’m pretty sure I have the best data science gig out there. I get to work with passionate smart people, on creative and innovative approaches to use data in products that people love. Hard to beat that!