Gnip's Social Data Picks for SXSW

If you’re one of 30,000 headed to SXSW, we’ve got our social data and data science panel picks for SXSW that you should attend between BBQ and breakfast tacos. Also, if you’re interested in hanging with Gnip, we’ve listed the places we’ll have a presence at!

Also, we’ll be helping put on the Big Boulder: Boots & Bourbon party at SXSW for folks in the social data industry. Send an email to bre@bigboulderinitiative.org for an invite.

FRIDAY 

What Social Media Analytics Can’t Tell You
Friday, March 7 at 3:30 PM to 4:30 PM: Sheraton Austin, EFGH

Great panel with Vision Critical, Crowd Companies and more. “Whether you’re looking for fresh insight on what makes social media users tick, or trying to expand your own monitoring and analytics program, this session will give you a first look at the latest data and research methods.”

Book Signing – John Foreman, Chief Data Scientist at MailChimp
Friday, March 7 at 3:50 to 4:10 PM: Austin Convention Center, Ballroom D Foyer

During an interview with Gnip, John said that the data challenge he’d most like to solve is the Taco Bell menu. You should definitely get his book and get it signed.

SATURDAY

Truth Will Set You Free but Data Will Piss You Off
Saturday, March 8 from 3:30 to 4:30 PM: Sheraton Austin Creekside

All-star speakers from DataKind, Periscopic and more talking about “the issues and ethics around data visualization–a subject of recent debate in the data visualization community–and suggest how we can use data in tandem with social responsibility.”

Keeping Score in Social: It’s More than Likes
Saturday, March 8 from 5:15 to 5:30 PM: Austin Convention Center, Ballroom F

Jim Rudden, the CMO of Spredfast, brands will talk about “what it takes to move beyond measuring likes to measuring real social impact.”

SUNDAY

Mentor Session: Emi Hofmeister
Sunday, March 9 at 11 AM to 12 PM: Hilton Garden Inn, 10th Floor Atrium

Meet with Emi Hofmeister, the senior product marketing manager at Adobe Social. All sessions appear to be booked but keep an eye out for cancellations. Sign up here: http://mentor.sxsw.com/mentors/316

The Science of Predicting Earned Media
Sunday, March 9 at 12:30 to 1:30 PM: Sheraton Austin, EFGH

“In this panel session, renowned video advertising expert Brian Shin, Founder and CEO at Visible Measures, Seraj Bharwani, Chief Analytics Officer at Visible Measures, along with Kate Sirkin, Executive Vice President, Global Research at Starcom MediaVest Group, will go through the models built to quantify the impact of earned media, so that brands can not only plan for it, but optimize and repeat it.”

GNIP EVENT: Beyond Dots on a Map: Visualizing 3 Billion Tweets
Sunday, March 9 at 1:00-1:15 PM: Austin Convention Center, Ballroom E

Gnip’s product manager, Ian Cairns, will be speaking about the massive Twitter visualization Mapbox and Gnip created and what 3 billion geotagged Tweets can tell us.

Mentor Session: Jenn Deering Davis
Sunday, March 9 at 5 to 6 PM: Hilton Garden Inn, 10th Floor Atrium

SIgn up for a mentoring session with Jenn Deering Davis, the co-founder of Union Metrics. Sign up here - http://mentor.sxsw.com/mentors/329

Algorithms, Journalism & Democracy
Sunday, March 9 from 5 to 6 PM: Austin Convention Center, Room 12AB

Read our interview with Gilad Lotan of betaworks on his SXSW session and data science. Gilad will be joined by Kelly McBride of the Poynter Institute about the ways algorithms are biased in ways that we might not think about it. “Understanding how algorithms control and manipulate your world is key to becoming truly literate in today’s world.”

MONDAY

Scientist to Storyteller: How to Narrate Data
Monday, March 10 at 12:30 – 1:30 PM: Four Seasons Ballroom

See our interview with Eric Swayne about this SXSW session and data narration. On the session, “We will understand what a data-driven insight truly IS, and how we can help organizations not only understand it, but act on it.”

#Occupygezi Movement: A Turkish Twitter Revolution
Monday, March 10 at 12:30 – 1:30 PM:  Austin Convention Center, Room 5ABC

See our interview with Yalçin Pembeciogli about how the Occupygezi movement was affected by the use of Twitter. “We hope to show you the social and political side of the movements and explain how social media enabled this movement to be organic and leaderless with many cases and stories.”

GNIP EVENT: Dive Into Social Media Analytics
Monday, March 10 at 3:30 – 4:30 PM: Hilton Austin Downtown, Salon B

Gnip’s VP of Product, Rob Johnson, will be speaking alongside IBM about “how startups can push the boundaries of what is possible by capturing and analyzing data and using the insights gained to transform the business while blowing away the competition.”

TUESDAY

Measure This; Change the World
Tuesday, March 11 at 11 AM to 12 PM: Sheraton Austin, EFGH

A panel with folks from Intel, Cornell, Knowable Research, etc. looking at what we can learn from social scientists and how they measure vs how marketers measure.

Make Love with Your Data
Tuesday, March 11 at 3:30 to 4:30 PM: Sheraton Austin, Capitol ABCD

This session is from the founder of OkCupid, Christian Rudder. I interviewed Christian previously and am a big fan. “We’ll interweave the story of our company with the story of our users, and by the end you will leave with a better understanding of not just OkCupid and data, but of human nature.”

Data Stories: Gilad Lotan of betaworks

Gilad Lotan is the chief data scientist for betaworks, which has launched some pretty incredible companies including SocialFlow and bitly. I was very excited to interview him because it’s so rare for a data scientist to get to peek under the hood of so many different companies. He’s also speaking at the upcoming SXSW on “Algorithms, Journalism and Democracy.” You can follow him on Twitter at @gilgul

(Gnip is hosting a SXSW event for those involved in social data, email events@gnip.com for an invite.) 

Gilad Lotan of Betaworks1. As the chief data scientist for betaworks, how do you divide your time amongst all of the companies? Basically, how do you even have time for coffee?

First of all, we take coffee very seriously at betaworks, especially team digg, who got us all hooked on the chemex drip. There’s always someone making really good coffee somewhere in the office.

Now beyond our amazing coffee, betaworks is such a unique place. We both invest in early stage companies, and incubate products, many of which have a strong data component. We’ve successfully launched companies over the past years, including Tweetdeck, Bitly, Chartbeat and SocialFlow. There are currently around 10 incubations at betaworks, at various stages and sizes.

Earlier this year, we decided to set up a central data team at betaworks rather than separate data teams within the incubations. Our hypothesis was that leveraging data wrangling skills and infrastructure as a common resource would be more efficient, and provide our incubations with a competitive advantage. Many companies face similar data-related problem-sets, especially in their early stages. From generating a trends list to building recommendation systems, the underlying methodology stays similar even when using different data streams. Our hope was that we could re-use solutions provided in one company for similar features in another. On top of that, we’re taking advantage of common data infrastructure, and using data streams to power features in existing products or new experiments.

When working with data, much of the time you’re building something you’ve never done before. Even if the general methodology might be known, when applied to a new dataset or within a new context there’s lots of tweaking to be done. For example, even if you’ve used naive bayes classifiers in the past, when applied towards a new data stream, results might not be good enough. So planning data features within product releases is challenging, as it is hard to predict how long development will take. And then there’s knowing when to stop, which isn’t necessarily intuitive. When is your model “good enough”? When are results from a recommendation system “good enough”?

The data team is focused on building data-driven features into incubated products. One week I’ll be working on scoring content from RSS feeds, and the following week I might be analyzing weather data clusters. We prioritize based on the stage the company and the importance of this data for the product’s growth. We tend to focus on companies that have high volumes of data, or are seeking to build features that rely on large amounts of data munching. But we keep it fairly loose. We’re small and nimble enough at betaworks that prioritization between companies has not been an issue yet.

I’m aware that it will become more challenging, especially as companies grow in size.

2. Previously, you were the VP of Research & Development at SocialFlow. How did data science figure into product development?

At SocialFlow the data team built systems that mine massive amounts of data, including the Twitter public firehose. From distributed ingestion to analytics and visualization, there were a few ways in which our work fed into the product.

The first and most obvious, was based on the product development roadmap. In an ideal situation, the data team’s work is always a few steps ahead of the product’s immediate needs, and can integrate its’ work into the product when needed. At SocialFlow, we powered a number of features including real-time audience trends, and personalized predictions for performance of content on social media. In both cases the modules were developed as a part of the product launch cycle and continuously maintained by the data team.

The second way in which we affected product development was by constantly running experiments. Continuous experimentation was a key way in which we innovated around our data. We would take time to test out hypothesis and explore different visualization techniques as a way to make better sense of our data. One of our best decisions was to bring in data science interns over the summer. They were all in the midst of their phd’s and incredibly passionate about data analysis, especially data from social networks. The summer was an opportunity for them to run experiments at a massive scale, using our data and infrastructure. As they learned our systems, each chose a focus and spent the rest of their time running analyses and experiments. Much of their work was integrated into the product in some manner. Additionally, several published their findings in academic journals and conference proceedings. Exploratory data analysis may be counter-productive, especially when there are no upper bounds set on when to stop the experimentation. But with strict deadlines, it may be invaluable.

The third, and most surprising for me, was storytelling. We made sure to always blog about interesting phenomenons that we were observing within our data. Some were directly related to our business – publishers, brands and marketers – but much was simply interesting to the general public. We added data visualization to make them more accessible. From the Osama Bin Laden raid, to the spread of Invisible Children’s Kony2012 video, we were generating a sizable amount of PR for SocialFlow just by blogging about interesting things we identified in our data. While the attention was nice, there were some great business and product opportunities that came because of that.

Working with interesting data? Always be telling stories!

3. In your SXSW session “Algorithms, Journalism and Democracy,” you’ll speak to the bias behind algorithms. What concerns you about these biases and what do people need to know?

There are a growing number of online spaces which are a product of an automated algorithmic process. These are the trending topics lists we see across media and social networks, or the personalized recommendations we get on retail sites. But how do we define what constitutes a “trend” or what piece of content should be included in our “hottest” list? Often times, it is not simply the top most read or clicked on item. What’s hot is an intuitive and very humane assessment, yet potentially a mathematically complex formula, if at all possible to produce. We’ve already seen numerous examples where algorithmically generated results led to awkward outcomes, such as Amazon’s $23,698,655.93 priced book about flies or Siri’s inability to find abortion clinics in New York City.

These are not Google, Apple, Amazon or Twitter conspiracies, but rather the unintended consequences of algorithmic recommendations being misaligned with people’s value systems and expectations of how the technology should work. The larger the gap between people’s expectations and the algorithmic output, the more user trust will be violated. As designers and builders of these technologies, we need to make sure our users understand enough about the choices we encode into our algorithms, but not too much to enable them to game the systems. People’s perception affects trust. And once trust is violated, it is incredibly difficult to gain back. There’s a misplaced faith in the algorithm, assuming it is fair, unbiased, and should accurately represent what we think is “truth”.

While it is clear for technologists that algorithmic systems are always biased, the public perception is that of neutrality. It’s math right? And math is honest and “true”. But when building these systems there are specific choices made, and biases encoded due to these choices.

In my upcoming SXSW session with Poynter Institute’s Kelly McBride, we’ll be untangling some of these topics. Please come!

4. When marketers typically look at algorithms, most try to game the system. How does the cycle of programmers trying to stop gaming and those trying to game it play into creating biases?

Indeed. In spaces where there’s a perceived value, people will always try to game the system. But this is not new. The practice of search engine optimization is effectively “gaming the system”, yet its’ been around for many years and is considered an important part of launching a website. SEO is what we have to do if we want to optimize traffic to our websites. Like a game of cat and mouse, as builders we have to constantly adjust the parameters of our systems in order to make them harder to game, while making sure they preserve their value. With Google search, for example, while we have a general sense of what affects results, we don’t know precisely, and they constantly change it up!

Another example is Twitter’s trending topics algorithm. There’s clear value in reaching the trending topics list on Twitter: visibility. In the early days of Twitter, Justin Bieber used to consistently dominate the trends list. In response, team Twitter implemented TFIDF (Term Frequency Inverse Document Frequency) as a way to make sure this didn’t happen – making it harder for popular content to trend. As a result, only new and “spiky” (rapid acceleration of shares) make it to the trends list. This means that trends such as Kim Kardashian’s wedding or Steve Jobs’ death are much more likely to trend, compared to topics that are a part of an ongoing story, such as the economy or immigration. I published a long blog post on why #OccupyWallStreet never trended in NYC explaining precisely this phenomenon.

Every event organizer wants to get their conference to trend on Twitter. Hence a unique hashtag is typically chosen to be used. We know enough about what gets something to trend, but it is still difficult to completely game the system. In a way, there’s always a natural progression with algorithmic products. The more successful these spaces are, the more perceived value they attain, the more people will try to game them. Changes in the system are necessary in order to keep it from being abused.

5. If you could have any other data scientist’s job, which one would you want?

I’m pretty sure I have the best data science gig out there. I get to work with passionate smart people, on creative and innovative approaches to use data in products that people love. Hard to beat that!

A Quick Look at the Hashtags in Jimmy Fallon’s Skit

Jimmy Fallon and Jonah Hill did a new skit called #Hashtag2 where they express themselves with hashtags on Twitter. If you haven’t seen it, go take two minutes. I’ll wait.

Screen Shot 2014-02-21 at 11.15.37 AM

After seeing this, I was curious if people actually used these hashtags in real life. Are we all really #blessed and #winning? It appears so. There were 784,207 uses of #blessed and 420,696 uses of #winning in the last 30 days on Twitter. But not many of us are living #livinlavidaloca with 580 uses. Maybe we’re not #livinlavidaloca because there have been only 83 mentions of #sippinonginandjuice.

Below are two charts that show real-world usage of the hashtags included in the video’s script. It’s rather amusing to see how prevalent (or not) they themes are in our daily lives!

ChartofHashtagsfromJimmyFallon

HashtagsInJimmyFallonsVideo

 

The Need for a Data Science Masters Program

Data science is a new profession and thus, there isn’t a clear educational or career path for data scientists. One of our most frequent questions we ask in our Data Stories series is asking about the career path people took to become data scientists. With Gnip’s own data science team, three of our members have PhDs in Physics and one has a masters in mathematics. So I am definitely interested in how universities are creating their own data science programs. To that end, I wanted to interview Annalee Saxenian, the dean of UC Berkeley’s School of Information, about their masters program for data scientists. 

This is part of our Data Stories series leading to SXSW. Dean Saxenian is speaking on “The Future Belongs to Data Scientists.” Gnip is hosting a SXSW event for those involved in social data, email events@gnip.com for an invite. 

Dean Annalee Saxenian

1. Why create a masters program specifically for data scientists?

There is huge demand for people who can work with data, at large as well as small scales, using the new tools and technologies that are becoming available for data storage, analytics, and visualization. While data science has been pioneered by technology companies like Google, LinkedIn, and Facebook, we believe that every organization today (large and small, profit and non-profit, in every industry) has new sources of data that it can use to inform decision making and to develop new products and services. This new data, which comes from click streams, online sales receipts, sensor networks, mobile devices, and social media, is not only available at very large scales, but is also largely unstructured or semi structured. This makes the analysis of the data fundamentally different from analysis of the smaller and more structured data sets of the past.

2. Data scientist is a new career path. What are the advantages of receiving formal training through a program such as Berkeley’s School of Information versus real-world experience?

Most organizations don’t have the resources or the commitment to systematically expose employees to the range of new tools, technologies, and skills required of a data scientist. Even leading technology companies  only provide very limited on-the job training in the relevant skills to their employees. They are looking for employees who already have expertise in areas like statistics and data analytics.

One of the advantages of a Master’s degree program like the Master of Information and Data Science (MIDS) at Berkeley is that our faculty has built a complete curriculum from the ground up–designing both the individual courses we think are essential to practicing data scientists as well as building the dependencies between the courses so that the whole is greater than the sum of the individual parts. The curriculum covers the full life cycle of data science. We offer courses devoted to research design, data storage and retrieval, statistical analysis, machine learning, data visualization and communication, data privacy and ethics, field experiments, and scaling and parallelism. In addition, we require that students gain experience working in teams. In short, formal education like the MIDS program offers comprehensive exposure to the field of data science.

3. Why did Berkeley decide to make the I School an online program?

The School of Information faculty decided to offer the program online for several reasons. On one hand, we are growing our existing programs and are outgrowing our facilities on the Berkeley campus. Offering an online program relieves us of the need to compete for scarce space on campus. We also believe that, as a School of Information, we should to be experimenting with new educational technologies, and that since most of our graduates will be working in teams and online settings, we should play a leadership role in this space. Last, but not least, by offering the degree online we are able to reach a much wider range of students than we can with our face to face programs. We are providing access to a Berkeley quality degree to people who who can’t move to Berkeley for family or work reasons and to those who need to continue working while they seek further education.

4. What characteristics do you think makes for the most successful data scientists?

Data scientists do need a set of technical and analytical skills and mastery of certain tools and technologies, but just as important are the  soft skills. The most successful data scientists can think creatively about trends in data, collaborate well on teams, and communicate the findings from data to non specialists. So they need to be clear thinkers, good collaborators and communicators, and they need to be able to think creatively about what they see in the data.

5. What do you think are the upsides and downsides for companies for dealing with data that previously wasn’t accessible?

The upsides for companies: the new data can be used to enhance business decision making as well as to develop new products and services. Companies are using previously inaccessible data to learn more about customer behavior and about market trends. They are designing regular online experiments that allow them to generate data allowing them to learn real time about trade offs in design and other business decisions.

The downsides: most companies still don’t have people with the relevant skills to learn from the new data, and they will need to reorganize in order to take full advantage of the new data. The established silos in established companies mean that data is managed by a different group than those who are able to analyze it or who are developing products and they in turn are not well connected to senior decision makers. Taking full advantage of the new data will require much closer interaction between these different parts of the organization.

Occupy Gezi: How Twitter Facilitated a Social Movement in Turkey

Last summer in Turkey, a small protest over the removal of trees in Gezi park began a large movement trying to protect one of the last green spaces remaining in the heart of Istanbul. The movement resulted in 1,900 people being arrested and nearly that many were reported injured. Social media served as the primary source of information for citizens. We interviewed Yalçın Pembecioğlu of Bigumigu about how the movement sparked and what it means for social media in Turkey. This is part of our SXSW Data Stories, where we’re interviewing presenters about their data talks. Yalcin is presenting on #Occupygezi Movement: A Turkish Twitter Revolution.

(Gnip is hosting a SXSW event for those involved in social data, email events@gnip.com for an invite.) 

Yalçın Pembecioğlu of  Bigumigu

1. How did Twitter help create the #Occupygezi movement?

During the start of the event, none of the broadcast networks covered #OccupyGezi. Not even a little bit. I guess this encouraged people to take control and be their own media. Suddenly, everybody started to take and distribute pictures and videos from the places that events took place. The content from citizen media went viral in seconds.

2. Why would people choose Twitter over mainstream media as a source of information?

If the information is coming from someone you trust, it is very important information. During #OccupyGezi, people have seen the cold brutal face of the mainstream media. Our friends were on the streets and they were telling unbelievable stories. We have decided to believe our friends and families, instead of the mainstream media.

3. How did the #Occupygezi movement respond to rumors via social media?

It was emotionally devastating to see the police brutality towards the protestors. At those kinds of times, I guess people become more tolerant to biased information. But many of us, including me, spent hours on the computer to decipher the dirty data into real bits of information. Hence the term “kesin bilgi mi?” was born in the Turkish internet. It means “is it a confirmed information?”. During #OccupyGezi, when important information came up, we were all asking questions to confirm it, and if it is confirmed, then we spread the information, if it’s not, we were warning the source to double check the data. Now it’s like a common meme on the internet to reply any joke as “kesin bilgi mi?”. We have learned not to trust everything on the internet in a quick course.

4. After #Occupygezi, how did the use of social media change in Turkey?

The most popular social media platform is Facebook in Turkey with over 30 million active users. After #OccupyGezi, penetration of Twitter has accelerated. It is said that nearly 1 million new accounts were created during the #OccupyGezi weeks. It is believed that Twitter has around 9 million active accounts in Turkey. A society, which was very comfortable in symmetrical networks discovered the power and potential of unsymmetrical networks.

5. Overall, what does social media mean for revolutions?
Social media is the place for individual voices. It is very important for revolutions, because via social media we see there are thousands, millions of people out there just like us. I am not sure that the social media will be easily spoiled by power holders in the future, but for now, it can be the single source that an individual’s voice can be heard.

Eric Swayne on the Fundamentals of Good Data Narration

Leading up to SXSW, we’ll be doing Data Stories with SXSW presenters starting with Director of Product of MutualMind, Eric Swayne, who is speaking on “Scientist to Storyteller: How to Narrate Data.” (Gnip is hosting a SXSW event for those involved in social data, email events@gnip.com for an invite that goes out in a month or so.) 

Eric Swayne

1. In your SXSW session, you’re going to talk about being more than a “data janitor.” What do you mean by this?

Data Janitor is a term that resonates with many Analysts currently, as they’re basically being used for maintaining the facilities: scrubbing data sources, pushing data into prescribed buckets, rolling out the same reports because they’re the reports “we’ve always done.”  These are still important, but we have to aspire to more. People that live in data analysis have the crucial opportunity for extracting meaning that transforms businesses through data-driven decisions, and it takes much more than just pushing out the monthly graphs and charts.

2. How important is visualization to good data narration?

Great data visualizations put incredibly powerful tools in the hands of Data Narrators that enable them to tell better stories, as well as extract insights from incomprehensible data sets.  However, it’s critical not to confuse the visualization WITH the insight – they are distinctly separate, and not necessarily dependent upon each other.  A simple pie chart that tells a CEO exactly what they need to know to make a good decision isn’t any visual tour de force, but it clearly gets the job done.  In all cases, visualizations should serve the story: the string of insights that lead to data-driven decisions.  When a good picture makes a good idea stick, that’s when you know the Data Narrator has done their job.

3. What are the trademarks of good data narration?
You’ll often see three key hallmarks:

1. True Insights – An insight tells me something I don’t know, that I need to know, and that I can do something about.  If the data story doesn’t include these three elements, it’s factual or irrelevant, not insightful.

2. User-Centric Approach – Human Interface Design isn’t just for UX professionals – Analysts need to become more adept at its principles as well.  Everything we say in a report or dashboard through form, color, size or spatial relationships carries meaning – whether we intended it to or not.  Data Narrators not only understand their story but also their audience: what they’re used to seeing, how they might be biased against certain ideas, and what assumptions they’re making based on what they see and hear.

3. Idea Inception – I would call this “stickiness”, but we have a much better term for it now, thanks to Christopher Nolan and Leo DeCaprio! Work from great Data Narrators shows an intent to focus the audience on an idea, and make sure they remember it. Data Narrators often focus not on the meeting where they present their work, but the NEXT meeting their audience is having, and whether they remember what was said and use it.

4. When marketers misinterpret data, what are the ramifications that you see?
Of course the ultimate impact of misinterpreted data is bad decisions, but it usually starts by creating bad stories. Urban legends pervade businesses just like any other culture, and they often sound enough like real data that they’re not questioned.  “Our site visitors click on blue more often,” “We’ve never had a good Q4 for product X,” “Twitter hasn’t driven sales for us like Facebook,” and on and on. These “data-ish” stories are particularly insidious because they’re often unquestioned assumptions, and voices that seek to pick them apart are often quashed as trying to “rock the boat.” Storytelling isn’t just a tool to be used for good or ill – it’s the default processing protocol for human brains. Where good, data-driven stories are NOT created, it leaves a vacuum that others will fill with whatever they remember.

5. What are your data pet peeves? What is the data equivalent of driving slow in the fast lane?

  • Confusing correlation with causation. I know, I know, we say this maxim so many times it should be the Data Scientist’s Golden Rule. But the fact is that it’s tremendously hard for humans to avoid this trap, particularly when correlations appear to validate the opinions we already have. This is why it’s incredibly important for us to question each others’ assumptions, and to be open to ours being questioned.
  • “Perfect” data. When charts show a straight line, or scatterplots neatly cluster, or r^2 results are incredibly high, I get suspicious. Nothing in nature is perfect, particularly when humans are involved – the reality is that while many of our behaviors can be consistent, they aren’t absolute.  It’s incredibly important that we use analytics and statistics to tell the story that the data tells us, not the one we want to say.
  • Trophy numbers.  When I start a new client engagement, I like to ask them what their Trophy Numbers are.  These are the stats and figures that are used to report upwards (and often justify jobs), but that we know have no inherent value.  Pageviews, Hits, Impressions, Potential Reach, and Asset Views are all often found in this category.  While these may be good symptoms of success, they almost always aren’t the way your business wins in the world.  Data Narrators don’t ignore these, but rather they lead clients on a journey from here to better KPIs that indicate real business success.

If you’re interested in more Data Stories, check out Gnip’s collection of 25 Data Stories. 

Gnip’s 2013 Highlight Reel

Gnip 2013 Holiday Card

At the end of last year, we reflected upon the year’s happenings, and wanted to do a similar recap of Gnip’s 2013.

Three years ago, Gnip partnered with Twitter to license their data. Since then, social data has continued on its incredible ride. This year marked both Twitter’s successful IPO and Tumblr’s acquisition by Yahoo. This past year has been a great year for social data and Gnip, but we’re even more excited for what’s ahead in 2014!

January

Gnip interviews Harper Reed, the former CTO of Obama for America.

February

Time Magazine writes about Gnip’s work with The Library of Congress to make all public Tweets part of their archive.

March

Gnip takes Big Boulder to Austin with Big Boulder Bourbon & Boots and Derek Gottfrid from Tumblr speaks.

April

Gnip launches 6 new data collectors — Instagram, Reddit, bitly, Stack Overflow, Panaramio, and Plurk.

Gnip launches a new firehose, Estimize, a crowdsourced earnings estimates platform.

May

Gnip becomes the first and exclusive provider of anonymized Foursquare data.

June

Gnip holds the second annual Big Boulder, the world’s first conference dedicated to social data. More than 200 people attended the 16 sessions with speakers from Twitter, Facebook, Pinterest and more. The Big Boulder Initiative launches as a collaborative industry group to discuss the challenges and obstacles facing social data.

Chris Moody Becomes CEO of Gnip

Gnip and MapBox collaborate to create three Twitter visualizations showing where locals vs tourists hangout, languages on Twitter and mobile device usage.

July

Gnip becomes the provider of GetGlue social data

August

Gnip is named a top 10 Place to work in the country by Outside Magazine

Gnip becomes exclusive provider of Klout Topics

Gnip launches its Profile Geo Enrichment, which significantly increases the amount of geo data available from Twitter.

September

Gnip Introduces Backfill, which simplifies and automates the process of collecting data that would otherwise be missed during brief disconnects.

Gnip launches the Search API for Twitter making it easy for Gnip customers to incorporate search into their products.

October

Gnip names Hottolink exclusive sales agent in Japan

Gnip works alongside Automattic to launch Automattic’s Certified Partner Program. Networked Insights and mBlast are the first companies to be certified.

November

Gnip interviews more than 25 people for its Data Stories project showing how people use social data in epidemiology, natural disasters, politics, product development and much more.

Gnip launches Gnip for Blogs, the only way to get Tumblr, WordPress and Disqus data in one easy-to-consume package.

December

Gnip’s partner program, Plugged In, turned one. In the first year, Plugged In added 23 companies, featured partners in 9 Gnip blog posts, wrote 4 partner whitepapers (FirstRain, mBLAST, TrendyBuzz, Clarabridge), held 3 partner webinars (SMA, Infomart/Infochimps, mBLAST), produced 2 partner videos (Union Metrics, Pivotal), and co-presented at 2 conferences! Phew.

The Big Boulder Initiative takes a big step forward with workshops in four different cities to discuss the future of social data.

And yesterday, Gnip published its 100th blog post for the year.

 

Data Story: Adam Sadilek on Tracking Food Poisoning With Social Data

Adam Sadilek has done some pretty ground breaking research around social data including tracking food poisoning with social data. When he was a Ph.D. student at the University of Rochester, he led a team that found a correlation between geotagged Tweets about foodborne illnesses that closely aligned with restaurants with poor scores from the health department. Adam is now a researcher at Google, and you can follow him on Twitter at @Sadilek

Tracking Food Poisoning via Twitter

 

1. Where did your interest in identifying health trends on Twitter come from?

First, it was studying how Twitter can predict flu outbreaks and then looking at identifying food poisoning outbreaks too.

We were interested in how much can we learn about our environment by sifting through the vast amounts of day-to-day chatter online. It turns out that machine learning can identify strong signals that can be used to make predictions about individuals as well as venues they visit. For example, in our GermTracker.org project, we predicted how likely is a Twitter user is to become sick based on how many symptomatic people he or she met recently. We leveraged geotags within the Tweets to estimate people’s encounters. In the nEmesis project, our model identified Twitter users who got sick after eating at a restaurant, which enabled us to rank food establishments by cleanliness.

2. Your machine learning can help assign scores to restaurants based on the chances of food poisoning that matches the Health Department based on Twitter data. Is there anyway to make Nemesis data public or as an add-on to services such as Yelp?

There certainly is — Henry Kaut’z group at the University of Rochester is working on an extending GermTracker to capture foodborne illness in real time as well.

3. What are the benefits and disadvantages of using social data over more traditional research on health patterns?

Online social media is very noisy, but significantly more timely. Many months pass between inspections of a typical restaurant. If they get a delivery of spoiled chicken a day after an A+ inspection, it will make their patrons sick anyway. Systems like nEmesis, on the other hand, can detect there is something going on very quickly. The flip side is that it’s hard to be certain on the basis of 140 characters. Therefore, we advocate for a hybrid approach, where inspectors use nEmesis to make better informed decisions. We can replace the current basically random inspections with a more adaptive workflow to detect dangerous venues faster.

4. What else do you think Twitter can tell us about public health?

We did a number of studies, focusing on multiple aspects of our health that can be informed by data mining online social media. Beyond flu and food poisoning, we looked at exposure to air pollution, mental health, commuting behavior, and other lifestyle habits. You can take a look at our publications at http://www.cs.rochester.edu/~sadilek/research/

If you’re interested in additional interviews with people using social data in research, check out our 25 Data Stories to hear about how researchers used social data to track cholera after Haiti’s earthquake. 

Data Story: John Foreman of MailChimp on the Data Science Behind Emails

 When I was in charge of email at my last startup, the MailChimp blog was a must read. Their approach to email marketing is brilliant so when my colleague suggested I interview MailChimp’s chief data scientist, John Foreman, for a Data Story, I was definitely onboard. In addition to being a data scientist at MailChimp, John is also the author of Data Smart: Using Data Science to Transform Information into Insight. You can follow him on Twitter at @john4man

1. People have a love/hate relationship with email. How can data science help people love email more and get more out of it?

Recently, people across industries seem to be waking up from their social-induced haze and rediscovering the effectiveness of direct email communication with their core audience.

Think about a true double-opted email subscription versus, say, a Facebook “like” of a product. When I like a product page on Facebook, do I really want to hear from them in my feed? In part, isn’t that “like” just an expression that’s meant for public display and not for 1-to-1 ongoing communication from the business?

Contrast that with email. If I opt into a newsletter, I’m not doing that for anyone but myself. Email is a private communication channel (I like the term “safe place”). And I want your business to speak to me there. That’s powerful. Now think of a company like MailChimp. We have billions of these subscriptions from billions of people all across the world. MailChimp’s interest data is unparalleled online.

OK, so that means that as a data scientist, I have some pretty meaty subscription data to work with. But I’ve also got individual engagement data. Email is set up perfectly to track individual engagement, both in the email, and as people leave the email to interact with a sender’s website.

So I use this engagement and interest data to build products — both weapons to fight bad actors as well as power tools to help companies effectively segment and target individuals with content that’s more relevant to the recipient. My goal is to make the email ecosystem a strong one, where unwanted marketing email goes away and the content that hits your mailbox is ideal for you.

For instance, MailChimp recently released a product called Discovered Segments that uses unsupervised learning to help users find hidden segments in their list. Using these segments, the sender can craft better content for their different communities of recipients. MailChimp uses the product ourselves; for example, rather than tell all our customers about our new transactional API, Mandrill, we used data mining to only send an announcement to a discovered segment of software developers who were likely to use it, resulting in a doubling of engagement on that campaign.

2. How is data science structured at MailChimp? How big is your team, and what departments do you work with?
MailChimp has three data scientists, and our job as a little cell is to deliver insights and products to our customers. That sounds like business-speak, so let me break it down.

By insights, I mean one-off research and analysis of large data sets that’s actionable for the customer. And by products, I mean tools that the customer can use to perform data analysis themselvesIf the tool or product isn’t useful or required by the customer, we don’t build it. A data science team is not a research group at a university, nor is it a place to just to show off technologies to investors. We’re not here to publish, and we’re not here to build “look at our data…ooooo” products for the media. Whenever a data science team is involved in those activities, I assume the business doesn’t actually know what to do with the technical resources they’ve hired.

Now, who is the “customer” in this mission? We serve other teams internally as well as MailChimp’s user base. So an example of a data product built for an internal customer would be Omnivore – our compliance AI model, while an example of a data product built for the general user population would be our Discovered Segments collaborative filtering tool.

We work very closely with the user experience team at MailChimp — the UX team is constantly interviewing and interacting with our users, so they generate a lot of hypotheses which we investigate using our data. The UX team, because their insight is built quickly from human interactions, can flit from thought to thought and project to project; when they think they’re onto something good, they kick the research idea to the lumbering beast that is the data science team. We can comb through our billions of records of sends, clicks, opens, http requests, user and campaign metadata, purchase data, etc. to quantitatively back or dismiss their new thinking.

UX Team<– UX TEAM

Data Science Team as Robots <– DATA SCIENCE TEAM

3. Your book, Data Smart, is about helping to teach anyone to get value out of data. Why did you see a need for this book? 

I used to work as a consultant for lots of large organizations, such as the IRS, DoD, Coca-Cola, and Intercontinental Hotels. And when I thought about the semi-quantitative folks in the middle and upper rungs of those organizations (people more likely to still be using the phrase “business intelligence” as opposed to “data science”), I realized there was no way for those folks to dip their toe into data science. Most of the intro books made a lot of assumptions about the reader’s math education background, and they depended on R and Python, so the reader needed to learn to code at the same time they learned data science. Furthermore, most data science books were “script kiddy” books, the reader just loaded stuff like the SVM package, built an AI model, and didn’t really know how the AI algorithms worked.

I wanted to teach the algorithms in a code free environment using tools the average “left behind” BI profession would be familiar with. So I chose to write all my tutorials in Data Smart using spreadsheets. At the same time though, I pride myself on writing a more mathematically deep intro text than what you find in many of the other intro data science texts. The book is guided learning — it’s not just a book about data science.

Now, I don’t leave the reader in Excel. I guide them into using R at the end of the book, but I only take them there after they understand the algorithms. Anything else would be sloppy.

Another reason I wrote the book is because the market didn’t have  a broad data science book. Most books focus on one topic — such as supervised AI. Data Smart covers data mining, supervised AI, basic probability and statistics, optimization modeling, time series forecasting, simulation, and outlier detection. So by the time the reader finishes the book, they’ve got a swiss army knife of techniques in their pocket and they’re able to distinguish when you use one technique and when you use another. I think we need more well-rounded data scientists, rather than the specialists that PhD programs are geared to produce.

4. You’ve written a book, maintain a personal blog and write for MailChimp. How important has communication and writing skills become to data scientists?

I believe that communication skills, both writing and speaking, are vital to being an effective data scientist. Data science is a business practice, not an academic pursuit, which means that collaboration with the other business units in a company is essential. And how is that collaboration possible if the data scientist cannot translate problems from the high-level vague definition a marketing team or an executive might provide into actual math?

Others in an organization don’t know what’s mathematically possible or impossible when they identify problems, so the data science team cannot rely on them to fully articulate problems and “throw them over the fence” to a data science team ready-to-go. No, an effective data science team works as an internal, technical consultancy. The data science team knows what’s possible and they must communicate with colleagues and customers to understand processes and problems deeply, translate what they learn into something data can address, and then craft solutions that assist the customer.

5. Time for the Miss America question. If you had access to any data in the world, what is the question or problem you’d like to most solve?

I am a huge fan of Taco Bell. And I recognize that the restaurant actually has very few ingredients to work with — their menu is essentially an exercise in combinatorial math where ingredients are recombined in new formats to produce new menu items which are then tested in the marketplace. I’d love to get data on the success of each Taco Bell menu item. Combined with possible delivery format information, nutrition information, flavor data, and price elasticity data, I’d love to take a swing at algorithmically generating new menu items for testing in the market. If sales and elasticity data were timestamped, perhaps we could even generate menu items optimized for and only available during the stoner-friendly “fourthmeal.”

Thanks to John for taking the time to speak with Gnip! If you’re interested in more Data Stories, please check out our collection of 25 Data Stories featuring interviews with data scientists from Kaggle, Foursquare, Pinterest, bitly and more! 

Data Story: How Microsoft Research is Using Social Data to Understand Depression

Sometimes the use cases for social data go far beyond what you would expect is ever possible. Such is the case with Microsoft Research who has done some really groundbreaking work around using social data to study depression and whether you can indicate if someone is depressed by their activity on Twitter. We interviewed Dr. Munmun de Choudhury of Microsoft Research to ask about their research using social data to study mental health, the privacy implications and how social data can improve mental health. Dr. De Choudhury will be joining Georgia Tech’s School of Interactive Computing as an assistant professor this Spring.   

Munmun de Choudhury of Microsoft Research

1. What are the high-level takeaways you found on using Twitter to research depression?

This research direction has revealed for the first time, how social media activity such as on Twitter can reveal valuable indicators to mental health e.g., depression. Twitter has much noise, but it is promising to see that there are signals hidden in there too, that can tell us about important issues as health and lifestyle, both at the level of an individual, as well as the scope of larger populations. The most prominent signals of depression lie on people’s social activity (i.e., to what extent they post, what kind of posts they share, when do they mostly post), their social network structure (e.g., how are they connected to their friends and friends of friends), and the linguistic style of the content they share. That these rather implicit signals (e.g., a person may never explicitly mention they are “depressed”) can indicate people’s mental and behavioral issues was a rather surprise to us; though when we consulted with psychologists (in fact one of the collaborators in this project was a psychologist), we learned that mental health may manifest itself via various nuances in people’s everyday behavior. This gives us hope that observing social media use of people over time—something which is increasingly gaining popularity—can be used to build tools, forecasting algorithms, interventions, and prevention strategies for both individuals themselves as well as policymakers to help them deal with and manage this medical condition in a better way.

2. What are the privacy concerns of studying mental health with social data?

Studying mental health with online social data is extremely attractive, and can have widespread implications in enabling better healthcare; however it comes with its own set of privacy and ethics related challenges which cannot be ignored. A number of questions may arise: Can we design effective interventions for people, whom we have inferred to be vulnerable to a certain mental illness, in a way that is private, while raising awareness of this vulnerability to themselves and trusted others (doctors, family, friends)? In extreme situations, when an individual’s inferred vulnerability to a mental illness is alarmingly high (e.g., if the individual is suicide-prone), what should be our responsibility as a research community? For instance, should there be other kinds of special interventions where appropriate counseling communities or organizations are engaged? That is, finding the right types of interventions that can actually make a positive impact on people’s behavioral state as well as abide by adequate privacy and ethical norms is a research question on its own. I hope this line of work triggers conversations and involvement with the ethics and medical community to investigate opportunities and caution in this regard.

Additionally, as a community, we need to be aware of the limits up to which such inferences about illness or disability can be deemed to be safe for an individual’s professional and societal identity. In a sense, we need to ensure that such measurements do not introduce new means of discrimination or inequality given that we now have a mechanism to infer traditionally stigmatic conditions which are otherwise chosen to be kept private. These and other potential consequences such as revealing nuanced aspects of behavior and mental health conditions to insurance companies or employers make resolution of these ethical questions critical to the successful use of these new data sources and the research direction.

3. If you’re able to identify depressed individuals with social media, what does prevention and intervention look like?

In terms of prevention, the ability to automatically and privately infer concerns in people’s mental health issues can enable health professionals be more proactive and make arrangements that improve the access of at-risk individuals to appropriate medical help. At the same time it can help policymakers in better understanding the incidence of different diseases, such as depression which is extremely underreported and considered socially stigmatic, so that people can benefit from better healthcare practices. Further, the population-scale trends of depression over geography, time or gender may be a mechanism to trigger public health inquiry programs to take appropriate and needful measures, or allocate resources in a timely manner.

In terms of intervention, since our estimates of depression can be made considerably more frequent than conventional surveys such as by the Centers for Disease Control (CDC), the estimates can be utilized time to time to enable early detection and rapid treatment of depression in sufferers. At the individual level, a variety of personalized and private tools may be developed that may help individuals better manage depression as well as help them seek social and emotional support easily.

4. What made you interested in studying “collective human behavior as manifested via our online footprints”?

I have always been fascinated by how new and emergent technologies online (e.g., Facebook and Twitter) are increasingly getting into the mainstream of our lives. While we know and realize that our actions on these platforms serve as a reflection of characteristics in the physical world—e.g., several of our Facebook friends are actually friends in real life, I was actually curious as to whether the increasing use of these platforms is impacting our behavior in some way, that is, if the reverse is true. For example, does it affect the way we emote, interact, or build social ties with others? That is one reason I became very interested in exploring deep into understanding aspects of our behavior based on what we say and what we do online.

The other motivation lies in my inherent penchant to study people. The web and particularly social networks and social media provide us with a very powerful tool to do so, in a way that the behavioral findings are derived non-intrusively from people’s day-to-day activities, and because of the scale of the data are mostly generalizable. Lastly on a humorous note, computer scientists are often labeled to be socially awkward; so perhaps you can assume that this particular computer scientist intends to show that “hey, even we can be socially cool too, and even make sense of your social actions on the web!”.

5. Where do you see the future of health research and social data going?
As people are increasing joining social media sites with a goal to remain connected as well as learn about what is going on around them, there are people who have been using these sites for years now. As Twitter and Facebook’s penetration increases, it would lend us a rich source through which we can observe individual-centric behavior over time, and consequently use those trends to understand when and where unexpected or anomalous behavior, e.g., concerning health issues, may emerge. At the population level, large-scale naturalistic data obtained from the web may provide rich insights into understanding health concerns and health outcomes which may not be possible with traditional survey methods. This is because surveys are often retrospective, and hence lack the immediacy of the context in which policies may be changed or influenced, or interventions made for enabling better healthcare.

Even more so, I hope that in the future, social media use can be leveraged to identify health issues in difficult to reach populations, or populations who would otherwise not reveal a condition due to social stigma. For instance, one of the many challenges hindering the global response to some of the extremely deadly diseases like AIDS is the difficulty of collecting reliable information about the populations who are most at risk for the disease. Since social media use is consistently gaining more and more ground, they might be the new platform wherein activity traces may be utilized to identify, with appropriate privacy policies enforced, particular vulnerable populations, and enable them receive better healthcare and help as need may arise.

If you’re interested in more data stories, please check out our collection of 25 Data Stories for interviews with data scientists from Pinterest, Foursquare, and more!