Data Story: Mohammad Shahangian on Pinterest Data Science

At Gnip, we believe the value of social data is unlimited. Data Stories is how we bring this belief to life by showcasing how social data is used. This week we’re interviewing data scientist Mohammad Shahangian of Pinterest about how the data science team works at Pinterest, surprising uses of Pinterest and data science as a career path. You can follow him on Pinterest at pinterest.com/mshahang

Data Scientist at Pinterest

1. What do you see is your role as the data scientist for Pinterest?

The company’s focus is on helping millions of people discover things they love and get inspiration to go do those things in their life. For me, that means analyzing the rich data that is created by the millions of people interacting with billions of pins from across the web each day. I evaluate this data and provide insights that make data actionable. My team also prototypes and validates ideas, performs deep analysis and builds tools that allow us to answer our most frequent questions in seconds. We work with every team to answer Pinterest’s biggest questions and ensure that each decision positively impacts Pinners over the long term.

For example, we take a business question like “How should our web, tablet and phone experiences differ?” and present the results as insights like, “Many users use the mobile apps in the morning and again at night, but prefer the website during the day” and “Users prefer to use mobile apps to casually discover new content, whereas they use the web to curate and organize content.” We then work with the design and product teams to build features around these insights and measure their impact.

2. What are some of your favorite ways that people use Pinterest that people wouldn’t expect?

What makes Pinterest unique is that it’s a tool and the users really define its use cases. For me, Pinterest was really helpful when I was planning my wedding and it made perfect sense to use as collaborative office shopping list. I would have never thought to use it as a tool for:

A collection of Stop signs from around the world
Daily Grommet gets their community to collaborate on a board to see things they want to sell
Vintage Driving - a collaborative board where users pin their favorite vintage cars:
GE Badass machines featuring GE tech
Madewell’s Rainbow board
Michelle Obama’s MyPlate Recipes encourages health eating
Stunning virtual collections of minerals and shipwrecks
The “365 Days of Pinterest” challenge. She made a Pinterest project every day for a year!
Sammy Sosa awesomeness
Sony shows off their technology with food pictures shot with a Sony Camera
Pantone announces the color of the year
The National Pork Board

3. What category do you see as the most viral on Pinterest?

DIY and recipes pins generally go viral year round. Around the holidays, holiday-themed content across all categories tends to get the most traction.

4. How has data science added value to Pinterest?

We have this internal value we refer to as “knit.” It means that we have an open, curious culture where everyone in different disciplines—from engineering and design to marketing to community—works together. Data science is at the core of that. The search, recommendations and spam teams apply data science to improve the quality of content we put in front of Pinners. This is only a subset of how we apply data though; most of the decisions we make at Pinterest are actually backed by data.

Data is a universal language that teams across the company use to collaborate and make decisions. Each team has a set of performance metrics, and we hold a weekly meeting to understand the impact that each area is having on company-wide metrics. As data scientists we do more than just analyze data, we create rich data sources that we make available to other teams so they can do their own analysis. More than half of Pinterest employees run MapReduce jobs via Hive.  Our metrics dashboards are accessible to everyone and our core metrics are emailed daily to the entire team.  We also share our data studies and insights with the whole team.

We also use data just for fun. During our weekly happy hour, we share a weekly Data Fun Fact with the team. We present the fact in the form of a multiple choice question and have the team vote on the answer. For example, we asked, “How many days before Valentine’s day does the query ‘Valentine’s day ideas’ increase the most: 1, 3, 5 or 7 days?” (Hint for the curious reader: two*three/two).

5. What do you think someone should know before becoming a data scientist at a major web company like Pinterest?

I would say go for it! If you are hungry to extract value from real world data, you’re really going to enjoy it. I know that for a lot of really talented people in academia the only thing standing between them and the opportunity to solve a really interesting problem is the lack of rich data. My experience at Pinterest has been the exact opposite. Our team can’t grow fast enough to tap into a world of valuable insights that are sitting dormant within billions of records somewhere in the cloud.

Continue reading

Data Stories: Brooke Fisher Liu on Using Social Media in Natural Disasters

Data Stories is Gnip’s project to tell the stories of how social data is being used. This week we’re interviewing Brooke Fisher Liu from the University of Maryland about her research on how people use social media in natural disasters (PDF). You can follow Brooke on Twitter at @Bfliu. (Also, you can see our data scientists post on Twitter’s reaction to an earthquake in Mexico.)

Brooke Fisher Liu

Brooke Fisher Liu (photo courtesy of Anne McDonough)

1. When the wildfires broke out in Boulder, I found Twitter to be the best source of information hands down. What kind of information do you see people communicating about natural disasters?

During natural disasters people tend to use social media for four interrelated reasons: checking in with family and friends, obtaining emotional support and healing, determining disaster magnitude, and providing first-hand disaster accounts. A consistent research finding is that people are less likely to follow official, government sources on social media than their friends and family during disasters. I think that may change over time as government sources become more savvy about effectively using social media during disasters.

2. How is curated content such as Storify changing how people communicate during disasters?

This is one area where the research hasn’t caught up with practice yet. However, I think that social media sites that curate content such as Storify, Pinterest, or even Instagram are going to be major players in disaster communication in the future. One of the reasons people don’t turn to social media for disaster information is that the quantity of information is difficult to sift through and verify. Sites that curate content help cut through the sea of online information, and also provide a familiar, reliable source of information through online connections established before disasters.

3. You talked about people mobilizing on social media after natural disasters in your report. Do you ever see people respond in real time?

Absolutely. Real-time communication is one of the primary draws of social media during disasters. There are multiple examples of social media being the first source of disaster information such as for the 2011 Tuscaloosa tornadoes and the 2008 Mumbai terrorist attacks.

4. What surprised you the most about how people were using social media during natural disasters?

By far the biggest surprise is that people still turn to traditional media sources, especially broadcast journalism, as the most accurate source of disaster information. So, while they may first turn to social media, they still prefer traditional media during disasters. I think this may change over time, but it certainly was a surprise for me. Of course, journalists often rely on social media for disaster information, and I think over time we’ll see the distinction between traditional media and so-called new media blur even more.

5. How do you think the use of social media in natural disasters will evolve?

I think over time people will view social media as more trustworthy and thus turn to it as their primary source of information. I also think social media will continue to play a large role in facilitating disaster recovery by helping people connect with each other and rebuild communities. “Official sources” such as governments and the media will increasingly enhance their social media presence before disasters, which likely will position them to be not only the first, but also most trustworthy social media sources down the road. Perhaps most importantly I think social media will continue to surprise us by providing new communication capabilities during disasters that we can’t currently predict.

Continue reading

Aspirational Brands & Tumblr: Lexus vs. Toyota

Gnip conducted a brief analysis of the Toyota family of brands (Toyota, 4Runner, Camry, Highlander, Lexus, Prius, Rav4, Scion, Sequoia, Tacoma, Tundra) on multiple social media platforms. We looked at brand mentions on Tumblr, Twitter, WordPress and WordPress comments during the period of Oct. 15 to Nov. 15, 2012.

As you would expect, Toyota was the most frequently mentioned brand on each social platform, with one enormous exception – Tumblr. Lexus had 5 times as many mentions on Tumblr as Toyota. This highlights how aspirational brands do exceptionally well on Tumblr where niche communities of fans often form around brands. (Attention brand managers, this happens whether the company is involved or not). A central component of Tumblr is visual content, which also plays well with aspirational brands. Furthermore, Tumblr content is both extremely viral and has a long shelf life meaning that content shared on Tumblr can be shared for longer periods of time and jump to more diverse sub-groups within the network than other social networks. During the month Gnip tracked mentions, Lexus received more than 200,000 mentions while Toyota received 40,000.

In social media, it is easy to rely on Twitter as a kind of alert system of when content is being shared, but at Gnip we’ve seen time and time again where content that pops up elsewhere doesn’t always pop up on Twitter. Each social media network has its own attributes and audience and modes of interaction. Because of likes, reblogging, and the way timelines are read by Tumblr users, Tumblr has active communities that aren’t found elsewhere.

Lexus on Tumblr

Observations On Disqus: The Spread of Words

Marketers and communicators all share a similar goal: to become part of the conversation. Comments in reaction to blogs and news stories are a fantastic place to discover the topics that are driving conversation. To dig deeper, we recently looked at public comments from Disqus, the world’s largest discussion platform, to see what was getting online chatter at the end 2012. With 70,000 comments published on Disqus every hour, you can find insights and conversations that can’t be found elsewhere.

What we found is that communicators often use a language set that the audience does not share.  In discussion, most common denominator language dominates. Let’s look at a couple of examples.

The Fiscal Cliff

Social Media Discussion of Fiscal Cliff

At the end of 2012, one topic that dominated mainstream publications and political blogs was the Fiscal Cliff, when a series of tax cuts for the United States were expected to expire at the end of the year. Since this was a topic of contention between the Democrats and Republicans, you would have expected this to be a passionate point of conversation during the Elections. As it turns out, this wasn’t exactly the case. When did the Fiscal Cliff talk start? The day after the Election. And the discussion was couched in broader terms than just the acute “Fiscal Cliff” crisis. So while Washington operates and speaks in continual crisis mode, the public thinks of these challenges in broader, more systemic terms.

Disqus Conversations on Taxes and Medicare

 While the Fiscal Cliff wasn’t a hot topic until after the election, taxes and medicare saw consistent conversations before and after the election.

Timing is everything when it comes to starting conversations. While the Election focused on what happened in the past four years and what would happen in the next four years, the day after the Election honed in on what was immediately down the road — the Fiscal Cliff.

Skyfall vs Breaking Dawn vs Twilight

Skyfall vs Breaking Dawn vs Twilight on Disqus

Moving from politics to pop culture, we were curious what would generate more conversation — a bunch of sparkly vampires driving Volvos (the movie Breaking Dawn, the fourth installment in the Twilight series) or the eponymous spy from England (Skyfall). We were initially surprised to see that Skyfall generated more chatter around its premiere on Nov. 9 than Breaking Dawn saw for its premiere on Nov. 18. However, when we took a closer look by adding the term Twilight into the mix, we found that Twilight created more chatter than Skyfall.

­Comments are an excellent barometer of buzz around upcoming events and launches. Even more than that, comments can help companies understand what terms people use about an event. In this example, if you were the studio marketer using content marketing to promote the release of Breaking Dawn, your odds would improve by using Twilight in your headline.

Movie Vs. Libya vs. Benghazi

While searching for popular movies on Disqus, we found an interesting spike for the term “movie” in mid-September, but couldn’t attribute it to a popular movie. After some digging, we realized that this was related to the movie “Innocence of Muslims,” the controversial spoof movie on the religion. While the movie was originally uploaded to YouTube in July, it aired on an Egyptian network on Sept. 9, which immediately created protests that quickly spread to Libya. On Sept. 11, four Americans including the Ambassador were killed in Benghazi, Libya. While the terms Libya and movie spiked immediately, Benghazi built up momentum more slowly over time spiking right before the election as it became part of the political debate between the two parties.

Buzz around current events doesn’t immediately spike right after the event. As new facts and information are disseminated, the current of conversation can change. In this scenario, a new and more specific term “Benghazi” did dominate the conversation, as it slowly became shorthand for the overall issue. What carries conversation is language that accelerates understanding and lowers the barrier for participation.

Ultimately, comments are windows into not only what people are talking about but also when topics tip over into public conscious and what the driving forces are behind when conversations peak. In the same way that communicators deploy search engine optimization to target searchers, they need to also incorporate conversation optimization strategies to become part of the conversation.

Data Stories: Harper Reed, Former CTO of Obama for America

Data Stories is Gnip’s project to tell the stories behind how people use data and why it matters. This week we interviewed Harper Reed, the Former CTO of Obama for America about the technology behind the scenes of elections, civic data and more. You can follow Harper on Twitter at @Harper.

Harper Reed Interview

Harper Reed, Photo by Joi Ito

 

1. In the next four years there will be massive changes in technology, what do you think will change in how campaigns use data in the 2016 election?

The big innovation this cycle was the analytics and how we found answers from the data we had. This follows the arc of the big data movement. When I first got involved in 2007/8 the conversations were all about collection and storage of data. Recently we have seen a shift from people not worrying about that because it is largely solved. Now, the big data space seems to have, thankfully, shifted to concentrate on gaining insight and getting answers from the data.

I think that this arc will continue. 2016 will be more about the answers that we will get from the data. Aggressively using modeling and data analytics to help make sure that there are no missteps.

2. Obama 2012 worked hard to remove the silos between tech and digital, what are some of the lessons you’ve learned between sharing data between departments?

This is obviously a work in progress for every organization. We made sure that there was a close physical proximity. That helps a lot.

Personally, the best lesson I learned was taught to me by John Maeda. He came in for a whirlwind visit and said “Manage by your Outbox. Not by your Inbox.” He then left. It was amazing and exactly the right amount of info. Later, he told me that this knowledge came from Larry Bacow.

Anyway, the idea is that you can work through political struggles, silos, etc by making sure that you are communicating out. Don’t expect or judge by the incoming communication.

This, for me, was the number one way to break down silos.

3. Political campaigns previously weren’t known for being full of tech savvy people. How do you reconcile the needs of campaign strategists and translate it to your team? Essentially, what did you learn about product development for campaigns?

Campaigns are organized like emergency response. Very top down. Lots of volunteers. Lots of downward delegation. Lots of managing and communicating up.

This type of organization does not mix well with standard software methodologies. It is hard to have a product manager when the stakeholders are unwilling to cede responsibility for the project to the PM.

Part of this is a lack of trust. Technology does not have a good history with campaigns (hopefully we helped instill more trust). Part of this is that in a top down type of environment, you need to negotiate more.

We found success by iterating on our process as quickly as we iterated on our software. We also would not have been able to do this without the great product team we had (Carol Davidsen, Mari Huertas, David Osborne, Jason Kunesh). They took the iteration seriously and made sure that the product development was successful no matter what.

4. What role did data scientists play in the reelection campaign?

We had an amazing set of data scientists – led by Rayid Ghani. They played the role of every scientist – thinking of crazy awesome things to test, testing experiments, making some of the coolest and more important of our discoveries and driving me crazy.

Working with them was awesome.

5. As one of your hacking projects, you took data from Chicago Transit Authority’s bus tracker and made it public. What other information would you like cities to make public?

I am going to pull the data hippie card here and say: ALL THE DATA!

I try and lead a very transparent and free life. There is not a lot of data that I think should not be public. Obviously personal data is a bit different (I would really like to have all my financial data, etc be available). But if it was available for everyone – it wouldn’t have such a stigma around it.

More realistically – the more civic data that is available, the better and more informed civic decisions we can make.

6. What’s next for you?

I have a small team of amazing people who I am working with. We are focusing on business tools. It should be fun.

Continue reading

Data Stories: Dmitrii Vlasov on Kaggle Contests

At Gnip, we’re big fans of what the team at Kaggle is doing and have a fun time keeping tabs on their contests. One contest that I loved was held by WordPress and GigaOm to see what posts were most likely to generate likes, and we interviewed Dmitrii Vlasov who came in second in the Splunk Innovation Prospect and sixth overall. For me, it was interesting to speak to an up and coming data scientist who isn’t well known yet. Follow him at @yablokoff.

Dmitrii Vlasov of the GigaOm WordPress contest

1. You were recognized for your work in the first Kaggle contest you ever entered. What attracted you to Kaggle, and specifically the WordPress competition?

I came to Kaggle accidentally as it always happens. I read some blog post about the Million Song Dataset Challenge provided by Last.fm and bunch of other organizations. The task was to predict which songs will be liked by users based on their existing listening history. This immediately made me feel excited because I’m an active Last.fm user and was reflecting about what connections between people can be established based on their music preferences. But the contest was coming to end and so I switched to WordPress GigaOm contest and got 6th place there. Well, it is always interesting to predict something you already use.

2. What is your background in data science?

Now I’m a senior CS student in Togliatty, Russia. Can’t say that I have a special background in Data Science – I had more than a year-long course of probability theory and math statistics in university, some self-learned skills about semantic analysis and have big love to Python as a tool for implementing ideas. Also, I’ve entered the Machine Learning course on Coursera.

3. You found that blog posts with 30 to 50 pictures were more likely to be popular. You also found that longer blog posts also attract more likes (80,000-90,000 characters). This struck my marketing team as really high and was contrary to your hypothesis that longer content might be less viral. Why do you think this is?

Well, my numbers show relative correlation between amount of photos, characters and videos and the amount of likes received. Big relative “folks love” on several prominent amount of photos means that there were not so many posts with such amount of photos but most of them were qualitative. Quick empirical analysis shows that these are special type of posts – “big photo posts”. They usually are photo report, photo collection or scrapbook. For such types of posts 10-15 photos are not enough but at the same time 10-15 photos seem too overloaded for normal post. The same can be said about big amount of text in post. Of course, the most “likeable” posts contain 1,000-3,000 characters, but posts with 80-90 thousands are winners in “heavyweight category”. These are big researches, novels, political contemplation. Analyse is quite simple but it shows that if you want to create media-rich or text-rich content it should be really media-text-rich. Or you may fall in a hollow of not suitableness.

4. What else would like to predict with social data if you got the chance?

Now I work on romantic and friend relationships that could be established based on people’s music preferences (it’s a privately held startup in alpha). This is a really interesting and deep area! Also, I’d like to work with some political data e.g. to predict reaction on one or another politician’s statement based on a user’s Twitter feed. Or to extract all “real” thesis of politician based on all of his public speeches.

Data Stories: Dino Citraro of Periscopic on Data Visualization

The Periscopic team has a long-standing reputation for their excellent work in data visualizations, so we asked on of the founders, Dino Citraro, to participate in a Data Story about data visualizations. You can follow Dino on Twitter at @dinocitraro and check out their work at Periscopic.com

Dino Citraro of Periscopic

1) Periscopic’s tagline is “Do good with data”. What are some of the projects that Periscopic that embody that tagline?

We formed Periscopic with the hope that we could do good with data. To us that means helping people that share the ideals of progressive social change, sustainability, human rights, equality, environmentalism, and transparency to name a few. Most of our work enables insights and discussions in those areas. Some recent and/or notable projects are:

“VoteEasy”

VoteEasy.org is a voter education tool that was designed to allow the general public to quickly and easily see how closely political candidates align with their views on key issues. It’s like Match.com for political candidates. It utilizes thousands of hours of research and a vast collection of data assembled by the nonpartisan group, Project Vote Smart. It is the most up-to-date resource for candidate political information, including voting records, interest groups ratings, campaign finances, and personal biography.

http://www.periscopic.com/#/work/voteeasy

“The State of the Polar Bear”

The State of the Polar Bear is the authoritative source for the health and status of the world’s polar bears. This multipart datavisualization was developed through an international partnership with the Polar Bear Specialist Group, a scientific collaboration of the five polar bear nations: Canada, Denmark, Norway, the USA, and Russia. It covers data related to pollution levels, tribal hunting, and population dynamics of the bears.

http://www.periscopic.com/#/work/pbsg

“Who’s Talking About Breast Cancer”

Developed for GE’s Healthymagination data visualization forum, this tool takes a realtime look at the discussions happening on Twitter around the topic of breast cancer. Tweets from all over the world are aggregated in a single location, allowing visitors to quickly understand the current topics, trends, and stories.

http://www.periscopic.com/#/work/ge-breast-cancer

2) With infographics now being an over-hyped tool for marketing, what challenges does that create for a company actually trying to tell stories with data?

If they are done well, infographics can be a very effective story-telling device. Unfortunately, many of them seem to either lack an engaging metaphor, or don’t do a good job of letting the data be the story.  Since most of our work is interactive, we have an advantage over traditional infographics because we can reveal information in a user-directed way. The challenges we face are how to slowly introduce these stories in a way that is engaging for visitors, and not overwhelming.

3) What are the greatest opportunities right now for data visualization?

The greatest opportunities for data visualization probably relate to public data and personal data. Public data, because it has that greatest potential for good and efficiency. Personal data, because it is the thing that most people seem to find interesting. The Quantified Self movement has exploded, and along with it the desire to understand our social media behaviors, and the rise of the Quantified Social Self.

4) How do you separate the wheat from the chaff when it comes to good data? 

There is no such thing as “good data”, there is only good context. You can create a compelling data visualization out of any data source, as long as you use the right context.  For instance, one of our pieces uses the gaps in the data – the lack of data – as part of the story. Our client wanted to highlight the fact that they needed to increase the data collection efforts, and wanted public support for this effort. You could have a massive data set that is impeccably organized, but without the right context, it can go unnoticed.

5) How does good visualization help create data literacy?

To us, the issue is literacy in general. Like good design, data visualizations should be transparent and unnoticed. The epiphanies one gets from interacting with data are the things that should be retained, not the fact that an interface was unique, or the interactivity was sophisticated.

Having said that, the very process of interacting with data through a visualization tool brings an understanding of what is possible, and with that, the desire increases for more, and better experiences.

Continue reading

Data Stories: Interview with Data Scientist Blake Shaw of Foursquare

At Gnip, we believe the value of social data is unlimited. Data Stories is how we bring this belief to life by showcasing how social data is used. This week we’re interviewing data scientist Blake Shaw of Foursquare about how data science is not only shaping Foursquare and its recommendations, but how Foursquare can be a “microscope for cities.” You can follow Blake on Twitter at @metablake and check out Foursquare’s blog for more data science. 

Data Scientist Blake Shaw of Foursquare

1. Your team has found a correlation between warm days and ice cream consumption in NYC. At some point, do you envision Foursquare being able to trigger offers based on different correlations your data science has found?

Yes!  In fact, we currently trigger recommendations (which often contain deals and offers) based on a ton of different contextual signals that the team here has identified as useful.  These signals include where you are, the places you like to go, the time of the day, the preferences of your friends, and what is popular around you. Mapping all of these signals to good recommendations requires finding correlations in massive amounts of data.  Some of these correlations are simple like when it’s the morning people like to get coffee, and some correlations are more complex like when it’s cold out in New York, people are more likely to go to ramen and noodle shops.

2. One of my favorite features of the Explore feature is that Foursquare lets you know when you check into a city locations where both locals and out-of-towners like to go. How does data science and product work together to make recommendations such as these?

Tourist recommendations is definitely one of my favorite features of Explore as well. In general, there is a healthy mix of product-driven and data-driven development at Foursquare. We will often work together to brainstorm not only what would be best to build from a product perspective but also what data we should be investigating further. Tourist recommendations came from the data; we realized that it would be easy to identify places that had a statistically high proportion of tourists and surface them to Explore users who find themselves in unfamiliar areas.  The results are fantastic — it’s like having millions of people creating a travel guide, just by walking around a city and checking in.

3. Foursquare got its start in NYC. What are interesting observations you’ve seen on how people use Foursquare in smaller cities such as Boulder and Denver?

I feel like Foursquare is more of a necessity in big cities like New York, where new places are opening all the time and it’s hard to keep track of them all.  That said, we see strong usage in places like Boulder and Denver as well. As expected, users in smaller cities such as these are more interested in old favorites rather than exploring new places.

4. What signals does Foursquare use to recommend places to people?

I can’t reveal all of the signals we use to rank places, but we believe that place recommendation should be highly personalized, so we heavily weight signals about your tastes and the tastes of your friends.  We also think that from all of this data about where people are going we can discern which are the best places.  Imagine being able to ask everyone who has been to a restaurant if they would go back. We believe that by measuring signals about places such as loyalty, expertise, and sentiment we can tease out the best places. This is the idea behind our recently launched Foursquare ratings.  People are voting with their feet in the real world, not simply leaving a star or a like on a website.

5. Do you see a correlation between Foursquare sharing check-ins and badges on other social sites and increased usage of Foursquare? For example, if someone chooses to share a checkin on Twitter or Facebook, does that increase the likelihood of other people checking in?

Yes we do. Roughly a quarter of all check-ins are shared to wider audiences on Twitter and Facebook.  These in turn help spread awareness and adoption of Foursquare.

6. Foursquare recently showed a visualization of how check-ins in NYC were affected by hurricane Sandy. How else do you see check-in data being useful other than for powering your recommendation engine?

Visualization of Foursquare Checkins Before and After Hurricane Sandy

One of my favorite aspects of working at Foursquare is getting to study this data from a larger sociological perspective. We are capturing this amazing signal about what millions of people are doing in the real world at every moment of the day in cities all around the globe. We have seen that when we aggregate check-in patterns across many individuals, we can measure features of cities at a higher resolution than was ever possible before.  I think this data can act almost like a “microscope for cities.”  If you look at how the storm affected NYC, you can see how this incredibly powerful force disrupted the natural rhythm of the city. It’s striking how predictable these patterns are, and how precisely we can identify unusual events. For example, in this plot we see how check-ins at grocery stores went up more than 200% in the days before the storm.  I see this real-time pulse or “EKG” of a city being a valuable resource in the future for understanding cities, giving us a larger view of the collective movement patterns of millions of people.

Continue reading

Four Themes From the Visualized Conference

The first Visualized conference was held in mid-town Manhattan last week. Even with Sandy and a nor’easter, the conference went off with only a few minor hiccups. The idea behind Visualized is a TED-like objective of exploring the intersection of big data, story telling and design. It worked.

Throwing designers and techies together is one of my favorite forums because of what is common and what is different. On one hand, artists are increasingly skilled with technical tools, on the other these people are often coming at things from very different perspectives.

The advantages of mixing these people at Visualized go beyond simple idea sharing.  Each person specializes, leading to amazing expertise, skill, and focused perspective, but also leaving something out. It is not that everyone can learn to do everything, but rather, by sharing projects, methods and tools, we can learn what to ask and who to seek out for collaboration.  The advantages of this mix are that it is the most reliable way to produce projects that evoke emotion with story, design and data to engage and inform.

We were treated to amazing technical talent and creativity, evident in, for example, Cedric Kiefer’s generative dancer reproduction “unnamed soundsculpture.”  To creating the basic model his team started with song, a dancer and knitting together the 3D surface images from three Microsoft Kinect cameras. They re-generated the movie of the dance by simulating the individual particles captured in the imaging and the enhancing these to generate more particles under the influence of “gravity” and “wind” driven by the music.

unnamed soundsculpture from Daniel Franke on Vimeo.

Cedric and his team radically expand ideas of numeric visualization by capturing and building on organic physical data in complex and subtle ways, generating a whole, engrossing new experience from the familiar elements.

Four themes surfaced repeatedly in the ideas and presentations of the speakers:

Teams

Most of the projects were produced by teams made up of people with  a handful diverse skills and affinities. I heard descriptions of teams such as, “we have a designer (color, composition and proportion sense, works in Illustrator, photoshop, pen and paper…), a data scientist (data munging, machine learning, statistical analysis…), a data visualization artist (Javascript, D3 skills, web API mashup skills…) someone who is driven by narrative and story telling (journalist, marketing project lead…), a database guy, etc.”

Assembling and honing these teams of technical and artistic creatives is probably a rare skill in itself and the result is a powerful engine of exploration, creativity and communication.

Hilary Mason from Bit.ly summed up the second-level data scientist talent shortage clearly: “Every company I know is looking to hire a data scientist; every data scientist I know is looking to hire a data artist.”  As broad as data scientists skills are, many are recognizing the value of talented designers with the appropriate programming skills for crafting a clear, engaging message.

The New York Times teams (two different teams presented), the WNYC team of two, Bit.ly, and many others showed the power of teams creating together and bringing diverse talents to projects.

There were a couple of notable individual efforts. My favorite was Santiago Ortiz’s beautiful, complex and functional visualization and navigation of his personal Knowledgebase. His design elegantly uses the 7-set Venn diagram, and his deep insights into searching by category and time come together perfectly.

Journalism

Sniffing out the story is fundamental to projects that evoke emotion with story, design and data to engage and inform. Journalists can smell drama and conflict and ask lots of questions. They have a sense of where to dig deeper. They are able to stick to the thread of the story and a have a valuable work ethic around finding the details and tying up loose ends.

A large part of success of Shan Carter and his team in creating the New York Times paths to the White House win visualization come from their ability to return over and over the the basic idea of making a relevant, accurate and understandable visualization of the various likely outcomes each each of the battleground states.  This visualization went through 257 iterations being checked into their Github repository with a few evident cycles of creative expanding followed by refocusing on the story.

Data Mashups

API-mashup skills found there best examples in the news teams. WNYC’s accomplishments in creating data/visualization mashups to communicate evacuation zones, subway outages, flood zone information updated in real-time during the storms, and other embeddable web widgets was amazing.  While their designs didn’t have the polish of some of the “slower” work presented, they produced great, accurate and timely results in days and sometimes hours.

Design

“A visualization should clarify in ways that words cannot.” (Sven Ehrmann)

This summed up what I found awe-inspiring and satisfying in the design work. Since I primarily work with data visualization, I often rely on the graph-reading skills of my audience rather than optimized design. This may be necessary for many business applications, but when the message is important and the investment you can reasonable expect from your audience is uneven, to take short cuts on design is to completely miss opportunities to engage and inform.  Great designers are masters at creating memory because the are able to reliably create “emotion linked to experience” (Ciel Hunter)

Jake Porway summed up data, team, story and design in the observations section of his presentation:

  • Data without analysis isn’t doing anything
  • Interdisciplinary teams are required
  • Visualization is a process (see the example from Shan Carter at NY Times above)
  • Tools enable amazing outcomes possible with limited resources
  • There is a lot of potential to do a great deal of good with when we learn to evoke emotion with story, design and data to engage and inform.

 

Data Stories: Interview with Simon Rogers of the Guardian Data Blog

If you’re into data visualizations at all, then you’re going to be familiar with Simon Rogers of the Guardian data blog. They tell incredible stories using data and they’re a leader in the industry for data journalism. I was elated when Simon agreed to be interviewed for a Data Story to talk about his work in data journalism. 

Simon Rogers of Guardian Data Blog

1. How does your department find its data and choose which sources to use?                                  It really varies. Sometimes it’s breaking news, such as Hurricane Sandy. That led us to do this map showing every verified event as we felt the raw information was too difficult for most people to find. Sometimes there’s a dataset that’s been released that we feel really needs questioning and investigation. Other times it’s down to a hunch that it might be interesting. After the Denver shootings we did this post looking at gun ownership and homicide rates around the world; so it can be something as serious as that – or as weird as a list of Doctor Who villains (personal obsessions can come into the process…)

2. What do you find first? The stories you want to tell or data that can tell stories?
It’s all about the stories, so I would say it goes that way round. It’s good to start examining a datasource with an idea in mind of what you’re looking for. Otherwise the whole thing just gets too unmanageable.

3. You majored in journalism. How did you end up pursuing data journalism, and what skills did you need to learn along the way? 
After 9/11 (which was my second day on the newsdesk) I was told to work with the graphics team to help tell those stories. And I found myself coming back to that role in between editing the science section. During that process, I started getting better at working with spreadsheets and just collecting data, often just make my job easier. You don’t want to keep having to search for Carbon emissions data anew each time you’re doing a story on climate change. In around 2006 or 2007, Adrian Holovaty came and gave a talk at the Guardian to staff and I thought ‘that sounds like a job – and it’s not a long way away from what I’m doing already’. So, when we launched the Datablog in early 2009, it was just a matter of surfacing data we already had. In the meantime, I’ve learnt a load of tools, but mainly work with Excel, Google refine, Fusion tables and free viz tools like tableau and Datawrapper.

4. The Guardian has incredible visualizations. As a data journalist, how do you work with your graphic artists to tell a story?
We can’t each do everything. I can make a map, but it will be miles better if a designer does it and the Guardian has some great graphic designers and a brilliant graphic team. But what I can do is get the right information they need and get it in the right format, and really help with telling the story.

5. The Guardian is trying to be a repository for all open government data? What data do you wish was more readily available?
Basic spending data. It should be easy to find but just getting a total amount that each  UK government department spends is always a nightmare involving PDFs. It’s not good having ultra-granular spending figures if we can’t get the totals.

6. You recently told a story about how people are using homophobic language on Twitter (The No Homophobes guide to language on Twitter). What other stories would you like to tell around social data?
That was actually showcasing amazing work on the web – which we do a lot now. I’m fascinated by the way that people use social media and how they use it in conjunction with other media – ie people tweeting while they’re watching TV, for instance. The way that people use Twitter in a crisis is fascinating – and how we share images too.

Continue reading