Data Story: Oliver O'Brien on Open Data Maps

I stumbled across the most amazing set of open data maps for bike sharing cities and tracked down the creator in London to interview him for a Data Story. Oliver O’Brien is the creator of the maps, which tracks available bikes and open spaces at bike sharing stations at more than 100 cities across the world. We interviewed him about his work with open maps and his research trying to understand how people move about the city. 

Ollie O'Brien Open Maps

1. What was the genesis for creating the maps?
It started from seeing the launch of London’s system in August 2010. It was at a time when I was working with Transport for London data on a project called MapTube. Transport for London had recently created a Developer portal for their datasets. When the London bikeshare launched, their map was not great (and still isn’t) – it was just a mass of white icons – so I took advantage of the data being provided on the Developer portal to create my own version, reusing some web code from an earlier map that showed General Election voting results in a fairer and clearer way. Once London’s was created, it proved to be a hit with people, as it could be used to see areas were bikes (or free spaces) might be in short supply. I was easily able to extend the map to Montreal and Minneapolis (the latter thanks to an enthusiastic local there) and then realised there was a whole world of bikesharing systems out there waiting to be mapped.

The maps act primarily as a “front-end” to the bikesharing data that I collect, for current and potential future research into the geomorphology of cities and their changing demographics and travel patterns, based on how the population uses bikesharing systems. However i have continued to update the map as it has remained popular, adding cities whenever I discover their bikeshare datasets. After three years, I am now up to exactly 100 “live” cities, where the data is fresh to within a few minutes, plus around 50 where the data is no longer available.

2. Where did you get the information to build the maps?
Mainly from APIs provided by each city authority or bikesharing operating company, or, where this is not available (which is often the case for smaller system) from their Google Map or other online mapping page that normally has the information in the HTML.

3. What is your background?
I’m an academic researcher and software developer at UCL’s Centre for Advanced Spatial Analysis. The lab specialises in urban modelling, and my current main project, EUNOIA, is aiming to build a travel mobility model, using social media as well as transport datasets, for the major European cities of London, Barcelona and Zurich. Bikesharing systems will form a key part of the overall travel model. Previously to CASA I worked as a financial GUI technologist at one of the big City banks – before then, at university, I studied Physics.

4. What are you looking to build next?
I am looking to continue to add cities to the global map, particularly from large bikesharing systems that are appearing – I am looking forward to the San Francisco Bay Area’s system launching in August – and I’m working on creating London’s EUNOIA model, taking in the transport data and augmenting it with other geospatial information, including data from Twitter. I am also looking at more effective ways to visualise data and statistics that are emerging from the recent (2011) Census that we had in the UK – the results of which are being gradually made available.

5. What open-source maps do you think should be created next?
I am hopeful that soon, an integrated map of all social media and sensor datasets, will become easily available and widely used. Partly to increase people’s awareness of the data that now surrounds them and partly to inform decision makers and other stakeholders, in creating a better, more inclusive city landscape – the so called “smart city”.

I would add that you may be interested in some of the other maps that we have created at UCL CASA, such as the Twitter Languages maps for London and New York:
http://twitter.mappinglondon.co.uk/ and http://ny.spatial.ly/ …and also http://life.mappinglondon.co.uk/ – these maps were all created mainly by my colleagues, with me just helping with the web work.

Boulder Bike Sharing

 Bike sharing map in Boulder, CO

Thanks to Oliver for the interview! If you’re interested in more geo + social, check out our recent posts on Social Data Mashups Following Natural Disasters and Mapping Travel, Languages & Mobile OS Usage with Twitter Data.

Dreamforce Hackathon Winner: Enterprise Mood Monitor

As we wrote in our last post, Gnip co-sponsored the 2011 Dreamforce Hackathon, where teams of developers from all over the world competed for the top three overall cash prizes as well as prizes in multiple categories.  Our very own Rob Johnson (@robjohnson), VP of Product and Strategy, helped judge the entries, selecting the Enterprise Mood Monitor as winner of the Gnip category.

The Enterprise Mood Monitor pulls in data from a variety of social media sources, including the Gnip API, to provide realtime and historical information about the emotional health of the employees. It shows both individual and overall company emotional climate over time and can send SMS messages to a manager in cases when the mood level goes below a threshold. In addition, HR departments can use this data to get insights into employee morale and satisfaction over time, eliminating the need to conduct the standard employee satisfaction surveys. This mood analysis data can also be correlated with business metrics such as Sales and Support KPIs to identify drivers of business performance.

Pretty cool stuff.

The three developers (Shamil Arsunukayev , Ivan Melnikov  and Gaziz Tazhenov) from Comity Designs behind this idea set out to create a cloud app for the social enterprise built on one of Salesforce’s platforms.  They spent two days brainstorming the possibilities before diving into two days of rigorous coding. The result was the Enterprise Mood Monitor, built on the Force.com platform using Apex, Visualforce, and the following technologies: Facebook API (Graph API),  Twitter API, Twitter Sentiment API, LinkedIn API, Gnip API, Twilio, Chatter, Google Visualization API. The team entered their Enterprise Mood Monitor into the Twilio and Gnip categories. We would like to congratulate the guys on their “double-dip” win as they took third place overall and won the Gnip category prize!

Have fun and creative way you’ve used data from Gnip? Drop us an email or give us a call at 888.777.7405 and you could be featured in our next blog.

Guide to the Twitter API – Part 2 of 3: An Overview of Twitter’s Search API

The Twitter Search API can theoretically provide full coverage of ongoing streams of Tweets. That means it can, in theory, deliver 100% of Tweets that match the search terms you specify almost in realtime. But in reality, the Search API is not intended and does not fully support the repeated constant searches that would be required to deliver 100% coverage.Twitter has indicated that the Search API is primarily intended to help end users surface interesting and relevant Tweets that are happening now. Since the Search API is a polling-based API, the rate limits that Twitter has in place impact the ability to get full coverage streams for monitoring and analytics use cases.  To get data from the Search API, your system may repeatedly ask Twitter’s servers for the most recent results that match one of your search queries. On each request, Twitter returns a limited number of results to the request (for example “latest 100 Tweets”). If there have been more than 100 Tweets created about a search query since the last time you sent the request, some of the matching Tweets will be lost.

So . . . can you just make requests for results more frequently? Well, yes, you can, but the total number or requests you’re allowed to make per unit time is constrained by Twitter’s rate limits. Some queries are so popular (hello “Justin Bieber”) that it can be impossible to make enough requests to Twitter for that query alone to keep up with this stream.  And this is only the beginning of the problem as no monitoring or analytics vendor is interested in just one term; many have hundreds to thousands of brands or products to monitor.

Let’s consider a couple examples to clarify.  First, say you want all Tweets mentioning “Coca Cola” and only that one term. There might be fewer than 100 matching Tweets per second usually — but if there’s a spike (say that term becomes a trending topic after a Super Bowl commercial), then there will likely be more than 100 per second. If because of Twitter’s rate limits, you’re only allowed to send one request per second, you will have missed some of the Tweets generated at the most critical moment of all.

Now, let’s be realistic: you’re probably not tracking just one term. Most of our customers are interested in tracking somewhere between dozens and hundreds of thousands of terms. If you add 999 more terms to your list, then you’ll only be checking for Tweets matching “Coca Cola” once every 1,000 seconds. And in 1,000 seconds, there could easily be more than 100 Tweets mentioning your keyword, even on an average day. (Keep in mind that there are over a billion Tweets per week nowadays.) So, in this scenario, you could easily miss Tweets if you’re using the Twitter Search API. It’s also worth bearing in mind that the Tweets you do receive won’t arrive in realtime because you’re only querying for the Tweets every 1,000 seconds.

Because of these issues related to the monitoring use cases, data collection strategies relying exclusively on the Search API will frequently deliver poor coverage of Twitter data. Also, be forewarned, if you are working with a monitoring or analytics vendor who claims full Twitter coverage but is using the Search API exclusively, you’re being misled.

Although coverage is not complete, one great thing about the Twitter Search API is the complex operator capabilities it supports, such as Boolean queries and geo filtering. Although the coverage is limited, some people opt to use the Search API to collect a sampling of Tweets that match their search terms because it supports Boolean operators and geo parameters. Because these filtering features have been so well liked, Gnip has replicated many of them in our own premium Twitter API (made even more powerful by the full coverage and unique data enrichments we offer).

So, to recap, the Twitter Search API offers great operator support but you should know that you’ll generally only see a portion of the total Tweets that match your keywords and your data might arrive with some delay. To simplify access to the Twitter Search API, consider trying out Gnip’s Enterprise Data Collector; our “Keyword Notices” feed retrieves, normalizes, and deduplicates data delivered through the Search API. We can also stream it to you so you don’t have to poll for your results. (“Gnip” reverses the “ping,” get it?)

But the only way to ensure you receive full coverage of Tweets that match your filtering criteria is to work with a premium data provider (like us! blush…) for full coverage Twitter firehose filtering. (See our Power Track feed if you’d like for more info on that.)

Stay tuned for Part 3, our overview of Twitter’s Streaming API coming next week…

Guide to the Twitter API – Part 1 of 3: An Introduction to Twitter’s APIs

You may find yourself wondering . . . “What’s the best way to access the Twitter data I need?” Well the answer depends on the type and amount of data you are trying to access.  Given that there are multiple options, we have designed a three part series of blog posts that explain the differences between the coverage the general public can access and the coverage available through Twitter’s resyndication agreement with Gnip. Let’s dive in . .. 

Understanding Twitter’s Public APIs . . . You Mean There is More than One?

In fact, there are three Twitter APIs: the REST API, the Streaming API, and the Search API. Within the world of social media monitoring and social media analytics, we need to focus primarily on the latter two.

  1. Search API - The Twitter Search API is a dedicated API for running searches against the index of recent Tweets
  2. Streaming API – The Twitter Streaming API allows high-throughput, near-realtime access to various subsets of Twitter data (eg. 1% random sampling of Tweets, filtering for up to 400 keywords, etc.)

Whether you get your Twitter data from the Search API, the Streaming API, or through Gnip, only public statuses are available (and NOT protected Tweets). Additionally, before Tweets are made available to both of these APIs and Gnip, Twitter applies a quality filter to weed out spam.

So now that you have a general understanding of Twitter’s APIs . . . stay tuned for Part 2, where we will take a deeper dive into understanding Twitter’s Search API, coming next week…

 

Letter From The New Guy

Not too long ago Gnip celebrated its third birthday.  I am celebrating my one week anniversary with the company today.  To say a lot happened before my time at Gnip would be the ultimate understatement, and yet it is easy for me to see the results produced from those three years of effort.  Some of those results include:

The Product

Gnip’s social media API offering is the clear leader in the industry.  Gnip is delivering over a half a billion social media activities daily from dozens of sources.  That certainly sounds impressive, but how can I be so confident Gnip is the leader?  Because the most important social media monitoring companies rely on our services to deliver results to their customers every single day. For example, Gnip currently works with 8 of the top 9 enterprise social media monitoring companies, and the rate we are adding enterprise focused companies is accelerating.

The Partners

Another obvious result is the strong partnerships that have been cultivated.  Some of our partnerships such as Twitter and Klout were well publicized when the agreements were put in place.  However, having strong strategic partners takes a lot more than just a signed agreement.  It takes a lot of dedication, investment, and hard work by both parties in order to deliver on the full promise of the agreement.  It is obvious to me that Gnip has amazing partnerships that run deep and are built upon a foundation of mutual trust and respect.

The People

The talent level at Gnip is mind blowing, but it isn’t the skills of the people that have stood out the most for me so far.  It is the dedication of each individual to doing the right thing for our customers and our partners that has made the biggest impression.  When it comes to gathering and delivering social media data, there are a lot of shortcuts that can be taken in order to save time, money, and effort.  Unfortunately, these shortcuts can often come at the expense of publishers, customers, or both.  The team at Gnip has no interest in shortcuts and that comes across in every individual discussion and in every meeting.  If I were going to describe this value in one word, the word would be “integrity”.

In my new role as President & COO, I’m responsible for helping the company grow quickly and smoothly while maintaining the great values that have been established from the company’s inception.  The growth has already started and I couldn’t be more pleased with the talent of the people who have recently joined the organization including: Bill Adkins, Seth McGuire, Charles Ince, and Brad Bokal who have all joined Gnip within the last week.  And, we are hiring more! In fact, it is worth highlighting one particular open position for a Customer Support Engineer.  I’m hard pressed to think of a higher impact role at our company because we consider supporting our customers to be such an important priority.  If you have 2+ years of coding experience including working with RESTful Web APIs and you love delivering over-the-top customer service, Gnip offers a rare opportunity to work in an environment where your skills will be truly appreciated.  Apply today!

I look forward to helping Gnip grow on top of a strong foundation of product, partners, and people.  If you have any questions, I can be reached at chris [at] gnip.com.

What Does Compound Interest Have to do with Evolving APIs?

Once Albert Einstein was asked what he found to be important discoveries. His answer did not mention physics, relativity theory, or fun stuff like Higgs bosons – but instead he said: “Compound interest is the greatest mathematical discovery of all time.”

I trust that most of you understand compound interest when it comes to investing or debt, but humor me and let’s walk through an example: Say you owe your credit card company $1000, and your interest rate is 16%. To make it simple, we assume the credit card company only requires you to pay 1% as your minimal payment every year, so the effective interest rate is 15%. After 30 years of compound interest you owe almost $60 000!

Compound Interest Graph

If there would be no compounding, you’d just owe a little bit over 5 grand!

What I find truly bizarre though is that when us software engineers throw around words like “technological debt” the eyes of our project managers or CEOs frequently just glaze over. Instead of doing the right thing – I’ll get back to that later – we are asked to come up with the quick hack that will make it work tomorrow and deal with the fallout later. Really? Sounds like we are using one credit card to pay off the other.

And we are even staying within terminology using “debt”! We could have said something like “Well, it would take us roughly 1 week longer to integrate our current J2EE backend with this 3rd party SOAP API instead of expanding our current custom XML parser, but then we would be done for good with maintaining that (POS) part of the app and can focus on our core IP.” But no, we keep it simple and refer to the custom XML parser as “technological debt”, but to no avail.

Now, the next time you have this conversation with your boss, show him the plot above and label the y-axis with “lines of code we have to maintain”, and the x-axis with “development iterations”, and perhaps a bell will go off.

Coming back to doing the right thing. Unfortunately determining what is the right thing is sometimes hard, but here are two strategies that in my experience decrease technological debt almost immediately:

  1. Refactor early and often
  2. Outsource as much as possible of what you don’t consider your core competency.

For instance, if you have to consume millions of tweets every day, but your core competency does not contain:

  • developing high performance code that is distributed in the cloud
  • writing parsers processing real time social activity data
  • maintaining OAuth client code and access tokens
  • keeping up with squishy rate limits and evolving social activity APIs

then it might be time for you to talk to us at Gnip!

Our Poem for Mountain.rb

Hello and Greetings, Our Ruby Dev Friends,
Mountain.rb we were pleased to attend.

Perhaps we did meet you! Perhaps we did not.
We hope, either way, you’ll give our tools a shot.

What do we do? Manage API feeds.
We fight the rate limits, dedupe all those tweets.

Need to know where those bit.ly’s point to?
Want to choose polling or streaming, do you?

We do those things, and on top of all that,
We put all your results in just one format.

You write only one parser for all of our feeds.
(We’ve got over 100 to meet your needs.)

The Facebook, The Twitter, The YouTube and More
If mass data collection makes your head sore…

Do not curse publishers, don’t make a fuss.
Just go to the Internet and visit us.

We’re not the best poets. Data’s more our thing.
So when you face APIs… give us a ring.

From API Consumers to API Designers: A Wish List

At Gnip, we spend a large part of our days integrating with third party APIs in the Social Media space. As part of this effort, we’ve come up with some API design best practices.

Use Standard HTTP Response Codes

HTTP has been around since the the early 90’s. Standard HTTP Response codes have been around for quite some time. For example, 200 codes level have meant success, 400 level have meant a client side error, and 500 level have been indicative of a server error. If there was an error during an API call to your service, please don’t send us back a 200 response and expect us to parse the response body for error details. If you want to rate limit us, please don’t send us back a 500, that makes us freak out.

Publish Your Rate Limits
We get it. You want the right to scale back your rate limits without a hoard of angry developers wielding virtual pitchforks showing up on your mailing list. It would make everyone’s lives easier if you published your rate limits rather than having developers playing a constant guessing game. Bonus points if you describe how your rate limits work. Do you limit per set of credentials, per API key, per IP address?

Use Friendly Ids, Not System Ids
We understand that it’s a common pattern to have an ugly system id (e.g. 17134916) backing a human readable id (e.g. ericwryan). As users of your API, we really don’t want to remember system ids, so why not go the extra mile and let us hit your API with friendly ids?

Allow Us to Limit Response Data
Let’s say your rate limit is pretty generous. What if Joe User is hammering your API once a second and retrieving 100 items with every request, even though on average, he will only see one new item per day. Joe has just wasted a lot of your precious CPU, memory, and bandwidth. Protect your users. Allow them to ask for everything since the last id or timestamp they received.

Keep Your Docs Up to Date
Who has time to update their docs when you have customers banging on your door for bug fixes and new features? Well, you would probably have less customers banging on your door if they had a better understanding of how to use your product. Keep your docs up to date with your code.

Publish Your Search Parameter Constraints
Search endpoints are very common these days. Do you have one? How do we go about searching your data? Do you split search terms on whitespace? Do you split on punctuation? How does quoting affect your query terms? Do you allow boolean operators?

Use Your Mailing List
Do you have a community mailing list? Great! Then use it. Is there an unavoidable, breaking change coming in a future release? Let your users know as soon as possible. Do you keep a changelog of features and bug fixes? Why not publish this information for your users to see?

We consider this to be a fairly complete list on designing an API that is easy to work with. Feel free to yell at us (info at gnip) if you see us lacking in any of these departments.