All These Things That I’ve Done

A little over two years ago, Jud and I hatched an audacious plan — pair a deep data guy with a consumer guy to launch an enterprise company. We would build an incredible data service with the polish of a consumer app, then attack a market generally known for being rather dull with a combination of substance and style.

Over the last two years, Jud has done an amazing job serving as Gnip’s CTO and implicitly as VP of Engineering. Under his leadership, the engineering team has delivered a product that turns the process of integrating with dozens of diverse APIs into a push-button experience. The team he assembled is fantastically talented and passionate about making real-time data more easily consumed. My own team has performed equally well, adding much-needed process to Gnip’s sales and marketing.

Two years ago, if you asked Corporate America to define “social media,” they probably would have said “the blogs.” Last year, they would have probably answered “the blogs and Twitter” and this year they’re adding Facebook to their collective consciousness. The time is better than ever to bring Gnip’s platform to the enterprise and, ultimately, I’m not the CEO to do it. Our plan to have a consumer guy lead an enterprise company ended up having a few holes. For Gnip to thrive in the enterprise, it needs to be squarely in the hands of people who have previously succeeded in that space. So as of today, I’m stepping down as CEO and leaving the company. Jud is taking over as CEO.

I am honored to have worked with Jud and it has been a privilege to work with my team for the last two years. Anything that Gnip has accomplished so far has been because of them. Any criticisms that the company could have accomplished more in the last two years can be directed squarely at me. I look forward to seeing Jud and the team do great things in the years ahead.

Gnip Integrates Facebook Search API

At F8 this week Facebook publicly announced the new Open Graph API.  As part of their announcement, Facebook also released a Search API against all public stream data created on Facebook.  Gnip is now excited to announce support for this Search API.  If your company has been looking to track conversations about your brand, product, or competitors on Facebook, you can now do so with the Gnip platform.

We’re happy to make this data available to existing Gnip customers and other developers eager to get access to public data.  Public data on Facebook currently includes posts and comments on fan pages and public user wall posts and status updates.

We’re super proud of our engineering team and all the work they’ve put into the Gnip platform to make it possible for us to quickly support the Facebook Search API and all other new APIs on the web.  If you’d like to get the public Facebook stream for a set of keywords, please get in touch with us at info@gnip.com.

Thanks to the Facebook team for making this public data available.  We look forward to continually making the case for more public data to help enterprises listen, understand, and respond to their customers.

Welcome aboard, Rob Johnson

One of the highlights of Boulder’s startup scene is the community aspect, what Micah Baldwin describes as competitive cooperation.  In many areas, startups see success as a zero-sum game and the ecosystem feels pretty hostile.  In other places I’ve lived (I’m looking at you, Bay Area), there’s a general feeling of goodwill amongst startups, but everyone is so freaking busy that true community fails to gel and what you get is a thin veneer of positivity.  In Boulder, we genuinely care about our startup brethren.

Case in point, EventVue.

I met Rob when he was a member of the first TechStars class in Boulder.  Ostensibly I was a mentor in the program, but I learned a ton from him from the very beginning.  Over the next several years, Rob showed his skills in a number of ways — EventVue holds the record for fastest investment close after “graduating” from Techstars; the company counted as customers a number of top-tier companies including Cisco and IDG; he and Josh Frasier relentlessly tackled technical and business obstacles in pursuit of success. At one point EventVue was a Gnip customer and at a later point it was not (but Rob and Josh always offered great feedback).

Ultimately, EventVue wasn’t the success Rob was hoping for.

Given the knowledge he gained, it’s not surprising that companies all over the country were competing to bring him on board.  Rob is a tenacious entrepreneur with a wide variety of skills in product and sales. He’s one of the most personable cats I’ve ever known — he can strike up a conversation with anyone on the street and instantly develop a level of rapport that I envy.  More importantly, he listens like no one I’ve ever known.  For early stage companies, listening to your customers is one of the best ways to orient your product correctly.

If Rob had taken a job offer outside of Boulder, our entire tech scene would have been worse for it.  Thus, I’m doubly excited to welcome Rob as Gnip’s new Director of Business and Strategy.  In the last couple of months he has worked as a consultant for Gnip and he’s already left an indelible mark on the company.  Rob is doing great things for us and I look forward to that continuing for a long, long time.  Just as importantly, I’m confident that someday soon (but not too soon), Rob will start another company in Boulder and perhaps I’ll find our roles reversed. In the meantime, I’m happy that he’s a part of Gnip and that he remains a part of the Boulder entrepreneurial scene.

Marketing is from Mars, Business Intelligence is from… Betelgeuse?

Beetlejuice! John Battelle wrote a great post last week titled “What Marketers Want from Twitter Metrics” in which he recounts a conversation with Twitter COO Dick Costolo and lists some data he hopes we’ll soon see from Twitter.  These metrics include:

  • How many people *really* see a tweet.  Even though @gnipsupport has 150 followers, it’s unlikely that they all saw our tweet about this post.
  • Better information around engagement, such as retweets and co-incidence data.  There’s a classic VC saying: “the first time I hear about something I don’t notice; the second time, I take an interest and the third time I take action.”

For me, marketing is about sending a signal into the marketplace and then measuring how effectively it is received.  For instance, Gnip is trying to better engage with companies that use third-party APIs, and since we’re a startup, low cost matters.  One mechanism is this blog and the article you’re reading now.  That’s the “sending a signal” part.  While you’re reading this, I’m likely logged into Google Analytics, monitoring how people find this article, and watching Twitter to see if anyone mentions this post.  That’s the “measuring effectiveness” part.  And this isn’t a static, one-time cycle.  Based upon the feedback I get (some direct, some inferred), I’ll write and promote future posts a little differently.

I am positive that Twitter and other forms of social media will be hugely beneficial to marketing and the surrounding fields of sales, advertising and customer service. Highly measurable and disintermediated low-friction customer interactions with the marketplace is a wonderful thing.  However, if five years from now we’re still primarily talking about social media in terms of marketing, then an opportunity has been squandered.

If marketing is a company sending a signal to the marketplace and measuring how it is received, then business intelligence (from a product perspective) is the process of measuring and acting on the signal that the marketplace itself is sending.  For instance, last holiday season, a major discount chain wanted to know why, in the midst of the a recession, many of their traditional customers were opting to shop at more expensive competitors.  After examining Twitter, Facebook and other social services, they discovered that customers were unhappy with their stores’ lack of parking and cashiers.  Apparently, even in a financial crunch, convenience trumps price.  The store took steps to increase the number of cashiers and sales immediately increased.  THIS is where I’d like to see more emphasis in social media.

It’s a function of magnitude

With marketing, the product or service has already been created and success is now predicated on successfully engaging as many people as possible with your pitch.  The primary question is “How do we take this product and make it sound as appealing as possible to the market?”  Great marketing can create far greater demand than a shoddy one, but in the end, the product is fairly static by that point.  Sales is plotted on a continuum defined as “customer need multiplied by customer awareness,” where need is static and awareness if a variable.  What if you could change the scale of customer need?

When the product or service is still being defined, the size of the opportunity is extremely fluid.  A product that doesn’t address a customer need isn’t going to sell a ton, regardless of how well it’s marketed.  A product that addresses a massive customer need can still fail with poor marketing, but it will be a game changer with the right guidance.  Business intelligence is crucial to the process of identifying the biggest need in a market and building the appropriate solution.

Steve Ballmer is very vocal about how he only cares about ideas that will move his stock price a dollar.  But to move his stock price by even $0.10 at today’s P/E is to increase earnings (earnings!) by almost $100MM annually.  In other words, if you’re a startup whose product can’t generate a billion dollars, then it’s not worth Microsoft’s time to talk to you.  And if you’re a MS product manager who isn’t working on a billion dollar product, you might want to put in a transfer request.  Or better yet, listen to the market and retool what you’re currently building, because no amount of marketing is going to save you.

Yeah, “billion” with a “b”

Typically, entrepreneurs use personal experience and anecdotal evidence to design their offering.  Larger companies may conduct market research panels or send out surveys to better understand a market.  We are now blessed with the ability to directly interact with the marketplace at a scale never previously imagined.  The market is broadcasting desire and intent though a billion antennae every day, yet product managers are still casting a deaf ear.  Maybe we need better tools and data so that the business world to start tuning in.

First off, when you’re launching a product, you ought to know what the market looks like.  We need better access to user demographics, both at the service level (who uses Twitter) as well as the individual level (who just tweeted X).  A number of companies are starting to serve this need (folks like Klout, who offers reputation data for Twitter users, and Rapleaf, who offers social and demographic data based on email address) but there is a massive way to go.  I would kill for the ability to derive aggregated demographics — tell me about all the people who tweeted Y in the last year.

Secondly, access to historical data is critical.  When deciding whether to even begin planning a new product, it’s important to know whether the marketplace’s need is acute or a long-standing problem.  Right now, it’s nearly impossible to access data about something from before the moment you realize you should be tracking it.   This has led to all sorts of “data hoarding” as social media monitoring services attempt to squirrel away as much data as possible just in case they should need it in the future.  The world would be so much better with mature search interfaces.  Think about your average OLAP interface and then think about Facebook Search.  Twitter has already said that they are taking steps to increase the size of their search corpus; let’s make sure they know this is important and let’s encourage other social services to make historical data available as well.

One beeeeeeellion dollarsThe best part of all this is that Marketers and Product Managers need many of the same things — they’re in the same universe, you might say.  The best companies engage marketing as the product is being defined, and a result, a lot of these metrics will be benefit product managers and marketers alike.

Dell selling $6 million of computers on Twitter?  That’s pretty great.  Dell identifying a new $600M market because of signals sent on Twitter… that’s simply amazing.  And that’s the level of impact I hope to see social media have in the next few years.  Got your own ideas on how we can get there from here?  Post ‘em in the comments.

(Thanks to Brad Feld, Eric Norlin and Om Malik for helping me edit this post into something more readable and accurate.)

Google Buzz — Yeah, we got that!

Google made a big splash two weeks ago when they introduced their new social product. Buzz had 9 million posts and comments in its first two days, and given that it’s integrated into Gmail, which is used by more than 120 million people, we expect it to grow into a major source of data on the web.

We are proud to announce today that Gnip has begun offering Buzz data to our customers. If you would like to painlessly add this new online signal source, then just reach out to info@gnip.com and we’ll get you connected.

Migrating to the Twitter Streaming API: A Primer

Some context:

Long, long ago, in a galaxy far, far away, Twitter provided a firehose of data to a few of partners and the world was happy.  These startups were awash in real-time data and they got spoiled, some might say, by the embarrassment of riches that came through the real-time feed.  Over time, numerous factors caused Twitter to cease offering the firehose.  There was much wailing and gnashing of teeth on that day, I can tell you!

At roughly the same time, Twitter bought real-time search company Summize and began offering to everyone access to what is now known as the Search API.  Unlike Twitter’s existing REST API, which was based around usernames, the Search API enabled companies to query for recent data about a specific keyword.  Because of the nature of polling, companies had to contend with latency (the time between when someone performs an action and when an API consumer learns about it) and Twitter had to deal with a constantly-growing number of developers connected to an inherently inefficient interface.

Last year, Twitter announced that they were developing the spiritual successor to the firehose — a real-time stream that could be filtered on a per-customer basis and provide the real-time, zero latency results people wanted.  By August of last year, alpha customers had access to various components of the firehose (spritzer, the gardenhose, track, birddog, etc) and provided feedback that helped shape and solidify Twitter’s Streaming API.

A month ago Twitter Engineer John Kalucki (@jkalucki) posted on the Twitter API Announcements group that “High-Volume and Repeated Queries Should Migrate to Streaming API“.  In the post, he detailed several reasons why the move is beneficial to developers.  Two weeks later, another Twitter developer announced a new error code, 420, to let developers identify when they are getting rate limited by the Search API.  Thus, both the carrot and the stick have been laid out.

The streaming API is going to be a boon for companies who collect keyword-relevant content from the Twitter stream, but it does require some work on the part of developers.  In this post, we’ll help explain who will benefit from using Twitter’s new Streaming API and some ways to make the migration easier.

Question 1:  Do I need to make the switch?

Let me answer your question with another question — Do you have a predictable set of keywords that you habitually query?  If you don’t, keep using the Search API.  If you do, get thee to the Streaming API.

Examples:

  • Use the Streaming API any time you are tracking a keyword over time or sending notifications /  summaries to a subscriber.
  • Use the Streaming API if you need to get *all* the tweets about a specific keyword.
  • Use the Search API for visualization and search tools where a user enters a non-predictable search query for a one-time view of results.
  • What if you offer a configurable blog-based search widget? You may have gotten away with beating up the Search API so far, but I’d suggest setting up a centralized data store and using it as your first look-up location when loading content — it’s bad karma to force a data provider to act as your edge cache.

Question 2: Why should I make the switch?

  • First and foremost, you’ll get relevant tweets significantly faster.  Linearly polling an API or RSS feed for a given set of keywords automatically creates latency which increases at a linear rate.  Assuming one query per second, the average latency for 1,000 keywords is a little over eight minutes; the average latency for 100,000 keywords is almost 14 hours!  With the Streaming API, you get near-real-time (usually within one second) results, regardless of the number of keywords you track.
  • With traditional API polling, each query returns N results regardless of whether any results are new since your last request.  This puts the onus of deduping squarely on your shoulders.  This sounds like it should be simple — cache the last N resultIDs in memory and ignore anything that’s been seen before.  At scale, high-frequency keywords will consume the cache and low frequency keywords quickly age out.  This means you’ll invariably have to hit the disk and begin thrashing your database. Thankfully, Twitter has already obviated much of this in the Search API with an optional “since_id” query parameter, but plenty of folks either ignore the option or have never read the docs and end up with serious deduplication work.  With Twitter’s Streaming API, you get a stream of tweets with very little duplication.
  • You will no longer be able to get full fidelity (aka all the tweets for a given keyword) from the Search API.  Twitter is placing increased weight on relevance, which means that, among other things, the Search API’s results will no longer be chronologically ordered.  This is great news from a user-facing functionality perspective, but it also means that if you query the Search API for a given keyword every N seconds, you’re no longer guaranteed to receive the new tweets each time.
  • We all complain about the limited backwards view of Twitter’s search corpus.  On any given day, you’ll have access to somewhere between seven and 14 days worth of historical data (somewhere between one quarter to one half billion tweets), which is of limited value when trying to discover historical trends.  Additionally, for high volume keywords (think Obama or iPhone or Toyota), you may only have access to an hour of historical data, due to the limited number of results accessible through Twitter’s paging system.  While there is no direct correlation between the number of queries against a database and the amount of data that can be indexed, there IS a direct correlation between devoting resources to handle ever-growing query demands and not having resources to work on growing the index.  As persistent queries move to the Streaming API, Twitter will be able to devote more resources to growing the index of data available via the Search API (see Question 4, below).
  • Lastly, you don’t really have a choice.  While Twitter has not yet begun to heavily enforce rate limiting (Gnip’s customers currently see few errors at 3,600 queries per hour), you should expect the Search API’s performance profile to eventually align with the REST API (currently 150 queries per hour, reportedly moving to 1,500 in the near future).

Question 3: Will I have to change my API integration?

Twitter’s Streaming API uses streaming HTTP

  • With traditional HTTP requests, you initiate a connection to a web server, the server sends results and the connection is closed.  With streaming HTTP, the connection is maintained and new data gets sent over a single long-held response.  It’s not unusual to see a Streaming API connection last for two or three days before it gets reset.
  • That said, you’ll need to reset the connection every time you change keywords.  With the Streaming API , you upload the entire set of keywords when establishing a connection.  If you have a large number of keywords, it can take several minutes to upload all of them and during the duration you won’t get any streaming results.  The way to work around this is to initiate a second Streaming API connection, then terminate the original connection once the new one starts receiving data.  In order to adhere to Twitter’s request that you not initiate a connection more than once every couple of minutes, highly volatile rule sets will need to batch changes into two minute chunks.
  • You’ll need to decouple data collection from data processing.  If you fall behind in reading data from the stream, there is no way to go back and get it (barring making a request from the Search API).  The best way to ensure that you are always able to keep up with the flow of streaming data is to place incoming data into a separate process for transformation, indexing and other work.  As a bonus, decoupling enables you to more accurately measure the size of your backlog.

Streaming API consumers need to perform more filtering on their end

  • Twitter’s Streaming API only accepts single-term rules; no more complex queries.  Say goodbye to ANDs, ORs and NOTs.  This means that if you previously hit the Search API looking for “Avatar Movie -Game”, you’ve got some serious filtering to do on your end.  From now on, you’ll add to the Streaming API one or more of the required keywords (Avatar and/or Movie) and filter out from the results anything without both keywords and containing the word “Game”.
  • You may have previously relied on the query terms you sent to Twitter’s Search API to help you route the results internally, but now the onus is 100% on you.  Think of it this way: Twitter is sending you a personalized firehose based upon your one-word rules.  Twitter’s schema doesn’t include a <keyword> element, so you don’t know which of your keywords are contained in a given Tweet.  You’ll have to inspect the content of the tweet in order to route appropriately.
  • And remember, duplicates are the exception, not the rule, with the Streaming API, so if a given tweet matches multiple keywords, you’ll still only receive it once.  It’s important that you don’t terminate your filtering algo on your first keyword or filter match; test against every keyword, every time.

Throttling is performed differently

  • Twitter throttles their Search API by IP address based upon the number of queries per second.  In a world of real-time streaming results, this whole concept is moot.  Instead, throttling is defined by the number of keywords a given account can track and the overall percentage of the firehose you can receive.
  • The default access to the Streaming API is 200 keywords; just plug in your username and password and off you go.  Currently, Twitter offers approved customers access to 10,000 keywords (restricted track) and 200,000 keywords (partner track).  If you need to track more than 200,000 keywords, Twitter may bind “partner track” access to multiple accounts, giving you access to 400,000 keywords or even more.
  • In addition to keyword-based streams, Twitter makes available several specific-use streams, including the link stream (All tweets with a URL) and the retweet stream (all retweets).  There are also various levels of userid-based streams (follow, shadow and birddog) and the overall firehose (spritzer, gardenhose and firehose), but they are outside the bounds of this post.
  • The best place to begin your quest for increased Streaming API is an email to api@twitter.com — briefly describe your company and use case along with the requested access levels. (This process will likely change for coming Commercial Accounts.)
  • Twitter’s Streaming API is throttled at the overall stream level. Imagine that you’ve decided to try to get as many tweets as you can using track.  I know, I know, who would do such a thing?  Not you, certainly.  But imagine that you did — you entered 200 stop words, like “and”, “or”, “the” and “it” in order to get a ton of tweets flowing to you.  You would be sorely disappointed, because twitter enforces a secondary throttle, a percentage of firehose available to each access level.  The higher the access level (partner track vs. restricted track vs. default track), the greater the percentage you can consume.  Once you reach that amount, you will be momentarily throttled and all matching tweets will be dropped on the floor.  No soup for you!  You should monitor this by watching for “limit” notifications.  If you find yourself regularly receiving these, either tighten up your keywords are request greater access from Twitter.

Start tracking deletes

  • Twitter sends deletion notices down the pipe when a user deletes one of their own tweets.  While Twitter does not enforce adoption of this feature, please do the right thing and implement it.  When a user deletes a tweet, they want it stricken from the public record.  Remember, “it ain’t complete if you don’t delete.”  We just made that up.  Just now.  We’re pretty excited about it.

Question 4: What if I want historical data too?


Twitter’s Streaming API is forward-looking, so you’ll only get new tweets when you add a new keyword.  Depending on your use case you may need some historical data to kick things off.  If so, you’ll want to make one simultaneous query to the Search API.  This means that you’ll need to maintain two integrations with Twitter APIs (three, if you’re taking advantage of Twitter’s REST API for tracking specific users), but the benefit is historical data + low-latency / high-reliability future data.

And as described before, the general migration to the Streaming API should result in deeper results from the Search API, but even now you can get around 1,500 results for a keyword if you get acquainted with the “page” query parameter.

Questions 5: What if I need more help?

Twitter resources:

Streaming HTTP resources:

Gnip help:

  • Ask questions in the comments below and we’ll respond inline
  • Send email to eric@gnip.com to ask the Gnip team direct questions

The Only Constant is Change

As a few people have mentioned online today, Gnip laid off seven team members today. It was a horrible thing to have to do and my very best wishes go out to each team member who was let go.  If you’re in Boulder and need a Java or PHP developer, an HR/office manager or an inside salesperson, send an email to eric@gnip.com and I’ll connect you with some truly awesome people.

I would like to address a few specific points for our partners, customers and friends:

  1. We believe as strongly as ever in providing data aggregation solutions for our customers.  If we didn’t, we would have returned to our investors the year of funding we have in the bank (now two years).
  2. We are still delivering the same data as yesterday. The existing platform is highly stable and will continue to churn out data as long as we want it to.
  3. The changes in personnel revolve around rebuilding the technology stack to allow for faster, more iterative releases. We’ve been hamstrung by a technology platform that was built under a very different set of assumptions more than a year ago. While exceptionally fast and stable, it is also a beast to extend.  The next rev will be far more flexible and able to accommodate the many smart feature requests we receive.

To Alex, Shane, Ingrid, JL, Jenna, Chris and Jen, it has been a honor working with you and I hope to have the privilege to do so again some day.

To our partners and customers, Gnip’s future is brighter than ever and we look forward to serving your social data needs for many years to come.

Sincerely,

Eric Marcoullier, CEO

Gnip Platform Update – Now For Authenticated Data Services

The Gnip Platform originally was built to support accessing public services and data.  In response to customer requests we soft launched support for authenticated data services over the summer and now we have fully rolled out the new service.   The difference between public and authenticated data services seems trivial, but in practice the differences are very important since authenticated services represent either business level arrangements between companies or private data access.   The new Gnip capabilities supports both of these scenarios.

As part of the new service Gnip also provides dedicated integration capacity for companies as we now are able to segment individually managed nodes on our platform for specific company accounts.   This means that a company with a developer key on Flickr, a whitelist account on Twitter, an application key on Facebook and a developer key on YouTube receives dedicated capacity on the Gnip platform to support all their data integration requirements.

Gnip will also continue to maintain the existing public data integration services which do not require authentication for access and distribution, and we expect most companies with use a blend of our data integration services.

Using the new support for authenticated data service requires contacting us at sales@gnip.com so we can enable your account. Please contact us today to leverage your existing whitelisting or authenticated account on Flickr, YouTube, Twitter or other APIs and feeds.

Gnip License Changes this Friday, Aug 28th

As we posted last month there are some changes coming to the way we license use of the Gnip platform.  See: Gnip Licensing Changes Coming in August.

These updates will be put in place this Friday, August 28th.    The impact of the new licensing will be the following for our existing users.

  1. The Gnip Community Edition license will be disabled as it is no longer being offered.   Accounts that were created before August 1st will be set to inactive and no longer will be able to access the Gnip API or Developer Website.   If your company is in the process of evaluating Gnip for a commercial project and needs a longer amount of time to complete your project please contact us at info@gnip.com and we can extend your account on a longer trial.
  2. Gnip Standard Edition user accounts using the Commercial, Non-profit and Startup partner license options will continue to be available as they are not impacted by the change on Friday.   If you are a standard edition user and we accidentally your disable your account on Friday please contact us at info@gnip.com and we will reactivate the account.
  3. New users who created an account starting after August 1st will receive an email notification on the day their 30-day trial expires informing them they need to contact Gnip to obtain the appropriate license for their commercial, non-profit or partner use case.

We appreciate all the companies and developers who have built solutions using Gnip and look forward to continuing to deliver real-time data to power these solutions.   By making these adjustments in our licensing we will be able to focus on innovating the Gnip Platform and supporting the many companies and partners we are fortunate to work with every day.