Author Archive
It took awhile, but Gnip’s now a Boulder Chamber of Commerce (@boulderchamber) Member. We joined after a pattern of clear value to our particular industry became clear. In August of this year they hosted an event on that put us face-to-face with a the U.S. Department of Commerce Under Secretary for International Trade (Francisco Sánchez) and Colorado Congressman (Jared Polis) where we discussed software patent issues, as well as immigration visa challenges the U.S. tech industry faces. Tonight I’m attending an event with Congressman Polis and a local software Venture Capitalist (Jason Mendelson) to talk about challenges surrounding the hiring of technical talent locally, and globally.
These are topics with significant political/legislative dynamics, and the Chamber has given us, a local software firm, access to relevant forums in which we can get our point of view on the table; thank you.
Whether or not the Chamber has been providing this kind of relevant access all along, I don’t know (my perception is otherwise). I do know that the impact they’re having on us as a local software business, as well as the channel they’re giving Gnip to get its perspective heard in the broader (National) forum, is significant. I’d encourage other Boulder software/technology firms to support their efforts, contribute in their events, and help them build an agenda that in the end, helps us be more effective software/technical businesses.
Join us, in joining the Chamber.
I’m excited to announce that, as of the end of October, Gnip is delivering over 30 billion paid social media activities per month to our customers. This is the largest number of paid social media activities that have ever been distributed in a 30 day period.
Over the past year, we’ve seen extraordinary growth in the number of paid social media activities we deliver. At the start of 2011, Gnip was delivering 300 million activities per month. By May, that number was up to 3 billion activities per month. And in October, we delivered 30 billion activities. In essence, we’ve been growing by a factor of 10 every 5 months. At this rate, we’ll be delivering 300 billion activities per month by March of next year
Cool numbers, but what’s driving this growth?
We’re seeing three key areas that are driving this number. First, we’re signing on new customers at an increasing rate, as more and more companies are seeing the possibilities in social media data. Second, we’re seeing increased interest in our Twitter firehose products. From hedge funds using social data to drive trading strategies to business intelligence companies layering social data onto their existing structured data sources, interest in volume products from Twitter is consistently increasing. And finally, we’re seeing a marked increase in the number of customers using multiple sources to enrich their product capabilities. From boards and forums to YouTube and Facebook, our customers are seeing the potential in the many other social media sources we offer.
So, 300 billion per month by March? It’s a big number, but the way things are going, I’ll take the over.
The social ecosystem has become the pulse of the world. From delivering breaking news like the death of Osama Bin Laden before it hit mainstream media to helping President Obama host the first Twitter Town Hall, the realtime social web is flooded with valuable information just waiting to be analyzed and acted upon. With millions of users and billions of social activities passing through the ever-growing realtime social web each day, it is no wonder that companies need to reevaluate their traditional business models to take advantage of this valuable data.
But with the exponentially ever-growing social web, massive amounts of data are pouring into and out of social media publishers’ websites and APIs every second. In a talk I gave at GlueCon a couple of months ago, I ran down some math to put things into perspective. The numbers are a little dated, but the impact is the same. At that time there were approximately 155,000,000 Tweets per day and the average size of a Tweet was approximately 2,500 Bytes (keep in mind this could include Retweets).
A Little Bit of Arithmetic
155,000,000 Tweets/day 2,500 Bytes = 387,500,000,000 Bytes/day
387,500,000,000 Bytes/day 24 Hours = 16,145,833,333 Bytes/hour
16,145,833,333 Bytes/hour 60 minutes = 269,097,222 Bytes/minute
269,097,222 Bytes/minute 60 second = 4,484,953 Bytes/second
4,484,953 Bytes/second 1,048,576 Bytes/megabyte = 4.2 Megabytes/second
And in terms of data transfer rates . . .
1 Megabyte/second = 8 Megabits/second
So . . .
4.2 Megabytes/second 8 Megabits/Megabyte = 33.8 Megabits/second
That’s a Lot of Data
So what does this mean for the data consumers, the companies wanting to reevaluate their traditional business models to take advantage of vast amounts of Twitter data? At Gnip we’ve learned that some of the collective industry data processing tools simply don’t work at this scale: out-of-the-box HTTP servers/configs aren’t sufficient to move the data, out-of-the-box config’d TCP stacks can’t deliver this much data, and consumption via typical synchronous GET request handling isn’t applicable. So we’ve built our own proprietary data handling mechanisms to capture and process mass amounts of realtime social data for our clients.
Twitter is just one example. We’re seeing more activity on today’s popular social media platforms and a simultaneous increase in the number of popular social media platforms. We’re dedicated to seamless social data delivery to our enterprise customer base and we’re looking forward to the next data processing challenge.
One of the more interesting components of Twitter streams are the links within the Tweets themselves. Not only are links one way to bridge from traditional web trend analysis, to social media, but they are also a window into what people are sharing.
Gnip provides three mechanisms to get at links in Tweets.
- Link Stream. The link stream provides you with 100% of the Tweets that contain links. Furthermore, Gnip enriches the stream with unwound URLs, so you don’t have to bother with an unwind-farm on your end.
- Power Track’s ‘has:links’ operator. Through Power Track, you can refine your complex queries (including substring matching) to collect only Tweets that contain links.
- Power Track’s ‘url_contains:’ operator. The ‘url_contains:’ operator allows you to filter the 100% Firehose for Tweets that have links and contain the substring you provide. It filters against both short, and long, URLs.
Happy filtering!
Monday’s deploy brought the ‘has:geo’ operator to Gnip’s Twitter Power Track. ‘has:geo’ gives you access to geo-coded Tweets (any Tweet with lat/lng coordinates). Geo-coded Tweets have been one of the most demanded streams/substreams to-date. We’re really excited to bring this feature to light.
Some usage examples:
- “has:geo”- alone, gives you the complete stream of all geo-coded Tweets
- “coffee has:geo” – gives you the complete stream of all geo-coded Tweets that contain the word “coffee”
- “fire has:geo” – gives you the complete stream of all geo-coded Tweets that contain the word “fire”
For a complete listing of Power Track operators see the documentation. As with all Commercial Twitter data products brought to you by Gnip, they are only for use in non-public-display and non-programmatic resyndication use cases. If you want to do at-scale, full-coverage analysis of Twitter streams, we’re here to help. Contact us at info@gnip.com for more info.
Gnip’s Twitter Power Track feed has been a raging success! One of the fun things about Power Track is its expandability. We’ve been adding features left and right over the past few weeks to ensure you’re getting the Tweet filtering precision you need, across the 100% Twitter Firehose, with no volume limits.
As of today’s deploy, we’ve added support for the following new features:
General
stream compression (optionally set via typical “Accept-Encoding: gzip” client header). At volume, bandwidth costs are very real, not to mention the operational challenges of dealing with fat pipes. By enabling compression you can dramatically reduce your bandwidth costs and potentially avoid expensive connection upgrades. For more info see documentation.
Operators
- contains:. sub-string matching. You can now expand your scope to include sub-strings. e.g. “contains: bam” grabs Tweets that include “Obama”.
- has:mentions. You can now narrow your scope to include only Tweets that include mentions of other Twitter accounts.
- has:hashtags. You can now narrow your scope to include only Tweets that include hashtags.
- has:links. Ensure the Tweets you’re looking for have links in them.
- has:geo. Ensure the Tweets you’re looking for are geo-coded. We’re soon going to enrich all Tweets that aren’t natively geocoded with geocoding (when possible based on content extrapolation).
- For more info checkout the documentation.
Feel free to reach out to us at info@gnip.com or our Gnip Google Group.
Happy filtering.
Gnip’s business is growing heartily. As a result, we need to field current demand, refine our existing product offering, and expand into completely new areas in order to deliver the web’s data. From a business standpoint we need to grow our existing sales team in order to capture as much of our traditional market as possible, as fast as possible. We also need to leverage established footholds in new verticals, and turn those into businesses as big as, or hopefully bigger than, our current primary market. The sales and business-line expansion at Gnip is in full swing, and we need more people on the sales and business team to help us achieve our goals.
From a technical standpoint I don’t know where to begin. We have a large existing customer base that we need to keep informed, help optimize, and generally support; we’re hiring technical support engineers. Our existing system scales just fine, but software was meant to iterate, and we have learned a lot about handling large volumes of real-time data streams, across many protocols and formats, for ultimate delivery to large numbers of customers. We want to evolve the current system to even better leverage computing resources, and provide a more streamlined customer experience. We’ve also bit off a historical data set indexing challenge that is well… of true historical proportion. The historical beast needs feeding, and it needs big brains to feast on. We need folks who know Java very well, have search, indexing, and large data-set management backgrounds.
On the system administration side of things… if you like to twiddle IP tables, tune MTUs for broad geographic region high-bandwidth data flow optimization, handle high-volume/bandwidth streaming content, then we’d like to hear from you. We need even more sys admin firepower.
Gnip is a technical product, with a technical sale. Our growth has us looking to offload a lot of the Sales Engineering support that the dev team currently takes on. Subsequently we’re looking to hire a Sales Engineer as well.
Gnip has a thriving business. We have a dedicated, passionate, intelligent team that knows how to execute. We’re building hard technology that has become a critical piece of the social media ecosystem. Gnip is also located in downtown Boulder, CO.
http://gnip.com/careers
Gnip had a great time at
O’Reilly’s Strata 2011 conference in California last week. We signed up several months ago as a big sponsor without knowing exactly how things were going to come together. The bet paid off and Strata was a huge success for us, and the industry at large. We were blown away with the relevance of the topics discussed and the quality of the attendees and discussions that were sparked. I was amazed at how much knowledge everyone now has surrounding big data set analysis and processing. Technologies that were immature and new just a few years ago, are now baked into the ecosystem and have become tools of the trade (e.g. Hadoop). All very cool to see.
That said, there remains a distinct gap between big data set handling and high-volume/real-time data stream handling. We’ve come a long way in handling monster data set processing in batch or offline modes, but we have a long way to go when it comes to handling large streaming data set challenges.
Hillary Mason, of bit.ly, hit this point squarely in her “What Data Tells Us” talk at Strata. We can open sourcely fan out ungodly amounts of processing… like piranha on fresh meat. However, blending that processing, and high-latency transactions, into real-time streams of thousands of activities per second is not as refined and well understood. Frankly, I’m shocked at the number of engineers I run into that simply don’t understand asynchronous programming at all.
The night before the conference started, Pete Warden drove BigDataCamp @Strata, where
Mike Montano from
BackType gave a high-level overview of their infrastructure. He laid out a few tiers and described the “speed” tier as something that did a lot of work on high-volume streams, and a “batch” tier that did stuff in a more offline manner. The blend of approaches was an interesting teaser into how Big Stream challenges can be handled. Gnip’s own infrastructure has had to address these challenges of course, and we launched into a thread of detail in our
Expanding The Twitter Firehose post awhile back.
Big Stream handling occupies a good part of my brain. I’d like to see Big Data discussion start to unravel Big Stream challenges as well.
About a month ago Twitter announced they will be shutting off XML for stream based endpoints on Dec, 6th, 2010, in order to exclusively support JSON. While JSON users/supporters are cheering, for some developers this is a non-trivial change. Tweet parsers around the world have to change from XML to JSON. If your brain, and code, only work in XML, you’ll be forced to get your head around something new. You’ll have to get smart, find the right JSON lib, change your code to use it (and any associated dependencies you weren’t already relying on), remove obsolete dependencies, test everything again, and ultimately get comfortable with a new format.
Gnip’s format normalization shields you from all of this as it turns out. Gnip customers get to stay focused on delivering value to their customers. Others integrating directly, and consuming stream data from Twitter in XML, have to make a change (arguably a good one from a pure format standpoint, but change takes time regardless).
From day one, Gnip has been working to shield data consumers from the inevitable API shifts (protocols, formats) that occur in the market at large. Today we ran a query to see what percentage of our customers would benefit from this shield; today we smiled. We’re going to sleep well tonight knowing all of our customers digesting our Activity Streams normalization get to stay focused on what matters to them most (namely NOT data collection intricacies).
Fun.
We at Gnip have been waiting a long time to write the following sentence: Gnip and Twitter have partnered to make Twitter data commercially available through Gnip’s Social Media API. I remember consuming the full firehose back in 2008 over XMPP. Twitter was breaking ground in realtime social streams at a then mind-blowing ~6 (six) Tweets per second. Today we see many more Tweets and a greater need for commercial access to higher volumes of Twitter data.
There’s enormous corporate demand for better monitoring and analytics tools, which help companies listen to their customers on Twitter and understand conversations about their brands and products. Twitter has partnered with Gnip to sublicense access to public Tweets, which is great news for developers interested in analyzing large amounts of this data. This partnership opens the door to developers who want to use Twitter streams to create monitoring and analytics tools for the non-display market.
Today, Gnip is announcing three new Twitter feeds with more on the way:
- Twitter Halfhose. This volume-based feed is comprised of 50% of the full firehose.
- Twitter Mentionhose. This coverage-based feed provides the realtime stream of all Tweets that mention a user, including @replies and retweets. We expect this to be very interesting to businesses studying the conversational graph on Twitter to determine influencers, engagement, and trending content.
- Twitter Decahose. This volume-based product is comprised of 10% of the full firehose. Starting today, developers who want to access this sample rate will access it via Gnip instead of Twitter. Twitter will also begin to transition non-display developers with existing Twitter Gardenhose access over to Gnip.
We are excited about how this partnership will make realtime social media analysis more accessible, reliable, and sustainable for businesses everywhere.
To learn more about these premium Twitter products, visit http://gnip.com/twitter, send us an email at info@gnip.com, or appropriately, find us on Twitter @gnip.