Twitter, you've come a long way baby… #8years

Like a child’s first steps or your first experiment with pop rocks candy, the first ever Tweet went down in the Internet history books eight years ago today. On March 21st, 2006, Jack Dorsey, co-founder of Twitter published this.

 

Twttr, (the service’s first name), was launched to the public on July 15, 2006 where it was recognized for “good execution on a simple but viral idea.” Eight years later, that seems to have held true.

It has become the digital watering hole, the newsroom, the customer service do’s and don’ts, a place to store your witty jargon that would just be weird to say openly at your desk. And then there is that overly happy person you thought couldn’t actually exist, standing in front of you in line, and you just favorited their selfie #blessed. Well, this is awkward.

Just eight months after their release, the company made a sweeping entrance into SXSW 2007 sparking the platforms usage to balloon from 20,000 to 60,000 Tweets per day. Thus beginning the era of our public everyday lives being archived in 140 character tidbits. The manual “RT” turned into a click of a button, and favorites became the digital head nod. I see you.

In April 2009, Twitter launched the Trending Topics sidebar, identifying popular current world events and modish hashtags. Verified accounts became available that summer; Athletes, actors, and icons alike began to display the “verified account” tag on their Twitter pages. This increasingly became a necessity in recognizing the real Miley Cyrus vs. Justin Bieber. If differences do exist.

The Twitter Firehose launched in March 2010. By giving Gnip access, a new door had opened into the social data industry and come November, filtered access to social data was born. Twitter turned to Gnip to be their first partner serving the commercial market. By offering complete access to the full firehose of publicly-available Tweets under enterprise terms, this partnership enabled companies to build more advanced analytics solutions with the knowledge that they would have ongoing access to the underlying data. This was a key inflection point in the growth of the social data ecosystem. By April, Gnip played a key role in the delivering past and future Twitter data to the Library of Congress for historic preservation in the archives.

July 31, 2010, Twitter hit their 20 billionth Tweet milestone, or as we like to call it, twilestone. It is the platform of hashtags and Retweets, celebrities and nobodies, at-replies, political rants, entertainment 411 and “pics or it didn’t happen.” By June 1st, 2011, Twitter allowed just that as it broke into the photo sharing space, allowing users to upload their photos straight to their personal handle.

One of the most highly requested features was the ability to get historical Tweets. In March 2012, Gnip delivered just that by making every public Tweet available starting from March 21, 2006 by Mr. Dorsey himself.

Fast forward 8 years, Twitter is reporting over 500 million Tweets per day. That’s more than 25,000 times the amount of Tweets-per-day in just 8 years! With over 2 billion accounts, over a quarter of the world’s population, Twitter ranks high among the top websites visited everyday. Here’s to the times where we write our Twitter handle on our conference name tags instead of our birth names, and prefer to be tweeted at than texted. Voicemails? Ain’t nobody got time for that.

Twitter launched a special surprise for its 8th birthday. Want to check out your first tweet?

“There’s a #FirstTweet for everything.” Happy Anniversary!

 

See more memorable Twitter milestones

Taming The Social Media Firehose, Part II

In part I, we discussed some high-level attributes of the social media firehose and what is needed to digest the data. Now let’s collect some data and perform a simple analysis to see how it goes.

On March 20, 2012, there was a large 7.4 earthquake in Oaxaca, Mexico. (See the USGS record.) Due to the severity of the earthquake, it was felt in many locations in Southern Mexico and as far away as 300 miles away in Mexico City.

Collecting data from multiple firehoses around a surprise event such as an earthquake can give a quick sense of the unfolding situation on the ground in the short term as well as helping us understand the long-term implications of destruction or injury. Social data use continues to evolve in natural disasters.

For this post, let’s limit the analysis to two questions:

  1. How does the volume of earthquake-related posts and tweets evolve over time?
  2. How rich or concise are social media conversations about the earthquake? To keep this simple, treat the size of the social media activities as a proxy for content richness.

In the past few months, Gnip has made new, rich firehoses such as Disqus, WordPress and Tumblr available. Each social media service attracts a different audience and has strengths for revealing different types of social interactions. For each topic of interest, you’ll want to understand the audience and the activities common to the publisher.

Out of the possible firehoses we can use to track the earthquake through social media activity, four firehoses will be used in this post:

  • Twitter
  • WordPress Posts
  • WordPress Comments
  • Newsgator (to compare to traditional media)

There are some common tasks and considerations for solving problems like looking at earthquake data across social media.  For examples, see this post on the taxonomy of data science. To work toward an answer, you will always need a strategy of attack to address each of the common data science challenges.

For this project, we will focus on the following steps needed to digest and analyze a social media firehose:

  1. Connect and stream data from the firehoses
  2. Apply filters to the incoming data to reduce to a manageable volume
  3. Store the data
  4. Parse and structure relevant data for analysis
  5. Count (descriptive statistics)
  6. Model
  7. Visualize
  8. Interpret

Ok, that’s the background. Let’s go.

1. Connect to the firehose. We are going to collect about a day’s worth of data. The simplest way to collect the data is with a cURL statement like this,

curl –compressed -s -ushendrickson@gnip.com
“https://stream.gnip.com:443/accounts/client/publishers/twitter/streams/track/pt1.json”

This command opens a streaming HTTP connection to the Gnip firehose server and delivers a continuous stream of JSON-formatted, GZIP-compressed data from the stream named “pt1.json” to my analysis data collector. If everything goes as planned, this will collect data from the firehose until the process is manually stopped.

The depth of the URL (each level of …/…/…) is the RESTful way of defining the interface. Gnip provides a URL for every stream a user configures on the server. In this case, I configured a streaming Twitter server named “pt1.json.”

A more realistic application would ensure against data loss by making this basic client connection more robust. For example, you may want to monitor the connection so that if it dies due to network latency or other network issues, the connection can be quickly re-established. Or, if you cannot afford to miss any activities, you may want to maintain redundant connections.

One of the network challenges is that volumes from the firehoses changes with a daily cycle, weekly cycle and surprise events such as earthquakes. These changes can be many times the average volume of posts. There are many strategies to dealing with volume variations and planning network and server capacities. For example, you may design graceful data loss procedures or, if your data provider provides it, content shaping such as buffering, prioritized rules or rule-production caps. In the case of content shaping, you may build an application to monitor rule-production volumes and react quickly to large changes in volume by restricting your rule set.

Here is short list of issues to keep in mind when planning your firehose connection:

  • Bandwidth must be sufficient for activity volume peaks rather than averages.
  • Latency can cause disconnects as well as having adverse effects on time-sensitive analysis.
  • Disconnects may occur due to bandwidth or latency as well as network outage or client congestion.
  • Implement redundancy and connection monitoring to ensure against activity loss.
  • Activity bursts may require additional hardware, bandwidth, processing power or filter updates. Volume can change by 10x or more.
  • Publisher terms of service may make additional filtering to comply with requirement as to how or when data may be used, for example, appropriately handling activities that were deleted or protected after your system received them.
  • De-duplicating repeated activities; identifying missing activities

2. Preliminary filters. Generally, you will want to apply broad filter terms early in the process to enable faster, more manageable downstream processing. Many terms related to nearly any topic of interest will appear not only in the activities you are interested in, but also in unrelated activities (noise). The practical response is to continuously refine filter rules to exclude unwanted activities. This may be a simple as keyword filtering, or sophisticated machine learning identification of activity noise.

While it would help to use a carefully crafted rule set for our earthquake filter of the firehoses, it turns out that we can learn a lot with the two simple rules “quake” and “terramoto,” the English and Spanish terms commonly appearing in activities related the the earthquake.  For our example analysis, we don’t get enough noise with these two terms to worry about additional filtering. So, each of the firehoses is initially filtered with these two key words. With a simple filter added, our connection looks like this,

curl –compressed -s -ushendrickson@gnip.com
“https://stream.gnip.com:443/accounts/client/publishers/twitter/streams/track/pt1.json”
| grep -i -e”quake” -e”terramoto”

The “grep” command simply looks for activities with only the terms “quake” or “terramoto;” the “-i” means to do this without worrying about case.

The filter shown in the example will match activities in which either term appears in any part of the activity including activity text, URLs, tagging, descriptions, captions etc. In order to filter more precisely on, for example, only blog post content, or only tweet user profile, we would need to parse the activity before filtering.

Alternatively, we can configure Gnip’s Powertrack filtering when we set up our server with rules for restricting filtering to certain fields or volume shaping. For example, to filter on tweet based on a Twitter user’s profile location settings, we might add the rule,

user_profile_location:”Mexico City”

Or, to shape matched Tweets volume for very common terms, we might add the rule to restrict output to 50% of matched Tweets with,

sample:50

For the earthquake example, we use all matched activities.

3. Store the data. Based on the desired analysis, there are a wide variety of choices for storing data. You may choose to create an historical archive, load a processing queue, and push the data to cluster storage for processing with, for example, Hadoop. Cloud-based key-value stores can be economical, but may not have the response characteristics required for solving your problem. Choices should be driven by precise business questions rather than technology buzz.

Continuing the working toward earthquake analysis, we will store activities to a file to keep things simple.

curl –compressed -s -ushendrickson@gnip.com
“https://stream.gnip.com:443/accounts/client/publishers/twitter/streams/track/pt1.json”
| grep -i -e”quake” -e”terramoto”
> earthquake_data.json

Plans for moving and storing data should take into account typical activity volumes. Let’s look at some examples of firehose volumes. JSON-formatted activities compressed with GZIP have a size of 100M Tweets ≈ 25 gigabytes.  While this takes less than 2 minutes to transfer to disk at 300 MB/s (SATA II), it takes about 6 hours at 10 Mb/s (e.g. typical congested ethernet network). Firehose sizes vary and one day of WordPress.com posts is a bit more manageable at 350MB.

Filtered Earthquake data for the Twitter, WordPress and Newsgator firehoses is only a few gigabytes, so we will just work from local disk.

4. Parse and structure relevant data. This is the point where we make decisions about data structure and tools that best support the desired analysis. The data are time-ordered social media activities with a variety of metadata. On one hand, it may prove useful to load an HBase to leverage the scalability of Hadoop, while on the other, structuring a subset of the data or metadata and inserting into a relational database to leverage the speed of indexes might be a good fit. There is no silver bullet for big data.

Keeping it simple for the earthquake data, use a Python script to parse the JSON-formatted activities and extract the date-time of each post.

import json
with open(“earthquake_data.json” as f:
for activity in f:
print json.loads(activity)[“postedTime”]

Now we can analyze the time-evolution of activity volume by counting up the number of mentions appearing in a minute. Similarly, to estimate content complexity, we can add a few more lines of Python to count characters in the text of the activity.

import json
with open(“earthquake_data.json” as f:
for activity in f:
print len(json.loads(activity)[“body”])

5. Descriptive statistics. Essentially, counting the number of earthquake references. Now that we have extracted dates and sizes, our earthquake analysis is simple. A few ideas for more interesting analysis could be understanding and identifying key players in the social network, extracting entities from text such as places or people, or performing sentiment analysis, or watching the wave of tweets move out from the earthquake epicenter.

6. Model. Descriptive statistics are okay, but well-formed questions make explicit assumption about correlation and/or causality and–importantly–are testable.  The next step is to build a model and some related hypothesis we can test.

A simple model we can examine is that surprise events fit a “double-exponential” pulse in activity rate. The motivating idea is that news spreads more quickly as more people know about it (exponential growth) until nearly everyone who cares knows. After saturation, the discussion of a topic dies off exponentially (analogously to radioactive decay). If this hypothesis works out, we have a useful modelling tool that enables comparison of events and conversations between different types of events and across firehoses. To learn more about the attributes and fitting social media volume data to the double exponential, see Social Media Pulse.

7. Visualize. Finally, we are ready to visualize the results of earthquake-related activity. We will use the simple tactic of looking at time-dependent activity in Figure 1. Twitter reports start arriving immediately after the earthquake and the volume grows to a peak within minutes. Traditional media (Newsgator) stories peak about the same time and continue throughout the day while blogs and blog comments peak and continue into the following day.

Twitter Reaction to Earthquakes

8. Interpret. Referring to Figure 1, a few interesting features emerge.  First, Twitter volume is significant within a minute or two of the earthquake and peaks in about half an hour. Within an hour, the story on Twitter starts to decay. In this case, the stories natural decay is slowed by the continual release of news stories about the earthquake. Users continue share new Tweets for nearly 12 hours.

The prominent bump in Tweet volume  just before 1:00 UTC is the result of Tweets by the members of the Brazilian boy-band “Restart,” who appeared to be visiting Mexico City at the time of the earthquake and came online to inform their fanbase that they were okay. The band’s combined retweet volume during this period added a couple thousand tweets to the background of earthquake tweets (which also grew slightly during this period due to news coverage).

While it is not in general the case, traditional media earthquake coverage  (represented by the bottom graph of the Newsgator firehose) peaks at about the same time as Tweet volume.  We commonly see Tweet volume peaking minutes, hours and occasionally days before traditional media.  In this case, the quake was very large attracting attention and pressure to answer questions about damage and injuries.

WordPress blog posts about the earthquake illustrate a common pattern for blog posts — that they live on a daily cycle. Notice the second wave of WordPress posts starting on around 9:00 UTC.  Blogs take little longer to peak because they typically contain analysis of the situation, photos and official statements.  Also, many blog readers choose to check in on the blogs they follow in the morning and evening.

A couple of comments on the topic of content richness… As you might have guessed, Tweets are concise. Tweets are limited to 140 characters, but average in the mid 80s.  Comments tend to be slightly longer than Tweets at about 250 characters. Posts have the widest range of sizes with few very large posts (up to 15,000 words). Posts also often contain rich media such as images or embedded video.

From our brief analysis, it is clear that different firehoses represent different audiences, a range of content richness and publisher-specific user modes of interaction. “What firehoses do I need?” To start to answer, it may be useful to frame the comparison in terms of speed vs. content richness. Stories move at different speeds and have different content richness based on the firehose and the story. Your use case may require rapid reaction to the social conversation, or nuanced understanding of long-term sentiment. Surprise stories don’t move in the same ways as conversations about expected events.  Each firehose represents a different set of users with different demographics and sensibilities. You will need to understand the content and behavior of each firehose based on the questions you want to answer.

As you can see by getting your hands a little dirty, looking at how a specific event is being discussed in social media is fairly straightforward.  Hopefully I have both shown a simple and direct route to getting to answers as well as giving some useful context as to considerations for building a real-word social media application. Always remember to start with a good, well-formed question.

In Taming The Social Media Firehose,  Part III, we will look at consuming the unique, rich and diverse content of the Tumblr Firehose.

Taming The Social Media Firehose, Part I

This is the first post in our series on what a social media “firehose” (e.g. streaming api) is and what it takes to turn it into useful information for your organization.  Here I outline some of the high-level challenges and considerations when consuming the social media firehose; in Parts II and III, I will give more practical examples.

Social Media Firehose

Why consume the social media firehose?

The idea of consuming large amounts of social data is to get small data–to gain insights and answer questions, to guide strategy and help with decision making. To accomplish these objectives, you are not only going to collect data from the firehose, but you are going to have to parse it, scrub and structure it based on the analysis you will pursue. (If you’re not familiar with the term “parse,” it means machines are working to understand the structure and contents of the social media activity data.) This might mean analyzing text for sentiment, looking at the time-series of the volume of mentions of your brand on Tumblr, following the trail of political reactions on the social network of commenters or any of thousands of other possibilities.

What do we mean by a social media firehose?

Gnip offers social media data from Twitter, Tumblr, Disqus and Automattic (WordPress blogs) in the form of “firehoses.”  In each case, the firehose is a continuous stream of flexibly structured social media activities arriving in near-real time. Consuming that sounds like it might be a little tricky. While the technology required to consume and analyze social media firehoses is not new, the synthesis of tools and ideas needed to successfully consume the firehose deserves some consideration.

It may help to start by contrasting firehoses with a more common way of looking at the API world–the plain vanilla HTTP request and response. The explosion of SOAPy (Simple Object Access Protocol) and RESTful APIs has enabled the integration and functional ecosystem of nearly every application on the Web. At the core of web services is a pair of simple ideas: that we can leverage the simple infrastructure of HTTP requests (the biggest advantage may be that we can build on existing web server, load balancers, etc.), and that scaleable applications can be build on simple stateless request/response pairs exchanging bite-sized chunks of data in standard formats.

Firehoses are a little different in that, while we may choose to use HTTP for many of the reasons REST and SOAP did, we don’t plan to get responses in mere bite-sized chunks.  With a firehose, we intend to open a connect to the server once and stream data indefinitely.

Once you are consuming the firehose, and–even more importantly–with some analysis in mind, you will choose a structure that adequately supports approach. With any luck (more likely smart people and hard work), you will end up not with Big Data, but rather with simple insights–simple to understand and clearly prescriptive for improving products, building stronger customer relationships, preventing the spread of disease, or any other outcome you can imagine.

The Elements Of a Firehose

Now that we have a why, let’s zero in on consuming the firehose. Returning to the definition above, here is what we need to address:

Continuous. For example, the Twitter full firehose delivers over 300M activities per day. That is an average of 3,500 activities/second or 1 activity every 290 microseconds. The WordPress firehose delivers nearly 400K activities day. While this is a much more leisurely 4.6 activities/second there still isn’t much time to sleep between the 1 activity every 0.22 s.  And if your system isn’t continuously pulling data out of the firehose, much can be lost in a short time.

Streams. As mentioned above, the intention is to make a firehose connection and consume the stream of social media activities indefinitely. Gnip delivers the social media stream over HTTP. The consumer of data needs to build their HTTP client so that it can decompress and process the buffer without waiting for the end of the response. This isn’t your traditional request-response paradigm (that’s why we’re not called Ping–and also, that name was taken).

Unstructured data. I prefer “flexibly structured” because there is plenty of structure in the JSON or XML formatted activities contained in the firehose. While you can simply and quickly get to the data and metadata for the activity, you will need to parse and filter the activity. You will need to make choices about how to store activity data in the structure that best supports your modeling and analysis. It is not so much what tool is good or popular, but rather what question you want to answer with the data.

Time-ordered activities done by people. The primary structure of the firehose data is that it represents the individual activities of people rather than summaries or aggregations. The stream of data in the firehose describes activities such as:

  • Tweets, micro-blogs
  • Blog/rich-media posts
  • Comments/threaded discussions
  • Rich media-sharing (urls, reposts)
  • Location data (place, long/lat)
  • Friend/follower relationships
  • Engagement (e.g. Likes, up- and down-votes, reputation)
  • Tagging

Real-time. Activities can be delivered soon after they are created by the user (this is referred to as low latency). (Paul Kedrosky points out that a 70s station wagon full of DVDs has about the same bandwidth as the internet, but an inconvenient coast-to-coast latency of about 4 days.) Both bandwidth and latency are measures of speed. Many people know how to worry about bandwidth but latency issues can really mess up real-time communications even if you have plenty of bandwidth. When consuming the Twitter firehose, it is common to realize latency (measured as the time from Tweet creation to the parsing the tweet coming from the firehose) of ~1.6 s  and as low as 300 milliseconds. WordPress posts and comments arrive 2.5 seconds after they are created on average.

So there are a lot of activities and they are coming fast. And they never stop, so you never want to close your connection or stop processing activities.

However, in real life “indefinitely” is more of an ideal than a regular achievement. The stream of data may be interrupted by any number of variations in the network and server capabilities along the line between Justin Bieber tweeting and my analyzing what brand of hair gel teenaged girls are going to be talking their boyfriends into using next week.
We need to work around practicalities such as high network latency, limited bandwidth, running out of disk space, service provider outages, etc. In the real world, we need connection monitoring, dynamic shaping of the firehose, redundant connections and historical replay to get at missed data.

In Part II we make this all more concrete. We will collect data from the firehose and analyze it. Along the way, we will address particular challenges of consuming the firehose and discuss some strategies for dealing with them.

 

Social Data: A Beacon in Natural Disasters

Social media for natural disasters and health epidemics provides a valuable window into what is happening on the ground.  Social media data can help track diseases and serve as a powerful method of communication letting people know what is happening. Healthcare is just beginning to understand the value of big data.

Google first gained attention for helping to predict influenza outbreaks based on search volume but scientists from Bristol University took it a step further with Twitter — they analyzed 50 million geo-tagged tweets related to flu and their results nearly perfectly correlated with national health statistics from the CDC.

Social data went a step further when scientists used Twitter to predict cholera outbreaks in Haiti. In 2010, a 7.0 earthquake ravaged the small island of Haiti, a third-world country with meager infrastructure. The island began preparing for disease outbreaks, specifically cholera. During the height of the cholera outbreak, the island was losing people at the rate of 50 people a day and a total of more than 3,000 people were killed that year. Scientists began tracking cholera outbreaks via Twitter and found that social data could beat the official reports by up to two weeks. As one of the scientists noted, “Not all cholera patients go to hospitals to be counted officially.”

This isn’t the first time that social data has played an important role after an earthquake.  During the aftermath of Japan’s earthquake in 2011, social media played an important role in helping to identify how people could obtain medicine for their chronic illnesses that still needed treatment but so many of the traditional pharmaceutical supply chains were cut off.

Like in the Japanese earthquake, there is immense potential for social data to help emergency responders during a disaster. During the Fourmile Canyon Fire in Boulder in 2010, VisionLink provided fire crews and managers a real-time view into what was happening on the ground by layering geo-tagged Tweets and Flickr images they got from Gnip onto a Google map of the area. With this information, emergency workers were better able to see what was happening on the ground.

Studies have proven that people rely on information communicated via social media in emergencies. A study from the University of Western Sydney found that people relied on official news channels and social media and shared and re-tweeted the most useful information. The authors of the study believe that social media helps people to not feel alone as well as to help getting the message out. Which also echoes a study put together by the Red Cross that shows one-third of the U.S. population said they would use social media to let their loved ones know they were safe. Which is exactly why Red Cross has created their own social media control room powered by Radian6.

If you’d like to learn more about social data in natural disasters, contact info@gnip.com.

VisionLink Map of Boulder Fires

Enhanced Filtering for Power Track

Gnip is always looking for ways to improve its filtering capabilities and customer feedback plays a huge role in these efforts.  We are excited to announce enhancements to our PowerTrack product that allow for more precise filtering of the Twitter Firehose, a feature enhancement request that came directly from you, our customers.

Gnip PowerTrack rules now support OR and Grouping using ().  We have also loosened limitations on the number of characters and the number of clauses per rule. Specifically, a single rule can now include up to 10 positive clauses and up to 50 negative clauses (previously 10 total clauses).  Additionally, the character limit per rule has grown from 255 characters to 1024.

With these changes, we are now able to offer our customers a much more robust and precise filtering language to ensure you receive the Tweets that matter most to you and your business.  However, these improvements bring their own set of specific constraints that are important to be aware of.  Examples and details on these limitations are as follows:

OR and Grouping Examples

  • apple OR microsoft
  • apple (iphone OR ipad)
  • apple computer –(fruit OR green)
  • (apple OR mac) (computer OR monitor) new –fruit
  • (apple OR android) (ipad OR tablet) –(fruit green microsoft)

Character Limitations

  • A single rule may contain up to 1024 characters including operators and spaces.

Limitations

  • A single rule must contain at least 1 positive clause
  • A single rule supports a max of 10 positive clauses throughout the rule
  • A single rule supports max of 50 negative clauses throughout the rule
  • Negated ORs are not allowed. The following are examples of invalid rules:
  • -iphone OR ipad
  • ipad OR -(iphone OR ipod)

Precedence

  • An implied “AND” takes precedence in rule evaluation over an OR

For example a rule of:

  • android OR iphone ipad  would be evaluated as apple OR (iphone ipad)
  • ipad iphone OR android would be evaluated as (iphone ipad) OR android

You can find full details of the Gnip Power Track filtering changes in our online documentation.

Know of another way we can improve our filtering to meet your needs?  Let us know in the comments below.

Boulder Chamber of Commerce: Why Gnip Joined

It took awhile, but Gnip’s now a Boulder Chamber of Commerce (@boulderchamber) Member. We joined after a pattern of clear value to our particular industry became clear. In August of this year they hosted an event on that put us face-to-face with a the U.S. Department of Commerce Under Secretary for International Trade (Francisco Sánchez) and Colorado Congressman (Jared Polis) where we discussed software patent issues, as well as immigration visa challenges the U.S. tech industry faces. Tonight I’m attending an event with Congressman Polis and a local software Venture Capitalist (Jason Mendelson) to talk about challenges surrounding the hiring of technical talent locally, and globally.

These are topics with significant political/legislative dynamics, and the Chamber has given us, a local software firm, access to relevant forums in which we can get our point of view on the table; thank you.

Whether or not the Chamber has been providing this kind of relevant access all along, I don’t know (my perception is otherwise). I do know that the impact they’re having on us as a local software business, as well as the channel they’re giving Gnip to get its perspective heard in the broader (National) forum, is significant. I’d encourage other Boulder software/technology firms to support their efforts, contribute in their events, and help them build an agenda that in the end, helps us be more effective software/technical businesses.

Join us, in joining the Chamber.

Simplicity Wins

It seems like every once in a while we all have to re-learn certain lessons.

As part of our daily processing, Gnip stores many terabytes of data in millions of keys on Amazon’s S3. Various aspects of serving our customers require that we pour over those keys and the data behind them, regularly.

As an example, every 24 hours we construct usage reports that provide visibility into how our customers are using our service. Are they consuming a lot or a little volume? Did their usage profile change? Are they not using us at all? So on and so on. We also have what we affectionately refer to as the “dude where’s my tweet” challenge; of the billion activities we deliver each day to our customers, inevitably someone says “hey, I didn’t receive Tweet ‘X’ what gives?” Answering that question requires that we store the ID of every Tweet a customer ever receives. Pouring over all this data every 24 hours is a challenge.

As we started on the project, it seemed like a good fit for Hadoop. It involves pulling in lots of small-ish files, doing some slicing, aggregate the results, and spitting them out the other end. Because we’re hosted in Amazon it was natural to use their Elastic MapReduce service (EMR).

Conceptually the code was straight forward and easy to understand. The logic fit the MapReduce programming model well. It requires a lot of text processing and sorts well into various stages and buckets. It was up and running quickly.

As the size of the input grew it started to have various problems, many of which came down to configuration. Hadoop options, JVM options, open file limits, number and size of instances, number of reducers, etc. We went through various rounds of tweaking settings and throwing more machines in the cluster, and it would run well for a while longer.

But it still occasionally had problems. Plus there was that nagging feeling that it just shouldn’t take this much processing power to do the work. Operational costs started to pop up on the radar.

So we did a small test to check the feasibility of getting all the necessary files from S3 onto a single EC2 instance and processing it with standard old *nix tools. After promising results we decided to pull it out of EMR. It took several days to re-write, but we’ve now got a simple Ruby script using various *nix goodies like cut, sort, grep and their friends. The script is parallel-ized via JRuby threads at various points that make sense (downloading multiple files at once and processing the files independently once they’ve been bucketed).

In the end it runs in less time than it did on EMR, on a single modest instance, is much simpler to debug and maintain, and costs far less money to run.

We landed in a somewhat counter-intuitive place. There’s great technology available these days to process large amounts of data; we continue to use Hadoop for other projects. But as we start to bring them into our tool-set we have to be careful not to forget the power of straight forward, traditional tools.

Simplicity wins.

Dreamforce Hackathon Winner: Enterprise Mood Monitor

As we wrote in our last post, Gnip co-sponsored the 2011 Dreamforce Hackathon, where teams of developers from all over the world competed for the top three overall cash prizes as well as prizes in multiple categories.  Our very own Rob Johnson (@robjohnson), VP of Product and Strategy, helped judge the entries, selecting the Enterprise Mood Monitor as winner of the Gnip category.

The Enterprise Mood Monitor pulls in data from a variety of social media sources, including the Gnip API, to provide realtime and historical information about the emotional health of the employees. It shows both individual and overall company emotional climate over time and can send SMS messages to a manager in cases when the mood level goes below a threshold. In addition, HR departments can use this data to get insights into employee morale and satisfaction over time, eliminating the need to conduct the standard employee satisfaction surveys. This mood analysis data can also be correlated with business metrics such as Sales and Support KPIs to identify drivers of business performance.

Pretty cool stuff.

The three developers (Shamil Arsunukayev , Ivan Melnikov  and Gaziz Tazhenov) from Comity Designs behind this idea set out to create a cloud app for the social enterprise built on one of Salesforce’s platforms.  They spent two days brainstorming the possibilities before diving into two days of rigorous coding. The result was the Enterprise Mood Monitor, built on the Force.com platform using Apex, Visualforce, and the following technologies: Facebook API (Graph API),  Twitter API, Twitter Sentiment API, LinkedIn API, Gnip API, Twilio, Chatter, Google Visualization API. The team entered their Enterprise Mood Monitor into the Twilio and Gnip categories. We would like to congratulate the guys on their “double-dip” win as they took third place overall and won the Gnip category prize!

Have fun and creative way you’ve used data from Gnip? Drop us an email or give us a call at 888.777.7405 and you could be featured in our next blog.

Who Knew First: Steve Jobs or Aron Pinson?

While Steve Jobs’ resignation yesterday had investors anxiously watching how $AAPL fared in trading, we at Gnip were having fun watching a different ticker- the realtime Twitter feed.

As you can see from the graph below (these represent the number of “Steve Jobs” mentions per minute*), Twitter showed an incredible spike almost immediately. Apple-specific activity peaked 11 minutes after the news broke, showing how quickly the word spread. Honors for first tweet go to @AronPinson, who must have some blazing fast typing skills.

Once again, it’s incredible to see how social media is quickly becoming a trusted means of accessing and delivering realtime information.

*For more details on how we conducted this search across the millions of real-time tweets we have access to, contact us!

Incredible Innovation in Boulder Valley – Results from the IQ Awards

One of the reasons that we at love working in the Boulder Valley is because of the incredible and talented companies that make up the local business ecosystem. Given the depth and quality of innovative organizations that make Boulder their home, we’re extremely excited and very honored to announce today that we’ve won the Boulder County Business Report (BCBR) Innovative Quotient (IQ) Award for Social Media/Mobile Applications.

Presented by the BCBR, the IQ Awards is an annual event that honors the most innovative new products and services developed by companies and organizations, with a special emphasis on advanced technologies, innovations within a particular business sector and sustainable business practices.

Congratulations to all of last nights winners, with a big shout out to our fellow Foundry family member Standing Cloud who won the award in the Internet/Web category. Below is a list of companies that were recognized and their respective categories:

     

  • Green/Sustainability: OPX Biotechnologies Inc.
  • Social Media/Mobile Applications: Gnip Inc.
  • Nonprofits: Safehouse Progressive Alliance for Nonviolence
  • Software: Accurence Inc.
  • Natural Products: Cooper Tea Co. and Third Street Chai
  • Sports & Outdoors: Crescent Moon Snowshoes
  • Consumer Products & Services: Agloves
  • Internet/Web: Standing Cloud Inc.
  • Innovation Accelerator: Boulder Innovation Center and Longmont Entrepreneurial Network
  • Business Products & Services: Radish Systems LLC



Thank you to the Boulder County Business Report for recognizing the amazing innovation that exists in our community and congrats again to all of our fellow winners! Keep the innovation flowing, Boulder.

For more info, check out our press release.