Twitter XML, JSON & Activity Streams at Gnip

About a month ago Twitter announced they will be shutting off XML for stream based endpoints on Dec, 6th, 2010, in order to exclusively support JSON. While JSON users/supporters are cheering, for some developers this is a non-trivial change. Tweet parsers around the world have to change from XML to JSON. If your brain, and code, only work in XML, you’ll be forced to get your head around something new. You’ll have to get smart, find the right JSON lib, change your code to use it (and any associated dependencies you weren’t already relying on), remove obsolete dependencies, test everything again, and ultimately get comfortable with a new format.

Gnip’s format normalization shields you from all of this as it turns out. Gnip customers get to stay focused on delivering value to their customers. Others integrating directly, and consuming stream data from Twitter in XML, have to make a change (arguably a good one from a pure format standpoint, but change takes time regardless).

From day one, Gnip has been working to shield data consumers from the inevitable API shifts (protocols, formats) that occur in the market at large. Today we ran a query to see what percentage of our customers would benefit from this shield; today we smiled. We’re going to sleep well tonight knowing all of our customers digesting our Activity Streams normalization get to stay focused on what matters to them most (namely NOT data collection intricacies).

Fun.

What Does Compound Interest Have to do with Evolving APIs?

Once Albert Einstein was asked what he found to be important discoveries. His answer did not mention physics, relativity theory, or fun stuff like Higgs bosons – but instead he said: “Compound interest is the greatest mathematical discovery of all time.”

I trust that most of you understand compound interest when it comes to investing or debt, but humor me and let’s walk through an example: Say you owe your credit card company $1000, and your interest rate is 16%. To make it simple, we assume the credit card company only requires you to pay 1% as your minimal payment every year, so the effective interest rate is 15%. After 30 years of compound interest you owe almost $60 000!

Compound Interest Graph

If there would be no compounding, you’d just owe a little bit over 5 grand!

What I find truly bizarre though is that when us software engineers throw around words like “technological debt” the eyes of our project managers or CEOs frequently just glaze over. Instead of doing the right thing – I’ll get back to that later – we are asked to come up with the quick hack that will make it work tomorrow and deal with the fallout later. Really? Sounds like we are using one credit card to pay off the other.

And we are even staying within terminology using “debt”! We could have said something like “Well, it would take us roughly 1 week longer to integrate our current J2EE backend with this 3rd party SOAP API instead of expanding our current custom XML parser, but then we would be done for good with maintaining that (POS) part of the app and can focus on our core IP.” But no, we keep it simple and refer to the custom XML parser as “technological debt”, but to no avail.

Now, the next time you have this conversation with your boss, show him the plot above and label the y-axis with “lines of code we have to maintain”, and the x-axis with “development iterations”, and perhaps a bell will go off.

Coming back to doing the right thing. Unfortunately determining what is the right thing is sometimes hard, but here are two strategies that in my experience decrease technological debt almost immediately:

  1. Refactor early and often
  2. Outsource as much as possible of what you don’t consider your core competency.

For instance, if you have to consume millions of tweets every day, but your core competency does not contain:

  • developing high performance code that is distributed in the cloud
  • writing parsers processing real time social activity data
  • maintaining OAuth client code and access tokens
  • keeping up with squishy rate limits and evolving social activity APIs

then it might be time for you to talk to us at Gnip!

Official Google Buzz Firehose Added to Gnip’s Social Media API

Today we’re excited to announce the integration of the Google Buzz firehose into Gnip’s social media data offering. Google Buzz data has been available via Gnip for some time, but today Gnip became one of the first official providers of the Google Buzz firehose.

The Google Buzz firehose is a stream of all public Buzz posts (excluding Twitter tweets) from all Google Buzz users. If you’re interested in the Google Buzz firehose, here are some things to know:

  • Google delivers it via Pubsubhubbub. If you don’t want to consume it via Pubsubhubbub, Gnip makes it available in any of our supported delivery methods: Polling (HTTP GET), Streaming HTTP (Comet), or Outbound HTTP Post (Webhooks).
  • The format of the Firehose is XML Activity Streams. Gnip loves Activity Streams and we’re excited to see Google continue to push this standard forward.
  • Google Buzz activities are Geo-enabled. If the end user attaches a geolocation on a Buzz post (either from a mobile Google Buzz client or through an import from another geo-enabled service), that location will be included in the Buzz activity.

We’re excited to bring the Google Buzz firehose to the Social Media Monitoring and Business Intelligence community through the power of the Gnip platform.

Here’s how to access the Google Buzz firehose. If you’re already a Gnip customer, just log in to your Gnip account and with 3 clicks you can have the Buzz firehose flowing into your system. If you’re not yet using Gnip and you’d like to try out the Buzz firehose to get a sense of volume, latency, and other key metrics, grab a free 3 day trial at http://try.gnip.com and check it out along with the 100 or so other feeds available through Gnip’s social media API.

Activity Streams

Gnip pledges allegiance to Activity Streams.

Consuming data from APIs with heterogeneous response formats is a pain. From basic format differences (XML vs JSON) to the semantics around structure and element meaning (custom XML structure, Atom, RSS), if you’re consuming data from multiple APIs, you have to handle each API’s responses differently. Gnip minimizes this pain by normalizing data from across services into Activity Streams. Activity Streams allows you to consistently digest responses from many services, using a single parsing routine in your code; no more special casing.

Gnip’s history with Activity Streams runs long and deep. We contributed to one of the first service/activity/verb mapping proposals, and have been implementing aspects of Activity Streams over the past couple of years. Over the past several months Activity Streams has gained enough traction that the decision for it to be Gnip’s canonical normalization format was only natural. We’ve flipped the switch and are proud to be part of such a useful standard.

The Activity Streams initiative is in the process of getting its JSON version together, so for now, we offer the XML version. As JSON crystalizes, we’ll offer that as well.

Swiss Army Knives: cURL & tidy

Iterating quickly is what makes modern software initiatives work, and the mantra applies to everything in the stack. From planning your work, to builds, things have to move fast, and feedback loops need to be short and sweet. In the realm of REST[-like] API integration, writing an application to visually validate the API you’re interacting with is overkill. At the end of the day, web services boil down to HTTP requests which are rapidly tested with a tight little application called cURL. You can test just about anything with cURL (yes, including HTTP streaming/Comet/long-poll interactions), and its configurability is endless. You’ll have to read the man page to get all the bells and whistles, but I’ll provide a few samples of common Gnip use cases here. At the end of this post I’ll clue you into cURL’s indispensable cohort in web service slaying, ‘tidy.’

cURL power

cURL can generate custom HTTP client requests with any HTTP method you’d like. ProTip: the biggest gotcha I’ve seen trip up most people is leaving the URL unquoted. Many URLs don’t need quotes when being fed to cURL, but many do, and you should just get in the habit of quoting every one, otherwise you’ll spend time debugging your driver error for far too long. There are tons of great cURL tutorials out on the network; I won’t try to recreate those here.

POSTing

Some APIs want data POSTed to them. There are two forms of this.

Inline

curl -v -d "some=data" "http://blah.com/cool/api"

From File

curl -v -d @filename "http://blah.com/cool/api"

In either case, cURL defaults the content-type to the ubiquitous “application/x-www-form-urlencoded”. While this is often the correct thing to do, by default, there are a couple of things to keep in mind: one, this assumes that the data you’re inlining, or that is in your file, is indeed formatted as such (e.g. key=value pairs). two, when the API you’re working with does NOT want data in this format, you need to explicitly override the content-type header like so.

curl -v -d "someotherkindofdata" "http://blah.com/cool/api" --header "Content-Type: foo"

Authentication

Passing HTTP-basic authentication credentials along is easy.

curl -v -uUSERNAME[:PASSWORD] "http://blah.com/cool/api"

You can inline the password, but keep in mind your password will be cached in your shell history logs.

Show Me Everything

You’ll notice I’m using the “-v” option on all of my requests. “-v” allows me to see all the HTTP-level interaction (method, headers, etc), with the exception of a request POST body, which is crucial for debugging interaction issues. You’ll also need to use “-v” to watch streaming data fly by.

Crossing the Streams (cURL + tidy)

Most web services these days spew XML formatted data, and it is often not whitespace formatted such that a human can read it easily. Enter tidy. If you pipe your cURL output to tidy, all of life’s problems will melt away like a fallen ice-cream scoop on a hot summer sidewalk.

cURL’d web service API without tidy

curl -v "http://rss.clipmarks.com/tags/flower/"
...
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/style/rss/rss_feed.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="/style/rss/rss_feed.css" type="text/css" media="screen" ?><rss versi
on="2.0"><channel><title>Clipmarks | Flower Clips</title><link>http://clipmarks.com/tags/flower/</link><feedUrl>http://rss.clipmarks.com/tags/flower/</feedUrl><ttl>15</ttl
><description>Clip, tag and save information that's important to you. Bookmarks save entire pages...Clipmarks save the specific content that matters to you!</description><
language>en-us</language><item><title>Flower Shop in Parsippany NJ</title><link>http://clipmarks.com/clipmark/CAD213A7-0392-4F1D-A7BB-19195D3467FD/</link><description>&lt;
b&gt;clipped by:&lt;/b&gt; &lt;a href="http://clipmarks.com/clipper/dunguschariang/"&gt;dunguschariang&lt;/a&gt;&lt;br&gt;&lt;b&gt;clipper's remarks:&lt;/b&gt;  Send Dishg
ardens in New Jersey, NJ with the top rated FTD florist in Parsippany Avas specializes in Fruit Baskets, Gourmet Baskets, Dishgardens and Floral Arrangments for every Holi
day. Family Owned and Opperated for over 30 years. &lt;br&gt;&lt;div border="2" style="margin-top: 10px; border:#000000 1px solid;" width="90%"&gt;&lt;div style="backgroun
d-color:"&gt;&lt;div align="center" width="100%" style="padding:4px;margin-bottom:4px;background-color:#666666;overflow:hidden;"&gt;&lt;span style="color:#FFFFFF;f
...

cURL’d web service API with tidy

curl -v "http://rss.clipmarks.com/tags/flower/" | tidy -xml -utf8 -i
...
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="/style/rss/rss_feed.xsl" type="text/xsl" media="screen"?>
<?xml-stylesheet href="/style/rss/rss_feed.css" type="text/css" media="screen" ?>
<rss version="2.0">
   <channel>
     <title>Clipmarks | Flower Clips</title>
     <link>http://clipmarks.com/tags/flower/</link>
     <feedUrl>http://rss.clipmarks.com/tags/flower/</feedUrl>
     <ttl>15</ttl>
     <description>Clip, tag and save information that's important to
       you. Bookmarks save entire pages...Clipmarks save the specific
       content that matters to you!</description>
     <language>en-us</language>
     <item>
       <title>Flower Shop in Parsippany NJ</title>
       <link>

http://clipmarks.com/clipmark/CAD213A7-0392-4F1D-A7BB-19195D3467FD/</link>

       <description>&lt;b&gt;clipped by:&lt;/b&gt; &lt;a
...

I know which one you’d prefer. So what’s going on? We’re piping the output to tidy and telling tidy to treat the document as XML (use XML structural parsing rules), treat encodings as UTF8 (so it doesn’t barf on non-latin character sets), and finally “-i” indicates that you want it indented (pretty printed essentially).

Right Tools for the Job

If you spend a lot of time whacking through the web service API forest, be sure you have a sharp machete. cURL and tidy make for a very sharp machete. Test driving a web service API before you start laying down code is essential. These tools allow you to create tight feedback loops at the integration level before you lay any code down; saving everyone time, energy and money.

xml.to_json

Gnip spends an in-ordinate amount of time slicing and dicing data for our customers. Normalizing the web’s data is something we’ve been doing for a long time now, and we’ve gone through many incantations of it.  While you can usually find a way from format A to format B (assuming the two are inherently extensible (as XML and JSON are)), you often bastardize one or the other in the process.  DeWitt Clinton (Googler) recently posted a clear and concise outline of the challenges around moving between various formats. I’ve been wanting to write a post using the above title for a couple of weeks, so a thank you to DeWitt for providing the inadvertent nudge.

Parsing

Here at Gnip we’ve done the rounds with respect to how to parse a formatted document. From homegrown regex’ing, to framework specific parsing libraries, the decisions around how and when to parse a document aren’t always obvious. Layer in the need to performantly parse large documents in real-time, and the challenge becomes palpable. Offline document parsing/processing (traditional Google crawler/index-style) allows you to push-off many of the real-time processing challenges. I’m curious to see how Google’s real-time index (their “demo” PubSubHubbub hub implementation) fares with potentially hundreds of billions of events moving through, per day, it in “real-time” in the years to come.

When do you parse?

If you’re parsing structured documents in “real-time” (e.g. XML or JSON), one of the first questions you need to answer is when do you actually parse. Whether you parse when the data arrives at your system’s front door versus when it’s on its way out can make or break your app. An assumption throughout this post is that you are dealing with “real-time” data, as opposed to data that can be processed “offline” for future on-demand use.

A good rule of thumb is to parse data on the way in when the relationship between inbound and outbound consumption is greater than 1. If you have lots of consumers of your parsed/processed content, do the work once, up-front, so it can be leveraged across all of the consumption (diagram below).

If the relationship between in/out is purely 1-to-1, then it doesn’t really matter, and other factors around your architecture will likely guide you. If the consumption dynamic is such that not all the information will be consumed 100% of the time (e.g. 1-to-something-less-than-1), then parsing on the outbound side generally makes sense (diagram below).

Synchronous vs. Asynchronous Processing

When handling large volumes of constantly changing data you may have to sacrifice the simplicity of serial/synchronous data processing, in favor of parallel/asynchronous data processing. If your inbound processing flow becomes a processing bottleneck, and things start queuing up to an unacceptable degree, you’ll need to move processing out of band, and apply multiple processors to the single stream of inbound data; asynchronous processing.

How do you parse?

Regex parsing: While old-school, regex can get you a long way, performantly. However, this assumes you’re good at writing regular expressions. Simple missteps can make regex’ing perform incredibly slow.

DOM-based parsing: While the APIs around DOM based parsers are oh so temping to use, that higher level interface comes at a cost. DOM parsers often construct heavy object models around everything they find in a document and, most of the time, you won’t use but 10% of it. Most are configurable WRT how they parse, but often not to the degree to just give you what you need. All have their own bugs you’ll learn to work through/around. Gnip currently uses Nokogiri for much of it’s XML document parsing.

SAX-style parsing. It doesn’t get much faster. The trade-off to this kind of parsing is complexity. One of the crucial benefits to DOM-style parsing is that node graph is constructed and maintained for you. SAX-style parsing requires that you deal with this tree and it often isn’t fun or pretty.

Transformation

Whether you’re moving between different formats (e.g. XML or JSON), or making structural changes to the content, the promises around ease of transformation that were made by XSLT were never kept. For starters, no one moved beyond the 1.0 spec which is grossly underpowered. Developers have come to rely on home-grown mapping languages (Gnip 1.0 employed a complete custom language for moving between arbitrary XML inbound documents and known outbound structure), conveniences provided by the underlying parsing libraries, or in the language frameworks they’re building in. For example Ruby has “.to_json” methods sprinkled throughout many classes. While the method works much of the time for serializing an object of known structure, its output on more complex objects, like arbitrarily structured XML, is highly variable and not necessarily what you want in the end.

An example of when simple .to_json falls short is the handling of XML namespaces. While structural integrity is indeed maintained, and namespaces are translated, they’re meaningless in the world of JSON. So, if your requirements are one-way transformation, JSON is cluttered in the end using out-of-the-box transformation methods. Of course, as DeWitt points out, if your need round-trip integrity, then the clutter is necessary.

While custom mapping languages give you flexibility, they also require upkeep (bugs and features). Convenience lib transformation routines are often written to base-line specification and a strict set of structural rules, which are often violated by real-world documents.

Integrity

Simple transformations are… simple; they generally “just work.” The more complex the documents however, the harder your transformation logic gets pushed and the more things start to break (if not on the implementation-side then on the format-side). Sure you can beat a namespace, attribute, and element laden XML document into JSON submission, but in doing so, you’ll likely defeat the purpose of JSON altogether (fast, small wire cost, easy JS objectification). While you might lose some of format specific benefits, the end may justify the means in this case. Sure it’s ugly, but in order to move the world closer to JSON, ugly XML-to-JSON transformers may need to exist for awhile. Not everyone with an XML spewing back-end can afford to build true JSON output into their systems (think Enterprise apps for one).

In the End

Gnip’s working to normalize many sources of data into succinct, predictable, streams of data. While taking on this step is part of our value proposition to customers, the ecosystem at large can benefit significantly from native JSON sources of data (in addition to prolific XML). XML’s been a great, necessary, stepping stone for the industry, but 9 times out of 10 tighter JSON suffices. And finally, if anyone builds a XSLT 2.0 spec compliant parser for Ruby, we’ll use it!

HOW-TO: Twitter Search Publisher

There has been some confusion around how to leverage Gnip’s Twitter Search (“twitter-search”) Publisher. We have work to do in order to clarify this use case from a usability/documentation standpoint, but in the meantime hopefully the following clarifies things a bit.

First off, “twitter-search” is a Polled Publisher which means it is subject to high latencies, as well as gaps in coverage. Secondly, we overload the “keyword” rule type in Filters in order to provide a mechanism for you to enter your http://search.twitter.com compatible queries (see http://search.twitter.com/operators for more information). Any query you can run on http://search.twitter.com, can be added to your Gnip filter as a “keyword” rule.

For example, if you search Twitter for “Boulder, CO” (including the quotes), Twitter considers that a literal, case-insensitive, phrase search; and so will Gnip. “Boulder, CO” (excluding the quotes), yields an OR search on Twitter; and hence the same in Gnip. If you search for “cars AND trucks” you get Boolean search operator behavior in Twitter, and subsequently in Gnip as well.

In short, we pass through the literal queries/strings that you hand Gnip, straight on through to Twitter. The “keywords” are opaque to Gnip. The only trick is in ensuring your “keywords” are entered into Gnip appropriately.

Through Gnip’s web interface, you can add comma separated keywords to a Filter. This is usually straightforward, however in the twitter-search Publisher case, it takes extra care to get the results you want, especially when you want to include commas or quotes in your queries. As a result, the format of the keywords entered in a twitter-search Publisher Filter must conform to csv quoting to ensure your queries get executed properly.

Through Gnip’s REST interface, you encapsulate the keywords within XML <rule> elements, so the csv quoting rules can be ignored.

For some further examples of how to add twitter-search keywords, see the Gnip API documentation.

As a final note, the overload of “keyword” rule types in Filters is something we’re experimenting with and is subject to change.

Guest Post, Rick Boykin: Gnip C# .NET Convenience Library

Microsoft .NETNow that the new Gnip convenience libraries have been published for a few weeks on GitHub, I’m going to tell you a bit about the libraries that I’m currently responsible for, the .NET libraries.  So, let’s dive in, shall we… The latest versions of the .NET libraries are heavily based on the previous version of the Java libraries, with a bit of .NET style thrown in. What that means is that I used Microsoft’s Java Language Conversion Assistant as a starting point, mixed in some shell scripting like Bash, Sed and Perl to fix the comments, and some of the messy parts that did not translate very well. I then made it more C# like by removing Java Annotations, adding .NET attributes, taking advantage of .NET native XML Serializer, utilizing System.Net.HttpWebRequest for communications, etc. It actually went fairly quick.  The next task was to start the Unit testing deep dive.

I have to say, I really didn’t know anything about the Gnip model, how it worked, or what it really was, at first. It just looked like an interesting project and some good folks. Unit testing, however, is one place where you learn about the details of how each little piece of a system really works. And since hardly any of my tests passed out of the gate (and I was not really even convinced that I even had enough tests in place,) I decided it was best to go at it till I was convinced. The library components are easy enough. The code is really separated into two parts. The first component is the Data Model, or Resources, which directly map to the Gnip XML model and live in the Gnip.Client.Resource namespace. The second component is the Data Access Layer or GnipConnection. The GnipConnection, when configured, is responsible for passing data to, and receiving data from, the Gnip servers.  So there are really only two main pieces to this code. Pretty simple: Resources and GnipConnection. The other code is just convenience and utility code to help make things a little more orderly and to reduce the amount of code.

So yeah, the testing… I used NUnit so folks could utilize the tests with the free version of VisualStudio, or even the command line if you want. I included a Gnip.build Nant file so that you can compile, run the tests, and create a zipped distribution of the code. I’ve also included an nunit project file in the Gnip.ClientTest root (gnip.nunit) that you can open with the NUnit UI to get things going. To help configure the tests, there is an App.config file in the root of the test project that is used to set all the configuration parameters.

The tests, like the code, are divided onto the Resource objects tests and the GnipConnection tests (and a few utility tests). The premise of the Resource object tests is to first ensure that the Resource objects are cool. These are simple data objects with very little logic built in (which is not to say that testing them thoroughly is not the utmost important.) There is a unit test for each one of the data objects and they proceed by ensuring that the properties work properly, the DeepEquals methods work properly, and that the marshalling to and from XML works properly. The DeepEquals methods are used extensively by the tests, so it is essential that we can trust them. As such, they are fairly comprehensive. The marshalling and un-marshalling tests are less so. They do a decent job; they just do not exercise every permutation of the XML elements and attributes. I do feel that they are sufficient enough to convince me that things are okay.

The GnipConnection is responsible for creating, retrieving, updating and deleting Publishers and Filters, and retrieving and publishing Activities and Notifications. There is also a mechanism built into the GnipConnection to get the Time from the Gnip server and to use that Time value to calculate the time offset between the calling client machine and the Gnip server. Since the Gnip server publishes activities and notifications in 1 minute wide addressable ‘buckets’, it is nice to know what the time is on the Gnip server with some degree of accuracy. No attempt is made to adjust for network latency, but we get pretty close to predicting the real Gnip time. That’s it. That little bit is realized in 25 or so methods on the GnipConnection class. Some of those methods are just different signatures of methods that do the same thing only with a more convenient set of parameters. The GnipConnection tests try to exercise every API call with several permutations of data. They are not completely comprehensive. There are a lot of permutations. But, I believe they hit every major corner case.

In testing all this, one thing I wanted to do was to run my tests and have the de-serialization of the XML validate against the XML Schema file I got from the good folks at Gnip. If I could de-serialize and then serialize a sufficiently diverse set of XML streams, while validating that those streams adhere to the XML Schema, then that was another bit of ammo for trusting that this thing works in situations beyond the test harness. In the Gnip.Client.Uti namespace there is a helper class icalled XmlHelper that contains a singleton of itself. There is a property called ValidateXml that can be reached like this XmlHelper.Instance.ValidateXml. Setting that to true will cause the XML to be validated anytime it is de-serialized, either in the tests or from the server. It is set to true in the tests. But, it doesn’t work with the stock Xsd distributed by Gnip.That Xsd does not include an element definition for each element at the top level which is required when validating against a schema. I had to create one that did. It is semantically identical to the Gnip version; it just pulls things out to the top level. You can find the custom version in the Gnip.Client/Xsd folder. By default it is compiled into the Gnip.Client.dll.

One of the last things I did, which had nothing really to do with testing, is to create the IGnipConnection interface. Use it if you want. If you use some kind of Inversion of Control container like Unity, or like to code to interfaces, it should come in handy.
That’s all for now. Enjoy!

Rick is a Software Engineer and Technical Director at Mondo Robot in Boulder, Colorado. He has been designing and writing software professionally since 1989, and working with .NET for the last 4 years. He is a regular fixture at the Boulder .NET user’s group meetings and the is a member of Boulder Digital Arts.

Data Standards?

Today’s general data standards are akin to yesterday’s HTML/CSS browser support standards. The first rev of Gecko (not to be confused w/ the original Mosaic/Navigator rendering engine) at Netscape was truly standards compliant in that it did not provide backwards compatibility for the years of web content that had been built up; that idea made it an Alpha or two into the release cycle, until “quirks-mode” became status quo. The abyss of broken data that machines, and humans, generate, eclipsed web pages back then, and it’s an ever present issue in the ATOM/RSS/XML available today.

Gnip, along with social data aggregators like Plaxo and FriendFeed, has a unique view of the data world. While ugly to us, we normalize data to make our Customers’ lives better. Consumer facing aggregators (Plaxo/FF) beautify the picture for their display layers. Gnip beautifies the picture for it’s data consumption API. Cleaning up the mess that exists on the network today has been an eye opening process. When our data producers (publishers) PUSH data in Gnip XML, life is great. We’re able to work closely with said producers to ensure properly structured, formatted, encoded, and escaped data comes into the system. When, data comes into the system through any other means (e.g. XMPP feeds, RSS/ATOM polling) it’s a rats nest of unstructured, cobbled-together, ill-formated, and poorly-encoded/escaped data.

XML has provided self describing formats and structure, but it ends there. Thousands of pounds of wounded data shows up on Gnip’s doorstep each day, and that’s where Gnip’s normalization heavy lifting work comes into play. I thought I’d share some of the more common bustage we see, along with a little commentary around the category of problem

  • ![CDATA[]] is akin to void* and is way overused. The result is magical custom parsing of something that someone couldn’t fit into some higher-level structure.

    • If you’re back-dooring data/functions into an otherwise “content” payload, you should revisit your overall model. Just like void*, CDATA usually suggests an opaque box you’re trying to jam through the system.
  • Character limited message bodies (e.g. microblogging services) wind up providing data to Gnip that has escaped HTML sequences chopped in half, leaving the data consumer (Gnip in this case) guessing at what to do with a broken encoding. If I give you “&a”, you have to decide whether to consider it literally, expand it to “&amp;”, or to drop it. None of which was intended by the user that generated the original content, they just typed ‘&’ into a text field somewhere.

    • Facebook has taken a swing at how to categorize “body”/”message” sizes which is nice, but clients need to do a better job truncating by taking downstream encoding/decoding/expansion realities into consideration.
  • Document bodies that have been escaped/encoded multiple times, subsequently leave us deciphering how many times to run them through the un-escape/decode channel.

    • _Lazy_. Pay attention to how you’re treating data, and be consistent.
  • Illegal characters in XML attribute/element values.

    • _LAZY_. Pay attention.
  • Custom extensions to “standard” formats (XMPP, RSS, ATOM). You think you’re doing the right thing by “extending” the format to do what you want, but you often wind up throwing a wrench in downstream processing. Widely used libs don’t understand your extensions, and much of the time, the extension wasn’t well constructed to begin with.

    • Sort of akin to CDATA, however, legitimate use cases exist for this. Keep in mind that by doing this, there are many libraries in the ecosystem that will not understand what you’ve done. You have to be confident that your data consumers are something you can control and ensure they’re using a lib/extension that can handle your stuff. Avoid extensions, or if you have to use them, get it right.
  • Namespace case-sensitivity/insensitivity assumptions differ from service to service.

    • Case-sensitivity rules were polluted with the advent of MS-DOS, and have been propagated over the years by end-user expectations. Inconsistency stinks, but this one’s around forever.
  • UTF-8, ASCII encoding bugs/misuse/misunderstanding. Often data claims to be encoded one way, when in fact it was encoded differently.

    • Understand your tool chain, and who’s modifying what, and when. Ensure consistency from top to bottom. Take the time to get it right.
  • UTF-16… don’t go there.

    • uh huh.
  • Libraries in the field to handle all of the above each make their own inconsistent assumptions.

    • It’s conceivable to me that Gnip winds up stating the art in XML processing libs, whether by doing it ourselves, or contributing to existing code trees. Lots of good work out there, none of it great.

You’re probably wondering about the quality of the XML structure itself. By volume, the bulk of data that comes into Gnip validates out of the box. Shocking, but true. As you could probably guess, most of our energy is spent resolving the above data quality issues. The unfortunate reality for Gnip is that the “edge” cases consume lots of cycles. As a Gnip consumer, you get to draft off of our efforts, and we’re happy to do it in order to make your lives better.

If everyone would clean up their data by the end of the day, that’d be great. Thanks.