Gnip spends an in-ordinate amount of time slicing and dicing data for our customers. Normalizing the web’s data is something we’ve been doing for a long time now, and we’ve gone through many incantations of it. While you can usually find a way from format A to format B (assuming the two are inherently extensible (as XML and JSON are)), you often bastardize one or the other in the process. DeWitt Clinton (Googler) recently posted a clear and concise outline of the challenges around moving between various formats. I’ve been wanting to write a post using the above title for a couple of weeks, so a thank you to DeWitt for providing the inadvertent nudge.
Here at Gnip we’ve done the rounds with respect to how to parse a formatted document. From homegrown regex’ing, to framework specific parsing libraries, the decisions around how and when to parse a document aren’t always obvious. Layer in the need to performantly parse large documents in real-time, and the challenge becomes palpable. Offline document parsing/processing (traditional Google crawler/index-style) allows you to push-off many of the real-time processing challenges. I’m curious to see how Google’s real-time index (their “demo” PubSubHubbub hub implementation) fares with potentially hundreds of billions of events moving through, per day, it in “real-time” in the years to come.
When do you parse?
If you’re parsing structured documents in “real-time” (e.g. XML or JSON), one of the first questions you need to answer is when do you actually parse. Whether you parse when the data arrives at your system’s front door versus when it’s on its way out can make or break your app. An assumption throughout this post is that you are dealing with “real-time” data, as opposed to data that can be processed “offline” for future on-demand use.
A good rule of thumb is to parse data on the way in when the relationship between inbound and outbound consumption is greater than 1. If you have lots of consumers of your parsed/processed content, do the work once, up-front, so it can be leveraged across all of the consumption (diagram below).
If the relationship between in/out is purely 1-to-1, then it doesn’t really matter, and other factors around your architecture will likely guide you. If the consumption dynamic is such that not all the information will be consumed 100% of the time (e.g. 1-to-something-less-than-1), then parsing on the outbound side generally makes sense (diagram below).
Synchronous vs. Asynchronous Processing
When handling large volumes of constantly changing data you may have to sacrifice the simplicity of serial/synchronous data processing, in favor of parallel/asynchronous data processing. If your inbound processing flow becomes a processing bottleneck, and things start queuing up to an unacceptable degree, you’ll need to move processing out of band, and apply multiple processors to the single stream of inbound data; asynchronous processing.
How do you parse?
Regex parsing: While old-school, regex can get you a long way, performantly. However, this assumes you’re good at writing regular expressions. Simple missteps can make regex’ing perform incredibly slow.
DOM-based parsing: While the APIs around DOM based parsers are oh so temping to use, that higher level interface comes at a cost. DOM parsers often construct heavy object models around everything they find in a document and, most of the time, you won’t use but 10% of it. Most are configurable WRT how they parse, but often not to the degree to just give you what you need. All have their own bugs you’ll learn to work through/around. Gnip currently uses Nokogiri for much of it’s XML document parsing.
SAX-style parsing. It doesn’t get much faster. The trade-off to this kind of parsing is complexity. One of the crucial benefits to DOM-style parsing is that node graph is constructed and maintained for you. SAX-style parsing requires that you deal with this tree and it often isn’t fun or pretty.
Whether you’re moving between different formats (e.g. XML or JSON), or making structural changes to the content, the promises around ease of transformation that were made by XSLT were never kept. For starters, no one moved beyond the 1.0 spec which is grossly underpowered. Developers have come to rely on home-grown mapping languages (Gnip 1.0 employed a complete custom language for moving between arbitrary XML inbound documents and known outbound structure), conveniences provided by the underlying parsing libraries, or in the language frameworks they’re building in. For example Ruby has “.to_json” methods sprinkled throughout many classes. While the method works much of the time for serializing an object of known structure, its output on more complex objects, like arbitrarily structured XML, is highly variable and not necessarily what you want in the end.
An example of when simple .to_json falls short is the handling of XML namespaces. While structural integrity is indeed maintained, and namespaces are translated, they’re meaningless in the world of JSON. So, if your requirements are one-way transformation, JSON is cluttered in the end using out-of-the-box transformation methods. Of course, as DeWitt points out, if your need round-trip integrity, then the clutter is necessary.
While custom mapping languages give you flexibility, they also require upkeep (bugs and features). Convenience lib transformation routines are often written to base-line specification and a strict set of structural rules, which are often violated by real-world documents.
Simple transformations are… simple; they generally “just work.” The more complex the documents however, the harder your transformation logic gets pushed and the more things start to break (if not on the implementation-side then on the format-side). Sure you can beat a namespace, attribute, and element laden XML document into JSON submission, but in doing so, you’ll likely defeat the purpose of JSON altogether (fast, small wire cost, easy JS objectification). While you might lose some of format specific benefits, the end may justify the means in this case. Sure it’s ugly, but in order to move the world closer to JSON, ugly XML-to-JSON transformers may need to exist for awhile. Not everyone with an XML spewing back-end can afford to build true JSON output into their systems (think Enterprise apps for one).
In the End
Gnip’s working to normalize many sources of data into succinct, predictable, streams of data. While taking on this step is part of our value proposition to customers, the ecosystem at large can benefit significantly from native JSON sources of data (in addition to prolific XML). XML’s been a great, necessary, stepping stone for the industry, but 9 times out of 10 tighter JSON suffices. And finally, if anyone builds a XSLT 2.0 spec compliant parser for Ruby, we’ll use it!