Today’s general data standards are akin to yesterday’s HTML/CSS browser support standards. The first rev of Gecko (not to be confused w/ the original Mosaic/Navigator rendering engine) at Netscape was truly standards compliant in that it did not provide backwards compatibility for the years of web content that had been built up; that idea made it an Alpha or two into the release cycle, until “quirks-mode” became status quo. The abyss of broken data that machines, and humans, generate, eclipsed web pages back then, and it’s an ever present issue in the ATOM/RSS/XML available today.
Gnip, along with social data aggregators like Plaxo and FriendFeed, has a unique view of the data world. While ugly to us, we normalize data to make our Customers’ lives better. Consumer facing aggregators (Plaxo/FF) beautify the picture for their display layers. Gnip beautifies the picture for it’s data consumption API. Cleaning up the mess that exists on the network today has been an eye opening process. When our data producers (publishers) PUSH data in Gnip XML, life is great. We’re able to work closely with said producers to ensure properly structured, formatted, encoded, and escaped data comes into the system. When, data comes into the system through any other means (e.g. XMPP feeds, RSS/ATOM polling) it’s a rats nest of unstructured, cobbled-together, ill-formated, and poorly-encoded/escaped data.
XML has provided self describing formats and structure, but it ends there. Thousands of pounds of wounded data shows up on Gnip’s doorstep each day, and that’s where Gnip’s normalization heavy lifting work comes into play. I thought I’d share some of the more common bustage we see, along with a little commentary around the category of problem
- ![CDATA] is akin to void* and is way overused. The result is magical custom parsing of something that someone couldn’t fit into some higher-level structure.
- If you’re back-dooring data/functions into an otherwise “content” payload, you should revisit your overall model. Just like void*, CDATA usually suggests an opaque box you’re trying to jam through the system.
- Character limited message bodies (e.g. microblogging services) wind up providing data to Gnip that has escaped HTML sequences chopped in half, leaving the data consumer (Gnip in this case) guessing at what to do with a broken encoding. If I give you “&a”, you have to decide whether to consider it literally, expand it to “&”, or to drop it. None of which was intended by the user that generated the original content, they just typed ‘&’ into a text field somewhere.
- Facebook has taken a swing at how to categorize “body”/”message” sizes which is nice, but clients need to do a better job truncating by taking downstream encoding/decoding/expansion realities into consideration.
- Document bodies that have been escaped/encoded multiple times, subsequently leave us deciphering how many times to run them through the un-escape/decode channel.
- _Lazy_. Pay attention to how you’re treating data, and be consistent.
- Illegal characters in XML attribute/element values.
- _LAZY_. Pay attention.
- Custom extensions to “standard” formats (XMPP, RSS, ATOM). You think you’re doing the right thing by “extending” the format to do what you want, but you often wind up throwing a wrench in downstream processing. Widely used libs don’t understand your extensions, and much of the time, the extension wasn’t well constructed to begin with.
- Sort of akin to CDATA, however, legitimate use cases exist for this. Keep in mind that by doing this, there are many libraries in the ecosystem that will not understand what you’ve done. You have to be confident that your data consumers are something you can control and ensure they’re using a lib/extension that can handle your stuff. Avoid extensions, or if you have to use them, get it right.
- Namespace case-sensitivity/insensitivity assumptions differ from service to service.
- Case-sensitivity rules were polluted with the advent of MS-DOS, and have been propagated over the years by end-user expectations. Inconsistency stinks, but this one’s around forever.
- UTF-8, ASCII encoding bugs/misuse/misunderstanding. Often data claims to be encoded one way, when in fact it was encoded differently.
- Understand your tool chain, and who’s modifying what, and when. Ensure consistency from top to bottom. Take the time to get it right.
- UTF-16… don’t go there.
- uh huh.
- Libraries in the field to handle all of the above each make their own inconsistent assumptions.
- It’s conceivable to me that Gnip winds up stating the art in XML processing libs, whether by doing it ourselves, or contributing to existing code trees. Lots of good work out there, none of it great.
You’re probably wondering about the quality of the XML structure itself. By volume, the bulk of data that comes into Gnip validates out of the box. Shocking, but true. As you could probably guess, most of our energy is spent resolving the above data quality issues. The unfortunate reality for Gnip is that the “edge” cases consume lots of cycles. As a Gnip consumer, you get to draft off of our efforts, and we’re happy to do it in order to make your lives better.
If everyone would clean up their data by the end of the day, that’d be great. Thanks.