Tag Archives: xml

Processing 1,200 DocBook XML Files

XML gets a bad rap, and I am not going to debate its merits here. That has been done ad nauseum over the years, and still no one has a better alternative for when it comes to tagging anything beyond the simplest books. Yes, there is JSON, but as soon as a title has a piece of art or an index I need linked, JSON is no longer the best tool. I have my own opinions about it, but the fact remains that XML is the lingua franca for tagged content in my industry, and I don’t have so many issues with it that I feel compelled to propose my own fixes for it.

The tools that exist to work with XML, however, I think speak to XML’s bad rap as much as anything. There is no single “go to” for all XML work. By that I mean, whereas one can use Eclipse all day to code C, Java, or use Xcode for Objective-C, C, there is no single app that does everything. In addition, the two provided technologies that exist for manipulating XML content—XSL and XQuery—feel as though they were developed from two completely different directions and simply thrown into the XML package.

My chosen tools and their uses

Diagnose and Repair

oXygen XML Editor
I forget what landed me on oXygen’s doorstep, but when manually inspecting an XML file for the first time, this is my “go to.” The errors that are returned from parsers like Xerxes can be intensely cryptic, but oXygen provides a useful interface to make drilling down to certain errors relatively straightforward. XML-specific editing hooks like automagically closing tags and validating against the declared DTD make working with XML worth the purchase price. There are times, however, when some UTF-8 encoding issue prevents even opening the file, at which point I move on to Plan B: xmllint.

xmllint
xmllint goes where oXygen fears to tread. If I have any error that prevents oXygen from opening a file for any reason, xmllint will tell me exactly where that error is. I’d like to think that an editor as robust as oXygen could handle the same functionality as xmllint, but it doesn’t. I don’t use the command for anything except the simplest of edits (no need to torture myself with vim or emacs if I don’t need to), so I then move onto the next tool for the fix: BBEdit.

BBEdit
BBEdit is the stuff of legend on the Mac, and I don’t think I need to sing its praises to the choir here. While it doesn’t have the XML-specific hooks of oXygen, it does open those files that oXygen barfs on, and has killer search and replace features for fixing problems. One of the best parts of BBEdit is that even if it does come across a UTF-8 encoding issue, it will open the file anyway, which means I can make the fix and move on to transformations.

Transformations

Updating an XML document’s structure is inevitable when prepping content. XML offers XSL, but I rarely work in a vacuum. Typically, I am mashing some content with some other content, and for that, I need to be able to manage a content store which XSL doesn’t allow on its own. Enter Cocoa.

Xcode
I hate to say it, because other developers might (will) cringe, but I use Xcode as a as a deep, rich scripting platform as I do for making an application. If I need to deploy a tool for my team, I can do so in no time, but more often than not, I am the one developing and executing the solution. I’ve developed a couple of strategies around this.

First, all solutions begin as command line applications. XML work almost never needs an interface, so I develop all the logic in controllers that link to the main function. If I need an interface, then adding one is a cinch coming from a command line app (but not the other way around).

Second, I develop with scalability in mind. If I do something with one file, chances are very high that I will need to do the same to other files as well. I have developed over time a class called OCFileParser that is the bridge between the directory system and editing logic.

The benefit to all of this is Using Cocoa’s NSXMLParser and NSXMLDocument classes makes working with XML incredibly flexible and fast. XSLT has its uses, but it doesn’t have the same hooks as full-fledged programming language.

The One Big Problem with those classes, however, is that they can crash with EXC_BAD_ACCESS on well-formed, valid XML. Out of the 1,200 titles I am working, there’s around 90 that exhibit this behavior, and they are a real mystery. Everything else in the toolbox has no problem with them but the NSXMLDocument class just barfs on them. I am still trying to sort out if there is some bug deep in the bowels of the classes (others exist so this is entirely in the realm of possibility); if how they link to external files is a problem; or if this is a memory issue—1,200 XML documents is a lot especially since I am relying on ARC for garbage collection to speed development.

That’s the setup. I have one or two other apps I have to work with, but I save those for when I get truly desperate, and not really worth mentioning here (though I am getting close given that last bit). I have a couple more blog posts on how the whole thing works in practice I am looking to get posted before the next semester begins.