All posts by Philip Regan

Processing 1,200 DocBook XML Files

XML gets a bad rap, and I am not going to debate its merits here. That has been done ad nauseum over the years, and still no one has a better alternative for when it comes to tagging anything beyond the simplest books. Yes, there is JSON, but as soon as a title has a piece of art or an index I need linked, JSON is no longer the best tool. I have my own opinions about it, but the fact remains that XML is the lingua franca for tagged content in my industry, and I don’t have so many issues with it that I feel compelled to propose my own fixes for it.

The tools that exist to work with XML, however, I think speak to XML’s bad rap as much as anything. There is no single “go to” for all XML work. By that I mean, whereas one can use Eclipse all day to code C, Java, or use Xcode for Objective-C, C, there is no single app that does everything. In addition, the two provided technologies that exist for manipulating XML content—XSL and XQuery—feel as though they were developed from two completely different directions and simply thrown into the XML package.

My chosen tools and their uses

Diagnose and Repair

oXygen XML Editor
I forget what landed me on oXygen’s doorstep, but when manually inspecting an XML file for the first time, this is my “go to.” The errors that are returned from parsers like Xerxes can be intensely cryptic, but oXygen provides a useful interface to make drilling down to certain errors relatively straightforward. XML-specific editing hooks like automagically closing tags and validating against the declared DTD make working with XML worth the purchase price. There are times, however, when some UTF-8 encoding issue prevents even opening the file, at which point I move on to Plan B: xmllint.

xmllint goes where oXygen fears to tread. If I have any error that prevents oXygen from opening a file for any reason, xmllint will tell me exactly where that error is. I’d like to think that an editor as robust as oXygen could handle the same functionality as xmllint, but it doesn’t. I don’t use the command for anything except the simplest of edits (no need to torture myself with vim or emacs if I don’t need to), so I then move onto the next tool for the fix: BBEdit.

BBEdit is the stuff of legend on the Mac, and I don’t think I need to sing its praises to the choir here. While it doesn’t have the XML-specific hooks of oXygen, it does open those files that oXygen barfs on, and has killer search and replace features for fixing problems. One of the best parts of BBEdit is that even if it does come across a UTF-8 encoding issue, it will open the file anyway, which means I can make the fix and move on to transformations.


Updating an XML document’s structure is inevitable when prepping content. XML offers XSL, but I rarely work in a vacuum. Typically, I am mashing some content with some other content, and for that, I need to be able to manage a content store which XSL doesn’t allow on its own. Enter Cocoa.

I hate to say it, because other developers might (will) cringe, but I use Xcode as a as a deep, rich scripting platform as I do for making an application. If I need to deploy a tool for my team, I can do so in no time, but more often than not, I am the one developing and executing the solution. I’ve developed a couple of strategies around this.

First, all solutions begin as command line applications. XML work almost never needs an interface, so I develop all the logic in controllers that link to the main function. If I need an interface, then adding one is a cinch coming from a command line app (but not the other way around).

Second, I develop with scalability in mind. If I do something with one file, chances are very high that I will need to do the same to other files as well. I have developed over time a class called OCFileParser that is the bridge between the directory system and editing logic.

The benefit to all of this is Using Cocoa’s NSXMLParser and NSXMLDocument classes makes working with XML incredibly flexible and fast. XSLT has its uses, but it doesn’t have the same hooks as full-fledged programming language.

The One Big Problem with those classes, however, is that they can crash with EXC_BAD_ACCESS on well-formed, valid XML. Out of the 1,200 titles I am working, there’s around 90 that exhibit this behavior, and they are a real mystery. Everything else in the toolbox has no problem with them but the NSXMLDocument class just barfs on them. I am still trying to sort out if there is some bug deep in the bowels of the classes (others exist so this is entirely in the realm of possibility); if how they link to external files is a problem; or if this is a memory issue—1,200 XML documents is a lot especially since I am relying on ARC for garbage collection to speed development.

That’s the setup. I have one or two other apps I have to work with, but I save those for when I get truly desperate, and not really worth mentioning here (though I am getting close given that last bit). I have a couple more blog posts on how the whole thing works in practice I am looking to get posted before the next semester begins.

A fatal flaw of Wall Street in one sentence

Emphasis mine:

JetBlue distinguished itself by providing decent, fee-free service for everyone, an approach that seemed to be working: passengers liked the airline, and it made a consistent profit. Wall Street analysts, however, accused JetBlue of being “overly brand-conscious and customer-focussed.”
The New Yorker: Why Airlines Want to Make You Suffer

Like software, and so many other things, all airlines suck, but some airlines suck less than others. JetBlue sucked less than most of the others; they sold a decent product—treating people like humans, making a cramped, stressful experience as tolerable as reasonable—for a decent price, made money doing it, and they are being punished for it. Having watched the changes that come from it firsthand, I’d say “maximizing shareholder value” is easily one of the worst things to happen to any business. Now JetBlue is selling the same crap as all the other airlines, which means they are no longer our “go to” airline for trips. Now, we’ll just shop around like everyone else.

Microsoft vs. LaTeX

Ed: This WordPress theme makes the titles all-caps, thus mangling “LaTeX.” My analytics should get interesting in a little while.

I haven’t read this entire article yet, but the opening paragraph has the best comparison of Word and LaTeX I’ve seen yet:

Microsoft Word is based on a principle called “What you see is what you get” (WYSIWYG), which means that the user immediately sees the document on the screen as it will appear on the printed page. LaTeX, in contrast, embodies the principle of “What you get is what you mean” (WYGIWYM), which implies that the document is not directly displayed on the screen and changes, such as format settings, are not immediately visible. An Efficiency Comparison of Document Preparation Systems Used in Academic Research and Development

Between work and school, I deal with Word and LaTeX a lot. LaTeX less so than Word given Word’s ease of use for everyone, but I work with enough math content at work that I needed to learn at least the basics. But, once I got the hang of LaTeX, I’ve been using that as my “go to” for document preparation, despite the state of LaTeX to be a lot more crunchy than I think it needs to be (that’s a separate blog post entirely). Still, even after using LaTeX consistently for a few years, I find it hard to explain it to someone who hasn’t so much as even seen it.

This bit in the abstract is interesting as well:

We show that LaTeX users were slower than Word users, wrote less text in the same amount of time, and produced more typesetting, orthographical, grammatical, and formatting errors. On most measures, expert LaTeX users performed even worse than novice Word users. LaTeX users, however, more often report enjoying using their respective software. We conclude that even experienced LaTeX users may suffer a loss in productivity when LaTeX is used, relative to other document preparation systems.

I really need to read the article to find why this to be true but two things come to mind immediately:

  • Know your tools. If LaTeX is a core requirement for submissions, then take the time to really learn it.
  • Always double-check your work. There are no excuses for not checking work before submission.

Coca-Cola Disconnects Voice Mail at Headquarters

Coca-Cola is one of the biggest companies yet to ditch its old-style voice mail, which requires users to push buttons to scroll through messages and listen to them one at a time. Landline voice mail is increasingly redundant now that smartphones are ubiquitous and texting is as routine as talking.

I stopped answering arbitrary phone calls both at home and at work a few years ago, and it’s been one the best productivity hacks I’ve ever done. At home, I can count the number of people for whom I will answer the phone on one hand and have fingers to spare. At work, if you want to talk to me, set up a meeting and I will be more than happy to show up or call if off-site. Our voicemails get sent to our email inboxes, but those get lost in the flurry of everything else. Rarely is there anything so urgent as needing my immediate attention, and oftentimes the work I do requires enough concentration that interruptions like the phone are devastating to my productivity. The voicemail system at work is just wasted on me.

One of the weird quirks of Applescript

I am prepping my company’s XML archive for uploading into MarkLogic (I know, English, right? I’ll post more about this in the near future). But, within the archive is a bunch of PDF, EPUB, and image files I don’t want, around 30,000 or so files that need deleting. (There’s about 1,400 XML files. I’ll post more about that in the near future.)

I’m using Applescript to crawl through the folder hierarchy and delete those files I don’t want and I came across this weird bug. It turns out this fails at some point of burgeoning memory usage fails, not even try...catch works:

tell application "Finder" to delete target_file

But this works as expected, complete with try...catch, regardless of the environment:

tell application "Finder"
delete target_file
end tell

Mucking about with WordPress Themes

Things might get a bit wobbly for a bit tonight. Maybe a few surprises around some corners, but nothing dangerous (I don’t think).

UPDATE: I didn’t change the theme. I want something that uses Bootstrap because I used for a web programming class just this past semester and I really liked it. I understand that it’s really meant for non-designing developers to work up quick, useful interfaces for their work, but it does have a very clean, readable style and is useful for simple sites like this one. But none of the themes I found worked out.

One OSS version I found required command line apps that I won’t use on my server. It’s a theme, guys; I shouldn’t need grunt and all of its dependencies just to get it going. But, I might revisit this one this weekend since it appears to be the most robust of the bunch I tried.

The themes delivered through WordPress all exhibited the same problem of completely messing up the menu system in two ways. First, the menu system was displayed flat, without its hierarchy. I have a lot of pages on this site organized so that the default layout doesn’t get completely cluttered with the thirty or so links. This goes back to my complaint about the latest theme to come from WordPress that destroys the hierarchy. Hierarchical menus are not trivial (they’re not exactly hard either), and they can take some work to build and maintain. A theme that does not respect that hierarchy is useless to anyone that has them.

Second, the menu system was displayed twice no matter where I activated it. So, thirty-ish flat links turned into sixty-ish flat links. Even if I were to tolerate the flat menus, and I tried to reconcile it but couldn’t, I couldn’t tolerate having everything displayed twice. I’m guess that this is a problem with WordPress since all of the themes I tried exhibited the exact same behavior. Weird and disappointing. Oh, well.

“We have to tell the police that they are us and we are them.”

If the army can arbitrarily kill thousands in Iraq, why can’t they kill a few people in Staten Island, Missouri, or Ohio? You “support the troops” why don’t you support us, they ask. . . Fair question. There is an answer. We made a bad mistake. Now we understand. We have to unwind this. We have to tell the police that they are us and we are them. When they kill us they are killing themselves. . . We can support the troops by honoring their sacrifice. By caring for them when they come home. Or caring for their families if they don’t. But don’t expect to get a pass when you break the law. Police must be held to a higher standard, because of the power we give them. Certainly not a lower one. NYPD are the people

Even though the people grant officials special privileges, officials are still beholden to the same fundamental rights and principles that govern all citizens. Officials must be willing to have their decisions reviewed, criticized, and adjusted as required by the people that put them in that position of authority. We have a shared responsibility to exercise our right to free speech when necessary to ensure that our other rights are not infringed. Ultimately, transparency is maintained through free speech by the people’s demand of it, and those in authority understanding they share that duty as fellow citizens.

Everything those in authority have is a privilege, neither a right nor immutable, and is open to scrutiny and adjustment. If those in authority do not see themselves as beholden to the same laws that apply to everyone else, then they do not see themselves as fellow citizens, and therefore have no business being in the position of authority entrusted to them by the people that placed them there.

“We cannot have a society in which some dictator can start imposing censorship here in the United States”

“We cannot have a society in which some dictator can start imposing censorship here in the United States, because if someone is going to intimidate someone from releasing a satirical movie, imagine what would happen if they see a documentary they don’t like, or news they don’t like,” the President added, expressing concern for the idea that some movie producers might fall victim to “self-censorship” to avoid angering another country. . . “That’s not who we are and that’s not what America is about,” Obama continued.
Ars Technica: Obama thinks Sony “made a mistake” pulling The Interview after hack

My understanding is that it was the theater owners who initially pulled the film, not Sony, but the point still stands. I can understand them being skittish after what happened in Denver, but come on. If there was an attack being staged large enough to attack multiple theaters in multiple cities simultaneously, I’m confident the federal authorities would have picked up on it way before that would have been executed. Regardless of who actually pulled off the hack and made the threat, the theater owners completely caved to an anonymous, baseless threat, plain and simple. We’ll be feeling the implications from this for years to come.

Sony has a chance to redeem their industry by using their own, and other, channels to distribute the film, but that redemption is timed. Time heals all wounds, which means there will come a point in time when releasing the film just won’t matter, but that would be a very muted showing of our resilience. The big “fuck you” would be to at least release the film digitally through as many distribution channels before the end of the year. The wheels of big corporations tend to move slow, but I’m sure if Sony’s CEO were to pick up the phone and call the CEOs of Netflix, Apple, and Google, this movie would be out within a matter of days.