Debating Formats for Open Access Articles

Andy Powell asks if the ubiquitous Portable Document Format is the right choice for academic publishing online, and especially for open access journals. He suggests that HTML might be the right way to go instead.

I don’t know much about the history of PDF, but I understand it was Adobe’s proprietary descendant of postscript. Obviously PDF has been wildly successful. Interestingly, as Peter Sefton points out in the comments, PDF is no longer a propietary format. Adobe opened up the standard, giving away both the instructions for making PDF documents and the rights to do so to anybody who wants to build the software for it. Why? I dunno. It seems like PDF was a big success as an Adobe-only format, and you have to wonder what convinced them to give up the legal rights to being the only people who can sell software to make and read it. Some governments very sensibly require that their documents be published in non-proprietary formats (although enforcement of that requirement seems pretty thin in places). That’s a sensible requirement both because open standard formats are more accessible to people who can’t or don’t pay for a monopoly company’s software, and because they are more likely to be accessible in years to come when the responsible company may have long ceased to sell readers compatible with current operating systems and such. Perhaps Adobe was afraid that somebody else’s open standard would rise to supremacy on that government requirement and the similar requirement of accessibility-oriented private citizens. In any case, they made the bold move of opening PDF up to the world. Nowadays Foxit among others make software for making and reading .pdfs (and in the case of Foxit, arguably do a better job than Adobe. Certainly their software is less of a bully to your computer, regardless of the quality of the documents it produces).

Peter Sefton also points out that HTML (which was designed not to do layout), is inevitably bad at document layout. Which can matter for the readability of table- and diagram-centric research documents. I would add that all the little readability details of font and kerning and such are also a bit of wreck in HTML. PDF gives the publisher the potential for control of those things, and sometimes control is a good thing.

He also argues for XML as a secondary format for open access articles to be published in. That would allow full semantic machine-fu goodness. I think he’s implying that the XML version would be the canonical one, which is an interesting and compelling idea. No one is going to read a document in native XML of course (XML not being really being designed to be read in its raw form), so this would have the odd fallout that the authoritative version of the document wouldn’t necessarily be seen by humans. Of course, properly implemented, XML is easily machine-translatable into any readable format. I hope and trust that there are existing software engines to automatically do just that.

So maybe researchers should publish their articles in (at least) three versions: an authoritative, future-proof, easily catalog-able, semantically illustrated XML version, a PDF or other hard-pixeled version for printed human consumption, and an HTML version for cursory online use. That would be kind of a work-flow version of what latex does for document creation on a humbler scale. The .pdf and .html could be easily auto-generated to the journal’s norms from the .xml, and further customized for layout by authors/editors that have the time and the inclination. All seems like a good idea to me.

leave a comment