Debating Formats for Open Access Articles

Andy Powell asks if the ubiq­ui­tous Portable Document Format is the right choice for aca­d­emic pub­lishing online, and espe­cially for open access jour­nals. He sug­gests that HTML might be the right way to go instead.

I don’t know much about the his­tory of PDF, but I under­stand it was Adobe’s pro­pri­etary descen­dant of post­script. Obviously PDF has been wildly suc­cessful. Interestingly, as Peter Sefton points out in the com­ments, PDF is no longer a propi­etary format. Adobe opened up the stan­dard, giving away both the instruc­tions for making PDF doc­u­ments and the rights to do so to any­body who wants to build the soft­ware for it. Why? I dunno. It seems like PDF was a big suc­cess as an Adobe-​​only format, and you have to wonder what con­vinced them to give up the legal rights to being the only people who can sell soft­ware to make and read it. Some gov­ern­ments very sen­sibly require that their doc­u­ments be pub­lished in non-​​proprietary for­mats (although enforce­ment of that require­ment seems pretty thin in places). That’s a sen­sible require­ment both because open stan­dard for­mats are more acces­sible to people who can’t or don’t pay for a monopoly company’s soft­ware, and because they are more likely to be acces­sible in years to come when the respon­sible com­pany may have long ceased to sell readers com­pat­ible with cur­rent oper­ating sys­tems and such. Perhaps Adobe was afraid that some­body else’s open stan­dard would rise to supremacy on that gov­ern­ment require­ment and the sim­ilar require­ment of accessibility-​​oriented pri­vate cit­i­zens. In any case, they made the bold move of opening PDF up to the world. Nowadays Foxit among others make soft­ware for making and reading .pdfs (and in the case of Foxit, arguably do a better job than Adobe. Certainly their soft­ware is less of a bully to your com­puter, regard­less of the quality of the doc­u­ments it produces).

Peter Sefton also points out that HTML (which was designed not to do layout), is inevitably bad at doc­u­ment layout. Which can matter for the read­ability of table– and diagram-​​centric research doc­u­ments. I would add that all the little read­ability details of font and kerning and such are also a bit of wreck in HTML. PDF gives the pub­lisher the poten­tial for con­trol of those things, and some­times con­trol is a good thing.

He also argues for XML as a sec­ondary format for open access arti­cles to be pub­lished in. That would allow full semantic machine-​​fu good­ness. I think he’s implying that the XML ver­sion would be the canon­ical one, which is an inter­esting and com­pelling idea. No one is going to read a doc­u­ment in native XML of course (XML not being really being designed to be read in its raw form), so this would have the odd fallout that the author­i­ta­tive ver­sion of the doc­u­ment wouldn’t nec­es­sarily be seen by humans. Of course, prop­erly imple­mented, XML is easily machine-​​translatable into any read­able format. I hope and trust that there are existing soft­ware engines to auto­mat­i­cally do just that.

So maybe researchers should pub­lish their arti­cles in (at least) three ver­sions: an author­i­ta­tive, future-​​proof, easily catalog-​​able, seman­ti­cally illus­trated XML ver­sion, a PDF or other hard-​​pixeled ver­sion for printed human con­sump­tion, and an HTML ver­sion for cur­sory online use. That would be kind of a work-​​flow ver­sion of what latex does for doc­u­ment cre­ation on a hum­bler scale. The .pdf and .html could be easily auto-​​generated to the journal’s norms from the .xml, and fur­ther cus­tomized for layout by authors/​editors that have the time and the incli­na­tion. All seems like a good idea to me.

leave a comment