EPUBs are an experimental feature, and may not work in all readers.
Refurbishing the Camelot of Scholarship: How to Improve the Digital Contribution of the PDF Research Article
Skip other details (including permanent urls, DOI, citation information)
This work is licensed under a Creative Commons Attribution 3.0 License. Please contact firstname.lastname@example.org to use this work in a way not covered by the license. :
For more information, read Michigan Publishing's access and usage policy.
This paper was refereed by the Journal of Electronic Publishing’s peer reviewers.
The Portable Document Format (PDF) has become the standard and preferred form for the digital edition of scholarly journal articles. Originally created as a solution to the need to “view and print anywhere,” this technology has steadily evolved since the 1990s. However, its current use among scholarly publishers has been largely restricted to making research articles print-ready, and this greatly limits the potential capacity of the PDF research article to form a greater part of a digital knowledge ecology. While this article considers historical issues of design and format in scholarly publishing, it also takes a very practical approach, providing demonstrations and examples to assist publishers and scholars in finding greater scholarly value in the way the PDF is used for journal articles. This involves but is not limited to graphic design and bibliographic linking, the deployment of metadata and research data, and the ability to combine elements of improved machine and human readability.
The Portable Document Format (PDF) was released by Adobe Systems in 1993 to facilitate the electronic distribution of documents. It was created to assist the circulation of digital documents among the newly networked computers that were spreading through offices, whether in local area networks (LAN) or through the Internet. What had become apparent was that documents were being prepared by various word-processing programs, each with their own proprietary file format. With networking racing ahead of file compatibility, John Warnock, Adobe Systems cofounder, in 1991 initiated what he called the Camelot Project in order to solve the “view and print anywhere” problem, as he neatly characterized it (1991, p. 1). Nearly a decade earlier, in 1982, the resourceful Warnock, working with Charles Geschke, figured they had solved the same problem with PostScript (marking the beginning of Adobe Systems). However, PostScript was itself not proving universally applicable. It required “powerful desktop machines,” as Warnock put it, as well as PostScript printers (1991, pp. 1–2).
The goal of Camelot was to develop a lightweight file format that would serve the broadest possible range of users, at least until widespread computing power caught up with the demands of PostScript. Camelot was intended, then, as a temporary, transitional solution to the view-and-print-anywhere problem. Its history and success proved otherwise. When launched in 1993, the file format’s poetic Camelot moniker was replaced by the prosaic “portable document format,” now universally known as PDF. In 2008, Adobe released the PDF as an open standard for others to develop applications for writing and reading it, in what we might think of as the new twenty-first-century corporate spirit of open standards and open source software.
In scholarly communication, the PDF has become the standard file format for research articles published in the electronic edition of peer-reviewed journals. Although many journals also publish a HTML version of their articles along with a PDF, the bulk of the research literature is now available in PDF. Over the last decade, the majority of researchers have switched to reading the online edition of journals available through their library’s electronic collections (King, Tenopir, Choemprayong, and Wu, 2009, p. 131; Hemminger, Lu, Vaughn, and Adams, 2007). While finding articles online is becoming a common practice, most academic faculty print out a good proportion of the PDFs they wish to read, while younger and more research-oriented scholars lead the way in reading articles on their computer screens. The PDF, originally a handy solution to the view-and-print-anywhere challenge for digitally prepared documents a decade ago, has now become one of the principal vehicles of scholarly communication in the digital era.
This ubiquity is a great credit to the versatility of Warnock’s invention of the 1990s. Yet to use the PDF in that earlier fashion—as simply a view-and-print file format anywhere—reflects a continuing failure to keep up with increased networking capacities for scholarly communication. The problem here is not entirely with the technical limits of the PDF, although those are being challenged by some pushing for a more dynamic format (e.g., Pettifer et al., 2011). However, we contend that the short- and medium-term problem is that the PDF has technical and graphical capacities that, if fully utilized, could serve science and scholarship far better in advancing the flow of ideas and the circulation of knowledge. The challenge we have set for ourselves with this article, as students of scholarly communication, is to set out the various ways in which journal publishers and readers can use the PDF far more effectively to deliver, circulate, read, and utilize the literature so vital for the advancement of knowledge.
Plays Well With Others?
The file compatibility issue of the 1990s has evolved, with the growing sophistication of the web, into one of effectively discovering, accessing, linking, and extracting the information necessary to making the most of the published literature to advance research and scholarship. To date, broadly two responses to working with the PDF in this new environment have emerged: either make the best of long-standing PDF practices, or look to the development of a new format. In the first instance, in what amounts to a work-around, PDF articles are being “crawled,” “parsed,” and “scraped” to obtain as much of the metadata required to index the articles as possible by such innovative projects as CiteSeer (Giles, Bollacker, and Lawrence, 1998) and Google Scholar. These automated data-mining systems are currently able to extract sufficient indexing and bibliographic information to form rough and ready citation indexes that enable scholars not only to see who cited whom, but also to see what was cited in what context. As well, PDF management systems, such as Zotero and Mendeley, enable individual scholars to manage their own collection of PDF articles by similarly extracting bibliographic information from the articles. The very metaphors in play here—crawling of PDFs in hopes of scraping or extracting (and yes, it can be like pulling teeth) some bibliographic data from them—suggest how crudely this approach currently works, leading to irregularities in the data, whether in the form of, say, a missing journal title or the same article listed twice because of a variation in the extraction. As we will explore, this is largely because publishers treat the PDF as a print format, paying little regard to the PDF’s embedded metadata and hyperlink capacities.
The second response to current limits in the use of the PDF for scholarly publishing has reached a critical mass with the “Beyond the PDF” movement, with an initial workshop held on this theme in 2011. Here, the goal is to create a whole new open source file format built from the ground up, “to be used by scholars to accelerate data and knowledge sharing and discovery,” as the workshop web page phrases it. Others working in this area of scholarly communication are looking for entirely new forms of publishing, beyond the traditional blind-peer-reviewed research article. A good example of this is the Public Library of Science’s Currents, with other article-format experiments underway in journal publishing (Priem and Hemminger, 2011; Groth et al., 2010; Fitzpatrick, 2009). Annotum is another radical instance, in which authors compose their articles inside the journal website, which uses no more than a special plugin for WordPress designed for rapid review and publication. However, the cultural resistance to change existing systems and practices should not be underestimated (Ackerman, 2009).
While we applaud the innovative indexing and management of PDFs, as well as explorations of more open format and new designs for scholarly communication, what follows is concerned with improving current use of the PDF within the traditional journal context. We think that for the short term the PDF will continue to form a “transitional” solution for scholarly publishing. We are particularly interested in helping what amounts to a renaissance of scholar-publishers, who are taking advantage of online systems to run independent journals, to make better use of PDFs in the service of their authors and readers (Edgar and Willinsky, 2010). The majority of these journals are in the Global South and do not have the human or financial resources to publish in HTML or to experiment with new technical forms and formats. They depend upon Open Office, Microsoft Office, PageMaker, and other programs, usually in older versions, to render their articles in PDF. In light of this widespread use, we offer a number of ways for making more of the PDF for scholarly publishing. We have reason to believe that PDFs can do more to assist reader-researchers, as well as others, in the comprehension, evaluation, and utilization of research and scholarship.
The PDF has capacities for graphic design, hyperlinking, and metadata that are not being fully exploited because of the continuing focus on PDF’s print-anywhere qualities. By making the PDF article good for printing alone, rather than also for reading and connecting on the screen, the journal publisher encourages readers, in effect, to print out the PDF article. Clicking the “print” button cuts the article off, obviously, from the web, where there is potentially a much richer reading experience of the sort that would seem poised—as an article is linked to those that gave rise to it, or its data set, or more recent related articles—to advance research and scholarship. We will review how it is that a PDF read on the screen has the potential to do all of that, as well as to be printed by those so inclined.
To assist publishers, large and small, in taking greater advantage of the PDF’s current capacities to advance scholarly communication, we examine (1) the design of the PDF article, including its graphic and functional design, (2) data portability, including linked metadata and research data, and (3) PDF future developments, including some advanced functionality that could be built on the progress made in the first two sections, if the Great PDF Replacement Format (GPDFRF) does not arrive before then. These three steps are graduated in the sense that the first can be implemented by publishers, including the modestly resourced scholar-publisher, this afternoon, while the second and third are progressively more demanding but still, we believe, well within reach among those seeking to see the quality of scholarly communication advanced.
We provide both descriptions and demonstrations of how publishers can improve the use of the PDF for scholarly communication, without giving up the capacity to print the article. The recommendations that we provide for the use of PDF with the research article are based on our combined expertise as educator, information scientist, and graphic designer, respectively. We seek to help publishers refurbish Camelot. Our goal is to see research articles operate within the larger hyperactive web of knowledge, where critical and fascinating connections can be made that are at once virtually nowhere and everywhere, anytime. Such has always been the promise of Camelot.
1. The Design of the PDF Article
The layout of the journal article page has changed little enough over the course of the last century, despite the transformation of the publishing medium from print to digital. One advantage of the PDF over HTML for the electronic edition of the journal is that the PDF has made it easier to recreate a familiar-looking edition of the journal page, if necessarily rescaled to fit the standard letter-size paper used by computer printers. This familiarity was necessary to reassure those concerned about the status, prestige, and worth of the electronic edition of a print journal. Yet among some journals, this has simply brought forward in the digital edition poor design principles from the printed journal page, needlessly sustaining less than ideal legibility for the article, and thus the ability to work with the article as effectively as possible. It is worth briefly considering the history of the journal page as the tradition has evolved away, in our estimation, from some basically sound principles.
When the first English-language scientific journal, Philosophical Transactions of the Royal Society of London, appeared on March 6, 1665, the page, in size and layout of type, resembled, not surprisingly, that of the common book (fig. 1). The book arts, by which printers and scribes set out the proportions of page size, text, and margins to achieve the eminently readable page, had already had a long history by the seventeenth century, reaching back into the era of the manuscript, during which such scholarly apparatus as the index and footnote were introduced. The graphical design principles of the original issues of the Philosophical Transactions, based on the book page of the day, have been generally preserved in the humanities, as demonstrated by a page drawn from a 2010 article in the Philosophical Review, a quarterly of roughly 150-page issues, published by Duke University Press (fig. 2). The earlier page design in print has transferred well to the PDF, rescaled to fit the standard letter-size of printer paper.
However, in the sciences, the journal page has gone through, in many cases, what might be called content intensification, with a corresponding loss of legibility over the centuries. A prime example of this is found with the New England Journal of Medicine (NEJM). It began as the Boston Medical and Surgical Journal, which in 1900 was already a 24-page weekly (fig. 3). It later developed, by the end of the century, into a 100-page weekly. When a journal reaches that size and frequency in print, the cost of printing, paper, and postage understandably leads to a highly compressed page design, which had otherwise become the mark of highly cost-conscious daily newspapers in the nineteenth century. Skip ahead to the twenty-first century, and you have a QWERTY-like legacy for the electronic edition of the journal, at least when it comes to the PDF: a lowest-common denominator whose reliability is due almost in total to its familiarity. The requirements of an old technology such as print and its associated cost structures, which no longer pertain, continue to dictate and constrain aspects of the reader’s experience, namely by limiting, in effect, to printouts for efficient and convenient reading.
Articles in the online edition of the NEJM are available in HTML and PDF. The PDF article format is a somewhat updated version of the 1900 print edition, with increased margins; narrower column width; and a cleaner, lighter font (fig. 4). What makes this PDF difficult for reading on the computer screen, let alone smart phone, is the two-column design that requires scrolling up and down, as well as keeping track of which side of the page you are on (fig 5). For this reason one imagines, the NEJM’s HTML version uses a single-column design (fig. 6). What the NEJM has achieved with HMTL could, of course, be executed in PDF to make for a readable text in print and on the screen, and such is our recommendation, as we will explain in more detail.
In light of the PDF increasingly becoming the version of choice for publishers and readers working with scholarly literature, and as many journals do not have the technical or human resources to produce HTML versions, the graphic design and technical utilization of the PDF need to assume greater versatility. They need to bring together the best of the PDF and HTML versions of the journal article, as currently represented by leading journals such as the NEJM. That is, readers should still be able to print the PDF research article on a common computer printer, but this requirement does not preclude a more functional and readable PDF for use on the variety of screens by which readers increasingly read and utilize research literature. This calls for authors and editors to pay greater attention to a number of textual elements, including (a) legibility, (b) references or works cited, and (c) annotations. For each, we provide relatively easy steps to increase the quality of reading and engagement.
For laying out the PDF version of a research article, we recommend that there be no more than 55 to 70 characters per column of text (Dyson, 2001; 2004). For various other reasons owing to mobile devices with limited screen real estate and the limitations of some PDF annotation tools, we would go even further to recommend that documents should only contain a single column of text per page. Some will immediately object that this will require extra paper on printing; indeed, there are many dual-column formats in continued, effective use. We would counter that it is the current less than screen-friendly quality of many PDFs that encourages at least some printing. The extra number of pages resulting from printing a single-column layout—with the example we provide below amounting to a 25 percent increase in length between one- and two-column layouts—may also give people further pause before printing. The single-column PDF offers several advantages. It provides space for marginal comments, whether on paper or on the screen, without blocking the text. The text itself takes up less screen real estate, which in turn makes it easier to do side-by-side comparisons of texts, tables, and figures, which contribute to evaluating and utilizing the text (Olive, Rouet, François, and Zampa, 2008). At the same time, in libraries and Internet cafes where printing is not an option, it will be read online, and that experience would also be improved by these design considerations, which can be readily imitated by publisher or author in preparing the Word or Open Office version of the article before it is then saved and published as a PDF.
To demonstrate the difference that additional care with a more legible graphic design can make—included the use of a single relatively narrow column—we have taken the liberty of graphically transforming an existing PDF article to improve the reading experience. The example we have used is a 2010 article by Abdul Rahman Abdul Malik on the “greening” of the architecture curriculum in Malaysian Institutes of Higher Learning from the open access journal ArchNet: International Journal of Architectural Research, which is published by MIT, with support from the Aga Khan Trust for Culture. In the original PDF of the article on the ArchNet website, the text of the article is presented in two columns with tables, figures, and references. With our redesigned PDF Article Demo—which we have posted online and can be downloaded—we ask readers to consider the increased readability, achieved by revising the graphical design to include a single-column design, additional space between lines (referred to as secondary leading), and navigational aids such as within-document anchor links to figures and images, as well as to citations. We will return to this redesigned PDF demonstration article to consider additional features for enhancing the PDF.
The formatting of the references or works cited included by the authors in an article follows, in principle, a strict set of conventions which have been refined over the years, whether by the Modern Language Association (MLA), the American Psychological Association (APA), or the Chicago Manual of Style, to name just three of many standards. The purpose of such conventions is to help readers track down sources reliably and with some efficiency. With the introduction of hyperlinks and a growing proportion of the literature available online, it is possible for readers to consult the sources cited as they are reading the article. This represents a significant gain for the evaluation, general integrity, and idea-generation associated with this literature, and as such needs to be part of the PDF vision for scholarly journals. The concept of linked references has given rise to utopian visions and practical suggestions for a universal bibliographic database.
On the practical side, CrossRef, an organization of scholarly publishers, employs an article DOI (Digital Object Identifier) system to provide a distinct link for each of close to 50 million registered citations and growing, at admittedly varying rates across disciplines (Davison and Douglas, 1998; Paskin, 2008). The DOI from CrossRef is being used by publishers, along with other databases, to provide a link to the referenced source. While CrossRef gathers references from across the disciplines, in the life sciences the National Library of Medicine’s PubMed, the leading life sciences database, provides a similar identifier services with its PMID, and it is common in life science journals to include links with each reference that utilize DOI and PMID. This, with few exceptions, is not being done with PDFs. To use the example of PLoS Biology, in the HTML version, the references include a link to three look-up options: Cross-Ref, PubMed, and Google Scholar, while no links are included in the PDF version (fig. 7).
The PDF version can and should include similar look-up options for each reference entry included in an article, especially for journals that do not provide a HTML version. Authors can be encouraged to provide any available identifiers (e.g., URLs) for the references they cite, such as the PubMed ID. In addition, DOIs can be readily obtained for an article’s entire set of references, using CrossRef’s batch-look-up service, whether by the author or publisher. With Google Scholar, one can build a search query for each reference using the first/primary author’s last name and the title. These look-up links can be discreetly embedded in a series of icons or acronyms representing Crossref, Google Scholar, PubMed, ISI Web of Science, or other relevant databases, and placed at the beginning or end of each reference. It is worth noting the importance of including a Google Scholar search link, in addition to other databases such as CrossRef and PubMed which is something that PLoS Biology includes, for example, but the NEJM does not. What Google Scholar provides is a way to look up both the publisher’s version, by way of CrossRef or PubMed (should the reader have subscription or library access to the journal in which the article appears), as well as any open access versions of the article that have been made available in the authors’ institutional repository or on a personal website. We have placed discrete “search” icons with each citation listed in the References in our PDF Article Demo. These icons contain URLs with simple search scripts for the cited work in Google Scholar (based on the title and first/primary author of the work). In this way, the link may well increase the reach of access for readers, while encouraging the checking of sources. This capability may provide an element of quality control on the liberties that authors may take, but more importantly, it adds to the educational contribution the article can make by connecting it directly to related literature.
The ability to annotate and gloss a text is among the most important aspects of scholarly reading, with a long-standing history dating back to medieval manuscript culture. The history of marginal gloss takes a fascinating turn, with one aspect developing into what were initially reader-contributed footnotes and the other accumulating into a commentary tradition that remains a mainstay of the humanities (Clemens and Graham, 2007; Teeuwen, 2010). While it may be assumed that readers print out PDFs for the ease of marking up an article, it is simple enough for publishers to support the digital annotation of PDFs, which have the advantage over print of being readily sharable among research teams—portable, and hypertextual, without sacrificing the ability to print out and preserve the notations.
Numerous open source and freeware annotation programs for PDFs have been developed, and journals can alert readers to the availability of such tools for use with the journal’s articles. On the other hand, it seems safe to assume that the overwhelming majority of readers use Adobe’s free Reader application to read PDFs and that this will be the extent of their PDF tool set. As it turns out, special allowances have to be made for annotating with Reader. With all that Adobe has done to protect and secure the PDF, in support of Digital Rights Management (DRM), the ability to mark up or annotate a PDF cannot be taken for granted. Only more recent versions of Adobe Reader allow readers to annotate a PDF, and only if that right has been enabled in the creation of the PDF. We have included some examples of such annotations in a second version of our PDF Article Demo. Further, readers can respond to such comments; they are laid out to demonstrate how the narrower line length can accommodate comments without reducing the readability of the article, and the comments appear both on the screen and in a printout of the article. Thus, in preparing articles as PDFs, it is critical for journal publishers to utilize the Enable Commenting function when saving the PDF. With these three relatively easy aspects of upgrading the graphic design, hyperlinking, and annotation of the PDF in hand, we turn to more complicated, challenging, and potentially more rewarding elements in improving the state of the scholarly PDF.
2. Data and the Article
One of print’s great breakthroughs in the field of scientific publishing was its ability to reliably reproduce data in copy after copy (compared to manuscript culture). Elisabeth Eisenstein went so far, in her historical recovery of science’s “unacknowledged revolution” (the European introduction of the printing press), to suggest that “‘the data available’ to all book readers . . . could help to explain why systems of charting the planets, mapping the earth, synchronizing chronologies, codifying laws ands compiling bibliographies were all revolutionized before the end of the sixteenth century” (1979, p. 113). Eisenstein concludes that “typographical fixity is a basic prerequisite for the rapid advancement of learning” (ibid.). On the other hand, we want to suggest, if without yet having the historical perspective afforded to Eisenstein, that the digital era will be marked by the more dynamic qualities of data, both in the case of (a) the machine-intelligible data available about the article (i.e., metadata), and (b) the actual research data, which forms the evidence on which an article is reporting.
(a) Article Metadata
An article’s metadata refers to the bibliographic information that identifies an article, including the author, title, publication, issue, year, keywords and so on. This metadata has obviously always been a part of the article, but what is at question in the digital era is the accuracy with which that data is machine-readable—particularly given the crucial role it plays in the identity of the work now that it is so readily loosed from its original site of journal publication. Standards, conventions, and protocols for digitally marking up a document’s metadata have been developed to ensure that it is properly identified in an online environment. However, this is often ignored in preparing the PDF version of a research article, given that it is not designed to live on the web, but to be printed where readers rely on visual and semantic clues to identify the author, title, and journal—clues which can pose considerable challenges to machines. We have already noted the current state of work-arounds, by which PDFs are more or less effectively scraped for metadata by Cite Seer and Google Scholar. To take another example, the PDF page-header text, typically containing the journal’s title, is functionally indistinguishable from the article body in machine-readable terms, making the process of mining text from PDFs unnaturally laborious, not unlike the extraction of “messy” data from web pages (Baroni et al., 2008).
Over the years, Adobe has updated the architecture of the PDF so that it now has containers for carrying structured, supplemental data associated with the content of the file, such as article metadata that would produce more reliable indexing and citation records, as well as more efficient bibliographic management. Free tools are available for entering PDF metadata into these containers, when the PDF is created as well as when the PDF is downloaded (more on that later), notably pdftk and iText. We have included the article’s metadata in our PDF Article Demo. In most cases, however, journal article PDFs contain spurious information (e.g., whatever name the Word file used to create the PDF was saved under), or no data. Some publishers have used these containers to include additional data, such as the table of contents, for the issue in which the article appeared, but we feel that they are quite ideally suited for the article metadata (fig. 8).
More recently, Nature has begun to include complete article metadata in its PDFs with the help of the Charlesworth Group, which describes its XML preparation of the journal’s articles as a way to “futureproof” its PDFs for “use in new web applications being developed for the semantic web,” as the news release stated in 2009 (fig. 9). The emphasis on the future reflects how the integration of metadata containers within PDFs seems to have done little to alter the way in which this document information is provided by publishers or extracted by services that support the scholarly literature. As things stand, citation management software, such as Zotero and Mendeley, employ a mixed-method solution for extracting author’s names from the article text (as well as the date and place of publication, etc.), which is then cross-referenced against a stable resource, such as Google Scholar, which itself is working from its own best guess of scraped data from PDFs (Marinai, 2009). This has created something of a chicken-and-egg situation in pursuit of the goal of accurate record-keeping and effective digital management of resources.
A pressing question, then, is whether publishers should first demonstrate their willingness to add complete article metadata to the PDFs or whether scholarly services should first be set up to extract accurate metadata from the PDF container in order to demonstrate how it will lead to improved services. From where we stand, both journal publishers and bibliographic service providers need to begin implementing this managed metadata approach, given its value for the management of digital libraries (Witten et al., 2010). Although PDF’s relatively brief three-year history as an open standard is probably to blame for the relative lack of attention paid it by metadata standards committees, Adobe has not done much to lead by example on this front. Despite the advent of new open-source tools for the newly opened PDF spec, most users would be hard pressed to find a document that is less open—particularly in terms of being editable by end-users without proprietary software—than the average PDF.
Any truly effective new PDF metadata strategy needs to be driven from a document repository toward some immediately obvious added value. All of the opportunities to add new features or new content to a PDF originate at the point of publication; all embedded links back to web content could not hope for a more stable host. This would seem to be a natural goal for universities’ institutional repositories, many of which have been criticized for failing to incentivize students and faculty to deposit published works despite their stated research goals (Johnston, 2010). If an institutional repository were able to provide better metadata than is found in pre-print editions created by the “Save As” menus of individual authors—perhaps cross-referenced with a campus-wide research monitor such as BibApp—faculty would no longer have any need to messily self-archive on an individual basis. “Green” open access, based on institutional archiving of published research, would be profoundly strengthened.
(b) Research Data, Tables, and Figures
Some of PDF’s most ardent critics are those interested in reconceptualizing the role of data in publishing, seeing the value in making the data public for (re)use: for reanalysis and replication studies, for aggregating, archiving, linking, and sharing (Fenner, 2011). For them, the printed-page presentation of PDF is its worst feature, for the way it binds quantitative data—or, in some cases, code—to plain text (Pettifer et al., 2011). As it stands, journal editors take considerable pains to professionally lay out data tables, focusing on aesthetics at the expense of reusability. The table summarizes the richness of the data and demonstrates the results of selected statistical analyses of it. Fellow researchers, whether engaged in formal peer review or in other critical engagement with the literature in their field, would be far better served, as would the standards of science, by having the complete data set in a spreadsheet drawn directly from an archive set up for data deposit.
For publishers to encourage authors to provide the data set in a spreadsheet format along with the article is bound to be a challenge initially. It is not currently part of publishing culture, except in a few fields such as genomics where the data’s value has historically been more widely accepted. In this area, data sets are typically provided via the U.S. National Library of Medicine’s Genbank and journals require deposit prior to publication. Here the connection has been made between responsible science and data deposit. In addition, there are projects underway, such as Dataverse Network and Dryad, that are attempting to provide the ready means and citation credit for researchers creating access to their data (King 2007). In the case of PDFs, providing a link from the static table to the interactive spreadsheet of the data—whether it resides on the publisher’s server; the author’s repository; or a dedicated data archive, whether it requires application and qualification from the requester to access the data—would be easy to implement and would not, we believe, take long in establishing its value as a means of increasing the quality of science at very little additional expense.
A second category of research data emerges from the use of graphical images, visualizations, archival materials, and other sources of data associated with the humanities. The PDF can do an excellent job with purely graphical content, involving the most popular image editing suites, such as Adobe Photoshop or GIMP. But here again, laying out an article to fit the letter-size printer page acts as a delimiter on the quality, size, and integration of these figures. To preserve the full potential of high resolution and richly detailed images within the digital setting of the PDF, these images can be inserted at the end of a given document in the highest resolution and with any suitable scale, as an appendix that may be skipped or only partially represented with printing. Among the many long-supported yet neglected features of the PDF is, in fact, the ability to insert anchor links or “bookmarks,” much as has been done in HTML since academic publishing first took to the Internet. A link to “Figure X,” along with perhaps a thumbnail in the document margins, can point as readily to a full image, portable as the document has ever been, more reliably than on the web.
A common thread to all of this is to find ways of extending the PDF’s place as an integral part of reading environments that are deeply linked to external documents, websites, and other resources, and putting an end to it standing as an entirely self-contained and discrete entity, awaiting its printing and stacking in a pile of other PDFs that one often sees on a colleague’s or one’s own desk, presumably awaiting filing. To continue to leave these capacities of the PDF unexplored and under-utilized unnecessarily reduces the value of publications whose principal contribution lies in some novel data generation or analysis. As a result, the article serves as a placeholder for the real achievement of the research, which is only available to reader-researchers through direct appeals to the authors. This is no less troubling for research in the social sciences and humanities, where qualitative and archival data often ends up distributed through publication of irreproducible PDF histograms, while rich digital images are constrained by a stock paper size for the assurance that they will travel as freely as the PDF, which is their academic totem. We need to publish PDFs that look and act more like hypertext—deeply integrated into the knowledge ecology of the web—while still being printable in this transitional period.
3. PDF Futures
With this graduated list of suggested PDF improvements, moving from easy and immediate to somewhat more challenging and far-reaching, we come to our final section on possible future developments for the PDF. This stage assumes that the PDF has a somewhat longer-range future, which does not seem unlikely. The improvements that we set out call for a greater investment in these sorts of improvements, of which we consider four, including (a) the use of scholarly tools, (b) the XML markup of bibliographies, (c) further uses of this formatted markup, and (d) new tools for reformatting PDFs.
(a) Advanced Hyperlink Tools
The HTML version of the NEMJ article, like many of the journals published by major publishing houses, also includes a series of supports or tools (fig. 10). These supports are intended to provide a context in which to read, evaluate, and use the article that can be of value to expert and novice readers alike (Willinsky 2003). This can take the form of an automatic keyword search derived from text in the article abstract, or simply link to other articles published by the same journal or parent entity. These supports include tabs for viewing references, citing articles, and letters to the editor, as well as tools for collecting the citation, picking up slides, seeking reprint permissions, emailing the article to colleagues, and a list of related articles with links to them.
Journal publishers can establish a dynamic service that will place links to these supports and tools, as well as current counts, such as number of citations of the article. This information can be placed in the PDF’s metadata container or appended to a more immediately obvious pre-title page (fig. 11) at the point of download, ensuring that the document is not orphaned or cut off from its scholarly context once it begins to circulate beyond the journal. If one receives a PDF from a colleague in an email, it can still provide the original support, connections, and tools that the journal that published it provides, even if some of the data on the PDF pre-title page will be dated. This, of course, is at odds with absolute portability; we could provide only a DOI link in the article itself, and nothing would be dated. Providing instead a list of recently published related articles would thus be a double-edged sword.
(b) Advanced Bibliographic Support
The bibliographic efforts of authors, in representing their exhaustive search and utilization of the literature, can be made more valuable to readers by treating this process as no less a matter of research that is productive of valuable data. This involves turning the list of references provided by the author into a searchable, well-structured database. After all, the author carefully structures (most of the time) the bibliographic information in the references list. For example, with entries following the APA format, the first element in a reference will the author’s last name, followed by the author’s first and second initial—then comes the year of publication in parenthesis, the article’s title, the name of the journal, and so on. Given this structure, a bibliography can be turned into, or marked up, as a database. Because of the predictability of this structure, the references can be parsed, prior to the article being made into a PDF, by a script that identifies each structural element of the citation, tagging it with a standardized XML (Extensive Markup Language), such as the one developed by the National Library of Medicine (fig. 12).
The XML-wrapped-and-tagged bibliography can then be placed, along with the article’s own set of bibliographic metadata, in the PDF container that will be created for the well-formatted article, while the article, without these tags and metadata, appears in the PDF ready to be read or, alas, printed. This involves using the PDF’s container, which is provided for holding additional structural or metadata information related to the file. Where the PDF freezes the page with everything in its predictable, and thus easy-to-read, place, this XML bibliographic database in the PDF’s container enables the article’s references to take on new value (Priem, 2011). Readers can build reliable, highly usable bibliographies from the articles they are reading; the citation counts and search queries of services such as Google Scholar can be improved (e.g., find articles that cite Foucault in specific genres or disciplines, find articles that have been read by other readers of Foucault; and find articles which have been frequently cited alongside articles by Foucault).
(c) Advanced XML Utilization
There are additional, if somewhat futuristic, advantages to pursuing this combination of XML and PDF for readers’ experiences. The software tools used in handling manuscripts store details such as the author’s name or the article title as structured data (i.e., existing in some field independent of the body text). Such a system could just as easily append this formatted data to the appropriate container within the PDF. A sophisticated PDF reading environment—a web browser, a stand-alone bibliographic management suite, or a mobile device—could then present this information (much as within-document headers and chapter titles are currently presented as anchor links) in an easily browsable manner. These links could additionally be mirrored in inline citations, connecting a user to Google Scholar or PubMed on a simple mouse-over. While our focus is on improving the PDF itself, this is one instance where we could see an equal improvement from not only redesigning the document but then using such advances to redesign the reading environment (fig. 13).
(d) Advanced Reformatting
None of this is to say that we do not enjoy and benefit from being able to hand out a printed PDF document at meetings, or any of the other ways in which PDF likewise benefits from being directly comparable to a physical volume (Marshall, 2010, p. 185). If we were being cynical, we could easily suggest that it is exactly PDF’s stodgy inflexibility that has borne out its success, and we will for that matter always have some need for a stodgy, inflexible document. For instance, it has come to be seen as a professional flourish, if not an expectation, to traffic one’s resume in PDF, simply because PDF is well-suited for static reading. In other words, it seems at times that we are obliged to retain a communication medium that is on technological par with a fax machine—by design—as a nostalgic measure of trust in the “printed” page.
We have outlined recommendations for improving the PDF, but in fact, any of these could as easily be applied to PDF’s successor. Among the many candidates touted here and elsewhere, none do all that the PDF is capable of. It is not clear whether the PDF will be able, in its current form, to draw due attention to its accumulated feature set, but it will almost certainly have all the time it needs. It will not disappear overnight. Because PDFs are so seldom edited post-hoc, a better PDF must originate from a better point of publication. With the genie of universal PDF printing out of the bottle, it would appear that we need to either discourage authors from submitting their work in PDF, rather than a Word or Rich Text Format (.rtf) or other work-in-progress format, or else vastly improve existing solutions for parsing and reformatting PDF documents. The former solution seems far simpler, and indeed is already in use in PLoS and elsewhere, though as we have discussed, none of these workflows has yet been standardized or made effectively open source. Document layout is a terribly important thing for both form and function, yet the overwhelming majority of scholarly journal layout methods and engines in use appear not to have been improved since transitioning to electronic publishing.
Some relatively large stakeholder in academic-content discovery must take the lead in this endeavor if we are to see a real improvement. University institutional repositories in search of added value would make fine candidates, as would the developers of open-source publishing platforms. Certainly, research and development is underway in pursuit of PDF’s successor, but there remains an infinitely greater contingent of researchers who both lack the time or expertise for tool-building and see little reason for a less-than-strictly-necessary migration from PDF. It falls to those of us involved in advancing electronic publishing to recognize as much, even as we seek ways of contributing to the value of Camelot projects, old and new.
John Willinsky is the Khosla Family Professor at the Stanford University School of Education. He is a fellow of the Royal Society of Canada, and the director of the Simon Fraser University-based Public Knowledge Project, dedicated to exploring how open-source technologies can be used to improve the professional and public value of scholarly research. He is the author of Technologies of Knowing, If Only We Knew: Increasing the Public Value of Social Science Research, and The Access Principle: The Case for Open Access to Research and Scholarship.
Alex Garnett is a PhD student in information science at the University of British Columbia and a researcher at the Electronic Textual Cultures Lab at the University of Victoria. He likes social network analysis, open access publishing, learning analytics, document standards, social design principles, and libraries. He holds an MLIS from the University of British Columbia and a BA in Cognitive Science from the University of Connecticut. His website is http://webambler.com, or follow him on Twitter: @axfelix.
Angela Pan Wong is an Austin-based designer. She works in narrative information design, picture books, and designing meaningful learning experiences through technology and digital media. She holds a BA/BFA in Design from the University of Texas at Austin and a MA in Education from Stanford University through the Learning, Design & Technology program.
- Ackerman, R. “The subjective feelings of comprehension and remembering accompanying text-learning on-screen.” In Learning in the Technological Era III: Proceedings of the 2009 CHAIS Conference, 2009.
- Baroni, M., F. Chantree, A. Kilgariff, and S. Sharoff. “Cleaneval: A competition for cleaning web pages.” In Proceedings of LREC, 2008.
- Binkley, Robert C. Manual on Methods of Reproducing Research Materials. Ann Arbor, MI: Edwards Brothers, 1936.
- Cameron, Robert D. “Towards Universal Serial Item Names.” Journal of Digital Information 1, no. 3 (1998): 10–11.
- Campbell, D., L. Dirks, O. Naim, and A. Wade. Article Authoring Add-in for Word. Microsoft Research., n.d., http://research.microsoft.com/en-us/projects/authoring/.
- Cheng, X., C. Dale, and J. Liu. “Understanding the characteristics of internet short video sharing: YouTube as a case study”, Technical Report arXiv:0707.3670v1 (2007), http://arxiv.org/PS_cache/arxiv/pdf/0707/0707.3670v1.pdf.
- Clemens, Raymond, and Timothy Graham. Introduction to Manuscript Studies. Ithaca, NY: Cornell University Press, 2007.
- Dyson, M. “The influence of reading speed and line length on the effectiveness of reading from screen.” International Journal of Human-Computer Studies 54, no. 4 (2001): 585–612, http://dx.doi.org/10.1006/ijhc.2001.0458.
- Dyson, Mary. “How physical text layout affects reading from screen.” Behaviour & Information Technology 23, no. 6 (2004): 377–93, http://dx.doi.org/10.1080/01449290410001715714.
- Fenner, M. “A very brief History of Scholarly HTML.” In Gobbledygook (blog), 2011, http://blogs.plos.org/mfenner/2011/03/19/a-very-brief-history-of-scholarly-html/.
- Fitzpatrick, K. “Peer-to-Peer Review and the Future of Scholarly Authority.” Cinema Journal 48 (2009): 124–29, http://dx.doi.org/10.1353/cj.0.0095.
- Giles, C. L, K. D Bollacker, and S. Lawrence. “CiteSeer: An automatic citation indexing system.” In ACM Conference on Digital Libraries, 1998, 89–98, http://dx.doi.org/10.1145/276675.276685.
- Groth, P., A. Gibson, and J. Velterop. “The Anatomy of a Nano-publication.” Journal of Information Services and Use 30, no. 1–2 (2010).
- Hemminger, Bradley M., Dihui Lu, KTL Vaughan, and Stephanie J. Adams. “Information seeking behavior of academic scientists.” Journal of the American Society for Information Science and Technology 58, no. 14 (2007): 2205–2225, http://dx.doi.org/10.1002/asi.20686.
- Johns, Adrian. The Nature of the Book: Print and Knowledge in the Making. Chicago: University of Chicago Press, 1998.
- Johnston, M. “Changing the paradigm: The role of self-archiving and institutional repositories in facilitating global open access to knowledge.” Access to Knowledge 2, no. 2 (2010).
- King, Donald W., Carol Tenopir, Songphan Choemprayong, and Lei Wu. “Scholarly journal information-seeking and reading patterns of faculty at five US universities.” Learned Publishing 22, no. 2 (2009): 126–44, http://dx.doi.org/10.1087/2009208.
- King, Gary. “An Introduction to the Dataverse Network as an Infrastructure for Data Sharing,” Sociological Methods Research 36, no. 2 (2007): 173–99, http://dx.doi.org/10.1177/0049124107306660.
- Lacy, Norris J. “Camelot.” In The New Arthurian Encyclopedia, edited by N. J. Lacy. New York: Garland, 1991, 66–67.
- Malik, Abdul Rahman Abdul. “Greening the Architectural Curriculum in all the Malaysian Institutes of Higher Learning—It is not an option.” ArchNet-IJAR: International Journal of Architectural Research 4, no. 2–3 (2010): 44–53, http://archnet.org/library/pubdownloader/pdf/11146/doc/DPC2067.pdf.
- Mariani, S. “Metadata Extraction from PDF Papers for Digital Library Ingest.” In Proceedings of the 10th International Conference on Document Analysis and Recognition, 2009, http://dx.doi.org/10.1109/ICDAR.2009.232.
- Martin, D. Book Design: A Practical Introduction. New York: Van Nostrand, 1990.
- Olive, T., J-F Rouet, E. François, and V. Zampa. “Summarizing digital documents: effects of alternate or simultaneous window display.” Applied Cognitive Psychology 22, no. 4 (2008): 541–58, http://dx.doi.org/10.1002/acp.1380.
- Paskin, N. “Digital Object Identifier (DOI) System.” In Encyclopedia of Library and Information Sciences. New York: Taylor and Francis, 2008, http://www.doi.info/overview/DOI-ELIS-Paskin.pdf.
- Pettifer, S., P. McDermott, J. Marsh, D. Thorne, A. Villeger, and T.K. Attwood. “Ceci n’est pas un hamburger: modelling and representing the scholarly article.” Learned Publishing 24, no. 3 (2011), http://dx.doi.org/10.1087/20110309.
- Priem, J. “Has journal commenting failed?” In jasonpriem.com, 2011. http://jasonpriem.com/2011/01/has-journal-article-commenting-failed/.
- Priem, J., and B. M. Hemminger. “Decoupling the Scholarly Journal.” Frontiers in Computational Neuroscience: Special Issue on Beyond Open Access: Visions for Open Evaluation of Scientific Papers by Post-Publication Peer Review (Under review, 2011), https://docs.google.com/document/d/1xDOy9GXXrUFc9TUIR2C470DTau8JEgZ9k-SMNIx5pb8/edit?hl=en_US&authkey=CMeCqOYD&pli=1.
- Shotton, David. “The Five Stars of Online Journal Articles.” Open Citations and Semantic Publishing, October 17, 2011, http://opencitations.wordpress.com/2011/10/17/the-five-stars-of-online-journal-articles-3/
- Suhonos, M. J. “Semi-automatic Citation Correction with Lemon8-XML.” Code4Lib Journal 6 (2009), http://journal.code4lib.org/articles/1011.
- Teeuwen, Mariken J. “Glossing in Close Co-operation: Examples from Ninth-Century Martianus Capella.” In Practice in Learning: The Transfer of Encyclopaedic Knowledge in the Early Middle Ages, edited by Jr. Bremmer, Kees Dekker, and Rolf H. Bremmer, 85–99. Paris: Peeters, 2010.
- Tenopir, Carol, Donald W. King, Jesse Spencer, and Lei Wu. “Variations in article seeking and reading patterns of academics: What makes a difference?” Library & Information Science Research 31, no. 3 (September 2009): 139–48, http://dx.doi.org/10.1016/j.lisr.2009.02.002.
- Warnock, J. “The Camelot Project.” PlanetPDF (1991).
- Willinsky, J. “Open access: Reading (research) in the Age of Information.” In 51st National Reading Conference Yearbook, edited by D. L. Schallert, C. M. Fairbanks, J. Worthy, B. Maloch, and J. V. Hoffman. Oak Creek, WI: National Reading Conference, 2003, 32–46.
- Witten, I. A., D. Bainbridge, and D. M. Nichols. How to Build a Digital Library. Burlington, VT: Morgan-Kaufmann, 2010.
Manual on Methods of Reproducing Research Materials
This article’s laboring over the design and use of the PDF in scholarly communication is not without historical precedent. In 1936, the Joint Committee on Materials for Research in the Social Science Research Council issued the Manual on Methods of Reproducing Research Materials, under the direction of Robert C. Binkley. It is a remarkable document, in the scope and detail of its concerns with the methods of publishing. It does nothing less than demonstrate by example the various means of distributing and publishing scholarly materials of the day, with the actual instances displayed and bound into the book, whether of typescript, mimeograph, blueprint, photostat, photo-offset, microcopy, and, more obscurely, gelatin hectograph and dexigraph (For example, the hectograph copies are numbered: “This is the 149 copy from the master sheet”; p. 44, fig. 24). The manual includes a detailed analysis of costs (using 100,00-words-by-copies as the cost unit), as well as evaluations of permanence and legibility. For example, Binkley deals with formulas for calculating typographical line-lengths for “efficient reading performances” (p. 14, fig. 3), before concluding, with some relevance to the PDF, that “the standard typewriter page of 8½” wide must be used with wide margins if the script is to be legible” (p. 40). On this and other questions, Binkley attests to “a certain disharmony between the interests of economy and legibility” (p. 40). Yet out of this science of scholarly communication, he envisions a “broadening of the base of the pyramid of scholarly activity . . . and the democratization... of scholarship” (pp. 201–2). For our part, we believe that improving the readability and functionality of the PDF research article, by using its existing capacities could contribute to Binkley’s vision, especially in combination with increased public and open access to that research. Whether this will “lead the whole population toward participation in a new cultural design,” as Binkley optimistically concludes his impressive report/exhibition, remains an open question (p. 202).
The Camelot title speaks to the fictional seat of authority of King Arthur’s fantastical court. The Wikipedia entry on “Camelot” cites Norris J. Lacy on the point of portability, from the very year Warnock first used the project title: "Camelot, located no where in particular, can be anywhere" (1991, p. 66).
The pdfforge project is an example of a simple, open source PDF generator that takes advantage of its open, no longer proprietary, standards.
See the Beyond the PDF conference website, which provides a snapshot of post-PDF publishing solutions as of early 2011.
Further examples additional file formats include the International Digital Publishing Forum’s ePub, Wolfram Alpha’s CDF, and the Mayo Clinic’s NGP. Yet the achievement of an established standard like PDF is not to be taken lightly. A decade ago, mounting a two-minute video clip was plagued by proprietary encoding standards, calling for RealPlayer, QuickTime, or Windows Media decoders, to name just a few. YouTube put an end to a decade of confusion virtually overnight by using Flash—itself now on the chopping block (Cheng et al., 2007).
We work with the Public Knowledge Project, the developer of Open Journal Systems, an open source publishing platform with over 10,000 installations. In Open Journal Systems installations, PDF article views appear to outstrip HTML by a factor of at least 9 to 1. This statistic has been anecdotally corroborated by other electronic document repositories, with the notable exception of the PLoS journals, whose public statistics attest to numbers that are almost exactly opposite. We can only speculate as to why this might be, although it likely has something to do with the fact that PLoS’ HTML is both expertly designed and given primacy of presentation. As the entire suite of journals are public on the open web, there is no need for an article abstract landing page, and all DOIs point directly toward HTML full text.
David Shotton (2011) has recently proposed a “Five Star” rating system for online journal articles, with star awarded for open access, enriched content, peer review, machine-readable metadata, and available datasets, which suggests a common set of standards that we are seeking to ensure the PDF article can meet.
For example, Douglas Martin, in Book Design: A Practical Introduction, speaks of a “secret canon, which underlies many late medieval manuscripts and incunabula [early printed books]” that was discovered by Jan Tschihold in 1953: “Page proportion 2:3. Text and page area of the same proportions, Height of the text area equal to the width of the page” (1990, p. 46). Martin then allows that “unfortunately such proportional canons can hardly ever be used in a pure form today because they tend both to require specific non-standard page formats as a starting point and to yield deluxe margins which are out of place in the modern world” (p. 46).
A universal bibliographic database that would generate bibliographies on the fly, based on each published work having a single reference number that links it back to that citation in the universal database, is a concept that was introduced well over a decade ago (Cameron, 1998).
The DOI provided by Crossref (e.g., doi:10.1093/rheumatology/ken311), can be rendered a hyperlink by appending http://dx.doi.org/ to the beginning of it, which will lead to the publishers’ site for the article. CrossRef offers a batch-look-up service that will identify the DOIs for a set of an article’s references that can be used by authors or publishers.
To create a PubMed lookup, a PMID is added to a NLM URL in this way: http://www.ncbi.nlm.nih.gov/pubmed/18276894.
For the Google Scholar look-up, the first author and the full title of the article is used. In the example given in fig. 7 here is what : http://scholar.google.com/scholar?hl=en&safe=off&q=author%3AMay+%22 Tropical%20arthropod%20species,%20more%20or%20less?%22.
For example, if Adobe Acrobat (Version 10) is used to prepare the PDF, then the publisher needs to ensure that in saving the Word document as a PDF, the steps include clicking on (a) Save As . . . ; (b) Reader Extended PDF . . . ; and (c) Enable Commenting and Measuring.
The PDF’s 2008 release as an open standard (http://www.adobe.com/pdf/release_pdf_faq.html) has not yet led it to be adapted into any OAI metadata format.
It is worth noting that the Association for Computer Machinery’s Special Interest Group on Computer-Human Interaction (ACM SIGCHI), by comparison, favors a relatively unique landscape view for their articles, allowing for greater flexibility with non-text content, but leading to complications in nonstandard reading environments.
Among hyperlink-unfriendly PDF software, Adobe Reader unrelentingly warns a user that any given link may be malware and will not be followed without confirmation, and Google Chrome’s native PDF viewer does not colorize or underline text links, so they are often simply not noticed.
The U.S. National Library of Medicine’s NLM-XML specifications, which drives PubMed data, has been used to drive scholarly continent in many disciplines (i.e., not only biomedical research), and would make a fine candidate for broader standardization. See also Campbell et al. (2009).
Under a PDF’s Document Properties, using Adobe Acrobat (Version 10), there is an “Additional Metadata . . .” button in the Description tab. It leads to the Adobe’s Extensible Metadata Platform (XMP), which provides a template for the basic metadata for the document (title, author, description, copyright status, etc.). This metadata needs to be “injected” into the container, potentially by utilizing open source software such as pdftk and iText.
One of the authors reportedly keeps a fax machine in his closet, and every so often takes it out to hug it in order to remind himself that it is still there. He acutely notes: “If nothing else, PDF is fortunate in that it would have a difficult time being the recipient of so much physical affection. Heaven knows that I trust it the way I do any cherished vestigial organ—a pinky toe, if not an appendix, waiting uselessly to burst and kill me without warning.”