EPUBs are an experimental feature, and may not work in all readers.

Scientific knowledge is increasingly being created and recorded in electronic forms, yet today's computer systems are poorly suited for the long-term retention of information. Unless conscious efforts are made, important knowledge will be lost to future scientists and historians.

The general question of preservation of digital information has recently emerged as a major topic for research. The underlying issues were first elaborated by the Research Libraries Group's Task Force on Archiving of Digital Information; several projects of the Digital Libraries Initiative emphasize preservation, including Cornell University's Prism project on information integrity.

Most of the work to date seeks general principles that apply to a wide range of preservation challenges. This paper is the opposite. It is an outgrowth of a discussion paper that was prepared for a meeting at the Council on Library and Information Resources in September 1999 to discuss preservation of journals in digital form. It makes no attempt to address the general problems of preservation, but concentrates on three case studies: the ACM Digital Library, the Internet RFC series, and D-Lib Magazine. Those examples were chosen as typical publications where the definitive versions are already in electronic formats and maintained online. The ACM Digital Library provides the electronic versions of journals that mainly originated in print. The other two are novel forms of digital publication that were made possible by the development of the Internet. (Conversely, the development of the Internet was greatly helped by open access to the Internet RFCs.)

This paper asks what can be done today that will help to preserve the information contained in these three examples for scientists and historians a hundred years from now. The answers are partly technical and partly organizational.

The Case Studies

The ACM Digital Library

The Association for Computing Machinery (ACM) is a professional society that publishes research journals and magazines in computer science. It also organizes a wide variety of conferences, many of which publish proceedings. ACM is typical of the publishers that have moved rapidly into electronic publication of conventional journals. In 1993, the ACM decided that its future publication process would be a computer system that creates a database of journal articles, conference proceedings, magazines and newsletters, all marked up in SGML. Subsequently, ACM also decided to convert large numbers of its older journals and build a digital library covering its publications from 1985. The digital library will eventually extend back to ACM's foundation in 1948. (See Bernard Rous in the June 1999 Journal of Electronic Publishing.)

The main collection came online in 1997. It has a Web interface that offers readers the opportunity to browse through the contents pages of the journals and to search by author, keyword and subject classification. Behind the Web interface lies a relational database, which is accessed through a set of CGI scripts.

The ACM Digital Library is available only from ACM. For performance reasons, ACM is negotiating to deliver its information to users through a private company that will use a private network to mirror the publications. That will greatly reduce access delays, particularly outside North America, but the mirroring is purely for performance, not for preservation.

For most of its publications, ACM continues to provide printed versions, which are generated from the SGML database. Since the ACM Digital Library became available online, demand for the online service has exceeded every forecast, while demand for the printed versions of the same journals has dropped sharply. The association expects to abandon the printed versions if and when the demand drops to uneconomic levels. No date has been set; a reasonable guess is that the printed versions of most journals will be withdrawn within the next five to ten years.

The Internet RFC Series

The Internet RFCs are the heart of the primary literature that documents the technology of the Internet. The initials "RFC" once stood for "Request for Comment" but that name long ago ceased to be appropriate. The 2,700 RFCs form a series that goes back thirty years. They include the formal specification of the TCP/IP protocols, Internet mail, components of the World Wide Web, and many more technical standards. Moreover, they are the only records of the technical discussions behind the development of much of modern networking.

RFCs have never been published on paper, though in the early years hard copy was available on request. Originally they were available over the Internet by FTP, more recently by the Web. Most are text-only with no graphical or other formats; a few have PostScript versions with additional graphics. Various indexes have been developed, but they are generated automatically. No metadata is provided beyond a number, category, a list of authors, and the title. Until fairly recently the older RFCs were not collected systematically and some of the older RFCs have been lost.

The organization behind the RFCs is complex. The RFCs are the official publications of the Internet Engineering Task Force (IETF), but responsibility for publication of the RFCs lies with the Internet Society (ISOC). The secretariat of the IETF coordinates the Internet Draft process that leads to the creation of the RFCs, but the RFC Editor maintains the RFC series. At present the secretariat is based at the Corporation for National Research Initiatives (CNRI), with services provided by Foretec Seminars, a subsidiary of CNRI. The RFC Editor is at the Information Sciences Institute, a semi-autonomous unit of the University of Southern California.

D-Lib Magazine

D-Lib Magazine is a monthly magazine that publishes articles about digital library innovation and research. Since its first issue in July 1995 it has become one of the primary sources for information about digital libraries. D-Lib magazine is representative of a number of important Web serials. The following comments are broadly applicable to other open access serials, such as the Journal of Electronic Publishing, RLG DigiNews, [formerly http://www.rlg.org/preserv/diginews/], First Monday, iMP [formerly http://www.cisp.org/imp/], Ariadne, and many more.

D-Lib Magazine uses basic Web technology. Articles are formatted in HTML, with images and other material in the standard Web formats of today. Efforts are made to ensure that the magazine is accessible from standard Web browsers, but there is no systematic enforcement of any markup standards. Recently, D-Lib Magazine has been introducing new metadata methods. Each article has a Digital Object Identifier (DOI) and an associated file containing simple metadata. The metadata uses fields from the Dublin Core and is marked up in XML.

Work is beginning on automatic reference linking from the magazine to other technical literature. That will partially address the problem of links from the magazine being broken. At present internal links are maintained carefully after publication, but external links are not monitored; over time some are broken and references become invalid.

From its origin, D-Lib Magazine has been supported by funds from DARPA grants. It is published by CNRI, a not-for-profit corporation with activities centered on the development of network-based information infrastructure. Currently, the magazine is edited for content by two of us at Cornell University, with production by CNRI.

Implications for Long-term Preservation

Selection

The desire to preserve publications for a hundred years highlights some interesting themes. The first is understanding what should be preserved. The primary information of science comes from many sources. The Internet RFC series, the Genome Database, NASA's photographic archives, D-Lib Magazine and the Journal of Electronic Publishing are not conventional journals, but they are primary sources in their fields.

"One approach to long-term preservation is to rely on the publisher"

Publishers and librarians often equate primary information with conventional peer-reviewed journals, but practicing scientists recognize that that is far from accurate. The review process that turns an Internet Draft into a standards track RFC is more thorough than almost any peer review. Conversely, peer-reviewed journals vary greatly in quality, from fundamental importance to an embarrassment.

Requirements

For scientific information, three possible levels of preservation have been proposed. They can be labeled conservation, preservation of access, and preservation of content.

The most demanding is conservation of the full look and feel of the publication. Museums and archives distinguish between conservation of artifacts and preservation of content. Is it sufficient to preserve the scientific information or is it important to conserve the look and feel of those early electronic publications for their historic interest? An early edition of Physics Review is of interest today as a historical artifact as well as for the physics it contains. We must expect that future generations will value publications such as these three examples as the incunabula of electronic publication; they will be of interest for how they use the Internet as much as for the scientific information they contain.

The second level is preservation of access, maintaining both the underlying material and an effective system of access. The ACM Digital Library and D-Lib Magazine both support fairly complex Web sites. Those sites have indexes, search engines, sets of metadata, guidelines to authors, and other materials beyond the actual published articles.

If the objective is to preserve only the scientific content, then a simple warehouse of the articles with minimal metadata is sufficient. Thus, the third and least demanding level of preservation is preservation of content. For example, Elsevier Science maintains a basic warehouse of journal articles that is independent of the various delivery systems that are provided. If the content is preserved, then the scientific knowledge is not lost, but access may be awkward.

Publishers as Archivists

One approach to long-term preservation is to rely on the publisher. If the publisher is actively maintaining a serial, there may be no need for other organizations to duplicate the technical work of preservation.

Studies of preservation emphasize the need to refresh data by periodically copying it from older magnetic storage, and to migrate information to keep current with modern formats and operating environments. When a publisher is actively managing materials, refreshing and migration become routine data processing. All three examples are published by organizations that have strong computing staffs. They regularly replace old hardware and transfer data to the new. They upgrade software packages (such as operating systems and databases) periodically and run tests to ensure that the new systems work with the old data.

By a strange twist, while active management by the publisher is likely to preserve both content and access, it also increases the need for conscious planning if the original look and feel is to be conserved. Migration to take advantage of new technology preserves the content and often improves current services, but frequently discards the design of early systems. D-Lib Magazine treats each monthly issue as its own package, complete with graphic design elements. That means that a reader who accesses an early issue will see the original design. A more common approach, however, is for the design of a publication to be described by a single package that contains a style sheet and a set of graphical elements. Changes to that package are reflected in all issues of the publication, thus changing the appearance of back issues. While no publisher would reformat its backlist of printed journals, migration of content frequently alters the design of all materials, losing the older design and the organization of the materials forever.

If a publisher is to be relied on for long-term preservation, it must be financially sound and technically skilled. The publisher must value the materials either as a business asset or as an archive that it keeps for the public good. The organizational stability and the commitment of the publisher become major considerations, but no organization is completely safe. This is a time of prosperity in the United States; the next hundred years will surely see financial and political crises, wars, corruption, incompetence, and natural disasters. Tomorrow we could see the National Library of Medicine abolished by Congress, Elsevier dismantled by a corporate raider, the Royal Society declared bankrupt, or the University of Michigan Press destroyed by a meteor. All are highly unlikely, but over a long period of time unlikely events will happen.

Stability for the Next Century

The three examples have very different organization stability. The scientific and library communities can be reasonably confident that ACM will continue to look after its Digital Library so long as the association exists. As a professional association, ACM sees the Digital Library as one of its great assets. If ACM should ever go out of business or merge with another organization, the Digital Library would be an important asset. The association is prosperous, with 80,000 members and significant financial reserves. ACM is more than fifty years old and could well be active a hundred years from now.

The organizational arrangements for the Internet RFC series are essentially short-term. They work well at present, but surely they will not remain unchanged for a hundred years. While the RFCs remain the working documentation of the Internet, the technical community will look after them, but there is nothing in the present structure that will preserve the RFCs when they cease to be current and become part of the history of science. The Internet Engineering Task Force is a remarkable organization, but the informality that has made it successful is a risk when planning for the long term.

The organizational stability of open-access serials varies greatly. For example, the University of Michigan Press, which publishes the Journal of Electronic Publishing, presumably pays attention to the long term, while SAIC, the commercial company that publishes iMP makes no long-term promises. CNRI, the publisher of D-Lib Magazine, depends on grant funding. If funding ceased, CNRI might well stop publishing the magazine and freeze the Web site. If sometime in the next hundred years CNRI were to cease operations, such a frozen Web site could easily be lost.

Copyright

In these three examples copyright does not appear to be a barrier to preservation, but for different reasons. The authors of most, but not all, materials in the ACM Digital Library have transferred copyright to ACM. Even where ACM does not own the copyright, it has the rights needed to publish, convert to different formats, and archive the materials. In any possible changes of its copyright policy, ACM would ensure that it had sufficient rights to allow any reasonable preservation policy.

ACM is more generous about copyright than most publishers, but still does not permit copies to be made of the entire library. The RFC series and D-Lib Magazine explicitly allow copies of the materials to be made, at least for noncommercial purposes. Legally no permission is needed to build a complete archive of these serials. The RFCs are often treated as public documents. Authors of RFCs grant very broad rights to ISOC and the IETF. ISOC holds copyright in the recent standards-track documents, but provides them to all users with essentially no restrictions. Authors retain copyright in the materials that appear in D-Lib Magazine. Each author provides CNRI with a release that permits CNRI to publish and maintain the magazine. The magazine is open access and broad permission is given for all noncommercial uses of the material. The entire Web site is mirrored at several sites around the world.

For preservation, those distinctions are more apparent than real. In each case the publisher has all the rights needed to preserve the materials, including the rights to work with other noncommercial organizations for long-term preservation. All of the publishers would be happy to discuss preservation with a well-intentioned library that planned to establish an archive of the serials for future generations.

Technology and Standards

Some of those materials are in such simple formats that refreshing the bits can preserve the content. Others are more complex. The ACM Digital Library is the only one of the three examples to use a standard system of markup, SGML, yet it is the most vulnerable to technical obsolescence. The system is complex technically and it uses a variety of formats (SGML, PDF, and HTML). The use of SGML poses particular problems for preservation. The DTD is specific to ACM and the special-purpose algorithms used to render mathematics from SGML are an essential part of the system. Indeed, the difficulties experienced in rendering mathematics from SGML have led some ACM members to urge the use of a language (TeX) that represents the appearance of mathematics directly. ACM uses a relational database to store the Digital Library material. The database also stores the metadata needed to manage the collection and provide access. The database schema and the metadata are specially designed and not tied to any standard. Thus the ACM Digital Library is dependent on ACM-specific specifications (a DTD, CGI scripts, database schemas, and rendering algorithms). The ACM-specific tools change periodically as ACM improves its system.

"Preservation is a service to the future that cannot depend on financial rewards

At the other extreme, the RFCs were deliberately designed to be extremely simple technically. Each RFC is a single file of ASCII text. RFCs have a carefully controlled layout and the basic descriptive metadata is easily extracted from the text. The short-lived experiment with PostScript was not seen as a success; the few PostScript versions are highly vulnerable to obsolescence since PostScript has many variants.

The technical issues of long-term preservation of D-Lib Magazine are shared with many other Web sites. The magazine is fairly simple. It uses no JavaScript or Java applets and tries to avoid the more abstruse aspects of HTML. Therefore, little of the content will be lost even when Web browsers cease to support current versions of the various formats. There should be safety in numbers, too: because a huge number of Web sites use the same formats, migration tools will almost certainly be widely available (e.g., tools to convert HTML to XML).

In these three examples, metadata is not crucial to preservation because most vital metadata is embedded within the documents. The descriptive metadata that exists can easily be recreated if necessary. Preservation of structural metadata is slightly more important. A database schema is used to manage the ACM Digital Library. That is essential for preservation of access, but not for preservation of content; it will surely be modified with time. D-Lib Magazine, in common with other Web publications, is highly dependent on internal hyperlinks, and hence on the directory structure used to store the magazine.

Strategies

Long-term preservation requires organizations that are committed to the long-term. Candidates include the national libraries, scholarly societies, charitable foundations, and major university libraries. It is no accident that these are all not-for-profit. Preservation is a service to the future that cannot depend on financial rewards.

For the scientific serials discussed in this paper, preservation appears likely to go through two phases, a period of active management by the publisher followed by preservation independent of the original publisher. For the ACM Digital Library, the duration of the first phase may be measured in centuries. For the Internet RFCs and D-Lib Magazine the first phase is unlikely to continue for more than a few decades, and could be less.

Partnerships with Publishers

A theme that runs through the examples is the need for the scientific and library communities to build partnerships with the publishers. One way to minimize the risk that valuable information will be lost is for publishers to make arrangements with libraries to replicate their collections. That ensures that separate copies of the publications always exist. If a publisher should subsequently cease to maintain its collections, those libraries become the long-term preservationists. The American Physical Society is discussing such a collaboration with Cornell University Library, and Elsevier has recently announced a policy that should lead to all its journals being protected in the same way.

An option that is sometimes proposed is for the publisher (e.g., ACM) to take the primary role in maintaining, upgrading, and migrating the system and its content. At the same time, one or more libraries accept snapshots of the system and its content on a regular basis, to be archived as protection against catastrophic events. A legal agreement would be drawn up between publisher and the libraries listing the circumstances under which the archives can be made available to general users. A side benefit of snapshots is that they conserve the design of the site at specific moments. Almost certainly, a hundred years from now such a snapshot would not be immediately usable, but it would provide the raw material for the digital archeologist. It is the digital analog of a dusty box of papers found in an attic.

Snapshots alone are not attractive to libraries, however, unless they are combined with current access to the publications. A more attractive alternative is for the library to maintain an up-to-date copy of the materials for both local access and preservation. That could even be managed as a mirror site for the publisher. Each publisher could collaborate with one or two libraries and the total effort would not be excessive.

Preservation Independent of the Original Publisher

Over time, the volume of material that is no longer actively maintained by its original publisher will grow. Some publications have sufficient financial value that others may be prepared to maintain them and generate revenue from them. JSTOR, which digitizes backruns of important journals, is the leading example of such an initiative. Hopefully, such activities will flourish, but they will not cover everything. Many sources of scientific information will have to be preserved for their cultural value alone. Preservation will have to be by organizations that are independent of the publisher. That is a function of research libraries that continues naturally into the digital future.

"Physical media decay, software systems become obsolete, and the expertise needed to manage the collections disperses"

The Library of Congress could play a special role. A prime function of the Library of Congress is to collect the cultural and intellectual output of today for the benefit of future generations. No legal changes are needed for the library to extend its mission to collecting and preserving information that is created in digital formats. It could be accomplished by partnerships with selected publishers, combined with acquiring and preserving materials that are not actively maintained by others. As a first step, the Library of Congress has an innovative agreement with UMI (now known as Bell and Howell Information and Learning) in which the library essentially designates that company as the long-term archive for most university theses. As yet, however, the library has paid little attention to information that is created in digital forms. Its digital-library efforts focus on converting physical artifacts to digital form.

Preservation of open-access materials like the Internet RFCs can proceed independently of partnerships with publishers. The Internet RFCs are now maintained by organizations that do not have long-term preservation as a priority. Because they are technically very simple, a possible strategy for preservation is for one or more libraries to announce a public commitment to acquire every RFC as it is published and to preserve the content for the long term. While it is sensible for those libraries to work with ISOC and the RFC editor, it is perfectly possible for them to act independently.

Exhortation

Preservation of digital information is not a single issue with a single universal solution. As these three examples demonstrate, much can be done immediately.

Information is especially vulnerable when the original publisher ceases to maintain it actively. That transition can be abrupt, caused by a natural catastrophe, war, bankruptcy, or other disaster. More often, however, the publisher slowly loses interest. Physical media decay, software systems become obsolete, and the expertise needed to manage the collections disperses. Those are already problems with research information mounted on Web sites.

The solution lies in preparation during the period of active management, so that the technical and legal arrangements for subsequent preservation are already in place. Hopefully, the planned partnership between the American Physical Society and the Cornell University Library will be one of many. If libraries and publishers work together, it should be possible to preserve all the primary materials of science for future generations.



William Y. Arms is Professor of Computer Science at Cornell University. He has a background in mathematics, operational research, and computing, with degrees from Oxford University, the London School of Economics, Sussex University and Dartmouth College. He has very broad experience in applying computing to academic activities, notably educational computing, computer networks, digital libraries, and electronic publishing. His career includes positions as Vice Provost for computing at Dartmouth College, Vice President for Computing at Carnegie Mellon University, and Vice President of the Corporation for National Research Initiatives. He is currently chair of the Publications Board of the Association for Computing Machinery, editor-in-chief of D-Lib Magazine and on the Management Board of the MIT Press. MIT Press published his book, Digital Libraries, in November 1999. You may contact him by e-mail at wya@cs.cornell.edu.


Links from this article:

ACM Digital Library, Association for Computing Machinery, http://www.acm.org/dl/.

Ariadne, The UK Office for Library and Information Networking, http://www.ariadne.ac.uk/.

Cornell University, Project Prism, Digital Libraries Initiative Phase 2, http://www.library.cornell.edu/iris/research/prism/.

D-Lib Magazine, http://www.dlib.org/.

First Monday, http://www.firstmonday.dk/.

iMP: Information Impacts, Center for Information Strategy and Policy (CISP), [formerly http://www.cisp.org/imp/].

Internet Engineering Task Force, Requests for Comments, http://www.ietf.org/rfc.html.

Journal of Electronic Publishing, University of Michigan Press, http://www.journalofelectronicpublishing.org/.

JSTOR: Journal Storage Redefining Access to Scholarly Literature, http://www.jstor.org/.

RLG DigiNews, [formerly http://www.rlg.org/preserv/diginews/].

Research Libraries Group Task Force on Archiving of Digital Information, Preserving Digital Information: Final Report and Recommendations, March 1996, http://www.rlg.org/legacy/ftpd/pub/archtf/final-report.pdf.

Rous, Bernard. ACM: A Case Study. Journal of Electronic Publishing 4(4) June 1999.


Acknowledgments

This article is based on a discussion paper prepared for a meeting of the Council on Library and Information Resources on September 27, 1999. Many of the ideas raised at the meeting are reflected in this new version, though naturally I take responsibility for errors and omissions. The meeting was attended by Scott Bennett (Yale University), Pieter Bolman (Academic Press), Robert Bovenschulte (American Chemical Society), Martin Blume (American Physical Society), William Gosling (University of Michigan), Rebecca Graham (Digital Library Federation), Kevin Guthrie (JSTOR), Karen Hunter (Elsevier Science), Michael Keller (Stanford University), Richard Lucier (University of California), Clifford Lynch (Coalition for Networked Information), Deanna Marcum (Council on Library and Information Resources), Elaine Sloan (Columbia University), Abby Smith (Council on Library and Information Resources), Michael Spinella (American Association for the Advancement of Science), Winston Tabb (Library of Congress), Sarah Thomas (Cornell University), Donald Waters (Andrew W. Mellon Foundation) and myself.

This work has been supported in part under DARPA Grant No. N66001-98-1-8908.