Electronic formats are the latest in a long line of media that humankind has used to record knowledge and experience. The revolution in publishing wrought by the Internet has presented libraries with mighty challenges in carrying out their business of collecting, describing, storing, providing access to, and preserving information for current and future researchers. National and other deposit libraries have a particularly unenviable task in preserving national heritage in published formats, because it is their business to ensure the survival of as many online publications as possible. This is the task that my colleagues and I have undertaken at The National Library of Australia.

When deposit libraries consider preservation of an item of national heritage, it is not just a matter of ensuring it is available for five, ten or fifty years. Such libraries work in terms of hundreds, even thousands, of years. This is a difficult enough brief in relation to traditional library materials such as works on paper, microfilm, and film. For works in digital formats, it is even more difficult, and the library and archival communities have not yet clearly established how it might be done.

For instance, the hardware and software required to read electronic documents is evolving fast, and the technology may be so different in ten years that it will not work on older documents. All Web documents rely on software and hardware for their creation and display, software and hardware that becomes obsolete as new versions and models are brought to market. While there is sometimes an attempt to make software backwardly compatible (that is, able to handle earlier versions), that usually accommodates one or two generations of changes only, so with the typical 18-month life cycle for hardware and software, within ten years we will not be able to display many older documents. In addition, documents on the Internet are particularly ephemeral, with frequent changes of content and location. Often they disappear altogether. To ensure that important information published electronically is preserved for future use, we must identify and collect (archive) it, and we must develop means for preserving it.

There are two broad categories of electronic publications: those published in physical format such as floppy disk or CD-ROM, and those published on line, as on the World Wide Web. They have some preservation needs and characteristics in common, but also have some that are quite different. This paper discusses issues that the National Library of Australia considers to be crucial to achieving long-term access to online publications and describes the work that is being undertaken here toward that end.

Ensuring long term access to online publications is a two-step process. First, the materials have to be identified, collected and made accessible in their current format (the archiving process). Second, the materials have to be managed in such a way that they remain accessible as technology changes (the preservation process). To date, it is the first step that has been the main focus of attention at the National Library of Australia.

Online publications have characteristics that more traditional library materials usually do not have, and that present libraries with new challenges in attempting to manage them. Foremost among these is their dynamic nature: They are subject to changing content, changing location, or disappearance altogether from the Internet. While some online publications undergo an editorial process or peer review, many more do not go through the quality control that has been common in print publishing. The number of online publications, especially in relation to the paucity of resources allocated to managing them, is another major obstacle to those libraries with responsibility for preservation of the national output.

Those characteristics of online publications have important implications for long-term access. It is technically complex and therefore costly to collect, store, provide access to, maintain, and ultimately preserve online publications. The costs are related not only to the day-to-day operational work but also to the necessity of building a sophisticated technical infrastructure. Because electronic publications are easier to copy and alter than print publications, there are sensitive issues to be dealt with relating to intellectual-property rights and authenticity. The volatility of online publications requires mechanisms for version control and location such as unique identifiers and permanent naming.

The National Library of Australia, through its PANDORA Project (Preserving and Accessing Networked Documentary Resources of Australia), has been working at two levels in its efforts to ensure long-term access to Australian online publications. At a conceptual level, the Library has defined its business processes in a Business Process Model, and identified the data that will need to be collected for current and future management of each title in a Logical Data Model. In addition, in December 1998, the Library published its Digital Services Information Paper, which sets out requirements for a technical infrastructure to collect, store, provide access to, and manage its PANDORA Archive of Australian online publications, as well as to support the management of other digital and paper-based collections.

Concurrently, the Library has been working at a practical level, implementing the business principles by developing selection guidelines, liaising with publishers, and building a small archive of titles, which by the end of April 1999, numbered 190 and occupied approximately five gigabytes of storage space.

Our purpose is to make Australia's cultural heritage available to future generations, as well as to today's scholars and researchers. Because to date there is little commercial publishing on the Internet in Australia, we have not yet had to deal with the complications of archiving subscription-only publications. We have, however, developed principles for managing commercial publications and have begun discussion with publishers on implementation.

The National Library has received no additional funding to undertake this new business. When we began our experimentation with online publications, we had no resources for the work. However, to move forward and to increase our knowledge of what is involved, we took staff from other tasks and developed an unsophisticated set of software tools that we are still using. That results in frustrating periodic system failure and labor-intensive operations, but we have learned a lot and demonstrated that at least a start can be made with few resources.

Our process today consists of collecting data, managing metadata, categorizing our collections, assuring quality, and providing free access to the information.

Collecting

To collect online publications, we use a robot to gather desired items from the Internet, or we make arrangements with the publisher to send the files to us on CD-ROM or zipped disk, or by transferring files on the Internet. Most titles are gathered, with the permission of the publisher, using version 1.2 of the Harvest [formerly http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/schwartz.harvest/schwartz.harvest.html] software developed by the University of Colorado.

As Harvest is an indexing rather than a gathering tool, it has been modified by our IT Section to suit our specific requirements. For example, one of its attributes is the identification and gathering only of those files that have changed since the last visit to a given site. However, we require all the files at a specified site at the time of harvesting, not just the ones that have changed, because we do not want our staff to have to integrate the new files with the old. Harvest also overwrites the old files with the new and discards files over a certain age, which obviously is unsuitable to a situation where the aim is to retain the historical version of a publication. We have reconfigured Harvest to bring back all of the files at the specified site, and store them in the archive as a new version of the title alongside earlier versions, and to retain all files indefinitely.

"To the best of our professional judgment, many online publications have low current or future research value"

To facilitate the process of gathering many files with different gathering schedules, our IT Section has written a user interface to Harvest, enabling us to submit archiving requests, log problems, and communicate with the programmer for problem solving. Our interface maintains a searchable archive of problem reports and solutions that can be used as a reference to solve new or recurring problems.

Some types of publications do not respond well to Harvest, so we also use WebZip software. This resides on an MS-DOS computer and staff members must unzip and move output files to the UNIX platform where PANDORA resides.

Management of Metadata

Future preservation strategies for online publications will require detailed information about the nature of the item and how it has been treated over time. Future researchers may also want historical information about the items they are using: what format it was originally in, and whether anything has been lost in the capture and preservation process. Day-to-day management of titles for the Archive also requires administrative information such as whether the publisher has given permission to archive.

To date we have no facility for recording the full complement of metadata required for each title as outlined in the Logical Data Model. We await the implementation of a full archive management system. In the meantime, to enable us to document the administrative history of the titles being archived, our IT Section created the PANDORA Archive Management System (PAMS) database. PAMS is rather a grandiose title for what is only a small metadata repository. Yet while it does not provide for all of the data elements that are required for long-term preservation, it does provide us with sufficient information about a title to manage archiving. Once an archive management system is available, the data from PAMS can be migrated to it.

Comprehensive or Selective?

One of the daunting aspects of collecting and managing online publications is the large volume of them vis-a-vis the staff and technical resources that libraries have available to manage them. National libraries are testing two different approaches to the collection of online publications: The Royal Library, Sweden (National Library of Sweden), through its Kulturarw3 Heritage Project, and the University of Helsinki (National Library of Finland) are taking regular snapshots of their nations' entire domain, while others — like us — are being selective.

The National Library of Australia's pragmatic decision to take a selective approach was based on two factors. First, managing online publications is labor intensive and therefore expensive. Second, to the best of our professional judgment, many online publications have low current or future research value. We consider that it is better to identify those titles most likely to be of future research value and to apply our limited resources to managing them to a high standard.

While we do not rule out the possibility of taking a snapshot of the complete Australian domain at some time in the future when our technical infrastructure can support it, we do not consider it to be a viable option.

Regardless of whether a library adopts a comprehensive or selective policy in relation to archiving online documents, the reality is that technical constraints prevent some types of online publications from being downloaded. Those that are constructed as databases, containing programs that create pages on the fly, resist the efforts of currently available robots to gather them.

Identification

Most national libraries have been supported in their task of building heritage collections of print publications by legal-deposit provisions, an important element in ensuring the survival of cultural material. Most legal-deposit legislation was framed long before electronic publications were known, and therefore has not covered library materials in digital formats. That is changing gradually. In Australia, the Commonwealth legislation that governs legal deposit does not yet include electronic publications, although the act is being reviewed and a draft bill is expected soon. The lack of provision for legal deposit of electronic publications to date has meant that we have had to fall back on more labor-intensive means of identifying them and negotiating with the publishers for permission to archive them.

The Library plans to introduce a Services to Publishers page on its Information Server. A component of that page, when legal deposit provisions apply, will be a way for publishers to notify us about titles available for deposit. The Library will accept those that conform to its selection guidelines.

In the meantime, staff of the Australian Electronic Unit whose task it is to build the PANDORA Archive spend time each day surfing the Internet to identify online publications of national significance. A small number of titles are also discovered through print magazines and newspapers, and from subscription to discussion lists. Titles that are identified as potential candidates for archiving are assessed against the selection guidelines.

To be selected for national preservation, a significant proportion of a work should be about Australia, be on a subject of social, political, cultural, scientific, or economic significance and relevance to Australia, or be written by an Australian of recognized authority and constitute a contribution to international knowledge. In addition to items of research value in their own right, we are also selecting a representative sample of publications that, collectively, will give future researchers insight into how Australians have used the Internet to disseminate information about their lives, interests, and concerns. For example, we have included in our collection Trishan's OZ because we think it is important to reflect the interests of young people and their enthusiasm for and skill in using the Internet.

We contact publishers for permission to copy their publications and to store them on the library's server. Where titles are available free of charge, publishers have been happy to allow us to provide immediate access to the versions in the PANDORA Archive. We include options to link to publishers' sites both from the catalogue record for each item and again from the Archive. As already mentioned, there is, to date, little commercial publishing on the Internet in Australia. Publishers of commercial titles are understandably concerned about the impact on their financial return of use of their titles through libraries. Libraries do not want to put publishers out of business. It is not in our interests to do so, as they are a crucial agent in the information chain. When the National Library does select a commercial title for the Archive, we hope to negotiate with the publisher for permission to archive the title and to provide immediate access to readers within the Library building only. Access to readers outside the building would not be enabled until the agreed period of commercial viability for a particular title had passed. (This would have to be negotiated on a title by title basis.) In the interim, external users would therefore be obliged to go to the publisher's site and pay for access. At some time in the future, publishers will go out of business, or will no longer be interested in maintaining titles for which there is little demand. It is important that the deposit libraries keep their publications available, as part of the national documentary record. Contact with the publisher also helps us sort out technical problems. For particularly complex or large publications, the publisher transfers the files over the Internet or sends them on CD-ROM.

If the publisher grants permission, an archive request is submitted to the management system. All of the titles selected for archiving are allocated to an appropriate gathering schedule: one-off, weekly, monthly, quarterly, half-yearly, nine-monthly, and annually. The system manages the gathering schedules, which we determine based on the publication pattern and the stability of its host site.

Quality Control

After the title has been gathered, our staff compares the archived version with the publisher's site for quality control. Our goal is to achieve an archived version that is identical in every way to the one on the publisher's site, both in terms of content and look and feel. For technical reasons, that is not always achievable, for instance, when the publisher's site includes search software for accessing back issues. Nevertheless, if we cannot attain a perfect result, we still consider that it is desirable to archive, and ultimately preserve, the intellectual content of a resource, even if some of the functionality is lost. As a final step, the publisher is invited to review the result and comment.

Access

One of our fundamental business principles is that we provide free access to the titles in the archive. This access will be managed, as described earlier, so that commercial publishers are not disadvantaged.

Another important business principle also relates to access. Currently, the primary means of access to the Archive is via a search of the Library's Online Public Access Catalogue [formerly http://ilms.nla.gov.au/webpac/] (OPAC) or the National Bibliographic Database (NBD). The user can link directly from the catalogue record to the publication in the Archive. (A link to the publisher's site is also provided.) We believe that it is important to integrate the discovery of electronic publications with all other types of library materials.

We also strongly believe that because electronic publications form part of our national imprint, they should be included in our national bibliography, described using full level MARC (Machine Readable Cataloguing) records, as we do for traditional library materials. We are encouraging other Australian deposit libraries to take the same approach as we work toward a distributed national collection of Australian electronic publications.

A second means of access to the Archive currently available is from a list of titles on the PANDORA Project Home Page.

The catalogue record describes the publication at the 'whole item' level. That is, it describes the electronic journal, not an article within it. In the future, we plan to provide a third means of access to the Archive — to the content of titles, for instance — to articles within an issue of a journal, via a search facility of an associated metadata repository.

Costs

In our experience, acquiring an online publication is much more expensive than acquiring a print item. With the technology currently available to us, we estimate that it takes one person one full working day to accomplish all of the tasks associated with acquiring the first version of an online publication, which makes it five times more labor intensive than for a print item. In addition to the operational costs, there are the huge costs associated with the development and maintenance of the technical infrastructure required to collect, store, provide access to, and manage online publications.

"Ultimately the PANDORA Archive will become the National Collection of Australian Electronic Publications"

And that does not include activities associated with preservation. Online publications will need frequent preservation intervention in the form of backup, refreshing of media, migration to new technology platforms, and other preservation strategies, in order to ensure their long-term access.

The British Library published a sobering analysis of the cost involved in the preservation of electronic publications. Although it acknowledged the difficulty of comparing print and digital publications, the study suggests that the cost of managing/preserving a digital publication over a 25-year period is about twenty times greater than it is for print. Those calculations were based on the management of 500 to 1,000 items per year. Once the volume rises to 10,000 or more per year, the unit cost could drop to perhaps five times the cost for print items.

Permanent Naming

Unique identifiers were an invention of the print-publishing industry to facilitate the precise identification of a publication or an edition of a publication for distribution and inventory control. The International Standard Book Number (ISBN) and the International Standard Serial Number (ISSN) have been used for electronic publications, but their application rules prevent them from being applied to some types of Internet publications. In any case, they lack an online resolution service, which is vital to Web publications. A permanent naming system for online publications is essential to libraries as well as publishers.

The Uniform Resource Locator (URL), the Internet address of a publication, can change and something as simple as the reorganization of the publisher's Web site can result in broken links. The URL does not uniquely identify a publication either. Some titles are located on more than one site — a publisher's main site and a mirror site, for instance. It may be critical for a user to know whether the versions are identical.

A solution to this problem has been developed by the Internet Engineering Task Force (IETF) URN Working Group in the form of the Uniform Resource Name (URN). Both publishers and libraries are interested in applying this technology. The Digital Object Identifier (DOI) has been developed by the International DOI Foundation on behalf of the publishing industry to provide a framework for managing intellectual content, link customers with publishers, facilitate electronic commerce, and enable automated copyright management. Deposit libraries are interested in an application of URN being explored by the Nordic Metadata Project. It would enable national libraries to create a type of URN based on National Bibliography Numbers (NBNs) to permanently and uniquely name the versions of publications that they have archived, and also to offer as a service to small and noncommercial publishers the creation of URNs for the titles on their sites.

Preservation

As mentioned earlier, ensuring access to online publications is a two-step process: the archiving phase and the preservation phase. This article has dealt almost exclusively with the archiving phase. At the National Library we are aware of the need for preservation activity and have framed our archiving work and our technical specifications for an archive-management system to accommodate the requirements of preservation. We are currently considering what will be needed, both conceptually and practically, to maximize the chances of preserving online publications in the face of rapidly changing technology, especially for large collections of heterogeneous formats.

No generally accepted, proven method for preserving digital objects has yet emerged. Strategies such as migration, emulation, and other models such as Digital Tablets (Kranch 1998) and the Digital Rosetta Stone (Heminger and Robertson) are being researched and each approach has its strengths and weaknesses. The creation of museums of redundant hardware and software is another possible approach, but the National Library, among others, does not consider this a viable strategy. It is likely that libraries will adopt a mix of preservation methods, selecting solutions according to the nature and requirements of the collections involved.

Conclusion

In Australia, the six States and two Territories share responsibility with the National Library for collecting and preserving our documentary heritage at a national, state, and local level. In the print environment, collection building has been supported by legal-deposit legislation at the State and Commonwealth levels, as well as by collecting agreements between the libraries to ensure the development of a comprehensive, distributed national collection of Australian research materials.

Because of the large volume of online materials and the high cost of collecting and managing them, the National Library considers that this cooperative approach to collection development must be extended to electronic resources as well. The Library has conceptualized the underlying principles of a national model, a blueprint for the way in which Australian libraries (and other collecting institutions) might work together to ensure coordinated and long term access to online publications. Ultimately the PANDORA Archive will become, with the participation of other libraries, the distributed National Collection of Australian Electronic Publications. The basic components of the national model are:

  • A set of collecting agreements that outline the areas in which each library will be responsible for selecting Australian online titles for archiving and future preservation;
  • Agreement from all participants to catalogue onto the National Bibliographic Database all titles selected;
  • Commitment from all participants to undertake a preservation role, rather than just an archiving role;
  • Agreement from all participants to negotiate with publishers arrangements that will ultimately, with the passage of time, ensure open, networked, and gratis access to titles in the archive.

Ultimately each heritage library would maintain its own archive on its own server. While libraries work to this end, the National Library is willing to support other participating libraries by hosting other archives on its own server.

Although the system that we have developed for archiving and managing online publications is unsophisticated and still in the proof-of-concept stage, we believe that we have made an excellent start and learned a great deal about preserving the national cultural heritage as it moves to digital formats.



Margaret E. Phillips has worked in libraries since 1976 and joined the staff of the National Library of Australia in 1987. In 1994 she became the manager of the Acquisitions Section, where she dealt increasingly with electronic materials. In 1996, as manager of the newly created Electronic Unit, she began to devote full-time attention to both online and physical-format electronic publications. As a member of the PANDORA Project team and the Digital Services Project working group, she has been closely involved with the establishment of policy and procedures for ensuring long term access to Australian Internet publications and with the development of an archive-management facility. You may contact her by e-mail at mphillip@nla.gov.au..


Works Cited:

The Digital Services Information Paper http://www.nla.gov.au/dsp/

Harvest software [formerly http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/schwartz.harvest/schwartz.harvest.html]

Heminger, A.R., and S.B. Robertson. Digital Rosetta Stone: A Conceptual Model for Maintaining Long-term Access to Digital Documents. [formerly http://www.ercim.org/publication/ws- proceedings/DELOS6/rosetta.pdf]

Hendley, Tony, The Preservation of Digital Material, [London], The British Library Research and Development Department, 1996, pp. 116-117.

Kranch, D. A. 1998. Beyond Migration: Preserving Electronic Documents with Digital Tablets, in Information Technology and Libraries 17: 138-48.

Kulturarw3 Heritage Project at the National Library of Sweden [formerly http://www.kb.se/kw3/ENG/]

Nordic Metadata Project proposal for Uniform Resource Names http://www.lub.lu.se/metadata/URN- help.html

Online Public Access Catalogue of the The National Library of Australia [formerly http://ilms.nla.gov.au/webpac/]

PANDORA Project

Trishan's Oz http://purl.nla.gov.au/slv/pandora/trishan

WebZip software http://spidersoft.com/


Margaret A. Phillips may be reached by e-mail at mphillip@nla.gov.au.