The Unsettled State of ArchivingSkip other details (including permanent urls, DOI, citation information)
This work is protected by copyright and may be linked to without seeking permission. Permission must be received for subsequent distribution in print or electronically. Please contact firstname.lastname@example.org for more information. :
For more information, read Michigan Publishing's access and usage policy.
The old paradigm of publisher as sole distributor and librarian as sole preserver is ending, although no one is certain who will assume which roles. The activities and responsibilities for archiving scholarly publications are in a state of flux with many decisions yet to be made.
We are all stakeholders in preserving scientific and technical knowledge. All society's progress — perhaps even its ultimate survival — rests on whether science continues to develop. Scientists will discover the cures for disease, hunger, global warming, and other threats to humanity by building on research conducted to date, but only if that research continues to be available to them.
Beyond altruism in supporting science and human progress, publishers have another reason to archive. The changes in distribution and the fiscal uncertainties of electronic publishing make it especially important to capitalize on potential derivative—and collaborative—products. Archiving provides the fodder for that kind of product development.
In the long term, archiving should yield both societal and financial rewards. In the short term, however, both librarians and publishers confront financial challenges. Librarians are in a bind because they face demands for more and more services, while their purchasing power has declined. For publishers, costs are escalating for traditional supplier services and new technology costs are appearing and recurring, yet publishers' constituents are less willing or able to pay increased fees for both traditional and new products.
We should be concerned about who does archiving and how, because in the last few years we have experienced technical difficulties in accessing data that were stored electronically as recently as the 1970s. (For example, NASA satellite photography files from that period are lost to engineers who no longer have the software to read those files.) With the emergence of completely electronic publications, the function of archiving, once the sole province of the library, is receiving increased attention from others in the scholarly communication process. We have a bits-versus-atoms issue — publishing in the ether versus publishing on paper. Information in electronic formats forces publishers and authors to focus on creating the same stability for future access to those digital bits as we have had for the physical page. Scholarly publishers want to calm authors' concerns about future generations' inability to read their work because of technical obsolence. Thus, the publisher must devise methods not just to produce information and knowledge but to preserve it in a form that can be used in the future.
As Negroponte (1995) pointed out, media will become digitally driven by the combined forces of convenience, economic imperative, and deregulation. And it will happen fast. There is no escaping from the fundamental and practical reality of the need to address archiving in a totally different perspective and to prepare now for the challenge of yet another medium in the future — whatever lies beyond bits.
We, like others, have no crystal ball, so we will not attempt to predict how the current unsettled state of archiving will be resolved. Instead, through this review paper, we hope to stimulate increased dialogue among the participants in the scholarly communication process about the importance of protecting our published information for future generations.
Rationale for Archiving
Archiving is the systematic development and retention of information preserved to assure that people will have access to it at any point in the future. Librarians, who traditionally have been the keepers of archives, consider integrity, accuracy, and predictability the most important attributes of archiving (ARL 1994). The functions of preservation and access are essential components of knowledge development, because we build our knowledge base incrementally from the work of those who came before us. The need to maintain an unbroken record of science is such a key scientific principle that hardly anyone questions the need for archiving. Instead, in the 1990s, as roles have changed and the issues surrounding archiving have become more complex, several questions have arisen. What should be saved? Who will be responsible for archiving? How can it be accomplished? How can archiving be financed?
Threats to Preservation
Archivists attempting to preserve paper face natural threats such as light, moisture, and heat; organic threats such as rodents and insects; and chemical threats such as acid content, air pollution, and plastic. Since the Commission on Preservation and Access documented the brittle paper syndrome in 1968 (Marcum 1996), responsible publishers have tried to avoid acid paper, and librarians have launched major campaigns to maintain the content that had been published in that medium, mostly by copying to microform or acid-free paper.
Ironically, the threats to preservation seem to have increased with the advent of the digital age. Not only are electronic products vulnerable to the same enemies that threaten paper, they face new problems. The most serious of these is obsolescence. In a 1998 survey of member institutions, the Research Library Group found that nearly half (15 of 36) of the respondents with digital holdings have lost access to some of their materials because they lack the operational or technical capacity to mount, read, or access files (Hedstrom and Montgomery 1998). Brand noted that, although one would presume computer technology would solve digital discontinuity, the fast-moving technology itself is the problem. The great creator becomes the great eraser, he said. Although we can still read manuscripts that are a thousand years old, we cannot read some materials less than 20 years old. Brand (1998) suggested that we are in the dawn of the digital dark age.
Data degradation is a concern, as is the ease with which a document can be changed, either deliberately or by accident, is an issue as well. Human carelessness and maliciousness, always threats to preservation of a document, may be even more dangerous for electronic products. How can we be certain which is the authentic version? There are no standards yet in place for maintaining digital information or for recording the incremental changes that are inserted because the technology enables the delivery of fluid documents.
When Hedstrom and Montgomery (1998) surveyed librarians for the Research Libraries Group, respondents added two other threats: insufficient resources and insufficient planning. Collection managers noted that top-level administrators often do not understand the strain that digital collections place on their resources or the need for standards, planning, infrastructure, and models of best practices.
Mandel (1996) identified three aspects of preserving digital information: content, context, and means of access. The actual substance of the content may be no different from a print product, although the presentation may be very different if it includes video, animation, 3-D, and so forth. For a print product, the context, or provenance, is apparent in the container: the book, journal, or reference work. Provenance is not as clear for electronic products. Because digital information is so fluid — so easily changed without evidence — authentication is particularly important. Readers must be assured that the content is indeed the version it is represented to be: the article published in a specific volume and issue of a specific journal, for example. They must be assured that it has been protected from any subsequent tampering. Mandel also noted that maintaining the original look and feel of a document will be increasingly difficult as the information is migrated to new media with new potential for visual images.
Deciding What to Preserve
Not every document, on paper or in digital form, can be saved — nor should it be. Librarians have long made archival decisions based on criteria such as usage, quality, and substantiation. Publishers can develop a similar set of criteria to accept or reject material for an archive. Saving multiple versions of the same basic document only confuses readers. Many ephemeral documents, such as informal publications or the contents of electronic bulletin boards, may not warrant saving.
The problem is that someone must make decisions, and recognizing the material that will have long-term value is extraordinarily difficult. Determining what should be retained requires intellectual judgment from a broad perspective. Unfortunately, many archiving decisions in history were made from a narrower viewpoint. For example, the French archivists who assessed the Vatican archives Napoleon ransacked and sent back to Paris saved only those items of immediate and apparent interest. As a result, many important documents that would be of great interest to us today were lost. Stille (1999) pointed out that only seven of the more than 120 plays Sophocles is known to have written survive.
The Task Force on Archiving of Digital Information, established by the Commission on Preservation and Access and the Research Library Group, found that many of the criteria librarians have used to select content on paper for archives apply equally well to digital information (Research Libraries Group 1996). Those criteria include:
- appraisal of the subject and discipline in relationship to the institution's collection goals
- the quality and uniqueness of the content
- its value now and for the future.
For digital information, accessibility in terms of hardware and software is another criterion. Addressing criteria from a publisher's point of view, we suggested in 1997 that those bodies of work for which the publisher has already enhanced the value by editing, designing, producing, and marketing are good candidates for archiving. Those works might include journals, books, reference works, and other substantive materials. (Meyers and Beebe 1997). Indeed, primary journals have been called journals of record to indicate the expectation that their content and context will be preserved.
Although the word systematic is often used in conjunction with archiving, there are no national or international standards or agreements by discipline for how material slated for preservation should be selected. In the end, happenstance may play more of a part in determining what is preserved than we would like to think. Popularity and ongoing use are almost always guarantors that works will be preserved. For those materials that are sought after, the redundancy that is essential to assure survival occurs naturally as multiple copies are made and kept. On the other hand, as Stille said, some major works — including those of Plato — survive today only because one or two scholars saved them along the way. If the Irish in the 5th century had not carried great works from Rome back to their isolated country, where monks just learning to read copied and recopied them, much of our most treasured literature would have been lost (Cahill 1995).
Roles and Responsibilities
Traditionally, libraries have borne responsibility for archiving; in fact, the term research library is often used synonomously with archives. Publishers, authors, and the community at large have looked to libraries to maintain the entire corpus of our knowledge. While scientific knowledge grew rapidly, particularly in the 40-plus years after Sputnik in 1957, library resources declined. Thus, as we noted in an earlier study, we created a body of knowledge that surpassed our institutional capacity to absorb (Meyers and Beebe 1999). At the same time, the growth of digital information exacerbated the demands on libraries.
At the end of the 20th century, publishers must assume some of the burden of maintaining archives. They may take on the responsibility alone, or perhaps in partnership with others in the business of scholarly communication, such as aggregators. There are significant advantages to the publisher who takes on the role of archivist. For example, at a point when many authors and some librarians are questioning the value the publisher brings to the business, assuring perpetual access is a strong reinforcement of the added value the publisher provides. A commitment to ongoing access signals a clear commitment to the discipline the publisher serves and to the individuals in that discipline. There are other business reasons for archiving as well. For example, developing an archive of materials that can be put together in various ways for derivative products makes it possible to reach different market segments with a different combination of the same information, perhaps in different medium. That is not to say that taking on the responsibilities of archiving is either inexpensive or profitable, just that some returns are possible.
"In the end, happenstance may play more of a part in determining what is preserved than we would like to think"
Solutions for the unsettled state of archiving will involve not just technology, but legal and social change. And there are many hurdles to leap, including lack of standardization and resolution of ownership issues. The technology solutions are likely to be the easiest, and social change the most difficult. As Grycz (1997) remarked, consistency across the industry is not one of publishing's outstanding characteristics.
The Record of Scholarship
Authors of scholarly articles are motivated by many compelling reasons: claim to discovery, tenure, promotion, and peer recognition. They also want their work to join the larger corpus of knowledge within a discipline, to become a part of history. Contribution to scholarship has become a standard for the career quest of the academic researcher, and to some extent for those toiling in industry and government.
Every scholar predicates his or her own line of inquiry on an aware ness of previous research. That awareness can come only from access to a well-preserved archive of subject literature. The archive of a scholarly discipline constitutes a larger knowledge base — an unbroken record of research upon which successive generations can build.
How well the record of scholarship will be preserved depends on resolving some critical issues. New definitions must be created to accommodate the new media. We need to develop a standard for which version of an article will be included in a discipline's archival literature — preprint, publication, or reprint, should those terms continue to apply. Also, if the article is published in both print and digital formats, which version is the publication of record? How do we correct errors, by published errata or by changes in the original text? And if an author alters the digital copy, can that ever be considered the publication of record? The parties involved in scholarly publishing must decide which group will bear the ultimate responsibility for monitoring the record of scholarship in terms of accessibility and veracity.
Historically, at least in the United States, the archive for any discrete subdiscipline could be found in the library of record designated by mutual agreement among the members of the Research Libraries Group. With the economic and physical constraints of the last few decades weighing heavily on libraries, it can no longer be assumed that those institutions will accept the burden of archiving all information in all media, and be responsible for migrating it from technology to technology over the years.
As we move more into electronic publishing, and as archiving becomes more complex, the financial issues also become more complicated. Two related questions emerge: How will publishers earn sufficient revenues to support archiving and continued innovation? And how will librarians, with decreasing budgets and lack of administrative support, pay for materials? Thus far, publishers have not developed effective business models. Three methods of payment for content have been discussed:
- the traditional subscription, in which an institution pays to receive a certain level of content (six issues, for example) over a certain period of time.
- site licenses, in which the parties negotiate how material will be received and how it can be distributed to library patrons.
- pay per view, in which each piece of content extracted carries a price.
There have been other suggestions. For example, Walker (1998) called for an all-electronic environment with free access. He suggested that the costs of editing (including peer review and revision), composing, and archiving could be financed by libraries with redirected subscription costs. Stevan Harnad, following up on the proposal, called for an abolition of all purchasing schemes and proposed that the publishing of scientific information be financed by page charges (September Forum, August 26, 1998). Those page charges would be funded by the university's savings on subscriptions.
Experimenting with the various models of payment, publishers are uncertain about their future revenues. They can no longer plan as they have in the past for how they will be paid, nor what levels of revenue they can anticipate.
Yet at the same time the financial risks are increasing for publishers, who must pay for forays into new media even as they begin to assume responsibility for preserving their content. Authors who suggest that they can maintain their own archives claim the Internet permits cheap, easily maintained archives. However, they overlook the costs of new equipment, robust archival servers, training, migration of data, and such business costs as development and capital. In addition to the outright financial risks associated with making the investment in archiving, Raney (1998) noted that publishers risk losing opportunities for more profitable ventures when they divert resources to archiving.
Over time, such costs undoubtedly will decrease. Odlyzko (1996) suggested that publishers should be worrying less about the life of an archival medium, noting that in ten years, we will be able to save 100 times the information for the same cost. Consequently, publishers should simply plan to migrate their information regularly. Nonetheless, the cost of preparing stable information formats and migrating it is not likely to be inexpensive.
Potential Derivative Products
One clear advantage of archiving information in well-tagged, platform-independent formats is the potential for creating derivative products. Peter Boyce (e-mail communication, October 28, 1998) noted that a prime electronic copy serves as a robust archival copy as well, and it can be updated regularly at relatively low cost. If publishers understand the needs of their audiences, they can assemble new products from their digital archives. For example, a publisher might assemble a manual on a specific topic from articles published in their journals and reference works.
Rights and Ownership Issues
Although this arena contains myriad permutations, two major themes emerge. First, there is the issue of relations between the author and the publisher. Second, there is the question of whether libraries are purchasing access or ownership.
Publishers, faced with the increased financial risks associated with electronic publishing and archiving, need to obtain all rights in all media to protect their opportunities to obtain a return on their investment. They believe it is essential that they control the distribution of a product to avoid wholesale copying and piracy that impinges on their revenues. They also want to protect the value they add in bringing the product to market. The publishers' added value traditionally starts with gathering and selection; continues with editing, production, marketing, and dissemination; and is completed with rights and permissions.
Authors, on the other hand, are increasingly reluctant to sign over their ownership to publishers. The litany heard in various forums has been that the university finances the research, the author invests time and intellectual capital, and the publisher reaps the reward. With feelings rising, Steven Koonin, provost of the California Institute of Technology, raised the idea of the university and author jointly retaining all copyrights (Guernsey 1998). Authors would publish only in journals that agreed to the proposal. Although there has not been widespread support for Koonin's proposition, authors are pressing to retain more rights. Among the items on their wish list are
- the right to announce results in advance of publication
- the right to post or update articles to an e-print server
- the right to circulate an article from a Web page
- the right to distribute paper or electronic reprints.
Publishers traditionally have been leery of any free distribution that might encroach on their sales; however, many are now sanctioning more distribution by authors. Because publisher policies vary widely at this point, it is likely to be some time before a clear standard evolves. And how that standard affects the economics of publishing may not be evident for even longer.
Issues of access and ownership are even more complicated. In the past, when libraries purchased subscriptions to print products that resided on their shelves, there was no question of ownership. Even if the library discontinued a subscription, the volumes they had received were theirs to keep forever. When libraries began to purchase electronic products, they found that they were buying access to information that would be theirs only as long as they continued to subscribe. If they discontinued the subscription, they could no longer make even the old information, for which they had paid, available to their patrons. Clearly, they lost the ability to serve as an archive. As publishers and librarians have negotiated site licenses and debated the issues, there has been movement to establish a standard for maintaining access to the information previously obtained when subscriptions lapse.
"As new technologies become more commonplace, customer expectations rise exponentially"
The impact of the Digital Millennium Copyright Act and Copyright Term Extension, both passed in 1998, will be felt throughout scholarly communication. Of pertinent note to the issue of archiving is the Section 108 Update that encompasses revision of archival preservation rules. This is the first change since the 1976 response to photocopiers was incorporated into the section. The current update allows for digital formats to be stored and preserved, and also sets forth a procedure for the preservation of materials not originally in digital form. Lutzker has written a Primer on the Digital Millennium in two parts that is an excellent guide to the new terms.
Conflicting Market Demands
Consumer studies suggest that there will be an ever-increasing demand that consumers' expectations be fulfilled. Scholars in different disciplines have specific idiosyncrasies, and most products have some unique features. However, consumers do have general expectations for all digital products that include the following: (Lumping these together dilutes our discipline dependency issue. There are two points: discipline-specific demands and general expectations.)
- greater functionality
- more rapid delivery
- more value for cost
- cross-publisher databases
- streamlined licensing agreements
- ability to identify fragments of content
- clear understanding of any limitations.
As new technologies become more commonplace, customer expectations rise exponentially. Customers want high quality information, and they want it cheap and fast. We need only look as far as our own keyboard and screen to realize how our level of impatience has risen. A mere 10 seconds of waiting is often considered excessive. The information that now comes quickly from the action of a few keystrokes in the past would have taken travel to a library and minutes, if not hours, of searching.
Along with speed, however, we must consider quality. The age-old production triangle of time, quality, and cost does not change across media. To produce a timely, peer-reviewed, high-quality publication — whether the container is wood pulp or plastic — will still demand attention to process and, therefore, a price that reflects the cost of that process. With the information wants to be free attitude that prevails on the Internet, distributing a scholarly journal's content electronically becomes more than a technological challenge for today's access and tomorrow's archive, it is a financially dubious prospect. That financial uncertainty is underlined by the differently evolving acceptance of electronic publishing in different disciplines. Indeed, market research must be done discipline by discipline. What is acceptable to a historical scholar may not satisfy an entomologist, or a civil engineer, or a cardiologist.
Moreover, the acceptance of electronic publishing may depend on the age of the customer, even within the discipline. Resh (1998) contrasted the attitudes of authors, editors, and users: The main issue of the transition from paper to electronic publishing comes down to a simple fact: scientific journals are most intensely read by young researchers, but decisions about how these journals communicate information are made by much older editors. Thus, changes are being made according to the perceptions of the producers rather than what the consumers need, expect, and are ready to use. Publishers are not doing sufficient market research based on discipline differences.
Much of market research has focused on the development of products to disseminate information electronically today. Researchers have tended not to include questions about access and use tomorrow. As a result, most publishers do not know whether customers believe that an archive has value or what formats are most valuable.
As Resh observed: Librarians appreciate the archival tradition of the library much more than the average scientist does. Librarians have argued for keeping serial subscriptions even when cost/use ratios . . . are in the hundreds of dollars. They are also reluctant to give up the paper version of journals in case the electronic revolution fizzles out. Plus, there is still a huge demand by students and public users who do not have access to computers, printers, or appropriate software.
There would be little argument with Resh's acknowledgment that a librarian's concern for an archive is greater than that of the average scientist. But publishers must find out how much their average scientist, physician, engineer, or a scholar appreciates archival tradition. Otherwise, the publisher has no blueprint for archiving.
Use of Older Materials
Many proponents of e-journals feel secure in their opinions that the archival component will be of little interest to future scholars. If today's researchers are any measure, then at least for disciplines based on historical records, that will not be the case. In a 1998 survey of its members' attitudes about electronic publishing (not archiving specifically), the Seismological Society of America (SSA) found that members used the back volumes of the Bulletin of the SSA from 1995 back to 1980 often and from 1979 back to 1960 sometimes. Just over half of the respondents sometimes used volumes back to 1940 and one-third of the responding members reported some use of volumes back to the Bulletin's start in 1911. If those scientists were asked directly if archiving electronic information is important, it is safe to guess that their responses would be in the affirmative.
Technology and Tools
As technology for electronic communication evolved through the 1970s, 1980s, and early 1990s, we imagined that it would solve all our communications problems. Paper shortages, postal strikes, bulging bookshelves, and escalating costs would recede from our memories. We would create magic kingdoms, data warehouses that would enable us to maintain all of our accumulated knowledge and deliver important new products effortlessly. Authors continue to promulgate this it's a snap mythology in their arguments that publishers are no longer needed. (See September Forum 1998 for examples.) The reality is soberingly different.
Constantly Changing Environment
Technology is changing so rapidly that it is no longer possible for anyone to keep up. Specialists can have in-depth knowledge of only very narrow fields, whereas generalists can hope only to recognize terminology when it comes into popular use and perhaps recall that they knew specific advances were under way.
Computer obsolescence occurs with frightening speed. Levin (1996) wrote a tongue-in-cheek seven-step guide to maintaining an up-to-date system that began with establishing a location with a tractor trailer and a huge pile of credit cards and ended with repeat every six months. He was, unfortunately, close to the mark. As difficult as the rapid changes are for consumers, they are much more difficult for organizations such as publishers and libraries that must consider not only their own needs and usage patterns, but those of diverse audiences. Wang dedicated word processing equipment is but a faint memory, as are several operating systems. A two-year-old computer is nearly an antique. To maintain archives and develop new systems, publishers and librarians are finding that they need to upgrade equipment regularly. And each upgrade, whether it is just a component or an entirely new system, introduces the need for training and an allowance for a learning curve, both of which are expensive. In addition, software and hardware do not always follow the same development cycles.
The need for regular upgrades in equipment and software has a major impact on planning for preservation and access to information in years to come. Mandel (1996) commented that maintaining a museum of hardware and software to assure access is impractical. As we noted earlier, a substantial collection of digital information is no longer accessible because the equipment is no longer in use. If paper copies do not exist, the information may never be retrieved.
To assure that future readers will be able to obtain the information, archivists — whether they are publishers, librarians, or others — must plan to migrate the data to a new medium on a regular basis. The Task Force on Archiving of Digital Information defined migration as a set of organized tasks designed to achieve the periodic transfer of digital material from one hardware/software configuration to another or from one generation of computer technology to a subsequent one. The task force outlined the following strategies:
Change Media. Archivists can move materials from a less stable medium to a more stable one. For example, the task force recommended moving data from a digital medium to paper or microfilm. It warned, however, that copying may result in losses in the form or structure of digital information. Losses will occur, for example, in computation capabilities, graphics, and indexing.
Change Format. For large archives with multiple formats, the task force suggested moving items to a smaller, more manageable number of standard formats. This strategy, the group said, will simplify, but not eliminate, the need for periodic migration.
Incorporate Standards. Widespread standards will make references and interchange much easier. The implementation of electronic data interchange (EDI), for example, depends on systematic implementation of a standard, which does not currently exist. Standards will facilitate migration to new media.
Build Migration Paths. Developing protocols for online deposits of information will simplify the process of archiving. The task force recommended that standards be established to create systems for backward compatibility from the initial design.
Use Processing Centers. Because so much information exists today in nonstandard formats, processing centers that specialize in reformatting materials could provide a cost-effective method for migrating information. The task force suggested that a national laboratory for digital preservation could provide a cost-effective alternative.
It appears that publishers, or libraries if they maintain the archives, can expect to migrate data fairly regularly. The American Astronomical Society, one of the few publishers to put all of its legacy information in electronic form, has established a plan and budget to migrate the entire corpus to a new medium every five years (Peter Boyce, personal communication, February 1998). Mandel (1996) pointed out that the engineering task is increasingly complicated as content and functionality are made inseparable.
Archivists have a variety of options when they are considering formats for preservation. Some of these formats are very new, and others, such as microform, have been in existence for decades.
Microform is a medium (microfilm or microfiche) that stores large amounts of print information in a condensed format. Microform is most often used to archive newspapers, magazines, and journals. Libraries collect microforms to conserve space most especially for rarely used serials. Microform also provides a means to acquire a copy of out-of-print publications (for example, back issues of journals); some documents are produced only on microfiche or microfilm for distribution. All archiving institutions, such as national, state, and local archives, historical societies, libraries, and museums, rely on microforms to preserve their print collections.
Bell & Howell Information and Learning (formerly UMI) has noted on its Web site that the older microforms get, the more people appreciate their unique attributes. Only microfilm and microfiche save large quantities of full-image information in a small space ... for decades. Thus, when compared to some electronic media, the microform is distinguished as an archiving medium in that it will provide proven access for many years into the future.
Portable Document Format (PDF), a derivative of Adobe's PostScript page description language, allows publishers to produce electronic documents with all the original formats, fonts, and page layout. Often producers of documents delivered on the Web will offer versions in both Hypertext Markup Language (HTML) and PDF to satisfy the consumer's desire for the look and feel of a printed page.
PDF was created to facilitate the exchange of information between computer systems, and was based on the assumption that the sender would want the receiver to see the document exactly as it was created (Kasdorf 1998). Some publishers have used it for archiving to assure that they can continue to deliver the same document no matter where it is printed. Although PDF files can be linked and searched, PDF does not offer the robust structuring capacity of Standardized General Markup Language (SGML). Consequently, Kasdorf has argued that both formats are important.
The most efficient way to assure that migration can occur is to use portable, platform-independent formats such as SGML. Because SGML-tagged information can be used in any medium, SGML greatly enhances the potential for long-term access. Kasdorf described several advantages to SGML: it offers powerful and flexible structuring capabilities; it can capture and organize metadata; and it is a hedge against technological change. In addition, SGML can be parsed; that is, software can automatically verify that all the coding is correct.
All of these advantages come at a price. First, SGML requires a Document Type Definition (DTD) for every type of publication: a book, a journal, or a report. Because of the level of detail and the quality of analysis required, DTDs can be expensive to write; they must be built and documented carefully. Second, SGML is a language, not a coding scheme; therefore, producers very likely will need to make up their own codes. Finally, SGML must be applied. Publishers incorporate SGML at different stages in the publication process. Some apply it in the editing phase, asking editors to tag the document. Some have added it to the composition phase, and still others tag materials for electronic products after the paper product has been printed.
"Managing this storehouse of information requires that we have a way to know what is in it and how the information is defined."
Authors and some publishers have suggested that Hypertext Markup Language (HTML), which is much simpler to use, is a suitable substitute. HTML has been extremely successful when used to define headers and other parts of documents designed for delivery on the World Wide Web. However, its simplicity makes it less useful for complex materials such as journals and books.
Extensible Markup Language (XML) is somewhat of a cross between HTML and SGML—less complex and flexible than SGML, but more structured than HTML. While noting that XML will revolutionize coding and structuring information for the Web, Kasdorf cautioned that SGML is the core. Publishers with good SGML archives, he said, will be positioned to move into any medium.
SGML is an International Standards Organization standard (ISO 8879, adopted in 1986), and there is also a standard (ISO 12083) for DTDs. However, most publishers have treated DTDs almost as house style, creating a unique DTD for their own journal program, and some have created new DTDs for individual journals. As we move more and more toward journal databases, the lack of standardization in identifying items such as references may create some complexities in linking and assembling databases.
As keepers of archives assemble more and more information in digital form, the management of the data becomes an increasingly complex task. These digital archives can be large data warehouses that contain information from a variety of publications or databases, in addition to all of the business, marketing, and promotional data.
Managing this storehouse of information requires that we have a way to know what is in it and how the information is defined. This need leads us to metadata, or data about data. Sen and Jacob described three types of metadata:
- Operations-level metadata, which define the structure of the information in the original format
- Warehouse-level metadata, which define how the data are interpreted when they are integrated into the warehouse
- Business-level metadata, which link the warehouse data to business applications.(Sen and Jacob 1998).
For example, the operations-level metadata might contain the title, the author, and other information specifically related to the original content. Warehouse-level data might include when permissions were received or when the content was updated. Business-level data could include the various forms in which the content has been published.
If an organization does not plan carefully, the administration of its warehouse can consume time that should be spent on product development and delivery. In describing how to build an effective data warehouse, Sigal issued a strong call for standardizing the equipment you use in your own operation to avoid incompatibilities and project failure, as well as cost overruns. Standardizing and avoiding proprietary systems enables producers to build strong repositories that will yield derivative and collaborative products economically (Sigal 1998).
Storage capacity for very large amounts of digital information has become less of a concern as the costs of storage have decreased dramatically. Nonetheless it is still an issue. The Research Libraries Group's Task Force on Archiving Digital Information suggested three ways of handling digital storage. For those materials in high demand, online storage is the answer, and multiple locations may be necessary in a distributed environment. Documents in occasional use can be stored near-line on optical or tape media in a jukebox. Materials that may be used infrequently can be stored most efficiently offline, usually on tape. The task force cautioned that redundancy is an essential ingredient in effective storage. If multiple copies are not retained, failure in any part of the system could result in loss or corruption of digital information. Redundancy involves having multiple copies in multiple storage units. A lack of redundancy for all archived materials is an ongoing concern for librarians.
National Register of Microform Masters
The National Register of Microform Masters (NRMM) reports on materials held in microform depositories. These materials have publication dates primarily from the late 19th and early 20th centuries; in addition, 16 percent were published before 1850, some as early as the 13th century. Over 40 percent of the microforms are in languages other than English, primarily Western European. The National Register of Microform Masters is itself a landmark in library cooperation, noted Reed-Scott (1998). Six institutions — the Library of Congress, Harvard University, the New York Public Libraries, UMI, General Microfilm Company, and Research Publications — hold most of the materials documented in the NRMM reports. (The Library of Congress is the largest noncommercial producer of microfilm.) However, a spectrum of academic libraries, historical societies, archives, as well as state and public libraries, also provide a significant number of titles to the Register.
First proposed in 1936 by Keyes Metcalf, NRMM was not started until 1965, after an Association of Research Libraries study by Wesley Simonton. Since 1986, ARL and the Library of Congress have joined forces to create machine-readable versions of the NRMM records with funding from the National Endowment for the Humanities and The Andrew W. Mellon Foundation. The benefits of the online NRMM are enhanced productivity through improved access to preservation microfilm masters. NRMM's size alone indicates its place of importance in our nation's archiving efforts.
NRMM Machine-Readable Statistics: (Reed-Scott 1998)
- 549,147 monographs, over 12,000 in non-Roman languages
- 22,729 serials from Harvard, NYPL, and the Library of Congress
- 7,083 musical scores
The National Digital Library Program
The National Digital Library Program began on May 1, 1995 when Librarian of Congress James Billington signed the National Digital Library Federation Agreement with the Commission on Preservation and Access and 14 other research libraries and archives. NDLP builds on the experience of two earlier projects, the Optical Disk Pilot Project and the American Memory project.
The American Memory project, the primary precursor to NDLP, emphasized digitizing the nation's memory. According to Fleischhauer, the pilot which ran from 1990 to 1995, identified audiences for digital collections, established technical procedures, wrestled with intellectual property issues, demonstrated options for distribution, and began institutionalizing a digital effort at the Library of Congress (Arms 1996a).
The technology the American Memory staff chose mirrored the digital diversity found in the online environment. The Library's development of an SGML scheme is particularly noteworthy as it serves NDLP for historical texts and documents and conforms to the Text Encoding Initiative, an international standards effort.
Basically NDLP will convert into digital form, from the Library of Congress's print and non-print material holdings, a core of primary source material that reflects American history and culture. In addition to reaching a much broader audience, NDLP affords the opportunity for the Library along with other agencies and organizations to mount research and development projects directed to real problems of indexing, managing, presenting, and storing such diverse collections.
Arms (1996a) noted that the versions made for this digital library have the potential to serve as preservation copies. Microfilming is the most accepted preservation method for textual materials, and photographic reproduction is used for pictorial materials. But as digital techniques are applied, the question arises of whether to add enhancements to facilitate access to the content, and there is the challenge to develop general principles guiding decisions for enhancements. As Arms pointed out: Although the NDLP team is developing standardized approaches, the conversion of each collection presents special circumstances, sometimes because of the physical condition of the originals or because scanning must be coordinated with curatorial activities for preservation or description of the collection.
Other challenges include the establishment of naming schemes, the fact that uniformity of description is not practical, the question of where descriptive information should be stored, and the issue of how to design the groupings of digital objects. In addition, NDLP must provide navigational tools to support access. NDLP's solution for navigational tools is the development of finding aids or documents used in searching to locate pertinent materials in the collection.
NDLP has the same concern with access and use of a collection that every library does. In a second article, Arms noted that, with more primary texts online, NDLP is still considering how it can deliver long documents in a way that is convenient to use. According to Arms, The challenge is to support rapid navigation without losing the sense of context provided by a physical book. (Arms 1996b)
As the NDLP staff members continue their work, they face a number of formidable tasks. They have identified ten challenges, clustered into five major issue categories.
|Challenges to Building an Effective Digital Library|
|Building the Resources||Challenge One:||Develop improved technology for digitizing analog materials.|
|Challenge Two:||Design search and retrieval tools that compensate for abbreviated or incomplete cataloging or descriptive information.|
|Challenge Three:||Design tools that facilitate the enhancement of cataloging or descriptive information by incorporating the contributions of users.|
|Interoperability||Challenge Four:||Establish protocols and standards to facilitate the assembly of distributed digital libraries.|
|Intellectual Property||Challenge Five:||Address legal concerns associated with access, copying, and dissemination of physical and digital materials.|
|Effective Access||Challenge Six:||Integrate access to both digital and physical materials.|
|Challenge Seven:||Develop approaches that can present heterogeneous resources in a coherent way.|
|Challenge Eight:||Make the National Digital Library useful to different communities of users and for different purposes.|
|Challenge Nine:||Provide more efficient and more flexible tools for transforming digital content to suit the needs of end-users.|
|Sustaining the Resource||Challenge Ten:||Develop economic models for the support of the National Digital Library.|
Digital Vault Initiative
At the 1998 annual conference of the American Library Association, UMI (now Bell & Howell Learning and Information) announced its plan to create the largest digital collection of printed works in the world. For their Digital Vault Initiative [formerly http://www.il.proquest.com/hp/Features/DVault/] they will completely convert all of their vast microform holdings to electronic format. Fifteenth century literature, 19th century newspapers, and current business magazines are just some of the items in the collection of 5.5-billion page images, all stored in three temperature-controlled vaults at the company's headquarters in Ann Arbor. UMI built the original vault in 1965 to protect the integrity of the collection, and followed it with a second and a third structure. Strict environmental controls prevail; they include maintaining a constant temperature of 70 degrees Fahrenheit and a 40 percent humidity level while filtering systems extract damaging elements from the air.
Bell & Howell Learning and Information (formerly UMI) expects that it will take several years to scan all of the holdings into a digital format. The work of putting old documents into the Digital Vault Initiative will run parallel to the annual addition of 37 million images of current information. Bell & Howell intends that its online archive will provide greater, richer depth of content than any other online resource. Library patrons will have access through their online service, ProQuest Direct, to entire documents — with illustrations and photographs with context provided by adjacent stories and advertisements.
Some Publisher Practices
When we wrote a white paper for The Sheridan Press in 1997, we found most of the published literature on archiving still addressing library practices rather than those now being undertaken in the publishing community. To help rectify the dearth of information concerning publishers' practices, we undertook an informal poll of publishers. The results have been presented to the Council of Biology Editors' annual meeting and at a seminar for the Society for Scholarly Publishing. We developed and co-chaired the SSP seminar in October 1998 to bring together librarians and publishers so they could share information and talk about how best to collaborate on archives in the future. The Institute of Electrical and Electronic Engineers (IEEE), the Ecological Society of America (ESA), and JSTOR were among the organizations reporting at the seminar.
"To augment our research in the archiving literature, we sent a survey via e-mail and fax to 101 publishers"
At the October 1998 seminar, Scott McFarland, Director of Electronic Product Development for the Institute of Electrical and Electronic Engineers, reported on IEEE's global archive approach. At that time the institute had a production archive of the following in SGML:
- 20,000 magazine pages
- 240,000 transactions and journal pages since 1995
- 800 current standards totaling 58,000 pages.
The SGML files totaled 318,0000 archived pages. In addition, all IEEE content from 1988 to the present has been archived in PDF format on CD-ROM, magnetic tape, and magnetic storage. In October 1998, the IEEE Archive contained more than 446,000 documents.
The IEEE Electronic Products Committee adopted the policy that to ensure continuity in the transition to electronic publishing, the IEEE remains committed . . . to accepting full responsibility for ensuring that all published material is archived. The Institute will archive its publications' content and work with customers to meet their needs. With this commitment IEEE faces major economic issues surrounding a continually expanding archive, including the following concerns:
- Costs of storage and maintenance shift from customer to producer.
- Size determines costs of migration to new storage technologies.
- Demand decreases as information ages.
JSTOR takes an image approach to maintaining the integrity of archival files for publishers. The Ecological Society of America (ESA) has found the JSTOR approach effective in dealing with the scary issues for a mid-size society, Dr. Mary Barber, Director of Science and Sustainable Initiative Programs for ESA, reported at the SSP Archiving Issues Seminar in October 1998. At that seminar Sarah Sully, JSTOR's Director of Publisher Relations, explained that there is a moving wall for archival materials that is always three to five years behind current issues of the journals as determined by the publisher. JSTOR is available to academic institutions via site licenses and has institutions participating in the US, Canada, and United Kingdom.
According to private e-mail from Sully, March 1, 1999, JSTOR has
- 378 participating institutions
- 77 participating publishers
- 131 participating titles
- 78 titles available online.
Beebe/Meyers Publishers Survey
After writing the white paper on archiving, we were still intrigued by the myriad of issues surrounding the topic. To augment our research in the archiving literature, we sent a survey via e-mail and fax to 101 publishers. The selection of the publishers was a convenience sample comprising colleagues whose names were found in membership directories and on meeting registration lists, many with publications programs that encompasses both books and journals. Fourteen disciplines were represented; twenty-seven publishers were in biomedical, medical, and life sciences, and twenty were in the behavioral and social sciences. The rest served the fields of engineering or the physical sciences. There was a fairly even split across commercial, society, and university press publishers. We received a 38 percent response.
When questioned about their attitudes toward archiving and future access to the literature they published, the most frequent response (one of every five publishers surveyed) was that the publisher was involved in archiving current material and planning for the future. They also see an expanding role for publishers in the area of archiving. Thirteen percent of the publishers maintain their publications in electronic format, but are not presently offering public electronic access (our mistake)to those files.
How these publishers are preserving their journal literature today breaks down as follows:
- print archives on-site (24)
- electronic archives on-site (22)
- print archives on-site and electronic with vendor (8)
- formal arrangement with a library (6)
- print and electronic archives with vendor (2)
- no print archive; electronic files with vendor (2)
- no electronic files (1)
- no response (5).
Establishing an archive is only the first step to protect the future of any discipline's literature. Guaranteeing access to that archive is the second. To assure access to their journals, the greatest number of publishers (22) are researching the potential for data migration across future technologies. Several (13) are actively providing access to archival materials based on current technology and 10 are providing limited access to electronic archives. Some of the publishers (5) have no plans for any access.
A few publishers gave us additional information. One reported constructing an electronic archive and planning for a consortial-publishing arrangement. Another has made new translations of the entire corpus and is setting aside dollars to migrate the information to new technologies regularly. A third publisher's approach is to keep their archives for five years, then pass the literature files over to a library or OCLC to provide access in perpetuity.
Changing Roles and the Future
The roles are shifting, and publishers are increasingly expected to preserve their own content. Librarians, still leery about how adequately publishers will perform the role, are nonetheless relinquishing some of the responsibility. They have no choice when they do not own the information, as is the case for many electronic products. Even when the access versus ownership issue is resolved, the sheer volume of information to be preserved and the cost of archiving will lead to shared responsibilities.
At the October 1998 archiving seminar presented by the Society for Scholarly Publishing, Mary Case, Director of the Office of Scholarly Communication for the Association of Research Libraries, gave a cogent description of expectations for publishers and librarians, summarized in the table below.
|Expectations of Publishers||Expectations of Librarians|
|Take responsibility for archiving products.||Actively participate in developing standards and practices.|
|Keep preservation in mind when developing products.||Help others to identify the information that should be preserved.|
|Adopt current best standards and practices.||Raise awareness of issues, advocate for federal funding to support research and development.|
|Document the standards used.||Involve computer scientists on university campuses in developing solutions.|
|Migrate information when appropriate.||Plan to pay for preservation efforts.|
|Back-up materials routinely.||Identify archival institutions.|
|Archive materials on industry-standard media regularly; store off-site.|
In addition, Case declared that publishers need to plan for handing responsibility over to another organization should they no longer be willing or able to maintain their archives. It has always seemed unlikely that publishers facing bankruptcy or the dissolution of a business would spend the time and care needed to assure survival of the content they had published. Instead, publishers need to establish contingency plans for transferring files and implementing access terms as a matter of course.
We tend to think of publishers and librarians as the key players in this drama. However, many others, including government at all levels, universities, other public and private institutions, aggregators, and authors play a key role. No one has written the entire script, but it is clear that partnerships will be required if we are going to preserve the world's knowledge base for future use.
The people engaged in different parts of scholarly communication process have different visions for the future. Some see only disaster ahead. They point to our lack of clarity on the basic issues: What will we preserve? How will we do it ? How will we pay for it? They point out the lack of redundancy for those materials that are already preserved, and they predict a calamity on the order of the destruction of the Alexandria Library. The end of civilization, they say, could be at hand. At the other end of the spectrum are those who see only a golden future. Digitize, they say, and we can save everything forever. Nirvana is reachable, if not at hand.
The majority of us are probably somewhere in between. We see how rapidly technology is changing and improving, and we hope that we can develop the necessary partnerships and make the requisite social and legal changes. All of us would prefer more clarity, especially around roles and lasting media. With the current unsettled state, it is difficult to plan for either content or finances. However, the state of archiving is likely to be unsettled for the next several years at least. We will build our perpetual archives slowly, incrementally with many zigs and zags — just as we have built the knowledge that they will hold.
Linda Beebe is president of Parachute Publishing Services, which provides project and program management and support for all publication phases from content development through promotion, distribution and evaluation. The former director of the NASW Press, a division of the National Association of Social Workers, she has worked for and consulted with professional and trade associations, consumer groups, colleges and universities, and others to develop and deliver communications in print and electronic media. Together she and Barbara Meyers have 55 years experience in scholarly publishing and are well known for their active participation in a number of industry groups such as AAP/PSP, CBE, NFAIS, and SSP. Linda Beebe may be reached at email@example.com.
Barbara Meyers is president of Meyers Consulting Services (MCS) [formerly http://www.mcsone.com/], which provides expertise in management, marketing, planning, and research to professional societies, publishers, and commercial firms. Formerly she worked for the American Chemical Society, the National Academy of Sciences, and the Chamber of Commerce of the U.S. She was one of the founders of the Society for Scholarly Publishing and was President of the Council of Biology Editors from 1997 through 1998. Together she and Linda Beebe have 55 years experience in scholarly publishing and are well known for their active participation in a number of industry groups such as AAP/PSP, CBE, NFAIS, and SSP. Barbara Meyers may be reached at firstname.lastname@example.org.
Arms, C.R. 1996a. Historical collections for the National Digital Library: Lessons and challenges at the Library of Congress. D-Lib Magazine April: Part 1. [doi: cnri.dlib/april96-c.arms]
Arms, C.R. 1996b. Historical collections for the National Digital Library: Lessons and challenges at the Library of Congress. D-Lib Magazine May: Part 2. [doi: cnri.dlib/may96-c.arms]
Association of Research Libraries. 1994. 2001: A space reality. Strategies for obtaining funding for new library space. Systems and Procedures Exchange Center. Flyer 200.
Brand, S. 1998. Written on the wind. Civilization October/November: 70-72.
Cahill, T. 1995. How the Irish Saved Civilization. New York: Nan A. Talese/Doubleday.
Fleischhauer, C. 1995. A periodical report from the National Digital Library Program, The Library of Congress, November/December. http://lcweb.loc.gov/ndl/nov-dec.ht ml
Guernsey, L. 1998. Colloquy: A provost challenges his faculty to keep copyright on journal articles. The Chronicle of Higher Education September 11. http://chronicle.com/colloquy/98/copyright/background.htm
Gryzcz, C. 1997. Professional and Scholarly Publishing in the Digital Age. New York: Association of American Publishers.
Hedstrom, M., and S. Montgomery. 1998. Digital preservation needs and requirements in RLG member institutions. [formerly http://www.rlg.org/preserv/digpres.html].
Levin, M. 1996. Avoiding computer obsolescence. Project Galactic Guide.
Library of Congress. 1998. Challenges to building an effective digital library. http://memory.loc.gov./ ammem/dli2/html/cbedl.html
Lutzker, A.P. (No date.) Primer on the digital millennium: What the digital millennium copyright act and the copyright term extension act mean for the library community. Association of Research Libraries. http://www.arl.org/info/frn/ copy/primer.html
Mandel, C. 1996. Enduring access to digital information: Understanding the challenge. European Research Libraries Cooperation: The LIBER Quarterly 6:453-64.
Marcum, D.B. 1996. The preservation of digital information. Journal of Academic Librarianship 22:451-54. [doi: 10.1016/S0099-1333(96)90006-3]
Meyers, B., and L. Beebe. 1999. The Future of Print Journals: A White Paper Prepared for The Sheridan Press. Hanover, Pa.: The Sheridan Press.
Meyers, B., and L. Beebe. 1997. Archiving from a Publisher's Point of View: A White Paper Prepared for The Sheridan Press. Hanover, Pa.: The Sheridan Press.
Negroponte, N. 1995. Being Digital. New York: Alfred A. Knopf, 13.
Odlyzko, A. 1996. On the road to electronic publishing. Available at http://math.albany.edu:8800/hm/emj/papers/road.
Reed-Scott, J. 1998. Recon project for preservation microfilm masters completed. ARL Newsletter. 96. http://www.arl.org/newsltr/196/ nrmm.html
Research Libraries Group. 1996. Preserving digital information: Report of the Task Force on Archiving of Digital Information. http://www.eric.ed.gov/ERICWebPortal/custom/portlets/recordDetails/detailmini.jsp?_nfpb=true&_&ERICExtSearch_SearchValue_0=ED395602&ERICExtSearch_SearchType_0=no&accno=ED395602
Resh, V.H. 1998. Science and Communication: An Author/Editor/User's Perspective on the Transition from Paper to Electronic Publishing. http://www.library.ucsb.edu/istl/98-summer/article3.html
Seismological Society of America. 1998. Publishing Survey Results. http://www.seismosoc.org/news/pubsurvey_summ.html
Sen, A., and V. Jacob. 1998. Industrial-strength data warehousing. Communications of the ACM 41:9, 29-31. [doi: 10.1145/285070.285076]
September Forum Archives. 1998. American Scientist. Available: http://amsci-forum.amsci.org/archives/september-forum.html.
Sigal, M. 1998. A common sense development strategy. Communications of the ACM 41:9, 42-43. [doi: 10.1145/285070.286430]
Stille, A. 1999. Overload: High tech dept. The New Yorker March 8: 38-44.
Walker, T. 1998. Free Internet access to traditional journals. American Scientist 86:5. http://amsci-forum.amsci.org/archives/september-forum.html.
All URLs were active when this article was completed in mid-May 1999; however, changing domains is a common problem with Web references. We were unable to use a handful of references we had intended to include because they were no longer available.
The genesis for this article was a white paper Archiving From the Publisher's Point of View (Note: should this be italics as in references?) published by The Sheridan Press in September 1997. The authors would like to thank The Sheridan Press for their support of our initial endeavor in writing about archiving and for the helpful reviews from Barry Davis, Fred Fowler, Chris Gohn, Kevin Pirkey, Craig Rineman, Greg Suprock, and Joan Weisman. Readers interested in obtaining a copy of the white paper should contact The Sheridan Press directly at email@example.com. Also see http://www.sheridan.com.
Links from this article
American Memory http://lcweb2.loc.gov/ammem/amhom e.html
Copyright Term Extension Act http://lcweb.loc.gov/ copyright/legislation/s505.pdf
Digital Millenium Copyright Act http://lcweb.loc.go v/copyright/legislation/hr2281.pdf
Digital Vault Initiative at Bell & Howell Information and Learning (formerly known as UMI) http://www.il.proquest.com/hp/Features/DVault/
Text Encoding Initiative http://www.tei-c.org/