Authors : | Gabrielle V. Michalek, Jennie Benford |
Title: | The Pittsburgh Jewish Newspapers Project: Digitizing Archival Newspapers for Full-Text Searching |
Publication info: | Ann Arbor, MI: MPublishing, University of Michigan Library Fall 2009 |
Rights/Permissions: |
This work is protected by copyright and may be linked to without seeking permission. Permission must be received for subsequent distribution in print or electronically. Please contact [email protected] for more information. |
Source: | The Pittsburgh Jewish Newspapers Project: Digitizing Archival Newspapers for Full-Text Searching Gabrielle V. Michalek, Jennie Benford vol. 12, no. 1, Fall 2009 |
URL: | http://hdl.handle.net/2027/spo.3310410.0012.106 |
The Pittsburgh Jewish Newspapers Project: Digitizing Archival Newspapers for Full-Text Searching
Carnegie Mellon University Libraries
Abstract
The Carnegie Mellon University Libraries was asked by the library director of a local synagogue to digitize a singular collection of three historic Jewish newspapers, all of which were published in Pittsburgh. These three newspapers, the runs of which span the 19th to the 21st century, posed unique digitization challenges, both as unwieldy, fragile physical objects and as unindexed, dense analog datasets. Using a combination of revamped workflows and repurposed equipment, the Carnegie Mellon Library staff rose to the challenge, creating a full text searchable set of 160,000 digital surrogates for under $100,000. As of 2009, fundraising to secure another $100,000 is underway to support creation of 75,000 more images, creating a searchable database of approximately 235,000 images by project end. In its current incarnation, the Jewish Newspapers Project provides open access to a one of a kind collection of social, genealogical and demographic information about the Jewish communities in and beyond 20th century Pittsburgh. This article is an examination of how a successful archival digitization project was launched by retrofitting earlier workflows, utilizing existing equipment from other projects, and by cobbling together a likeminded group of participants, all of whom have some stake in the end product.
Why a Jewish Newspaper Project?
Pittsburgh prides itself on its history and that history is often spoken of in terms of ethnicity. As 19th century America evolved from an agricultural society into an industrial superpower, people from all nations came to Pittsburgh to work first in glass factories and iron mills and then, by the late 1800s, to run the huge steel mills that lined the city’s three rivers. Interest in that heritage is strong, both from descendents of those immigrants as well as from academic researchers around the world. The demand for primary resources about Pittsburgh’s past is also strong but those resources tend to be rare, fragile, locally held and, in the case of newspapers, largely unindexed. Finding a way to access information within sole copies of embrittled newspapers without destroying them, much less making that information widely available, would be a task requiring several stakeholders, many of whom are part of the Pittsburgh’s latest incarnation as a burgeoning center of digital technology and innovation.
The Jewish community of Pittsburgh can trace its history back to the arrival of Jews from Germany and Poland in 1840. In 1991, Carnegie Mellon University undertook the first large scale archival digitization project the archival community had seen, digitizing the papers of the late Senator H. John Heinz III. [1] To say both the Jewish and technology communities have flourished in Pittsburgh since these watershed moments would be an understatement. The Jewish community of Pittsburgh has a long, influential history in and beyond the city limits and is a vital part of the city to this day. On the technology front, recent years have seen the Smoky City evolve into a High Tech hotspot. Talent from local universities and businesses, as well as monies from forward thinking foundations, has come together for important digitization projects of local archival materials. The story of the Pittsburgh Jewish Newspaper project is the tale of how these two sectors have worked together, using today’s innovations to better illuminate the details of yesterday’s Jewish experience. [2] This paper details the Jewish Newspaper project, briefly outlining the source material, and focusing on creation of the workflow and processes necessary to digitize the historic newspapers in question.
Collaboration is Key
In 2003, Anne Molloy, chair of the Library and Archives of Rodef Shalom Congregation, began to make inquiries concerning potential funding for a project to digitize the Jewish newspapers of Pittsburgh. The three newspapers in question, The Jewish Criterion, The Jewish Outlook, and The Jewish Chronicle, covered both world and regional news from a Jewish standpoint from 1895 until the present day. Archives of these papers were to be found in various repositories both in and outside of Pittsburgh. All three papers were in strong demand by researchers and, as none of the three had been indexed, searches had to be conducted page by page, often by repository staff and, if in lieu of microfilm, to the detriment of fragile original copies. The Rodef Shalom Congregation itself maintains a serviceable run of the Jewish Criterion within its archives but both the incompleteness of that run and the time required of their one-person archives staff to physically search the unindexed issues begged for a solution.
In the fall 2004, Anne Malloy contacted Dr. Gloriana St. Clair, Dean of the University Libraries at Carnegie Mellon University, to explore the possibility of directing toward the newspaper indexing problem technology configured for another project completed at the Carnegie Mellon University Libraries. The earlier project, called the Posner Memorial Collection, [3] resulted in the digitization of and digital access to over 1,000 rare books. The Posner Memorial Collection project put the Carnegie Mellon University Libraries in a good position to work with the newspapers in question as the handling needs of embrittled newspapers are similar in many ways to the treatment required of rare books. It is important, for example, not to strain the bindings of the books and not to expose each page to UV light for an extended period of time. The Carnegie Mellon University Libraries already had staff trained in such work, not to mention non invasive scanning equipment, and a dedicated and secure room for the work.
At the time the Jewish Newspapers Project was started, similar newspaper digitization projects were being undertaken at other institutions using significant grant monies. The Utah Digital Newspapers Program, for example, began in 2002 with the first of three Library Services and Technology Act grants, which was followed by a $1 million dollar grant a year later from the Institute for Museum and Library Services. The million dollar grant was to fund the digitization of 264,000 newspaper pages, complete with inclusion of metadata, to augment the existing searchable Utah Digital Newspapers Program Database. [4] Not only did the Carnegie Mellon University Libraries have the equipment and knowhow in place for the Jewish Newspaper Project, the workflows from earlier projects allowed the project to get underway without the infusion of major soft money. The Jewish Newspaper Project, started in 2003 and slated for completion in 2011, has so far cost only $100,000. And while a major effort is underway to obtain $100,000 to complete the scanning work, the cost of the end product will still be a fifth that of similar projects.
It was agreed that the Carnegie Mellon University Libraries would reconfigure a workflow that was originally created for the Posner Project and therefore take on the task not only of indexing the Jewish Criterion but of digitizing and providing digital access to the three newspapers, specifically the Criterion 1895-1962, the Jewish American Outlook 1934-1962, and the Jewish Chronicle 1962-present. The project would be undertaken by Carnegie Mellon University Digital Library Initiatives, a University Libraries’ department affiliated with the Carnegie Mellon University Archives.
In preparing for the digitization, creating the fullest run of each paper proved to be a significant challenge. Bound copies of the Jewish Criterion held at the Rodef Shalom Congregation Archives, and the American Jewish Outlook, held at the Rauh Jewish Archives of the Historical Society of Western Pennsylvania, were incomplete series. Back issues of the Jewish Chronicle are located at the publication office of the paper where additional copies are being added as the newspaper is published weekly. The Criterion presented special problems because a complete original set in paper does not exist. Rodef Shalom Temple had the majority of the bound volumes but there were still large gaps in the collection, especially for the years 1895-1901 and 1906-1914. The Carnegie Library of Pittsburgh was able to contribute 1911-1914 in paper format for the project. The only other mostly complete set was a microfilm version held by the New York Public Library. Although digitizing the original paper format is preferable because of image quality, the project team decided to digitize the microfilm of the missing issues in order to provide as comprehensive a digital collection as possible.
Inclusion of the microfilm brought the number of repositories with whom Carnegie Mellon would need to coordinate archival access for the project up to four, a significant number of partners in a project where the loaned items are unique and fragile. Establishing confidence in the lenders of these rare newspapers took a great deal of coordination and what might be classified as “customer service.” As the project progressed, part of the workflow was to keep lenders updated on the project, including updates on where their items were in the work queue and how they were being used. Lenders would only loan three to four volumes of newspapers at a time, not an unusual precaution among archives that loan rare materials. During this time, lenders were in contact with project coordinators, receiving project updates and able to check in as they wished. It should be noted that scanning an archival collection does not result in disposal of the original hard copy collection. This was not a project where the hardcopies could be thrown out after the scans were made and part of the agreement with the partnering repositories was that their originals would be returned in the same condition in which they were sent. As the project progressed, Carnegie Mellon gained the trust of lenders who saw their materials coming back in good shape and in a timely fashion. This trust ensured that all available copies of the papers were eventually scanned.
Intellectual Property
Locating the actual physical newspapers was not the only obstacle to access around which the project had to navigate. Issues of intellectual property also needed to be addressed. From their inception until 1962, the Jewish Criterion and the American Jewish Outlook were independently owned and operated. They were, in fact, in direct competition with one another covering the same community and events. In 1962, both newspapers were acquired and closed by the Pittsburgh Jewish Publication and Education Foundation, which was created with the express purpose of publishing a new weekly publication called The Jewish Chronicle. All intellectual property rights for the Jewish Criterion, American Outlook, and The Jewish Chronicle were transferred and held by the Pittsburgh Jewish Foundation. During the digitization process and prior to publication, Carnegie Mellon University Libraries received non-exclusive electronic copyright to digitize and free-to-read on the World Wide Web for all three publications.
Scanning the Collection
Once the legal details of the project were put to rest, questions about how to scan these physically unwieldy items needed to be addressed. All three publications came bound in oversized format volumes, the largest measuring 11 ½” by 17”. The earliest issues were created strictly in black and white with color covers being introduced in the Criterion in March 1940. Beyond the basic challenges of handling such large objects, specific format details created other problems. Although all three publications are considered newspapers, the majority of issues resemble magazines, printed not on newsprint but on coated paper that reflects light, making the scanning process that much more difficult. Furthermore, the pages of all three periodicals are printed in columns and consist of an amalgamation of graphics, advertisements and a variety of font types. While this format does not affect the quality of the actual scan it does make the automated searching of that scan, and the application of metadata to the information within the scan, that much more difficult—and it is this automated searching that makes the Jewish Newspaper Project valuable to researchers, allowing keyword searches across thousands of unindexed articles.
Scanning the microfilm images posed a different set of challenges. The original microfilming was done on 35 mm film, filmed two images per frame. As luck would have it, the camera used to photograph the New York Public Library run of the Criterion had a burned out light bulb making all of the images on the left side of the frame dark from left to right. In an effort to improve image quality, an attempt was made to automatically color correct the images, however, this attempt proved unsuccessful and was abandoned.
Other problems with the collections that hindered the project included missing pages, pages that were partially cut out, large, dense text blocks with little or no border, tightly bound materials, and inconsistent dates and volume numbering. At times a supplemental issue may have been released without a date or a confusing volume and issue number. Such anomalies challenged not only the scanner operators who worked to get the best image possible from compromised materials; but they also meant that those people inputting the metadata for each scan (usually the self same scanner operators) needed to address difficult cataloging decisions in the middle of the project workflow.
Workflow
A workflow is developed for all digital projects that enter the Carnegie Mellon University Digital Library Initiative work queue. The workflow informs project members of the required steps for each project as well as the directional movement of data and materials. See Figure 1 below for the workflow process for the Jewish newspapers project.
![[figure] [figure]](/j/jahc/images/3310410.0012.10600.jpg)
Diagram of the workflow for the Pittsburgh Jewish Newspaper Project.
The workflow for The Jewish Newspapers Project shows the steps taken to transform the paper/microfilm collection into a keyword, full text searchable database. What follows is a more detailed description of that workflow starting at the point of scanning.
Digitization: Equipment and Software
Digitization of the paper collections is done on the Zeutschel Omniscan 6000 color scanner. [5] The Zeutschel Ominscan 6000 is designed to accommodate rare books in nonstandard formats, and was originally purchased for use with the Posner Memorial Collection scanning project. To scan the newspapers on the Zeutshel, we use a book cradle with glass-plate top to protect the bindings of the bound volumes. In addition, hand-made book cradles were constructed for use on those volumes that could not be opened for the full 180 degree angle. This existing scanner at Carnegie Mellon, paired with the book cradles, proved to be a good fit for the problematic, oversized and tightly bound volumes of newspapers.
The hardware and platform for the scanning device changed over time. As of fall 2007 we were using a Dell Optiplex GX620 PC with Pentium D 3.20 Ghz processor with 2 GB of RAM, running Windows XP. The machine has a 21” monitor. There are two storage disks—one containing 148 GB and the other containing 225 GB of storage space.
The imaging software is Omniscan 6.03, which has contrast enhancement, color-correction, rotation, despeckle, deskew, crop and mask. The software will also remove gutter curvature and shadows. [6] The software allows the scanner operator to quickly capture images and correct flaws in image quality.
Digitization: The Process
The scanning operator begins the scanning process by creating a file folder name. (The file naming convention is described later in the article.) Each issue of all three serials within the Jewish Newspaper Project receives its own file folder. Once the file folder name is created the scanning operator begins the digitization process, scanning approximately one image every two minutes. Scanning time is slower than scanning materials from the general collection because the materials are tightly bound and often the text block goes into the gutter. The scanning operator has to be careful to get a clear image without capturing the gutter while getting all of the information found in the text block. During downtimes, lunch and at shift’s end, the scanning operator transfers the scanned image files through the network to a middleware server called Wolfpack, also described later.
The scanning operator is responsible for creating collection condition reports and sending these to the appropriate curator. The scanning operator also keeps scanning statistics on each issue in the collection and details the number of images created. Finally, and most important, the scanning operator tracks the location and movement of the source materials, in this case the three serials used in the project. This activity is critical when collaborating with librarians and archivists from other repositories who are concerned about the care and maintenance of their collections. Losing materials erodes confidence in the project team and impedes progress and, with four separate repositories involved in the project, this stewardship aspect of the scanning process is critical.
Bibliographic Metadata, Persistent URLs and File Naming Conventions
Carnegie Mellon University Libraries create and maintain a variety of digitized archival collections, the contents of which number well over one million individual digital items. The system applied to these collections was transposed onto the Jewish Newspaper Project, specifically the use of persistent URLs. [7] As an academic library system, the Carnegie Mellon University Libraries uses persistent URLs to ensure data provided by the Carnegie Mellon Library system is consistent and reliable—a must if this information is going to be cited in scholarly research and publications. Any time information or data are moved from one part of the server to another part, or even onto a new server, the URL always points end users to the information they are seeking. This is done by setting up a middle-layer server, the sole purpose of which is to take the request the end user is making and look for where the data reside. This allows movement of the data from server to server providing someone pays attention to the maintenance of the middle-layer server since the middle-layer server is always updated with current data locations. The newspapers in question could have been scanned much more quickly and inexpensively without this middle layer server and the personnel time employed to create the metadata behind the scans. The result, however, would have been more amateur than professional. Anyone who has used “homegrown” websites for research will know firsthand the frustration of good information lost via broken links. And anyone who has conducted academic research knows that such broken links can negate the reliability of an otherwise rich source of information.
The idea of persistence extends to the database as a whole. In taking on the Jewish Newspapers Project, the Carnegie Mellon University Libraries promised to maintain the database and its website in perpetuity, adding yet another task to the day-to-day responsibilities of the Carnegie Mellon University Libraries’ IT staff. The task of maintaining the server is a policy issue, supported through the infusion of human resources from the IT staff, as needed to stay on top of the task. Making changes to one server is less confusing and costly than having to make changes to dozens of locations. Another advantage that persistence offers is that it allows the end user to access each digital object from the web directly by using the URL. This allows for direct links from bibliographic citations and footnotes.
When digitizing a serial with multiple issues the question arises as to how to describe the serial and each of its parts. Each of the three newspapers received a MARC record and was catalogued in OCLC. The MARC record describes each collection as a whole and does not describe individual issues within a collection. It was possible to create a MARC record with a persistent URL in the 856 field for each issue, but this would have been cumbersome. [8] Instead, a web page was created for the project where each of the journals was described and given a persistent URL placed in the 856 field in the MARC record that points to the web page.
For all of the digital projects housed at Carnegie Mellon University Libraries, a PostgreSQL database is employed using an Apache model modrewrite which rewrites the URL on the fly. This allows a seamless transition from the old URL to the new URL for the end user. To create persistent URLs the technical members of the project team at Carnegie Mellon made a conscious choice to loosely base the database upon the CNRI Handle System DOI model in case the standard does survive. [9] However, they do not feel confident that the model will survive over the long-term and believe it is important to keep the architecture flexible in case the DOI model disappears. The database itself is small and does not require a designated server.
Another area where policy and technology issues intersect is in the file naming schema. It is imperative that file naming be consistent and compatible across collections and operating systems. The schema developed for the project used a portion of the journal name, year, volume number, issue number, and date. Each issue is required to have a separate folder. Images are placed in a folder named with corresponding identifier (e.g., an identifier for an issue will result in a folder with the same name, and 00000001.tif through 0000000N.tif within that folder). Here is an example to illustrate how an issue from the Criterion, August 19, 1939, Volume 94, Issue 15 would look using the naming schema:
Title – XXX (CRI)
Year – XXXX (1939)
Volume Number – XXX (094)
Issue Number – XXX (015)
Date — mmddyyyy (08191939)
Underscore for delimiters – ( _ )
Resulting file folder name would be: CRI_1939_094_015_08181939
System Administration for Preservation and Access
The heavy lifting for the project comes in the form of system administration where files are moved, organized, copied, and backed up for preservation. Preservation requirements are different than requirements for searching and display so a preservation master image is created for each scan, and then derivative formats are created from the master image for searching and display. The digital masters, intended for preservation, were scanned at 300-400 DPI, 24 bit color and saved as TIFFS using a Group IV compression, an international standard. Images were written in sequential order, with corresponding 8.3 file names, e.g., 00000001.tif as first image in volume sequence and 00000341.tif as 341st image in volume sequence. The current size of the Jewish Newspapers Project database is approximately 1.5 Terabytes with an estimate that, by project’s end, the database will top about 3.0 Terabytes. To date, over 158,000 TIFF images have been created with that number expected to double by the time the project is completed. Each image is converted to multiple derivative formats. In the case of the Jewish Newspapers Project that includes a text file and JPEG. When the project is complete the system administrator will be responsible for managing approximately 900,000 electronic files. The system administrator is also responsible for performing back-ups of the information and verifying the integrity of the data as they are converted using Wolfpack technology.
Wolfpack Technology
Wolfpack, an open source system developed at Carnegie Mellon University Libraries, is a distributed file conversion system. The system allows conversion of files from one format to another in an automated manner, reducing the amount of human labor and increasing the speed at which we can make materials available on the web after digitization. As mentioned earlier in this paper, in the Jewish Newspapers project images are digitized and turned into TIFF files. TIFFS are similar to photographic images and are extremely limited in their functionality. They present two problems in that they are large and cumbersome, and cannot be searched. To solve these problems we must convert the TIFF images to formats that are more suitable for searching and display. The Wolfpack technology allows the system administrator to batch process or transfer a large number of images into the system for automated conversion. The Wolfpack server can be set to determine which conversions need to be done. For the Jewish Newspapers Project TIFF images are converted to text files for searching and JPEGs for display. In the near future, based on user feedback, the TIFF images will be converted to searchable PDFs that will allow the search terms to be highlighted in the display file.
The Wolfpack system consists of three pieces of software:
- a crawler to determine which files are missing, and add that information to a database,
- a server program to hand out unprocessed files to clients and get the completed files back,
- a client program which gets work from the server, creates the missing file, and sends it back.
The crawler program looks through the file system to determine which files need to be created for a given collection. It contains a set of rules which determine the required files. (It is similar to the UNIX “make” or Java “ant” programs.) For example, if one has a TIFF image, and the rules say that every TIFF needs to have a corresponding text file, the crawler adds a new “TIFF to text” conversion to the database for that file.
The server program is responsible for handing out work to the clients and getting the results back. The server uses the same database as the crawler process to determine what work needs to be performed. When a client asks the server for work, the server looks in the database to see what conversions are needed, and passes the corresponding input file to the client. When the client has completed its work, it sends the resulting file back to the server. The server adds the new file to the file system and removes that conversion from the database. In this manner, all the conversions which the crawler adds to the database are performed.
The client program is where the actual conversion is performed. Different clients may be able to perform different types of conversions; for example, only clients with OCR (optical character recognition) software installed can perform OCR. When the client asks the server for work, it tells the server what conversions it can perform. The server finds a suitable conversion in the database and gives the client the corresponding input file. The client then runs the conversion program to get the correct output file, and then sends the output file back to the server.
The running of the conversion program is the sole purpose of the client. The actual conversion programs are outside of this system, but are abstracted using a “wrapper” which takes the input file, runs a script, and gets the resulting output file. The script can be something as simple as calling a command-line program, or something as complicated as running a Windows batch file program which opens a Windows program and simulates a user clicking buttons and typing keys. The client program also calculates a checksum for its output file, and this is used to ensure that the output file is reliably transferred back to the server.
Open-source and CMU-developed programs are used for many of the conversions such as “TIFF to JPEG”, “TIFF to BZIP”, and “multiple PDF files into a single PDF”. We currently use commercial programs for PDF creation and for Optical Character Recognition. We also use a commercial scripting program for Windows which automates the running of Windows programs which need mouse input, such as the OCR software.
DIVA
After materials have been digitized and converted using Wolfpack they are added to the DIVA system and made available to end users. DIVA is an Oracle-based information system that was written and customized by Carnegie Mellon University Libraries. It allows end users to search, browse, view and print digital images found within the Jewish Newspapers Project. DIVA provides conventional access to library and archival materials, and adds powerful new functions for searching and retrieving documents, supporting multimedia, and customizing the structure and presentation of collections. The current operating system for DIVA is Windows 2003 and it resides on a Dell server in a controlled environment. See Figure 2 below for an example of a browsable page in DIVA.
![[figure] [figure]](/j/jahc/images/3310410.0012.10601.jpg)
Example of Browse page in DIVA.
Website Design and Usability Testing
Staff members in the Digital Library Initiatives Department at Carnegie Mellon created and enhanced the Jewish Newspapers web presence, the design of which is intended to be welcoming and attractive. The primary goal of the website is to provide functionality with a display that makes using the database easy and enjoyable while providing outreach to as broad an audience as possible. See Figure 3 for a screen shot of the project homepage.
![[figure] [figure]](/j/jahc/images/3310410.0012.10602.jpg)
Project Home Page Screen Shot
To date the website has undergone two series of usability testing. The first series of testing was fairly informal, with Carnegie Mellon University Library staff and students as well as the archivists for the Rodef Shalom and Rauh Archives evaluating the site for its look and feel. Feedback was collected by the project director who then created a list of recommendations that would be implemented by the Digital Library Initiatives Group. The primary object of the second series of testing was to determine if new functionality developed for the project, called “snippets,” was useful to end users. [10] See Figure 4 below for an example of this new functionality.
![[figure] [figure]](/j/jahc/images/3310410.0012.10603.jpg)
Example of Snippet Search Results Page in DIVA
In addition, testing provided the feedback needed to make recommendations for the redesign of the display interface that would make snippet functionality clear and easy to use. Guided by a Human Factors Researcher, the second series of testing employed “think-aloud” protocols where subjects were asked to verbalize their thoughts as they moved through a series of tasks using the snippet technology. Subjects were drawn from the Carnegie Mellon University Library staff and student body. Recommendations were implemented by the Carnegie Mellon University Libraries Research and Development team. The current online incarnation of the project features these changes.
Preservation
As a collaboration between two archives and an academic library The Jewish Newspapers Project has always counted preservation of the digital system and its files an essential component of the project. The two main approaches to preservation for this project have been adherence to digital library best practices and standards for metadata as well as imaging creation and capture, and information back-up and redundancy.
As discussed earlier, the descriptive metadata uses the MARC format with persistent URLs based upon a flexible approach to using the emerging Handle System model. [11] The file naming schema reflects digital library best practices and the 8.3 file naming format. [12] The imaging portion of the project either meets or exceeds DLF imaging standards. [13]
To ensure long-term preservation of the digital images, all of the files are regularly backed-up on LT02 tapes. [14]The tape technology is changing quickly to keep pace with the ever increasing size of hard drive technology. This is perceived as a good thing because it forces the system administrator to refresh tapes more regularly as the new technology emerges. System administrators have confidence in the tape back-up system to store and then, when necessary, restore the data. The tapes are stored in controlled environments in two different locations. As Carnegie Mellon University develops campuses and programs overseas, it will be possible to expand our storage locations to other countries and continents for security and preservation reasons.
Staffing and Other Costs
The complexity of the project requires the involvement of several departments from within the Carnegie Mellon University Libraries. A team was assembled which includes a library department head (for project oversight), a system administrator (for file management and back-ups), three members of Library Instructional Technology R&D (to modify DIVA system design, adjust Wolfpack Technology to meet project needs, and maintain persistent URLs), a digitization projects manager (for oversight of the Website design), and usability researcher (for website testing and design), and a scanning operator (for digitization and file transfer). All these members of the Carnegie Mellon University Libraries contribute a portion of their workday to the project except the scanning operator who works full time on the project. In addition, members of the Heinz History Center, Rauh Jewish Archives, the Rodef Shalom Temple Archives, the Carnegie Public Library and The Jewish Chronicle of Pittsburgh all contribute time to the project
Carnegie Mellon University Libraries participate in the Jewish Newspapers Project based largely on its relationship with some major donors and friends of the University. Before the project began, it was determined that the project required a budget of approximately $200,000. At the start of the project the American Jewish Federation contributed $45,000. Even though all of the funding was not in place Carnegie Mellon University Libraries decided to proceed with the project on a shoestring budget using a combination of cost sharing and in-kind donations.
Popularity of the project has resulted in an ongoing critique of its search engine. While a lack of funding has not allowed for the additional resources needed for proposed upgrades, the commentary generated by users has provided several items for a continuing “wish list” of ways to make the database even more effective.
Usage
It is important to know if and how the Jewish Newspaper Project is meeting the needs of its users. Carnegie Mellon University Libraries use transaction log analysis to understand how data are being used, and how often and by whom the project is being accessed. The use of transaction log analysis is considered another best practice by the digital library community. The program used is called Analog 6.0, which is the most popular file analysis software. The freeware software tracks what people access on the website and creates reports with summary statistics that are easy to read. While the software does not identify the precise person who is using the system, (nor do we want to know this information), it does let us know how many unique computers accessed the database so if there were one person, or robot, making a million requests we would know it. For many of our digital collections the number one user is Googlebot. [15]
In the first month after release we received over 11,000 unique requests per day within the database. Once the excitement of the initial released had passed the number of unique searches dropped, but remain high. For example, in the second six month period of 2007, 35,416 unique searches were conducted, with an average of 5,903 searches per month or 195 searches per day.
So, What Does It All Mean?
Success of the Jewish Newspaper Project provides clear evidence that digitizing and granting digital access to locally held newspapers drastically increases their usage. Offering access to searchable digital surrogates provides users with powerful search tools that enhance the value of their research while protecting from handling the fragile hardcopies of these newspapers. That this project was done on a shoe-string budget, using reconfigured technology and workflows, is proof that equipment, software, workflows and staff knowledge gained for specific archival digitization projects can be successfully repurposed. The Jewish Newspaper Project demonstrates this success by the number of users it serves daily and the positive feedback received from users.
Notes
1. "The Senator H. John Heinz III Archives." http://www.library.cmu.edu/Research/Archives/Heinz/ (accessed 03/07/2009).
2. "Pittsburgh Jewish Newspaper Project." http://pjn.library.cmu.edu/ (accessed 03/07/2009).
3. "Posner Memorial Collection." http://posner.library.cmu.edu/Posner/ (accessed 03/07/2009).
4. “Microfilm, Paper, and OCR: Issues in Newspaper Digitization; The Utah Digital Newspapers Program.” Kenning Arlitsch and John Herbert. Microform and Imaging Review, Vol 33. No. 2 (59-60).
5. The equipment was originally purchased in 2000 for the Posner Memorial Collection in Online Format Project which reached completion in 2005.
6. When digitizing tightly bound books the edge of the scanned image may show some curvature in the shape of a black line due to its proximity to the gutter.
7. Persistent URLs guarantee that a URL will work in perpetuity. This is done by making the URL for a web resource point to a service which automatically redirects the user to the current location of the resource.
8. The MARC record 856 field provides the electronic location and access information to an electronic resource.
9. The Handle System DOI model is a method of creating persistent URLs, http://www.doi.org/
10. A sentence in which the user’s search terms appear is presented along with a link to the page in which it was found. This provides context as to how the search term is used on the page and reduces the amount of time end users need to reach their desired results.
11. The Handle System DOI model is a method of creating persistent URLs, http://www.doi.org/
12. A standardized file naming convention
13. DLF imaging standards can be found at: http://www.diglib.org/standards/bmarkfin.htm
14. Data storage tapes used for backup because of their high storage capacity
15. A web crawler used by Google to collect web pages to be indexed and made available by their software