Technical

About This Technical Section

The success of CRMS lies in its review process and the technical infrastructure that supports it. The CRMS interface presents a scanned image of a work in HathiTrust; the reviewer makes a copyright determination on the volume using the interface and the research tools the interface makes available (Stanford Copyright Renewal Database, the Virtual International Authority File [VIAF], etc.). The system stores a record of that determination, and, when appropriate, the system then exports the determination to the HathiTrust Rights Database. The system also includes methods for verifying reviews and determinations.

Access to scanned images of works in HathiTrust is essential to CRMS, as it would be to any copyright determination project at a comparable scale. This technical section therefore presumes that your project will be working with digital scans. Physical volumes are time-consuming and inefficient to manage by comparison.

Background

Copyright determination at the University of Michigan Library did not begin with CRMS. By the time the first version of CRMS went online in 2009, the staff of the Electronic Resource Access Unit had already conducted rights research on over 55,000 volumes in HathiTrust. This would have been an impressive accomplishment in and of itself, but the reviewers were working “manually” with only Excel spreadsheets and cumbersome automation. Their rights determinations were exported to the HathiTrust Rights Database monthly.

The first IMLS grant allowed the CRMS project team to streamline the rights research process by consolidating everything required for a copyright determination into one online interface. Reviewers had easy access to the scanned volume, several information resources to assist in making a determination, and a searchable database of all past rights determinations. The design of the system ensured the reliability of the determinations by requiring at least two reviewers for each volume and introduced a “conflicts” interface for expert reviewers who could adjudicate whenever reviewers disagreed. An automated processing script exported determinations to the HathiTrust Rights Database each night. After seven months of development, the first version of CRMS-US went live in July 2009.[65] The development of a training site for CRMS reviewers in May 2010 was also an opportunity to add functionality to allow system access for reviewers from Indiana University, the University of Minnesota, and the University of Wisconsin, all of whom began contributing work the following July.

The second IMLS grant allowed the CRMS project team to adapt the CRMS interface for rights research on non-US works. Development of the CRMS-World interface required five months. Testing in late April 2012 had the new version ready in time for the first CRMS-World training summit in early May. More rounds of testing and development followed that summer. A large part of the development effort for CRMS-World was concerned with migrating nonshared information from the source HTML and Perl code into the database and configuration files and making it possible for these to be extracted and used at runtime. An example of this is the list of information sources made available to reviewers for copyright research: CRMS-US uses the Stanford Copyright Renewal Database, whereas CRMS-World includes a number of other tools such as VIAF. The goal was to have everything differentiating the two systems be part of the database or configuration file, avoiding hardcoding to the greatest extent possible.

The development of CRMS-World had the advantage of starting from what was by then a mature CRMS codebase. The system could detect which “mode” (US or World) to run in and dynamically choose the interface and backend logic components that were appropriate for each reviewer. This shared codebase reduced maintenance costs because a tool written for one mode would work largely unchanged in the other. In a very real sense, CRMS-US and CRMS-World were one system that “came in two flavors,” one formality-based and the other author-based.[66]

A Glossary of Terms Useful for Copyright Determination

Like any complex project, CRMS has acquired its own vocabulary. Here, we provide definitions to our terms in four main categories:

  1. Objects being reviewed
  2. User roles
  3. Interface/system
  4. Rights determination

This glossary can also be found in the appendices, with terms listed in alphabetical order.

1. Objects Being Reviewed (“Candidate Pool”)

The architecture of a digital library adds complexity to the concept of a “book,” so many of the terms used to describe objects being reviewed do not in fact make it easy to talk about “how many books were reviewed.” In order to accurately associate rights codes with a specific physical object and to reduce duplicate reviewing of different copies of the same item, CRMS makes use of metadata to distinguish relationships. The nature of these relationships often makes it difficult to accurately count “books” as a statistic. Instead we deal with unique scanned objects that become eligible or ineligible for system consideration based on their accompanying metadata. (The following definitions build on each other and thus are presented in conceptual order rather than alphabetically.)

Volume: A volume in HathiTrust is not a “book” in the normal sense of that word but a unit of measurement indicating the unique scan representing one physical item. In line with common library binding practice, it may represent a discrete monograph, a single volume from a monographic series, or several items bound together. Scans of the same work but from different physical copies are treated as unique volumes, and each one receives its own volume ID. Copyright determinations are made at the volume level.

Volume ID: The volume ID is an alphanumeric identifier assigned by HathiTrust and Zephir to a volume (e.g., mdp.39015005731453). Each scan representing a different physical copy of a work is assigned a unique volume ID.

Figure 4 A breakdown of the component parts of a Volume ID
Figure 4 A breakdown of the component parts of a Volume ID

Catalog ID: The catalog ID is a unique identifier assigned by HathiTrust and Zephir that joins together related volume IDs of a particular work in the same edition. Each catalog ID in Zephir may have one or more than one volume ID associated with it, depending on how many copies of that work in that same edition are in HathiTrust. This relationship can be used to assign rights codes to duplicate volumes; however, a catalog ID may also represent volumes in a multipart monograph. In this case, the catalog ID does not indicate volumes that are exactly the same and should not be used for rights code inheritance without determination of individual parts.

Figure 5 Relationship between a Catalog ID and Volume IDs
Figure 5 Relationship between a Catalog ID and Volume IDs

Candidates (pool): The candidates pool is a subset of volumes within HathiTrust whose metadata (date and place of publication, country of origin, current rights, etc.) indicate they are within scope for a defined CRMS copyright review project. The candidates pool will trend toward zero as work progresses; however, it may remain level or even increase as HathiTrust ingests new volumes that match the scope. Candidates are updated each night by a query run against the HathiTrust Rights Database. In some cases, volumes are dropped from candidates due to a change in eligibility often stemming from a correction to their bibliographic metadata.

Active volume: A volume in the candidates queue becomes active whenever someone reviews it. Active volumes are given precedence by the queuing algorithm because work has already been done on them. A volume ceases to be active when all parts of the review process are complete.

Source volume: A source volume is the specific scan that has undergone manual review. A volume ID represents the source volume. Once one copy is reviewed in CRMS and becomes a source volume, then all the other copies associated with that particular catalog ID in Zephir may become “inheriting volumes,” provided there is no indication of enum/chron (enumeration and chronology) in the catalog ID.

Inheriting volume(s): Inheriting volumes are all duplicate copies of a work (in that particular edition) in HathiTrust. After a source volume’s rights code is exported to the HathiTrust Rights Database, volumes eligible for inheritance are automatically given the same rights code. Inheritance takes place when a CRMS determination is exported to the Rights Database.

Figure 6 Inheritance IDs
Figure 6 Inheritance IDs

Inserts: Component parts in a larger work that were written or created by other authors and may be subject to different copyright terms. Illustrations, articles, quotations, lyrics, and diagrams are examples of “component parts” that could turn out to be inserts. An insert could be an extensive part of a larger work, but even a brief insert can be significant. The presence of an insert is one of the more common reasons why a CRMS reviewer may decide a volume should be set aside as “undetermined.”

Multipart monograph: A work composed of more than one part in which the parts have been published over a span of time (usually several years). A multipart monograph can be a special problem in copyright determination because the parts of the work may be subject to different copyright laws—for example, a US work in which the first part was published in 1920, the second part in 1925, and the third in 1930. As a result, the individual parts have to be reviewed independently, even though technically they belong to the same work.

Enum/chron (enumeration and chronology): These are standard metadata used in library catalogs for serial publications and multipart monographs. The presence of enum/chron metadata in a record prevents inheritance of rights codes in CRMS because volumes that are part of a multipart monograph may be subject to different rights.

2. User Roles

Roles are the basis for determining the kinds of privileges people have within CRMS, the interface features available to them, and the levels of access they have to works in the system. In some cases a person may have more than one role.

Reviewer/advanced reviewer: A reviewer is a person authorized to perform copyright determinations. A reviewer is moved up to the status of an advanced reviewer after demonstrating consistent and reliable understanding of the process. Advanced status requires less oversight of a reviewer’s work.

Expert reviewer: An expert reviewer is a reviewer who is specially trained to adjudicate conflicting reviews. Experts are selected from top-performing reviewers to address conflicts generated by reviewers.

External admin: An external admin is a liaison from a partner institution that may not have authorization to perform copyright determinations but requires access to performance statistics of reviewers from their institution in order to make supervisory decisions.

Admin: An admin is someone entitled to see all project dashboards, statistics, and user information in order to run the project, assess performance, and track activity. An admin cannot override the constraints of the system to change the rights status of a volume.

Super admin: A super admin has the highest level of permissions and may override system logic in order to review any volume, not constrained by the scope of any given candidate pool. Formal legal training is a consideration in granting this role. The system developer also has this role.

3. Interface and System

PageTurner: A HathiTrust application that enables authorized reviewers to view scanned page images. CRMS embeds a version of PageTurner in its interface, but it is a separate application owned and maintained by HathiTrust. HathiTrust access and authentication modules confirm when a user should have authorization to have access to it. If a request for access does not come from an approved IP address, PageTurner will restrict access to works in the public domain. For more details about the application, see http://www.hathitrust.org/access_determination.

Priority: Priority codes route a volume through the CRMS system so it will be displayed to the appropriate user and in some cases restricted from view to other users. The majority of volumes are given Priority 0, which enables any reviewer to see them. Some volumes receive higher priority to ensure they will be reviewed more quickly and/or by a more experienced reviewer.

Status: Status codes indicate how far a volume has progressed through the review process and, to some degree, which path that volume is taking through the system (e.g., Did both reviewers agree or disagree?). Each volume in the queue has a status code, with 0 being the default. The following are the status codes used currently in CRMS-World. Note that Status 1 was not used during the early development of CRMS, and this practice persisted. Volumes progress from Status 0 to another category depending on the result of the review process.

StatusShort explanation
0Awaiting review or not yet processed
2Conflict
3Match pending expert review
4Match
5Reviewed by expert
6HathiTrust issue reported
7Status 3 expert review completed
8Partial match resolved by system
9System-generated review for rights inheritance

Validation/invalidation rate: A validation rate is the percentage of an individual’s reviews that either matched other reviewers’ judgments or are deemed correct by experts. The statistic is represented as validation in the personal display. For the management team, it displays in the converse as invalidation. The validation rate is a broad measurement to test how closely a reviewer is aligned with the CRMS review process. Adjudications where an expert elects to apply the Swiss option do not count against a reviewer’s validation rate. Instead, they are counted separately, influencing neither validation nor invalidation.

Swiss option: The Swiss option is an alternative to invalidation, which an expert reviewer may employ during adjudication to grant a neutral mark to a nonconforming review. Without this option, any reviews that do not match the expert’s would count as errors in the reviewer’s personal statistics. A Swiss option neutralizes the issue and avoids invalidating either reviewer. It is primarily useful in situations where there is complexity or a judgment call beyond the bounds of routine work.

4. Rights Determination

Review: A review is an individual reviewer’s judgment about the copyright status of a work. The reason for that judgment is stored in the system with a corresponding rights code. Depending on how a volume moves through the CRMS process, two or three reviews may accrue before a final determination is reached.

Conflict: A conflict occurs when two reviews for a volume disagree on one or more critical pieces of information that would affect access to the work. For example, two independent reviews of the same work are in conflict where one reviewer selects “public domain” and the other selects “in copyright.”

Final determination: A final determination is the collective result of all reviews done on a volume (including, if necessary, an expert’s adjudication). It is the result when that process is complete.

Exported determinations: Not all final determinations are sent to the HathiTrust Rights Database. Exported determinations are a subset of final determinations that meet criteria for export.

Attribute: A rights code is composed of two parts. The first half is called the attribute, and it represents the copyright status of the work and facilitates access control. Examples of attributes used by CRMS are “ic,” “icus,” “pd,” “pdus,” and “und.” There are twenty-six attributes (as of this writing), though most are not used in copyright determination. A list of attributes can be found at http://www.hathitrust.org/rights_database.

Reason: A rights code is composed of two parts. The second half is called the “reason,” and it accounts for why the volume was given that copyright status. There are eighteen “reasons” (as of this writing) accounting for a number of different situations. A list of reasons can be found at http://www.hathitrust.org/rights_database.

Rights code: A shorthand term representing both the attribute and reason code of a determination.

Rights database: The repository of rights information for each digitized volume in HathiTrust. The Rights Database should not be confused with the CRMS database, which is a separate repository that includes more detailed metadata necessary for rights research. For further details, see https://www.hathitrust.org/rights_database.

Technical Components

A rights determination system is complex because it must meet stringent requirements pertaining to copyright law, security best practices, reliable data management, and flexible user management and access. This section will provide a detailed discussion of the system components we have implemented to address these concerns.

At its core, CRMS is a web-hosted application using MySQL as a data store. Two database tables are especially important: the queue and the review table. The queue is the set of volumes waiting for or in the process of review, and the review table stores the data entered by users submitting reviews. Data in both tables are moved to other database tables when the review process is completed, so these tables are constantly in flux.

The review interface embeds many research resources within its limited screen real estate. When a reviewer visits the interface, the queuing algorithm automatically assigns volumes for review and ensures that two different users review each volume. If there is a disagreement, then an expert resolves the conflict with a third review. Finally, the resulting copyright determinations are exported to the HathiTrust Rights Database daily.

This section has been divided into three parts: “Core Elements,” “Critical Advanced Elements,” and “Recommended Elements.” Core elements are essential to the rights determination process and must be included in any copyright review project. Critical advanced elements, while not essential to the rights determination process as such, are necessary to maintain the security and efficiency of a rights determination system at scale. Recommended elements are valuable features that further improve the system’s flexibility, efficiency, and usability.

Core Elements

Web-Based Application Infrastructure

CRMS was designed as a web-based application so that trained librarians and staff at partner institutions could access a secure, hosted space on the University of Michigan infrastructure and participate in copyright determination. Users can access the CRMS interface via commonly used browsers, including Firefox, Chrome, or Opera. This approach allows us to be platform agnostic.

The underlying code of CRMS is composed of Perl CGI scripts and Java­Script. The various displayed pages of the interface are created using Template Toolkit (http://www.template-toolkit.org) because it integrates seamlessly with Perl.

CRMS Database

The CRMS database stores and provides access to review and determination results within the system. In addition to the queue and review table, CRMS also stores a candidate pool (volumes that will eventually be in the queue), historical reviews that have already been used to make copyright determinations, and data on those determinations. There are various secondary tables that also store precalculated (to reduce page load times) statistics on system and user activity.

MySQL has been a reliable database management system for a user base of over fifty reviewers contributing hundreds of reviews each day; it has also been seamless in handling complex queries across large tables. MySQL has full support in the University of Michigan Library infrastructure, where it is considered significantly easier to maintain than Oracle.

The most important thing for the developer to keep in mind when working on database communication is to follow—to the greatest extent possible—best security practices in sanitizing all external inputs. CRMS follows the practice of using “bind parameters” with Perl’s DBI drivers.

Algorithms/Heuristics for Identifying Which Works Are In-Scope

Large digital libraries such as HathiTrust include works that are subject to different copyright regimes depending on their country of origin and other factors. The project will need heuristics and algorithms to translate the goals of the rights determination project into a reasonably sized “pool of candidates” for copyright review. If the project is ongoing and the candidate pool is open-ended, the algorithms must also identify works that have recently become candidates as a result of new library accessions. CRMS relies on time stamps from the HathiTrust Rights Database to identify volumes added or modified since the previous check.

The bibliographic metadata of volumes in the digital library is used to determine which of them will be in scope for the project. The review system requires access to that metadata, including publication date, country of origin, and/or others as appropriate for the copyright regime in question. The developer may find it helpful to have access to someone with cataloging expertise to aid in parsing record formats like MARC.

Another issue for CRMS-World concerned date ranges in the MARC 008 fixed field. Each volume of a multivolume or multiyear work potentially has its own date among the enum/chron metadata, and together these dates might be represented on the catalog record in the form of a range. The project team discussed the possibility of trying to parse a single publication date from the enum/chron metadata, but we were not able to find a reliable method for translating human-readable enum/chron metadata into a machine-readable form. We decided instead to exclude volumes with ranges for publication dates from our candidate pool.

A Queuing Algorithm That Presents the Right Volumes to the Right People

CRMS was designed with a separate queue and candidate pool—the former being much smaller than the latter—for the sake of having greater flexibility to customize the presentation of volumes to reviewers without the potential inefficiency of manipulating a large database table. So long as the queue is set at a size beyond what reviewers can reasonably accomplish in a single day, it can be repopulated from the pool each night with no negative impact on productivity.

The most important tasks for the CRMS queuing algorithm are to (1) make sure the same user does not review the same volume twice, (2) prioritize volumes that already have one review, and (3) prevent volumes from receiving more than the required two reviews.[67]

The algorithm uses a locking mechanism to prevent simultaneous review. It “locks” a volume (setting a flag in the queue entry for the volume) whenever a reviewer is working on it and unlocks it when the review is submitted. This prevents a third reviewer from seeing the volume during its second review. And because the algorithm always checks review counts, a volume cannot be presented again after its second review.

The queuing algorithm also controls other noncore functions, including priority and projects (both discussed below).

Review Interface with Information Resources Appropriate to the Research

The review interface provides a scanned view of the work and allows the reviewer to enter information relevant to that work’s copyright status. It also allows the reviewer either to confirm the system’s recommended rights determination or to select a different determination based on additional information discovered during the review.

Figure 7 CRMS-US interface
Figure 7 CRMS-US interface
Figure 8 CRMS-World interface
Figure 8 CRMS-World interface

The left side of the interface (the “operational pane”) displays a summary of the volume’s bibliographic metadata, options for adjusting the display of the scan and for setting display defaults, and radio buttons for selecting a rights determination. A text box and a drop-down menu with note categories allow the reviewer to add notes about the volume, including additional author death dates or possible inserts.

The interface streamlines the review process by providing single-click access to online resources such as the Virtual International Authority File (VIAF), the Library of Congress Authorities, and Wikipedia. In CRMS, the reviewer can toggle between a view of the scanned volume and a view of a selected resource with a single click. If an embedded resource has a discoverable URL scheme, it can be “presearched” for the user by crafting a URL based on bibliographic information. This means that search results of system-generated keywords are already displayed by the time the interface is toggled to the resource. Almost all the resources available in CRMS support this feature.

A Way to Export Determinations

A mechanism is needed in order to make determinations available for use. What form that mechanism takes depends on the way your institution implements rights determinations.

In the case of CRMS, there was already a HathiTrust protocol for submitting text files with rights determinations for automatic processing. The submission format is a simple tab-delimited file that contains the rights attribute, reason, and originating system (CRMS-US or CRMS-World). This provided a convenient way for CRMS to share determinations with the HathiTrust Rights Database.

A consequence of this approach (as opposed to having the HathiTrust database request determinations via an API) is that the CRMS database is a “black box” to the outside world. The HathiTrust database receives rights determinations, but it cannot access other metadata (such as author death dates) that would explain or justify those determinations.

If the decision is made to implement an API, developers will need to consider carefully which data can be queried. Sensitive data, such as personally identifying information, must be protected. Access controls around the API must conform to institutional policy.

Critical Advanced Elements

While not core to the rights determination process, the following elements are extremely important for any copyright determination project and should be included in the system’s design.

Appropriate Access Controls

Rights determination projects by definition require access to potentially copyrighted works, so their design must give the highest priority to restricting access to that material only to authorized reviewers.

This may not be a simple task. Access control in copyright determination systems will need to achieve three major goals:

  1. Seamless integration of the review system and the digital library, both of which may have their own authentication systems with different levels of authorization
  2. Management of users having a variety of privileges
  3. Reliable and secure export of rights determination data from the review system to the digital library

Developers are accustomed to dealing with security concerns, but copyright determination will be subject to particularly intense scrutiny from rights holders concerned about the protection of copyrighted material. Even experienced users find navigating through multiple layers of access challenging, but the design team may only be able to streamline that experience to a limited degree. Reviewers will need carefully worded, step-by-step instructions—and possibly online user support—to guide them through the authorization process.

In the case of CRMS, there are five broad levels of access:

  1. The library system (U-M Library, the host infrastructure for the other layers)
  2. The review system (CRMS)
  3. Content subject to copyright (hosted in HathiTrust)
  4. Administrative functions (in CRMS, accessible only to developers and administrators)
  5. Development system (in CRMS, accessible only to developers and testers)

A user’s access depends on the user’s status among the CRMS user types. The list below details the set of user privileges within CRMS; it is not strictly a hierarchy. Significant privileges (especially access to copyrighted material) are extended only to users who require them. Access to in-copyright works and the ability to submit reviews are the most tightly controlled privileges and extended only when necessary.

  • Reviewer. A new user who has recently completed training and is in a probationary period. If the two reviews for a volume are both provided by new reviewers, their work is double-checked by an expert even if their judgments match. This provides an additional degree of oversight for users who are still in the learning process.
  • Advanced reviewer. This designation is for reviewers who have fully completed their training process. Experts do not adjudicate advanced reviewer judgments unless they conflict.
  • Expert (or “expert reviewer”). Experts are chosen when they exhibit sufficient experience and mastery of process to adjudicate conflicts between reviewers and advanced reviewers. Experts receive additional training before being assigned this privilege.
  • External admin. Reserved for supervisors at partner institutions who wish to monitor the progress of their own reviewers. An external admin can view statistics of all reviewers at their institution but cannot view information about any other reviewers and cannot submit reviews.
  • Admin. The access level extended to members of the project team. This privilege includes access to statistics for all reviewers and the ability to add volumes to the queue.
  • Super admin. The highest level of access that may be necessary for the primary developer and the project’s principal investigator. Functionality exposed by this privilege is primarily used for debugging and is only rarely used.

An Algorithm to Provide Recommended Judgments

The workflow of a rights determination project is based on the copyright laws applicable to the works under review. In most cases, copyright duration is based on the life of the author plus a specific number of years. When assessing whether a particular volume has entered the public domain, a limited number of mathematical calculations are necessary. Individual reviewers can perform these, but a better option is to translate the law into algorithms when possible.

For CRMS-World, we introduced an algorithm that selects the appropriate rights code for reviewers after they have entered sufficient information to make the prediction. This has the advantage of freeing reviewers from doing date arithmetic and encapsulating the logic in a program that can be carefully inspected to ensure correctness. For example, when determining the public domain status for a single-author work published in the UK, our system can take the death date of the author of the volume and apply the UK’s “life of the author + 70 years” copyright duration to the work.

Since a copyright in a single-author work continues until the last day of the “life + 70” term, the first year a work enters the public domain is actually the “year of the author’s death + 71.” This is a textbook example of something that should be done algorithmically to avoid inevitable “off by one” errors by reviewers.

When a work passes through CRMS-World, the system’s recommended judgment is visible to the reviewer in the interface. The reviewer can either confirm that recommendation or decide to change it based on additional information discovered during the review. The presence of third-party authored material (i.e., inserts) within the work is the most common situation that prompts the reviewer to override the system recommendation.

A Mechanism for Resolving Conflicting Reviews

Any system that employs a two-review process will generate conflicting reviews and should have a mechanism for addressing them. Resolving conflicts helps maintain the integrity of the copyright review process and provides an opportunity to educate reviewers when their reviews fall outside of accepted practice. Conflict resolution can be accomplished through the oversight of an expert reviewer.

Copyright review at a large scale results in hundreds of daily determinations. Managing conflicts can quickly become a grueling process unless experts have a mechanism for organizing and working with relevant conflicting reviews. In the case of CRMS, we provided a “conflicts page” for aggregating reviews in conflict so the experts can easily adjudicate them and give them final determinations.

The CRMS approach to conflicts has evolved over time; reviews of a work must agree on the rights attribute (“public domain” or “in copyright”), but our systems do not require them to match in every detail (e.g., author death dates, copyright renewal numbers, and dates). Expert reviewers are only required to address conflicts when their resolution will determine whether a volume will be opened or remain closed. This has the effect of significantly reducing an expert reviewer’s workload without compromising the reliability of the review process.

Conflicts that do not have an impact on access can be left for resolution in the future. For example, if a conflict involves only ic and und attributes, the system automatically gives it a und/crms final determination. This acknowledges the fact that no matter which attribute the expert would have selected (ic or und), exporting the determination to the Rights Database would have the same result: the work remains closed.

Recommended Elements

Recommended elements are valuable features that further improve the system’s flexibility, efficiency, and usability.

A Way to Link a Given Determination with a Set of Reviews

If reviews and their associated determinations are stored in separate tables, it is useful to have an explicit identifier linking them. CRMS uses an auto-incrementing group identifier to associate all the reviews that contributed to a determination. Use of a “foreign key” such as this is common in database programming. Since volumes do occasionally get re-reviewed (case in point, when the copyright term expires), it is necessary to be able to distinguish unambiguously the reviews that contributed to each determination without resorting to fuzzy time stamp logic.

A Means for Reviewers to Put Their Review Temporarily “On Hold”

A hold period allows a reviewer to temporarily set aside a partially completed review in order to submit a question to the project team about a point of copyright law or some other part of the research process. Once the reviewer has an answer, the review is easy to retrieve, edit, and submit. The hold period should allow a reasonable span of time for the project team to respond to the matter in question.

Inheriting Rights Determinations on Otherwise Identical Volumes

A mechanism to minimize duplicative review effort is important when working with large-scale collections. CRMS attempts to keep only one representative volume from a catalog record in the candidate pool. Once a determination is made for that volume, other volumes associated with that catalog record are eligible to inherit the same determination.

If new volumes are added to your project, it is important to identify those that have already been reviewed. A second form of inheritance, “candidate inheritance,” applies when a volume enters the candidate pool either because it was recently ingested by HathiTrust or due to a bibliographic correction. The system searches for other volumes’ completed determinations on the same catalog record, and if it finds that a determination already exists, then the new volume is eligible to inherit that determination. The new candidate can be removed, as there is no need for a review.

A “Subproject” Mechanism That Allows Assignment of Volumes and Reviewers to Specific Sets of Works for Review

At the beginning of a copyright review project, reviewers are frequently tasked with performing one type of review on a single pool of candidates. Our experience has been that librarians, users, and administrators may identify specific populations of works for review, which must be prioritized and reviewed separately from the main candidate pool. Consequently, your project team may be asked to take on special subprojects featuring their own candidate pools.

The “subproject” mechanism allows us to select specific volumes for separate review only by a designated subset of reviewers. Once defined, administrators should be able to assign reviewers to a given subproject based on criteria appropriate for that project. This may in some cases mean a reviewer could be assigned to more than one subproject. Some projects may require a narrower, more specialized group of reviewers. For example, a subproject composed of Spanish works may be best suited for reviewers with fluency in Spanish.

A Mechanism to Detect When Re-review Is Likely to Be Profitable

A work identified as in copyright by a rights determination project can be scheduled for re-review when its metadata indicate it has crossed a date boundary that may put it in the public domain. If your project collects author death dates and/or publication dates, it will be possible to conduct an annual search of previously determined volumes and identify those that have likely entered the public domain. Those eligible can then be queued for re-review.

Tools for Searching Various Categories of Reviews

Search features in a copyright review system must allow reviewers and administrators to find volumes and reviews using selected criteria. These search features should include historical reviews (i.e., finished and exported) and unprocessed reviews (i.e., still editable). Users rely on these tools to refresh their memories when reviewing a volume with an issue similar to one they encountered before. These tools can also aid self-training by allowing reviewers to consult expert adjudication notes. Finally, access to unprocessed reviews allows reviewers to find and edit their reviews from earlier in the day.

Figure 9 CRMS-US historical reviews table
Figure 9 CRMS-US historical reviews table

Reviewer Performance Statistics Pages

Statistics reports to track reviewers’ performance (i.e., validation rates) should be accessible to the reviewers and to their supervisors (i.e., external administrators) at their respective institutions. We found this access helped communicate the importance of CRMS to the supervisors and give them a concrete set of metrics by which to evaluate the work.

Figure 10 Reviews statistics table
Figure 10 Reviews statistics table

Business intelligence–style dashboards can provide useful statistics for tracking the project. Dashboards can also be a form of advertising, giving potential new participants an opportunity to see what the project has accomplished in a form that is appealing and easy to understand.

Figure 11 CRMS-US dashboard
Figure 11 CRMS-US dashboard
Figure 12 CRMS-World dashboard
Figure 12 CRMS-World dashboard

Priority

It is occasionally useful to bypass the normal function of the algorithm by prioritizing a volume for review. A priority system in the queue allows administrators to accelerate review of one or more volumes to respond to time-sensitive requests. In general, having fine-grained priority levels grants nuanced control over volumes as they move through the review process. As part of this, it is likely that an interface for administrators to manually add volumes to the queue will be useful.

A Mechanism for Overseeing New Reviewer Performance

It may be useful to oversee reviewers who have recently completed training to ensure their early reviews consistently reflect the project’s established standards. Newly trained reviewers can use a “Provisional Match” page so experts can evaluate their work.

The CRMS Review Processes

This part of the technical section addresses how the technical components described above work together in practice. Here we present the review process in a roughly chronological form, moving from our methods for identifying review candidate volumes through to the export of CRMS determinations. Given its focus on the practical application of CRMS, this part will also identify and describe a few noteworthy differences between the CRMS-US and CRMS-World projects.

Zephir and the HathiTrust Rights Database

For a work to be reviewed by CRMS, it must first be included in HathiTrust and in Zephir (HathiTrust’s “bibliographic metadata management system,” which can be accessed through the digital library’s online catalog). At present, there are over thirteen million volumes in Zephir. Given the size of HathiTrust, the CRMS project had to take precaution when establishing the scope of our inquiry or risk having a pool of candidates beyond the limits of even our well-funded effort.

An equally important resource for CRMS is the HathiTrust Rights Database, which tracks each volume’s current rights status as well as any changes to its status. Due to the “one-to-many” relationship between a catalog record and its component volumes (which may have different rights), the decision was made to keep the data stand-alone, outside the catalog.

CRMS has read-only access to the Rights Database, and this allows CRMS to query the Rights Database for newly deposited or newly changed items that are in scope for rights determination. Each rights entry has a time stamp, so CRMS can limit its query to only the volumes modified or added since its previous query.

Criteria for Identifying In-Scope Volumes

For a copyright review project drawing from a digital library on the scale of HathiTrust, it is essential to develop criteria for selecting volumes to be reviewed.

“Country of origin” was a major influence on the scope of each CRMS project. The chosen country determines which copyright laws will apply to the works in scope, and it also determines the potential size of that pool. The decision of the first CRMS project to focus on works published in the United States between 1923 and 1963 meant we would eventually be dealing with a pool of over three hundred thousand works.

The differences between US copyright law and the laws in Australia, Canada, and the UK meant that different criteria would be needed for the research methods in CRMS-US and CRMS-World. These criteria determined the metadata that each version of CRMS used to create its own pool of candidates.

A volume was a candidate for CRMS-US if it matched the following criteria:

  • Rights of “ic/bib” (“in copyright/bibliographically derived by automatic processes”)
  • Bibliographic format of “bk” (book; MARC leader[6] in {a,t} and leader[7] in {a,c,d,m})[68]
  • Published 1923–63 inclusive (based on 008 copyright year)
  • Published in the United States (i.e., not a foreign work; based on 008[15-17])
  • In English (based on 008[35-37])
  • Not a government document (based on a number of heuristics; see appendices)
  • Not a translation (041:a set to “eng” and 041:h set to a different language code, or “translat{ion,ed}” found in 245:c or 500:a)
  • Not a dissertation (“thes{e,i}s” or “diss” found in 500:a or 502:a)

A volume was a candidate for CRMS-World if it matched the following criteria:

  • Rights of “ic/bib” or “pdus/bib”[69]
  • Published in Australia, Canada, or the UK[70]
  • Published between the following spans[71]
  • 1871–1941 (UK)
  • 1891–1961 (Australia or Canada)
  • In English (based on 008[35-37])
  • Not a translation (041:a set to “eng” and 041:h set to a different language code, or “translat{ion,ed}” found in 245:c or 500:a)
  • Single publication/copyright date (for now; based on 008[6], 008[7-10], and 008[11-14])

The Candidates Pool

When a volume has been identified as a candidate for CRMS review, it must be added to the particular pool of candidates matching its bibliographic criteria. (“Pool” is the term CRMS commonly uses, but “stack” would be more apt, technically.) This is one of several tasks “overnight processing” addresses.

Overnight processing is a script that runs each night in several phases and handles tasks that are important to nearly every step in the CRMS review process, from selecting volumes for review to exporting determinations to the HathiTrust Rights Database.

The overnight processing phase called “candidate import” is responsible for adding volumes to the candidate pool. It first compiles a list of all volumes in the HathiTrust Rights Database that have been added or changed in the previous twenty-four hours. Then it examines each volume’s current rights and its bibliographic metadata stored in Zephir. With that information, the system is able to tell whether a volume ought to go through CRMS. If it should, then the system adds the volume to the pool as a candidate and copies relevant parts of its metadata into the CRMS database; otherwise, the logic simply moves on to the next volume. It also occasionally detects when a previously added candidate no longer meets the requirements for candidacy (typically due to a bibliographic metadata correction) and quietly removes it from the pool.

A volume is not allowed into the candidate pool if the system discovers it has been through CRMS already. When a volume has been reviewed, the system adds it to the “historical reviews” database table, so the system will ignore any potential candidates that already have a listing there. If the system is running correctly, there is no way for a previously reviewed volume to get back into CRMS without some kind of human intervention.[72]

Before a volume enters the review process, the system will draw it from the pool of candidates into the queue.

The Queue

The queue is a subset of the candidate pool containing volumes that are next in line for the reviewers. While it is not absolutely necessary for the review process, the queue provides a smaller and more predictable set of volumes, and this makes it easier to work with than the candidate pool itself. The queue can be set to a specific number of volumes and provides an easier target for tracking statistics for daily and monthly reports.

The queue is stored in its own table in the CRMS database, which means it can include more metadata than the relatively limited set that is stored in the candidate pool. The queue table tracks each volume’s priority level, who added it, and where it came from.[73] The metadata also includes a locking mechanism to prevent a volume from being reviewed by more than one person at a time.

Both CRMS-US and CRMS-World have their own queue. Each night, overnight processing removes the volumes that have been reviewed that day and then replenishes each queue with enough candidates from its corresponding pool to bring the queue back up to its designated number of volumes.

In the listing below, instances of userid can be understood as the reviewer’s CRMS id (e-mail address or Michigan uniqname). In practice, they would be wildcards against SQL injection attack and passed as parameters to the DBI module.

SELECT q​.id,​(SELECT COUNT(*) FROM reviews r WHERE r​.id​=​q​.id) AS cnt, SHA2(CONCAT(userid,q​.id)​,0) as hash, q.priority FROM queue q

WHERE q.priority<3

AND q.priority!=1

AND q.locked IS NULL

AND q.status<2

AND NOT EXISTS (SELECT * FROM reviews r2 WHERE r2​.id​=​q​.id AND r2.user=userid)

AND NOT EXISTS (SELECT * FROM historicalreviews h WHERE h​.id​=​q​.id AND h.user=userid)

HAVING cnt<2

ORDER BY q.priority DESC, cnt DESC, hash, q.time ASC

The HathiTrust PageTurner Access and Authentication Modules

In order for a reviewer to do her work, that reviewer must be authorized to view in-copyright works in HathiTrust.

Four access and authentication modules in the HathiTrust PageTurner program perform this security function. The modules check the reviewer’s profile and confirm that the reviewer is permitted to see copyrighted material for the purposes of copyright research. If the reviewer does not have that permission, PageTurner will refuse access to that reviewer and display only a message that the reviewer is not allowed to view copyrighted content. Unless that reviewer’s permissions are changed, she will not be able to see in-copyright material.[74]

Access to in-copyright material is strictly enforced in CRMS. Reviewers must complete and submit a form called “Statement for Access to In-Copyright Works in HathiTrust” before they will be authorized for participation in CRMS.

Figure 13 Statement for Access form
Figure 13 Statement for Access form

The Review Process

Once the reviewer is confirmed for access to in-copyright works, the review process can begin. There are several tools to help guide reviewers through the process; the most significant is the CRMS interface.

The reviewer must be logged into the CRMS interface and have her browser pointed to its “review” pane in order to see the scanned image of the volume under review. The interface provides relevant catalog information and review tools adjacent to the scan. A CRMS reviewer may review as much or as little of the work as necessary to make an accurate copyright determination, but in most cases the front matter of the volume (from title page to table of contents) provides the most relevant copyright-related information.

Each version of CRMS (-US, -World) has an associated decision tree, as do subprojects such as CRMS-Spain. Each decision tree lays out the research process as a step-by-step flowchart. This approach ensures that the reviewer considers every relevant factor and does so in a specific order. For a determination to be complete in CRMS, the reviewers must come to a compatible decision about a work (“pd,” “pdus,” “ic,” “icus,” or “und”). If two reviewers come to incompatible decisions, then their reviews are “in conflict,” requiring an expert to adjudicate between them.

System Response to Matches and Conflicts

Once two reviewers have submitted their judgments, the system checks for conflicts between the two reviews and responds accordingly. The system does this through the use of status codes.

Conflicts (Status 2) If the two reviewers disagree, either about the attribute or about the reason, then the volume will move to Status 2 and be added to the “conflicts page” in the interface.

If the reviewers agree that the attribute should be ic, icus, or und, but then disagree about which reason should apply, then the volume will not move to Status 2 and be added to the conflicts page.[75] Because the final result will be that the volume will remain closed, refining the specific reason for the closure is not an effective use of time. This means that volumes in the conflicts page will always have at least one review that recommends either pd or pdus.

Provisional Matches (Status 3) All work done by nonadvanced reviewers who have only recently completed their training is automatically assigned a Status 3 and added to a “provisional match” page where an expert can confirm it. The reviewer versus advanced reviewer distinction provides a period for new reviewers to demonstrate their consistent and reliable understanding of the process. Status 3 is also used for minor (typically author death date) mismatches between advanced reviewers that are not important enough to be considered Status 2 conflicts. (However, even this step will be skipped if both advanced reviewers have selected und/nfi.)

Matches (Status 4) If the two reviewers agree—if both reviewers select the same rights (a.k.a. attribute) and the same reason—then the volume will move to Status 4 and be included in the export process that evening.

Expert Adjudication

Two reviews will typically be sufficient for an exportable copyright determination. In some cases, however, an expert will need to intervene in a conflict or a provisional match.

Experts can access conflicts and provisional match pages in the CRMS interface from a drop-down menu. Each page contains a list with each row representing one review of a volume (a typical volume will have two rows until an expert makes an adjudication). The lists make it easy for an expert to see at a glance all the review work done on a volume.

Each row also includes a link to the scan so the expert can access it and get a better understanding of how the reviewer reached that judgment. This takes place within a review interface that features radio buttons to allow the expert to toggle back and forth between the two reviews. When the expert is ready to make an adjudication, the modified interface will also allow her to import a preferred review’s data and notes into her own review, saving her some keystrokes and allowing her to add comments to the previous work.

Figure 16 Conflicts table
Figure 16 Conflicts table
Figure 17 Provisional matches table
Figure 17 Provisional matches table

The expert examines the conflicting reviews and other data pertaining to that volume, adds comments or corrections as necessary, and then submits her own review. The expert’s judgment will be exported to the HathiTrust Rights Database that evening, except in cases where an und/nfi determination would inappropriately prevent US access.

Overnight Processing

That evening, the overnight processing script responds to the work done that day.

First, Status 0 volumes with two reviews are moved to Status 2, 3, or 4, depending on whether they are a conflict, provisional, or match.

Next, Status 4 (or higher) volumes are moved from the reviews table to historical reviews (indicating they have completed the review process) and the determinations table. Determinations eligible for export are written to a text file for the Rights Database to read.

Overnight processing also updates user statistics, including monthly review counts and validation numbers, and updates export statistics. Finally, overnight processing replenishes the queue to a predetermined amount greater than the number of reviews that can be completed in one day.

Inheritance

The overnight processing phase “export inheritance” takes each volume that has been added to the determinations table in the last twenty-four hours and identifies all the other volumes associated with its catalog ID. These copies now become inheriting volumes and will inherit the same determinations as their corresponding source volumes (the specific scanned copies that actually went through the CRMS review process).

Here is an example of export inheritance: A volume—for instance, an edition of Kwaidan by Lafcadio Hearn from 1907—receives two reviews (both of them in complete agreement). That evening, overnight processing checks the corresponding HathiTrust catalog record and finds another copy of that edition of Kwaidan (not yet reviewed by CRMS) associated with that record. This other volume now becomes an “inheriting volume” and inherits the same rights determination as the first volume.

“Candidates inheritance” is a mirror process to “export inheritance” that addresses the opposite situation. The former matches a source with inheriting volumes, while the latter matches a new inheriting volume with an old source.

For example, three months have passed since the export inheritance example above, and a new institution joins the HathiTrust community with a copy of the same edition of Kwaidan in its library. “Candidates inheritance” checks the new volume and discovers it to be a match for the same catalog record as the earlier two copies of Kwaidan. The process identifies the new copy as an inheriting volume, automatically generates a determination for it, and then exports that determination to the HathiTrust Rights Database.

CRMS Exports and the HathiTrust Rights Database

At this point, the work of CRMS is done except for exporting the determination to the Rights Database.

CRMS sends its determinations to HathiTrust in the form of a text file, and HathiTrust uses these determinations to update the volumes’ rights information in the HathiTrust Rights Database.

Rejections of CRMS determinations are exceptionally rare, though they do happen—usually when HathiTrust has information that was not available to CRMS reviewers at the time reviewers made a given determination.