16.5 Study of Users
The remainder of this paper discusses some of what we have learned about the users. The first interesting point is a relation among technology, behavior, and attitudes. We expected that the technology, as it grew, would influence the attitudes of scholars, both faculty and students, which in turn would influence their behavior. However, we tracked attitudes carefully over the entire study and saw only the smallest movement towards believing that online books are a better way to do one's scholarly work. This forces us to conclude that, in fact, technology effectively influences behavior and that attitudes simply have to catch up. This may mean that scholars are moved to technology by a subliminal perception of benefits, which they cannot articulate. On the other hand, it may mean that fashions in scholarly behavior are simply no more rational than any other kinds of fashion.
Analysis of individual use
A key innovation in the Columbia online books project was the introduction, in 1997, of the ability to identify the activity of unique users. This was a fortunate byproduct of the security system, developed to permit people to read online books from home. To maintain confidentiality of the users, system analysts replaced the identities of individual users with uninformative labels.
With anonymity ensured, we were permitted to link usage to administrative files containing demographic information about the users. Typical results are those shown in Table 16.3, reporting the distribution of the status of individual users at the time they first used a particular resource. The resource in this case was the online version of the Oxford English Dictionary. While we had a number of reference works available online and, by the close of the project, close to 200 books in online form, the total usage of the OED represented approximately 50 percent of all online usage, and so it is used here to illustrate the types of analyses that we performed.
There were 3,600 individuals who used the OED during the study period. Just over 2,000 of these were undergraduate students at the time of first use. Nearly 300 were graduate students and close to 140 were faculty members.
We analyzed the ways in which individual users used the resource. To do this we introduced the rule that an inactive period of 15 minutes or more was considered to mark the end of a session. This is a reasonable rule based on detailed analysis, which showed that there was a natural break in the distribution (over all users) of the interval between "clicks" at somewhere around 10-15 minutes. We interpret this as meaning that continuation of a session over a break of this duration will be a rare event, which we can safely ignore. We also studied the total amount of use that individuals made of specific resources. This is illustrated by data on the OED. The mode (that is, most common) number of clicks that an individual user made on the OED is somewhere between 2 and 3. Above that number the number of clicks that a person made on the OED drops exponentially. The rate of the drop is such that the chance to go on to two more clicks is about 2/3 at any time. (The chance to add one more click is the square root of this number, or about, 83%.)
As shown in Figure 16.3, the time spent using the OED online follows an exponential distribution. This indicates that at any time in the course of using the OED an individual has a constant probability of just quitting and deciding never to use it again (roughly 100%-83%=17%).
This apparently exponential behavior is intriguing and we pursued it in another way. Since we could anonymously track individual users, we could plot how much an individual used the resource against how long it was since the first time that the individual used it. With 100% adoption this graph would be roughly linear. We show the actual data for the OED (which had heavy use) in Figure 16.4.
Figure 16.4: Scatter plot of total use against time since first use
Figure 16.4 is a scatter plot. Each point represents one individual user. The y-coordinate of the point represents the number of sessions that an individual had with the OED and the x-coordinate represents the number of days since that individual first used the OED. The steep line represents the expected usage relationship if adopters continued to use the resource at a steady rate.  In fact, a regression analysis shows that the best fit is nearly horizontal, which indicates there is little ongoing use by individuals. It is apparent that many observations are not well-predicted by this model, and indeed, that some usage did persist.
We can plot this data in a more familiar form by showing the distribution of time since first use, without paying attention to how much use there has been. We do so by projecting the preceding figure onto a horizontal axis; See Figure 16.5. We see, as have most researchers in the academic setting before us, that it is very easy to discover the existence of the semester. Each of the five peaks in this graph corresponds to an academic semester. There might be some cause for optimism in the fact that the leftmost peak, which represents the most recent surge in use, spring 1999, seems to rise higher than any of the earlier ones. However we don't know quite what to make of the fact that the one before it (fall 1998) represents a drop from the preceding fall.
Online Versus Paper: Usage Data
Our data (based on comparison between the online book usage figures and data collected through circulation statistics and slips placed in corresponding reference titles in the library) suggest that online books were used more than their print counterparts. If we count circulation alone we find that there were about three times as many accesses per book online as for the paper version. After consultation with librarians we believe that a reasonable correction for in-house use is to increase circulation by 50%. This would reduce the ratio to twice as many online uses per book.
Figure 16.5: Histogram of time since first use
NOTE: Height of the bar is the number of sessions logged by users starting the indicated number of days before data collection.
We conjecture that higher usage for online books is due to lower convenience costs than for other access options. Having purchased a paper copy for the library does not ensure that the book is available. The book might be in circulation, or missing from the shelf. If the library is closed the paper copy of book is not available to a user. A common access option is an online public access catalogue (OPAC). However, an online public access catalog does not support even the roughest form of browsing into the book until the book itself is put online. An OPAC provides so little information about a book that a scholar might not be aware that it contains material relevant to his work. If so, the mere ownership of that book by his library does not make it truly available to him. Catalog records enhanced with tables of contents and book indexes are a relatively new offering and a major asset to the scholar in locating books relevant to his or her research, but do not eliminate the higher convenience costs of accessing the physical book at the library.
Hence, the online access to a full book represents a quantum leap in the availability of the contents of that book, and, we believe, lowers the barriers to access for many modalities. Perhaps the only modality for which it is not clear that online access is preferable is "plain old reading at length."
We were also interested in studying patterns of access when readers use online books. We have approached this in two different ways. One is essentially qualitative, in which we asked people in surveys and in interviews how they used online books. In doing that we were able to identify at least the following kinds of activity: browsing, grazing (that is, reading portions of text scattered through the book, punctuated by visits to the index or table of contents) citation checking, the finding of individual facts or quotations, reading on reserve for a course, determining the need for a paper copy, printing (that is, turning the online book into paper), and directly reading online.
We have also, because we can track individual users, been able to break some new ground in quantitative analysis of how people use books online. Generally, each chapter is a separate file, and hence a separate entry in the web sever log. Thus, by analyzing the sequence of clicks on chapters, we are able to distinguish a number of different ways in which individuals use online books. The first style we characterize as linear use: an individual reads chapters of a book in exactly the same order in which they appear in the printed volume. The second pattern of use is quasi-linear, in which the sections of the book are visited in some personalized order but each section is read once and only once. We also observe a pattern we call hyper-linear, in which sections are visited in an arbitrary order and some sections are visited more than once. Hyper-linear usage occurs about 12% of the time. See Figure 16.6.
Figure 16.6: Patterns of motion in online books
Figure 16.7: Use of index in online books
There are several ways that a use pattern may involve use of the index (or, more generally, search tools); see Figure 16.7. The first format is to use a search tool once, at the outset, and then to view portions of the book in some linear or quasi-linear order. Another possibility involves using the index, going to a section, and then going back to the index and out to another section and continuing in this pattern. Whether this is a natural behavior evolving in the presence of online books or an artifact introduced by the fact that returning to some index or search tool may be the easiest way to get to the next section is something we don't know at this point. In thinking about these patterns of use, we may compare them to what a person might do with the book in hand, at the library shelf, or with access to the catalog, in some online format.