    Analysis of individual use

    A key innovation in the Columbia online books project was the introduction, in 1997, of the ability to identify the activity of unique users. This was a fortunate byproduct of the security system, developed to permit people to read online books from home. To maintain confidentiality of the users, system analysts replaced the identities of individual users with uninformative labels.

    Table 16.3: Status of users at time of first use
    Frequency Percent
    Undergraduate Student 2088 58.0
    Other 607 16.9
    Missing 328 9.1
    Graduate Student 295 8.2
    Other Student 145 4.0
    Faculty 136 3.8
    TOTAL 3599 100.0

    With anonymity ensured, we were permitted to link usage to administrative files containing demographic information about the users. Typical results are those shown in Table 16.3, reporting the distribution of the status of individual users at the time they first used a particular resource. The resource in this case was the online version of the Oxford English Dictionary. While we had a number of reference works available online and, by the close of the project, close to 200 books in online form, the total usage of the OED represented approximately 50 percent of all online usage, and so it is used here to illustrate the types of analyses that we performed.

    N Stem Leaves
    1049.00 0 00000000000000000000000000000000000000000000001111111111111111111111111111111111111111111111111
    491.00 0 22222222222222222222222222333333333333333333
    265.00 0 444444444444455555555555
    202.00 0 666666666667777777
    156.00 0 88888889999999
    140.00 1 0000000111111
    92.00 1 22223333
    102.00 1 4444455555
    86.00 1 66667777
    62.00 1 888999
    68.00 2 0000111
    49.00 2 22233
    57.00 2 44455
    48.00 2 6677
    38.00 2 899
    38.00 3 001
    41.00 3 2233
    29.00 3 45
    35.00 3 667
    32.00 3 899
    34.00 4 011
    25.00 4 23
    18.00 4 45
    20.00 4 67
    16.00 4 89
    11.00 5 0&
    Figure 16.3: Stem and Leaf diagram of time spent viewing the OED
    NOTE: This figure is a stem-and-leaf diagram that represents a histogram of use. It will look like an ordinary histogram if rotated 90 degees to the left, but contains rather more detail, as individual values are represented by numbers, rather than marks. Each leaf digit represents an observation. The value of the observation in minutes of usage is equal to 10*Stem+Leaf. The last row represents fifty or mroe minutes of usage. The value in the first column (N) is the total number of observations summarized in that row.

    There were 3,600 individuals who used the OED during the study period. Just over 2,000 of these were undergraduate students at the time of first use. Nearly 300 were graduate students and close to 140 were faculty members.

    We analyzed the ways in which individual users used the resource. To do this we introduced the rule that an inactive period of 15 minutes or more was considered to mark the end of a session. This is a reasonable rule based on detailed analysis, which showed that there was a natural break in the distribution (over all users) of the interval between "clicks" at somewhere around 10-15 minutes. We interpret this as meaning that continuation of a session over a break of this duration will be a rare event, which we can safely ignore. We also studied the total amount of use that individuals made of specific resources. This is illustrated by data on the OED. The mode (that is, most common) number of clicks that an individual user made on the OED is somewhere between 2 and 3. Above that number the number of clicks that a person made on the OED drops exponentially. The rate of the drop is such that the chance to go on to two more clicks is about 2/3 at any time. (The chance to add one more click is the square root of this number, or about, 83%.)

    As shown in Figure 16.3, the time spent using the OED online follows an exponential distribution. This indicates that at any time in the course of using the OED an individual has a constant probability of just quitting and deciding never to use it again (roughly 100%-83%=17%).

    This apparently exponential behavior is intriguing and we pursued it in another way. Since we could anonymously track individual users, we could plot how much an individual used the resource against how long it was since the first time that the individual used it. With 100% adoption this graph would be roughly linear. We show the actual data for the OED (which had heavy use) in Figure 16.4.

    Figure 16.4: Scatter plot of total use against time since first useFigure 16.4: Scatter plot of total use against time since first use

    Figure 16.4 is a scatter plot. Each point represents one individual user. The y-coordinate of the point represents the number of sessions that an individual had with the OED and the x-coordinate represents the number of days since that individual first used the OED. The steep line represents the expected usage relationship if adopters continued to use the resource at a steady rate.[3] In fact, a regression analysis shows that the best fit is nearly horizontal, which indicates there is little ongoing use by individuals. It is apparent that many observations are not well-predicted by this model, and indeed, that some usage did persist.

    We can plot this data in a more familiar form by showing the distribution of time since first use, without paying attention to how much use there has been. We do so by projecting the preceding figure onto a horizontal axis; See Figure 16.5. We see, as have most researchers in the academic setting before us, that it is very easy to discover the existence of the semester. Each of the five peaks in this graph corresponds to an academic semester. There might be some cause for optimism in the fact that the leftmost peak, which represents the most recent surge in use, spring 1999, seems to rise higher than any of the earlier ones. However we don't know quite what to make of the fact that the one before it (fall 1998) represents a drop from the preceding fall.