KWIC and Dirty? Human Cognition and the Claims of Full-Text Searching
Skip other details (including permanent urls, DOI, citation information)
This work is protected by copyright and may be linked to without seeking permission. Permission must be received for subsequent distribution in print or electronically. Please contact email@example.com for more information. :
For more information, read Michigan Publishing's access and usage policy.
Over the last several years, full-text searching of large text corpora has placed an extraordinarily powerful tool in the hands of humanities students and scholars. Use of these corpora is now entering mainstream research and, not surprisingly, is affecting research methods and the nature and quality of research outcomes. To what extent does the availability of new and copious sources of full text—along with the tools to mine them—relieve mental economy, freeing individuals from committing to memory not only names and facts but complex thoughts? Are we finally proceeding from a traditional (and obsolete?) "just in case" paradigm to a long-overdue "just in time" model for learning and scholarship? Using evidence from the literary record and from current research in human cognition, the author points to certain disjunctions between the machine processes that enable full-text searching and the subtle cognitive processes that underlie human learning and reasoning. Like all powerful tools, full-text searching requires circumspect use—and in no way relieves humanists and other researchers of the need to read extensively and think deeply.
The great Google announcement of December 2004 was greeted by expressions of immense joy from a general public normally unmoved by library matters. Newsweek, for example, celebrated a "transition to a new era of history":
If it weren't for the war, and the terrorism and the election, 2004 might well be remembered as the Year of Search. Maybe it will anyway. If we get through these rocky times with civilization's underpinnings intact, our descendants, swimming in total information, might be required to memorize the date of last August's Google IPO as a cultural milestone. Except that in the post-Google era, memorization will be obsolete, because even the most obscure fact will be instantly retrievable (Levy 2004).
The breathlessness of Newsweek's prose notwithstanding, there can be no question that Google's plans represent a high-water mark in the rising flood of searchable full text—surpassing in ambition, certainly, and someday maybe even in fact, the huge commercial and non-commercial digitization initiatives that have been announced or that have come to fruition over the last several years. These have already been extraordinary. The largest such undertaking has probably been Gale's digitization of the entire corpus of British 18th Century published books: 150,000 titles, 33,000,000 pages, readable in facsimile and fully searchable. But competitors such as ProQuest and Readex are in the same league. And we should not overlook the massive backfile conversion projects of journal publishers such as Elsevier, Wiley, and Springer. Over the last three years, with the new availability of such huge corpora, full-text searching has entered the mainstream of education and research in the humanities.
Yet reading the Newsweek article may also bring to mind both the tenor and the content of Jorge Luis Borges's classic story, "The Library of Babel." There we read: "When it was proclaimed that the Library contained all books, the first impression was one of extravagant happiness. . . . The universe was justified, the universe suddenly usurped the unlimited dimensions of hope. . . . Thousands . . . abandoned their sweet native hexagons and rushed up the stairways, urged on by the vain intention of finding their Vindication" (Borges 1964b).
In this library of Borges's imagination, however, the outcome disappointed the unlimited hopes; indeed developments were disastrous beyond what anyone could have imagined: "[The] pilgrims disputed in the narrow corridors, proffered dark curses, strangled each other on the divine stairways, flung the deceptive books into the air shafts, [and] met their death cast down in a similar fashion by the inhabitants of remote regions. Others went mad" (Borges 1964b).
What now of our new Library of Babel, the sum total of all of these digitization efforts under way throughout the world? What will come of the total searchability of the human record? Will memory indeed become obsolete, as that Newsweek journalist suggested, because everything worth remembering will be on hand for immediate retrieval whenever needed? Has human memory just been a stopgap all these years, to serve us only until e-memory makes it superfluous? What does the human mind bring to the process of memory that machines do not? And if that contribution is substantive, what are the risks of an ever greater reliance on machine memory? Could its insertion into the teaching and research enterprise perhaps even impede the transmission of knowledge and the creation of new knowledge in subtle ways we do not as yet appreciate and perhaps do not wish to appreciate, since the prospect of finding anything, anytime, anywhere is so exciting, so intoxicating? At the end of the day, will we, too, know less than we do now? Will we, like the surviving inhabitants of the Library of Babel, gradually all go mad?
This paper focuses on keyword searching as our generation's answer to the problem of information glut. Together with various mechanical sorting and ranking algorithms, it is the dyke we have erected against the flood of information that the progressive digitization of all texts is creating. Or perhaps more accurately, it is the faucet we have put in that dyke. Recent studies suggest that students and information consumers are quite satisfied with what they find through keyword searches (De Rosa, Dempsey, and Wilson 2004; Marcum 2005; Fast and Campbell 2004). Why then are they not the perfect answer?
Let me start with a parable from the past—the very distant past. Around 500 B.C., a boxer by the name of Skopas commissioned the poet Simonides of Keos to write and perform a song in his honor—an epikinion—in celebration of an important victory (Weinrich 2004). In all versions of this story, and there are many, we hear that at the end of the celebration, Skopas was not pleased with the work, because two thirds of it devoted to praising the twin deities Castor and Pollux and only one third dwelt on Skopas and his memorable triumph. So Skopas told the poet that he would pay only one third of the honorarium and Simonides could feel free to petition the gods for the rest. The end of the story is well known. During the banquet a messenger calls Simonides out of the hall, saying two young men wished to speak with him. Simonides leaves the room, but there is no one outside. At that very moment, though, the roof of the hall collapses, crushing Skopas and all his guests. In this way, it is said, the gods reward Simonides and punish the stolid literalist Skopas.
Most readers today would judge differently than Skopas and say that the entire poem was probably actually about him, but that doesn't mean that every section has to invoke his name or dwell exclusively on the prosaic details of the boxing match. Modern readers would probably also agree that the gratitude that Castor and Pollux showed Simonides was probably equally poorly informed, since the praise they heard and appreciated so much was probably nothing but elaborate rhetorical ornamentation. (Gods are suckers for that.)
The quality of language that was at issue between Skopas and Simonides is not at all unique to poetic speech. As Chomsky and other transformational linguists made clear in the 1960s, an extraordinarily subtle and intricate process relates speaker meanings to language output in all natural (i.e., human) language. Individual words and even complete sentences therefore do not necessarily map one-to-one to phenomena of the world. To suggest otherwise is to return us to a pre-linguistic notion of language such as that of the School of Port-Royal in the 17th Century and Diderot and the encyclopedists of the 18th who believed that words and things were related to one another in a direct, binary relationship. In the Logique du Pont-Royal, for example, we read the following: "The sign encloses two ideas, one of the thing representing, the other of the thing represented; and its nature consists in exciting the first by means of the second" (quoted in Foucault 1994). In other words, a thing in nature excites the production of the word that stands for it. A hundred years later, Diderot perpetuated the belief in a direct identification of words with meanings: He wrote: "The language of a people gives us its vocabulary, and its vocabulary is a sufficiently faithful and authoritative record of all the knowledge of that people" (quoted in Foucault 1994). This is, we would observe today, a stupendous simplification. Words simply do not signify. In literature as in speech as in other textual forms, the occurrence of certain words may be meaningful, but only in highly complex ways. For the same reason, the omission of certain words can characterize arguments of enormous weight where such words might otherwise be expected to occur. Michel Foucault's foundational work on meaning and signifying, The Order of Things, can be said to be all about the French Revolution, and yet it's possible—and I haven't checked—that the word string "French Revolution" does not occur a single time in the entire book. Consider the following (typically Foucauldian) sentence in that book, which advances a powerful claim about the Revolution:
The last years of the eighteenth century are broken by a discontinuity similar to that which destroyed Renaissance thought at the beginning of the seventeenth . . . It is a radical event that is distributed across the entire visible surface of knowledge. (Foucault 1994)
Foucault does not mention the French Revolution as being this "discontinuity" or a watershed of any kind. Instead it is one of those surface phenomena that doesn't appear to be significant within his archaeology. I don't believe The Order of Things is in ebrary or Netlibrary, but if it were, it would most definitely not rise to the top of relevant texts for an understanding of the French Revolution. This is because the relevance algorithms most commonly used in these online libraries for full-text searching are based on a blind count of mechanically harvested keywords.
The disconnect between vocabulary on the one hand—in other words, language regarded as a cumulation of keywords—and phenomena on the other is not restricted to imaginative literature, where we may speak meaningfully (as Borges has) of "transparent tigers and towers of blood" (Borges 1964c), or the unorthodox works of certain French cultural historians. This might be claimed by the proponents of so-called scientific language, who say that theirs is a more precise and literal language type, one therefore more amenable to keyword searching. Granted, on a certain level there is a more exact correlation in the sciences between terminology and things. Go to any journal in the sciences, and you will encounter terms such as "thermoelastic transformation" (as in materials science) or "presynaptic ATP-sensitive potassium channels" (as in neuroscience), which appear to have totally clear and unique meanings and always occur when they are being discussed or argued about. But even in the sciences, things are not as they may seem. In a recent book entitled Metaphor and Knowledge: The Challenges of Writing Science, Ken Baake, a researcher in the rhetoric of scientific literature, reveals the extent to which metaphor—that phenomenon of language that so totally confounds the keyword searcher—is not just decorative, but in fact constitutive of scientific argument (Baake 2003). This echoes earlier arguments of linguists George Lakoff and Mark Johnson, who in their classic of the 1980s, Metaphors We Live By, described metaphors as key to the way all people grapple with reality: "[M]ost people think they can get along perfectly well without metaphor," they write. "We have found, on the contrary, that . . . our ordinary conceptual system, in terms of which we both think and act, is fundamentally metaphorical in nature" (Lakoff and Johnson 1980).
Then there is the fact that word meanings respond to their contexts. Even in the sciences, macro- and microcontextual pressures have an impact on word meaning. Sociologists and philosophers refer to this phenomenon as indexicality, "the fact that the meaning of speech and action depends on the . . . situation in which it occurs" (Johnson 2000). What this means in practice is that keyword searches retrieve both commensurate and incommensurate occurrences of words—incommensurate in relation to the content the keyword is supposed to represent. Compare, for example, the meaning of the word "noise" in everyday speech and in communications theory, where it has a highly technical meaning (Hargrave 2001). Similarly, contexts can push entirely different words other than searched-for keywords into the semantic vicinity of a desired meaning. These interlopers, these semantic adaptations meaningful only in limited contexts, risk going entirely undetected in full-text searching.
In other words, articles, books, texts in general—especially in the humanities but not restricted to them—are like complex organisms. They have integrity and an internal coherence that does not map in any direct way to particular phenomena or the world in general. Individual word occurrences, even those that seem most set in their meanings, are indexical in nature, responding to pressures upon them by their local or disciplinary environments. This is why Paul Saenger of the Newberry Library in Chicago refers to this dipping in and back out of texts—what we do when we deracinate keywords and carry them away from their native settings with some additional word material still clinging to them, like dirt to roots—as "intrusive consultation" (Saenger 1997). "Intrusive consultation" is exactly the form of information acquisition that keyword searching encourages and even imposes. Or we could call this the Frankenstein Fallacy: You pull a beating heart out of a body and put it somewhere else, and indeed, it still is the heart, yet in any meaningful way it is the heart no longer.
The most renowned intrusive consultant of texts was not a real individual, but a literary figure, and of course he, too, was a creation of that great library metaphysician, Jorge Luis Borges. It was Ireneo Funes, "Funes the Memorious," who never forgot any detail he was ever exposed to, could call these details up instantly, but who was incapable of linking any of the millions of details he could recall to one another in any coherent way. As the narrator in Borges's story writes of Funes, "he was not very capable of thought. To think is to forget differences, generalize, make abstractions. In the teeming world of Funes, there were only details, almost immediate in their presence" (Borges 1964a). "Immediate," yes, and instantly retrievable, but incoherent almost in the original meaning of the word, as in not able to "cleave, stick, or hang together." Or as Funes said it perhaps best himself: "My memory, sir, is like a garbage heap" (Borges 1964a).
As we will examine below, the lack of coherence of the results of full-text searching is fundamentally at odds with natural patterns of knowledge acquisition. Other problems of full-text searching that are more formal than epistemological will not be dealt with here to the same extent, but they also deserve mention. These include: orthographic irregularities in early modern English, which can lead to significant omissions or skewing of search results; the different vocabulary registers in 17th, 18th, even 19th century English; and thirdly the inaccessibility of foreign-language texts to English keyword searching—no minor problem in the ECCO (Eighteenth Century Collections Online) database, by the way, which includes 4322 complete works in French, 3712 in Latin, and even 443 in Welsh.
On the threshold between "technical" and "epistemologically fundamental" problems with keyword searching is the fact that trawling for certain keywords in any large database almost always retrieves something, regardless how misinformed the selection of keyword may be. As an example, search for "univeristy" in an OPAC or in Google. Northwestern University's OPAC currently retrieves 299 hits for "univeristy." In Google, this misspelling occurs in 1,460,000 documents. Granted, very few students will assume this to be the correct spelling of the word and base an honors thesis on what they find, but nonetheless the wealth of retrieved records even for totally misinformed full-text searches confirms the wisdom of the German saying: Wie man in den Wald hineinruft, so schallt es heraus. In a recent New York Times article, Stanford's Geoffrey Nunberg discusses several non-trivial examples of this phenomenon. "Search engines," he writes, "make it all too easy to filter information in ways that reinforce pre-existing biases. A Google search on 'voting machine fraud,' for example, will turn up popular Web pages that feature those words prominently, most of which will support the view that voting machines make election fraud easier; opposing sites won't tend to feature that language, so will be missed in the search" (Nunberg 2005).
But we should return to the issue of coherence and the decontextualization of the results of full-text searching, as this is the area in which more fundamental problems exist.
The kind of instantaneous keyword searching we use so casually today is unlike any information gathering tool our species has ever known. What can compare to it? Indexing or cataloging a book is a high-level intellectual undertaking, often leading the human agent to invent terms for a work's metatextual carapace that do not naturally occur in the text. Perhaps the more mechanical concordance creation of earlier generations comes closest to what we have today—but for the speed with which this work can now be accomplished:
. . . the first concordance—of the Vulgate, completed in the early 13th century—required the labor of 500 Dominican friars. Even in more modern times, those who began concordances knew that they might not live long enough to see them completed. This was the case for the first directors of the Chaucer concordance, which took 50 years before reaching publication in 1927.
Today, we can do the same work in minutes. But slow like then or fast like today, what does reliance on a concordance—or lists of keyword occurrences—do to the process of textual analysis? As Deborah Friedell observed in a 2005 New York Times article entitled "The Word Crunchers":
To read a concordance is to enter a world in which all the included words are weighted equally, each receiving just one entry per appearance. While Amazon's concordance can show us the frequency of the words "day" and "shall" in Whitman, "contain" and "multitudes" don't make the top 100. Neither does "be" in Hamlet, nor "damn" in "Gone with the Wind." The force of these words goes undetected by even the most powerful computers. (Friedell 2005)
By relying on machine concordances and full-text searching, we are staking much of the future of textual analysis on the results of a relentless, almost instantaneous, but ultimately dumb process performed by machines.
Others have warned us against too great a reliance on machine intelligence when working with information conveyed through texts. One of the earliest was the originator of mechanically assisted text processing, search, and retrieval, Vannevar Bush, whose seminal essay of 1945 entitled "As We May Think" first hypothesized a relationship between individual researchers on the one hand and "the sum of our knowledge" on the other. Bush had no problem with "the reduction of mathematical transformations to machine processes," but only "simple selection" of information, as he put it, would yield to these mechanical search algorithms. Knowledge and information wrapped up in texts pose a problem of an entirely different type. "Our ineptitude in getting at the record is largely caused by the artificiality of systems of indexing," he wrote. With this, Bush was referring to the computer's inability to search for things that are not identical or nearly identical, in a literal sense, to a search expression. "The human mind does not work that way," he wrote. Instead "[i]t operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association of thoughts, in accordance with some intricate web of trails carried by the cells of the brain" (Bush 1945).
This assessment of human cognition is at odds with much of what we read today. It's very trendy to describe human thought in terms of computer function. As Hillel Schwartz puts it in his 1996 book, The Culture of the Copy:
The metaphors of machine reasoning and computer linkage have become our own: our brains are parallel processors, our thinking is signal processing, our lives are databanks, and we need "downtime." In the company of laser copiers, faxes, and scanners, we are drawn to assume that what we copy instantly we know intimately. (Schwartz 1996)
But this is not so. As Vannevar Bush explains, the creation of what he calls "associative trails" characterizes human thought and sets it apart from machine processes. These associative trails can be grouped and placed into relationship to one another through complex metaphors, and indeed, "cognitive science shows us that we cannot reason independently of metaphor" (Borowicz 2001). Metaphors that aid us to remember and to reason can also take the form of plotlines and stories—as argued by Mark Turner in The Literary Mind, in which he claims that "narrative imagining," the construction of stories by our brain, is "the fundamental instrument of thought" (Turner 1996). Or these complex metaphors may take the form of landscapes or buildings, historically the path taken by mnemonics, the art of memory (Spence 1985; Yates 1992). All the talk about online searching replacing the need for human memory fails to take these idiosyncrasies of human cognition into account. We can't retain long lists of keywords in context in our minds, but the mind is almost limitlessly capacious if words are placed into a meaningful context of story, verse, or argument. Indeed—and here we move on to the next important point—any attempt to rely on keyword searching to replace prolonged reading as a tool of knowledge acquisition—a function of memory—will fail and in fact impede human thought and learning. At least for us humans, the universe, as the poet Muriel Rukeyser once said, "is made of stories, not of atoms" (quoted in Schmundt 2002).
We've talked now about memory, but what about forgetting? Why waste time reading a whole book or article, a student (or a Newsweek reporter) might ask, if we are going to forget most of what we read anyway? Keyword searching then becomes a kind of just-in-time information acquisition strategy, far preferable, some might claim, to the presumably wasteful just-in-case strategies of traditional humanistic education. Here, too, an unfortunate computer analogy seems to be operative. Are things that are read and then "forgotten" as gone as, say, something we have deleted from our hard drive? To answer this question, we can turn to an entire book on the topic, Harald Weinrich's Lethe: The Art and Critique of Forgetting. Weinrich marshals a host of literary evidence to explain why forgetting cannot be compared with file deletion. We can start with the title of a book by Uruguayan writer Mario Benedetti: El olvido está lleno de memoria: "Forgetting is full of memory" (Benedetti 1995). Or the poem "A Reader" by Borges, which includes these lines: "Having learned and having forgotten Latin is a possession, because forgetting is a form of memory, its broad basement, the secret other side of the coin" (cf. Weinrich 2004). Augustine, too, writes Weinrich, marveled "at the fact that even forgetting is found among the multitude of memory's contents (inesse oblivionem in memoria mea)" (Weinrich 2004). This, according to Augustine (via Weinrich), is why we can remember that we have forgotten something—the essential precondition for querying a database to recover something we once knew or think we did. This is also why broad exposure to what we call today "full texts," coupled with that strange species of human memory known as forgetting, is essential to information retrieval and to learning more generally.
Cognitive science is also not silent on the nature of human forgetting. It, too, asks how forgetting differs from file deletion—or from never having been exposed to a text in the first place. We read that "[t]he conventional wisdom in cognitive science is that at least 95 percent of all thought is unconscious" (Borowicz 2001). This "95 percent below the surface of conscious awareness shapes and structures all conscious thought. If the cognitive unconscious were not there doing this shaping, there would be no conscious thought" (Lakoff and Johnson 1999). What we forget therefore is not actually deleted from memory, but instead becomes part of this "cognitive unconscious." Granted, it may not be a photographic reproduction of what was once learned, but it informs conscious thought and can itself become conscious again upon reflection.
In the dialog Meno, Plato famously develops the concept of anamnesis, or re-collection, as the source of learning. Plato attributed knowledge of first principles to an innate understanding that must merely be drawn out by the teacher. Learning, then, is nothing other than remembering. Akin to this Platonic notion, modern cognitive science confirms that much learning does indeed take place against the background of the cognitive unconscious. How often are we surprised that we already know something we believe to be hearing for the first time? This may be accompanied by the realization "Hey, I've read that somewhere before." But without all the tedium of reading long texts, later anamnetic re-collection is precluded. Structuring an encounter with a text through a list of keywords-in-context may result in some astonishing finds, especially dazzling for a teacher who may have spent an entire career looking for just such a passage or word occurrence, but it impedes the natural learning process. Without reading a whole work or substantial parts thereof, at the end our minds are nothing but Funesian garbage heaps. This opens the door to despair and insanity—as it does for the librarians in Borges's "Library of Babel." Users who want to learn will recoil in horror from such a library—if they are not seduced by it first and made mad, a very real consequence of the "epistemological despair" that can befall those who stand before too great a chaos of information—and do not flee in time (Garrett 1999).
Voltaire once wrote that "a great library has the quality of frightening those who look upon it." Entering such a library—or for that matter any large database like EEBO or ECCO, which is much the same thing—is an existentially complex experience. There is a "delicate balance that exists between our delight in the experience of monumentality and abundance—and our fear of being crushed, as it were, by a thing, a place, that is simply too vast to comprehend" (Garrett 2004). Therefore, before turning ourselves or the students we serve loose on full-text databases, it may be a good idea to acquire or provide at least a general orientation to the field about to be entered. Then and only then can keyword occurrences be placed into a meaningful context. The power of keyword searching becomes a tool then rather than a dazzling—but empty—Funesian surrogate for learning and for thinking.
Lest the views and the arguments of this article appear too negative: It is no longer possible to do serious work in the humanities without the power to search the Internet or the very large, proprietary textual databases most research libraries put at the disposal of their users. Bibliographic software such as EndNote, which meshes smoothly with online sources of information, is also a huge asset in humanities research: it is one of those "things that make us smart," as the title of a book by Donald A. Norman puts it pithily (Norman 1993), for such bibliographic utilities, used consistently, allow researchers to re-locate lost thoughts, quotations, or references they may have once had in active memory, but which naturally tend to slip out of the active conscious and into that special category of memory called the cognitive unconscious.
We should also give credit to the role played by bibliographic records in overcoming the "dumbness" of purely mechanical keyword searching. In an article published in College & Research Libraries in 2005, Tina Gross and Arlene G. Taylor document the huge benefits in both recall and concision brought about through the metalanguage present in subject headings:
It was found that more than one-third of records retrieved by successful keyword searches would be lost if subject headings were not present, and many individual cases exist in which 80, 90, and even 100 percent of the retrieved records would not be retrieved in the absence of subject headings. (Gross and Taylor 2005)
Although their study analyzed only searches of bibliographic data in OPACs and not the consequences of adding subject headings to full-text collections, it is likely that a significant increase in the intelligence of keyword searching could be achieved through such an enhancement. This is behind at least one project to add abundant subject headings to full-text databases—subject headings that make explicit the "associative trails" Vannevar Bush spoke of so longingly, adding word-searchable metacontent to the full texts that might never occur in the work otherwise. A team of librarians from the universities of Michigan, North Carolina, Northwestern, California-Riverside, and Yale is currently working to add subject headings to tens of thousands of ECCO records, this in an effort to add some human intelligence to the domain of machines:
Man cannot hope fully to duplicate this [intricate web of trails carried by the cells of the brain] artificially, but he certainly ought to be able to learn from it. In minor ways he may even improve, for his records have relative permanency. The first idea, however, to be drawn from the analogy concerns selection. Selection by association, rather than by indexing, may yet be mechanized. One cannot hope thus to equal the speed and flexibility with which the mind follows an associative trail, but it should be possible to beat the mind decisively in regard to the permanence and clarity of the items resurrected from storage. (Bush 1945)
As with so many other innovations of science, full-text searching can be used to enormous positive effect, can in fact be essential for serious work—or it can be abused to dumb down the educational enterprise in ways no earlier generation could have ever dreamed possible. Librarians, as teachers and mediators, as catalogers and interpreters of content, are the associative "trail blazers" that Vannevar Bush conjured up 60 years ago, they "who find delight in the task of establishing useful trails through the enormous mass of the common record" (Bush 1945). In this age, it is we in the library profession who have the mission to humanize the machine and make it serve us and our communities on our own terms.
This article is based on a paper presented at the program "Old Texts Made New: EEBO, ECCO, and the Impact on Literary Scholarship," organized by the Literatures in English Section (LES) of the Association of College and Research Libraries in Chicago on June 25, 2005. The author is indebted to Kristine J. Anderson (Purdue University), who convened the program and encouraged this contribution, and to his two co-presenters, professors Jesse Lander (University of Notre Dame) and Helen Thompson (Northwestern University), for their many interesting comments and suggestions, not all of which could be explicitly acknowledged in this paper. Jeffrey Garrett, Northwestern University Library, 1970 Campus Drive, Evanston, Illinois 60208; Telephone 847-467-5675; Fax 847-467-7899; E-mail firstname.lastname@example.org.
During 2004, Gale was scanning and digitizing text from material from microfilm at a rate of over 700,000 pages per week. (Ray Bankoski, Thomson Gale, personal communication) The non-commercial Text Creation Partnership, or TCP, based in Ann Arbor, has digitized close to 400 million words of texts from Chadwyck-Healey's Early English Books Online, or EEBO—dwarfing Chadwyck-Healey's English Poetry Database, which has about 100 million words, and other "CD-ROM power-tools" (Nicholson Baker in (Baker, 1997)) of the 1990s. The final TCP release will almost surely top one billion words. (Martin Mueller, Northwestern University, personal communication)
The philosopher Wittgenstein was famous for such idiosyncrasies of word choice. In his Tractatus Logico-Philosophicus, for example, he regularly and consistently refers to languages and grammars as "toolboxes," words as "tools," and aspects of the similarity between words as "family resemblances." (Cf. Wittgenstein, 1974)
Jeffrey Knapp, in his recent Shakespeare's Tribe, bases an entire line of analysis on occurrences of the word "rogue" in Elizabethan English, overlooking an important spelling variant, "roge" that has quite different meanings and connotations associated with it. (Knapp, 2002) (I am grateful to Professor Jesse Lander, University of Notre Dame, for this example.)
For example, the word "sanitation" does not occur a single time in the 150,000 volumes of ECCO, the Eighteenth Century Collections Online, although hygiene was certainly not a non-issue in that century. Searchers are better advised to search for terms such as "fever," "rats," or "effluvia" to find useful texts.
Baake, Ken. 2003. Metaphor and Knowledge: The Challenges of Writing Science, Studies in Scientific and Technical Communication. Albany: State University of New York Press.
Baker, Nicholson. 1997. Lumber. In The Size of Thoughts. New York: Vintage.
Benedetti, Mario. 1995. El olvido está lleno de memoria. Montevideo, Uruguay: Cal y Canto.
Borges, Jorge Luis. 1964a. Funes, the Memorious. In Labyrinths. Selected Stories and Other Writings, edited by D. A. Yates and J. E. Irby. New York: New Directions.
———. 1964b. The Library of Babel. In Labyrinths: Selected Stories and Other Writings, edited by D. A. Yates and J. E. Irby. New York: New Directions.
———. 1964c. Tlön, Uqbar, Orbis Tertius. In Labyrinths. Selected Stories and Other Writings, edited by D. A. Yates and J. E. Irby. New York: New Directions.
———. 1977. Un lector. In Obra poética. Buenos Aires: Edición Emecé Editores.
Borowicz, Jon. 2001. The Body of a Philosopher: Embodied Thought as Physical and Social Activity. In Sagesse du Corps: Actes du colloque interdisciplinaire organisé au Collège dominicain de philosophie et de théologie à Ottawa les 29 et 30 septembre 2000, edited by G. Csepregi. Aylmer (Québec): Éditions du scribe.
Bullokar, John. 1616. An English expositor teaching the interpretation of the hardest words vsed in our language. With sundry explications, descriptions, and discourses. By I.B. Doctor of Phisicke. London Printed by Iohn Legatt.
Bush, Vannevar. 1945. As We May Think. Atlantic Monthly: 101-108.
De Rosa, Cathy, Lorcan Dempsey, and Alane Wilson. 2004. The 2003 OCLC Environmental Scan: Pattern Recognition: A Report to the OCLC Membership. Dublin, Ohio: OCLC.
Fast, Karl V., and D. Grant Campbell. 2004. 'I Still Prefer Google': University Student Perceptions of Searching OPACs and the Web. In ASIST 2004 Annual Meeting, "Managing and Enhancing Information: Cultures and Conflicts." Providence, Rhode Island: American Society for Information Science and Technology.
Foucault, Michel. 1994. The Order of Things. Translated by A. Sheridan. New York: Vintage.
Friedell, Deborah. 2005. The Word Crunchers. New York Times, June 5, 47.
Garrett, Jeffrey. 1999. Redefining Order in the German Library, 1775-1825. Eighteenth-Century Studies 33 (1 (Fall)):103-23.
———. 2004. The Legacy of the Baroque in Virtual Representations of Library Space. Library Quarterly 74 (1):42-62. [doi: 10.1086/380853]
Gross, Tina, and Arlene G. Taylor. 2005. What Have We Got to Lose? The Effect of Controlled Vocabulary on Keyword Searching Results. College & Research Libraries 66 (3):212-230.
Hargrave, Frank. 2001. Noise. In Hargrave's Communications Dictionary: Wiley.
Johnson, Allan G. 2000. The Blackwell Dictionary of Sociology: A User's Guide to Sociological Language. Malden, Mass.: Blackwell Publishers.
Knapp, Jeffrey. 2002. Shakespeare's Tribe: Church, Nation, and Theater in Renaissance England. Chicago: University of Chicago Press.
Lakoff, George, and Mark Johnson. 1980. Metaphors We Live By. Chicago: University of Chicago Press.
———. 1999. Philosophy in the Flesh. New York: Basic Books.
Levy, Steven. 2004. Google's Two Revolutions. Newsweek, December 27, 70.
Marcum, Deanna B. 2005. The Future of Cataloging: Address to the Ebsco Leadership Seminar. In ALA Midwinter Meeting. Boston, Mass.
Markoff, John, and Edward Wyatt. 2004. Google Is Adding Major Libraries to Its Database. New York Times, December 14, 1.
Norman, Donald A. 1993. Things That Make Us Smart: Defending Human Attributes in the Age of the Machine. Reading, Mass.: Addison-Wesley.
Nunberg, Geoffrey. 2005. Teaching Students to Swim in the Online Sea. New York Times, February 13, 4.
Saenger, Paul Henry. 1997. Space Between Words: The Origins of Silent Reading, Figurae. Stanford, Calif.: Stanford University Press.
Schmundt, Hilmar. 2002. Hightechmärchen: die schönsten Mythen aus dem Morgen-Land. Berlin: Argon.
Schwartz, Hillel. 1996. The Culture of the Copy: Striking Likenesses, Unreasonable Facsimiles. New York: Zone Books.
Spence, Jonathan D. 1985. The Memory Palace of Matteo Ricci. New York, London: Penguin.
Turner, Mark. 1996. The Literary Mind. New York: Oxford University Press.
Wegmann, Nikolaus. 2000. Bücherlabyrinthe: Suchen und Finden im alexandrinischen Zeitalter. Köln: Böhlau.
Weinrich, Harald. 2004. Lethe: The Art and Critique of Forgetting. Translated by S. Rendall. Ithaca: Cornell University Press.
Wittgenstein, Ludwig. 1974. Tractatus logico-philosophicus. 1st pbk. ed. London: Routledge and Kegan Paul.
Yates, Frances Amelia. 1992. The Art of Memory. London: Pimlico.