Authors : | Timothy Allen, Charles Cooney, Stéphane Douard, Russell Horton, Robert Morrissey, Mark Olsen, Glenn Roe, Robert Voyer |
Title: | Plundering Philosophers: Identifying Sources of the Encyclopédie |
Publication info: | Ann Arbor, MI: MPublishing, University of Michigan Library Spring 2010 |
Rights/Permissions: |
This work is protected by copyright and may be linked to without seeking permission. Permission must be received for subsequent distribution in print or electronically. Please contact mpub-help@umich.edu for more information. |
Source: | Plundering Philosophers: Identifying Sources of the Encyclopédie Timothy Allen, Charles Cooney, Stéphane Douard, Russell Horton, Robert Morrissey, Mark Olsen, Glenn Roe, Robert Voyer vol. 13, no. 1, Spring 2010 |
Article Type: | Article |
URL: | http://hdl.handle.net/2027/spo.3310410.0013.107 |
Plundering Philosophers:Identifying Sources of the Encyclopédie.
Abstract
Denis Diderot and Jean le Rond d’Alembert’s Encyclopédie ou Dictionnaire raisonné des sciences, des arts et des métiers stands as one of the crowning achievements of the French Enlightenment. This monumental work, containing some 77,000 articles written by no less than 140 contributors, was published in Paris between 1751 and 1772 in seventeen in-folio volumes of text and eleven volumes of engravings. As with all reference works, the authors and editors of the Encyclopédie made extensive use of a vast array of contemporary reference works and scholarship to complete their massive compendium of enlightened knowledge. The identification of sources material used by the philosophes is a massive undertaking in itself, as the authors rarely acknowledged the works upon which they relied in writing their contributions. This paper describes two different experiments to identify sources of the Encyclopédie. The first applies the "Vector Space Model" (VSM) to identify articles that may have been borrowed from the Dictionnaire de Trévoux (1743) – an intellectual rival of the Encyclopédie compiled by French Jesuits in the first half of the 18th century. We find that the Vector Space Model can be an effective means of identifying "similar" passages in documents, in this case, potentially borrowed articles that were then examined by human evaluators. Overall, we conclude that 5.32 percent of all of the articles in the Encyclopédie that were examined were borrowed from the Jesuit critics of the philosophes. The second experiment, building on the first, applies what we call Pairwise Alignment of Intertextual Relations (PAIR) to detect passages borrowed from another important predecessor of the Encyclopédie, Louis Moréri's popular Grand dictionnaire historique (1671-1759), which was also a product of Jesuit scholarship. Given the genealogical character of the Moréri dictionary, which represented an understanding of knowledge radically different than that of the encyclopédistes, we were nonetheless able to identify more than 400 shared passages between the two works using the PAIR approach. These findings shed new light on the composition process of the Encyclopédie and suggest that the intellectual battle lines between the Jesuits and the philosophes may not have been as firmly established as previously understood. We conclude by outlining improvements to both the VSM and PAIR models, which we expect will make further identification of similar passages more effective.
Keywords: similarity, algorithms, plagiarism, intertextuality, Encyclopédie, Trévoux, Moréri, Diderot, d'Alembert, Enlightenment, Jesuits.
Introduction
Denis Diderot and Jean le Rond d’Alembert’s Encyclopédie ou Dictionnaire raisonné des sciences, des arts et des métiers stands as one of the crowning achievements of the French Enlightenment. This monumental work, containing some 77,000 articles written by no fewer than 140 contributors, was published in Paris between 1751 and 1772 in seventeen in-folio volumes of text and eleven volumes of engravings. [1A] As with all reference works, the authors and editors of the Encyclopédie made extensive use of a vast array of contemporary reference works and scholarship to complete their massive compendium of Enlightenment knowledge. Widespread copying or paraphrasing from one reference work into another, with limited if any acknowledgment, was not an uncommon practice at the time of the Encyclopédie's publication. Indeed, it is perhaps a practice that cannot be avoided in the creation of encyclopedias. [1] The "reuse" of knowledge is not simply a case of "plagiarism," if by that we mean unacknowledged direct copying of significant passages from other works, since the authors and editors would frequently rework language from previous works to suit their stylistic tastes or ideological orientations. Rather, this may be considered a process of intellectual "plundering," defined by the Oxford English Dictionary as "to take material from (literature, artistic or academic work, etc.) for one's own purposes." The plundering, or reworking, of previous scholarship makes systematic identification of the sources used by the philosophes a rather more difficult problem than the simple detection of plagiarism, given then extensive variations made to the originating passages.
One of the sources of which the philosophes were reputed to have made extensive use was the Dictionnaire universel françois et latin vulgairement appelé de Trévoux (colloquially known as the Dictionnaire de Trévoux) [2], an intellectual rival of the Encyclopédie compiled by French Jesuits in the first half of the 18th century [3]. Indeed, Jesuit critics of the Encyclopédie complained loudly of the extent to which entries were copied from earlier works, although among the possible sources of plagiarism the Trévoux dictionary was never explicitly mentioned. Due to the scale of the problem there has not yet been a systematic attempt to identify the extent to which the philosophes plundered the Trévoux, or any other earlier work for that matter. In order to attempt to detect possible borrowings, we use a general document similarity measure — the Vector Space Model (VSM) — to identify articles in the Encyclopédie that may have been borrowed, in whole or in part, from the Trévoux. As we report below, this simple technique is surprisingly effective. Our procedure enlisted several researchers familiar with the period and materials to evaluate possible borrowed passages and select those objects that were, in their judgment, most probably borrowed. Overall, we found that 5.32 percent of the examined articles (80 - 85% of the entire work) in the Encyclopédie were borrowed from the Jesuit Dictionnaire de Trévoux. We further concluded that VSM works best when comparing identified blocks of text, in this case articles, but was less effective when borrowings occurred in smaller, unidentified passages, such as sentences paragraphs within a larger article.
Expanding on the success of the VSM experiment to identify borrowed articles in the Encyclopédie, we applied a completely different approach to the identification of similar passages borrowed from another Jesuit reference work, Louis Moréri's Grand dictionnaire historique, published in 20 editions between 1671 and 1759 [4]. As an historical dictionary, oriented around the genealogies of European nobility and the Roman Catholic Church, Moréri's work would certainly seem antithetical to the Enlightenment project of the philosophes, who sought to undermine superstition and religious intolerance through the expansion of knowledge in the Arts and Sciences. It would therefore be somewhat surprising to find a strong presence of Moréri material, at least superficially, in the pages of the Encyclopédie. Whereas the VSM compares whole articles without regard to word order, our Pairwise Alignment of Intertextual Relations (PAIR) approach uses clusters of shared "n-grams" — sequences of two or more words — as a way to identify shared passages. This technique, borrowed from genetic sequence algorithms in the biological sciences as well as plagiarism detection systems, is very effective in identifying common passages between two texts, from small fragments of sentences to entire articles. Using the PAIR approach, we compared all 17 volumes of text from the Encyclopédie to the 10-volume 1759 edition of Moréri's Grand dictionnaire historique and identified some 580 possible shared n-gram sequences of varying lengths. Upon close inspection, we determined that about 72% of these instances (418 n-gram sequences) were indeed common to both works.
These findings shed additional light on the composition process of the Encyclopédie and suggest that the intellectual battle lines between the Jesuits and the philosophers may not have been as firmly established as previously understood. As we further evaluate our use of the VSM and PAIR approaches to common passage identification, we look to improve their overall performance and perhaps expand their functionality. Finally, we suggest that these approaches are in fact complementary, and can serve as effective techniques for identifying similar passages across many works in large digital collections.
Background
In November 1753, the Reverend Father Guillaume-François Berthier wrote in the Mémoires pour l'Histoire des Sciences & des Beaux-Arts (commonly known as the Journal de Trévoux - a wide-ranging literary journal published in France between January 1701 and December 1767, in essence, a Jesuit New York Review of Books for18th-century readership) that he and his fellow Jesuit editors were engaged in a "war over the Encyclopedia" that they had not wanted to start and that they did not wish to continue [5]. This intellectual “war,” if that is indeed how it was perceived, began two years prior when the Journal de Trévoux began publishing criticisms of various aspects of Diderot's Encyclopédie, the first volume of which appeared in June of 1751. Among these criticisms was the notion that entire articles or sections of articles in the Encyclopédie had been copied without proper citation, sometimes word for word, from reference works that preceded the Encyclopédie's publication. The Journal de Trévoux published lists of the Encyclopédie entries that its editors had "discovered" to be plagiarized [6] from other works, including many from Louis Moréri's Grand dictionnaire historique and the Jesuits' own Dictionnaire universel françois latin (more commonly known as the Dictionnaire de Trévoux) which had been published in new, expanded editions in the previous decade. These actions prompted Diderot to respond in the Foreword of the third volume of the Encyclopédie (1753) that such borrowings often were, among other things, not actual borrowings and simply represented the unimpeachable fact that dictionaries cannot write very differently about certain topics: "[I]ls ne sauroient faire autrement" "[Indeed they cannot do otherwise]" [7]. Diderot takes specific umbrage with the Dictionnaire de Trévoux, a work purportedly published by the same Jesuit editors of the Journal de Trévoux, asserting that this dictionary was itself a copy of the Basnage edition of Antoine Furetière's 17th-century Dictionnaire universel (See appendix).
While Diderot was probably correct in his description of the Dictionnaire de Trévoux's connection to the Basnage [8], it is unclear as to what extent the Jesuits of the Journal de Trévoux had a hand in the creation and publication of the Dictionnaire de Trévoux. Berthier himself denies a connection between the two works and even goes so far as to condemn the Dictionnaire de Trévoux, if indeed it had also engaged in plagiarism [9]. In any case, though there may be no explicit connection between the two publications from the small town of Trévoux, and no conclusive evidence that the Dictionnaire belonged to the Jesuits, it is clear that the Trévoux dictionary was considered by many in 18th-century France (among them Voltaire) to be a publication of notable Jesuit inspiration [10].
As suggested above, many reference works and compilations of this period made extensive use of previously published works. It is more interesting to note that the Dictionnaire de Trévoux (hereafter Trévoux), as a primarily Jesuit work, was in a sense philosophically opposed to the Enlightenment flagship Encyclopédie. Considering this opposition, then, one might be surprised that Diderot and the Encyclopédistes would refer to, or borrow from, the Trévoux at all. Yet several Encyclopédie scholars have identified numerous "borrowed" passages [11] in addition to those located by the editors of the Journal de Trévoux. The inventory of these borrowings is somewhat instructive when considering the underlying intellectual relationship between the Encyclopédie and the Trévoux. To date, however, these studies have limited their scope to relatively small portions of both works. For example, Leca-Tsiomis's most extensive list of borrowings concerns all of the entries that begin with the letters "JO" in both works. The daunting scale of the problem may have restricted researchers to limited forays in the systematic exploration of the relationship between the Trévoux and the Encyclopédie.
The Encyclopédie [12] and the Trévoux are both multi-volume, large-scale reference works, each containing just over 77,000 articles and subarticles. Thus, any attempt to systematically compare all of the articles in both works by hand would represent a daunting task. However, much of this work can be accomplished through the application of text mining techniques. Using the Vector Space Model (VSM) implementation of PhiloMine, the machine learning extension to our PhiloLogic search and analysis package, we performed such systematic comparisons [12A]. In this experiment, the system compared many thousands of articles from both works to one another, presenting to the user "similar" articles and numeric scores indicating the degree of similarity. While the system can propose article pairs which are most probably related, human evaluation is nevertheless required to determine if a particular article pair are related due to the presence of "borrowed" material or simply, as Diderot suggested, due to their treatment of the same subject matter. Thus, the problem is rather more complicated than simple plagiarism detection, in which one could use a repeating phrase analysis to identify borrowings, because the authors and editors of the Encyclopédie would often recast a borrowed entry or incorporate it in a longer article. Finally, identification of "borrowings" is not always clear-cut, but is rather a judgment call that requires human consideration and may be subject, in some cases, to further debate.
Vector Space and Similarity
First described in his 1975 paper, “A Vector Space Model for Automatic Indexing,” the Vector Space model described by Gerald Salton has since become one of the core approaches used in information retrieval [13]. At its base, the VSM approach searches large collections of documents and returns a list of citations that are deemed to be "most relevant" to a user query. Over the years, there have been numerous enhancements and extensions of the VSM, most of which have attempted to improve performance of user initiated querying [14]. The Vector Space Model is a very general algorithm that has been applied in a wide variety of applications, including nearest neighbor document classification systems to topic based document segmentation systems [15]. Our application of VSM is somewhat unusual in that we are using it to compare relatively small parts of documents directly in order to find the most similar objects. Indeed, this kind of use of the VSM was proposed by Salton and Singhal [16] in a paper presented months before Salton's death. In it, they demonstrated the use of VSM to produce links between parts of documents, forming a type of automatic hypertext:
The capability of generating weighted vectors for arbitrary texts also makes it possible to decompose individual documents into pieces and explore the relationships between these text pieces [...] Such insights can be used for picking only the "good" parts of the document to be presented to the reader.
Salton and Singhal further argued that manual link creation would be impractical for huge amounts of text, but these conclusions may have had limited influence given the general interest at that time in human-generated hypertext links on the WWW.
Our VSM implementation in PhiloMine uses the standard "bag of words" approach that represents a collection of documents as a very sparse matrix, which looks rather much like a spreadsheet. Each column corresponds to a particular word (or other feature such as a lemma [17]) and each row represents a single document. The value for each cell in the matrix is either the frequency of the word for that column in a document or some weight assigned for that word or feature. The size of the "spreadsheet" can be very large, depending on the number of unique words or features retained, but it is nonetheless very sparse because most of the cells are empty (have value 0). A vector space search compares an input vector to all of the vectors (rows) and measures the similarity by comparing the deviation of the angles traced by the two vectors, usually expressed as a cosine of the angles of the two vectors in n-dimensional space. In practice, this measure of similarity assigns a score of between 0 (no match) and 1 (exact match), with the results being sorted by the cosines. At the application level, the user sets up a task to compare all of the vectors (rows) in one subset of the collection to all of the vectors in another subset, which also permits a comparison of the same subset or entire collection. As noted above, this simple approach has been demonstrated to be an effective way of identifying similar documents and small parts of documents. Thus, rather than searching for something, the vector space approach allows us to compare large sets of vectors against each other looking for the most similar entries.
Encyclopédie and Trévoux
As mentioned above, ARTFL’s PhiloMine is a Web-based system that allows users to submit a variety of large-scale supervised and unsupervised machine learning tasks on collections built using the PhiloLogic search and retrieval engine. For this work, we have used PhiloMine to compare all of the articles beginning with a particular letter in the both the Trévoux and the Encyclopédie, with a select few parameters such as minimum similarity threshold to display, size range of entries, and a word count filter (maximum and minimum document percentages). In all, we compared articles between 15 and 500 words, limiting the vocabulary to words that occur in less than 90% of the articles (to eliminate function words) and more than 1% (to reduce the size of the vectors), and with a similarity threshold of between .70 and .80. The resulting comparisons ranged from less than 1,000 articles (for rare words, such as those beginning with U) to more than 20,000 (for frequent entries, such as R). We adopted the letter-by-letter approach in order to subdivide the task among the team, and with the added the expectation that most borrowings would occur in articles with the same or similar headwords. The resulting report, which can be viewed as a WWW page [18], lists all of the entries from the Encyclopédie in the comparison task and displays those entries from the Trévoux exceeding the threshold score. Links to the articles are provided in order to allow human evaluators to access the possibly borrowed articles. This approach filters out the vast majority of articles, leaving the human free to examine salient entries. Figure One shows a short extract of a comparison of 11,000 articles beginning with "P", with one highly similar entry, "PINEAU."
- Query words in: PINDE, le in Encyclopedie [volume12].
- Query words in: PINDENISSUS in Encyclopedie [volume12].
- Query words in: PINEALE, Glande pinéale in Encyclopedie [volume12].
- Query words in: PINEAU in Encyclopedie [volume12].
- PINEAU, in Dictionnaire de Trévoux [trevoux29]. [0.982607368881035]
- Query words in: PINEY ou PIGNEY in Encyclopedie [volume12].
- Query words in: PING - PU in Encyclopedie [volume12].
- Query words in: PINGUICULA in Encyclopedie [volume12].
This procedure allowed us to scan long lists rapidly and to focus attention on the most probable matches. As an aid to assist human evaluators, we added a secondary measure of similarity based on the resemblance of the headwords in possible matches, using a simple string edit distance measurement [19]. Possible matches that received high VSM similarity scores along with highly similar headwords were then highlighted.
As a rule of thumb, we found that entry pairs scoring greater than 0.9 and having very similar or exact match headwords were almost certainly "borrowed." The articles "PINEAU" represent what we consider a direct borrowing from the Trévoux, and indeed, the only real differences between the articles are either orthographic (i.e., "auvernat" vs. "Auvernas"), or additions by the Encyclopédie’s editors such as class of knowledge (Agriculture) and authorship (D.J.):
Encyclopédie article:
PINEAU, s. m. (Agriculture.) c' est un raisin fort noir, qui vient en Auvergne, & qui est un des plus doux & des meilleurs à manger: le vin qu' on en tire s'appelle auvernat à Orléans, dans d' autres endroits morillon, & pineau en Auvergne: les Poitevins font beaucoup de cas du vin pineau. Trévoux. (D. J.)
Trévoux article:
PINEAU, s. m. C'est un raisin fort noir qui vient en Auvergne, & qui est un des plus doux & des meilleurs à manger. Le vin qu'on en tire s'appelle Auvernas à Orléans, dans d'autres endroits morillon ; & Pineau en Auvergne. Les Poitevins font beaucoup de cas du vin Pineau.
It is important to note that in the above case the author of the Encyclopédie article cites the Trévoux as a reference, a fact that would explain the similarity of the entries, however, this sort of direct (or indirect) citation is only present in less than half of the borrowed articles.
Aside from the articles that are identical or nearly identical, in which cases the presence of referential citations can help us determine the validity of the vector space matches, many of the other "borrowed" articles fall into different categories of similarity and offer no common citations either in the Trévoux or the Encyclopédie. We have found sub-articles from the Trévoux that have been incorporated as single paragraphs in a larger Encyclopédie article on the same subject, which while scoring lower than exact matches such as "PINEAU" (normally in the 0.8 - 0.89 range), reveal borrowings that are still significant enough to be drawn out by the vector space function. A prime example of this sort of borrowing can be found in the Encyclopédie article "DÉFLORATION" which was scored as having a strong similarity with the Trévoux article "DÉPUCELER" although they have different headwords and only share distinct segments of text rather than the entire entry:
Encyclopédie article:
DÉFLORATION, s. f. (Hist. mod.) action par laquelle on enleve de force la virginité à une fille. Voyez Virginité. La mort ou le mariage sont l'alternative ordonnée par les juges, pour réparer le crime de défloration. Plusieurs anatomistes faisoient de l'hymen la véritable preuve de la virginité; persuadés que quand on ne le trouve point, il faut que la fille ait été déflorée. Voyez Hymen.
Les anciens avoient tant de respect pour les vierges, qu'on ne les faisoit point mourir sans leur avoir auparavant ôté leur virginité. Tacite l'assûre de la fille encore jeune de Sejan, que le bourreau viola dans la prison avant que de la faire mourir. On attribue aux habitans de la côte de Malabar la bisarre coûtume de payer des étrangers pour venir déflorer leurs femmes, c'est-à-dire en prendre la premiere fleur.
Chez les Ecossois, c'étoit un droit de seigneur de déflorer la nouvelle mariée; droit qui leur fut, diton, accordé par leur roi Evenus, qu'on ne trouve pas néanmoins dans la liste que nous en avons. On prétend que ce droit leur fut ôté par Malenne, qui permit qu'on s'en rachetât pour un certain prix qu'on appelloit morcheta, ou un certain nombre de vaches par allusion au mot de marck, qui dans les langues du Nord signifie un cheval. Buchanan dit aussi qu'on s'en rachetoit pour un demi-marc d' argent.
Cette coûtume a eu lieu dans la Flandre, dans la Frise, & en quelques lieux d' Allemagne, si l'on en croit différens auteurs.
Par la coûtume d' Anjou & du Maine, une fille après vingt-cinq ans se peut faire déflorer, sans pouvoir être exhérédée par son pere.
Ducange cite un arrêt du 19 Mars 1409, obtenu par les habitans d' Abbeville contre l' évêque d' Amiens, qui faisoit racheter pour une certaine somme d' argent la défense qu' il avoit faite de consommer le mariage les trois premieres nuits des noces: ce qui étoit fondé sur le quatrieme concile de Carthage, qui l' avoit ordonné pour la révérence de la bénédiction matrimoniale. Chambers. (G)
Trévoux article:
DÉPUCELER, v. act. Il dépucelle, il dépucellera, il a dépucellé. Oter la fleur de virginité à une personne. Vitiare virginem. Les Anciens avoient tant de respect pour les vierges, qu'on ne les faisoit point mourir, sans les avoir fait dépuceler. Ceux de la côte de Malabar payent les étrangers pour venir dépuceler leurs femmes, & en prendre la première fleur. Chez les Ecossois c'étoit un droit des Seigneurs de dépuceler la nouvelle mariée, qui leur fut accordé par Evenus leur Roi, & qui leur fut ôté par Malcome, qui permit qu'on s'en rachetât pour un certain prix qu'on appelloit marcheta, ou un certain nombre de vaches par allusion au mot de march, qui signifioit chez eux un cheval : Buchanan dit aussi, qu'on s'en rachetoit pour un demi-marc d'argent, qu'on appelloit marchette. Cela a eu lieu aussi dans la Flandre, dans la Frise, & en quelques lieux d'Allemagne. Par la coutume d'Anjou & du Maine, une fille après 25 ans se peut faire dépuceler, sans pouvoir être exhérédée par son père. Du Cange cite un Arrêt du 19. Mars 1409. obtenu par les habitans d'Abbeville contre l'Evêque d'Amiens, qui faisoit racheter par une certaine somme d'argent la défense qu'il avoit faite de dépuceler les nouvelles mariées les trois premières nuits de leurs noces : ce qui étoit fondé sur le IVe Concile de Carthage qui l'avoit ordonné pour la révérence de la bénédiction matrimoniale.
Dépuceler, se dit aussi en parlant des choses qu'on fait la première fois. Cet Avocat a plaidé sa première cause, le voilà dépucelé.
Dépucelé, ée. part.
We have also identified a third kind of borrowing, one that scores in the range of 0.7 - 0.79, or the lower bound of our testing parameters. These entries usually share the same basic vocabulary or even several sentences on the same topic, but are overall significantly different from each other. The article "GAMBIT" is an example of this kind of limited borrowing, wherein the Encyclopédie article seemingly incorporates a sentence from the Trévoux article, which is longer and otherwise distinct:
Encylcopédie Article:
GAMBIT, s. m. c' est, aux Echecs, une méthode particuliere de joüer, selon laquelle, après avoir poussé le pion du roi ou de la dame deux cases le premier coup qu' on joue, on fait ensuite avancer également de deux cases le pion de leur fou; c' est ce que le Calabrois appelle gambetto dans son traité sur les échecs, où il rassemble toutes les manieres de jouer le gambetto. Le traducteur françois a rendu le mot italien par celui de gambit, que nos joüeurs d' échecs ont adopté, tout barbare qu' il est dans notre langue. (D. J.)
Trévoux article:
GAMBIT s. m. Ce terme est en usage parmi les Joueurs d'échecs, pour signifier la maniére de jouer, où après avoir poussé le pion du Roi, ou de la Dame, deux cases le premier coup qu'on joue, on fait ensuite autant avancer le pion de leur fou : ce que le Calabrois appelle Gambetto. Son Traducteur dit, dans son Avertissement, que comme il est impossible de trouver aucune signification de ce mot, qui puisse quadrer à son sujet, il a été obligé de l'habiller à la Françoise, & de l'appeller gambit, ainsi qu'il se nomme parmi tous ceux qui sçavent & pratiquent ce jeu.
Il me semble pourtant que le mot Italien gambetto pourroit être rendu en notre langue par celui d'enjambée, les pions doublant alors leurs pas. Toutes les maniéres de jouer le gambit sont rassemblées dans le second Livre de la Traduction du Jeu des Echecs de Gioachino Greco Calabrois
For the human evaluator these two articles appear clearly related both thematically and in terms of vocabulary, but this sort of case represents a problem for the Vector Space algorithm, which scores the similarity of the two articles based solely on shared vocabularies, leading to the possibility of missed connections when the shared passages occur in articles of varying length. While we were able to identify the shared passages in the GAMBIT articles, a similarly small passage within a much longer article would presumably be missed by the Vector Space comparison.
These reservations aside, we were nonetheless very impressed by the overall performance of the Vector Space model, which ultimately led to the successful identification of more than 3,300 articles "borrowed" from the Dictionnaire de Trévoux by the compilers of the Encyclopédie. The systematic examination of the more than 140,000 articles between these two reference works would have been all but impossible without the help of the text mining algorithms outlined above, the output of which would have been equally impossible to evaluate without the judicious effort of human scholars. We have made our results available to the public in the form of several HTML spreadsheets [20]. Entries in the spreadsheets represent suggested matches between Encyclopédie and Trévoux articles (listed by headwords) along with additional information (classes of knowledge, authors, references, etc.) and the similarity scores.
VSM Limitations and PAIR
As we have seen, the Vector Space Model is an effective and robust approach for the identification of similar parts of documents. It has several advantages in terms of computational efficiency, simple scoring measures, and tolerance for noisy data, such as uncorrected output from optical character recognition (OCR) systems and variations in orthography. VSM does not, however, take into account word order, is limited to identifiable blocks of text (in this case articles), works best when comparing objects of roughly similar size, and does not do well with the identification of smaller extracts or similar passages within larger blocks of text. Based on some recent experimentation, we believe that the identification of topic-based segments or some other form of document "chunk" would help resolve some of the issues related to the VSM’s requirement of predefined textual blocks, but certainly not all.
These limitations of the Vector Space Model have led us to begin work on a second, complementary approach for the identification of similar passages that we call Pairwise Alignment of Intertextual Relations (PAIR). This approach is based on work in the field of bio-informatics as well as plagiarism detection and may be considered a variant of common sequence alignment algorithms [21]. For any pair of texts, we identify regions in both texts that share a statistically significant number of "n-grams," where n-grams are sequences of two, three, or more, content words (function words removed) or lemmas (root forms of words). Tri-grams of content words or lemmas, for example, are very uncommon (save for some named entities), so clusters of even small numbers are of particular interest. This allows us to anchor sequential matching rapidly from identified pairs of n-grams. In our preliminary implementation, we specify the number of words or bytes that follow or precede an n-gram, in document order, as well as the minimum number of n-gram matches to be considered a similar passage. For example, we can set the maximum span between matching n-grams at 25 characters and require 4 or more n-grams in a chain to be considered a possible similar passage. Modification of these two parameters allows us to adjust the matching algorithm depending on the qualities of the data and the demands of the task.
We are currently using a proof-of-concept implementation of PAIR to identify borrowed passages in the Encyclopédie from other reference works as well as classic 18th-century political and philosophical texts. Montesquieu, for one, was frequently cited and borrowed from by the authors of the Encyclopédie. For example, the underlined passage below from the article "Famille" by the Chevalier de Jaucourt – author of some 17,000 Encyclopédie articles and known textual “plunderer”:
Il est si vrai que la famille est une sorte de propriété, qu'un homme qui a des enfans du sexe qui ne la perpétue pas, n'est jamais content qu'il n'en ait de celui qui la perpétue: ainsi la loi qui fixe la famille dans une suite de personnes de même sexe, contribue beaucoup, indépendamment des premiers motifs, à la propagation de l'espece humaine; ajoûtons que les noms qui donnent aux hommes l'idée d' une chose qui semble ne devoir pas périr, sont très-propres à inspirer à chaque famille le desir d'étendre sa durée; c'est pourquoi nous approuverions davantage l'usage des peuples chez qui les noms même distinguent les familles, que de ceux chez lesquels ils ne distinguent que les personnes.
appears to have been borrowed from Montesquieu's De l'esprit des loix (vol 3):
Des familles. Il est presque reçu partout que la femme passe dans la famille du mari. Le contraire est, sans aucun inconvénient, établi à Formose, où le mari va former celle de la femme.
Cette loi, qui fixe la famille dans une suite de personnes du même sexe, contribue beaucoup, indépendamment des premiers motifs, à la propagation de l'espèce humaine. La famille est une sorte de propriété: un homme qui a des enfans du sexe qui ne la perpétue pas, n'est jamais content qu'il n'en ait de celui qui la perpétue.
Les noms, qui donnent aux hommes l'idée d'une chose qui semble ne devoir pas périr, sont très propres à inspirer à chaque famille le désir d'étendre sa durée. Il y a des peuples chez lesquels les noms distinguent les familles: il y en a où ils ne distinguent que les personnes: ce qui n'est pas si bien.
The algorithm identified a cluster of 26 tri-grams from "loi_fixe_famille" to "inspirer_chaque_famille" [22] spanning a paragraph break in Montesquieu’s text. Many authors, most notably Jaucourt, borrowed liberally from Montesquieu, often taking passages from numerous chapters to compose single articles and often significantly reworking passages. This approach is also demonstrated to work in more noisy data environments. For example, the passage:
savoir le baptême, les ordres, & l'eucharistie, mêlant de si grands abus dans l'administration du baptême, qu'en une même église il y a différentes formes de baptiser, ce qui rend le baptême nul. Aussi l'archevêque Menesès rebaptisa- t il en secret la plûpart de ces peuples. 4°. Ils ne se servent point des saintes huiles dans l'administration du baptême, & ils oignent seulement les enfans d' un onguent composé d'huile de noix d' Inde, sans aucune bénédiction.
in the article Chrétiens de S. Thomas was borrowed from Louis Moréri's Grand Dictionnaire historique, in the article of the same name:
favoir, le baptême , les ordres Se l'eucharistie. Ils mêlent même de Si grands abus dans l'administration du baptême , qu'en une même église , il y a différentes formes de baptiser, ce qui rend le baptême nul. C'est pourquoi l'archevêque Ménefés rebaprifa en secret la plupart de ces peuples. 4. Us ne Se Servent point des Saintes huiles en donnant le baptême , & ils oignent Seulement les enfans d'un onguent composé d'huile de noix d'Inde, sans aucune bénédiction.
Note the combination of Optical Character Recognition (OCR) errors and differences in orthography (e.g., "plûpart" and "plupart"). This simple implementation thus allows us to adjust the algorithm's flexibility in order to compensate for more noisy data, such as uncorrected OCR [23].
While the approach adopted in our implementation of PAIR has several distinct advantages over the Vector Space Model in the identification of "similar passages," most notably the ability to identify much smaller sub-matches not based on predetermined "chunks" of documents and a respect for document order, it is considerably more computationally expensive. Also, in a “noisier” environment, it tends to identify smaller parts of larger matches in order, rather than identify a single, larger match. This may be a factor that can be adjusted using existing parameters, or through the use of a simple heuristic to merge matches that occur close to one another. Finally, PAIR is less suited to finding related passages that deal with similar questions or more general topics than the Vector Space approach.
Encyclopédie and Moréri
Using the PAIR approach, we conducted a second experiment which included another 18th-century predecessor to the Encyclopédie: Louis Moréri's Grand dictionnaire historique, edited and published over 20 times from 1671 to 1759 and whose final 10 volume edition (1759) we had previously built as a full-text database using uncorrected optical character recognition (OCR) [24]. Whereas the Dictionnaire de Trévoux was primarily a language dictionary, which made it a useful resource for any writer, including the encyclopédistes, Moréri's Dictionnaire was more genealogical and historical in nature and its subject matter is perhaps best grasped in its full title:
Le grand dictionaire historique : ou, Le mélange curieux de l'histoire sacreé et profane ; qui contient en abregé, les vies et les actions remarquables des patriarches, des juges, des rois des juifs, des papes ... des empereurs, des rois, des princes illustres, & des grands capitaines ... l'etablissement et le progrès des ordres religieux & militaires, & la vie de leurs fondateurs : les genealogies de plusieurs familles illustres de France & d'autres païs. L'histoire fabuleuse des dieux, & des héros de l'antiquité payenne : La description des empires, royaumes, républiques ... etc.
[The Great Historical, Geographical, Genealogical and Poetical Dictionary: Being A Curious Miscellany of Sacred and Prophane History. Containing, in short, the Lives and most Remarkable Actions of the Patriarchs, Judges, Popes, and Kings of the Jews ... Emporers, Kings, Illustrious Princes, and Great Captains ... The Establishment and Progress of the Religious and Military Orders and the Life of their Founders: The Genealogies of many Illustrious Families from France and Elsewhere. The Fabled History of the Gods and Heroes of Pagan Antiquity: The Descriptions of Empires, Kingdoms, Republics ... etc.]
It was precisely this sort of historical perspective, informed greatly by the Roman Catholic worldview, that the Encyclopédie's "Project for Enlightenment" sought to overcome. Indeed, the editors even went so far as to inform its readers that the sort of history embodied by Moréri's dictionary has no place in their work: "Au reste, on observera que les articles d'Histoire de notre Encyclopédie ne s'étendent pas aux noms de Rois, de Savans, & de Peuples, qui sont l'objet particulier du Dictionnaire de Moreri, & qui auroient presque doublé le nôtre" [As for the rest [of the articles], one can see that the History articles of our Encyclopedia do not encompass the names of kings, scholars and peoples, which are the distinct object of Moreri's Dictionary, and which would have nearly doubled the size of our Encyclopedia] [25]. As we have seen above, however, this assertion of textual autonomy is brought into question by the Jesuit Journalistes de Trévoux immediately upon publication of the first volumes of the Encyclopédie. Comparing many of the articles in the first volume with other reference works, the Jesuit Journalistes found that, contrary to Diderot's editorial assertions, whole articles and parts of articles were taken directly from the Moréri dictionary:
Nous disions plus haut, que nos Auteurs renoncent à bien des objets dont Moréri & ses Continuateurs se sont occupés; mais on ne doit pas croire pour cela que le grand Dictionnaire Historique ait été inutile à l'Encyclopédie; c'est tout le contraire,& dans la comparaison que nous avons faite de ces deux Livres, & qui n'est pas finie, le second, c'est-à dire le plus moderne, nous a présenté une multitude d'articles, les uns transcrits presque mot à mot, les autres fortement imités de Moréri.
[We said above that our authors [the Encyclopédistes] renounce many of the things with which Moréri and his heirs are occupied, but one should not believe that, because of this, Le grand dictionnaire historique was not useful to the Encyclopédie. Quite the contrary, and in the unfinished comparison of the two works that we carried out, the second, most recent work presented us with a multitude of articles that either were transcribed nearly word for word from or bear a strong resemblance to Moréri.] [26]
Our goal, then, was to compare, in much the same way the Jesuit critics did in the 18th century, the Moréri dictionary and the Encyclopédie, looking not only for completely borrowed articles [27], but in this case, for smaller passages or sentences common to both works.
Using the PAIR comparison approach outlined above, we compared each volume of the Moréri to each of the text volumes of the Encyclopédie. This rudimentary implementation identified regions of texts that shared at least four tri-grams common to both documents and separated by as many as 25 characters. This process identified 582 possible matches that were then evaluated by hand. Of the 582 proposed matches, 72% (418) of the passages in the Encyclopédie were identified as shared with the Moréri. We have subsequently refined the approach to reduce the number of "false positives" [28]. It should be noted, however, that with any algorithm based on similarities, we are unable provide a measure of how many additional borrowed passages this approach might have missed. From a user's perspective, the system takes two sets of files or databases, performs the alignment matching, and generates a report showing the matching passages and providing links to the full context in each file or database.
As with the three types of "similar" articles identified using the Vector Space approach, we can also distinguish three main categories of shared passages between the Encyclopédie and the Moréri. The first, and most rudimentary, type of shared passage is the formulaic expression, most often used in legal documentation, and the more general commonplace and cliché used by authors of this time period. In our first example we see a sequence of 10 n-grams that have been highlighted using the PAIR approach; the first occurring in the "Privilege du Roi," or King's permission to publish, of the Moréri; a legal phrase cited as an exemplum in the Encyclopédie article on Letters of Pareatis:
Moréri, PRIVILEGE DU ROI:
Voulons que la copie des présentes qui fera imprimée tout au long au commencement ou à la fin dudit Ouvrage, soit tenue pour duement Signifiée, & qu'aux copies colla-lionnées par l'un de nos amés, féaux Congeillers-Secrétaires , soi soit' ajoutée comme à l'original. Commandons au premier notre Huiflier ou Sergent sur ce requis, de faire pour l'exécution d'icelles,1 tous astes requis & nécessaires, sans demander autre permiflion, & nonobstant clameur de Haro, charte Normande , & Lettres à ce contraires. Car tel est notre plâifir.
Encylcopédie, PARÉATIS, p. 934:
La forme d' un paréatis est telle: « Louis par la grace de Dieu, &c. au premier notre huissier ou sergent sur ce requis: te mandons à la requête de N. mettre à dûe & entiere exécution en tout notre royaume, pays, terres & seigneuries de notre obéissance l' arrêt rendu en notre cour de.... le.... jour de.... ci attaché sous le contrescel de notre chancellerie contre tel y nommé, & faire pour raison de ce tous exploits & actes nécessaires, de ce faire te donnons pouvoir sans demander autre permission, nonobstant clameur de haro, charte normande, prise à partie, & autres lettres à ce contraires; car tel est notre plaisir », &c.
The second type of similar passage deals with shared citations from outside sources, for example, both the Moréri and the Encyclopédie reference a quotation from Brebeuf concerning the origins of writing:
Moréri, CADMUS, p. 16:
Penféeque Brebúuf, dans sa traduction de la Pharfale, aheureusement étendue dans ces quatre vers :
Cest de lui que nous vient cet art ingénieux £)e peindre la parole, €• de parler aux yeux ; Et par Us traits divers de figures tracées , Donner de la couleur & du corps aux pensées.
Encyclopédie, ECRITURE, p. 358:
ECRITURE, sub. f. (Hist. anc. Gramm. & Arts.) Nous la définirons avec Brebeuf:
Cet art ingénieux De peindre la parole & de parler aux yeux, Et par des traits divers de figures tracées, Donner de la couleur & du corps aux pensées.
It is important to note, in this case, that along with the different contexts in which the quotation occurs, there is also some "noisy data" introduced by the OCR of the Moréri text as well as variant spellings of the source's name, either Brebúuf or the more modern Brebeuf. We are nonetheless pleased that the PAIR approach was able to find matching sequences given these factors, all of which would have complicated a traditional exact match query.
Thirdly, we were able to identify shared passages between the Moréri and the Encyclopédie that do not represent whole articles, but rather occur in the context of larger articles that are otherwise distinct and normally lacking any form of citation. This third category of borrowing should prove to be the most intellectually interesting as they attest to a sort of hidden intertextuality between the two opposing dictionaries. Here again, we are following the previous work of the Jesuits who, on a more limited scale, were able to identify plagiarized articles as well as those that were "fortement imités," or strong imitations, coming from the Moréri. One such case has to do with the articles "ALCORAN" which, for the most part, treat the same subject in different ways, yet still share many similar passages, such as the one below, identified using the PAIR system:
Moréri, ALCORAN, p. 327:
... qui tonfiste à croire que tout ce 'qui arrive est tellement déterminé dans les idées éternelles, que rien n'est capable d'en empêcher les effets. Le Second est, que cette religion doit être établie Sans miracles, Sans dispute, & reçue sans contradiction ; de Sorte que tous ceux qui y répugnent , doivent êtte mis à mort Sans autre forme de procès ; & que les musulmans qui tuent ces incrédules, méritent le paradis.
Encyclopédie: ALCORAN ou AL-CORAN, p. 251:
Les deux points fondamentaux de l' alcoran suffiroient pour en démontrer la fausseté: le premier est la prédestination, qui consiste à croire que tout ce qui arrive est tellement déterminé dans les idées éternelles, que rien n' est capable d' en empêcher les effets; & l' on sait à quel point les Musulmans sont infatués de cette opinion. Le second est que la Religion Mahométane doit être établie sans miracle, sans dispute, sans contradiction, de sorte que tous ceux qui y répugnent doivent être mis à mort; & que les Musulmans qui tuent ces incrédules, méritent le Paradis: aussi l' histoire fait - elle foi qu' elle s' est encore moins établie & répandue par la séduction, que par la violence & la force des armes.
In all, we were able to identify some 580 potential borrowings between the Moréri and the Encyclopédie, of which, 72% (418 sequences) were found to indeed fall into one of the three categories of shared passages outlined above. It must be noted, however, that this preliminary implementation of PAIR was highly experimental and used primarily as a "proof-of-concept" which could then lead to further refinements of the approach and the elimination of many of the false-positive matches. Nonetheless, the results of this experiment are still very convincing and lead us to believe that the intertextual relationship between these two seemingly divergent works was much stronger than previously held. For example, upon inspection of the proposed matches, we notice that more than sixty percent of the borrowings were made by the Abbé Mallet [29], a controversial encyclopédiste whose articles on theology and history all straddle the line between the traditional Roman Catholic world view and the critical historicism of the philosophes. Interestingly, and perhaps somewhat ironically, the editors' dismissal of the subject matter of the Moréri - "Au reste, on observera que les articles d'Histoire de notre Encyclopédie ne s'étendent pas aux noms de Rois, de Savans, & de Peuples, qui sont l'objet particulier du Dictionnaire de Moreri, & qui auroient presque doublé le nôtre" [As for the rest [of the articles], one can see that the History articles of our Encyclopedia do not encompass the names of kings, scholars and peoples, which are the distinct object of Moreri's Dictionary, and which would have nearly doubled the size of our Encyclopedia] - occurs in the context of the biographical notice for the Abbé Mallet in the Preliminary Discourse [30].
Thus, using the PAIR approach we were able to identify relationships between articles and passages that perhaps went unnoticed previously as well as some that were rather unexpected. Mallet's use of the Moréri, for example, given his religious background and the subject matter of the articles for which he was responsible, is not at all that surprising. However, one would certainly not expect to find whole articles or passages taken from the ultra-conservative Moréri and inserted into the Enlightenment machine de guerre that the Encyclopédie was to become. Nevertheless, It is precisely this type of borrowing that we do find in many of the sequences identified using the PAIR approach; shared passages between the two works that suggest a much stronger level of "intertextuality" than previously anticipated. A finding that leads us to question further the many possible relationships, both known and unknown, between the Encyclopédie and its 18th-century competitors.
Conclusion
It is well known that the encyclopédistes plundered, out of necessity, many of the contemporary reference works available to them, in both French and other languages. Indeed, the project of the Encyclopédie itself began as a translation of Ephraim Chamber's Cyclopaedia, or, An universal dictionary of arts and sciences, which is the source of numerous entries. Other important predecessors of the Encyclopédie include, but are not limited to, Pierre Bayle's Dictionnaire historique et critique, Louis Moréri's Grand dictionnaire historique, the Dictionnaire de Trévoux, and the Dictionnaire de l'Académie française, not to mention the more specialized works, including the Dictionnaire du Commerce, Dictionnaire des Arts, Dictionnaire des drogues, etc. and Diderot's own translation of Robert James' Dictionnaire universel de médecine. Furthermore, it is also clear that contributors to the Encyclopédie made use of numerous more general studies and scholarship, such as Montesquieu's Esprit des Lois and Buffon's Histoire Naturelle. Some of the authors in the Encyclopédie also appear to have favored particular works of reference or scholarship, as is the case with the Abbé Mallet, many of whose articles appear to borrow extensively from Moréri's historical dictionary. Given the extent of this "intertextuality" between contemporary works and the Encyclopédie, it is our assertion that experiments using machine learning approaches such as those outlined above (VSM and PAIR), can help identify shared articles and passages, shedding new light on the compilatory nature of encyclopedic writing as well as the editorial choices involved in constructing a truly "reasoned dictionary."
Future Work
While there are admittedly many kinds of relationships between relatively small segments of textual information, we have begun the above experiments in order to find that closest of all intertextual relationships, the borrowed or plagiarized passage. Detection of the sources of the Encyclopédie is a useful arena for experimentation, whose results, we believe, will be salient in designing more flexible approaches to identify other kinds of relationships between texts. With the ever-increasing size of databases, we will require numerous technologies to assist in directing our reading and identifying useful, interesting, and, above all, unexpected relationships between texts. As massive collections of electronic texts become more readily available, the promising possibilities of tracing webs of influence that would cut across many documents are evident. For example, the following three passages from the articles "Client" in all three of the above treated reference works suggests the viability of tracing lines of influence - borrowings, allusions, paraphrastic citations, etc. - across many texts.
Trévoux: C'étoit chez les Romains celui qui se mettoit sous la protection d'un puissant Citoyen, lequel s'appelloit par cette relation patronus, patron, & de son côté devoit à ses cliens sa protection & son secours. Ce patron assistoit le client dans ses besoins, & le client donnoit son suffrage au patron, quand il briguoit quelque Magistrature. Patronus.
Moréri: C'étoit chez les Romains un citoyen qui Se inettoit sous la protection d'un homme nuiltant, qui s'appclloit son patron. Ce patron assifloit le Client de sa protection, de son crédit & de ses biens ; et le Client donnoit son Suffrage au patron » quand il briguoit quelque magistrature pour lui ou pour ses amis
Encyclopédie: CLIENT, s. m. (Hist. anc.) parmi les Romains c' étoit un citoyen qui se mettoit sous la protection de quelqu' autre citoyen de marque, lequel par cette relation s'appelloit son patron, patronus. Voyez Patron. Le patron assistoit le client dans ses besoins, & le client donnoit son suffrage au patron, quand il briguoit quelque magistrature ou pour lui - même, ou pour ses amis.
These connections may well be cited passages or other kinds of similarities. Even in the case of annotated citations, we believe that building direct connections between documents, rather than attempting to follow "citation links," will offer a more effective approach to reconstructing intertextuality.
To support this type of research, we are working on two relevant development efforts. The first is to examine the effectiveness of topic-based segmentation in order to divide texts into parts based on changing topics that will then be processed using the VSM to identify passages on similar subjects. The second is a more robust and flexible implementation of PAIR, based on the shortcomings identified in this initial experiment. We plan to extend the matching to optionally include lemmas (root word forms) as well as different lengths of n-grams ranging from 2 to 5. Finally, we will be applying these approaches to a wider range of document types and time periods in order to assess their effectiveness [31].
NOTES
1A 1A. The ARTFL Encyclopédie is a digitized version of the first Paris edition of the Encyclopédie, with full-text liked to digital page images. For more information, see http://encyclopedie.uchicago.edu.
1. 1. See Macary (1973) and Kafker (1981a).
2. 2. (1743). We extracted data for the Dictionnaire universel françois et latin de Trévoux from Le Grand Atelier historique de la langue française CD-ROM published by Redon and Dictionnaires Le Robert/SEJER, 2004.
3. 3. It is curious to note that the Dictionnaire de Trévoux is not mentioned as a "predecessor" of the Encyclopédie in Kafker (1981b), an omission rectified in Leca-Tsomis (1999).
4. 4. See [http://www.lib.uchicago.edu/efts/ARTFL/projects/dicos/moreri/]
5. 5. Responding to Diderot’s increasingly combative defensive tactics refuting claims of plagiarism, Berthier mentions this unsought encyclopedic war: “Car il nous importe de faire connoître que nous n'avons point désiré la guerre de l'Encyclopédie; que nous ne voulons rien faire pour la perpétuer, & qu'il nous est beaucoup plus agréable de nous asseoit dans le Temple de la paix, pour contempler de-là le succès de ce grand Ouvrage.” Journal de Trévoux, Nov. 1753. See [http://www.lib.uchicago.edu/efts/ARTFL/projects/trevoux/]
6. 6. Journal de Trévoux 1752-1754. Also see Eick (2004).
7. 7. Encyclopédie, “Avertissement des Editeurs,” Tome 3, 1753.
8. 8. Macary (1973).
9. 9. Berthier, November 1753.
10. 10. See Miller (1994).
11. 11. See Leca-Tsiomis (1999), Miller (1994) and Morin (1989).
12. 12. This is based on the May 2007 release of the ARTFL Project's Encyclopédie. Please consult the project WWW site [www.lib.uchicago.edu/efts/ARTFL/projects/encyc/] for further details.
12A 12A. PhiloMine is an open source set of extensions to the PhiloLogic system developed by the ARTFL Project. See [http://philologic.uchicago.edu/philomine/] for further information. PhiloMine includes VSM code outlined in Ceglowski (2003).
13. 13. While Salton (1975) paper is widely regarded as the first statement of the VSM, Dubin (2004) suggests that the idea was developed over the course of several years.
14. 14. Such as Deerwester (1990), Hoffmann (1999), and Turney (2005).
15. 15. KNN (K Nearest Neighbor) and the Vector Space model is described Chapter 14 of Manning (2008). Hearst (1997) describes VSM in topic based document segmentation.
16. 16. Singhal (1995).
17. 17. Lemmas are the root or canonical forms of inflected words, e.g., 'go' represents "going, went, gone," etc.
18. 18. Result tables can be found here: [http://tinyurl.com/4gksso].
19. 19. A useful general introduction to "Levenshtein distance" is available on Wikipedia: [http://en.wikipedia.org/wiki/Levenshtein_distance]. Olsen (1988) examines this class of algorithms and suggests application in historical and textual research.
20. 20. Links to result tables (spreadsheets) for both experiments are available in full at: [http://tinyurl.com/4gksso. We would encourage other researchers to examine these results and perhaps be inspired to draw more specific conclusions from this raw data.
21. 21. See the description of BLAST (Basic Local Alignment and Search Tool) [http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/similarity.html]. The Wikipedia entry on BLAST may be a useful introduction to the approach for non-specialists [http://en.wikipedia.org/wiki/BLAST]. Other approaches to the general alignment and document similarity problem include: Forman (2005) and Eshghi (2005) who describe models of content "chunk" level similar document matching and He et al, (2007) use "bursty" feature selection to identify changing subjects in temporal document sequences and Bourdaillet (2007) examines alignment in "noisy" datasets. Sequence alignment is also related to the more general "longest common substring problem," long considered a "classic" problem in computer science.
22. 22. The region spanner in revised implementations catches the final n-grams matched, missed in this and the following examples.
23. 23. Stein et. al. (2006) discuss the impact of uncorrected optical character recognition in machine learning tasks, concluding that the impact is limited. With the rise of mass digitization projects, such as Google Books, we believe systems must be tolerant to significant levels of "noisy" or uncorrected data.
24. 24. This implementation features a similarity searching mechanism that matches on many lexical variants rather than exact matching as one way to compensate for uncorrected OCR. We have also used this approach for a similar implementation of Ephraim Chambers' Cyclopaedia (2 vols. published in 1728, with 2 supplement vols. in 1753) available at [http://www.lib.uchicago.edu/efts/ARTFL/projects/dicos/chambers/].
25. 25. Discours préliminaire des Editeurs, p. xlj.
26. 26. Journal de Trévoux, Article CXIX, novembre, 1751, p. 2428-29.
27. 27. Some of these would presumably be brought to light through a Vector Space similarity approach, but as we shall see, many would have been missed.
28. 28. The table showing the borrowed passages as well as false positives, displayed in red, is available at [http://tinyurl.com/4gksso]. Identification of the false positives was an important element of this effort, since it allowed us to evaluate our proof-of-concept implementation.
29. 29. The Abbot Edme-François Mallet died in 1755, limiting his contributions to the first 5 volumes of the Encylcopédie.
30. 30. Discours préliminaire des Editeurs, p. xlj.
31. 31. Since this article was submitted, the development team has completed and released open source implementations of PAIR and the PhiloLogic-specific version called PhiloLine. This effort took into account the limitations of our PAIR prototype described in this paper, and added a number of important extensions and functions. Please consult our PAIR/PhiloLine release site (http://code.google.com/p/text-pair/), for source code, documentation and a number of examples and demonstrations.
WORKS CITED
Dictionnaire de Trévoux [Internal database]
Encyclopédie [http://encyclopedie.uchicago.edu]
Journal de Trévoux [http://www.lib.uchicago.edu/efts/ARTFL/projects/trevoux/]
Grand Dictionnaire historique de Moréri [http://www.lib.uchicago.edu/efts/ARTFL/projects/dicos/moreri/]
Bourdaillet, Julien and Ganascia, Jean-Gabriel. 2007: "Alignment of noisy unstructured data", IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data, Hyderabad, India - January 8, 2007.
Ceglowski, Maxiej. 2003: "Building a Vector Space Search Engine in Perl", Perl.com [http://www.perl.com/pub/a/2003/02/19/engine.html].
D'alembert, Jean. Discours préliminaire des Editeurs de 1751 et articles de l'Encyclopédie introduits par la querelle avec le Journal de Trévoux. Ed. Martine Groult. Paris: Honoré Champion Editeur, 1999.
Deerwester, S., Susan Dumais, G. W. Furnas, T. K. Landauer, R. Harshman. 1990: "Indexing by Latent Semantic Analysis". Journal of the Society for Information Science 41/6: 391-407
Dubin, David. 2004: "The most influential paper Gerard Salton never wrote", Library Trends 52/4: 748-764.
Eick, David Michael. 2004: "Defining the Old Regime: Dictionary Wars in Pre-Revolutionary France", Ph.D. Thesis, University of Iowa.
Eshghi K, Tang, HK. 2005: "A Framework for Analyzing and Improving Content-Based Chunking Algorithms," Hewlett-Packard Laboratories, Palo Alto, Technical Report, HPL-2005-30(R.1).
[http://www.hpl.hp.com/techreports/2005/HPL-2005-30R1.pdf
Forman G, Eshghi K, Chiocchetti S. 2005 "Finding similar files in large document repositories" KDD '05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining ACM Press New York, NY, USA, 394-400.
He, Qi, Chang, Kuiyu, Lim, Ee-Peng and Zhang, Jun. 2007: "Bursty Feature Representation for Clustering Text Streams" in Proceedings of the 2007 SIAM International Conference on Data Mining.
Hearst, M. A. 1997: "Texttiling: segmenting text into multi-paragraph subtopic passages." Computational Linguistics. 23 /1: 33-64.
Hofmann, T. 1999: "Probabilistic Latent Semantic Analysis" in Proc. of Uncertainty in Artificial Intelligence, UAI'99 .
Kafker, Frank A. 1981a: "The Encyclopédie and its predecessors." Studies on Voltaire and the Eighteenth Century 194: 223-237.
_____. 1981b:Notable Encyclopaedias of the Seventeenth and Eighteenth Centuries: Nine predecessors of the Encyclopédie. Oxford: Voltaire Foundation.
Leca-Tsiomis, Marie. 1999: "Ecrire l'Encyclopédie: Diderot: de l'usage des dictionnaires à la grammaire philosophique." Studies on Voltaire and the Eighteenth Century 375: 133+.
Macary, Jean. 1973: "Les dictionnaires universels de Furetière et de Trévoux, et l'esprit encyclopédique moderne avant l'Encyclopédie." Diderot Studies XVI: 152.
Manning, C. D., P. Raghavan, and H. Schütze (2008, forthcoming): Introduction to Information Retrieval. Cambridge University Press. Consulted electronic preprint at [http://informationretrieval.org/].
Miller, Arnold. 1994: "The last edition of the Dictionnaire de Trévoux." Studies on Voltaire and the Eighteenth Century 315: 5-49.
Morin, Robert. 1989: "Diderot, l'Encyclopédie, et le Dictionnaire de Trévoux." Recherches sur Diderot et sur l'Encyclopédie 7: 71-118.
Olsen, M. 1988: "Theory and Applications of Inexact Pattern Matching: A Discussion of the PF474 String Co-processor" in Computers and the Humanities 22: 203-15.
Salton, G., A. Wong, and C. S. Yang. 1975: "A Vector Space Model for Automatic Indexing," Communications of the ACM 18/11: 613-620.
Singhal, A. and Salton, G. 1995: "Automatic Text Browsing Using Vector Space Model" in Proceedings of the Dual-Use Technologies and Applications Conference 318-324.
Stein, Sterling Stuart, Argamon, Shlomo, Frieder, Ophir. 2006: "The Effect of OCR Errors on Stylistic Text Classification." SIGIR '06, Seattle, Washington.
Turney, Peter D. 2005: "Measuring Semantic Similarity by Latent Relational Analysis" Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI-05) Edinburgh, Scotland: 1136-1141.
Appendix: Main predecessors of the Encyclopédie
1674 – Louis Moréri publishes his historical/genealogical dictionary, Le Grand Dictionnaire historique, ou mélange curieux de l'histoire sacrée et profane...
1690 – Antoine Furetière’s Dictionnaire universel contenant généralement tous les mots françois, tant vieux que modernes, et les termes de toutes les sciences et des arts. Undertaken as a more inclusive alternative to the dictionary of the French Academy, which would appear 4 years later.
1694 – Le Dictionnaire de l’Académie française. First editon of the French Academy’s dictionary of the French language.
1697 – Pierre Bayle publishes his monumental work, Le Dictionnaire historique et critique, purportedly to correct the countless inaccuracies found in the Moréri dictionary.
1702 – The protestant Henri Basnage de Beauval publishes a re-edited and augmented version of the Furetière dictionary: Le Dictionnaire universel ... d’Antoine Furetière.
1704 – Concerned by what they felt was a protestant reworking of Furetière’s work, the Jesuits of Trévoux would publish their own dictionary: Le Dictionnaire universel françois et latin, vulgairement appelé Dictionnaire de Trévoux.
1704 – Publication in London of John Harris's Lexicon technologicam or an Universal Dictionary of the arts and sciences.
1718 – Second edition of the Dictionnaire de l'Académie française.
1728 – Ephraim Chambers publishes the Cyclopaedia: or, A Universal Dictionary of Arts and Sciences in England.
1740 – Third edition of the Dictionnaire de l'Académie française.
1740 – Sixth edition of Moréri’s Grand dictionnaire historique.
1743 – Fifth edition of the Dictionnaire de Trévoux.
1744 – Johann Jakob Brucker finishes his Historia critica philosophiae.
1747 – Diderot and d’Alembert undertake a French translation of Chamber’s Cycloaedia, a project that will eventually become the Encyclopédie, ou dictionnaire raisonné des sciences, des arts, et des métiers, whose first volume will be published in 1751.