spobooks5621225.0001.001 in

    13.2 The RePEc document dataset

    Origin and motivation of RePEc

    A scholarly communication system brings together producers and consumers of documents. For the majority of the documents, the producers do not receive a monetary reward. Their effort is compensated through a wide circulation of the document and peer approval of it. Dissemination and peer approval are the key functions of scholarly communication.

    Scholarly communication in Economics has largely been journal-based. Peer review plays a crucial role. Thorough peer review is expensive in time. According to Trivedi (1993), a paper commonly takes over three years from submission to publication in an academic journal, not counting rejections. From informal evidence, slowly rising publication delays have been curbed in the past few years as journal editors have fought hard to cut down on what have been perceived to be intolerable delays.

    Researchers at the cutting edge cannot rely solely on journals to keep abreast of the frontiers of research. Prepublication through discussion papers or conference proceedings is now commonplace. Access to this informally-disseminated research is often limited to a small number of readers. It relies on the good will of active researchers to disseminate their work. Since good will is in short supply, insider circles are common.

    This time gap between informal distribution and formal publication can only fundamentally be resolved by reforming the quality control process. The inconvenience resulting from the delay can, however, be reduced by improving the efficiency of the informal communication system. This is the initial motivation behind the RePEc project. Its traditional emphasis has been on documents that have not gone through peer review channels. Thus RePEc is essentially a scholarly dissemination system, independent of the quality review process, on the Internet.

    Towards an Internet-based scholarly dissemination system

    The Internet is a cost-effective means for scholarly dissemination. Many economics researchers and their institutions have established web sites. However, they are not alone in offering pages on the Web. The Web has grown to an extent that the standard Internet search engines only cover a fraction of the Web, and that fraction is decreasing over time (Lawrence and Giles, 1999). Since much of economics research uses common terms such as "growth", "investment" or "money", a subject search on the entire Web is likely to yield an enormous number of hits. There is no practical way to find which pages contain economics research. Due to this low signal-to-noise ratio, the Web per se does not provide an efficient mechanism for scholarly dissemination. An additional classifying scheme is required to segregate references to materials of interest to the economics profession.

    The most important type of material relevant to scholarly dissemination are research papers. One way to organize this type of material has been demonstrated by the arXiv.org preprint archive, founded in 1991 by Paul Ginsparg of the Los Alamos National Laboratory, with an initial subject area in high energy physics. Authors use that archive to upload papers that are stored there. ArXiv.org has now assembled over 150,000 papers, covering a broad subject range of mathematics, physics and computer science, but concentrating on the original subject area. An attempt has been made to emulate the arXiv.org system in economics with the "Economics Working Paper Archive" (EconWPA) based at Washington University in St. Louis, but success has been limited. There are a number of potential reasons:

    • Economists do not issue preprints as individuals; rather, economics departments and research organizations issue working papers.

    • Economists use a wider variety of document formatting tools than physicists. This reduces the functionality of online archiving and makes it more difficult to construct a good archive.

    • Generally, economists are not known for sophisticated practices in computer literacy and are more likely to encounter significant problems with uploading procedures.

    • There is considerable confusion as to the implications of networked pre-publication on a centralized, high-visibility system for the publication in journals.

    • Economics research is not confined to university departments and research institutes. There are a number of government bodies—central banks, statistical institutes, and others—which contribute a significant amount of research in the field. These bodies, by virtue of their size, have more rigid organizational structures. This makes the coordination required for the central dissemination of research more difficult.

    An ideal system should combine the decentralized nature of the Web, the centralized nature of the arXiv.org archive, and a zero price to end users. I discuss these three requirements in turn.

    The system must have decentralized storage of documents. To illustrate, let us consider the alternative scenario. This would be one where all documents within a certain scope, say within a discipline, would be held on one centralized system. Such a system would not be ideal for three reasons. First, those authors who are rejected by that system would have no alternative publication venue. Since Economics is a contested discipline, this is not ideal. Second, the storage and description of documents is costly. The centralized system may levy a charge on contributors to cover its cost. However, since it enjoys a monopoly, it is likely to use this position to extract rent from authors. This would not be ideal.

    On the other hand, we need access points to the documents for both usage of the documents by end users, as well as for the monitoring of this usage. These activities are best conducted when a centralized document storage is availble, such as the one that arXiv.org affords. Otherwise the economics paperes become lost in the complete contents of the web and their usage is recorded in the web logs of many servers. Such usage logs are private to the manangement of the web servers. They can not be used to monitor usage.

    To explain why the end-user access to the dissemination system should be free, it is useful to refer to Harnad's distinction between trade authors and esoteric authors (1995a). Authors of academic documents are esoteric authors rather than trade authors. They do not expect payments for the written work; instead, they are chiefly interested in reaching an audience of other esoteric authors, and to a lesser extent, the public at large. Therefore the authors are interested in wide dissemination. If a tollgate to the dissemination system is established, then the system will fall short of ideal.

    Having established the three criteria for an ideal system, let me turn to the problem of implementing it. The first and third objectives could be accomplished if departments and research centers allow public access to their documents on the Internet. But for the second, we need a library to hold an organized catalog. The library would collect what is known as "metadata": data about documents that are available using Internet protocols. There is no incentive for any single institution to bear the cost of establishing a comprehensive metadata collection, without external subsidy. However, since every institution will benefit from participation in such an effort, we may solve this incentive problem by creating a virtual collection via a network of linked metadata archives. This network is open in the sense that persons and organizations can join by contributing data about their work. It is also open in the sense that user services can be created from it. This double openness promotes a positive feedback effect. The larger the collection's usage, the more effective it is as a dissemination tool, thus encouraging more authors and their institutions to join, as participation is open. The larger the collection, the more useful it becomes for researchers, which leads to even more usage.

    Bringing a system to such a scale is a difficult challenge. Change in the area of scholarly communication has been slow, because academic careers are directly dependent on its results. scholarly communication. Change is most likely to be driven from within. Therefore, scholarly dissemination system on the Internet is more likely to succeed if it enhances current practice, without a threat to replace it. In the past, The distribution of informal research papers has been based on institutions issuing working papers. These are circulated through exchange arrangements. RePEc is a way to organize this process on the Internet.

    The architecture of RePEc

    RePEc can be understood as a decentralized academic publishing system for the economics discipline. RePEc allows researchers' departments and research institutes to participate in a decentralized archival scheme which makes information about the documents that they publish accessible via the Internet. Individual researchers may also openly contribute, but they are encouraged to use EconWPA.

    Each contributor needs to maintain a separate collection of data using a set of standardized templates. Such a collection of templates is called an "archive". An archive operates on an anonymous ftp server or a Web server controlled by the archive provider. Each archive provider has total control over the contents of its archive. There is no need to transmit documents elsewhere. The archive provider retains the liberty to post revisions or to withdraw a document.

    An example archive. Let us look at an example. The archive of the OECD is at http://web.archive.org/web/20010829193045/http://www.oecd.org/eco/RePEc/oed/. In that directory we find two files. The first is oedarch.rdf:

    Template-Type: ReDIF-Archive 1.0
    Handle: RePEc:oed
    Name: OECD Economics Department
    Maintainer-Email: eco.contact@oecd.org
    URL: http://www.oecd.org/eco/RePEc/oed

    This file gives basic characteristics about the archive. It associates a handle with it, gives an email address for the maintainer, and most importantly, provides the URL where the archive is located. This archive file gives no indication about the contents of the archive. The contents list is in a second file, oedseri.rdf:

    Template-type: ReDIF-Series 1.0
    Name: OECD Economics Department working papers
    Type: ReDIF-Paper
    Provider-Name: OECD Economics Department
    Provider-Homepage: http://www.oecd.org/eco/eco/
    Maintainer-Email: eco.contact@oecd.org
    Handle: RePEc:oed:oecdec

    This file lists the content as a series of papers. It associates some provider and maintainer data with the series, and it associates a handle with the series. The format that both files follow is called ReDIF. It is a purpose-built metadata format. Appendix B discusses technical aspects of the ReDIF metadata format that is used by RePEc. See Krichel (2000) for the complete documentation of ReDIF.

    The documents themselves are also described in ReDIF. The location of the paper description is found through appending the handle to the URL of the archive, i.e. at http://web.archive.org/web/20010627025821/www.oecd.org/eco/RePEc/oed/oecdec/. This directory contains ReDIF descriptions of documents. It may also contain the full text of documents. It is up to the archive to decide whether to store the full text of documents inside or outside the archive. If the document is available online—inside or outside the archive—a link may be provided to the place where the paper may be downloaded. Note that the document may not only be the full text of an academic paper, but it may also be an ancillary files, e.g. a dataset or a computer program.

    Participation does not imply that the documents are freely available. Thus, a number of journals have also permitted their contents to be listed in RePEc. If the person's institution has made the requisite arrangements with publishers (e.g. JSTOR for back issues of Econometrica or Journal of Applied Econometrics), RePEc will contain links to directly access the documents.

    Using the data on archives. One way to make use of the data would be to have a web page that lists all the available archives, and allow users to navigate the archives searching for documents of interest. However, that would be a primitive way to access the data. First, the data as shown in the ReDIF form is not itself hyperlinked. Second, there is no search facility nor filtering of contents.

    Providing services that allow for convenient access is not a concern for the archives, but for user services. User services render the RePEc data in a form that make it convenient for a user. User services are operated by members of the RePEc community, libraries, research projects etc.. Each service has its own name. There is no "official" RePEc user service. A list of services in at the time of writing may be found in Appendix A.

    User services are free to use RePEc data in whatever way they see fit, as long as they observe the copyright statement for RePEc. This statement places some constraints on the usage of RePEc data:

    You are free to do whatever you want with this data collected on the archives that are described here, provided that you
    (a) Don't charge for it or include it in a service or product that is not free of charge.
    (b) When displaying the contents of a template (or part of a template) the following fields must be shown if they are present in the template: Title, Author-Name, File-Restriction and Copyright (if present).
    (c) You must contribute to RePEc by maintaining an archive that actively contributes material to RePEc.
    (d) You do not contravene any copyright statement found in any of the participating archives.

    Within the constraints of that copyright statement, user services are free to provide all or any portion of the RePEc data. Individual user services may place further constraints on the data, such as quality or availability filters.

    Because all RePEc services must be free, user services compete through quality rather than price. All RePEc archives benefit from simultaneous inclusion in all services. This leads to an efficient dissemination that a proprietary system can not afford.

    Building user services. The provision of a user service usually starts with putting frequently updated copies of RePEc archives on a single computer system. This maintenance of a frequently updated copy of archives is called "mirroring". Everything contained in an archive may be mirrored. For example, if a document is in the archive, it may be mirrored. If the archive management does not wish the document to be mirrored, it can store it outside the archive. The advantage of this remote storage is that the archive maintainer will get a complete set of access logs to the file. The disadvantage is that every request for the file will have to be served from the local archive rather than from the RePEc site that the user is accessing.

    An obvious way to organize the mirroring process overall would be to mirror the data of all archives to a central location. This central location would in turn be mirrored to the other RePEc sites. The founders of RePEc did not adopt that solution because it would be quite vulnerable to mistakes at the central site. Instead, each site installs the mirroring software and mirrors its own data. Not all sites adopt the same frequency of updating. Some may update daily, while some may only update weekly. A disadvantage of this system is that it is not known how long it takes for a new item to be propagated through the system.

    The documents available through RePEc

    Over 160 archives, some of them representing several institutions, in 25 countries currently participate in RePEc. Over 100 universities contribute their working papers, including U.S. institutions such as Berkeley, Boston College, Brown, Maryland, MIT, Iowa, Iowa State, Ohio State, UCLA, and Virginia. The RePEc collection also contains information on all NBER Working Papers, the CEPR Discussion Papers, the contents of the Fed in Print database of the US Federal Reserve, and complete paper series from the IMF, World Bank and OECD, as well as the contributions of many other research centers worldwide. RePEc also includes the holdings of EconWPA. In total, at the time of writing in March 2001, over 37,000 items are downloadable.

    The bibliographic templates describing each item currently provide for papers, articles, and software components. The article templates are used to fully describe published articles. They are currently in use by the Canadian Journal of Economics, Econometrica, the Federal Reserve Bulletin, IMF Staff Papers, the Journal of Applied Econometrics, and the RAND Journal of Economics. These are only a few of the participating journals.

    The RePEc collection of metadata also contains links to several hundred "software components"—functions, procedures, or code fragments in the Stata, Mathematica, MATLAB, Octave, GAUSS, Ox, and RATS languages, as well as code in FORTRAN, C and Perl. The ability to catalog and describe software components affords users of these languages the ability to search for code applicable to their problem—even if it is written in a different language. Software archives that are restricted to one language, such as those maintained by individual software vendors or volunteers, do not share that breadth. Since many programs in high-level languages may be readily translated from, say, GAUSS to MATLAB, this breadth may be very welcome to the user.