ERPA software — CGI scripts written in PERL — fulfills two functions:

  • Its "spider" visits the Web servers of the participating institutions and updates the central database on the Vienna ERPA server as necessary.
  • The search engine gives site visitors access through the Web interface.

Our technical guidelines specify how the participating institutions must store their papers and how they should provide the necessary metadata.

Our main goal was to keep the system as simple as possible to minimize the effort put into editing the papers. That was particularly important because the participating institutions' series already had a great number of papers on line when we started the Archive.

The "Spider"

A "spider" is a Web tool that can be programmed to perform automated functions. Once set up, it works without human intervention. ERPA's spider is currently launched once a week. It visits specific directories on the Web servers of the participating institutions, searching for HTML files that contain meta-information, such as author, title etc. An important piece of information in those files is the location of the text of the paper (which may be distributed over several other files in different directories). The spider stores the files at the central server in Vienna. For each paper we have one file that contains the metadata, and one that contains full text (minus HTML code and stopwords — words too common to be searched) but the metadata and the full text are not necessarily in a separate file.

Tagging

ERPA software recognizes both meta tags in the head element of the main file and in ERPA-specific comment tags in the body of the HTML text. A single document could even have metadata both ways, and the software will accommodate it. For editors, tagging in the text reduces additional editing and redundancy — especially in files that were not created for the repository. However, a separate metadata file makes it easy to provide the necessary information at one central spot.

The ERPA system uses the following fields:

ERPA Field/tagDescription/remark
authorthe name of the author(s) of the paper
titlethe title of the paper
datethe date of publication of the paper, given in the format (D).(M).YYYY
URLthe URL the search-engine should point to as the result of a search (which may be different from the URL of the "main" file)
keywordsthe list of keywords attributed to the paper
textthe main text of the paper
includea list of one or more files, which also include(s) (parts of) the main text

For keywords, ERPA maintains a common thesaurus to be used by all participating groups. The management board has set up an easy and informal process of amending the list of keywords if necessary.

The "text" tag does not have to be inserted, but is recommended in order to make the full text as pure as possible, e.g. by excluding the "references" part of a paper. So far this feature has been used only by European Integration online Papers.

The HTML code of a typical ERPA "main file" might look like this:

<HEAD>

<TITLE>EIoP: Text 1997-001: Abstract</TITLE>

<META NAME="include" CONTENT="1997-001.htm">

<META NAME="URL" CONTENT="http://eiop.or.at/eiop/texte/1997-001a.htm">

</HEAD>

..

<P>

<!—BEGIN title—>

Old 'foundations' and new 'rules' - For an enlarged European Union

<!—END title—></P>

<P>

<!—BEGIN author—>

Philippe C. Schmitter and Jose I. Torreblanca

<!—END author—></P>

<P> Date of publication in the EIoP:

<!—BEGIN date—>

10.4.1997

<!—END date—></P>

<P>Keywords:

<!—BEGIN keywords—>

institutions, enlargement, majority voting, Council of Ministers, European Parliament

<!—END keywords—></P>

..

<!—BEGIN text—>

<H1>Introduction</H1>

..

<!—END text—>

..

The Search Engine

The second main element of the ERPA software is the search engine, which is built on the central database that holds the two files for each paper in the Archive. Readers enter searches on one of two Web forms. The short-search form lets users search in the author and title fields only, or choose the "Quick Update" function that indicates recent entries to the database. The advanced-search form gives access to a full range of search options:

  • selection of a particular working-paper series (the default search is performed in the entire archive);
  • case sensitivity (the default is not case-sensitive);
  • searching in one or any combination of the following fields: author; title; range of dates of publication; keywords; and full text;
  • allowing AND NOT searches (the default is an AND — "must contain") for each field;
  • facilitating the keyword search with two list boxes containing the whole list of available keywords;
  • allowing the operators AND, OR, NOT as well as nesting and truncation in the full-text search.

The search result is presented in a list sorted by the date of publication, with newest papers first. It gives the authors' names, the publication date, the title of the retrieved paper, and a hyperlink to the original URL of the paper.

Technical Prospects

Certainly ERPA plans on including more series. In addition, we are looking for service improvements that include:

  • Incorporating PDF files into the Archive automatically. Today the spider is able to extract metadata only from HTML files, so PDF files have to be converted to HTML to allow the spider to work. Our plan is to implement a new version of the spider, which will be able to extract the meta-information and the full text from PDF files.

  • Improving the reporting. Almost since the beginning of the Archive we have been collecting simple statistical information, i.e., the number of times the search-engine was used. In order to know more about the way ERPA is used, we would like to implement a more sophisticated tool to give us better user statistics.

  • Coordinating the tagging with the Dublin Core (DC) Metadata for Resource Discovery specifications, which seem to have become the international standard for metadata. Today we are only partially compliant. Full compliance will allow knowbots (automated information brokers) and other search tools that use DC-standard tags. The differences today are that we use "author" instead of "creator" and "keywords" instead of "subject," and we still allow tagging inside the body of the text. The elimination of comment tags, however, would be a major change.

  • Establishing a mirror site in the U.S. Given the worldwide Europeanist community, and the very large Europeanist community in North America, we are already negotiating with a U.S. university to host a mirror site. That will improve access to the Archive for researchers on the other side of the Atlantic.

Additional Members

The ERPA management board received applications for membership from other groups with working-paper series soon after the official launch. The Board chose not to accept groups on a case-by-case basis, but to formulate a general policy for applicants. After lengthy and controversial discussions among the editors of the founding series, a policy paper was issued in June 1999 setting out the criteria to be met by any applicant.



Dr. Michael Nentwich is a senior researcher at the Institute of Technology Assessment of the Austrian Academy of Sciences in Vienna where he is mainly involved in projects in the area of information and communication technologies. He is also involved in a number of WWW projects and edits the European Integration online Papers (EIoP). Previously, he was a lecturer at the interdisciplinary Research Institute for European Affairs at the University of Economics in Vienna, an HCM fellow at the Universities of Warwick and Essex in the U.K.. He is currently a guest researcher at the Max Planck Institute for the Study of Societies in Cologne, Germany. He studied law, economics and political science in Vienna and Bruges/Belgium. His publications include books and articles on European economic law, European constitutional issues, democratic theory and technology assessment.

Dr. Nentwich's home page is at: http://eiop.or.at/mn/

Michael Nentwich may be reached by e-mail at mnent@oeaw.ac.at.


Links from this article:

ERPA policy paper, http://eiop.or.at/erpa/policy.htm

Dublin Core (DC) Metadata for Resource Discovery specifications, http://purl.oclc.org/dc/about/element_set.htm