Add to bookbag
Author: Karine Pellier
Title: Contributions of data Management to Cliometrics
Publication Info: Ann Arbor, MI: MPublishing, University of Michigan Library
April 2006
Availability:

This work is protected by copyright and may be linked to without seeking permission. Permission must be received for subsequent distribution in print or electronically. Please contact mpub-help@umich.edu for more information.

Source: Contributions of data Management to Cliometrics
Karine Pellier


vol. 9, no. 1, April 2006
Article Type: Article
URL: http://hdl.handle.net/2027/spo.3310410.0009.102

Contributions of Data Management to Cliometrics

Karine Pellier

Faculté des Sciences Economiques
LAMETA - UMR 5474
Avenue de la mer - Site de Richter
C.S. 79606
34960 - MONTPELLIER CEDEX 2-FRANCE
pellier@lameta.univ-montp1.fr

.01 Abstract

Often in historical research the empirical validation of theoretical assumptions requiresthe handling of an important volume of data. Once collected, these data should be structured in order to facilitate their conservation and/or use. This paper describes the organization of data used as support for Cliometric or econometric analysis. To this end, the paper details the stages of the creation of a database the main objective of which is to store, organize and structure a unit of a statistical series resulting from an account of the Spanish education system. Our framework, based on an example, can be generalized to other data management tasks in economic history.

.04 Introduction

With the New Economic History, economists have been able to improve upon the weaknesses of more traditional approaches (Kuznets, 1941; Mitchell, 1927; Schumpeter, 1954) in order to combine economic analysis and historical data. First advanced by Conrad and Meyer’s (1957, 1958) works, the New Economic History was popularized by Fogel and North, two economist/historians honored with the Nobel Prize for Economics in 1993. The New Economic History is defined as a discipline of synthesis that judiciously combines economic theory and quantitative methods in the investigation of history and economics. This Cliometric renews the problems of economic history and tries to provide, primarily on the basis of original statistical series, a reinterpretation of outstanding historical phases. The first works particularly concerned United States growth in the 19th century, the conditions of emergence of the industrial revolution, or the great crises of modern history. Thereafter, with renewed interest in the subject, theories of growth were re-examined in order to propose answers to the great historical questions: what are the determining factors of durable growth? What is the role of technical progress, or the role of human capital?  [2]

Understanding the development or the decline of economies requires resorting to techniques of modeling and to the subjection of assumptions to the test of history. So the work of the researcher is multi-dimensional, and s/he must have a sufficiently complete “toolbox.” The instruments of analysis, borrowed from statistics, must be powerful and suitable if the results are to validate or refute hypotheses. Quality data are essential for these quantitative methods. One of the innovative aspects of Cliometrics is the method of collecting and organizing statistical data with which it has been associated. The choice of statistical variables is difficult because it must as much as possible reflect economic reality. In fact, economic models dictate this choice, but the data are not always directly or completely available. Often long-term analyses require calculated estimates or the rebuilding of a time series.

.05 The Role of Databases

Databases constitute a valuable tool in preparing data for econometric treatment. Various arguments justify their implementation. They facilitate the management and logical organization of significant volumes of data. Rules to check the validity and consistency of data can be applied as they are being entered into a database. It is also possible to share in a network the content of databases, while protecting their content by offering read-only access.

This paper is particularly relevant to this phase of preparation of data and proposes to detail the organization of a database with two purposes. The first purpose was to collate and organize economic and demographic data relating to nineteenth- and twentieth-century Spain in a way that will facilitate detection of possible causal relationships between the education system and the economic system of this country. On the other hand, this database will also complete the education field of a more significant database, CAROLUS, conceived in 2001 and updated by N. Daures. This latter already contains data gathered in fields (e.g., education, demography, employment) and is made available to a team of researchers via a network. Our framework, even if it is based on an example, could be applied more widely to other historic data management projects.

.06 Principles of Modelling and Implementation of the Database

A coherent organization of the database requires the construction of a conceptual data model or “entity-relationship” model (ER). This model should aim to identify, describe and analyze in a formal way the unit of information to be stored. It defines the structure and relationships of information, and consequently defines the architecture of the database. To create the conceptual data model, we follow a frequently used approach,: an ascending type, also called “bottom-up” approach. This one requires a number of steps. The first consists of drawing up a list of the data (information) that will figure in a “dictionary of the data”. Then, this information is organized in groups called “entities.” Finally, the relationships between these are specified.

.07 Spain from 1850 to 1965

This database was designed to store economic and demographic data relating to Spain over the period 1850-1965. The data, which appear in the form of time series, concern the total population of Spain, the total of the public expenditure of the country, a price index worked out by L. Prados (1993), and the national income (measured according to the cost of factors of production). Also featured is the expenditure on education of the county, in current prices, distributed according to seven categories: expenditure on administration for teaching, expenditure on teaching primary education, secondary education, technical education, university education, and professional education, and the education of physically or mentally handicapped children. The data were compiled by C Diebolt (1999, 2000), and in the main came from the public finance directories published by the Spanish Ministry of finance. The series were neither modified nor corrected. However, they were lacking data for the period approximately corresponding to the Spanish Civil War (1936-1940). Additional fields were added to complete these series. These contained information concerning the status of the data,  [3] the measuring unit, references to publications, and any other comments. The dictionary of the data fields was thus as follows:

  • SerieName
  • AbbreviatedName
  • MeasuringUnit
  • Source
  • Comment
  • Author
  • Year
  • Value
  • Status

The simplicity of this organization of the database was guaranteed by there being a restricted number of entities. Each entity possesses attributes (i.e., properties that describe each piece of data). A particular instance (occurrence) of an attribute is a “value.” For example, “the total of the public spending” is one value of an attribute appearing in the field “SerieName.” The sources of the time series being varied, it seems relevant to create three entities (Figure 1). The first one, named “Series,” gathers only detail about each series of data including: title of series (full and abbreviated title), measuring unit, source, comments and finally the author of the series, (i.e., the person who collected the data). As for the second entity, “Data,” it contains attributes exclusively referring to the annual values of the series: year, value and status. The last entity relates to the geographical area of the data includes the attribute country. It should be noted that this database contains only data relating to Spain. However, the broader database CAROLUS already contained data relating to other countries. The addition of this last entity was thus intended to facilitate the integration of our database into CAROLUS. All the entities so defined comply with appropriate uniqueness and identification rules such as each attribute appears only once and in only one entity, and each entity has an identifier. An identifier, more commonly called a “key,” uniquely identifies each occurrence of an entity. Thus, in the entity “Series,” the identifier “Serie_ID” identifies in a distinctive way each series (e.g., population, price index).

[figure]
Figure 1: The entities with their attributes

It remains to establish associations, (i.e., the relationships), between these three entities (Figure 2). These associations have properties (such as degree and cardinality) that have to be defined. The degree of an association is the number of entities associated in the relationship. In our model, each association is of the binary type because it relates to only two entities. Cardinality quantifies the relationships between entities by measuring how many instances (maximum and minimum) of one entity are related to a single instance of another. Thus, there should be no data that does not also belong to a series or to a geographical area. Each piece of data should relate to only one series and only one geographical zone. It appears here as a constraint resulting from a functional dependence between the entities. Indeed, it is not possible within the database to create an occurrence of the entity “Data” without also indicating an associated occurrence of the entities “Series“ and “Geographical Area.” On the other hand, each such series or zone can contain zero data and to the maximum “N” data. The ER model thus obtained is standardized: there is no redundant information, each attribute of each entity depends directly on the identifier for each entity, and there is no attribute that is in only partial dependence.

[figure]
Figure 2: Entity-Relationship model

The conceptual model is then used to obtain a logical data model. Among the various logical data models, Codd's (1970) relational model was used. This model prepares for the compilation of the data in a database management system (DBMS). A DBMS-R is software based on the relational model, and should be durable, reliable, shareable between several users, and capable of managing large quantities of information. The structure of our database being not very complex, we chose to implement it using Access software.  [4]

The relational model (RM) is based on tables and relationships between tables. The relationships between data identified earlier in the ER model are here applied to the formation of tables. Tables are analogous to the entities of the ER model and are the fundamental building blocks of the database. They allow us to create the framework for storing information. The structure of a table is similar to that of a worksheet in a spreadsheet. The columns correspond to fields (i.e., elements that contain information in the same standard way and of the same nature). The data are stored in the form of records. A record is a line that contains an entry in several fields. The attributes of these entities from the ER model correspond to fields, and identifiers to primary keys in each record. It is possible to record information of any kind in the fields, but for each field it is advisable to specify the type of data (numeric data or text), field size, and the indexing. Even if databases are particularly adapted to the management of bulky data it is preferable to minimize the volume of the database. A database of minimal size is simpler to handle and facilitates faster calculations and faster transfers via Internet. It is important at this point to establish with certainly the data type to be stored in any field because any later modification of the type or the size of the fields is likely to generate problems, especially at the level of entering the data. According to these recommendations, we linked to each field of the tables “Series,” “Data” and “Geographical Areas” a relevant data type as well as the corresponding length (Figure 3). The default value of a text type is 50 characters but it may be adjusted according to the data. Moreover, the data type long integer is used for numbers without decimals or automatic numbers and the data type simple integer is used for decimal numbers.

[figure]
Figure 3: Properties of the tables

Each table contains all the data of the same type. The table “Series” contains all information relating to the series. “Geographical Areas” contains all information relating to the geographical specificity of the data. And “Data” stores all of the statistical data. The primary keys “Serie_ID”, “Data_ID” and “Country_ID” uniquely identify each record in the tables. By convention, the primary keys have neither null values nor repeating values. In the table “Data” the field “Value” refers to the statistical data itself and the field “Status” gives us an indication about the nature of the data (0: missing, 1: valid, 2: estimated, 3: calculated).

The relationships between tables are similar to associations. However, associations of type max 1, max N are not represented in the relational model (RM). Relationships between tables are represented by a direct line and the cardinality of the relation is obtained by reversing the maximum values of association (Figure 4). Moreover, a relationship is made between two relational tables by matching the values of a foreign key of one table with the values of the primary key in another. A foreign key is a field whose values are the same as the primary key of another table. Primary and foreign keys are fundamental because they are the means by which entries in different tables in the database are connected with each other. The field “Serie_ID” of the table “Data” is a foreign key which links in the table “Series” to the primary identifier of the statistical series to which the data belong. In a similar manner, the field “Country_ID” indicates the country of reference and links to the table “Geographical Areas.” A precaution must be taken during the connection of the tables. The dependent fields containing the foreign keys must be in the same type and of the same length as the primary keys. But if the primary key is an automatic number then the external key will necessarily also be a numeric.

[figure]
Figure 4: Relational diagram of the database

The relationship between the tables is of the type one-to-many (1-N). This means that the field of the table containing the primary key may be in relation to one or more records of the table containing the external key. In relation to the connection between the tables “Series” and “Data,” this implies that a series is likely to relate to many pieces of data. Such an arrangement is logically appropriate when one considers that the data are annual and cover a range of years from 1850 to 1965, so that any oneseries is likely to contain a large number of pieces of the same type of data recurring year after year.

The analysis of relationships can facilitate access and querying of the database. However, the relationships must respect rules and in particular the referential integrity of the data. The role of this latter is to ensure the validity of all entries, and checking against typographical errors and prohibiting any deletion or inopportune modification of primary keys or records. In particular, in our database, it will be possible to complete the table “Data” only with records corresponding to a record in the tables “Series” and “Geographical Areas.” The referential integrity rule makes sure that every foreign value matches a primary key value in an associated table.

.08 Data Transfer and Consultation

After the structure of the database and tables has been finalized and implemented, information content must be entered or imported. Where the data were stored in traditional worksheets, they could be imported into the table “Data.” The use of spreadsheets for the manual entry of data is convenient because functions can be used to detect anomalies such as missing values, and inconsistencies. Data in the tables “Series” and “Geographical Areas” are compiled manually either directly, or through the intermediary of a form. The forms, like the tables, the queries and the reports, are the major components of Access that facilitate the handling of data. A form is an interface through which one can consult, modify or add data resulting from one table or query. It makes the handling of the database easier and consequently limits the errors in entry by the user.

For this database, it was helpful to create two forms. The first is based mainly on the table “Series” but is also linked to the table “Data.” This form allows us to consult in the same window and at the same time all the information relating to the selected series and the associated statistical data (Figure 5). For safety measures, modifications such as the deletion of data are forbidden. A form may also be used easily to run certain otherwise complex actions. Such an action, which may be started by a click of the mouse on a button in such a form, is a macro, i.e., a procedure written in a specific language, the VBA (Visual Basic Application). When a form is created, it is thus easy to insert buttons that will be used to open another form, table, query, or report or close the active form.

[figure]
Figure 5: Form for entry/consultation of information

The period of complete study extends from 1850 to 1965, but it happens that an economist using this database might wish to carry out a more restricted analysis. In this case, we propose a second form “Serie's Choice,” allowing him to select the series he wishes to see as well as the desired period (Figure 6). Default values define the limits of the interval. The form is on any one table but is associated with a query that is based on the tables. The fields of this form will make it possible to specify the parameters of a query and thus to limit the search for data.

[figure]
Figure 6: Form for the filtering of data

.09 Use of the Database

The principal function of a database containing an important volume of data is to retrieve particular information in order to facilitate analysis. The queries are the most powerful tool to fulfill this function. They filter and view the researched records in a window that is similar to a worksheet. The types of queries include select queries to search data with conditions, parameter queries and cross tab queries.

There is a dynamic interaction between tables and queries. When one modifies or adds data in a table, the query based on the fields of this table is updated automatically. In order to avoid congestion, only the structure of the query is displayed at any point and not the underlying tables. Nevertheless, every modification carried out is reflected in the data of the corresponding table. The procedure to create a query initially consists in defining the objective of the query, i.e. the data that are sought. It is then advisable to identify the tables containing these data. Finally, it is necessary to define the criteria and the operations applied to the data. A query can combine data from multiple tables or queries, if the relationships between these objects are defined at the start.

We created for this database four queries with particular goals. The first one is a query that can be run using a form. The form “Serie's Choice” is a custom form where a query's parameters may be entered. Once established, the structure of the query will not be modified. Indeed, in order to ensure the correct working of the database, the user simply enters the parameters of the form and should not have need in any case to make modifications to the query. From this form, the user chooses the period and the series s/he wishes to view. In the query, the criteria applied to the field “Year” define precisely the limits of the period searched (Figure 7). For each one of these limits, it is necessary to indicate the name of the form and the name of the controlwith which it is associated. In the example provided, “B” and “E” form such limits. These limits can contain only existing values in the field “Year” of the table “Data.” Control “S” is a combo box in which the user can select the title of the series which interests him. After the query is run, the user can view the filtered data on a datasheet.

[figure]
Figure 7: Criteria of a Parameter Query

The second query is designed to extract the data relating to education spending at current prices and also calculates the annual sum of this expenditure. Given that it is a question of connecting several fields of the table “Data,” we developed a “Cross tab Query” to fulfill this function. In order to display in a convenient tabular form the annual value of each type of expenditure, the field “Years” is put on the line header and the field “AbbreviatedName” in the columns header (Figure 8). The logical operator OR is used to specify the criteria applied to the field “AbbreviatedName” which are used to restrict the retrieved series only to those which interest us. The structure of this query is finished by the addition of a calculated field “Total” based on “Value.” This field includes totals of the different types of spending for each year.

[figure]
Figure 8: Criteria of a Cross tab Query

There should be no need to add new data that could be calculated from data already included in a database if the database is effectively designed. If an economist user wished to know the proportion of the public expenditure devoted to education over the entire period, it should not be necessary to add a new series of data to the database since one already has the total of public and educational spending, obtainable via the cross tab query.  [5] Thus, a new query can be designedbased on the fields “Year” and “Total” of the cross tab query and on the fields “Year,” “Serie_ID,” and “Value” of the table “Data” (Figure 9). To ensure the correct working of the query, the match between the two identical fields “Year” (taken one from the cross tab query and one from the table “Data”) is identified in the query’s construction and the fields connected by an equijoin. A calculated field, the field “Ratio,” is added, in which the ratio between the “Total” field (education spending) and the “Value” field (public spending) is calculated. The datasheet retrieved by this query displays only two fields, the year and the value of the calculated ratio for each year.

[figure]
Figure 9: Criteria of a Select Query

The purpose of the last query that will be examined here is to retrieve the series of data relating to the Spanish national income and to display these data in a graph for the initial period. This last query is based on the fields “Year,” “Value,” and “Serie_ID” of the table “Data.” As all the data required are from the same series, all that is required in building the query is to enter the relevant series ID number (17 for the series of data of Spain's national income) in the criteria column for the field “Serie_ID.” This query generates a report. Reports possess the functionality to represent data retrieved from queries or field in calculations, or in this case, in a graph (Figure 10). Thanks to the graphic report related to this query, the user can visualize one or more series simultaneously.

[figure]
Figure 10: Graphic report based on a query

Queries are useful tools with which to interrogate a database. However, when one wishes to carry out more complex calculations, it is better to export the data into specifically adapted software such as spreadsheets. Within this database, several buttons have been created which initialize macros and make the export of data into Excel spreadsheets easy. Such data could be imported thereafter into other specialistsoftware, such as the E-Views software, in order to subject it to statistical processing. Spreadsheets are useful tools complementary to the DBMS to use in the processing and export of data.

This discussion finishes with an overview of a general menu (Figure 11). The initial menu of the database, which is displayed when it is opened, is provided in Figure 11. This is designed to describe the functionality of the database and provide help to the user. It gathers in one place buttons giving access to the various objects and functions that have been created for the exploitation of the database. Possibility of data loss or other accidents is limited owing to the fact that these various objects open on a read-only mode.

[figure]
Figure 11: General Menu

.010 Conclusion

Once created, this database answered the starting requirements. First, thanks to its simple and concise structure, this database made possible the efficient organization of data used by a bachelor's degree student. The statistical processing that the student employed revealed causal relationships between the education system and the economic system of Spain during the nineteenth- and twentieth-centuries. Future uses of the database are always possible: the general menu (Figure 11) makes the use of it really easy and even users with a minimal level of technical competences will be able to create new objects and queries answering their needs.

Additionally, this database has completed another database, called CAROLUS, which is more important in size. The transfer of the tables to CAROLUS was carried out without any difficulty because these two databases have fully compatible structures. CAROLUS contains data in various fields (demography, education, patents…) available to a team of researchers that undertake research on quantitative history, structured by economic theory and using statistical and econometric methods. (For an example of work based on the database, see Diebolt, Jaoul & San Martino 2005.)

Finally, databases are efficient tools available to students or researchers who have to manage data in a logical way and with the aim of use them. They are an essential step to prepare data that have different origins or that come from files of different types. Through this paper we show that our methodology could be generalized to a variety of applications for historical cata analysis.

.011 Notes

1. The author would like to thank Nicolas Daures and Claude Diebolt for useful suggestions and comments.

2. For a more complete description of the nature of Cliometrics,interested readers can visit the Website of the French Cliometric Association: http://www.cliometrie.org

3. The status field informs us about the data’s quality, i.e, whether the data are obtained directly through the source are an estimate, are missing, and so on.

4. The DBMS Oracle or SQL-Server software is more powerful and is able to manage databases of very complex structures, but on the other hand requires more technical data-processing competences and knowledge.

5. In the same way, education spending at constant price couldbe obtained using the price index data series.

.012 References

Bouzeghoub M. Le Modèle relationnel : Algèbre, langages, applications. (Collection Les bases de données en question, Hermès, 1998).

Codd, E. “A Relational Model of Data for Large Shared Data Banks.” Communications of the ACM 13, no. 6 (1970): 377-387.

Conrad,A. “Economic Theory, Statistical Inference and Economic History.” Journal of Economic History 17, no. 4 (1957): 524-544.

Conrad A. & Meyer J. “The Economics of Slavery in the Ante Bellum South.” Journal of Political Economy 66, no. 2 (1958): 95-130.

Daures N.“La base de données CAROLUS. Données économiques et démographiques sur l'éducation en France aux XIXème et XXème siècle.”, 2004,First Workshop on Cliometrics & Econometric History, AFC, CEROM de l'Ecole Supérieure de Commerce de Montpellier & UMR LAMETA, 10 septembre 2004.

Diebolt C. “Gouvernment Expenditure on Education and Economic Cycles in the Nineteenth and Twentieth Centuries. The case of Spain with special Reference to France and Germany.” Historical Social Research 24, no. 1 (1999): 3-31.

Diebolt C. Dépenses d’éducation et cycles économiques en Espagne aux XIXème et XXème siècles. (L’Harmattan, Paris, 2000).

Diebolt C. “La cliométrie se rebiffe!”, 2004, First Workshop on Cliometrics & Econometric History, AFC, CEROM de l'Ecole Supérieure de Commerce de Montpellier & UMR LAMETA, 10 septembre 2004.

Diebolt C., Jaoul-Grammare M. & San Martino G. Le mythe de Ferry : une analyse cliométrique, in : Revue d'Economie Politique (Editions Dalloz), 115 no. 4 (2005): 471-497

Fogel, R. Railroads and American Economic Growth: Essays in Econometric History. (The Johns Hopkins University Press, Baltimore, 1964).

Fogel, R. “The Reunification of Economic History with Economic Theory.” American Economic Review 55, no. 2 (1965): 92-98.

Fogel, R. “Economic Growth, Population Theory, and Physiology: The Bearing of Long-Term Processes on the Making of Economic Policy.” American Economic Review 84, no. 3 (1994): 369-395.

Kuznets S. “Statistics and Economic History.” Journal of Economic History, no. 1 (1941): 26-41.

McCloskey, D. “Does the Past Have Useful Economics?”Journal of Economic Literature 14, no. 2 (1976): 434-461.

McCloskey, D.Econometric History. (Macmillan, London, 1987).

Ministerio de Hacienda (Ed.).Datos basicos para la historia financiera de España (1850-1975). (2 vols, Fabrica Nacional de Moneda y Timbre, Madrid, 1976).

Mitchell W.C.Business Cycles: The Problem and its Setting. (National Bureau of Economic Research, New York, 1927).

North D. “Economic Performance Through Time.” American Economic Review 84 (1994): 359-368.

Prados de la Escosura, L. “Spain's Grows Domestic Product, 1850-1990: A New Series.” Working Paper, Universidad Carlos III de Madrid D-93002 (1993).

Rollinat R.La nouvelle histoire économique. (Editions Liris, Collection Perspectives Contemporaines, Paris, 1997).

Schumpeter J.Histoire de l’analyse économique. (Traduction française (1983), Gallimard, Paris, 1954).

Website of The French Cliometric Association (AFC): http://www.cliometrie.org

Viescas J. Microsoft Access version 2002 au quotidien. (Microsoft Press, 2001).