|Author:||Deborah Lines Andersen|
|Title:||Controlling Digital Data|
|Publication Info:||Ann Arbor, MI: MPublishing, University of Michigan Library
This work is protected by copyright and may be linked to without seeking permission. Permission must be received for subsequent distribution in print or electronically. Please contact firstname.lastname@example.org for more information.
Controlling Digital Data
Deborah Lines Andersen
vol. 6, no. 1, April 2003
Controlling Digital Data
Benchmark: a standard by which something can be measured or judged. 
.01 A Fable
While spending the spring 2003 semester at the Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow I heard a story about digital data.  I have not found a way to confirm or deny it, so I relate it here as a fable. It may very well be true and if anyone reading this can document the case then I would appreciate an email. The reason it is a fable is that it has a moral, and, I believe, serious policy and research issues attached to it that historians—digital or otherwise—would do well to ponder.
During the Vietnam War the United States Department of Defense kept a database of all soldiers fighting in the war.  One would expect that the database contained such data elements as name, rank, serial number, place of residence, dates of enlistment, dates of service termination and perhaps such items as pay and place(s) of active service. The important data element for this story was the citizenship of each service person. One needs to keep in mind that someone could fight in the Vietnam War, in the United States Armed Forces, without being a citizen of this country.
When the war was over the US government passed a resolution that all individuals who had fought for this country would be declared American citizens.  This is where the story gets interesting. An appropriate official notified the National Archives and Records Administration, where the database resided , that the NARA should go to the citizenship field of the database and change every nationality to "American." This was not a question of adding a new field, but of changing the information that had previously been recorded.
Any historian reading this should immediately be outraged. By changing the citizenship data field there would be a permanent loss of information that might have relevance for researchers now or in the future. The story went on to say that a savvy archivist was given the job of changing the data field had been well trained, and did not delete the original information, preserving, at least for this round, the citizenship of each individual who had fought in the war.
.02 Preserving Digital Information
The first moral that his fable suggests has to do with the integrity of information in historical databases—thou shalt not change historical records, even if someone in the government tells you to. This is a hard lesson because it requires that schools of archival studies teach their students to respect digital data, which is far more fragile than paper-based data where we can see erasures and additions. Furthermore, this moral requires that digital archives impress upon their employees that information should not be changed, no matter how easy, for whatever reason, even if told to do so by a government official.
This credo to never change digital information does surface the issue of inaccurate information. It appears to be human nature to "neaten up" items that have gotten messy—be they gardens or library shelves. Neatening up data is another matter. I can tell the last time I made changes in this document by looking at the time stamp that my word processing program provides. Similarly, date-time stamping of historically significant records is a way to check the last time information was added to or deleted from a record. If this time stamping were accompanied by notations of what was done, why, and under whose authority, historians years from now might have a chance of using historical electronic data that are accurate and well documented. My word processing program only retains the last time I worked on the document. For historic records it would make sense to catalog every time a document has been changed. This is something for policy makers and technology experts to work out. At the same time, it is an issue worth taking into account every time an historian uses electronic data as the justification for a particular historical argument.
.03 One Step Farther: Controlling Digital Metadata
"Metadata" are data about data.  In the case above, the metadata would be at the beginning of a file, or permanently attached to a collection of files, and give information about what data were collected when, by whom, for what purpose, and, if necessary, when the data were amended or changed in any way. In the case of electronic data—digital files—it would be important to know the software and hardware that supported a particular collection of records, and if the data had been migrated to more current software and hardware in the archival processes over some period of time. Someone could read the metadata for a particular data set, electronic or otherwise, and decide whether or not the materials were pertinent to her research without having to actually go to the documents. These are the kinds of materials that historians like to examine on the web. It makes it possible to check the relevance of historical materials without having to actually go to the archives or depository, saving both time and travel dollars in the process.
Thus metadata can save time and frustration for researchers. All historical research is fraught with the peril of having to use data—historical evidence—that were created by someone else, who is usually no longer around, and that are missing items the historian wishes had been included, or are organized in such a way that the researcher really wishes the creator had thought a little more about who might eventually find this information useful. This is the secondary use of data that were created for a primary, usually very different purpose. Martha Ballard kept a diary of her midwifery in order to account for the babies she had birthed.  Could we suppose that she ever thought that her diary would come to light years later, and be used for a book about midwifery and then a web site that allows individuals to learn about midwifery through interactive web pages?  Perhaps if she had had this thought, she would have been more methodical, or more complete in some of her entries. Perhaps she would have left out items that she thought too personal or mundane. When anthropologists find a body frozen at the top of one of the Andes Mountains, complete with hair style, clothing and tools, it is such a significant find because no one in the age of that unfortunate individual thought to preserve these artifacts for humanity. It was a fortuitous accident—or at least fortuitous for today's researchers.  Today it is probable that US presidents are aware that their writings will be used by historians of the future and that they act in specific ways in order to ensure that the record of their presidency will be available to researchers of the future. This is a wonderful example of thinking about the future and controlling data and metadata in such a way that important historical information will be preserved. 
It seems apparent that the fable that started this column illustrates a case in which someone was not thinking about possible future uses of data. Another plausible example of this phenomenon are census data. There are US census data files that start in the 1790s and continue every decade thereafter. The US Bureau of the Census keeps track of a wealth of information about residents in the United States. 
Every ten years there are nonetheless complaints from states and municipalities that there has been undercounting of residents. Illegal aliens tend to not register for fear of being deported. Homeless individuals are hard to track down. Common law couples with children will give inaccurate information so that unwed mothers can continue to receive federal and state aid. The power to conduct the census is set forth in US Constitution.  The census is a method for determining how many representatives should be apportioned to each state in the union, based upon its population. Technically the US Census Bureau could just take a count of individuals and meet this statutory requirement. Instead, over the years, census data have been collected on more and more items.  Additionally, there are long and short forms of the census, with randomly selected households receiving the long form. 
Overarching questions for the reader of this column are these: Who decides what information will be collected in the United States census every ten years? How are determinations made as to what to include and exclude? Are there historians or archivists who are involved in this process? Are there economists, political scientists or ethnographers on the team? How well are these individuals trained to do what they do? One might do well to think about the restraints of imagination, time, ability to record materials, or law that might influence which materials are collected for the census.
.04 A Challenge: Research into Creating Metadata
There are a variety of research projects that could come out of this discussion. All of these might lead to better data collection methods for any record keeping agency, and might lead to better history resources in the future. In keeping with the fable that started this column, these projects could also help individuals who deal with digital data to make sound, future-looking choices about how they collect and preserve information.
- One research project would propose to uncover who, at the federal and state levels, makes decisions about what information will be collected for each census (or any other materials that are systematically collected today). Who are these people? What are their backgrounds and training? How are they selected? How do they make these decisions? This would probably be a questionnaire, perhaps with follow up interviews to get a better sense of the decision-making process used by these individuals.
- A second project would make use of multiple historians of United States history. Again, I would expect that this would be a questionnaire with follow-up interviews. The main research question might be, "If you were in charge of census data collection (or any other systematically collected materials that the researcher can think of), what information do you believe we should collect today in order to inform researchers of the future?" This research question gets at the heart of our culture, politics, economics, and a wealth of other factors. Rather than waiting to see what information might be available for future research, this project would suggest that historians might control the kinds of information to which their colleagues of the future have access.
- Yet another project would require that a group of historians, archivists, and records administrators physically get together to discuss the issues of data and metadata. If a researcher were to posit to this group that a new, digital data collection effort were under way, say to create a national database of the heritage of each American family, what would this focus group of individuals say about the data, and metadata, for this project? What are the informational issues for doing this? The computing issues? The policy issues in terms of data integrity and accuracy? There are obvious privacy and access issues involved in this discussion. Could a group of experts create policy that would be better because they were all working toward a common goal of creating such a national database? Would researchers of the future be better off, and be able to create better historical analysis because of this "up front" work?
One wonders how many times historians have looked at data and asked themselves why the individual, group or organization maintained records in this format. Why could they not see the historical value of these materials, and why did they not pay more attention to it? These are critical questions for the present, if one hopes to allow the digital scholar of the future to make use of data that have not been altered or irrevocably lost due to lack of policy, or poor policy, on the part of those who control the retention of records. When all Vietnam War records become available for public scrutiny at least one historian or ethnographer will thank the nameless digital archivist who could look into the future and see the value of maintaining the records as they were originally produced.
Note: all URLs active as of 10 March 2003.
1. "Benchmark," American Heritage Dictionary, 4th ed., 2000.
2. Thanks to Seamus Ross, of HATII, for this story. I do not doubt its veracity, but, per the discussion of this column, would like some documentation to back it up before calling it the Truth.
3. See http://www.archives.gov/search/index.html for the search engine for the National Archives and Records Administration's (NARA) files. Upon entering "Vietnam War" on March 7, 2003, 10,280 results emerged. There is a lot of information on the Vietnam War.
4. Again, if any reader knows the citation for such a resolution, this writer would appreciate an email.
5. Looking at the NARA site (see note 3 above), it is apparent that although indexed through the site, the actual locations of the digital records are disparate. According to the story driving this column, the NARA did have the ability to access and change the database in question.
6. "Data about data. In data processing, meta-data is definitional data that provides information about or documentation of other data managed within an application or environment. For example, meta-data would document data about data elements or attributes, (name, size, data type, etc) and data about records or data structures (length, fields, columns, etc) and data about data (where it is located, how it is associated, ownership, etc.). Meta-data may include descriptive information about the context, quality and condition, or characteristics of the data." This definition and explanation are from the Free On-line Dictionary of Computing at http://wombat.doc.ic.ac.uk/foldoc/ .
7. Laurel Thatcher Ulrich, A Midwife's Tale: The Life of Martha Ballard, Based on Her Diary, 1785-1812, New York: Alfred A. Knopf, 1991.
8. See http://www.dohistory.org for the interactive site that contains Martha Ballard's diaries so that the user can explore historical interpretations of these texts.
9. National Geographic's online news site at http://news.nationalgeographic.com publishes these sorts of findings about ancient civilizations and interpretation of artifacts they left behind. The March 6, 2003 issue contained "Did Neanderthals Lack Smarts to Survive?" and specifically looked at fossils and artifacts to draw its conclusions.
10. One can look at a variety of examples of presidential libraries through their online faces. The Clinton library is at http://Clinton.archives.gov . The senior George Bush's library is at http://bushlibrary.tamu.edu , while Gerald Ford's library is at http://www.ford.utexas.edu/ It is worth noting that while the National Archives and Records Administration has links to these websites, they have a variety of url endings, indicating the location of the libraries in multiple organizational venues, governmental and educational among them.
11. See the US Census Bureau's website at http://www.census.gov . Their mission statement is "to be the preeminent collector and provider of timely, relevant, and quality data about the people and economy of the United States."
12. See http://www.census.gov/acsd/www/history.html for a history of the US census, its roots, and the changes that have taken place in its administration over the years, and : http://www.census.gov/mso/www/centennial/works.htm for a more detailed explanation called, "How the Census Works."
13. The 1790 census asked name, occupation, and date of birth. Each 10 years thereafter the data fields became more numerous.
14. A percentage of households is randomly selected by the US Census bureau to participate in the long form, adding economic and demographic information for analysis.