/ Using Formal Reference to Enhance Authority and Integrity in Online Mathematical Texts

This paper was refereed by the Journal of Electronic Publishing's peer reviewers.

The amount and variety of digital information readily available to the public has become one of the defining features of the intellectual and scientific landscape. This information does not stand in isolation, but instead resides in a community of artifacts, and often the authority and integrity of one article is dependent on that of its ancestors.

Mathematical texts are prime examples of the importance of that dependency. Few mathematical texts are entirely self-contained; they "quote" other sources via the definitions and expressions that they utilize. The very meaning of a single mathematical formula can change depending on the definitions of its components, which may vary from source to source. When considering mathematical proofs, the hierarchy of dependencies can in fact be quite large, and difficult to follow.

Unlike prose quotes, mistaken quotes in mathematics are often hard to identify. A paper that quotes Hamlet as saying "To sing or not to sing," would be immediately dismissed as flawed, if not mendacious. Instead, a proof citing a substitution rule on the wrong branch in the proof, or even a formula referring to variable i instead of j, results not only in errors that easily go unnoticed, but also allows the formula to take on a new, and basically, wrong meaning.

Other sources of difficulty or doubt stem from the abundance of freely available information for which the quality and authority is unknown or difficult to assess. It would be inspiring if in certain collections all of the facts were correct, all of the citations proper, all of the quotes exact, the definitions right, and the arguments logically correct. It would be inspiring if these collections' services helped readers better understand evidence for the validity of such collections.

In the interest of addressing these challenges, the authors present new work with two main goals: (1) to minimize, or eliminate the errors introduced when referencing or quoting other sources and (2) to simplify the tracking of these sources so that a reader can trace the foundations or basis for a given claim. To accomplish these goals, we have interfaced an authoring tool with a digital library of formally verified references such as definitions and proofs to which articles can point and from which mathematical text can be automatically extracted. In all cases, links to the formal sources are maintained. That way, each quote would have its own verification of accuracy because it is created from the authoritative source automatically. No scribe with quill pen or copyist with a keyboard can introduce errors, and authority is ensured by virtue of the articles' creation, rather than post-filtering.

Indeed, this approach can provide additional assurances. The large digital library which acts as our database has been verified with the highest standards of correctness known, using computer-checked formal proofs. This formal library provides not only a path through the hierarchy of mathematical dependencies, but also an ability to expose the proof or mathematical reasoning at a given step, making both error tracking and in-depth exploration for the purpose of learning accessible from the articles.

In what follows, we introduce new concepts that are central to our methodology, and describe our publication system using mathematical examples. While our implementation works with a mathematics database, we believe our methods generalize well beyond this domain. In general, they apply to any data for which the origin and meaning of quotes or claims made are of importance. In this article the examples act as a proof of concept, and are not included for mathematical understanding, but rather to show the capabilities of our authoring tool.

Formal Reference

We say that a particular piece of text within a document, such as a mathematical formula, formally references those objects and data from which that piece of the document was mechanically derived. Hence, there are computer-verified facts relating documents to the contents of objects they formally reference. In contrast, manual transcription of objects into a document, as is traditionally done with article citations, would not produce a formal reference.

Perhaps the simplest example of formal reference would be quoting from a referenced text, i.e., simply excerpting without modification. An automated quotation mechanism could allow the author to stipulate what part of the referenced text to quote, and the computer would copy the quotation, thus preventing error. Not only might this save the author some trouble — which in the case of "typesetting" mathematical formulas can be quite tedious and error prone — but the reader who knows that the quotation was computer-built does not have to establish its accuracy anew by comparing the quote and the source. Formal reference is a computer-verified bond between the contents of document and source.

Semantically Anchored Texts

If part of a document that formally references some data also provides a computationally usable path to that data, we say it is semantically anchored. Anchoring provides various benefits beyond unanchored formal reference. First, the anchors may lead to a world of data that is highly elaborated independently of the documents that reference it. Second, those formally referenced structures can be analyzed by computer and provide the basis for a computer checkable criterion for common reference, e.g., two documents quoting the same base sources in their claims; this gives a definite sense to "being about the same thing." Third, new documents can be created that intentionally refer to the same objects and concepts used in already existing documents by following the anchors to the formal referents (Allen & Constable, 2005). Fourth, the computer-verified facts relating documents to the contents of objects they formally reference can be scrutinized and independently reverified.

In short, those places in a document that formally reference other objects or texts are semantically anchored to those referents when there is a computationally effective path from those places in the document to the sources from which they were derived.

Creating Semantically Anchored Documents

The production of anchored documents depends on the medium intended for them, the editing tools and source-text formats the author prefers to work with, and the kind of access the author has to the underlying repository. The important processes are

  1. the repository or database of verified claims that can be referenced,
  2. a Web-publication service for the repository, especially the generation of formally referencing subtexts, and
  3. tool-assisted editing by authors while interacting with the repository, the Web, and a conventional text editor.

Repositories of Formal Content

In our prototype, the domain of our repository is formal mathematics. Mathematical content formalized for computer processes such as verification, represents content or internal abstract structure completely and unambiguously. These repositories are accessible for mechanical elaboration and one can find dependencies between meaningful formal texts such as proofs, definitions and programs can be identified. Our configuration uses the Cornell Formal Digital Library (FDL) (Allen et. al., 2002, Allen & Constable, 2005). We used this repository because it has plenty of content, we believe it's a good example of a repository design for formal mathematics, and we have experience using it. We expect the increasing widespread use and availability of other such repositories for formally developed content. HELM, C-CoRN, and Mizar are others, for example.

Web Publication Services

Our main purpose for this effort has been to find ways to make it practical for authors to create documents integrated formally with texts they reference and to make their creations widely accessible to readers. We have assumed the Web will be the principal medium for readers to access authored documents and the formal materials they refer to.

Our prototype methods include a Web-publication[1] service that creates Web pages representing repository objects, including authored documents referring to the repository. Our prototype uses LaTeX and PDF as the basic data formats because LaTeX is the current standard for documents with special mathematical notations and from LaTeX one can produce PDF pages that link to other Web pages. The authored documents can also be put in standard data formats suitable for publication in paper journals. Other kinds of texts in the FDL repository that expository documents can reference, such as mathematical definitions and theorems or programs, are independently projected onto the Web using methods considered appropriate for them by the publication's service providers.

A further element of the Web-publication service for expository documents is the publication of auxiliary data derived from the authored document such as a digest of what materials it formally references. A structured representation of the mathematical expressions used in the document would facilitate search for the document based upon structure and content of the mathematical expressions as opposed to their appearance as character configurations.

When a new document is projected onto the Web, authors must decide which Web addresses to link to. The Web-publication basis is a "consistent" collection of data referring to all the repository sources and defining how to render final document texts from those resources, and how to locate them on the World Wide Web. We assume that initially the author will want to link to the same pages he or she is reading during document preparation. When the document is eventually regenerated as part of a new edition, a new document will be created by the publication service that has a newer Web-publication basis.

Editing Example

In our methodology, authors do not cut and paste math into documents. Instead, they indicate sources for math text to be inserted by a computer. Our prototype configuration supports LaTeX as the source language for the author to edit. Our tools for creating documents are most fully integrated for users of the Emacs editor, although the various components could be used with some author overhead with any string-text editor. The core element for generating text for formal references in a document is a process for converting abstract mathematical structures in the repository to LaTeX code; we call it the Dynamic Math Formatter (DMF).

The LaTeX commands embedded in author sources can be used to refer to repository objects and to parts of the content of such objects, and can describe how to arrange those contents in the document when formatting them. For example, an author could specify a whole definition, the left hand side of a definition, a whole theorem statement, or a component of a complex statement. Normally, the DMF command will be considerably simpler than the LaTeX code inserted in the intermediate file by preprocessing. It prevents transcription errors, and so provides a useful editing service to the author. But the more profound aspect of this functionality is that it is possible for readers to ascertain that such transcriptions from referenced texts were performed automatically, and neither error nor deception by the author has occurred at these points in the document. Variations of the document can be automatically generated by the publication service, making very plain where these certified transcriptions have occurred.

Here we present examples of a few of those commands and the formulas automatically laid out by them. These are from the document of which Figure 1 is an excerpt. It is not important to understand the math, just to observe the flexibility of the commands: not only are objects inserted automatically, but new objects can be constructed from parts of existing objects, and layouts can be controlled directly.

  • The command \DMFtyping{eval_factorization_wf}{f} stipulates that the typing assumed for variable f in the named theorem
    should be formatted as:
  • Use of \DMFob[Def]{prime_factorization_of} in the author source indicates that the content of the object named prime factorization of should be laid out as follows and juxtaposed to a link (in PDF ) labeled "Def " to the object itself.
  • The author could insert the left-hand side of the a definition named eval_factorization into the document with the command, \DMFlhs{eval_factorization}.
  • An author can assemble new expressions (which are not necessarily in the repository) from elements of other DMF commands to describe a new expression. For example,

    \DMFitermul{i}{\DMFvar{e}{\DMFvar{i}}}{\DMFvar{a}}{\DMFvar{b}} is

    formatted to produce

The DMF renders pure structured data of the FDL as printable text; it is principally applied to mathematical expressions, but is actually more general than that, and is also used for formatting program text and sometimes for ordinary English text. The DMF is designed to be used in an environment where abstractly structured objects are stored independently of how they are to be rendered for the eye. One then supplies a collection of display templates that specify how various types of structured data are to be displayed. The DMF then applies these displays to the abstractly structured data to produce characters or symbols or instructions in another rendering format such as LaTeX or HTML. This is analogous to converting a pure content mathml expression to presentation MathML.

Given our design choices for the repository, Web, and DMF, here is the scenario we implemented. The author writes a LaTeX document using an Emacs text editor that has been enhanced to access a repository, the FDL. The author edits in tandem with a Web browser that is used to view the repository contents as well as the output document in PDF as it is developed. In the LaTeX source text being prepared, at those places where the author wants the PDF text to formally reference a mathematical object, the author can insert a link to the object and, if desired, further indications as to how the referencing text is to be generated from the referenced data. As often as desired, the author may have a fresh copy of the PFG output text generated for viewing. The PDF output text will contain the automatically generated text for formally referenced objects and links to the Web addresses representing the formally referenced objects.

Figure 1 is an excerpt from a document about the Fundamental Theorem of Arithmetic, produced using our authoring tool, exemplifying formal reference. Each formula throughout the sample text was derived automatically from a repository object, and so formally references that object. None of these formulas anywhere in the sample text was typeset explicitly by the author. Instead, the code for formulas was generated automatically based on references to contents of a repository of computer checked mathematics. The figure is an image of PDF text that includes links to sources for the formulas.

Figure 1: Sample of Semantically Anchored Text
Figure 1: Sample of Semantically Anchored Text

Note, four of the formulas automatically formatted in Figure 1, are semantically anchored with links variously labeled "Def", "Thm" or "FTA". If we had submitted the actual document instead of an image snapshot of it for the purposes of our present publication, and you were reading this article online, you could follow the links to Web presentations of the formal mathematics these anchored formulas were derived from, where you would find more detail and various derived information. In this case, the four links, Def, Thm, Def and FTA would lead you to the appropriate math.

Related Work

Our editing tool is unique because it generates correct-by-construction documents, at least with respect to quotes or, in our case, referenced mathematics. The importance of tools that facilitate expression of and access to mathematics, however, is witnessed by several significant accomplishments in the mathematical editing and publishing domains that we wish to note. Design Science and Integre, for example, provide editing capabilities for mathematical content and presentation on the Web using MathML. We have experimented with Integre's software, and believe that making MathML as an additional option along with LaTeX , would be a valuable endeavor. In the domain of formal mathematics, OMDoc for Open Mathematical Documents, (Kohlhase, 2000) specifies a format specifically for formal mathematics, or mathematical proof. OMDoc extends XML to provide structured content-markup for formal mathematical objects, including allowance for informal text. While OMDoc may provide an additional representation format for users of our tools in the future, we wished to work in language already familiar to many mathematicians and scientific researchers who refer to math. It should be noted that use of the LaTeX format does not require that authors use standard text editors; TeXmacs, for example, is a WYSIWYG editor based upon TeX. Furthermore, the methodology we have developed for the FDL would be immediately useful, once adapted, to any of the large collections of formalized mathematics developed by internationally established groups of specialists (Kaufman & Moore, 1997; Bertot et. al., 1997; Paulson, 1994; Gordan & Melham, 1993; Harrison, 1996; Allen et. al. 2005; Rudnicki, 1992; Hickey et. al., 2003; Benzmuller et. al., 1997; Sorge, 2001; Shankar, et. al. 1999; Schurmann, et. al., n.d.; Wolfram, 1988; Geddes, Czapor, & Labahn, 1992) increasingly being made available on the Web.


We have presented one new method for ensuring correctness in quotes or claims in a document at the time of document preparation. The semantically anchored texts that we produce contain formally-grounded explanations; by this we mean an explanation that ultimately can be reduced to a readable machine-checked proof in a formal logic. They serve as exceptionally thorough models of fundamental elements of education, namely, explanation and evidence. Whether we are talking about an English essay, a chemistry experiment, a legal argument, or a mathematical demonstration, students are taught to answer the question "How do you know?" They are taught to give evidence and to say what statements follow from others. This ability to provide evidence and evaluate arguments is critical to a liberal arts education or an engineering one. Hence, the interface between the document and the verified repository not only ensures correctness and eliminates error by construction, but also gives depth to the article, from the inserted math to its very foundations.


    1. We use the term "publication" here with some reservation; we use it to mean making texts available to the public, without necessarily alluding to the other obligations that may be associated with "publication."return to text


    The authors thank Richard Eaton, lead programmer of the FDL for his valued work and input. This work was supported by the National Science Foundation, NSDL Grant # 0333526 entitled "Enabling Large-Scale Coherency among Mathematical Texts in the NSDL." We also are grateful for Judith Axler Turner's suggestions and improvements to the article.


    Allen, S., & Constable, R. (2005). Enabling large scale coherency among mathematical texts. Cornell University Tech Report. TR2006-2014, http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cis/TR2006-2014.

    Allen, S., Bickford, M., Constable, R., Eaton, R., Kreitz, C., & Lorigo, L. (2002). FDL: A prototype formal digital library. PostScript document on website, May 2002. Available online at http://www.nuprl.org/html/FDLProject/02cucs-fdl.html.

    Benzmuller, C., Cheikhrouhou, L., Fehrer, D., Fiedler, A., Huang, X., Kerber, M., Kohlhase, M., Konrad, K., Melis, E., Meier, A., Schaarschmidt, W., Siekmann, J., & Sorge, V. (1997). Ωmega: Towards a mathematical assistant. In William McCune, editor, Proc. of the 14th International Conference on Automated Deduction, volume 1249 of Lecture Notes in Artificial Intelligence. Springer, July 1997.

    Bertot, J., Bertot, Y., Coscoy, Y., Goguen, H., & Montagnac, F. (1997). User Guide to the CTCOQ Proof Environment. INRIA, Sophia Antipolis, February 1997. System Revision 1.22; Documentation Revision 1.31.

    Geddes, K. O., Czapor, S. R., & Labahn, G. (1992) Algorithms for Computer Algebra. Kluwer, Boston, 1992.

    Harrison, J. (1996). HOLLight: A tutorial introduction. In Formal Methods in Computer Aided Design (FMCAD'96), volume 1166 of Lecture Notes in Computer Science, pages 265-269. Springer, 1996.

    Hickey, J., Nogin, A., Constable, R. L., Aydemir, B. E., Barzilay, E., Bryukhov, Y., Eaton, R., Granicz, A., Kopylov, A., Kreitz, C., Krupski, V. N., Lorigo, L., Schmitt, S., Witty, C., & Yu, X. (2003). MetaPRL — A modular logical environment. In David Basin and Burkhart Wolff, editors, Proc. of the 16th International Conference on Theorem Proving in Higher Order Logics (TPHOLs 2003), volume 2758 of Lecture Notes in Computer Science, pages 287-303. Springer-Verlag, 2003.

    Kaufmann, M., & Moore, J. (1997). An industrial strength theorem prover for a logic based on Common Lisp. IEEE Transactions on Software Engineering, 23(4):203-213, April 1997. [doi: 10.1109/32.588534]

    Kohlhase, M. (2000). OMDoc: Towards an Internet Standard for the Administration, Distribution and Teaching of mathematical Knowledge. In Proceedings of Artificial Intelligence and Symbolic Computation, Springer, Lecture Notes in Artificial Intelligence, 2000.

    Kohlhase, M. (2000). OMDoc: An Infrastructure for OpenMath Content Dictionary Information. Bulletin of the ACM Special Interest Group for Algorithmic Mathematics, SIGSAM, 2000.

    Rudnicki, P. (1992). An Overview of the Mizar Project, Proceedings of the 1992 Workshop on Types for Proofs and Programs, Chalmers University of Technology, Bastad, 1992.

    Schurmann, Pfenning, & Shankar. Logosphere. A Formal Digital Library. Logosphere homepage: http://www.logosphere.org.

    Shankar, N., Owre, S., Rushby, J. M., & Stringer-Calvert, D. W. J. (1999). PVS Prover Guide. SRI International, Menlo Park, CA, September 1999.

    Sorge, V. (2001). Ω-ANTS: A Blackboard Architecture for the Integration of Reasoning Techniques into Proof Planning. PhD thesis, Saarland University, Saarbruken, Germany, November 2001.

    Trybulec, A. (1984). On a system of computer-aided instruction of logic. Bulletin of the Section of Logic PAS, 12, Nr.4, 1984.

    Wolfram, S. (1988). Mathematica: A System for Doing Mathematics by Computer. Addison Wesley, 1988.