I first asked myself that question three years ago, when the University of California Press began planning to publish books on the World Wide Web. I've been using ISO 12083 since early 1996, when the first UC Press book destined for both print and Web publication was launched into production. I'm still asking myself the question.

What It Is

ISO 12083 is a standard published jointly by the International Standards Organization (ISO), the National Information Standards Organization (NISO), and the American National Standards Institute (ANSI). It presents a subset of Standard Generalized Markup Language (SGML) that can be used for electronic markup of books, articles, serials, and mathematical texts. Only the markup for books is discussed in this article, but the same considerations apply to the others.

The 12083 standard for books is a Document Type Definition (DTD), that is, a set of tags and rules for marking up a book so that its structure and content can be read by a computer. The visual cues put in by the author (such as an indented first line to indicate a new paragraph, or an indented block to indicate a block quotation) are replaced by tags (such as <p> or <bq>). Those tags are then a permanent computer-readable addition to the text. When the text is displayed on a computer screen or printed on paper, the tags can be changed back to visual cues for the human reader. So the same text can be read by computers and, with appropriate software, by humans.

Of course, all word processors do something like that. But SGML, unlike a word processor, is not proprietary. An SGML-tagged document can be read by any computer using any text-processing software, not just by the software that was used to create the document. So the question certainly isn't, Should we be using SGML? For electronic archiving and publication of texts, we should; no question about it.

But what should we be using SGML for? It's excellent for creating computer-readable archives within an organization using a single in-house DTD. This is one of the ways UC Press uses SGML — to keep electronic copies of our books for future uses, such as revised editions, reprints of individual chapters, or CD-ROM publication.

However, we also want to publish our books electronically by mounting them on a Web server that is accessible by the general public. For that purpose, SGML has a serious disadvantage. It allows an unlimited number of DTDs — that is, sets of rules for constructing documents. So the software that can read the SGML documents has to be complex, and therefore expensive. The advantage of using a single, standard set of rules for all documents on the Web is obvious. ISO 12083 might have been a candidate, but it was beaten to the marketplace by an even simpler standard, HTML (HyperText Markup Language).

"ISO 12083 in its present form is a Procrustean bed that few books fit into without distortion or mutilation"

HTML is a form of SGML that allows only one DTD. It can be read by software such as the Netscape and Microsoft Web browsers, which are free. But HTML's set of rules is too simple to be of any use for fully marking up books. It just doesn't have enough tags for all the elements that a book can contain. Whereas an electronic book on the reader's screen needs a simple, widely accessible tagging system such as HTML, an electronic book in an archive or on a Web server needs an extensible, sophisticated tagging system such as SGML. XML is an attempt to meet these conflicting needs, but it doesn't really resolve the conflict. It's simpler than SGML, but it still allows a limitless number of DTDs, so it still needs complex software to read it.

Separate Tag Sets

A better solution might be just to use SGML for archiving and HTML for publication. The conversion from SGML to HTML needs complex software, but it can be done by the publisher. The SGML files can be translated into an HTML version that's then mounted on a server; or the SGML files can be mounted along with software such as DynaWeb, which converts SGML into HTML on the fly. Either way, the general reader needs only a Web browser to read the resulting HTML.

OK, so we use HTML for Web publication and SGML for in-house archiving and for mounting on a Web server. But how are the files going to be used? Books mounted on a server won't be just read or downloaded by readers. They need to be searchable, which means they need to be tagged in some standard way, and all or most servers need to use the same standard. For example, if all book titles are tagged <title>, it's easy to do a wide-ranging search for book titles that contain a given word. If they're not, it's not easy, probably not even possible. That's where ISO 12083 comes in. It offers a standard set of tags and a standard DTD for books. But it has drawbacks.

ISO 12083 in its present form is a Procrustean bed that few books fit into without distortion or mutilation. It is possible to modify the DTD, and the ISO gives rules for modifying it without departing from the standard. UC Press has modified the DTD considerably while staying within the ISO's guidelines (see the appendix to this article); but extensive modification, even though it's permitted, means that some of the value of having a standard is lost. Fortunately, it would be fairly easy to revise the 12083 DTD to fit a wider range of books, and a revision is in fact under way.

For publishers whose books appear in print as well as electronically, ISO 12083 has another drawback. The 12083 DTD defines a document in terms of nested elements: A book contains chapters, which contain sections and subsections, which contain paragraphs, which may contain subelements such as block quotations. So an element is defined not only by its immediate tag but also by its context. For example, a <p> may need to be formatted differently depending on whether it's a general text paragraph or a paragraph in a block quotation. Because of this nesting of elements, only high-end typesetting software can read 12083-tagged files, and that software requires a highly skilled operator. We need a set of tags that can be handled by desktop typesetting software such as Quark XPress.

More Tag Sets

So there's a conflict between the needs of composition tags and of archive tags. Once again, separate tag sets seems like a good solution: ISO 12083 for archiving, and another set for composition. With that in mind, UC Press is developing a set of composition tags that are not nested at all. Every paragraph has a tag that gives all the information needed for formatting that paragraph, without any need to look at previous tags. For example, a general text paragraph is tagged <p>, and a block quotation paragraph is tagged <bqp>. The typesetting software can then read the tagged paragraphs one after the other in a linear sequence, formatting each paragraph as it goes. The trick is to base the linear composition tags on the nested ISO 12083 tags, so that one set can be converted to the other automatically.

"Books on the Web need to be divisible into parts, chapters, sections, and other subdivisions, because people won't necessarily want to access the whole book"

What tags to use during editing presents another problem. SGML-editor software is available, but it's . . . yes, complex and expensive. UC Press's copy editing is done by freelancers, and we can't afford to buy expensive editing software for all of them. On the other hand, we don't want to give them the files with the heavy tagging that we send to the compositor. Nor do we want to give them completely untagged files, because we want the editors to check the tags after we've put them in. So we use still another set of tags for editing. The editing tags are very similar to the typographic codes that copy editors have always written in the margins of paper manuscripts: <cn> for a chapter number, <ct> for a chapter title, and so on.

The Process

Let me summarize the whole electronic editing-production-publication process that's being developed at UC Press. We get electronic manuscripts from our authors in various word-processing programs. The files come to us formatted but not tagged in a way that we can use.

Editing tags. We put in UC Press's editing tags in-house, before the manuscript goes to the copy editor. We have developed tagging software that maps our tags to the author's formatting, then inserts the tags automatically. The copy editors' word-processing software doesn't do anything with the tags; they are just there in the files, like the editor's markup in a paper manuscript. One of the copy editor's tasks is to make sure all the elements are correctly tagged, typing in any corrections that may be needed. The editing-tag set has to be comfortable for copy editors, and it has to be automatically convertible into the composition-tag set. The publishing industry's traditional markup codes, with a little modification, meet those requirements very well.

Composition tags. After copyediting, we replace the editing tags with composition tags. We also take out all formatting, such as italic, and replace it with tags. Those composition tags serve as Quark XPress-style tags; that is, the formats specified by the book designer are mapped to the tags, and the Quark software then applies those formats to the tagged paragraphs. The composition tags are hidden during typesetting. When we get the Quark files back from the compositor, we can make the tags visible again, and we can then convert them into our archiving tags. The composition-tag set has to be comprehensive, it has to be linear rather than nested, and it has to be automatically convertible into the archiving-tag set. I don't know of any existing set of tags that meets these requirements; UC Press is developing one.

Archive tags. After composition, we replace the composition tags with tags for our in-house archives and for our Web server archives. The books on the Web server need to be divisible into parts, chapters, sections, and other subdivisions, because people won't necessarily want to access the whole book; and the books need to be searchable, because people will want to check whether the book contains the information they need. This means that the archive-tag set has to be fully structured (that is, nested), so that the books can easily be broken up into useful chunks; and it has to use a standard, widely recognizable set of element names for its tags, so that a search can be limited to selected parts of those books. ISO 12083 meets those two requirements. The tag set also has to be comprehensive, so that every element of every book can be properly tagged. At present, ISO 12083 doesn't meet that third requirement without a lot of modification.

Publication tags. When books are published on the Web, archiving and publication overlap. The files are stored on a server (a kind of archive), from where they're downloaded to users' computers (a form of publication). The files that the user receives need to have a set of publication tags that are easily converted into formats on the user's screen. (The files that are mounted on the server can have either a publication set of tags or an archiving set that is converted into the publication set as needed.) The publication-tag set has to be simple and has to use standard element names, so that it can be read by any Web browser. HTML meets those requirements.

An Answer

I think I can now offer an answer to my question. Should we be using ISO 12083? For archiving and for Web publication, Yes, but it needs a lot of modifying to be really useful. For the other stages in the publishing process, No. What electronic publishers need, in fact, is not one set of tags but four sets that fit together. If ISO 12083 offered that, we'd all be using it.



Tony Hicks has worked in publishing for 14 years. He's been with the University of California Press since 1992, first as a project editor and more recently as a programmer/analyst. He's currently responsible for developing an integrated electronic editing-production-publication system for UC Press.


Appendix: Modifications to the ISO 12083 DTD

Since early 1996, the University of California Press has been marking up some of its books in SGML, using the ISO 12083 DTD. It has been found necessary to modify the DTD extensively. The modifications were done within the constraints specified by the ISO. Some of them might usefully be incorporated in the next version of the DTD.

What follows is a summary of the main areas in which the original DTD has been modified (see the UC DTD [formerly http://www.ucpress.edu/scan/epub/ucp_dtd.html]). For more information, send an e-mail message to tony.hicks@ucop.edu.

Front Matter

Considerable modifications were needed in the preliminary pages, particularly for the copyright page (PUBFRONT).

It is impossible to foresee all the front-matter sections that may need to be included (e.g., Note on Transliteration). Instead, it was found useful to define a general front-matter section element, FMSEC, which could then be identified by an ID attribute (e.g., <:FMSEC ID="NTRANS">).

It was also found useful to define a general front-matter list element, FMLIST, for lists of tables and illustrations.

Chapters

Additional elements are needed in the chapter opening display (e.g., subtitle).

In the printed book, the notes may need to be a section or subsection within the chapter.

The element POEM needed considerable modification, and a new element DIALOG needed to be defined.

Divisions Other than Chapters

Books in the humanities may have divisions other than chapters (e.g., acts and scenes in a play). It would be desirable to define a general element, DIV, that could then be identified with an ID attribute (e.g. <DIV ID="I.ii"> for act I, scene ii of a play).

Model Group "m.pseq"

The model group "m.pseq" defines the content of several quite different types of elements: BQ; ITEM and DD; FOOTNOTE and NOTE; and TSTUB and CELL. It would be preferable to define each type in a separate model group, because each type has a different structure.

A poetry quotation should preferably not require a P as its first subelement. A poetry quotation should not be required to contain the element POEM, because generally it is only poemlines that are being quoted, not an entire poem.

A list should preferably not require a P as its first subelement.

A table cell should preferably not require a P as its first subelement. A table cell should be able to include untagged data (e.g., a number).

Emphasis

For emphasis types, it's desirable to specify the reason for the formatting rather than the formatting itself. The following emphasis types have been found useful:

Italic: frnlang = foreign language; name = name (e.g., of a ship); wdaswd = word as word

Small caps: datetime = date (such as B.C.) or time (such as P.M.)

It was found useful to tag cited titles as T (title), with an attribute specifying the type of title, e.g., <T TYPE="book">, <T TYPE="article">, etc.

Back Matter

Some books have back-matter sections other than those specified in the ISO DTD. Examples: acknowledgments, afterword, list of contributors. It's desirable to have a general "bmsec" element that can be identified with an ID attribute as needed.

In the printed version of the book, the notes usually constitute a separate back-matter section.

Bibliography

Many modifications were found necessary in the tagging of citations. In particular, it is essential to be able to list citations with the author's name first rather than the title.

Illustrations

It is desirable to treat an illustration and its caption as a unit that can be inserted at an appropriate point in the text. For this, a new element ILLGRP (illustration group) is needed, with an ID attribute so that it can be referenced from the text. It needs to contain the empty elements PLATE and MAP in addition to FIG. It also needs to contain the elements CAPTION (a text element) and ILLUSTR (an image element, e.g., EPS or TIFF).

Universal Attributes

It would be desirable to give all elements the attributes TYPE and ID. The TYPE attribute would give flexibility by permitting similar elements to be distinguished (e.g., <TITLE TYPE="book">, <TITLE TYPE="article">, etc.) without having to create completely new element names. The ID attribute would allow specific instances of elements to be identified for typesetting, formatting, or cross-reference purposes.

Character Entities

The ISO lists of character entities referenced in the 12083 DTD have many gaps, and it has been found necessary to add a supplementary list. The Unicode character set might be a better alternative.

SGML Declaration

It was found necessary to increase the maximum value of LITLEN from 240 (in the reference concrete syntax) to 480. This change makes it possible to increase the size of some model groups so as to add the necessary number of new elements.