Abstract

HTML represents the worst of two worlds. We could have taken a formatting language and added hypertext anchors so that users had beautifully designed documents on their desktops. We could have developed a powerful document structure language so that browsers could automatically do intelligent things with Web documents. What we have got with HTML is ugly documents without formatting or structural information. I show that a standard modern novel cannot be rendered readable even in HTML level 3. I propose a document- and author-centered way of determining the simplest enhancements to HTML sufficient to capture the intent of the authors. I review Tom Malone's mid-1980's work on semistructured messages, which shows us how to add structure without sacrificing flexibility and generality. I describe how to add structure tags without breaking current Web browsers and HTTP servers. Finally, I discuss useful ideas that we can take from the KQML agent-communication language.

Introduction

"Owing to the neglect of our defences and the mishandling of the German problem in the last five years, we seem to be very near the bleak choice between War and Shame. My feeling is that we shall choose Shame, and then have War thrown in a little later, on even more adverse terms than at present."

Winston Churchill in a letter to Lord Moyne, 1938 [Gilbert 1991]

Surfing the Web, you find an announcement for a conference. You'd like to click the mouse and have three entries made automatically in your electronic calendar: the abstract deadline, the paper deadline, and the conference itself. Unless your computer understands natural language, this will never happen because there is no way for the author of the conference announcement to encode sufficient structural information in HTML.

You are in a library reading the Bible, a Harold Robbins novel, Hamlet, The Shipping News, TIME magazine, and Encyclopedia Brittanica. They all look different. They all look good thanks to the $millions invested in graphic design on the part of the publishers of the respective items. You are surfing the Web reading Travels with Samantha, the Bible, Werdna's Humor Archive [formerly http://www.ugcs.caltech.edu/~werdna/humor.html], and The Temptation of Saint Anthony [formerly http://www.cis.upenn.edu/~mjd/tsa/tsa.html]. They all look the same. Of course, you can change their appearance to some extent by editing resource files on your UNIX machine or visiting dialog boxes on the Macintosh. However, do we really believe it is more efficient for each of 20 million people to spend five minutes designing a document badly or for one professional to spend a few days designing a document well?

HTML's impoverished formatting capabilities frustrates the would-be designer of beautiful documents. HTML's lack of structural tags frustrates the would-be provider of more advanced browsers.

Fixing the Formatting Problem

Make a graph with time on the x-axis and formatting capabilities on the y-axis. Draw a line from the HTML level 1 to HTML level 3. You'll see that HTML reaches LaTeX's level of formatting capability around the year 2000. If we are going to get there eventually, we might as well get there soon.

[Note: a group of people at Los Alamos National Labs has already developed practical extensions of TeX and TeX viewers that allow documents with full TeX formatting and full hypertext linking to arbitrary URLs.]

Currently, though, our methodology for extending HTML is backwards. Based on our experiences with other formatting languages, we sit down and figure out what are the most typically needed commands and argue for their inclusion in HTML. People with a stake in keeping the language simple, either for aesthetic reasons or because they don't want to further snarl the pile of C code that serves them as a Web client, fight to keep these commands out.

Let's choose a set of 100 documents in advance and decide that we are going to put enough richness into HTML to capture the design intent in at least 98 of them. For example, crack open a copy of The English Patient [Ondaatje 1992]. Although its narrative style is about as unconventional as you'd expect for a Booker Prize winner, it is formatted very typically for a modern novel. Sections are introduced with a substantial amount of whitespace (3 cm), a large capital letter about twice the height of the normal font, and the first few words in small caps. Paragraphs are not typically separated by vertical whitespace as in Mosaic but by their first line being indented about three characters. (This makes dialog much easier to read than in Mosaic, by the way, where whitespace cuts huge gaps between short sentences and breaks the flow of dialog.) Chronological or thematic breaks are denoted by vertical whitespace between paragraphs, anywhere from one line's worth to a couple of centimeters. If the thematic break has been large, it gets a lot of whitespace and the first line of the next paragraph is not indented. If the thematic break is small, it gets only a line of whitespace and the first line of the next paragraph is indented.

The English Patient is not an easy book to read in paperback. It would become, however, a virtually impossible book to read in Mosaic because neither the author's nor the book designer's intents are expressible in HTML. As the author of Travels with Samantha, I have exactly the same problem. It looks great in PostScript and is formatted very similarly to my paperback copy of The English Patient. Our fileserver handles as many as 100,000 requests per day for pieces of Travels with Samantha and none of those pieces give my readers the quality of experience they'd get reading a version hastily printed out from the simplest word processor. We should demand better from two $30,000 workstations talking to each other over 45 Mbit/second T3 lines.

People argue that HTML isn't a formatting language. It is somehow supposed to be a structural representation of a document. Yet there is no tag for a thematic break, large or small. There is no way to indicate a section break. In short, even the simple requirements of fiction are utterly beyond HTML Level 3's capabilities, never mind what we'd need to automate the processing of conference announcements.

Once we have enough new tags in HTML to represent the author's intent in 98 of our 100 previously selected documents, then it is time to ponder the best way to capture the book designer's intent. If we are determined not to clutter up HTML with formatting directives, then surely we can add a STYLE-SHEET tag in the HEAD so that people who are willing to spend a day or two designing a book nicely can save the other 20 million people on Internet the trouble of doing it themselves.

In the long run, people are not going to accept an expensive system that is inferior in many ways to a $5 paperback book. Eventually Web documents are going to contain formatting information. We might as well sit down with our 100 documents plus manuals for LaTeX, Adobe Acrobat, Frame's internal format, etc. and specify a rich system for capturing author and designer intent. Six months of torture for Web client programmers will ensue, but that is better than the Web documents and clients being out of sync six times in the next decade.

Fixing the Structure Problem

Can the same approach solve the structure problem? What if we locked a bunch of librarians and a handful of programmers in a room together and made them think up every possible slot that any Web document could ever want to fill. They'd come out with a list of thousands of fields, each one appropriate to at least a small class of documents.

This wouldn't work because the committee could never think of all the useful fields. Five years from now, people are going to want to do new, different, and unenvisioned things with the Web and Web clients. Thus, a decentralized revision and extension mechanism is essential for a structure system to be useful.

There is a deeper reason why this wouldn't work. Nobody would be able to write parsers and user interfaces for it. If a user is developing a Web document, does he want to see a flat list of 10,000 fields and go through each one to decide which is relevant? If you are programming a parser to do something interesting with Web documents, do you want to deal with arbitrary combinations of 10,000 fields?

Malone's Work on Semistructured Messages

Back in the early 1980s, Tom Malone and his collaborators at MIT developed the Information Lens, a system for sharing information within an organization. He demonstrated how classifying messages into a kind-of hierarchy facilitated the development of user interfaces. Figure 1 shows one of Malone's example hierarchies [**** insert Figure 6 from Malone's paper][sic]. For each message type, there is an associated list of fields, some of which are inherited from superclasses. Consider the class MEETING ANNOUNCEMENT. Fields such as TO, FROM, CC, and SUBJECT are inherited from the base class MESSAGE. Fields such as MEETING PLACE are associated with the class MEETING ANNOUCEMENT itself.

Each message type also has an associated list of suggested types for a reply message. For example, the suggested reply type for MEETING ANNOUNCEMENT is REQUEST FOR INFORMATION. Most importantly, the decomposition of message types into a kind-of hierarchy allows the automatic generation of helpful user interfaces. For example, once the system knows that the user is writing a LENS MEETING ANNOUNCEMENT, that determines which fields are offered for filling and what defaults are presented. Fields having to do with software bugs or New York Times articles are not presented and fields such as PLACE and TIME may be filled in with the usual room and time.

What did Malone's team learn from this?

  • That a much wider range of messages could be processed automatically. It was convenient for users to fill in lots of fields so messages typically had enough structure to enable fairly sophisticated automatic processing.
  • That by not forcing users to fill out every field and by allowing users to insert arbitrary text in some fields, unusual situations could be handled gracefully. The usefulness of the formal rules was reduced and users had to do more work by hand.
  • That making message types explicit facilitated the development of rules for automated processing. For example, a few lines of code sufficed to delete every New York Times article whose ARTICLE DATE was prior to TODAY.

Adapting Malone's Work to the Web

Where do we put the fields?

First of all, if we are not to break current clients, we need a place to put fields in an HTML document such that they won't be user-visible. Fortunately, the HTML level 2 specification provides just such a place in the form of the META element, which goes in the head of an HTLM document and includes information about the document as a whole. For example

<meta name="type" content="conference-announcement">
<meta name="conference-name" content="Second Int'l WWW '94">
<meta name="conference-location-brief" content="Chicago">
<meta name="conference-location-full" content="Ramada-Congress Hotel, 520 South Michigan Avenue, Chicago, Illinois, USA">
<meta name="conference-date-start" content="17 October 1994">
<meta name="conference-date-end" content="20 October 1994">
<meta name="conference-abstracts-deadline" content="10 August 1994">
<meta name="conference-papers-deadline" content="15 September 1994">

would be part of the description for our conference and provides enough information for entries to be made automatically in a user's calendar.

It might not be pretty. It might not be compact. But it will work without causing any HTML level 2 client to choke.

There are a few obvious objections to this mechanism. The most serious objection is that duplicate information must be maintained consistently in two places. For example, if the conference organizers decide to change the abstracts deadline from 10 August to 15 August, they'll have to make that change both in the META element in the HEAD and in some human-readable area of the BODY.

An obvious solution is to expose the field names and contents to the reader directly, as is typically done with electronic mail and as is done in [Malone 1987]. When Malone added semiformal structure to hypertext [Malone 1989], he opted to continue exposing field names directly to users. However, that is not in the spirit of the Web. Stylistically, the best Web documents are supposed to read like ordinary text.

A better long-term solution is a smart editor for authors that presents a form full of the relevant fields for the document type and from those fields generates human-readable text in the BODY of the document. When the author changes a field, the text in the BODY changes automatically. Thus, no human is ordinarily relied upon to maintain duplicate data.

How do we maintain the document type hierarchy?

Malone unfortunately cannot give us any guidance for maintaining a type hierarchy over a wide area network, as he envisioned a system restricted to one organization. He can give us some inspiration, however. Malone reports that a small amount of user-level programming sufficed to turn his structure-augmented hypertext system into a rather nice argument maintenance tool, complete with user-interface for both display and input [Malone 1989].

Whatever mechanism we propose, therefore, had better allow for an organization to develop further specialized types that facilitate clever processing and presentation. At the same time, should one of these hyperspecialized documents be let loose on the wider Internet, it should carry some type information understandable to unsuspecting clients. Once mechanism for doing this is the inclusion of an extra type specification:

<meta name="type" content="lanl-acl-conference-announcement">

<meta name="most-specific-public-type" content="conference-announcement">

In this case, the Los Alamos National Laboratory's Advanced Computing Laboratory has concocted a highly specialized type of conference announcement that permits extensive automated processing by Web clients throughout Los Alamos. However, should someone at MIT be looking at the conference announcement, his Web client would fail to recognize the type LANL-ACL-CONFERENCE-ANNOUNCEMENT and look at the MOST-SPECIFIC-PUBLIC-TYPE field. As CONFERENCE-ANNOUNCEMENT is a superclass of LANL-ACL-CONFERENCE-ANNOUNCEMENT, all the things that the MIT user's client is accustomed to doing with conference announcements should work with this one.

Nonhierarchical inheritance (also known as "multiple inheritance") is also important so that duplicate type hierarchies are not spawned. For example, the fact that a document is restricted to a group or company might possibly apply to any type of document. Should there be two identical trees, one rooted at BASIC-DOCUMENT and the other at BASIC-INTERNAL-DOCUMENT? Then we might imagine documents for which there is an access charge. Now we just need four identical trees, rooted at BASIC-FREE-DOCUMENT, BASIC-METERED-DOCUMENT, BASIC-INTERNAL-FREE-DOCUMENT, BASIC-INTERNAL-METERED-DOCUMENT. There is a better way and it was demonstrated in the MIT Lisp Machine Flavor system (a Smalltalk-inspired object system grafted onto Lisp around 1978): mixins. Mixins are orthogonal classes that can be combined in any order and with any of the classes in the standard kind-of hierarchy. Here are some example mixin classes:

  • DRAFT-MIXIN. Contributes fields DRAFT-EXPECTED-COMPLETION-DATE, DRAFT-ADDRESS-FOR-COMMENTS, DRAFT-VERSION-NUMBER, DRAFT-PREVIOUS-VERSION-URL. When a document type inherits from this class, the client can display "****DRAFT****" in a few prominent places for the reader, can offer to look up the previous version and show change bars on top of the current version, and can offer to send a structured comment on the draft back to the author(s).
  • RESTRICTED-MIXIN. Contributes fields RESTRICTED-CURRENT-RELEASE-TERMS, RESTRICTED-PERSON-AUTHORIZED-TO-RELEASE, RESTRICTED-EXPECTED-RELEASE-DATE, RESTRICTED-AUTHORIZED-ACCESS-SPECIFICATION (explains who can access, possibly a domain name or list of networks). A company's HTTP server could be programmed to watch for documents whose type inherits from this class and only deliver them to authorized users. Non-authorized users might be sent an explanation and the name of a person who could authorize release.

If there are N mixins recognized in the public type registry, we might have to have 2^N classes for every class in the old kind-of hierarchy. That's one for every possible subset of mixins, so we'd have classes like TRAVEL-MAGAZINE, TRAVEL-MAGAZINE-RESTRICTED, TRAVEL-MAGAZINE-DRAFT, TRAVEL-MAGAZINE-DRAFT-RESTRICTED, etc. This doesn't seem like a great improvement on the 2^N identical trees situation.

However, if we allow documents to specify their fundamental type and mixins separately

<meta name="type" content="travel-magazine">
<meta name="mixin-types" content="restricted-mixin">

and build the final type at runtime in both the HTTP server and the Web client, then we need only have one hierarchy plus a collection of independent orthogonal mixins. This presents no problem for programmers using modern computer languages such as Smalltalk and Common Lisp that allow type definition at run-time, but programs implemented in primitive languages (e.g., C++) that have purely static types are going to essentially need their own dynamic type system.

We know then that we need multiple inheritance and distributed extensibility. A standard Internet approach to distributed maintenance of a hierarchy is found in the Domain Name System (DNS), where authority for a zone is parcelled out and that authority includes the ability to parcel out subzones [Stevens 1994; Mockapetris 1987a [formerly http://info.internet.isi.edu:80/0/in-notes/rfc/files/rfc1034.txt]; Mockapetris 1987b [formerly http://info.internet.isi.edu:80/0/in-notes/rfc/files/rfc1035.txt]. A DNS-style might seem like overkill initially and would result in some delays for pioneer users of document types because without a substantial local cache, document type queries would have to be sent across the Internet for practically every Web document viewed.

Regardless of how the hierarchy is maintained, developing the initial core taxonomy is a daunting tasks. Fortunately, librarians have been at it for hundreds of years and have done most of the work for us. The USMARC, ****, [**** add references][sic] are carefully thought out systems and should serve as the basis for our tree.

Conclusions

HTML is inadequate. It lacks sufficient structural and formatting tags to even render certain kinds of fiction comprehensible much less aesthetic. HTML needs style sheets or improved formatting capabilities so that document designers can spare 20 million Internet users from adjusting everything themselves. The META tag in HTML level 2 can be exploited to implement a document typing system. We need to develop a hierarchy of document types to facilitate implementation of programs that automatically process Web documents. This type system must support multiple inheritance.


Author Information

Philip Greenspun

Laboratory for Computer Science and Artificial Intelligence Laboratory at Massachusetts Institute of Technology

Advanced Computing Lab [formerly http://www.acl.lanl.gov/] at Los Alamos National Laboratory


References

Martin Gilbert 1991. Churchill A Life Henry Holt & Company, New York, page 595

Malone, Thomas W., Grant, Kenneth R., Lai, Jum-Yew, Rao, Ramana, and Rosenblitt, David 1987. "Semistructured Messages are Surprisingly Useful for Computer-Supported Coordination." ACM Transactions on Office Information Systems, 5, 2, pp. 115-131. [doi: 10.1145/27636.27637]

Malone, Thomas W., Yu, Keh-Chaing, Lee, Jintae 1989. What Good are Semistructured Objects? Adding Semiformal Structure to Hypertext. Center for Coordination Science Technical Report #102. M.I.T. Sloan School of Management, Cambridge, MA

Mockapetris, P.V. 1987a. "Domain Names: Concepts and Facilities," RFC 1034 [formerly http://info.internet.isi.edu:80/0/in-notes/rfc/files/rfc1034.txt]

Mockapetris, P.V. 1987b. "Domain Names: Concepts and Facilities," RFC 1035 [formerly http://info.internet.isi.edu:80/0/in-notes/rfc/files/rfc1035.txt]

Ondaatje, Michael 1992. The English Patient. Vintage International, New York

Stevens, W. Richard 1994. TCP/IP Illustrated, Volume 1: The Protocols. Addison-Wesley, Reading, Massachusetts


Notes

1. E. Annie Proulx's The Shipping News won the 1994 Pulitzer Prize for Fiction.