SGML and PDF--Why We Need Both

Kasdorf, Bill

doi:https://doi.org/10.3998/3336451.0003.406

EPUB
Print
Share+
- Twitter
- Facebook
- Reddit
- Mendeley

SGML and PDF--Why We Need Both

Journal of Electronic Publishing

Volume 3, Issue 4: Moving from Print to Electronic Publishing, June, 1998

DOI: https://doi.org/10.3998/3336451.0003.406

Permissions: This work is protected by copyright and may be linked to without seeking permission. Permission must be received for subsequent distribution in print or electronically. Please contact [email protected] for more information.

For more information, read Michigan Publishing's access and usage policy.

Book and journal publishers need to approach the two key technologies used in making print and electronic products — SGML, the Standard Generalized Markup Language, and PDF, Adobe Acrobat's Portable Document Format — with an open mind. Each of those technologies is very powerful in its own way; but ultimately, many publishers will use both in order to achieve the most streamlined workflow and make the best use of their intellectual property.

Introduction

One of the most common misconceptions among publishers expanding from print into electronic publishing is that they must choose between two competing technologies — SGML, the Standard Generalized Markup Language (or one of its close relatives, HTML and XML), and PDF, Adobe Acrobat's Portable Document Format.

In fact, those technologies complement each other. Publishers who understand how to use them appropriately often wind up using both. Publishers who take a more partisan view, placing their bets exclusively with one or the other, miss opportunities to make the most of their intellectual property, to deliver the most useful products to their readers, and to make the hybrid print/electronic world we live in a practical and economical reality.

Below, we'll examine SGML and PDF in more detail, describing the essential features of each, comparing their strengths and weaknesses, and discussing how they can be used most effectively. But first let's look at the path print publishers often take into electronic publishing and how that path makes adopting both SGML and PDF attractive — or, in some cases, inevitable.

An Often-Accidental Combination

When print publishers take their first steps toward electronic publishing, they often start by posting hand-coded Tables of Contents and some sample text in HTML on a Web site. At the same time, they're most likely providing PostScript files to their printers. Without even realizing it, they've already got feet in both camps: HTML is a child of SGML (albeit an immature one), and PostScript is the parent of PDF.

Soon, though, their readers begin to demand more electronic access. PDF offers an appealing solution, because it's a convenient and inexpensive by-product of the PostScript already produced for print. And printers are moving from PostScript to PDF as well. So it's easy at this stage to say "Why bother with SGML? PDF is all I need."

The Need for Structure and Adaptability

As the number or complexity of their electronic publications grow, publishers need a way to organize and manage them. Because of its powerful and flexible structuring capabilities, as well as its capacity to capture and organize information about the publications ("metadata"), SGML is an effective solution. Ironically, a publisher's first use of SGML is sometimes simply to help manage access to collections of PDF publications. Plus, readers need more sophisticated ways to find what they're looking for in those rapidly growing collections, based on the structure and content of the information. Even if readers want PDF as an end product, they need SGML (or HTML or XML) to find what they want.

Publishers also discover that they need some way to protect their electronic archives from becoming obsolete as technology evolves. Yesterday's HTML begins to look inadequate as new options become available; XML will soon become integrated into popular browsers, forcing many publishers to go back and re-code or restructure electronic files that seem up-to-date today. SGML provides a way to minimize the cost and disruption of keeping up with technological change.

Even as electronic products increasingly displace their print counterparts, print will still be part of the picture. Whether from an offset press, a DocuTech, or a laser printer, print is what many readers ultimately want. And text formatted for the screen is rarely appropriate for print. Fortunately, some publishers may find that SGML can lower the cost of producing their print products as well as their electronic ones, because it can make typesetting much more automatic. However, even in an SGML-based production environment, PDF will play a key role in delivering those print pages — whether to a printer or directly to a reader.

"Our phrase 'Print the Damn File' has become a popular way to stress PDF's essential nature as an electronic description of the appearance of a specific page"

It's becoming clearer every day that the ideal workflow for book and journal publishers starts with SGML, which can then be used to generate HTML (or, soon, XML) for Web publishing and PDF for electronic versions of pages to be printed (whether they're printed in bulk or on demand). Even when it's not practical to make an immediate leap to this ideal workflow, it's possible to make gradual progress toward it, and get incremental benefits along the way. The better that publishers understand SGML and PDF, the sooner they'll be able to take advantage of their power to provide both an ideal complement of electronic files and a dynamic, economical workflow that will adapt to changing technology as well as to readers' changing expectations.

SGML Is for Structure, PDF Is for Pages

The essential difference between SGML and PDF — and the distinction at the heart of why they're complementary, not competing, technologies — can be summed up in a single sentence:

SGML is all about structure and meaning, and has little or nothing to do with appearance;

PDF is all about appearance, and has little or nothing to do with structure and meaning.

Understanding that is the key to using those technologies effectively.

At Impressions, where we edit and produce books, journals, and electronic products for a wide range of publishers, we have found those technologies to be invaluable tools — even when the publisher doesn't require them. Sometimes our customers even hold us back, because they don't understand what SGML and PDF have to offer. We've coined two interpretations of those acronyms to help people keep the essential distinctions in mind. To replace the old joke that SGML stood for "Sounds Good, Maybe Later," we suggest: "Structure Gives Me Leverage." And our phrase "Print the Damn File" has become a popular way to stress PDF's essential nature as an electronic description of the appearance of a specific page.

Although SGML can be used to describe appearance — in fact, the main purpose of its offshoot HTML is to tell a browser how to display tagged text — and although Acrobat does, in fact, offer some structuring and navigational functionality, those broader dimensions are departures from the essence of each technology. SGML is mainly for describing what a document is; PDF is mainly for describing what it looks like on a page.

SGML: "Structure Gives Me Leverage"

SGML, the Standard Generalized Markup Language, traces its roots back to a time when it was virtually impossible to read an electronic document from another system because each system, each platform, each software package described its contents using proprietary codes that could not be interpreted on other systems. As computing systems proliferated, it became more and more important to interchange information between unrelated systems.

There are two dimensions to that interchange: horizontal (passing a document from person to person) and vertical (many people collaborating on shaping, changing, and reusing documents and parts of documents).

Horizontal interchange is the most common, because in almost all contexts it becomes useful to pass a document intact from one system to another, adapting appropriately to each. Yesterday text written on a word processor (say, a Wang) had to be used by a typesetter (using, perhaps, a Penta) to create print pages, but also by an electronic publisher (who might use Folio) to create an electronic version. Because each of those systems had its own tagging and organizing structure, and because they were often used inconsistently by the people doing the work, it was difficult or impossible to reuse the data. All too often — even today — work prepared one place requires extensive and expensive conversion or rekeying to be used in another.

"The good news is, you can make up your own codes. The bad news is, you have to make up your own codes"

The other kind of interchange — vertical interchange — allows various people to collaborate on shaping a given body of text, and for parts of that text to be reshuffled and reused in different ways. The writer of a repair manual for a Buick, for example, can probably use big portions of text from the corresponding Chevy manual; and some, but not all, of what appears in the manual for the mechanic might also appear in the owner's manual. Such manuals are written, reviewed, edited, and augmented by whole teams of people who rarely all use the same systems. Even in a simpler context, where a lone editor needs to revise a lone author's work, that kind of interchange is a big problem unless they're using identical tools.

SGML Separates Structure from Presentation

The architects of SGML addressed both of those needs by creating a language for delineating the elements of text documents without describing their appearance. They did that by separating structure from presentation. SGML concerns itself with the structural features of a document, but ideally, it leaves it to the ultimate presentation system to determine how those features appear. That way, when documents move from system to system, or portions of one document are used in another, they don't need to be recoded.

The language those architects created — the Standard Generalized Markup Language, which was accepted as an international standard (ISO 8879) in 1986 — uses text characters to comprise both the text and the markup that describes that text. It has no proprietary codes; instead, each user (or group of users) may create whatever codes are necessary and meaningful for what is being published.

That is an extremely powerful and liberating concept. Now, replacing the assortment of proprietary codes the Wang system used to format a heading (and which could only be interpreted properly by another Wang, or someone expert in those codes), and replacing the corresponding but different assortment of codes in the typesetter's Penta system (which might want that heading to be 14-point Palatino italic, flush left, rather than the centered and underscored Courier the Wang might have used), and replacing the still different codes needed in the Folio system (where that heading might be 12-point Helvetica Medium, and red), SGML allowed that heading to be coded as <H1>, or even just <heading> or <TITLE>.

When the resulting SGML file is used, the system that uses it can convert it to its own coding system and determine what it looks like. That heading can be made to look different on every word processor it's used on; it can be formatted one way for paper and another way for the screen; it can be typeset in whatever design a print publisher chooses, and can be converted by whatever system they happen to use; it can be displayed differently on the Web or on a CD-ROM. In all those contexts the code attached to that title doesn't have to change. The time and cost once necessary to convert the code from system to system has been dramatically reduced.

SGML Is a Language, not a Set of Codes

The key to interpreting the codes in an SGML document is the DTD — the Document Type Definition. In fact, an SGML document is not complete without three distinct parts: the Declaration (which gives the user fundamental information like what language and code set are being used — for example, English and ASCII), the DTD (which details all the codes and the rules restricting their use), and, finally, the Instance (the text being published, marked up with the codes described in that DTD).

Many people mistakenly assume that SGML is a "standard" set of codes. In fact, it's just the opposite. The good news is, you can make up your own codes. The bad news is, you generally have to make up your own codes.

SGML is a language. In that sense, it's like a programming language — it can be used to create applications that are then usable by someone who has no knowledge of SGML; but only somebody expert in SGML (in effect, the "programmer") can create that application in the first place. In book and journal publishing, the applications (code sets, or DTDs) can be specific to a single book or journal, or can span a group of related books or journals.

SGML Is Often Complicated

SGML doesn't have to be complicated, but it often is. Generally speaking, the wider variety of material a given DTD is designed to accommodate, the more complex and abstract it tends to become. It's relatively easy to write a DTD for a simple reference book that might have only a handful of structural features used repeatedly in thousands of pages. However, that DTD will be useless in describing a textbook or a scholarly journal. Writing a DTD for a complicated textbook full of sidebars and exercises of all sorts is hard enough; writing one to accommodate a publisher's line of textbooks (both existing and future ones) is much harder; and writing a DTD that a number of different publishers can agree on is virtually impossible.

"Publishers are often faced with a choice between keeping things simple (saving money and time) and making them richer (but more complicated and expensive)"

Journals are ideal candidates for SGML because they are highly structured and highly repetitive. But, as any journal publisher who has done so can attest, developing a DTD that properly and powerfully describes all the features needed for both print and electronic versions of a journal is not a trivial matter. When the names of the authors are tagged, should the lead author be distinguished from the others? How should the surname be tagged, so that you don't end up with a lot of authors whose last name seems to be "Jr." in the electronic version? In the references, is it sufficient just to tag each reference, or should there be sub-tags for the place and date of publication, the publisher, and the page ranges?

Richer than Composition Tagging

Many SGML tags are extremely valuable in an electronic context, but not at all necessary when typesetting the print pages. Composition files may have a tag resembling <AU> for the line where the authors' names are listed, but they almost never have any finer level of detail than that. The same goes for the references. As for figures and tables, the composition files typically have a code only where each is first referred to (because each figure and table appears only once on the printed page), whereas in an electronic context it's desirable to have a link every time a figure or table is mentioned.

That means SGML files often have much richer coding than composition-only files. Of course, they don't have to be so complicated: it's possible to set up the SGML in such a simple way that the codes are virtually equivalent to typesetting codes. But that sacrifices much of the benefit and power SGML brings in the electronic world. Publishers are often faced with a choice between keeping things simple (saving money and time) and making them richer (but more complicated and expensive). Publishers who build SGML basically for archival purposes, and need only print products today, tend to opt for simplicity. Publishers who need to bring out products in both print and electronic formats soon discover that well-thought-out, richly tagged SGML files have a very rapid payback.

The Power of Metadata

Another significant benefit of SGML is its ability to capture and organize information about the material being published, even if that information doesn't actually appear on the printed pages, or appears in a different way. A biographical reference book might want to systematically keep track of the occupations, birth and death dates, burial places, nationalities, educational histories, and other pertinent facts about each person, whether or not that information appears in the article itself. Journals typically provide abstracts, keywords, publication history, and other such information on the print pages, but that information often needs to be structured differently in an electronic context than it is on the printed page.

SGML is an excellent technology for managing metadata in an extremely flexible and dynamic way. In fact, many SGML repositories are really properly thought of as "text databases." They enable a publisher to organize the published information in different ways for different contexts. Although the biographies in the example above might be presented alphabetically in the printed book, an electronic product might make it possible to present them chronologically, or geographically, or grouped by occupation.

Because the DTD can include not only descriptions of the component elements but definitions — rules that determine how each element can be used — it is possible to include elements that appear in one context but not another. For example, the DTD for a journal might distinguish between the elements a publisher wants to make freely available on the Web (title, author, affiliations, abstracts, keywords, publication history) and the elements available only for a fee or to subscribers (the full text, figures, tables, and references).

SGML Documents Can Be "Parsed," or Validated

Another important consequence of the fact that an SGML document incorporates the key to its own codes (in the Declaration and the DTD) is that software can check to make sure everything is done correctly. That is a very powerful function; just ask any typesetter who has had to scroll through hundreds or thousands of pages of word-processing files to find the codes used in them, and who then converted those files, set up the formats, and typeset the pages, only to find in the page proofs that in Chapter 12 there's an unsuspected fourth level of subhead and that in Chapter 15 there are some unidentified symbols that have defaulted to gibberish.

An SGML file can be "parsed" — a process by which the document instance is checked against the Declaration and the DTD to make sure all the codes in the file are legal and used properly. If special characters occur (which SGML usually expresses in the form of "entities") that haven't been specified, the parser will find them. If the author forgot a required end code (for example, by properly coding a superior figure as <SUP> but mistyping its end code as </SUB> instead of </SUP>) or used a code improperly (the DTD might specify that an <H2> can't be used before an <H1> or that a list must have at least three items), the parser will call attention to the errors.

"You could write an SGML file on a grocery bag with a crayon and bury it in a box in your back yard, and somebody could dig it up and interpret it centuries hence"

Just as spell checking does not replace proofreading (it will miss a "there" that's meant to be a "their"), parsing does not ensure that everything is perfect in the file. It only checks to make sure that the codes have been used legally. And it's highly dependent on how permissive the DTD is. Most DTDs allow a chapter title to be used only once in a chapter, so an extra <CT> would be found. However, if the author legally codes <H2> that should be an <H1>, the parser will say it's okay. Computers are not going to put editors and proofreaders out of business.

Power at a Price

The bottom line is that there is no better way to describe the structure of a document than SGML. It's an international standard, and it's completely non-proprietary, so it liberates documents from the cumbersome and costly process of conversion from system to system. It requires no special hardware or software; although there are an increasing number of SGML-based systems on the market, it's possible to create a valid SGML file in any word processor or text editor. Thus SGML preserves the document and its coding from obsolescence as well: You could write an SGML file on a grocery bag with a crayon and bury it in a box in your back yard, and somebody could dig it up and interpret it centuries hence, as long as they "spoke" SGML. And finally, it frees the document from the constraints of any particular method of presentation, adapting to any mode of display — print or electronic, existing or not yet invented.

But that durability and flexibility come at a price. It can be hard work to analyze a set of documents thoroughly enough to accommodate their print and electronic needs, to write a well-constructed and well-documented DTD, to set up a well-designed SGML-based workflow. When publishers needed only to produce print documents, few found SGML to be worth the trouble. Now, faced with the need to bring out electronic versions of their publications, many publishers are finding that SGML is a very wise investment indeed.

Why not Use Just HTML?

Many publishers, on learning the complexity and effort required to implement SGML properly, are tempted to save everything in HTML. That is a very dangerous and short-sighted approach.

HTML is an application of SGML designed to tell a browser how to format documents for the Web. It has been wildly successful, and, in fact, it is largely responsible for the resurgence of interest in SGML in the past few years. The reason it's so popular is that it's extremely simple. Unlike SGML, it is a specific set of codes; that makes it very easy to learn and use, and easy to build tools for.

But in its very simplicity is its limitation for books and journals. Publishers quickly find that HTML doesn't offer enough codes to describe more than the bare bones of their publications. There is no code for Abstract, for example, so journal publishers typically use the <BLOCKQUOTE> tag to represent an abstract in HTML, because a browser will typically format it as an indented block (looking just like an extract in text). But now something essential has been lost: You can no longer distinguish the abstracts from the extracts.

HTML offers six levels of heads, which seems adequate until you realize that the <H1>s that you used as level-one subheads look

huge;

publishers wind up instead using <H3>s in HTML, which is usually

about the proper size

for a level-one subhead (thus leaving only two subordinate levels to choose from). Also, special characters are limited in HTML: You can't display something as basic as an em dash, not to mention a host of Greek and math characters. And compounding the situation is that HTML is a moving target: As browser makers jockey for position, they invent new features that make earlier versions of HTML obsolete.

That is not to say that HTML is useless. In fact, it's extremely useful, because it's the coding needed for the Web. The key is to think of it as an output, not an archival format. Since SGML is richer and more flexible than HTML, SGML archives can always be "dumbed down" to HTML, but the reverse is not the case. And even if you need to display your abstracts as block quotes on the Web, your SGML archive still knows they're abstracts, not extracts. So when you want to use that same file for print, or for a CD, or when you want to deliver the abstracts differently in one context than another, you still can.

XML — SGML for the Web

Recognizing the limitations of HTML and the complexities of SGML, the Web community developed yet another standard: XML, the Extensible Markup Language. Unlike HTML, XML allows for the invention of new codes; unlike SGML, it does not require a DTD. In most respects, however, XML is, in fact, consistent with SGML. One of the requirements of its design was that XML files be compatible with SGML. (XML also simplifies SGML in other ways. For example, while SGML allows "tag minimization" in some contexts, enabling the omission of end tags, XML always requires explicit end tags. That makes it a lot easier to write tools and browsers, which don't need to have the sophistication to be able to tell where an end tag is implied or where it is just plain missing by mistake.)

"Whereas SGML arose mainly out of the mainframe world and very large corporations and organizations, Acrobat was created to address the need of any user of a personal computer"

XML introduces the concept of a "well-formed" document, one in which the tags used are nested correctly and the proper XML syntax is followed. (In addition, like SGML, XML allows for "valid" documents too, which go a step beyond "well formed" status by using an explicit structure defined in a DTD.) "Well-formedness" is a very appealing feature of XML, because it allows publishers to tag what they are publishing in whatever way is meaningful, without being confined to a specific set of tags (as with HTML) or needing to write a DTD (as with SGML).

Why, then, would a publisher want to use a DTD if XML lets you get away without one? Although it's tempting to skip the work of writing one, a DTD provides an invaluable template or blueprint that enforces an important consistency among classes of documents. Marketing literature may not need that rigor, but books and journals typically do. Think what chaos there would be if one editor chose to tag abstracts as <ABS>, while another used <ABSTR>, and yet another used <AB>. All of those variations could be perfectly legal in well-formed XML documents, but the resulting collection of journals would be far less usable.

XML promises to revolutionize the way information is coded and structured for the Web. Its two related technologies, XLL (Extensible Linking Language) and XSL (Extensible Style Language) will similarly revolutionize hyperlinking and display. But the key is to remember that SGML is really at its core. Publishers who develop good SGML archives will be able to convert them easily to XML when necessary, further demonstrating the power and adaptability of SGML.

PDF: Extending the Power of Print

Ironically, in the middle of SGML's evolution as the most powerful interchange technology for information, Adobe Systems developed PDF for interchange too — but interchange of an entirely different sort. In fact, at first PDF was just an underlying technology to the main product it was created for: Adobe Acrobat.

Acrobat was created, like SGML, in response to the nightmarish state of document interchange created by the proliferation of systems and software triggered by the growing dependence on computers in every phase of commerce and communication. Whereas SGML arose mainly out of the mainframe world and the very large corporations and organizations (like the Department of Defense and the Securities and Exchange Commission) that used them, Acrobat was created to address the need of any user of a personal computer to exchange a document with customers, suppliers, or colleagues. (Keep in mind that that was before the Internet had really entered the public consciousness.)

Sharing Pages, not Just Information

Adobe created Acrobat so that a person creating a document on a Mac using Word, for example, could send the resulting pages electronically to a colleague down the hall or in a distant city working in WordPerfect on a PC. Unlike SGML, which presumed that it was best to let that recipient process and format the information in whatever way made sense in that local environment, Adobe presumed that the sender wanted the recipient to see the document in exactly the form in which it had been created — page-for-pa= ge, line-for-line, letter-for-letter, and including every graphic, every table, every equation, every white space, and even every color: The document would be visually identical.

To do that, Adobe developed PDF, Acrobat's Portable Document Format, as a by-product of PostScript, Adobe's powerful page-description language that had become the standard way to describe pages electronically in the graphics world. There is an essential difference between PostScript and PDF, however: PostScript is a programming language, PDF is a page-description format.

As a programming language, PostScript is powerful and dynamic, allowing a tremendous range of interpretation in various applications and providing the ability to produce visually identical pages by any number of means. However, a PostScript file must be read from beginning to end, because some programming expression in page one (declaring what fonts are being used, for example) can affect page 51. You can't just extract page 51 and expect it to stand alone.

PDF, on the other hand, as a page description format, is page independent: Every page has all the information needed to display that page. Because it doesn't allow all the complex computational alternatives of PostScript, PDF code is much more consistent and predictable. Whereas the PostScript description of a page will actually be dramatically different if that page is created in Quark vs. Penta vs. PageMaker vs. XyVision (because each of those applications builds PostScript files in a different way), the PDF that results from each of them will be very similar.

"Esthetics matter more to some publishers than to others"

The reason for that consistency is that PDF is created by a PostScript interpreter called the Distiller, which converts the PostScript produced by an application program into a very standard PDF description of the resulting page. (It's also possible to create PDF with the PDF Writer that is part of some applications like word processors and spreadsheets, but that is not as effective and is rarely done in the context of book and journal publishing.)

Expressing the Look Electronically

The resulting PDF retains all the information about what that page is supposed to look like. If the fonts are available on the recipient's computer, it will use them; if they aren't, it will emulate the fonts using two Multiple Master fonts that are integral to Acrobat. Although that emulation is surprisingly good in some cases — always preserving the sizes and line breaks so paragraphs and pages don't reflow — it emulates the appearance of some fonts better than others, so it is best to embed the fonts used to create the pages in the resulting PDF.

In addition to fonts, PDF represents all the other visual aspects of the page: line breaks, layout, white space, graphics, colors — every visual feature of the page. And because it is a vector technology rather than the bit-map associated with scanned images, the resulting files are very compact and yet the fonts and graphics adapt to the resolution of the output device. That way, when they're viewed on the screen, they may be seen at 72 dots per inch, but when they're printed on a laser printer, they're 300 or 600 dpi, and when they're output on a high-resolution imagesetter or platesetter they can be 2540 dpi — or even higher if higher resolution is available.

Communicating Structure Visually

The type and graphics in the resulting PDF pages are displayed with stunning fidelity. That is very appealing to publishers who have put a tremendous amount of work into how their pages look. The esthetics matter more to some publishers than to others; but, more importantly, the layout and typography of well-produced print pages are the primary way they communicate structure. A reader can tell an extract from normal text because it's indented; a level-one subhead is more prominent than a level-two subhead; footnotes are at the bottoms of pages, "linked" via a superscript in the page; a sidebar is clearly supplementary, and pertinent to a specific passage in text because of its position on the page; figures and tables are placed as soon after they're first mentioned as possible.

Those formatting clues only seem obvious because they are a highly evolved visual language by which information has been structured for centuries. Publishers have well-evolved typographic and page makeup styles; typesetters spend a tremendous amount of time implementing them properly; and readers know at a glance how to interpret them.

In contrast to Acrobat, one of the things that makes SGML so complex is having to translate all of those conventions into the logical structure of a computer language. Anyone who has done a document analysis in preparation for writing an SGML DTD can attest to how difficult it can be to express in words the distinctions and relationships that are intuitively obvious on the page — and also to reveal those that are ambiguous, and resolve the ambiguity. Is this indented block the same as that indented block? No, this one's an extract and that one's an abstract. How can you tell? Because the abstract comes before the text starts, and it's in a different font. And what about this extract where the lines are all different lengths? Oh, that's a poetry extract; the justified ones are prose extracts. And by the way, the little flush-right line at the end of each one is an attribution. Etc., etc. You get the idea.

Analyzing your documents that thoroughly is in fact a very good thing to do, particularly when you're going to create book after book or journal after journal with the same structure. It actually makes everybody's work a lot easier — the designer's, the editor's, the typesetter's, and ultimately the reader's — to have all that spelled out explicitly. But it can be a lot of work to figure it out properly in the first place.

Acrobat Is Much Less Work than SGML

One of the most appealing aspects of Acrobat and PDF is that you don't have to go through all that. The structural information is conveyed the way it has always been — visually. (Of course somebody still had to figure out that structure in the first place; but that "somebody" is usually taken for granted in most publishing workflows.)

You can tell the level-two heads from the level-one heads just by looking at them. You know immediately that a particular extract is a poetry extract, and that the name of the poet is at the end of it. When you see a superscript, you look at the bottom of the page for the footnote. The first time a figure is mentioned, it is followed soon by that figure itself. All with no codes! (That's an illusion, of course. Codes were needed to arrange everything on the page in just that way in the first place. But that was done earlier, in a page-makeup program. Little or no work was needed to create the PDF from the print-page files, and no codes are apparent to a reader.)

That's not all. PDF presents tables in all their typographic glory, with their carefully structured relationships and alignments intact — without any extra work at all. The same goes for special characters — Greek characters, accented characters, special typographic dingbats: No translation necessary. If they appeared on the page, they can appear in the PDF. Best of all, equations — the bane of many electronic publishers' existence — require, literally, no work at all in Acrobat. A typeset equation is rendered on the screen just as it is on the page. If it's in the PostScript, it's in the PDF.

"SGML is about structure, not appearance; PDF is about appearance, not structure"

All the graphics are there, too. Vector graphics — such as those created by Adobe Illustrator — are incorporated into the PDF files automatically. Best of all, if there are small characters that are hard to read on the screen, like labels in artwork or superscripts in equations, the viewer can zoom in on them, and they become as clear as can be. Bitmapped graphics such as scanned halftones are incorporated too. Acrobat gives you the option of embedding the full high-resolution image used for the print page, or downsampling the graphics (systematically discarding bits of the graphic's data while retaining the overall effect) to achieve smaller file sizes and optimize electronic delivery. PDF even allows you to differentiate between halftones and line art, retaining the crispness of the latter through higher resolution while downsampling the former more drastically, where the smaller file size still results in an acceptable image.

A Standard Format for Print

Acrobat brings many of those same advantages to the print world. In fact, printers are increasingly turning to PDF in preference to PostScript. PDF files have fewer problems when they get to the printer because the distilling process provides something of a preflighting (a process printers often go through to smoke out problems in PostScript files), revealing potential snags such as missing fonts or graphics. PDF is also much more compact, often between a quarter and a tenth the size of the corresponding PostScript file depending on the graphic complexity of the pages. And most importantly, PDF is page independent, enabling a much simpler correction process and, best of all, parallel processing: Adobe's Extreme technology splits a PDF file into multiple streams and processes them through multiple RIPs at the same time. (RIPs are the Raster Image Processors that change the PostScript or PDF code into the black and white spots to be printed.) That dramatically reduces processing throughput time. The latest generation of PostScript, PostScript 3, enables PDF to be processed directly in the RIP, without having to be converted back to PostScript first, as earlier generations required.

For all those reasons, publishers are finding that PDF is the most valuable format in which to preserve their typeset print pages for output on laser printers, imagesetters, platesetters, or even digital presses like the DocuTech. Particularly when those pages might be printed on a web press for the first edition, reprinted on a sheetfed press, kept in print in small quantities on a DocuTech, and delivered on demand via the Web (with many occasions to proof on laser printers during the lifetime of the publication), the PDF archive is a valuable asset, streamlining the flow and reducing costs as the pages move from technology to technology.

So Why Bother with SGML?

If PDF is such an easy solution for electronic publishing, and is probably created anyway for print pages, why bother with SGML? The most obvious reason is that print pages are rarely formatted to work well on the screen. They tend to be vertical; the type tends to be small to conserve pages; and their structural elements and the way the reader's attention is intended to flow are communicated with visual cues that often require seeing a page as a whole (which may not be possible or convenient on a screen). It's possible to reformat pages to optimize them for the screen, of course, but that is rarely done because it requires going back to the page-layout application and redoing the pages, thus canceling out a lot of the convenience.

More important is the fact that PDF carries little structural information. Acrobat does offer some navigational features: Thumbnail views are created automatically, bookmarks are created easily, the full text can be "indexed" with Acrobat's Catalog software, enabling Boolean searching of collections of PDF files (like papers in a journal or chapters in a book). You can even add hyperlinks within and between Acrobat documents, but they are usually done by hand, which gets expensive if there are many of them. And PDF files do offer a limited amount of metadata, incorporating key words, revision dates, author names, etc.; but since those are typically not created in an automated way, they often aren't done at all.

Keep in mind the essential distinction: SGML is about structure, not appearance; PDF is about appearance, not structure. Although you can see that the indented block of text at the beginning of a journal paper is obviously an abstract, there is nothing in the PDF code labeling it so. PDF knows that abstract is in Times Bold, and is indented a pica on the left and right, but it has no idea what its meaning is. So it's impossible to write a mechanism that will display or search just the abstracts without a lot of extra hand coding. If you're only searching a few chapters in a book or the papers in a given issue of a journal, that is not a big obstacle; but if you want to search through a whole collection of books or the past five years of ten different journals, it's a huge obstacle.

PDF Is Proprietary

There's one other key difference between SGML and PDF: SGML is a true, independent, international standard; PDF is becoming a de facto standard, but it is proprietary. It is owned and continues to be developed by Adobe Systems, Inc. It is necessary to have Adobe's Acrobat Reader software to view a PDF page (which, thankfully, is free, and is available for Macintosh, Windows, and Unix). To create PDF from PostScript, you need to buy the Acrobat Distiller, either as a stand-alone program or as part of another application. To create bookmarks and put in hyperlinks, you need Acrobat Exchange. To index collections of PDF files you need Acrobat Catalog. All of them are reasonably priced and easy to learn to use, but they are nevertheless proprietary. PDF is ultimately dependent on Adobe, whereas SGML is independent of all vendors, all software, all platforms.

Structure Is Necessary

Ironically, SGML (or its progeny HTML and XML) is often used to manage the delivery of PDF pages. Journal publishers who may be in the process of moving to full-text SGML delivery often find PDF to be the most convenient way to supply full-text pages to their subscribers electronically, but those pages are almost always accompanied by SGML-based headers that let users navigate to the particular papers they want to retrieve. Even when real SGML headers aren't used, HTML (or, soon, XML) is necessary, if only to list the available content and say "click here to download PDF." The best way to create that HTML or XML is from an SGML archive.

Knowing When to Use SGML and PDF

Is the choice between SGML and PDF a difficult one? Not if you understand how they work and what they're good for.

If you publish groups of products (books or journals) that share a clear, repetitive structure, you will almost surely find SGML to be beneficial. Although setting it up requires some effort and expense, a well-constructed SGML-based workflow can be the most economical one over time. It can reduce the cost of composition by providing fully tagged, error-free files for the typesetter. When publishers need to publish in more than one medium — typically print, CD-ROM, and Web — SGML almost always dramatically reduces the cost of producing subsequent versions, whether they are simply converted or modified and augmented to take advantage of the power of the various media.

If you need to produce typeset pages, you will almost surely find PDF to be beneficial. It will be the best file to furnish to the printer of your books and journals; it's a convenient way to deliver proofs electronically; it's the best way to deliver those typeset pages over the Internet for users to view or print out locally; and it's even a surprisingly effective and economical way to produce a simple CD-ROM. It requires so little extra work and cost (if you're already typesetting pages for print products) that it's almost a no-brainer.

A great many publishers are in both categories. We're lucky to be working in a time when such clear standards have emerged. When it was a choice between highly customized, proprietary systems, it was a mess. Now, it's clear that PDF is the best way to preserve an electronic description of the visual appearance of a page, and SGML (or XML) is the best way to describe the structure and meaning of its content. Most book and journal publishers need both.

For Further Information

The Web offers a wealth of opportunities for further exploration of SGML and PDF. As the developer of the Acrobat technology, Adobe (www.adobe.com) provides the most complete and up-to-date information on PDF. Likewise, Robin Cover's SGML/XML Web Page (xml.coverpages.org/) is a key source of SGML-related information; and XML.com (www.xml.com) is the best resource for the latest on XML.

For publishers just beginning to look into SGML and PDF, one good place to start is the Web site for my company, Impressions, where we provide a number of resources based on our experience implementing both SGML and PDF for book and journal publishers. See:

Electronic Publishing from Print Books and Journals ([formwrly www.impressions.com/resources_pgs/elpub_pgs/bookjourn.html]) — a white paper providing an overview of the options publishers often start with, from HTML to Acrobat, the Web to CD-ROM.
An Introduction to SGML and XML for Book and Journal Publishers ([formerly www.impressions.com/resources_pgs/SGML_pgs/SGMLintr.html]) — this white paper goes into some depth about those important technologies.
Making the Ideal Book or Journal for the Digital Era ([formerly www.impressions.com/resources_pgs/elpub_pgs/ideal_book.html]) — our popular seminar, which surveys electronic editing, SGML, HTML, XML, and Acrobat.
SGML/HTML/XML Resources ([formerly www.impressions.com/resources_pgs/SGML_pgs/SGML_HTML_XML.html]) — an updated listing of pointers to a variety of useful publications and Web sites relating to those topics.

Bill Kasdorf is president and owner of Impressions Book and Journal Services, a composition- and publishing-services firm that designs, edits, and produces books and journals in print and electronic forms. Serving a wide range of publishers — trade, professional, scholarly, college, technical, medical, and legal — Impressions has developed a national reputation for the effective and practical application of technology to the production of both books and journals.

Bill is also majority owner and president of Madison House Publishers, a small scholarly press, and minority owner and Vice President of IoStar, Inc., a software-development firm. From 1990-95, he was vice president of Edwards Brothers, a book and journal manufacturer. (He sold Impressions to EB in 1990 and bought it back in 1995.) In 1994-95 he served on the 15-member Xerox Book Publishing Advisory Council.

Bill is a graduate of the University of Wisconsin, where he was elected to Phi Beta Kappa in his junior year. A member of the Board of Directors of the Society for Scholarly Publishing, he is a frequent and effective speaker and seminar leader for SSP, as well as for other publishing-industry organizations such as the Association of American University Presses, Bookbuilders West, and the Chicago and Philadelphia Book Clinics. He lives in Ann Arbor, Michigan with his wife and two daughters.

Top of page

the journal of electronic publishing

SGML and PDF--Why We Need Both

Introduction

An Often-Accidental Combination

The Need for Structure and Adaptability

SGML Is for Structure, PDF Is for Pages

SGML: "Structure Gives Me Leverage"

SGML Separates Structure from Presentation

SGML Is a Language, not a Set of Codes

SGML Is Often Complicated

Richer than Composition Tagging

The Power of Metadata

SGML Documents Can Be "Parsed," or Validated

Power at a Price

Why not Use Just HTML?

XML — SGML for the Web

PDF: Extending the Power of Print

Sharing Pages, not Just Information

Expressing the Look Electronically

Communicating Structure Visually

Acrobat Is Much Less Work than SGML

A Standard Format for Print

So Why Bother with SGML?

PDF Is Proprietary

Structure Is Necessary

Knowing When to Use SGML and PDF

For Further Information