/ Q. A.: HTML, PDF and TXT: The Format Wars

For as long as technology has been part of our lives, format wars have raged. 33-rpm vs. 78-rpm records. Cassettes vs. eight-track tapes. Beta vs. VHS. PCs vs. Macs.

The Web was supposed to put an end to all that. After all, the HTML standard was designed to allow users of any computer to see a document the same way it would appear on any other computer. And to some extent, it has lived up to that promise.

But if there isn't a format war raging, there is at least some confusion over the best way to present information on the Web. Forget for a moment about specialized cases, such as offering a downloadable data set or audio clips. Even when the information is nothing more than text and graphics, it's not always an easy call to decide the best format for it.

The State of the Web

It takes only a few minutes of surfing the Web to realize that HTML documents are the backbone of the Web, for many good reasons:

  • HTML is fairly easy to learn, and WYSIWYG (what you see is what you get) editors are making it easier with each new release to build HTML pages.

  • HTML is competent at presenting text and graphics in a reasonably decent layout. With the introduction of Cascading Style Sheets (CSS), it has become very good.

  • Web browsers readily accommodate a multitude of plug-ins that allow the inclusion of audio, video, 3-D, and other specialized files. Any of these can be included as a link in a standard HTML page; clicking the link loads the plug-in to view or play the file.

  • <META> tags allow HTML pages to be found easily by search engines, and allow the creators of HTML pages to specify the descriptions of those pages that the search engines should provide.

  • HTML files are tiny. Even complex pages can be delivered to a viewer's screen almost instantaneously, provided graphics are kept to reasonable sizes. [1]

  • Through the use of interactive forms, HTML pages allow the collection of information online.

With all those advantages, HTML has much to offer online publishers. Unfortunately, it also has some important shortcomings:

  • HTML does not support scientific notation, so documents that rely on notation cannot be translated effectively to HTML. This could change, however, once the World Wide Web Consortium approves MathML, an XML application for describing mathematical notation and capturing both its structure and content. [2]
  • Even though Cascading Style Sheets have been around since early 1997 — and have supposedly been supported by some browsers almost as long — not even the latest browsers render pages created with CSS in a predictable manner. And for at least the next few years, a large number of Web visitors will still be using 3.x browsers that don't support CSS at all: Web expert Jakob Nielsen recently wrote that "sites will have to support users with version 3 browsers until early 2001." [3] So for the short term, good-looking HTML pages can be built only by rendering text as graphics (which slows page loading time) and tucking elements into table cells (which also slows down loading, and makes the pages virtually unusable for those with visual disabilities or text-only browsers). [4]
  • HTML documents are designed for reading on the screen. Because screen resolution is so low (only 72 or 96 dots per inch, as opposed to the 300 dpi that even a $99 printer offers), reading lengthy documents on screen is hard on the eyes and takes more time. Consequently, many users print Web pages to read offline — and find that a page that looks good on the screen can be a mess when printed. [5]
  • With the use of the comment tag, HTML documents offer a limited ability for writers and editors to share notes, questions, etc. For instance, after I submit a column to JEP, I receive a version with my editor's comments sprinkled throughout. That's useful, but in order to view those comments, I have to open the document in an editor and hunt down the comments amid hundreds of lines of source code. In longer documents, it can be easy to overlook comments. And long, detailed comments can require separate e-mail messages that can be hard to keep track of.

PDF to the Rescue

Chances are you've come across another format in your travels on the Web: Adobe's Portable Document Format. PDF files cannot be posted as Web pages per se, but they can be linked to from HTML pages. They're used on a wide range of sites, to deliver everything from technical documents to tax forms. PDF files can also be used to provide updates between regular print editions, a tactic employed by Digital Queue, among others.

In creating a PDF file, Acrobat essentially takes a snapshot of an electronic file. The beauty of Acrobat is that it doesn't matter what program created the file, and it does not require the viewer to have the original program or fonts on his or her computer. All the viewer needs is a free Acrobat reader plug-in, available from Adobe.

What the viewer sees is pretty much a photo of the original document, with fonts, graphics, advanced layout, scientific notation, and any other features that were in the original. As Acrobat Product Manager Joel Geraci told Desktop Publishers magazine, "If you have to deliver a final-form document where layout is critical, PDF is really the only option." [6]

In addition to its fidelity to the original document, PDF in its latest incarnation offers several other benefits for online publishers. Chief among them is a powerful review feature, which allows editors to add their comments in several ways:

  • by inserting comments using electronic notes;

  • by highlighting, underlining, striking through and circling text; and

  • by marking files as "Approved" or "Confidential" with clip-art stamps.

When an author receives the marked-up PDF file, he or she can generate a summary of reviewers' comments organized by reviewer, as well as use a "Document Compare" feature to view original and reviewed versions of the document.

Some key limitations of earlier versions of Acrobat also have been addressed in the latest version. Previously, a PDF file might as well have been set in stone: Once it was created, there was no way to make even minor modifications to it without creating a new file. Nor was there any way to capture information from a PDF file and import it into other documents or applications. The latest version of Acrobat addresses those problems by allowing authors and publishers to make last-minute text and image changes on PDF files, and additionally allows the reuse of text, graphics and table data from PDF files.

Finally, the latest version of Acrobat allows PDFs to collect information online with interactive forms, something only HTML files could do before.

One Size Does Not Fit All

So why doesn't everyone just switch from HTML to PDF? Because it's not perfect, of course. But files created with Acrobat's current iteration suffer from only a few problems:

  • Encoding files in the PDF format requires purchasing and learning another program. Neither of these is a big deal; the street price of Acrobat 4.0 is about $150, and many users will be able to turn out PDF files within five minutes of installing the program.

  • Readers have to download the free plug-in to view a PDF file. But with an estimated 50 million downloads to date, that doesn't seem insurmountable.

  • PDF files can be huge. While a short document optimized for screen can download almost as quickly as an HTML file, longer PDF files that are optimized for printing can take ages to download over a 28K or 56K modem. If most visitors to a site come from the corporate or educational worlds where high-speed lines are more typical, even that might not be a big objection.

Of course, if download speed is the most important criterion for a publisher, there is one other viable format: plain text. Putting plain-text files on the Web is not a great idea: By definition, they cannot include any graphic or typographic devices that help make long spans of gray type readable on the printed page. (See, for example, Andrew Odlyzko, The Visible Problems of the Invisible Computer.) But plain text files can be handy for distributing newsletters, columns, updates and headlines, and the like to subscribers' e-mail boxes.

So what format is the best? The ideal for many online publications would be a combination of all three: a plain-text e-mail alert, an HTML version for fast loading and online reading, and a downloadable PDF version for offline reading. That may seem like a lot of work, but many word-processing and desktop-publishing programs can now produce HTML and PDF versions with the click of a mouse. Short of offering every format, though, the best bet is to make sure you know who your readers are and how they use your publication. That knowledge is the best guide to making sure you give them what they'll use the most.

The author thanks Suzanne Bourdess of Towson University for editorial review.


JEP Contributing Editor

Thom Lieb is an associate professor of journalism and new media at Towson University in Baltimore. Among his courses is Writing for New Media. He is the author of Editing for Clear Communication and has written and edited for magazines, newspapers, newsletters and online publication. He holds a Ph.D. in Public Communication from the University of Maryland at College Park and a master's of science in Magazine Journalism from Syracuse University. You may contact him by e-mail at lieb@towson.edu..


1. Thom Lieb, "Caution: Speed Zone," Journal of Electronic Publishing, December 1997.return to text

2. World Wide Web Consortium, Mathematical Markup Language (MathML™) 1.01 Specification. http://www.w3.org/1999/07/REC-MathML-19990707/return to text

3. Jakob Nielsen, "Stuck With Old Browsers Until 2003," Alertbox, April 18, 1999. http://www.useit.com/alertbox/990418.htmlreturn to text

4. Lieb, "Access Code," Journal of Electronic Publishing, June 1998.return to text

5. Nielsen, "In Defense of Print," Alertbox, February 1996. http://www.useit.com/alertbox/9602.htmlreturn to text

6. Scott Bury, "Blending the Elements," Desktop Publishers, June 1999. [Editor's note: link removed August 2001 because the page no longer exists.]return to text

Links from this article:

Adobe Acrobat Reader, http://www.adobe.com/prodindex/acrobat/readstep.html

Content Exchange newsletter, [formerly http://www.content-exchange.com/cx/html/newsletter/emailnews.htm]

Digital Queue, http://digitalout.com/

Internal Revenue Service forms, http://www.irs.ustreas.gov/prod/forms_pubs/index.html

Odlyzko, Andrew, The Visible Problems of the Invisible Computer, http://www.dtc.umn.edu/~odlyzko/doc/visible.problems.txt

Western Digital Technology Library, http://www.westerndigital.com/acrobat/welcome.html