Abstract

In creating PubMed Central (PMC) [1], the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) needed a common format, with a single Document Type Definition (DTD), for all content in PMC. The first version of the NLM DTD was made available to the public in early 2003, and it quickly became the de facto standard for tagging journal articles in XML even outside the NLM. As usage grew, users and potential users started asking about formalizing the article models as a standard with the National Information Standards Organization (NISO).

Work on the NISO standard began in late 2009, and the Journal Article Tag Suite was released as a Draft Standard for Trial Use as NISO Z39.96 in March 2011.

A Short History of the NLM DTD Project

PubMed Central and the pmc-1.dtd

PMC is the NLM’s digital library of full-text life sciences journal literature. Currently it holds over 2 million articles from more than 250 publishers. Although PMC is also used to store articles based on research funded with NIH grants as part of the NIH Public Access project [2], the original intent of the project was to take full-text article submissions from publishers and make them available through the database. The only technical requirement at the time was that the publisher had to supply the articles in some SGML or XML format and include all images so that the articles could be displayed at PMC.

In early PMC (see Figure 1), the SGML or XML content was loaded into a database and then it was rendered into HTML from the publisher’s original SGML or XML when the article was requested by a user.

Figure 1: Early PMC Workflow
Figure 1: Early PMC Workflow

Shortly after the project began in 2000, it became obvious that this was not a scalable workflow. Both the database-loading software and the rendering step (between the “PMC Database” and “PMC Website” in Figure 1) had to be written to accommodate the different models of articles that would be in PMC. As new journals joined the project with different article models, all of the software would have to be modified.

The PMC workflow was changed so that content coming in in different publishers’ article formats was to be converted into a common format (Figure 2). The PubMed Central DTD was written based on the two article models that were being submitted to PMC at the time: the keton SGML DTD and the BioMed Central XML DTD. The pmc-1.dtd is still available at http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/dtd/pmc-1.dtd, although it has not been used in PMC or for any articles submitted to PMC in years.

Figure 2. Updated PMC Workflow
Figure 2. Updated PMC Workflow

Because this article model was built based on a small sample set, as publishers submitted new formats for inclusion in PMC, the pmc-1.dtd started to grow to be able to handle new article structures that were not needed for keton or BMC articles. It became obvious that the PubMed Central DTD should be reviewed and modified. NCBI contacted Mulberry Technologies in Rockville, Maryland, to perform an independent review of the pmc-1.dtd and to work on a replacement model.

Inera’s DTD Review

In 2001, the Harvard University Library E-Journal Archiving Project (using funds from the Mellon Foundation) commissioned a study into the feasibility of having one DTD that could be used to archive all electronic journals [3]. The report prepared by Inera, Inc., Belmont, Massachusetts, was a survey of the journal article DTDs from the following publishers.

  1. American Institute of Physics
  2. BioOne
  3. Blackwell Science
  4. Elsevier Science
  5. Highwire Press
  6. Institute of Electrical and Electronics Engineers
  7. Nature Publishing Group
  8. PubMed Central
  9. University of Chicago Press
  10. John Wiley & Sons

The report concluded at the end of 2001 that it seemed possible that there could be a single DTD that could accommodate any electronic journal article, but none of the DTDs in the study met all of the requirements.

Pmc-2.dtd

At this point, the modified version of the pmc-1.dtd was well under way. Many of the suggestions from the E-Journal Archive DTD Feasibility Study[3] were incorporated into the modified PubMed Central article model. When the modified model was shared with Bruce Rosenblum from Inera, he determined that the pmc-2.dtd was almost the one model that they had been looking for during the feasibility study.

A meeting was held in the spring of 2002 at the NLM that included representatives of NCBI/NLM, the Harvard Library, the Mellon Foundation, Mulberry Technologies, and Inera to try to work out the details of adopting the new pmc-2.dtd to general use for archiving any electronic journal article.

At this meeting it was determined:

  1. That the project would be a set of “standard” XML elements and attributes that could be used to build article models.
  2. That work should continue on the new models to expand them to handle any journal article content including a survey of articles across many disciplines to ensure that all article objects could be accommodated in the new model.
  3. That there should be two article models built initially: one that was a broad target for conversion of any article content into it for use by archives, and one that was more prescriptive that gave explicit rules for tagging content that could be used by publishers for creating content.

    The first model became the Archiving and Interchange Tag Set (http://jats.nlm.nih.gov/archiving/"), and the second became the Journal Publishing Tag Set (http://jats.nlm.nih.gov/publishing/).

  4. That the new models should be easily extensible. One requirement was that the OASIS CALS table model should be available for use and easy to add to one of the models.

Version 1.0

Version 1 of the NLM Archiving and Interchange Tag Suite was released in early 2003. It included two article models: the Archiving and Interchange DTD (http://dtd.nlm.nih.gov/archiving/1.0/) and the Journal Publishing DTD (http://dtd.nlm.nih.gov/publishing/1.0/).

There was much interest in the release [4], and the DTDs started to be used for submission of content to PMC and for journals that were not PMC participants. To ensure that the article models kept up with usage and made reasonable updates, NLM established a feedback form through the Mulberry Technologies website where comments, complaints, and suggestions could be accumulated.

The NLM Archiving and Interchange Tag Suite Working Group

Also, the NLM established the NLM Archiving and Interchange Tag Suite Working Group who would advise NLM on changes that needed to be made to the models and the Tag Suite based on feedback and their own use.

The working group responded to the comments submitted through the feedback form and made suggestions for new versions. It was determined that whole number version number changes would indicate a major update and decimal version number changes would indicate a minor update.

Version 1.1 of the models was released in November 2003 and was only a minor update.

Version 2.0

Version 2.0 was released in September 2004. It was considered a major update because of some structural changes in how the parameter entities in the DTDs were built. Because of this syntactic change, new version 2.0 DTD files could not reuse or refer to existing files from previous versions of the DTD even if the content models described in those files had changed.

Version 2.0 was a completely backward-compatible change, however, from the content point of view. That is, any files that were valid against version 1.0 or 1.1 would still be valid if they were checked against version 2.0.

Different Views on “Archiving”

Two major users of the DTDs at this time were PubMed Central, who had migrated all of its content to the Archiving and Interchange DTD shortly after it was released, and Portico (http://www.portico.org). Portico was the E-Journal archive based on the Harvard University Library’s E-Journal Archiving Project.

Both PMC and Portico were active members of the NLM Working Group, and it soon became clear that PMC and Portico had different philosophies on archiving.

It was PMC’s mission to archive the content of the journal article without regard to article presentation except where it may have an effect on understanding the science in the article [5].

It was Portico’s mission to archive everything that they received in the submitted file [6].

In the Content Model Spectrum shown in Figure 3, models that fall to the left side are optimized for content to be converted into them—these are good archiving models. Models that fall on the right side are optimized for content creation. They give content producers more guidance.

Figure 3. Where the Archiving and Publishing Models fall on the content model spectrum in versions 1 and 2.
Figure 3. Where the Archiving and Publishing Models fall on the content model spectrum in versions 1 and 2.

After version 2.0 was released, it became obvious that there were more and more requests to make the Archiving model more optimized for conversion that there needed to be some changes made.

Version 2.1 — The Authoring Model

It also became clear based on users’ suggestions that the publishing model was not really filling the role of a content-creation model. So, with version 2.1, both the Archiving model and the Publishing model made a shift to the “Optimized for Conversion” side of the spectrum, and a new model, the Article Authoring model, was introduced (Figure 4).

Figure 4. Where the Models fall on the content model spectrum with the release of version 2.1.
Figure 4. Where the Models fall on the content model spectrum with the release of version 2.1.

With increased usage, more comments came in, and versions 2.2 and 2.3 were released.

In 2006, the British Library and the U.S. Library of Congress announce that they support the NLM Article models for archiving electronic journal articles [7]. This came with an understanding that the NLM would formalize the standard by registering it with NISO.

JATS-Con

In November 2010, NCBI held the first user group conference for the NLM DTDs, called “JATS-Con.” Proceedings from the meeting including the video of the presentations are available [8], and information on the 2011 meeting is available at http://jats.nlm.nih.gov/jats-con.

“JATS-Con” is the first use of the abbreviated form of “Journal Article Tag Suite.” The new name has become very popular since the meeting and has come to symbolize the next phase in the evolution of the Tag Suite.

Involvement of NISO

Version 3.0 — The First Backward-incompatible Version

When the discussion started about formalizing the Article Tag Suite with NISO, the original plan was to submit the latest version of the Tag Suite and the article models and have them registered. But, over the years, there were a number of fixes that the Working Group had been avoiding making because they would be not-backward-compatible changes.

One of these, for example, is the id attribute on <list-item>. In version 1, it was mistakenly listed as a CDATA attribute, meaning it could take any character data content, rather than as an ID attribute. This meant that <list-item> could not be referenced by an <xref> element (specifically that the <list-item> could not be referenced by the rid attribute on <xref>, which is of type IDREF).

This is a minor error, but changing the id attribute on <list-item> in version 2.0 to ID instead of CDATA would mean that any content created under version 1.0 that had a value of a list-item id attribute that was not a valid XML ID would not be valid under the new Version 2.0 DTD. This would have been a backward-incompatible change and one that the Working Group chose not to make.

At this point, the Working Group decided to make all pending changes, and submit the cleanest version possible to NISO. Version 3.0 of the Tag Suite and the three article models were released in November 2008 [9][10][11].

The NISO Working Group

NCBI proposed a work item to NISO for the Standardized Markup for Journal Articles Based on NLM’s Journal Archive and Interchange Tag Suite. The work item was approved by NISO, and by the end of 2009, the NLM Working Group was disbanded and reformed as a NISO Working Group. Fortunately, for continuity of the project, many of the original members agreed to sign on to the new NISO group. The first meeting was held in December 2009.

A year had passed since version 3.0 was released, and there were a number of new suggestions that had accumulated. The new Working Group that it should address these requests in a timely manner, create an update to version 3.0 of the NLM models, and submit that to NISO rather than the year-old version 3.0.

Draft Standard for Trial Use

On March 30, 2011, after approval of the NISO Standardized Markup for Journal Articles Working Group and the NISO Content and Collection Management Topic Committee, NISO released NISO Z39.96x, JATS: Journal Article Tag Suite, as a Draft Standard for Trial Use[12].

The standard describes a set of XML elements and attributes that can be used to build journal article models and three article models or Tag Sets: the Archiving and Interchange Tag Set, the Journal Publishing Tag Set, and the Article Authoring Tag Set. The draft standard is available for public comment until September 30, 2011.

The standard does not contain any schemas or usage documentation. All of the non-normative supporting documentation is available on the NLM JATS website at http://jats.nlm.nih.gov.

A New Name and a New Number

The working group decided that it did not make sense to start the NISO project with version 3.1, so the schemas are being released in the non-normative documentation as “JATS version 0.4.” It is anticipated that once all of the comments have been addressed, the next version, which should be the one approved as a standard (NISO Z39.96), will be “JATS version 1.0.”

Note that the models described as JATS version 0.4 are essentially a minor backward-compatible upgrade to the NLM version 3.0 models.

The Role of the NLM in NISO Z39.96

NLM has created and maintains the non-normative supporting documents for Z39.96, including complete Tag Libraries and DTD, XSD, and RELAX NG schemas for the three article models described in the standard. All of this documentation is available at http://jats.nlm.nih.gov.

Additionally, all of the supporting documentation and schemas for the earlier NLM article models (versions 1.0 through 3.0) are still available at http://dtd.nlm.nih.gov.

Adoption of JATS

PubMed Central has not moved to the JATS version 0.4 and will most likely wait for JATS version 1.0 before it started converting content over to the new model. However, PMC will accept content in any of the article models NLM 1.0 through JATS 0.4, and NCBI is actively testing the new model.

Once the standard is released as JATS 1.0, it will move into continuing maintenance, and the NISO Standardized Markup for Journal Articles Working Group will be reformed to support the ongoing needs of the journal article XML community.



Jeff Beck is a Technical information Specialist at the National Center for Biotechnology Information at the U.S. National Library of Medicine. He hasbeen involved in the PubMed Central project since it began in 2000. He has been working in print and then electronic journal publishing since the early 1990s. Currently, he is co-chair of the NISO Z39-96, JATS: Journal Article Tag Suite Working Group. He is a BELS-certified Editor in the Life Sciences.

References

    1. National Center for Biotechnology Information. 2000. PubMed Central. http://www.ncbi.nlm.nih.gov/pmc (accessed April 28, 2011).return to text

    2. NIH. “National Institutes of Health Public Access .” http://publicaccess.nih.gov/ (accessed April 24, 2011).return to text

    3. Inera Inc. 2001. “E-Journal Archive DTD Feasibility Study.” December 5, 2001. http://www.diglib.org/preserve/hadtdfs.pdf (accessed April 27, 2011).return to textreturn to text

    4. Robin Cover. 2003. “NLM Releases XML Tagset and DTDs for Journal Publishing, Archiving, and Interchange.” Cover Pages. May 30, 2003. http://xml.coverpages.org/ni2003-05-30-a.html(accessed May 2, 2011).return to text

    5. Jeff Beck. 2010. “Report from the Field: PubMed Central, an XML-based Archive of Life Sciences Journal Articles.” Presented at International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML. Balisage Series on Markup Technologies, 6. doi:10.4242/BalisageVol6.Beck01.return to text

    6. Sheila Morrissey, John Meyer, Sushil Bhattarai, Sachin Kurdikar, Jie Ling, Matthew Stoeffler, and Umadevi Thanneeru. 2010. “Portico: A Case Study in the Use of XML for the Long-Term Preservation of Digital Artifacts.” Presented at International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML. Balisage Series on Markup Technologies, 6. doi:10.4242/BalisageVol6.Morrissey01.return to text

    7. Library of Congress. 2006. “Library of Congress, British Library to Support Common Archiving Standard for Electronic Journals.” News from the Library of Congress. April 19, 2006. (accessed, May 2, 2001).return to text

    8. JATS-Con: Proceedings of the Journal Article Tag Suite Conference 2010. Bethesda, MD, November 1–2, 2010. (accessed, May 2, 2001).return to text

    9. NCBI. Archiving and Interchange Tag Set, Version 3.0. (accessed, May 2, 2001).return to text

    10. NCBI. Journal Publishing Tag Set, Version 3.0. http://dtd.nlm.nih.gov/publishing/3.0/ (accessed, May 2, 2001).return to text

    11. NCBI. Article Authoring Tag Set, Version 3.0. (accessed, May 2, 2001).return to text

    12. NISO Standardized Markup for Journal Articles Working Group. Z39.96 JATS: Journal Article Tag Suite. http://www.niso.org/standards/z39-96/ (accessed, May 2, 2001).return to text