EPUBs are an experimental feature, and may not work in all readers.

The Virtual Observatory at Johns Hopkins University is a quintessential cyberinfrastructure project. Its objective is to support new science by greatly enhancing access to data and computing resources, providing a “virtual sky” and making it possible for astronomical researchers to find, retrieve, and analyze astronomical data from ground- and space-based telescopes worldwide.

According to the Virtual Observatory Web site http://us-vo.org,

The VO will enable a new way of doing astronomy, moving from an era of observations of small, carefully selected samples of objects in one or a few wavelength bands, to the use of multi-wavelength data for millions, if not billions of objects. Such datasets will allow researchers to discover subtle but significant patterns in statistically rich and unbiased databases, and to understand complex astrophysical systems through the comparison of data to numerical simulations. The VO will provide simultaneous access to multi-wavelength archives and advanced visualization and statistical analysis tools.

The VO is intended to make it easy to locate, retrieve, and analyze data from archives and catalogs worldwide, and it assumes that astronomical data is distributed rather than centralized. Thus, the Virtual Observatory is concerned with data discovery, data access, and data integration, the hallmarks of cyberinfrastructure projects.

The Virtual Observatory’s capabilities are enabled through the use of standard protocols for registering the existence and location of data and for requesting data that satisfies the user’s interests. These standards are developed on a national basis through the US National Virtual Observatory,[1] and an international basis through the International Virtual Observatory Alliance.[2] The goal is to establish one set of standards that is accepted worldwide.

The essence of the Virtual Observatory is interoperability. Data discovery, data access, and database queries are enabled by metadata standards. A primary goal of the Virtual Observatory is to provide integrated access to archival data and derived data products: catalogs, tables, and highly processed images, spectra, and time series. The initial focus of VO development has been on providing access to the archival datasets that are already available via public interfaces on the Web, but often with unique and incompatible interfaces. Derived data products are the purview either of dedicated large projects or of individual researchers. Large projects have thus far worked to provide access to these high-level products. The valuable data from individuals or small collaborations sometimes appear in the electronic journals and sometimes on personal Web sites, but most often these data are not available at all in any standard form or via any standard interface.

While the VO has made significant impact in astronomy, thus far its scope has deliberately not included long-term data curation, focusing instead on data location and data access standards and protocols. Based on extensive, ongoing dialogue, the VO project team has concluded that academic research libraries represent the ideal home for long-term curation of large-scale datasets to support scholarly communication, given their expertise and long-term, sustainable support from universities. Libraries play an important role in curation, a role that becomes more complicated with the increasing use of electronic means of content delivery. With funding from the Institute of Museum and Library Services and Microsoft, the library at Johns Hopkins and the VO have been developing a prototype data-curation system, and have learned valuable lessons.

Getting to Know You

Today the representatives of the VO and the library view each other as equals in a cohesive partnership. This did not happen without a good deal of time getting to know each other. Even with good intentions, mutual respect, and common interests, there was still a need to gain familiarity, learn jargon, understand needs, and appreciate expertise. As I reflect on the development of this partnership, I wonder occasionally what the VO researchers initially expected of us at the library beyond providing a sustainable, organizational home for supporting data curation. Undoubtedly, the VO astronomers at Johns Hopkins believed that libraries, archives, and museums could build upon the principles that have served well for many years to support data curation. However, the astronomers may have been surprised by the deep technical expertise in the library’s Digital Knowledge Center (now re-launched as the Digital Research and Curation Center or DRCC). The DRCC comprises a unique set of information technology professionals in a library setting who have conducted research and development for over 10 years. Thus the DRCC was able to participate in a rich, unique dialogue that has illuminated our path as we discover new forms of scholarly communication.

In the process of getting to know each other, the librarians and astronomers moved the conversation further back in the data-curation process, from system design to reconsidering requirements. As we in the DRCC learned more about astronomy, we refined our suggestions about access mechanisms; as the astronomers of the VO learned more about digital archiving, they refined their suggestions about characteristics of curation. Essentially, we have started to make decisions about requirements, system design, implementation, sustainability, and governance together. There is no doubt that each group possesses specific expertise, and the growing understanding of each other’s domains has resulted in unforeseen insights and a richer, collaborative decision-making process.

While there are many accounts of cyberinfrastructure-enabled scholarly communication from the scientists’ perspective or from the digital librarians’ perspective, there are few, if any, that account for the scientists’ and digital librarians’ perspective. Below is my account of some of what we have learned in the ongoing dialogue between the astronomers and digital librarians at Johns Hopkins.

New Forms of Scholarly Communication

Data are publications

Astronomy data flows through various levels of processing. For example, the telescope for the Sloan Digital Sky Survey captures the raw pixel data. These data are sent to the Fermi Lab for processing. Subsequently, a Beowulf cluster—a set of personal computers linked together for high-performance processing—produces a catalog based on these data. Ultimately, data are loaded into a SQL database and released to the community. Once data are processed to this degree, astronomers consider them “level 3” data, which are on the order of terabytes. Further processing and analyses of these level 3 data result in even further refined “level 4” data, which may be cited in traditional journal publications. Much of the existing discussion regarding the connection between data and publications focuses on these level 4 data (or their equivalent in other disciplines).

The current data curation prototype development effort at Johns Hopkins also focuses on level 4 data. However, there is a great deal of interest surrounding the aforementioned level 3 data releases. In the past, scientists acquired data, analyzed them, and then published results. Given the size of the data being generated today, it has become necessary to produce a refined “publication” of data before the community can perform its research. That is, rather than “acquire, analyze, then publish” the chain has shifted to “acquire, publish, then analyze.” The VO scientists say that the level 3 data releases should be treated as a new form of publication. Much of the discussion regarding new forms of scholarly communication considers data and articles as a new form of compound publication. Our experience leads us to submit that data releases, even without accompanying articles, might be considered a new form of publication.

Database queries are a form of scholarly communication

Astronomers query databases primarily through three main forms: query by observational parameters such as type of instrument or accuracy of data; query by parameters of the desired object such as name, position, or color of the astronomical object; query through a description of the science (e.g., astrophysical concepts such as “globular cluster metallicity” or work by a particular author). These queries reveal a great deal regarding a particular astronomer’s insights, hypotheses, and research ideas. As my colleague in the DRCC, Tim DiLauro, learned more about these queries, he realized that they represent a form of scholarly communication or intellectual expression. Our astronomy colleagues have also mentioned that these queries could be useful for assisting fellow astronomers (e.g., by comparing against associated results) and for training students or amateur astronomers. As such, in addition to preserving data, it may be important to preserve the queries.

New roles for the library

As the library moves into the center of VO activities, new roles are being defined. Regarding the aforementioned queries, there is an issue associated with astronomers sharing these queries with other astronomers. As a trusted, objective organization, the library can play a role as a verifying “authority” by time-stamping queries or adding provenance information to validate or track the chain of scholarly communication without compromising the confidentiality or integrity of such queries. In this case, a lack of depth of knowledge in a specific domain turns out to be useful. However, a breadth of knowledge across disciplines might be even more useful. The Sheridan Libraries at Johns Hopkins will develop a common layer of infrastructure that will support scholarly communication across disciplines. As it is at the crossroads of multi-disciplinary research, the library will adopt a more active role in bringing together teams of researchers and educators.

Finally, all libraries need to consider the implications of these new forms of scholarly communication from a collection development standpoint. If the purpose of building collections is to capture and preserve the intellectual record of scholarship, then libraries need to consider what that means in the context of data-driven science. These points illuminate only the first set of potential implications for libraries as they consider these new forms of scholarly communication such as data and associated queries.

The results will be important for scholarship. My colleague, Tim Stinson, the Council on Library and Information Resources Post-Doctoral Fellow at Johns Hopkins, and I recently drew a connection between scientific data and medieval manuscripts[3] that underlined that importance:

Perhaps it is not a set of inherent characteristics within specific disciplines that defines their mode of scholarship or communication, but rather the relative ease or difficulty with which practitioners of those disciplines can generate, acquire or process data.

Realizations

We are “inventing” rather than “reinventing” scholarly communication today. Reinventing seems to imply wholesale change, but in relation to an existing set of practices. While there are several reasons to consider existing reference points, it may be worth considering entirely new approaches toward publications and library collection development policies, service models, and infrastructure development. Our scientists are already engaged in such a transformation with their scholarly practices. Should not libraries be prepared do the same?

We often hear that technology is the easy part. As a technologist, I have often felt irritated by this statement because it feels as if the accomplishments of my colleagues are being trivialized. Nonetheless, it is true that the human and social dimensions of cyberinfrastructure may present greater challenges than the technological dimensions. Whether this is because technology is easy or people are hard is open to debate. However, it seems clear that human interoperability is more complex than machine interoperability. Our experience at Johns Hopkins has demonstrated that human interoperability is essential to define requirements properly and to develop road maps appropriately. Without such engagement between those who produce new knowledge and those who serve them by acting as the stewards of knowledge, no amount of technology will help us reach our greatest potential. The relationship we have between the Sheridan Libraries and the VO at Johns Hopkins works well, but replicating it across other domains and libraries may be difficult: we have been fortunate to have the time to create the trust and understanding we have developed. Eventually, it will be necessary to identify, cultivate, and reward individuals as “data scientists” who can act as human interfaces between libraries and domain experts.

After all, the astronomers have shown us that not even the sky is the limit anymore.


Acknowledgments

The work described in this article is supported through an Institute of Museum and Library Services National Leadership Grant LG0606018206 and a grant from Microsoft Corporation. In addition, the VO, Scholarly Publishing and Academic Resources Coalition (SPARC), and Teragrid have committed funding. While several individuals have participated in this work, I am deeply indebted to Alex Szalay, Bob Hanisch, and Tim DiLauro for the specific ideas noted in this article.


NOTES

    1. US Naval Virtual Observatory. http://us-vo.org (accessed Jan. 18, 2008). return to text

    2. International Virtual Observatory Alliance. http://ivoa.net (accessed Jan. 18, 2008).return to text

    3. Choudhury, Sayeed, and Timothy L. Stinson. “The Virtual Observatory and the Roman de la Rose: Unexpected Relationships and the Collaborative Imperative.” Academic Commons, submitted December 16, 2007. http://www.academiccommons.org/commons/essay/VO-and-roman-de-la-rose-collaborative-imperative (accessed Jan. 18, 2008).return to text

    References

    Fink, J. Lynn, and Philip E. Bourne. “Reinventing Scholarly Communication for the Electronic Age.” CTWatch Quarterly 3, no. 3 (August 2007).http://www.ctwatch.org/quarterly/articles/2007/08/reinventing-scholarly-communication-for-the-electronic-age/ (accessed Jan. 18, 2008).

    National Science Foundation Office of Cyberinfrastructure. Revolutionizing Science and Engineering Through Cyberinfrastructure: Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure. January 2003, http://www.nsf.gov/od/oci/reports/toc.jsp (accessed Jan. 18, 2008).

    National Science Foundation Cyberinfrastructure Council. Cyberinfrastructure Vision for 21st Century Discovery. March 2007. http://www.nsf.gov/od/oci/CI_Vision_March07.pdf (accessed Jan. 18, 2008).

    Sloan Digital Sky Survey. http://www.sdss.org/ (accessed Jan. 18, 2008).

    Institute for Astronomy, University of Hawaii. Pan-STARRS (Panoramic Survey Telescope & Rapid Response System). http://pan-starrs.ifa.hawaii.edu/public/ (accessed Jan. 18, 2008).

    Lynch, Clifford. “The Shape of the Scientific Article in the Developing Cyberinfrastructure.” CTWatch Quarterly 3, no. 3 (August 2007). http://www.ctwatch.org/quarterly/articles/2007/08/the-shape-of-the-scientific-article-in-the-developing-cyberinfrastructure/ (accessed Jan. 18, 2008).

    “Open Data.” SPARC/ACRL Forum at the American Library Association Annual Conference, New Orleans, June 24, 2006. http://www.arl.org/sparc/meetings/ala06/ (accessed Jan. 18, 2008).

    Choudhury et al. “Digital Data Preservation for Scholarly Publications in Astronomy.” International Journal of Digital Curation 2, no. 2 (2007): 20–30. http://www.ijdc.net/ijdc/issue/view/5 (accessed Jan. 18, 2008).