Herbert Van de Sompel
The Open Archives initiative (OAi) promotes and encourages the development of author self-archiving solutions (also commonly called e-print systems) through the development of technical mechanisms and organizational structures to support interoperability of e-print archives. Such interoperability can stimulate the transition of e-print systems into genuine building blocks of a transformed scholarly communication model. This paper describes the Santa Fe Convention of the OAi. This is a set of relatively simple but potentially quite powerful interoperability agreements that facilitate the creation of mediator services. These services combine and process information from individual archives and offer increased functionality to support discovery, presentation and analysis of data originating from compliant archives.
In July 1999, Paul Ginsparg, Rick Luce and Herbert Van de Sompel sent out a Call for Participation (Ginsparg, Luce, and Van de Sompel 1999a) to a meeting exploring cooperation among scholarly e-print archives. The meeting, held in October 1999 in Santa Fe, and originally called the Universal Preprint Service meeting, led to the establishment of the Open Archives initiative (OAi) (Ginsparg, Luce, and Van de Sompel 1999b). The goal of the OAi is to contribute in a concrete manner to the transformation of scholarly communication. The proposed vehicle for this transformation is the definition of technical and supporting organizational aspects of an open scholarly publication framework on which both free and commercial layers can be established.
This paper describes the origins of the OAi and work heretofore in defining this framework: the Santa Fe Convention. This convention is a combination of organizational principles and technical specifications to facilitate a minimal but potentially highly functional level of interoperability among scholarly e-print archives. The convention gives data providers -- individual archives -- relatively easy-to-implement mechanisms for making information in their archives externally available. This external availability then makes it possible for service providers to build higher levels of functionality, mediator services, using the information made available from scholarly archives that adopt the convention.
The growth of e-print archives
The origins of the Open Archives initiative lie in the growing number of electronic preprint (e-print) archives. While several of these began as informal vehicles for the dissemination of preliminary results and non-peer reviewed "gray literature", a number of them have evolved into an essential medium for sharing research results among the colleagues in a field.
These archives demonstrate a shift in the traditional scholarly communication model, which has relied on formally published scholarly journals. There is a growing consensus that the scholarly journal system is facing significant challenges:
The e-print archives exemplify a more equitable and efficient model for disseminating research results. An important challenge is to increase the impact of the e-print archives by layering on top of them services -- such as peer review -- deemed essential to scholarly communication. This is the focus of the Open Archives initiative.
An exhaustive review of existing e-print archives is out of the scope of this paper. An interesting list of initiatives is available at the Office of Scientific and Technical Information. A brief review of some of the notable efforts is illustrative of the scope of these initiatives:
There are indications that a growing number of disciplines and organizations are inspired by this pioneering work and are investigating alternative models for scholarly communication:
From individual archives to an interoperable fabric
The aim of the archive initiatives described above is to try to create a more effective scholarly communication mechanism that addresses problems that exist in the established system. The approaches that are taken by individual archives differ in a number of ways. Some initiatives build on a centralized model, others on a distributed departmental, or by extension, institutional model. Some deal with gray (non-peer reviewed) literature only, others incorporate metadata of peer-reviewed papers or try to establish some form of peer-review outside of the established system. Some deal with metadata only, others with both metadata and full content. Yet all share the attribute of offering scholars a vehicle to conveniently and immediately disseminate research results to peers.
The reason for launching the Open Archives initiative is the belief that interoperability among archives is key to increasing their impact and establishing them as viable alternatives to the existing scholarly communication model. This conviction is expressed in the official mission statement of the initiative:
The Open Archives initiative has been set up to create a forum to discuss and solve matters of interoperability between author self-archiving solutions (also commonly referred to as e-print systems), as a way to promote their global acceptance.
Interoperability is a broad term, touching many diverse aspects of archive initiatives, including their metadata formats, their underlying architecture, their openness to the creation of third-party digital library services, their integration with the established mechanism of scholarly communication, their usability in a cross-disciplinary context, their ability to contribute to a collective metrics system for usage and citation, etc.
Interoperability among archives offers substantial benefits to the scholars that use them. An important attribute of the traditional research library as an information provider is its role as a common entry point for a variety of information resources, not necessarily divided along disciplinary or institutional boundaries. The move from physical to digital sources should not be accompanied by the breakup of this entry point into a collection of fragmented archives. An increasing number of scholars move fluidly in their research across domain boundaries; the technology for delivering digital information should facilitate rather than hinder such fluidity. Mechanisms for interoperability offer the potential for discovery tools and virtual collections (Lagoze, 1998) that extend across the contents of multiple archives. Authors also benefit from such archive spanning tools, since their works will be accessible by a wider audience.
Interoperability is also beneficial to the archive and service provider. Rather than having to provide an entire suite of services for its users, individual archives can instead establish a well-defined interface on which external providers can build enhanced services. A variety of such services can be envisioned, including those that facilitate discovery, linking, and reviewing. An intriguing and essential set of services would be those that provide metrics to assist in the evaluation of the impact of certain scholarship and aid in tenure review and promotion decisions.
The Sante Fe Convention of the OAi represents a pragmatic, incremental, and collaborative approach towards interoperability. The initiators of the Open Archive initiative hope that this practical approach will be a catalyst for significant changes in the mechanisms for scholarly communication. The need for such change has been the issue of numerous papers, workshops, and Internet discussion groups. Yet, the existing system has proven somewhat resistant to change, no doubt due to the complex socio-political and economic forces that support it. For example, the current system of academic promotion and tenure is closely linked to the traditional journal system (Wilson 1942). This acts as an important factor sustaining the existing communication model (Schauder 1994). Understandably, scholars are hesitant to support alternative models that are not yet linked to their evaluation and promotion. While such issues will continue to support the current system, the development of practical technical and organizational solutions, such as the Sante Fe Convention, builds a framework for changes that will inevitably occur and may encourage the implementation of those changes.
Agreeing on interoperability: the Santa Fe meeting of the Open Archives initiative
A successful first meeting of the initiative was held on October 21-22, 1999, in Santa Fe, New Mexico. The meeting was sponsored by the Council on Library and Information Resources (CLIR), the Digital Library Federation (DLF), the Scholarly Publishing & Academic Resources Coalition (SPARC), the Association of Research Libraries (ARL) and the Los Alamos National Laboratory (LANL). The participants were computer scientists and digital librarians. There were also representatives of existing and emerging e-print systems, of scholarly publishers and of the sponsors. All but one of the invited institutions sent a representative. This was considered to be a firm indication of the perceived importance of the initiative.
The central theme of the first meeting was the establishment of recommendations and mechanisms to facilitate cross-archive value-added services. Such services could combine information derived from cooperating archives, process that information to produce some value-added information, and make that enhanced information available to users, agents, or other services. Examples of such services include cross-archive search engines, current awareness services, linking systems, and peer-review services.
Achieving progress on this goal required agreement among the participants on the issue of interoperability. Although interoperability has been a watchword for a variety of efforts in digital libraries and networked information (Paepcke, Chang, et al. 1998), the actual meaning of it and the implementation thereof has often proven elusive. Like many meetings intended to reach agreement on standards, attendees at the Santa Fe meeting arrived with a variety of pre-conceived notions on what was required to reach interoperability. It is instructive to review how these differing notions converged into a well-defined agreement that provides the foundation for cross-archive exchange of information.
The meeting began with a rather expansive example of interoperability, illustrated through the UPS Prototype project coordinated by Herbert Van de Sompel, Thomas Krichel, and Michael Nelson. This project and its results are described at length in the companion paper (Van de Sompel, Krichel, Nelson, et al 2000). Briefly summarized, the prototype demonstrated the integrated operation of a variety of services operating over data originating from a set of archives. Each of those services provided a reasonably rich level of functionality (implemented through a set of protocols).
There was general agreement among the participants at the meeting that the Prototype was an extremely useful demonstration of potential. There was also agreement, however, that trying to reach consensus on the full functionality of the Prototype was "aiming too high" and that a more modest first step was in order. The Prototype team, based on their insights gained during implementation of the UPS prototype, also reached a similar conclusion. This is described more fully in "Recommendations made to the Open Archive group" of (Van de Sompel, Krichel, Nelson, et al. 2000).
The remainder of the meeting was engaged in determining the proper degree of modesty, which balanced the need for adequate functionality against the requirement that the cost of entry for participating archives be sufficiently low. This is a question that has bedeviled other efforts at interoperability; for example, buy-in to the highly functional Z39.50 protocol has largely been limited to libraries, due to the costs of complexity (Stubley 1999). An important step towards establishing the cost/functionality balance was reached by the beginning of the second day with agreement among the participants on a tiered model of interoperability. This model is illustrated in Figure 1, showing the following layers:
Framing the problem of interoperability with this model quickly led to the decision to restrict the Santa Fe recommendations to interoperability at the level of metadata harvesting. The mechanisms for establishing this interoperability, described in full detail in the Santa Fe Convention and summarized in the remainder of this paper, are three-fold:
This agreement treats documents as black-boxes; archives can have idiosyncratic document representations with the Santa Fe Convention only specifying a URL entry point to the archives' individual document models. The question and functionality of common mediator services are left open to implementers who wish to exploit the Santa Fe Convention and build mechanisms based on it.
The Santa Fe Convention
The Santa Fe Convention presents a technical and organizational framework designed to facilitate the discovery of content stored in distributed e-print archives. It makes easy-to-implement technical recommendations for archives that –- when implemented –- will allow data from e-print archives to become widely available via its inclusion in a variety of end-user services such as search engines, recommendation services and systems for interlinking documents. In addition, the convention introduces an organizational framework for making information available about archives that adhere to the technical recommendations of the convention -- the data providers -- and about trusted parties that build end-user services for data originating from such archives -- the service providers. As such it provides a communication mechanism between providers of data and providers of services and creates a community of open archives.
Definitions and Concepts
The Santa Fe Convention builds on on a number of definitions and concepts that are essential for its understanding.
Open and managed e-print archives
The Convention considers the following to be crucial components of an e-print archive:
The last item is crucial for enabling third parties to create services that support the discovery, presentation and analysis of data in the archive. Most e-print archives will also provide native end-user services. However, facilitating the broad dissemination of archive data through third party services is a crucial feature of an e-print archive. Therefore, the open interface is a key part of the Santa Fe Convention.
Data providers and service providers
Consistent with the objective of the Santa Fe Convention and the identification of the crucial functions of an e-print archive, there is a distinction between two participants in the convention:
Data in an e-print archive
The convention uses the notion of a record in an archive. Some archives may store metadata that describes full content without storing the full content itself. In this case, the metadata is a record. Other archives may also store full content. However, the convention assumes that if full content is stored, there will always be associated metadata stored in the archive as well as a mechanism to tie metadata and content together. In this case the combination of metadata and full content is a record.
Technical Components of the Santa Fe Convention
The complete details of the technical components of the Santa Fe Convention and instructions for participating are available via the core document. Organizations considering participation should refer to that document. This section summarizes the information for the purpose of an overview.
Open Archives Metadata Set
The Open Archives Metadata Set (OAMS) is a collection of nine metadata elements intended to facilitate coarse granularity resource discovery among the records in distributed and dissimilar archives. The semantics of this set have purposely been kept simple in the interest of easy creation and widest applicability. There is no provision for qualification or extension of the nine elements. The expectation is that individual archives will maintain metadata with more expressive semantics and the Open Archives Dienst Subset provides the mechanism for retrieval of this richer metadata.
Open Archives Dienst Subset
The Open Archives Dienst Subset is a set of protocol requests that are delivered via HTTP. This protocol is a subset of the full Dienst protocol. The protocol requests in the subset provide the following functionality:
All responses to these requests are formatted in XML.
Organizational aspects of the Santa Fe Convention
The convention also introduces an organizational framework to facilitate its implementation and to establish a communication mechanism between data providers and service providers. An understanding of this framework can be obtained from an exploration of the core document of the Santa Fe Convention that gives a step by step approach for making an e-print archive or a service comply with the Santa Fe Convention.
For the data providers, some of these steps are directly related to the implementation of the technical recommendations of the convention, as summarized in the previous section. In addition, the core document introduces the following important organizational elements:
Conclusions and future plans
The technical results of the Santa Fe meeting may be perceived as quite modest, and indeed they are. However, the technical moderation should be viewed in a broader context. First, it played an important role in bringing the Santa Fe meeting to a successful conclusion, with agreement among diverse parties. This agreement amongst a core group is an important step towards the development of a broader e-print community with a strong focus on cooperation and interoperability. The organizational framework provided by the Santa Fe Convention is intended to actively contribute to the creation and extension of such a community. Second, the limited nature of the technological requirements lowers the cost of entry for new participants, and hopefully builds momentum for the development of scholarly publishing alternatives. This momentum will provide a basis for future agreements that may extend and enhance the current Santa Fe Convention.
If successful, the Convention will attract early adoption by existing archives and encourage the establishment of new scholarly archives that will support the mechanisms defined by the Convention. The former, early adoption, seems to be occurring with participants at the meeting representing arXiv, the California Digital Library, clinmed, CogPrints, RePEc and NCSTRL, stating their intention to comply with the Santa Fe Convention in the near future.The CogPrints team at Southampton also work on the implementation of a free software for e-print archives that will comply with the Santa Fe Convention (Harnad 1999). Based on the number of inquiries received since the Santa Fe meeting, there are reasons to be optimistic regarding the establishment and adoption by other existing and planned archives. Positive feedback has been received from representatives of German mathematical and physical e-print archives. In addition, several commercial and non-commercial parties have expressed interest in creating mediator services once archives have implemented the convention.
The current challenge for the Open Archive initiative is to maintain a focus on the successful dissemination and implementation of the Santa Fe Convention. Before considering whether it is necessary or appropriate to expand the nature of the interoperability agreements, it is essential that the mechanisms described in the current convention be widely implemented and tested in practice. Without such proof of concept, the initiative may find itself increasing the complexity (and cost of implementation) of the interoperability mechanisms without discovering if, in fact, the level of interoperability defined by the existing Santa Fe Convention is sufficient and practical. Any future work to expand the scope of the OAi should understand that the success of any interoperability standard must be measured relative to both its functionality and its cost of adoption (Arms 2000).
The near-term plans for the Open Archive initiative include public dissemination of the Santa Fe Convention scheduled for February 15, 2000, and meetings to review progress and chart future activities. This paper represents the initial public dissemination and the Open Archives web site will serve as a persistent and official record of the convention. The next meeting will take place at ACM Digital Libraries 2000 in San Antonio, Texas, in June 2000. The exact dates and place of this meeting will be posted on the Open Archives web site nearer to the June date. A European meeting is tentatively planned in conjunction with ECDL 2000 in Lisbon, Portugal, in September 2000.
Anonymous. PubMed Central: An NIH-Operated Site for Electronic Distribution of Life Sciences Research Reports. August 1999. [http://www.nih.gov/welcome/director/pubmedcentral/pubmedcentral.htm].
Arms, William Y. 2000. Digital Libraries. MIT Press.
Bowman, C.M., et al. 1995. The Harvest Information Discovery and Access System. Computer Networks and ISDN Systems. 28 no. 1 & 2: pp. 119-125.
Buck, Anne M., Richard C. Flagan, and Betsy Coles. Scholars' Forum: A New Model For Scholarly Communication. March 1999. [http://library.caltech.edu/publications/scholarsforum/].
Delhamothe, Tony and others. 1999. Netprints: the next phase in the evolution of biomedical publishing. British Medical Journal 319: 1515-6.
Ginsparg, Paul, Rick Luce, and Herbert Van de Sompel. Call for participation in the UPS initiative aimed at the further promotion of author self-archived solutions. July 1999a. [http://www.openarchives.org/ups-invitation-ori.htm].
Ginsparg, Paul, Rick Luce, and Herbert Van de Sompel. the Open Archives initiative. July 1999b. [http://www.openarchives.org/].
Harnad, Stevan. 1999. Free at Last: The Future of Peer-Reviewed Journals. D-Lib Magazine 5, no. 12. [http://www.dlib.org/dlib/december99/12harnad.html]
Judson, Horace Freeland. 1994. Structural Transformations of the sciences and the end of peer review. Journal of the American Medical Association 272, no. 2: 92-4.
Lagoze, Carl. 1999. Defining Collections in Distributed Digital Libraries, D-Lib Magazine 5, no. 11. [http://www.dlib.org/dlib/november98/lagoze/11lagoze.html]
Leiner, B.M. 1998. The NCSTRL Approach to Open Architecture for the Confederated Digital Library. D-Lib Magazine 5, no. 12. [http://www.dlib.org/dlib/december98/leiner/12leiner.html]
Lucier, Richard and John Ober. Scholar-led Innovation in Scholarly Communication University ePub: An initiative in Electronic Scholarship. October 1999. [http://www.cdlib.org/eschol/summary.html].
Paepcke, Andreas, Chen-Chuan Chang, Hector Garcia-Molina, and Terry Winograd. 1998. Interoperability for Digital Libraries Worldwide. Communications of the ACM 41, no 4.
Schauder, Don. 1994. Electronic publishing of professional articles: attitudes of academics and implications for the scholarly communication industry. Journal of the American Society for Information Science 45, no. 2: 73-100.
Stubley, Peter. 1999. Clumps as Catalogues. Ariadne no. 22. [http://www.ariadne.ac.uk/issue22/distributed/distukcat2.html].
Van de Sompel, Herbert, Thomas Krichel, Michael L. Nelson and others. 2000. The UPS Prototype: An Experimental End-User Service across E-Print Archives. D-Lib Magazine 6, no. 2.[http://www.dlib.org/dlib/february00/vandesompel-ups/02vandesompel-ups.html].
Varmus, Harold. E-BIOMED: A Proposal for Electronic Publications in the Biomedical Sciences. May 1999. [http://www.nih.gov/welcome/director/pubmedcentral/ebiomedarch.htm].
Wilson, L. 1942. The academic man: a study in the sociology of a profession. New York: Oxford University Press.
The authors wish to thank:
Herbert Van de Sompel wishes to thank the Belgian Science Foundation for a special Ph.D. grant.
Work on Dienst and the Open Archives Dienst Subset is supported by the National Science Foundation Grant No. IIS-9817416 and Defense Advanced Projects Agency Grant No. N66001-98-1-8908, with the Corporation for National Research Initiatives.
Copyright © 2000 Herbert Van de Sompel and Carl Lagoze