Lots of Copies Keep Stuff Safe Previous   Contents   Next Issues in Science and Technology Librarianship Fall 2002 DOI:10.5062/F47P8WCW URLs in this document have been updated. Links enclosed in {curly brackets} have been changed. If a replacement link was located, the new URL was added and the link is active; if a new site could not be identified, the broken link was removed. Lots of Copies Keep Stuff Safe As A Cooperative Archiving Solution for E-Journals Victoria A. Reich Director, LOCKSS Program Stanford University vreich@stanford.edu Abstract The LOCKSS model, based on analysis of the history of cultural continuity epitomized by "Lots of Copies Keeps Stuff Safe," creates low-cost, persistent digital "caches" of e-journal content housed locally at institutions that have authorized access to that content and actively choose to preserve it. Accuracy and completeness of LOCKSS caches is assured through a peer-to-peer polling system (operated through LCAP, LOCKSS' communication protocol), which is both robust and secure. The creation of such caches, given the requirement that the caching library already have the right through subscription to obtain that content, has met with a high degree of publisher and library engagement and commitment. Through its technical development and beta-1 testing (1999-2002) (Permanent publishing n.d.), the LOCKSS project has demonstrated that its model and protocol are technically viable. We are working to build production software and to establish the LOCKSS model as an ongoing, operating archival solution. Support for LOCKSS has been generously provided by the National Science Foundation, the Andrew W. Mellon Foundation, Sun Microsystems, and others*. Introduction The LOCKSS Program mission is to build tools and provide support to research libraries so they can easily and affordably create, preserve, and archive local electronic collections. We believe it is in society's best interest for libraries to own rather than lease electronic information and thus to retain their traditional custodial role for scholarly information. The LOCKSS Program will build tools and provide support to publishers so they can, without risk to their business model or to their publishing platforms, distribute electronic journals to libraries and relinquish responsibility to provide perpetual access. In the paper library system, libraries acquire materials to serve their local communities. They keep the paper materials on the shelf and provide access to local readers. Libraries cooperate through a variety of mechanisms to ensure access. Any particular reader can easily find a copy; however, it is extremely difficult for anyone to systematically find and destroy all copies. Libraries ensure that content persists simply by supporting their local communities. One can think of the print library system as a cooperative, affordable, decentralized, archive system with LOTS OF COPIES. There are many parallels in the LOCKSS "Library System." Libraries will acquire and own digital materials to serve their local communities. They will keep the titles in LOCKSS caches and provide access to local readers. The libraries cooperate to detect and repair the content when damaged. Any particular reader will be able to easily find a copy, however it will be extremely difficult for anyone to systematically find and destroy all copies of a title. Libraries ensure that content persists simply by supporting their local communities. LOCKSS will provide a cooperative, affordable, decentralized, archive system with LOTS OF COPIES. Social Aspects The LOCKSS model appeals to both librarians and publishers, the two communities it is intended to serve, for two main reasons: It conforms to the needs, desires and characteristic behaviors of each community, as described below. The technology operates (by a peer-to-peer protocol) and is developed (as open source) through social cooperation. Appeal to Library Community Academic librarians face a dilemma regarding electronic journals. Their communities clearly want e-journals, but they are also chartered to preserve access to the scholarly record for future generations. The emerging primacy of e-journals in the absence of assured permanent access changes the dynamics of acquisition: content that libraries have owned in paper is now rented in the electronic version. A unilateral change of policy by the publisher can remove their electronic access to past material with no recourse. Failure to renew a journal subscription (due to a limited budget) may have similar consequences. To date there has been no mechanism to implement the traditional purchase-and-own library model for electronic materials. Librarians have lost the option to build and maintain local collections. Librarians are therefore reluctant to discontinue paper subscriptions, even though preserving paper no longer preserves the complete record of scholarship. The LOCKSS system is designed to allow librarians to take custody of and preserve access to the e-journals to which they subscribe. In effect, this restores the purchase model with which librarians are familiar. Using their own computers and network connections, they obtain, preserve and provide access to a copy of the run of an e-journal for which they purchased access. In the same way, using their own buildings, shelves, and staff, they obtain, preserve, and provide access to paper journals. To be widely accepted, it is essential that this technology is affordable even to libraries with a limited budget, and thus the LOCKSS project has emphasized utilizing affordable hardware and "appliance-like" software. The LOCKSS model restores the notion of building local collections of electronic journals. Material stored in a local LOCKSS cache remains available to that local community's reader even when the publisher "goes away" (merger, bankruptcy, subscription cancellation, network traffic, etc.). The content is never "dark;" it is always available to the local community. Installing and populating LOCKSS caches are actions librarians can take for themselves to serve their local communities. The benefits of restoring the practice of library material ownership and the choice of long-term access outweigh the costs of maintaining the system and related equipment. We believe the costs of collection management will also prove to be affordable. The LOCKSS model is inherently based on interlibrary cooperation (largely automatic and "invisible"). There are clear incentives for participating libraries to cooperate. One library by itself is not expected to have a critical number of caches for any given title. Although an institution could run multiple caches of the same material, more likely it will make arrangements with colleagues and peers to bring up caches until a critical mass is reached. The low cost and democratic structure of the LOCKSS system -- each copy is as valuable as any other -- empowers smaller institutions to take part in the process of digital preservation. In normal operation, a LOCKSS cache will only act as a proxy for, and thus supply content to, the host institution's own readers. Local pride, as well as a healthy distrust of remote systems, will drive a library to recruit friends to ensure there is a critical mass of copies. Each library can do a little bit to preserve materials they care about and collaborate with others they trust who care about the same content. I believe a contributing factor in the LOCKSS system's viability is the fact that it meshes and merges with current collection development practices. Librarians choose materials to serve their local communities. They band together in buying consortia to obtain economies of scale. They divide up collection responsibilities for materials not in high demand. In short, librarians cooperate to accomplish tasks that benefit their local communities. The LOCKSS technology obeys this same dynamic: it requires local action and limited cooperation to provide benefit to the library's local community. When librarians become confident that they can themselves take action to preserve material, we believe they will be more likely to switch from paper to electronic subscriptions. It may seem excessive for every participating library to build and retain a complete copy of the material to which they subscribe. But the goal of replicating the purchase model makes this essential. Fortunately, e-journals are not very large compared to current, affordable mass-market disks. Librarians who administer subscriptions decide whether or not they want to ensure online access to their individual communities if the publisher site for a given e-journal fails for any reason. In each case in which librarians choose to ensure access to an e-journal, they use the LOCKSS system to install a local cache while they have authorized access to the material. They cannot retroactively cache materials once access is terminated. This is a technical requirement that was driven by the publishers. Libraries will generally make the archive decision at point of purchase, and materials are cached as they are published. Librarians can always choose to "weed" their LOCKSS caches if local collection priorities change. To get the benefits of access, a library must participate. There is no "freeloading." Those who contribute to the preservation of the journal are rewarded with continued access; those who do not contribute to the journal's preservation are not. This "point of purchase" requirement makes it very important that it remain cheap and easy to run a LOCKSS cache. Once the LOCKSS model has been implemented and embraced by a core group of libraries, issues may arise regarding building a sufficient community of caches for specific journal titles. The guiding principle is that libraries choose what to cache, just as they choose subscriptions and retention policies. Prominent, widely-subscribed journals will naturally attract a relatively large number of caches, thus fulfilling the "Lots of Copies" principle. Less ubiquitous journal titles will proportionately attract a smaller number of caches. Those libraries that do cache a relatively obscure journal will have an interest in promoting caches at peer subscribing institutions in order to ensure that there is a sufficient number of caches for that journal. Failing that, they may simply decide -- either individually or in concert -- to start more than one cache of their own. As the absolute minimum number of caches to achieve some protection is small -- in single digits -- and the minimum number sufficient to attain robust protection not much larger, such an expedient is practical. Appeal to Publisher Community Although publishers are a heterogeneous group with different missions and scales of operation, they share certain dynamics. They understand archiving is an important social need and wish to be responsive. They are fearful of loss of control of their content. They are concerned about their continued economic vitality (whether on a profit or cost-recovery basis). Publishers want librarians to purchase their e-journals. A growing number of them would like to discontinue hard copy printing as the dual production system costs are expensive. While librarians increasingly understand that paper is no longer the version of record, they are retaining subscriptions to the paper versions. Library acquisitions funding is insufficient to purchase both formats, and in the absence of a sustainable digital archiving solution, librarians choose the more tangible and persistent format. As a condition of moving to electronic versions, librarians are requiring (especially through consortia purchases) that publishers guarantee long-term access to content sold. These guarantees are problematic at best. Only the largest publishers have sufficient resource to implement (or negotiate with third parties to provide) archives for content they publish. The smaller publishers have no easy way to meet this requirement. Publishers are motivated to endorse and support the LOCKSS system because of this impasse, and because the cost to them of doing so is low. (A list of {supporting publishers} is available.) In particular, scholarly society publishers want material available for future society members and other subscribers. The LOCKSS system suggests that they will be able to respond to library market pressure for an "archiving solution" at negligible cost. Bmj.com ({http://bmj.bmjjournals.com/}), published by the British Medical Association, contains the full text of articles published in the weekly British Medical Journal since January 1994, as well as an increasing amount of material that is unique to the site. The site serves 75,000 distinct users each week. "We want to help doctors everywhere practice better medicine and to be at the forefront of international debates on health," said Tony Delamothe, editor of bmj.com. "Maintaining long-term security and integrity of our materials is crucial to those objectives. The LOCKSS project is unique in being able to provide this for us, and at a price we can afford" (Sun Microsystems 2001). Even larger and commercial publishers seem to be well disposed towards the LOCKSS model. Most publishers fear their journal content will be illegally replicated or leaked on a massive scale once copies are in the custody of others. This was initially a major point of concern regarding the LOCKSS software. Publishers want their access-control methods enforced. When their sites are available, they want to retain access to reader usage data and have access to the record of the reader's interactions with their sites. Because it provides content to other caches only to repair damage in content they held previously, no new leakage paths are introduced. Because the reader is supplied preferentially from the publisher, with the cache only as a fallback, the publisher sees the same interactions they would have seen without LOCKSS caches. Publishers could run LOCKSS caches for their own journals and, by doing so, over time could audit the other caches of their journals. A non-subscriber cache would eventually reveal itself by taking part in the damage detection and repair protocol. The mere possibility of detection should deter non-subscribers from running LOCKSS caches. Just as the publisher cannot be sure he has found all the caches, the caches cannot be sure none of the other caches belongs to the publisher. The LOCKSS design has other advantages from the publisher's perspective: It returns the responsibility for long-term preservation, and the corresponding costs, to the librarians. Although publishers have an interest in long-term preservation, many cannot afford to do a credible job of it themselves. In fact, publisher failures or publisher changes in policy are among those events librarians are most interested in addressing. Publishers want to maintain journal brand and image. They spend considerable effort to design how their content is presented. Many publishers want to retain control over presentation. Because LOCKSS caches preserve the presentation form of the content, they do not preempt or mask the journal's brand or image. Two of the critical LOCKSS design criteria - the need to remain affordable and to allow the archiving decision to be made at the point of purchase - also apply to the means by which publishers give libraries permission to use LOCKSS caches to preserve their content. Our suggested legal language** is "blanket permission" from publishers to libraries. It is not institution specific. It is not negotiated. The publishers that we expect to embrace a decentralized archiving solution would be disinclined to dedicate the staff necessary to negotiate individual license agreements with librarians, either for subscription access or for archiving. Technology Overview The LOCKSS technology implements a peer-to-peer network of persistent web caches. The caches proactively crawl the web to collect relevant new content as it is published. Unlike normal caches they are never flushed. The caches cooperate to detect and repair any damage automatically, without human intervention. The cached content is perpetually audited; the archive is never "dark." Content can be in any format delivered via HTTP, provided that the content is relatively immutable. HTML pages from e-journals typically contain dynamic content, such as ads. References to these materials are preserved, but the targets of these references are not. This means that the HTML collected by different caches can be different, so the HTML is filtered before being compared between caches. The readers get what the publisher published, but the comparisons are based only on the words the authors wrote. A community in which each of a set of like-minded organizations runs a LOCKSS cache targeting the same content can be mutually assured that they will each continue to be able to access this content using the same URLs at which it was originally published even after it has ceased to be accessible from the original publisher. The more organizations that run caches, the higher the level of assurance each obtains. The design of the LOCKSS technology is based on a few key ideas: Preserving the presentation form of web content rather than the source databases used to generate it shows promise of being one pragmatic, practical approach to long-term preservation and archiving. It avoids the need to integrate the preservation system with the publishing system. It leverages the publisher's existing access-control mechanisms. It preserves a form of the publisher's content that they already place at risk, smoothing the path to the necessary publisher agreements. (Collecting, preserving, and archiving the presentation files do not preserve access to services offered by the e-journal web site such as intra-journal and intra-publisher searches for text and bibliographic data, or to server-side active content provided by authors such as databases or Java servlets.) Preserved content remains accessible at the original publisher's URL. Links and bookmarks, searches through indexing and abstracting databases, etc. resolve either to the publisher's site or to the locally cached content. The techniques used to access content at the publisher also work to find the preserved content. Connecting many less reliable repositories together in a manner that allows for cooperative damage detection and repair has many advantages. Consumer-grade hardware is dramatically cheaper than the industrial-strength equipment needed for an archival quality repository. Although the total amount of equipment needed for adequate preservation of given content using LOCKSS caches is greater, the cost is lower. LOCKSS caches are designed as "appliances" with dramatically lower system administration requirements, both in terms of time and skill level, than industrial-strength equipment. Again, despite the larger amount of equipment, the total cost of administering it should be lower. And the costs are spread among the interested parties, with each choosing to spend an amount they deem appropriate. The global cost of the system never appears on anyone's budget. The LOCKSS cache machines are off-the-shelf PCs. The LOCKSS software performs four functions: It collects newly published content from the target e-journals using an off-the-shelf Open Source web crawler. It continually compares the content it has collected with other caches holding the same content, and repairs any damage. It acts as a web proxy, delivering the publisher's or preserved versions as appropriate to its clients. It provides a web-based administrative interface, by means of which the local administrator can target new journals for preservation and monitor the state of journals being preserved. History and Status The LOCKSS technology has been undergoing increasingly severe testing since 1999. The alpha test ran through 2000. Test participants included Stanford University, Harvard University, Columbia University, University of California Berkeley, University of Tennessee, and Los Alamos National Labs. The test content was Science Online. The {first beta} was successfully deployed to 50 libraries worldwide between 2000 and 2002. Test content was from the Proceedings of the National Academy of Sciences and the British Medical Journal. At these sites it has run with little operator intervention for nearly a year. The average site has spent about an hour a month dealing with their cache. Almost all this time has been on per-cache problems rather than per-journal problems. Most of this time is spent because a cache was unplugged from the network, or from power, needed an IP address change, or the packet filters between it and the Internet were changed. With funding from the Andrew W. Mellon foundation, the LOCKSS project is now building production software. Testing of the next release of software is scheduled to begin late 2002. The key redesign of the production software is the introduction of a publisher plug-in module. The publisher plug-in module will tailor the processes of collecting, preserving and providing access to a particular e-journal allowing the LOCKSS software to be more flexible and efficient. Each online publishing platform will require a separate plug-in module (for example, one for HighWire Press titles, one for Blackwell Synergy titles, etc.). The plug-in software will use whatever journal-specific information is available to make more efficient the searches for new content and for damage to preserved content by: Exploiting knowledge of the e-journal's URL structure to target the search for newly published content Exploiting knowledge of the e-journal's URL structure to drive the checking process and target the search for damage Using knowledge of the e-journal's HTML formatting to assist the comparison process by filtering out variable content such as advertisements Mapping between bibliographic, URL, and file names for content The platform or system on which e-journals are mounted is a critical. The addition of an e-journal to a library collection is dependent on there being an appropriate plug-in for the technology supporting that e-journal. While it is more efficient to develop plug-ins for platforms widely used, compared to idiosyncratic, one-off or few-title platforms, there is a parallel need to acquire competence and to embrace smaller and/or less sophisticated titles. We also anticipate that some platforms will be inherently more difficult than others. We also anticipate that over time plug-in development will become faster and easier with experience. It is critical that competence for building this software resides in the communities of both librarians and publishers. Cooperation and Community The LOCKSS software is Open Source; the current version of the protocol is available on {SourceForge}. As with any open-source software system, no one can prevent the emergence of variants. But the LOCKSS software implements a network protocol. Variants that are not interoperable will not be functional; they will not have "lots of copies" to talk to, and will thus be unlikely to succeed. Variants that are interoperable will be welcome. Diversity of this kind offers Darwinian evolution to better fit a changing environment, and avoids monoculture vulnerabilities. Each version of the software will have different strengths and flaws. There will be economic and social pressure to keep each new version of the software interoperable with the system as a whole. The bigger the base of users, the greater that pressure will grow. As the size of the network grows, there will be a base from which to build open standards, as actual software rather than paper documents. Libraries and publishers can collaborate on applications and further developments neither could achieve alone. Publishers may choose to put the "plug-ins" for their content on the journal web site. Librarians may choose to deposit "plug-ins" in a repository. Our intention is to design the generic daemon and the plug-in API so that someone with reasonable programming skills (comparable to a computer science graduate student) can tailor a plug-in for any particular journal. The LOCKSS system will clearly not be the unique and ultimate solution to all e-archiving, or even all e-journal archiving, requirements. It is important that this NOT be the case. We are emphatic in our distaste for monolithic structures! We will have been successful if we provide over a period of years the assurance to libraries that their investment in paid access to e-journals is adequately safeguarded in those cases that warrant a small commitment of resources in computer storage and staff effort. Community Participation Please join us! The LOCKSS Program is soliciting library and publisher participation. Contact the author for additional information. Publisher Actions For LOCKSS to work in production, publishers will be required to provide: written permission to the librarians to cache and archive their content and "machine readable permission" to the LOCKSS caches themselves to collect and preserve their content. It is highly recommend that publishers grant librarians permission by adding language to subscription agreements. LOCKSS caches as currently designed collect content as it is published, so this permission must be granted at the point of subscription. The intent is to avoid the need for negotiations between principals. This language grants blanket permission for that library to: Hold copies of subscribed materials Use material consistent with original terms Provide access to local community Provide copies for audit and repair to other caches only if they've had copy in the past Publishers must also give the LOCKSS crawler permission to slowly crawl, collect, and cache content, preferably through a web page that lists at least the top level URLs in whatever the "archival unit" is for that title. This web page is to be known as the "publisher manifest." The LOCKSS system will work more efficiently as the "publisher manifest" becomes more detailed (article level file description for each issue/volume). We urge publishers to include in this manifest information the front matter not usually included in the electronic journal, for example: editorial board, author instructions, etc. While it is not a requirement, it is also recommended that publishers install a LOCKSS cache for their own content. During the testing phase, it is important for publishers to gain the same understanding and experiences as our international library partners. The LOCKSS system will be stronger with both communities contributing critical comments as the system evolves. In production, publishers who run LOCKSS caches will be able to audit their content on the LOCKSS cache system. Over time, all participants can discover who has content by "listening" to inter-cache communications. The communication protocol for damage detection and repair requires caches to "publicly" announce holdings LOCKSS caches can't be secretive. Librarian Actions Librarians at a minimum must make and implement collection development decisions and then make the locally cached content available to their local community of readers. There are a myriad of unanswered questions surrounding collection development, collection management, and collection access questions. With funding from the Andrew W. Mellon foundation, the Stanford University LOCKSS team is working closely with staff from Emory University, Indiana University, and the New York Public Library to work to implement LOCKSS in these three environments and demonstrate what issues arise and how they may be addressed at a local level. A major cost of implementing LOCKSS at an institution will be the processes of building rather than leasing electronic journals. The LOCKSS Program collection development process for any particular title is roughly: Choose an electronic title for your collection, preservation, and archiving Confirm and/or obtain publishers' permission to cache the titles Get specific platform "plug-in" (from publisher, repository, or library modifies) Confirm or recruit critical mass of caches Steps two through four will be facilitated by the LOCKSS Alliance as described below. To provide local access to the materials in their LOCKSS caches, librarians will need to integrate their LOCKSS caches within their institutional network, with emphasis on integration of the LOCKSS cache with local proxy service. This will be necessary if materials stored in LOCKSS caches are to be delivered to authorized and authenticated users whenever these materials are not available from the publisher's online service, in a manner transparent to the user. As libraries choose ever-larger numbers of e-journals for preservation, the community will need collection management tools and the ability for the LOCKSS administrative interface to interoperate with collection management programs. For this phase of LOCKSS development, Indiana University is constructing a process so the community can drive functional specifications, design a general set of query/response interactions so the functional specifications can be implemented within most library technical environments and as possible prototype one implementation of possible collection management software. This is work we expect the LOCKSS Alliance to facilitate. The LOCKSS Alliance The LOCKSS software is free and available for download from {SourceForge}. No fees are required from any party to use the LOCKSS software to archive content. In theory, the LOCKSS system is decentralized and does not require coordination. In practice, for a sustainable distributed e-journal archival system, some coordinating program infrastructure is indicated both for software development and support and for coordination of collection management initiatives. This coordinated program infrastructure will be provided by the LOCKSS Alliance. The Alliance is planned as a for-fee service organization for both librarians and publishers. Fees will provide tangible coordination services and community driven development for Technology Support installation and use Develop software, fix bugs, version control Train local technical expertise Grow open source community Collections Ensure sufficient caches/title Share plug-in applications Disclose publisher agreements Expand to other genres Metadata standards and implementation *Support for LOCKSS: The National Science Foundation this material is based upon work supported by the National Science Foundation under Grant No. 9907296, however any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. The Andrew W. Mellon Foundation Private donations Stanford University Libraries ** License "Publisher acknowledges that Licensee participates in the LOCKSS system for archiving digitized publications. Licensee may perpetually use the LOCKSS system to archive and restore the Licensed Materials, so long as Licensee's use is otherwise consistent with this Agreement. Licensee may also provide its digital copies of Licensed Materials to other LOCKSS systems in support of the overall preservation and restoration purposes of LOCKSS, so long as any other LOCKSS system demonstrates it has the rights to the Licensed Materials necessary to access and copy them." References Permanent publishing: local control of content delivered via the web. n.d. [Online.] Available: {http://lockss.stanford.edu/projectstatus.htm}. [November 6, 2002]. Sun Microsystems, Inc. 2001. Press release, Palo Alto, CA, October 10, 2001. Previous   Contents   Next