PROLA: Database Review Previous   Contents   Next Issues in Science and Technology Librarianship Spring 2002 DOI:10.5062/F46D5QZ7 URLs in this document have been updated. Links enclosed in {curly brackets} have been changed. If a replacement link was located, the new URL was added and the link is active; if a new site could not be identified, the broken link was removed. Database Reviews and Reports PROLA: Database Review Ian D. Gordon Head, Reference Information Services James A. Gibson Library Brock University St. Catharines, Ontario, Canada igordon@brocku.ca The 2001 American Physical Society's (APS) article Keeping the Promise: Phys Rev Completes Online Archive proudly stated that "PROLA is now complete: every paper in every journal that APS has published is mounted online in a friendly, powerful, fully searchable system. The project took just under ten years from earliest conception to reality." Since its inception PROLA (Physical Review Online Archive) continues to be at the forefront of technical, archival, design, display format, and search engine developments. PROLA is a state-of-the-art database that archives individual articles published by the suite of Physical Review journals. This project continues to truly mirror the Society's mission to " advance and diffuse the knowledge of physics." APS is one of the first society publishers that has created a complete archive of its publications back to its beginning in 1893. This allows researches using the Internet desktop access to the PROLA database search engine and individual journal articles. A short history of {PROLA} and a {What's New in PROLA} service helps new comers with an orientation to this archival database. {PROLA Subscription Information} provides information on APS titles, current subscription costs and linking relationships between current titles and the PROLA archival dataset. PROLA is an essential research tool for many theoretical, applied physicists. Librarians and physicists comment that PROLA is: relatively cheap for individuals, institutions and consortiums, intuitive and easy to navigate, allowing stemming, left and right-handed truncation, phrase, Boolean, and sounds-like searching capabilities, flexible as a link manager to retrieve alternate papers from APS and other publishers, progressive in the number and variety of document and image display formats, quick with the use of the Cornell mirrored site, and fascinating as it moves towards a truly archival XML format for online delivery. To learn more about PROLA developments I asked Mark Doyle APS's Product Development Manager and a frequent contributor to SLAPAM-L a series of questions. Ian: Tell us a bit about yourself and how you have come to designing search engines, archives, information resources and databases? Mark: My training is as a physicist -- I received a Ph.D. in high energy physics string theory from Princeton University in 1992. Starting in 1994, I worked at Los Alamos with Paul Ginsparg for two years on administering and developing arXiv.org. In July 1996 I came to work for the American Physical Society editorial office doing all-purpose Research and Development for APS electronic offerings. PROLA (which started in 1993 and was run as a CRADA project involving APS, Los Alamos (a group unrelated to the e-print archives), and the Naval Research Lab (their Torpedo project)) was made my responsibility in the late summer of 1997. The development was brought in house and the prototype server developed at Los Alamos was completely overhauled and re-organized and launched as a beta product in July 1998. I sit on the editorial board of ALPSP's Learned Publishing and I am an active participant of CrossRef's Technical Working Group. I have been an invited speaker at various SSP Top Management Roundtables and at last year's SLA conference as well as at other conferences. My main interest is in exploiting electronic publishing and related technologies to enhance and transform scholarly communication, particularly for physicists. Lately I have been focusing on how an established publisher like APS might transition away from the subscription model. This is a difficult challenge and such a goal can only be accomplished with the participation of authors, their institutions, and the libraries. Libraries will have a key role in developing partnerships with scholarly societies (and perhaps other publishers) for the stewardship of the archive of scholarly communications. Ian: Give me an idea as to who subscribes to PROLA? Mark: PROLA has a little over 1,500 institutional subscriptions (many of which may be part of an APS bundled subscription such as APS-ALL and PR-ALL) and just over 500 individual subscriptions (these are only available to members of the APS). Ian: How can scientists and information professionals test PROLA? Mark: Anyone can pretty much get to any page. Non-subscribers can't download PDFs or view the Page Images and they can't execute a search. They can view all Tables of Contents and they can view all abstracts. But they won't see the Reference section and the links (instead there is a link that says: [Show References] Requires Subscription just above the [Show Articles Citing This One] link). Both of these require a subscription to view. Ian: How is the PROLA pricing calculated? Mark: There is no deep rationale behind the pricing of PROLA -- mostly the original institutional subscription price was set to be quite affordable for any and all institutions. The price for PROLA isn't so much calculated, but set to encourage wide usage and at the same time help us recover the money we invested in building it. Pricing for members is set so as to be a decent value, but not so low that it will encourage institutions not to subscribe. The member price was set to $100/year to make encourage institutional rather than member subscriptions. A PROLA subscription is included with either the PR-ALL or APS-ALL combination packages. The current (2002) pricing for PROLA is: $355 Tier 1, $390 Tier 2, $425 Tier 3 for an institution that subscribes to at least one APS journal and $750 for an institution that doesn't subscribe to any APS journal. The tiers in the US are tied to the Carnegie classification scheme for higher educational institutions. Outside the US, a mix of different criteria are used to place institutions in the tiers. Development on PROLA has cost a total of about $2 million so far. We are working to mirror it around the world for convenient and easy access. Any money from PROLA subscriptions is used to offset these costs and to lessen the pressure on our journal subscription prices. It will be at least five years before we recover our costs. Our entire journal operation is budgeted to break even and over the last five years we have more or less achieved this. Ian: What is the nature of the Library of Congress repository agreement? Mark: Basically the agreement calls for APS to deposit a full copy of our electronic archive with the Library of Congress as well as maintain a front end server (PROLA interface) for LOC patrons to use. It also calls for cooperative investigation of the technical requirements for ensuring an accurate copy as well as whether this fulfills our obligation to provide a copy of our material to the LOC under current copyright law. The agreement's length is indefinite. Either party can terminate the agreement with 60 days written notice. Anything already transferred remains at the LOC in case of termination. Currently we are working out how to actually transfer the 500 GB of data we have. The prototype server is set up. Recently we have done some final cleanups and will be exporting the data to LOC by the end of May, 2002. Ian: Has APS considered discontinuing publishing journal print equivalents at some time in the future? Mark: Well, PROLA is really for material older than three years plus the current year. Subscribers to our current contents journals (hosted on AIP's OJPS platform now) have the option of not receiving print -- this comes at a 15% discount. This has been the first year that option is available. About 10% of our subscribers (2002) took the discount and discontinued print. Future pricing will almost certainly introduce greater differentials between online and print subscriptions as the number of subscribers receiving print continues to decrease. Ian: Has APS thought to include a current awareness service? Mark: Yes, we have thought about this a great deal. The main sticking point is whether this should be available to everyone, just subscribers, or just APS members. AIP is working on building the functionality into OJPS (and may even have released it for their journals as well as those of other societies hosted on OJPS). We have also taken (initiated with AIP, but now also with other journals) another approach to current awareness by introducing virtual journals which highlight the best content in particular subfields to raise awareness of articles and to reduce the need for (usually expensive) niche journals. Ian: Has APS worked with other database producers and vendors to manage PROLA links? Mark: Yes, anyone is free to link to us via CrossRef (we are a member) or via our link manager. No one needs an agreement to link to us. I believe ISI, SilverPlatter, and various INSPEC customers link to us directly. As for linking out from PROLA, we link to INSPEC and {SPIN} through an agreement with AIP. We intend to link to ISI, but we are patiently waiting for them to develop a more pragmatic linking strategy in which we could query them with citation information and get back an ISI identifier that can then be used in forming a link. Right now their system is rather complicated and would require too much work on our part to implement and maintain. PROLA also links to other publishers via CrossRef and to the SPIRES high energy physics database. Also, we are working with ADS to develop linking to their content. We also link to arXiv.org. We are working on linking to ADS at the moment. Ian: What are some of the new PROLA developments? Mark: Developments in progress or near completion include investigating: an all-APS search engine, extending citing article links to all APS and more external sources, setting up more mirrors (Europe, possibly Far East, and others), using the DjVu format to greatly reduce the size of files delivered in PROLA, use of Open Archives Initiative to make metadata available to others (full access won't be free, but will be on a sliding fee scale depending on the richness of the metadata -- there will probably be a minimal free component though). All APS PROLA content in CrossRef. Ian: How are you keeping up to date with an increasing number of display formats within an archival environment? Mark: Our future strategy is going to be based on further developing our SGML/XML archive so that we can create multiple formats for printing and viewing (single column, two column, e-book, etc.). TeX is a possible backend for this (and REVTeX 4 was developed with this use in mind). We are working with our vendors to create a truly archival XML format for online delivery. We are working with other publishers on the STIX project which seeks to add many new math glyphs to Unicode (mostly achieved) and then to create a freely available set of high-quality fonts. Our hope is to provide the necessary prerequisites for displaying our content in a wide range of settings (e-books, web browers, print on demand, etc.) Ian: Physicists heavily depend upon searching under author's names. In the absence of an authority file how can searchers increase their success of finding the appropriate papers? Mark: The author names are printed as they are supplied to us. Some authors publish differently depending on the paper -- (for example, M. Doyle and Mark Doyle). The PROLA search engine allows "M. Doyle" in the author field to match "Mark Doyle" though (but not vice versa). Other variations such as Mueller and Müller are harder to deal with when searching (soundex works pretty well here). Searching PROLA using author's names in association with additional affiliation and abstract keywords is also recommended. Ian: Chemical names are difficult to search. Any hints on how to search for strange and complex names/formulas with sub and super scripts, etc.? Mark: This is admittedly difficult and cryptic. Right now, you have to use TeX notation to do this. Superscripts get translated to ^{...} and subscripts get translated to _{....}. For instance, to search for water, one does H_{2}O or to search for beryllium 9, one does ^{9}Be. More complicated formulas become more difficult, but wild cards can help here (try searching for La_{*). Ian: How has the search engine evolved to improve indexing and retrieval? Mark: Searching is based on XML files for the front and backmatter (titles, authors, affiliations, abstracts, references, etc.). The tagging is quite rich and includes tagging all math. Material before 1985 was rekeyed, material from 1985-1996 was derived partly from AIP's SPIN database, partly rekeyed, and partly extracted from our legacy TeX and troff files that we used to typeset the journals, and from 1997-present they are extracted from the SGML that our vendors deliver to us. This XML collection is further processed to regularize hyphenation, make author and cited author names robustly indexed, and to make the math somewhat searchable (sub- and superscripts, Greek letters, math operators). A "best ASCII" attempt is made for math. Authors with complicated surnames can be searched easily because many different variations on names are indexed (for instance, 'van', 'de', 'das', etc. can be omitted by the searcher). Super- and subscripts can be matched using a TeX-style notations: ^{9}Be will match beryllium 9, H_{2}O will match water, etc. The backend search engine is completely commoditized and in fact we are about to complete our third substitution of a backend. The rest of the interface is all home grown. Full-text searching will be based upon OCR and will thus not be as high quality as the above stuff. Full text for 1997-present will be based on SGML though. We expect to be able to deploy this by June. Ian: How does PROLA keep pace with never-ending demands for better quality archival formats for printing and viewing as prerequisites for displaying content in an ever-increasing wide range of settings? Mark: One needs to draw a distinction between PROLA the product (the web application that delivers a range of content to end users) and our internal electronic archive which includes information in formats not directly delivered to end users. I'll focus on the latter. Although our archive is somewhat heterogeneous, it can be divided into several large chunks that are more or less uniform. For 1996 and earlier, the deliverable content is the TIFF images of the scanned paper journals along with the XML bibliographic files. Grayscale and color images were separately captured as JPEG images. The TIFF format is quite generic (using Group 4 fax compression) and so will be easy to migrate to any future bitmap standard that supplants TIFF. (I am a firm believer that there will always be a robust conversion from one de facto standard to the next because you can't become a new de facto standard without such a migration path. The main example of this is the migration from PostScript to PDF. Of course not all features of the new standard will be available to legacy conversions (e.g., hyperlinking in PDF), but there isn't a loss of information.) The rest of our archive consists of high resolution PostScript files that were used to actually print the journal, the PDF files we created from them for online use, and full-text SGML files created during production. The SGML comes from two different vendors and while it is very useful for creating XML bibliographic records, I don't consider it archival at this point. That said, we are just finishing off the development of a new XML DTD (and the tagging rules for applying it) that is much more suited for long-term archiving and reuse in an online environment (even as a direct deliverable to an end user). Archiving for a single publisher the size of the APS isn't that difficult because we do have uniformity over large periods of time. The real conundrum is when you have an entity like a library trying to integrate the archival material from a diverse set of sources. Partnerships between archivers and producers can help ameliorate this problem. There are still some difficult areas for us however. We are starting to publish videos that are integral to the understanding of papers. Video formats are far more complex than other formats and may not be easily migratable to future formats without a lot of (expensive) labor. We are still working on the best strategy for dealing with this. The starting point is to have a least one version that is in an open standard that is completely non-proprietary (MPEG for now). Fortunately, we don't yet have a large influx of such material and it is still manageable. Ian: Talk a bit about how PROLA has worked with archival standards? Mark: For the scanning project, I talked with various third parties such as JSTOR and archivist librarians. I wouldn't say we are following standards, but just making common sense tradeoffs between long term considerations and initial cost. We chose open, widely implemented and supported standards (G4 TIFF and JPEG) for image formats. The scanning was done at a sufficient DPI to give very good results on printers without needlessly increasing the storage and bandwidth costs (we settled on 600 dpi for the TIFF images). For the XML, we used our own internally developed DTD that is relatively simple for a third party to understand and use while at the same time being able to map the math content into MathML if that is needed. For the new XML development we are doing, we chose to adhere to standards where we could. So all math markup will be in MathML and all special characters will be from Unicode 3.2 (the APS is one of the publishers participating in the STIX project which resulted in about 1,000 new math and other special glyphs from STM publishing being incorporated into Unicode 3.2). Any new special characters will be put into the private use area of Unicode and documented within the XML instance itself (a verbal description of the glyph, what if any standard glyphs were transformed or composed to create the glyph, what font if any it can be found in, and a rough bitmap of what the glyph looks like). It is hoped that we will be able to work with others to get some of these ideas out in the mainstream and thereby make XML archiving more uniform across publishers. Eventually we will also start looking at emerging standards for archives such as OAIS and seeing if we can benefit from implementing these standards. Ian: How has PROLA worked with CrossRef and other database producers/vendors to provide cross linking to other resources and to enable large scale querying? Mark: Our strategy for PROLA is to keep it more or less feature for feature compatible with the deployment of our current content journals in AIP's OJPS platform. Thus we try to implement the same links wherever we can. We migrate one year of content each year into PROLA (really we just reveal it in PROLA -- all of our content exists within the archive and PROLA can function as a full failover for our content in OJPS if need be). Thus we want to maintain the same functionality for this content. SPIN and INSPEC links are the result of an agreement with AIP as part of this strategy. SPIRES linking has been completely implemented now. We would like to link to ISI, but we have been dissatisfied with the process that is needed to create such links. OJPS had such links enabled for a while, but because of production issues, these links have been disabled for now. We hope ISI will soon have a more workable solution in place. The newest cooperative effort for us is with ADS. We are working on providing extensive linking to the astronomy and astrophysics literature available there and expect to have that implemented by the early summer. APS (through my position on the Technical Working Group) has been quite active in CrossRef. On some issues we are a minority view (e.g., should DOIs be calculable from traditional citation metadata (journal, volume, page)), but overall I think CrossRef has been beneficial for creating links between publishers. We are in the process of depositing all of our PROLA records into CrossRef now (there were some final QC issues that I wanted to resolve first). Ian: The creation of a truly archival XML format for online delivery must be a work in progress. What have been the surprises, problems and triumphs along the way? Mark: As I said earlier, we are revamping our archival marked up material. The main issues we have with our current SGML are its relationship to the final published article and the extent to which is it reusable outside of the system in which it was produced. One of our vendors derives the SGML from their typesetting system after the fact and this transformation could potentially be a source of errors - thus I can't consider the SGML from this source a guaranteed copy of what we published. Furthermore, no path exists for taking this SGML and reprocessing it back into the typesetting system to get, say, a PDF file out of it. Of course, we can still make good use of this SGML, particularly for creating the XML front and backmatter files and for searching. Our other vendor does directly use the SGML to typeset the journals using ArborText which has a TeX backend. The main problem here is that often it is necessary to feed directions to the TeX backend via SGML processing instructions. Sometimes these even contain actual article content (one example is for producing extended radical signs) and so one has to actually reverse engineer the typesetting process in order to recover the content. Another problem is that both vendors introduce SGML entities as needed for new characters and these are more or less completely undocumented. One can only look at the printed page (or PDF) to determine what they are. All of this makes the SGML very suspect from an archival point of view. Ian: What is next for PROLA now that the archive is complete? Mark: Of course the archive continues to evolve with time. I don't expect that we will need to migrate content any time soon, but we are always thinking about what we will need to do. Improving access and reliability for our end users is the current priority. We plan to develop innovative uses for searching and mining the archive. Now that we have this wonderfully rich archive in electronic form, one can envision doing all sorts of projects to highlight various relationships between groups of papers. For instance, it would interesting to locate papers "similar" to a given paper or to see the relationships between groups of authors as a field evolves. As more and more content becomes available in our XML format, we plan to exploit this heavily. Searching by formula, dynamically reformatting the article, linking to parts of papers, etc. will all become possible. Ian: What is different about the APS electronic archive and something like arXiv.org which has an increasing proportion of the physics literature? Mark: First, the APS archive is much more uniform than the TeX material on arXiv.org making it easier for us to migrate in the future. But the main difference from my point of view is the existence of the SGML and XML. Even if the current SGML we have isn't truly archival, it is still a rich source of information for searching and linking articles together. Thus PROLA can offer many more services (better linking for instance) than arXiv.org. This of course comes at a price -- it is still expensive to produce the marked up file. My hope is that over the next five to ten years, authoring tools will develop that will enable authors to directly create XML of the same archival quality as the XML I described above. This will enable arXiv.org and the APS to benefit immensely. For both of us, there will be very little additional cost for creating the rich services offered in a modern online journal application. However, during this transition, I still view what the APS does in production as an essential component (beyond peer review of course) that must be continued and, thus, must be paid for. Paying for this and peer review outside of the subscription model remains a challenge. Finally, librarians need to familiarize themselves with these issues and develop the skills to deal with marked up content. If libraries are to maintain a responsibility for archiving scholarly literature, then they must learn what publishers do and participate in its evolution. Collections of PDF files will not suffice for a long term archive. APS is quite interested in forming partnerships with research institutions and libraries so that we can continue to serve the broad interests of scholarly communication. Ian: Thanks for taking the time out of your busy schedule to talk about APS and PROLA. Mark: You're welcome. Cheers. Further Internet Links and Contacts {Frequently Asked Questions about the APS link manager} {What's new in PROLA} {PROLA Archive Cornell Mirror} PROLA Home Page APS Pricing and Subscription Information {Keeping the Promise: Phys Rev Completes Online Archive} (APS News Online August/September 2001) {PROLA: More Than Just a Pretty Acronym} (APS News Online August/September 1999) Physical Review Online Archives (PROLA) D-Lib Magazine June 1998 Doyle, Mark. Dilaton Contact Terms in the Bosonic and Heterotic Strings. 1992. ________. The Operator Formalism and Contact Terms in String Theory. Ph.D. Princeton University, 1992. ________. World-Sheet Supersymmetry Without Contact Terms. 1992. Mark Doyle Manager, Product Development The American Physical Society doyle@aps.org Barbara Hicks Associate Publisher The American Physical Society assocpub@aps.org Ian Gordon Head, Reference Information Services Science Librarian James A. Gibson Library Brock University igordon@brocku.ca Previous   Contents   Next