PROLA: Database Review

      	Previous	 	Contents	 	Next
	Issues in Science and Technology   Librarianship
	Spring 2002

	DOI:10.5062/F46D5QZ7


     	 URLs in this  document have been updated.  Links enclosed in {curly  brackets} have been changed.  If a replacement link was located,  the new URL was added and the link is active; if a new site could not be  identified, the broken link was removed.


Database Reviews and Reports
    PROLA: Database Review
        Ian D. Gordon
  Head, Reference Information Services
  James A. Gibson Library
  Brock University
  St. Catharines, Ontario, Canada
igordon@brocku.ca        


    The 2001 American Physical Society's (APS) article Keeping the Promise: Phys Rev Completes Online Archive proudly stated that "PROLA is now complete: every paper in every journal that APS has published is mounted online in a friendly, powerful, fully searchable system. The project took just under ten years from earliest conception to reality." Since its inception PROLA (Physical Review Online Archive) continues to be at the forefront of technical, archival, design, display format, and search engine developments. PROLA is a state-of-the-art database that archives individual articles published by the suite of Physical Review journals. This project continues to truly mirror the Society's mission to " advance and diffuse the knowledge of physics." APS is one of the first society publishers that has created a complete archive of its publications back to its beginning in 1893. This allows researches using the Internet desktop access to the PROLA database search engine and individual journal articles. 

    A short history of {PROLA} and a {What's New in PROLA} service helps new comers with an orientation to this archival database. {PROLA Subscription Information} provides information on APS titles, current subscription costs and linking relationships between current titles and the PROLA archival dataset. PROLA is an essential research tool for many theoretical, applied physicists. Librarians and physicists comment that PROLA is: 

    
	relatively cheap for individuals, institutions and consortiums,  
	intuitive and easy to navigate, allowing stemming, left and right-handed truncation, phrase, Boolean, and sounds-like searching capabilities,  
	flexible as a link manager to retrieve alternate papers from APS and other publishers,  
	progressive in the number and variety of document and image display formats,  
	quick with the use of the Cornell mirrored site, and   
	fascinating as it moves towards a truly archival XML format for online delivery.   


    To learn more about PROLA developments I asked Mark Doyle APS's Product Development Manager and a frequent contributor to SLAPAM-L a series of questions. 

    Ian: Tell us a bit about yourself and how you have come to designing search engines, archives, information resources and databases?

    
Mark: My training is as a physicist -- I received a Ph.D. in high energy physics string theory from Princeton University in 1992. Starting in 1994, I worked at Los Alamos with Paul Ginsparg for two years on administering and developing arXiv.org. In July 1996 I came to work for the American Physical Society editorial office doing all-purpose Research and Development for APS electronic offerings. PROLA (which started in 1993 and was run as a CRADA project involving APS, Los Alamos (a group unrelated to the e-print archives), and the Naval Research Lab (their Torpedo project)) was made my responsibility in the late summer of 1997. The development was brought in house and the prototype server developed at Los Alamos was completely overhauled and re-organized and launched as a beta product in July 1998.     I sit on the editorial board of ALPSP's Learned Publishing and I am an active participant of CrossRef's Technical Working Group. I have been an invited speaker at various SSP Top Management Roundtables and at last year's SLA conference as well as at other conferences.

    My main interest is in exploiting electronic publishing and related technologies to enhance and transform scholarly communication, particularly for physicists.  Lately I have been focusing on how an established publisher like APS might transition away from the subscription model. This is a difficult challenge and such a goal can only be accomplished with the participation of authors, their institutions, and the libraries. Libraries will have a key role in developing partnerships with scholarly societies (and perhaps other publishers) for the stewardship of the archive of scholarly communications. 


    Ian: Give me an idea as to who subscribes to PROLA?     

 Mark: PROLA has a little over 1,500 institutional subscriptions (many of which may be part of an APS bundled subscription such as APS-ALL and PR-ALL) and just over 500 individual subscriptions (these are only available to members of the APS). 

    Ian: How can scientists and information professionals test PROLA?

    
 Mark: Anyone can pretty much get to any page. Non-subscribers can't download PDFs or view the Page Images and they can't execute a search. They can view all Tables of Contents and they can view all abstracts. But they won't see the Reference section and the links (instead there is a link that says: [Show References] Requires Subscription just above the [Show Articles Citing This One] link). Both of these require a subscription to view.

    Ian: How is the PROLA pricing calculated?

    
Mark: There is no deep rationale behind the pricing of PROLA -- mostly the original institutional subscription price was set to be quite affordable for any and all institutions. The price for PROLA isn't so much calculated, but set to encourage wide usage and at the same time help us recover the money we invested in building it. Pricing for members is set so as to be a decent value, but not so low that it will encourage institutions not to subscribe. The member price was set to $100/year to make encourage institutional rather than member subscriptions. A PROLA subscription is included with either the PR-ALL or APS-ALL combination packages.  The current (2002) pricing for PROLA is: $355 Tier 1, $390 Tier 2, $425 Tier 3 for an institution that subscribes to at least one APS journal and $750 for an institution that doesn't subscribe to any APS journal. The tiers in the US are tied to the Carnegie classification scheme for higher educational institutions.  Outside the US, a mix of different criteria are used to place institutions in the tiers.     Development on PROLA has cost a total of about $2 million so far. We are working to mirror it around the world for convenient and easy access. Any money from PROLA subscriptions is used to offset these costs and to lessen the pressure on our journal subscription prices. It will be at least five years before we recover our costs. Our entire journal operation is budgeted to break even and over the last five years we have more or less achieved this. 


    Ian: What is the nature of the Library of Congress repository agreement?     

 Mark: Basically the agreement calls for APS to deposit a full copy of our electronic archive with the Library of Congress as well as maintain a front end server (PROLA interface) for LOC patrons to use. It also calls for cooperative investigation of the technical requirements for ensuring an accurate copy as well as whether this fulfills our obligation to provide a copy of our material to the LOC under current copyright law. The agreement's length is indefinite. Either party can terminate the agreement with 60 days written notice. Anything already transferred remains at the LOC in case of termination. Currently we are working out how to actually transfer the 500 GB of data we have. The prototype server is set up. Recently we have done some final cleanups and will be exporting the data to LOC by the end of May, 2002. 

    Ian: Has APS considered discontinuing publishing journal print equivalents at some time in the future?

    
 Mark: Well, PROLA is really for material older than three years plus the current year.  Subscribers to our current contents journals (hosted on AIP's OJPS platform now) have the option of not receiving print -- this comes at a 15% discount. This has been the first year that option is available. About 10% of our subscribers (2002) took the discount and discontinued print. Future pricing will almost certainly introduce greater differentials between online and print subscriptions as the number of subscribers receiving print continues to decrease. 


    Ian: Has APS thought to include a current awareness service?    


Mark: Yes, we have thought about this a great deal. The main sticking point is whether this should be available to everyone, just subscribers, or just APS members. AIP is working on building the functionality into OJPS (and may even have released it for their journals as well as those of other societies hosted on OJPS). We have also taken (initiated with AIP, but now also with other journals) another approach to current awareness by introducing virtual journals which highlight the best content in particular subfields to raise awareness of articles and to reduce the need for (usually expensive) niche journals. 


     Ian: Has APS worked with other database producers and vendors to  manage PROLA links?    


Mark: Yes, anyone is free to link to us via CrossRef (we are a  member) or via our link manager. No one needs an agreement to link to us. I believe  ISI, SilverPlatter, and various INSPEC customers link to us directly. As for  linking out from PROLA, we link to INSPEC and {SPIN} through an agreement with AIP. We  intend to link to ISI, but we are patiently waiting for them to develop a more  pragmatic linking strategy in which we could query them with citation information  and get back an ISI identifier that can then be used in forming a link. Right now  their system is rather complicated and would require too much work on our part to  implement and maintain. PROLA also links to other publishers via CrossRef and to  the SPIRES high energy physics  database. Also, we are working with ADS to  develop linking to their content. We also link to arXiv.org.  We are working on linking to ADS at the  moment. 

    Ian:   What are some of the new PROLA developments?

    
Mark: Developments in progress or near completion include   investigating:    	an all-APS search engine,  
	extending citing article links to all APS and more external sources,   
	setting up more mirrors (Europe, possibly Far East, and others),  
	using the DjVu format to greatly reduce the size of files delivered in PROLA,  
	use of Open Archives Initiative to make metadata available to others (full   access won't be free, but will be on a sliding fee scale depending on the richness   of the metadata -- there will probably be a minimal free component though).  
	All APS PROLA content in CrossRef.  


    Ian:   How are you keeping up to date with an increasing number of   display formats within an archival environment?    

 Mark: Our future strategy is going to be based on further  developing our SGML/XML archive so that we can create multiple formats for printing  and viewing (single column, two column, e-book, etc.). TeX is a possible backend  for this (and REVTeX 4 was developed with this use in mind). We are working with  our vendors to create a truly archival XML format for online delivery. We are  working with other publishers on the STIX project which seeks to add many new math  glyphs to Unicode (mostly achieved) and then to create a freely available set of  high-quality fonts.  Our hope is to provide the necessary prerequisites for  displaying our content in a wide range of settings (e-books, web browers, print on  demand, etc.)


    Ian: Physicists heavily depend upon searching under author's  names. In the absence of an authority file how can searchers increase their success  of finding the appropriate papers?    


Mark: The author names are printed as they are supplied to us.   Some authors publish differently depending on the paper -- (for example, M. Doyle  and Mark Doyle). The PROLA search engine allows "M. Doyle" in the author field to  match "Mark Doyle" though (but not vice versa).  Other variations such as Mueller  and Müller are harder to deal with when searching (soundex works pretty well  here). Searching PROLA using author's names in association with additional  affiliation and abstract keywords is also recommended. 

    Ian: Chemical names are difficult to search. Any hints on how to  search for strange and complex names/formulas with sub and super scripts,  etc.?

    
Mark: This is admittedly difficult and cryptic. Right now, you  have to use TeX notation to do this. Superscripts get translated to ^{...} and  subscripts get translated to _{....}. For instance, to search for water, one does  H_{2}O or to search for beryllium 9, one does ^{9}Be. More complicated formulas  become more difficult, but wild cards can help here (try searching for La_{*).   

    Ian:   How has the search engine evolved to improve indexing and   retrieval?

    
 Mark: Searching is based on XML files for the front and  backmatter (titles, authors, affiliations, abstracts, references, etc.). The  tagging is quite rich and includes tagging all math. Material before 1985 was  rekeyed, material from 1985-1996 was derived partly from AIP's SPIN database,  partly rekeyed, and partly extracted from our legacy TeX and troff files that we  used to typeset the journals, and from 1997-present they are extracted from the  SGML that our vendors deliver to us. This XML collection is further processed to  regularize hyphenation, make author and cited author names robustly indexed, and to  make the math somewhat searchable (sub- and superscripts, Greek letters, math  operators). A "best ASCII"  attempt is made for math. Authors with complicated  surnames can be searched easily because many different variations on names are  indexed (for instance, 'van', 'de', 'das', etc. can be omitted by the searcher).  Super- and subscripts can be matched using a TeX-style notations: ^{9}Be will match  beryllium 9, H_{2}O will match water, etc. The backend search engine is completely  commoditized and in fact we are about to complete our third substitution of a  backend. The rest of the interface is all home grown.  Full-text searching will be  based upon OCR and will thus not be as high quality as the above stuff. Full text  for 1997-present will be based on SGML though. We expect to be able to deploy this  by June. 

    Ian: How does PROLA keep pace with never-ending demands for better  quality archival formats for printing and viewing as prerequisites for displaying  content in an ever-increasing wide range of settings? 

    
Mark: One needs to draw a distinction between PROLA the  product (the web application that delivers a range of content to end users) and our  internal electronic archive which includes information in formats not directly  delivered to end users. I'll focus on the latter. Although our archive is somewhat  heterogeneous, it can be divided into several large chunks that are more or less  uniform. For 1996 and earlier, the deliverable content is the TIFF images of the  scanned paper journals along with the XML bibliographic files. Grayscale and color  images were separately captured as JPEG images. The TIFF format is quite generic  (using Group 4 fax compression) and so will be easy to migrate to any future bitmap  standard that supplants TIFF. (I am a firm believer that there will always be a  robust conversion from one de facto standard to the next because you can't become a  new de facto standard without such a migration path. The main example of this is  the migration from PostScript to PDF. Of course not all features of the new  standard will be available to legacy conversions (e.g., hyperlinking in PDF), but  there isn't a loss of information.)    The rest of our archive consists of high resolution PostScript files that were used  to actually print the journal, the PDF files we created from them for online use,  and full-text SGML files created during production. The SGML comes from two  different vendors and while it is very useful for creating XML bibliographic  records, I don't consider it archival at this point. That said, we are just  finishing off the development of a new XML DTD (and the tagging rules for applying  it) that is much more suited for long-term archiving and reuse in an online  environment (even as a direct deliverable to an end user).

    Archiving for a single publisher the size of the APS isn't that difficult because  we do have uniformity over large periods of time. The real conundrum is when you  have an entity like a library trying to integrate the archival material from a  diverse set of sources. Partnerships between archivers and producers can help  ameliorate this problem.

    There are still some difficult areas for us however. We are starting to publish  videos that are integral to the understanding of papers. Video formats are far more  complex than other formats and may not be easily migratable to future formats  without a lot of (expensive) labor. We are still working on the best strategy for  dealing with this. The starting point is to have a least one version that is in an  open standard that is completely non-proprietary (MPEG for now). Fortunately, we  don't yet have a large influx of such material and it is still manageable.   


    Ian: Talk a bit about how PROLA has worked with archival  standards?     


Mark: For the scanning project, I talked with various third  parties such as JSTOR and archivist librarians.  I wouldn't say we are following standards, but just making common sense tradeoffs  between long term considerations and initial cost. We chose open, widely  implemented and supported standards (G4 TIFF and JPEG) for image formats. The  scanning was done at a sufficient DPI to give very good results on printers without  needlessly increasing the storage and bandwidth costs (we settled on 600 dpi for  the TIFF images). For the XML, we used our own internally developed DTD that is  relatively simple for a third party to understand and use while at the same time  being able to map the math content into MathML if that is needed.     For the new XML development we are doing, we chose to adhere to standards where we  could. So all math markup will be in MathML and all special characters will be from  Unicode 3.2 (the APS is one of the publishers participating in the STIX project  which resulted in about 1,000 new math and other special glyphs from STM publishing  being incorporated into Unicode 3.2). Any new special characters will be put into  the private use area of Unicode and documented within the XML instance itself (a  verbal description of the glyph, what if any standard glyphs were transformed or  composed to create the glyph, what font if any it can be found in, and a rough  bitmap of what the glyph looks like). It is hoped that we will be able to work with  others to get some of these ideas out in the mainstream and thereby make XML  archiving more uniform across publishers. 

    Eventually we will also start looking at emerging standards for archives such as  OAIS and seeing if we can benefit from implementing these standards. 


    Ian: How has PROLA worked with CrossRef and other database  producers/vendors to provide cross linking to other resources and to enable large  scale querying?    


Mark: Our strategy for PROLA is to keep it more or less  feature for feature compatible with the deployment of our current content journals  in AIP's OJPS platform. Thus we try to implement the same links wherever we can. We  migrate one year of content each year into PROLA (really we just reveal it in PROLA  -- all of our content exists within the archive and PROLA can function as a full  failover for our content in OJPS if need be). Thus we want to maintain the same  functionality for this content. SPIN and INSPEC links are the result of an  agreement with AIP as part of this strategy. SPIRES linking has been completely  implemented now.     We would like to link to ISI, but we have been dissatisfied with the process that  is needed to create such links. OJPS had such links enabled for a while, but  because of production issues, these links have been disabled for now. We hope ISI  will soon have a more workable solution in place. 

    The newest cooperative effort for us is with ADS. We are working on providing  extensive linking to the astronomy and astrophysics literature available there and  expect to have that implemented by the early summer. 

    APS (through my position on the Technical Working Group) has been quite active in  CrossRef. On some issues we are a minority  view (e.g., should DOIs be calculable from traditional citation metadata (journal,  volume, page)), but overall I think CrossRef has been beneficial for creating links  between publishers. We are in the process of depositing all of our PROLA records  into CrossRef now (there were some final QC issues that I wanted to resolve first).  


    Ian: The creation of a truly archival XML format for online  delivery must be a work in progress. What have been the surprises, problems and  triumphs along the way?     


Mark: As I said earlier, we are revamping our archival marked  up material.  The main issues we have with our current SGML are its relationship to  the final published article and the extent to which is it reusable outside of the  system in which it was produced.     One of our vendors derives the SGML from their typesetting system after  the fact and this transformation could potentially be a source of errors -  thus I can't consider the SGML from this source a guaranteed copy of what  we published. Furthermore, no path exists for taking this SGML and  reprocessing it back into the typesetting system to get, say, a PDF file  out of it. Of course, we can still make good use of this SGML,  particularly for creating the XML front and backmatter files and for  searching.  Our other vendor does directly use the SGML to typeset the  journals using ArborText which has a TeX backend. The main problem here is  that often it is necessary to feed directions to the TeX backend via SGML  processing instructions.  Sometimes these even contain actual article  content (one example is for producing extended radical signs) and so one  has to actually reverse engineer the typesetting process in order to  recover the content. Another problem is that both vendors introduce SGML  entities as needed for new characters and these are more or less  completely undocumented. One can only look at the printed page (or PDF) to  determine what they are. All of this makes the SGML very suspect from an  archival point of view. 


    Ian: What is next for PROLA now that the archive is  complete?    


Mark: Of course the archive continues to evolve with time. I  don't expect that we will need to migrate content any time soon, but we are always  thinking about what we will need to do. Improving access and reliability for our  end users is the current priority. We plan to develop innovative uses for searching  and mining the archive. Now that we have this wonderfully rich archive in  electronic form, one can envision doing all sorts of projects to highlight various  relationships between groups of papers. For instance, it would interesting to  locate papers "similar" to a given paper or to see the relationships between groups  of authors as a field evolves. As more and more content becomes available in our  XML format, we plan to exploit this heavily. Searching by formula, dynamically  reformatting the article, linking to parts of papers, etc. will all become  possible. 


    Ian: What is different about the APS electronic archive and  something like arXiv.org which has an increasing proportion of the physics  literature?    


Mark: First, the APS archive is much more uniform than the TeX  material on arXiv.org making it easier for us to migrate in the future. But the  main difference from my point of view is the existence of the SGML and XML. Even if  the current SGML we have isn't truly archival, it is still a rich source of  information for searching and linking articles together. Thus PROLA can offer many  more services (better linking for instance) than arXiv.org. This of course comes at  a price -- it is still expensive to produce the marked up file.  My hope is that  over the next five to ten years, authoring tools will develop that will enable  authors to directly create XML of the same archival quality as the XML I described  above. This will enable arXiv.org and the APS to benefit immensely. For both of us,  there will be very little additional cost for creating the rich services offered in  a modern online journal application. However, during this transition, I still view  what the APS does in production as an essential component (beyond peer review of  course) that must be continued and, thus, must be paid for. Paying for this and  peer review outside of the subscription model remains a challenge.     Finally, librarians need to familiarize themselves with these issues and develop  the skills to deal with marked up content. If libraries are to maintain a  responsibility for archiving scholarly literature, then they must learn what  publishers do and participate in its evolution. Collections of PDF files will not  suffice for a long term archive. APS is quite interested in forming partnerships  with research institutions and libraries so that we can continue to serve the broad  interests of scholarly communication. 


    Ian: Thanks for taking the time out of your busy schedule to talk  about APS and PROLA.    


Mark: You're welcome. Cheers.  

    
Further Internet Links and Contacts
    {Frequently Asked Questions about the APS link manager}    {What's new in PROLA}

    {PROLA Archive Cornell Mirror}

     PROLA Home Page

    APS Pricing and Subscription   Information

    {Keeping the Promise: Phys Rev Completes Online Archive} (APS News Online August/September  2001)

    {PROLA: More Than Just a  Pretty Acronym} (APS News Online August/September 1999)

    Physical Review Online  Archives (PROLA) D-Lib Magazine June 1998

    Doyle, Mark. Dilaton Contact   Terms in the Bosonic and Heterotic Strings. 1992. 

    ________. The Operator Formalism and Contact Terms in String Theory.  Ph.D. Princeton University, 1992.

    ________. World-Sheet   Supersymmetry Without Contact Terms. 1992. 

    Mark Doyle 
  Manager, Product Development
  The American Physical Society
doyle@aps.org

    Barbara Hicks
  Associate Publisher
  The American Physical Society
assocpub@aps.org

    Ian Gordon
  Head, Reference Information Services
  Science Librarian
  James A. Gibson Library
  Brock University
igordon@brocku.ca  

      
  	Previous	 	Contents	 	Next