College and Research Libraries Automated Collection Analysis Using the OCLC and RLG Bibliographic Databases Nancy P. Sanders, Edward T. O'Neill, and Stuart L. Weibel This study examined the feasibility of automating the labor-intensive process of collection anal- ysis. Collections in botany and mathematical analysis from institutions holding membership in the Committee on Institutional Cooperation (the Big Ten universities plus the University of Chicago) served as the study population. The databases of the Online Computer Library Center (OCLC) and the Research Libraries Group (RLG) were initially used as the sources of holdings information. The study found that the methodology provided a promising alternative means of analyzing and comparing library collections. However, due to varied cataloging practices of the participating libraries, accurate results could not be obtained without local verification of the holdings data. • he growing trend toward re- search library participation in cooperative collection develop- ment agreements has prompted collection managers to seek consistent means to evaluate and compare their col- lections. Unfortunately, most methods cur- rently available are labor-intensive. The purpose of this study was to test the feasi- bility of using the databases of the biblio- graphic networks for computerized collec- tion analysis to reduce the labor required. The project was formally initiated in summer 1985 when the Online Computer Library Center (OCLC) and the Research Libraries Group (RLG) were invited by the Committee on Institutional Cooperation (CIC) to participate in a meeting of its sci- ence bibliographers and collection devel- opment officers. The meeting explored the potential for cooperative collection devel- opment among CIC institutions. Discus- sions with participants and program plan- ners suggested that science and technology collections would be good subject areas for study. The CIC meeting was held at the Univer- sity of Chicago, September 12 and 13, 1985. During that meeting, some preliminary analyses for the OCLC member libraries were presented. Following the discussion, it was decided to expand the study to in- clude all CIC member universities: Chi- cago, illinois, Iowa, Michigan, Minnesota, Wisconsin, Indiana, Michigan State, Ohio State, and Purdue. Nancy P. Sanders is Head, Home Economics Libraries, at Ohio State University Libraries, Columbus, Ohio 43210-1285. Edward T. 0 'Neill is Senior Research Scientist, and Stuart L. Weibel is Associate Research Scientist at OCLC Online Computer Library Center, Dublin, Ohio 43017-0702. The authors thank the Research Libraries Group, particularly Leslie Hume, for assistance and support. We also thank all CIC libraries for their help. It would have been impossible to complete this study without the detailed local verification provided by the staffs of the individual libraries. 305 306 College & Research Libraries A literature review revealed substantial work in the area of collection analysis, par- ticularly in collection overlap. This pre- vious work is summarized in William Gray Potter's review of relevant research. 1 Much of the work completed in the 1960s and 1970s investigated the feasibility of estab- lishing processing centers, union catalogs, or cooperative collection development agreements. Several overlap studies based on metho- dologies different from that planned for this study were examined for relevant find- ings. Many earlier studies were based on random sampling from card catalogs or shelflists. For example, William Nugent's stud¥ of six New England state universi- ties; Ellen Altman's investigation of the optimum composition of a secondary school interlibrary loan system; 3 William Cooper, Donald Thompson and Kenneth Weeks' study of overlap in the University of California system; 4 and Edward O'Neill and Mary Lynn Seanor's analysis of the li- brar~ collections in western New York State5 take this approach. Later studies such as those by Thomas Nisonger of the libraries in north Texas; 6 Barbara Moore, Tamara Miller and Dan Tolliver at the University of Wisconsin/ and Glyn Evans, Roger Gifford, and Donald Franz in New York State8 em- ployed OCLC archive tapes in collection analysis. Potter used the LCS library com- puter network in lllinois academic institu- tions. 9 While these studies, based on com- parisons of random samples rather than recommended lists, were of interest, the methodologies and populations were suffi- ciently dissimilar to render comparisons difficult. The potential problems common to overlap studies in general were well doc- umented by Michael Buckland, Anthony Hindle, and Gregory Walker .10 SAMPLING METHODOLOGY Random samples of 500 monographic records from each of the two subject areas, botany and mathematical analysis (which includes calculus, functional analysis, functions, and differential equations), were extracted from the OCLC Online Un- ion Catalog. These two subject areas were selected because their bibliographic charac- July 1988 ter provides a useful contrast, each was col- lected by all of the institutions, and they were readily identifiable subjects in both the Library of Congress and the Dewey Decimal classifications. The samples were intended to be representative of recently published monographs in the subject ar- eas, and thus in the pool of potential library acquisitions. As such, they should not be viewed as a checklist of desirable books. Only records with a copyright or publica- tion date between 1978 and 1983 were in- cluded. This eliminated differential rates of retrospective conversion among the li- braries as a factor in the comparison of holdings and minimized the effects of de- layed acquisitions or cataloging backlogs. A book was categorized as mathematical analysis if it had a Library of Congress clas- sification number in the range QA300-433. Books without a Library of Congress class were included if they had been classified as 515 in the Dewey Decimal classification. For botany, sample selection was based on the Library of Congress classification QK and the corresponding Dewey Decimal classification 581. At the time the samples were extracted, the OCLC Online Union Catalog contained 2,301 mathematical analysis titles and 5,044 botany titles pub- lished during the six-year period included in the study. The sample records were then compared to related records in the OCLC Online Un- ion Catalog to determine whether any rep- resented a publication with substantially the same content or ''text'' as defined by Patrick Wilson to differentiate between the content of a work and its physical form. 11 For example, under this definition, a dis- sertation in photocopy, microform or type- script is considered to be a single text. If records for a duplicate text were lo- cated, an experienced searcher determined whether the sample record was the first added to the Online Union Catalog as de- termined by its position in the OCLC num- ber sequence. On}.y the lowest numbered record for any text was included in the sam- ple. If the sample record was the first in the Online Union Catalog and others were added later, all library holdings symbols at- tached to the subsequently added records were added to the holdings of the original sample record. This procedure maintained the statistical validity of the sample, ensur- ing that each text had an equal chance of being included in the sample regardless of the number of records in the database rep- resenting that text. Different editions were considered dif- ferent texts with the exception of "edi- tions" from Latin America and non- English-speaking Europe where so-called editions are most often 11 printings. 1 ' There- fore, different "editions" from these coun- tries were considered to be the same text, and the records were collapsed or elimi- nated based on their OCLC number unless there was evidence of revision. Transla- tions were considered to be distinct texts. Obvious serial (not monographic series) articles that had been cataloged separately and entered as monographs were also eliminated from the sample. In most cases, determining whether two records repre- sented the same item was not simple. Some decisions were later found to be erro- neous when bibliographers examined their local records or an item in hand. These er- rors simply point out the problem long rec- ognized by those who catalog in an online environment: determining whether an ex- isting record represents a work in hand is often difficult, if not impossible, given the idiosyncrasies and lack of standardization in the publishing industry and the impossi- bility of adequately describing an item to distinguish it from different, though simi- lar, works using current cataloging criteria. Following the manual search of the data- base and the elimination of records not rep- resenting unique texts, 392 records re- mained in the botany sample and 454 in the mathematical analysis sample. As the sample was searched, all relevant OCLC holdings data were appended to the selected bibliographic records. However, because only six of the eleven CIC institu- tions (Illinois, Indiana, Michigan State, Ohio State, Purdue, and Wisconsin) are OCLC members, not all CIC member hold- ings were represented. Four of the institu- tions (Iowa, Michigan, Minnesota, and Northwestern) are RLG members. Chicago is not associated with either bibliographic network. To obtain holdings data for the RLG members, a listing of the bibliographic Automated Collection Analysis 307 records in the samples was sent to RLG where it was checked against that data- base. To test the completeness of the net- works' holdings data, the OCLC holdings information was compared with local rec- ords at Ohio State. It was obvious that the networks' holdings data were incomplete, largely due to local cataloging practice. Fre- quently, Ohio State had cataloged an item by attaching its holdings symbol to a series or serial record, rather than adding it to the record for the individual item or "subu- nit.'' This finding highlights the problems that arise when databases designed for one purpose, in this case cataloging, are used for a different purpose, such as collection analysis. To determine the magnitude of the II sub- unit'' problem, the searcher located the record for each serial or series that was cited within the subunit monographic rec- ord. Each was examined to determine whether it would be feasible to assume that a given library held the monographic item if the library's symbol was attached to the serial holding record (assuming that the li- brary holding symbol was not attached to the monographic record). In the majority of cases it was decided that it was not possible to assume this because many series con- tained several hundred to more than one thousand associated monographs. As a result of this early finding and the number of records belonging to this subu- nit category, the libraries were asked to ver- ify their holdings. At the same time, Chi- cago was asked to identify the materials held. All of the institutions agreed to check the sample against their catalogs. How- ever, due to local difficulties, the botany sample could not be verified at Michigan and Northwestern. The Michigan data pro- vided by RLG were used without valida- tion, recognizing that the botany holdings for Michgan are underestimated. Because Northwestern had only recently begun en- tering records into the RLG database, its unverified holdings were known to be seri- ously underrepresented. Therefore, its bot- any data were excluded from the analysis. The results of the local verification, shown in figures 1 and 2, confirm the ear- lier suspicions that holdings indicated by 308 College & Research Libraries 11Holdings indicated by the biblio- graphic networks may not accurately reflect a library's collection; there- fore, the records for the bibliographic networks should be only one of sev- eral sources used to measure collec- tion strengths." the bibliographic networks may not accu- rately reflect a library's collection; there- fore, the records for the bibliographic net- works should be only one of several sources used to measure collection strengths. The reason for the discrepancies vary .. For example, the holdings discrep- ancy figures show all of Chicago's holdings as ''added by the library'' because records from bibliographic networks were not available at the time the sample was taken. Also, data for Minnesota underrepre- sented their botany holdings because the wrong holdings symbol was used during data extraction. Local cataloging practices may account for other variations, such as the ''subunit'' problem noted above, but further examination and explanation await future research. HOLDING PA TIERNS Of the analyses developed from the vari- ous holdings data, three focus on individ- ual libraries' holdings. Five examine collec- tions of the OC institutions as a whole and provide an overview of the potential for co- 150- ~ 100- 0 j E :.i 50 0 ._.-by llnly 0-.uoooce>nlklgto- ....... ,.... • -.go- by llnly ,..... ..... I"'"' 1-- .---- FIGURE 1 Botany Holdings July 1988 ,_ 300 r- f- o--.. ~ ,.... o--.. - 0 ·--.. ~ 1--- ['!"" I- ~ ..-- r- P"'" f-,.... I- i I ! i G) I J Cl nl E Cl> f:! Cl> Q. 50 40 30 20 10 0 Automated Collection Analysis 309 Number of Libraries Holding a Given Title FIGURE3 • Botany 0 Mathematical Analysis Title Duplication Patterns is relatively flat. The number of items held by multiple libraries reflects the similarity among the collections. In botany, 25% of the sample was held by 5 or more of the 10 participating libraries. For mathematical analysis, 41% of the sample was held in 5 or more collections, and 36% was held in 6 or more libraries, indicating greater similarity among the mathematical analysis collec- tions. The average number of libraries holding each title also indicates a greater duplication of the mathematical analysis material. Mathematical analysis items were · held by an average of 4.2 CIC libraries, while the botany items were held by an av- erage of only 2.3 libraries. Even when Northwestern's mathematical analysis holdings are excluded-to be consistent with botany-the average mathematical analysis book was still held by 4libraries. The pool of available materials was quite different for mathematical analysis and botany. During the period of study, ap- proximately 350 books were published an- nually in mathematical analysis and 660 in botany. However, a greater proportion of the mathematical analysis materials was ac- quired. The CIC libraries each acquired an average of 134 books annually in mathe- matical analysis and 152 books in botany. The higher acquisition rate from a relatively small pool of available materials could po- tentially explain the higher duplication rate for mathematical analysis. An analysis of titles not held by any CIC institution was undertaken as a result of numerous comments from CIC partici- pants that the sample was not representa- tive of research collections because it in- cluded many popular books, texts, and other nonresearch materials more suitable for public or school libraries. While the sample had been intended as a selection of all material published in the subject areas, the investigators questioned whether the material not held by the CIC institutions would be generally considered to be II re- search material." To address that question, the types of libraries holding the sample items not held by a CIC institution were an- alyzed. The findings are shown in figures 4 and5. For this analysis, a research library was defined as a member of the ARL, and aca- demic libraries were defined as all other college and university libraries. The public libraries group also includes processing centers, school libraries, and state libraries. Only North American library holdings were included in the analysis. The exami- nation showed that 61% of the 101 mathe- matical analysis titles and 60% of the 176 botany titles not held by CIC institutions were held by a least one other research li- brary. Also notable is the number of items not held by a CIC institution that were held only by another research library: 45% in the mathematical analysis sample and 32% in the botany sample. In all cases, the sample items were more often held by research li- braries than by any other type of library. Other academic libraries held the second 310 College & Research Libraries 100 ~ 80 J: Ill -! 60 j:: ..8 40 E ::J z 20 0 0 Number of titles held • Number of titles held exclusively FIGURE4 Libraries Holding Botany Titles not Held by CIC Institutions 60 0 Number of titles held • Number of titles held exclusively 'C 50 'ii J: Ill -! 40 j:: 0 ~ 30 .Q E ::J z 20 10 0 FIGURES Libraries Holding Mathematical Analysis Titles not Held by CIC Institutions largest portion of the titles not held by a CIC institution, followed by public and special libraries. For mathematical analy- sis, academic libraries held 52%, exclu- sively 28%; public libraries held 17%, exclu- sively 2%; and special libraries held 10%, exclusively 2%. For botany, academic li- braries held 44% of the titles, 21% of them exclusively; public libraries held 10% of the titles, 2% of them exclusively; and special libraries held 36%, exclusively 10%. Thus, July 1988 almost all of the materials not held by aca- demic and research libraries were held by special libraries, especially in botany . Therefore, it appears likely that the materi- als were not acquired by any CIC library for reasons other than their lack of scholarly fo- cus. COLLECTION OVERLAP Collection overlap was analyzed to deter- mine the extent of duplication among CIC libraries. The results of the analysis are shown in tables 1 and 2. Overlap was deter- mined by measuring the number and pro- portion of titles held in common by pairs of CIC libraries, i.e., by each CIC library com- pared sequentially with every other CIC li- brary. The number held in common is shown in tables 1 and 2 below the diagonal space while the percentage appears above. Percentages were calculated by first de- termining the number of volumes in the sample that were held by paired institu- tions (e.g., 89 + 109 or 198, in the case of the Ohio State and Wisconsin botany col- lections). The number of duplicated items was then subtracted (198 - 67 = 131 in the example), leaving the number of titles held by the two libraries. The number of titles held in common was divided by the num- ber of titles held, yielding the percentage of titles held in common by the two libraries (67 I 131 = 0.511, or 51.1%). A related research project by Charles Davis and Deborah Shaw12 suggests that overlap is predictable by collection size. In the present study a significant positive cor- relation ( r = 0.58) was found between over- lap and number of volumes held by both institutions for botany. In mathematical analysis, however, there was no significant correlation (r == - .01). The botany finding does not support the Davis and Shaw study. However, the method of computing the overlap was different and could ac- count for the inconsistency. Further re- search is required to understand the rela- tionship between collection size and overlap. The overlap percentages were, on aver- age, higher for mathematical analysis than for botany. The differences are likely due, at least in part, to factors noted earlier: the Automated Collection Analysis 311 TABLE 1 COMMON HOLDINGS IN BOTANY No . of Common Titles/% of Common Titles No. of Titles Ohio Institution Held Chicago illinois Indiana Iowa Michigan Michigan State Minnesota State Purdue Wisconsin Chicago 65 37.8 37.0 39.6 33.7 35.8 34.4 35.1 31.9 42.6 Illinois 132 54 38.8 47.5 36 .6 52.8 47.3 51.4 32.6 60.7 Indiana 61 34 54 42.7 39.6 42.4 32.7 45.6 36.5 46.6 Iowa 76 40 67 41 42.0 53.2 43.4 48.6 29.7 50.4 Michigan 66 33 53 36 42 43.0 38.7 44.9 30.1 36.7 Michigan State 117 48 86 53 67 55 52.4 54.9 31.3 60.3 Minnesota 142 53 88 50 66 58 89 45.3 31.3 47.6 Ohio State 89 40 75 47 54 48 73 72 37.1 51.1 Purdue 55 29 46 31 30 28 41 47 39 36.7 Wisconsin 109 52 91 54 62 47 85 81 67 44 TABLE2 COMMON HOLDINGS IN MATHEMATICAL ANALYSIS No. of Common Titles/% of Common Titles No . of Titles North- Ohio Institution Held Michigan Chica!Zo illinois Indiana Iowa Michigan State Minnesota western State Purdue Wisconsin Chicago 124 37.5 58.1 52.9 44.1 Illinois 301 116 47.6 42.4 70.7 Indiana 170 108 152 57.6 55.7 Iowa 142 92 132 114 Michigan 252 115 229 151 131 Michigan State 141 90 135 111 100 Minnesota 163 102 147 124 111 Northwestern 105 73 100 86 81 Ohio State 198 103 185 125 114 Purdue 147 93 134 114 108 Wisconsin 174 100 155 125 114 smaller body of mathematical analysis ma- terial published and the geographical spe- cificity of some botany material. It is highly likely that institutional collection policies also affected the overlap patterns, though this was not explicitly examined in the study. COMPOSITION OF THE COLLECTIONS Language of publication was found to be a useful attribute for characterizing the lit- erature of a given subject field and for dis- tinguishing the collecting policies of re- search libraries. Figure 6 shows the proportion of foreign-language material held by each institution. As might be ex- pected, the majority of each library's collec- tion was in English. The larger collections contain a higher proportion of non-English material. This generalization proves stronger for the mathematical analysis 49.8 129 147 99 167 135 154 51.4 55.1 46.8 47.0 52.2 50.5 44.0 46.4 32.7 58.9 42.7 48.4 55.5 59.3 45.5 51.4 56.2 57.1 54.6 57.2 48.8 50.4 59.7 56.4 48.9 54.9 38.4 59.0 51.1 56.6 57.5 46.4 52.7 54.0 58.3 111 48.9 53.0 58.2 61.2 78 88 40.3 50.9 45.3 117 125 87 51.3 49.4 101 114 85 117 56.6 116 128 87 123 116 sample, in which the total collection size and the proportion of the foreign-language collection are closely correlated. In the bot- any sample, Chicago, Michigan, and Ohio State have higher percentages of non- English material than would be predicted by their comparative collection sizes. Mich- igan's figure is probably explained by un- der representation of its collection; Chica- go's by its heavy research emphasis; and Ohio State's by its Herbarium staff's inter- est in Latin America and resulting pur- chases in Spanish and Portuguese, and the emphasis on Systematics for which the Bio- logical Sciences Library purchases in many foreign languages. The foreign-language composition of the sample, shown in figure 7, provides yet an- other means of illustrating the differences in the character of the two samples. The non-English portions of the mathematical analysis sample were primarily German 312 College & Research Libraries Q) ~ 25 :I Clc: lij.2 ~ ~ 20 .21~ ~0 tl. .5 15 -(I) 0.!! ~ r= 10 e Q) 0.. 5 0 July 1988 • Botany 0 Mathematical Analysis FIGURE 6 Foreign Language Holdings (46%), Russian (30%) and French (16%). The botany foreign-language material was published more frequently in French (35%), German (24%), Russian (13%), and Spanish (10%). IMPLICATIONS The concept of analyzing library collec- tions by comparing their current acquisi- tion patterns to the pool of available mono- graphs was found to be a viable approach to collection evaluation. Although the re- sulting data could be used either to com- pare the relative strengths of different sub- ject areas within a single library or to compare relative strengths in a given sub- 50 • Botany 0 Mathematical Analysis 0 German French Russian Spanish Other FIGURE7 Common Languages ''The concept of analyzing library col- lections by comparing their current ac- quisition patterns to the pool of avail- able monographs was found to be a viable approach to collection evalua- tion.'' ject among different libraries, the investi- gators believe that the absolute numbers are far less significant than the relative numbers. Knowing that a library acquires 25% of all available material tells little about the strength of the collection. It is only when the acquisition rate is compared to that of peer institutions that the assess- ments become meaningful. For example, by comparing all CIC collections, it became clear that a library acquiring 30% of the available botany material is likely building a strong collection. However, in mathemat- ical analysis, the acquisition of 30% of the available material would produce only an average collection. Further research would be required to build a basis of comparative data for other subject areas. The acquisition patterns for both botany and mathematical analysis materials indi- cate a considerable potential for coopera- tive collection development among the CIC institutions. Since only approximately 5% of the acquisitions are unique, a relatively small shift in acquisition patterns could result in a significant reduction in the amount of material not acquired by any CIC institution. The result of such changes in collection development policies would be that library users would experience a small decrease in the proportion of their needs met locally, but a higher proportion would be met within the consortium. Whether the overall results of such changes would be desirable would depend on us- age patterns, local expectations, and politi- cal conditions, none of which was exam- ined in this study. The relation between collection size and overlap bears further investigation. If such analysis could substantiate a strong posi- tive correlation between size and overlap, then libraries contemplating cooperative agreements might rely with some confi- dence on the more easily obtainable collec- tion size statistics for a particular subject classification rather than computing com- mon holdings. Of equal importance in such a study would be a careful analysis of the Automated Collection Analysis 313 collections that do not conform to the model to derive an explanation of their uniqueness. From the analysis of the holdings, it is clear that local library cataloging practices and bibliographic networks' policies affect the utility of the online databases for collec- tion analysis. The responses from the CIC institutions indicate a pattern of cataloging practices that require local validation to achieve reliability. Cataloging policies that resulted in partial cataloging of mono- graphic series and no cataloging for some reserve, technical report, and theses collec- tions became apparent in this study. Potential uses for the results of compara- tive collection data include accreditation re- ports, collection strength analysis for pro- posed new programs, cooperative project viability, and Conspectus or NCIP work sheet validation. However, unless a method can be found to compensate for the unreported holdings, local validation of the holdings data is necessary to obtain consis- tent and reliable results. The expense of that process obviously limits its application to selected subject areas. REFERENCES 1. William Gray Potter, "Studies of Collection Overlap: A Literature Review," Library Research 4:3-21 (Spring 1982). 2. William R. Nugent, "Statistics of Collection Overlap at the Libraries of the Six New England State Universities," Library Resources & Technical Seroices. 12:31-36 (Winter 1968). 3. Ellen Altman, ''Implications of Title Diversity and Collection Overlap for Interlibrary Loan among Secondary Schools," Library Quarterly 42:177-94 (Apr . 1972). 4. WilliamS. Cooper, Donald D. Thompson, and Kenneth R. Weeks, "The Duplication of Mono- graph Holdings in the University of California Library System," Library Quarterly 45:253-74 (July 1975). 5. Edward T. O'Neill and Mary Lynn Seanor, A Suroey of Library Resources in Western New York (Buf- falo, N.Y.: Western New York Library Resources Council, 1971). 6. Thomas E. Nisonger, "Editing the RLG Conspectus to Analyze the OCLC Archival Tapes of Sev- enteen Texas Libraries," Library Resources & Technical Seroices 29:309-27 (Oct./Dec. 1985). 7. Barbara Moore, Tamara}. Miller, and Don L. Tolliver, "Title Overlap: A Study of Duplication in the University of Wisconsin System Libraries," College & Research Libraries 43:14-21 (Jan. 1982). 8. Glyn T. Evans, Roger Gifford, and Donald R. Franz, Collection Development Using OCLC Archival Tapes (Washington, D. C.: Office of Education, Office of Libraries and Learning Resources, ED 152 299, 1977). 9. William Gray Potter, "Collection Overlap in the LCS Network in Illinois," Library Quarterly 56:119-41 (Apr. 1986). 314 College & Research Libraries July 1988 10. Michael K. Buckland, Anthony Hindle, and Gregory P.M. Walker, "Methodological Problems in Assessing the Overlap between Bibliographical Files and Library Holdings," Information Process- ing and Management 11:89-105 (Aug. 1975). 11. Patrick Wilson, Two Kinds of Power: An Essay on Bibliographical Control . (Berkeley: Univ. of Califor- nia Pr., 1968). 12. Charles H. Davis and Debora Shaw, "Collection Overlap as a Function of Library Size: A Compar- ison of American and Canadian Public Libraries,'' Journal of the American Society for Information Sci- ence 30:19-24 (Jan. 1979).