Identifying Potential Solutions to Increase Discoverability and Reuse of Analog Datasets in Various Campus Locations Previous Contents Next Issues in Science and Technology Librarianship Winter 2018 DOI:10.5062/F4PC30NR Identifying Potential Solutions to Increase Discoverability and Reuse of Analog Datasets in Various Campus Locations Shannon L. Farrell Natural Resources Librarian Natural Resources Library University of Minnesota St. Paul, Minnesota sfarrell@umn.edu Julia Ann Kelly Science Librarian Magrath Library University of Minnesota St. Paul, Minnesota jkelly@umn.edu Abstract Describing, preserving, and providing access to data is now the purview of many science librarians, although the emphasis has been on data in electronic format. Data in paper or analog format might be found in many places around our campuses. At the University of Minnesota we conducted a preliminary investigation of analog data through discussions with faculty, staff, and the University Archives. We identified data in numerous locations, including the University Archives, personal collections, departmental holdings, museums, and off-campus research stations. We discovered data in many formats and carried out a few initial projects including creating a detailed inventory of one research center's analog data and digitizing and depositing one individual's dissertation data in our institutional repository. We also examined University Archives and discovered substantial amounts of analog data along with problems such as incomplete description or context. Overall we have identified several challenges and directions that we could take to make analog data more findable and available for reuse, but there is no clear single path forward. Introduction Academic librarians are increasingly charged with tasks related to data management, especially in the sciences, with a focus on born-digital machine-readable data. However, many campus entities still hold scientific research data in paper or analog format. These data have received little or no attention in the new era of librarians working with research data. However, in 2015 when science librarians were informally asked via the STS-L and USAIN-L mailing lists about how they are working with researchers and their paper data, several responders noted this as a problem and expressed interest in finding solutions. We propose that academic librarians utilize their skills in data management to address the preservation, discoverability, and possible reuse of this body of analog data. Analog data can exist in many places across a university campus. Through informal discussions with faculty in our liaison departments, we discovered that a number of groups on our campus store analog data, sometimes spanning decades. These are often centers, museums, or research outposts that conduct long-term research projects. Scientists in charge have concerns about preservation and an interest in making the data as useful as possible for their own groups and possibly outsiders. In all of our interactions, even if they did not know that librarians now worked in data management, the researchers did recognize that librarians have skills in organization, preservation, and access, and they welcomed our interest and expertise. In some instances when individual researchers retire, their lab books and field notes find their way to the university's archives, whose traditional users are historians, not scientists. While it is not current practice to digitize all of the data, they are certainly safe from harm and available for perusal by those visiting the archive. As we investigated the analog data assets on our campus, both in the University Archives and in labs and other research units, we recognized the potential value of the material for reuse by other scientists. We also recognized the challenge of preservation, especially for records going back 100 years or more in non-controlled environments. In nearly all cases, only a small number of people knew of the data's existence. Without descriptive metadata and platforms to make that metadata visible, the chance for reuse of these data sets is not very high. Literature Review As the value of raw research data and its potential for reuse has come to the attention of policymakers and funding agencies (Holdren 2013; NSF n.d.), academic librarians have developed new skills and services to support their long-time mission to collect and organize information and to make it accessible. Services now offered by research libraries include assisting with data management plans, advising researchers about options for preserving data, curating data, and hosting data repositories (Johnston 2017; Kellam & Thompson 2016; Tenopir et al. 2017). While working with research data may be new to librarians, university archives have collected and preserved scientific research data for many decades. The purpose of preserving this data may not have been for its later reuse by other scientific researchers, since the primary audience for archives is historians; thus the archival finding aids might not highlight specific features of the data that would lead scientific researchers to think it would be useful to their work. However, there are anecdotal examples of scientists reusing data. Archivists have documented various aspects of how scientific data fit into their collections (Haas et al. 1985). Past archival practices called for being selective about what data to accept (Haas et al. 1985; Janzen 1980). There are currently discussions but no consensus on accepting historical data (Laver 2003; Noonan & Chute 2014). To inform the discussion on these and other issues, some archivists suggest more interaction with scientists (Akmon et al. 2011). Others suggest taking a careful look at metadata practices in regard to research data (Lauriault et al. 2007). A few organizations, including the Smithsonian, the Biodiversity Heritage Library (Biodiversity Heritage Library n.d. a; b), and the library at Texas A&M, have undertaken projects to digitize older field notebooks and make them freely available (Texas A&M University Libraries n.d.). The Texas A&M effort includes identification numbers for preserved physical specimens when available. Researchers have been reusing older analog data for decades, but in the examples we found, authors were mainly using data of a type that was commonly known to exist and probably relatively easy to locate. Examples include weather data (Brázdil et al. 2016; Kelso & Vogel 2007) and other data produced by government entities such as records of herbicide applications (Chauvel et al. 2012); land use (Munro & van der Horst 2016); agricultural production (Nikulin 2015); and plant surveys (Van der Veken et al. 2004). The occasional paper, such as the historical look at the Hubbard Brook Experimental Forest in New Hampshire, depend on data sets particular to a location or project (Fahey et al. 2015). Although we specifically searched several scientific databases with terms that would indicate use of historical data and scanned the full text of likely candidates, we were unable to verify any examples of scientific researchers using data that was retrieved from university archives. Analog Data on the University of Minnesota Campus Our interest in analog science data on our campus started with a request from the head of the Horticulture department. They had unearthed potato breeding records covering over 100 years and wanted assistance from the library in deciding what to do with them. The department head recognized their potential value and was concerned about the department's ability to house them safely over the long run. We provided storage, created an inventory of more than 100 volumes, and continue to work on a plan to best describe the data and make their availability known. Exploring the University Archives Looking for a permanent home for the potato records led us into numerous discussions with staff members at the University of Minnesota Archives. As we learned more about their collections and the decisions they make about what to retain, we realized that materials in the Archives undoubtedly contained scientific data that might be of interest to current and future students and researchers. Our University Archivist was able to cite examples of material that had been reused by scientists. As we did initial investigation using the Archives' finding aids we noted that although records might contain research data, the descriptions did not always make it easy to identify. A deeper look into the finding aids and the records themselves was our next step. We conducted a preliminary scan of research data in University Archives by searching the term "data" on the University Archives web site search box, which includes the electronic finding aids. We restricted the search to collections within the agricultural, biological and environmental sciences. After reviewing results to confirm that they likely included some kind of research data, we identified 33 collections to explore further. We examined each of the 33 collections' finding aids and noted where potential datasets might be located and what subjects were covered. We also noted what terms were used to describe potential datasets, such as "research," "studies," "lab notebooks," "field notebooks," "survey," "tally sheets," and so on -- approximately 75 terms in total. We then identified four collections of papers to examine directly: three personal collections and one research center collection. We discovered that there is no consistency or controlled vocabulary in how research data is described in Archives. Some items that we thought would contain data did not, such as a folder entitled "Field notes." Others that we had not identified via our search as being potential data sources did turn out to have research data, such as a folder called "Mirror Lake." This is probably due to archives' policy in retaining the original creator's folder names. Although the point of our exercise was to discover how many datasets were in the Archives, and not how valuable or re-usable the data were, we noted that much of the data was lacking metadata and/or context, and likely may not be useful to future scientific researchers. For example, some datasets did not describe where and when the data were collected, or what variables they were measuring (e.g., acronyms, columns/rows were not explained). There were problems with readability of handwritten data, sheets being torn or missing, and data that were crossed out or written over. Many of these issues likely arose because of archives' policy to take items as they are and to use terminology and description that was provided by the researcher. If a collection comes to the archives after a researcher is deceased (as is likely), there is no dialogue between archivist and researcher, and thus no extra description in a record. Our challenges in finding reusable data led us to think about what new procedures might help avoid these situations and make this valuable material more accessible to researchers. We learned from conversations with the University Archivist that users of archival collections often approach archives armed with the name of a particular researcher whose work they would like to investigate. They have gathered background information and have an idea of what they might find. Others depend on Google to help them identify collections of potential interest and the finding aids developed by Archives staff are essential to this process. To determine if searching Google could be a successful strategy for identifying data in our University Archives, we selected 15 collections that we knew from our earlier investigations contained data. Each of the finding aids for these collections included the term "data" at least once. To mimic how a scientific researcher might search for existing data using Google, our Google search strategy included: Researcher last name "Data" "Minnesota" A general word or phrase about the subject area of the data (e.g., wildlife, insect, green revolution) For just over half of the collections (8 of 15), the current finding aid for the researcher did not appear in the first 30 items in the results list. We also did not find any other documentation in those first 30 hits, such as older finding aids, that would lead us to believe that our archives contained data generated by that researcher. For the remaining seven we did locate a finding aid that mentioned data. Despite common wisdom that people locate data in archives via Google, in less than half the cases we were unable to do so using the above outlined search strategy. Other Analog Data Held Around Campus In addition to the potato breeding data, we suspected that there were pockets of analog data in various labs, centers, and departments on campus. A few examples were similar to the potato materials, in which researchers approached us, but we also decided to reach out to a small number of other groups. A former graduate student who studied mountain plovers, a western bird, came to us via a department head. The species is threatened and he had videos, photos, slides, vocal recordings, drawings, and field notes that he wanted to make freely available to other researchers. Through an internal grant program we were able to digitize the materials. We treated the project as a pilot to see how we might best deal with a mix of analog data types. All scanned items were first placed in our media repository which provided good viewing for the visuals but lacked the ability to create hierarchies or other organizational groupings. We eventually created spreadsheets that included groupings, metadata, and links to the individual visuals and deposited them as a single collection in our institutional repository along with the field notes and the resulting dissertation (Graul 1973). This experience taught us a great deal about the complexity of dealing with analog data. Another researcher contacted us after hearing about our interest in analog data. He had formerly been involved in a project to discover agricultural research data from Africa and convert it to machine-readable format. He was now working on a public-private partnership to do the same for data on certain crops and investigate automated systems to ingest the data. We are currently exploring how to help him in this endeavor. Everyone we approached did indeed have analog data and they each had someone else that they recommended we talk to. In the past year, we have also met with researchers from: Bell Museum of Natural History, where we discussed solutions to preserving specimen labels and collecting records. University of Minnesota Insect Collection, where we identified a large analog dataset that is not replicated anywhere digitally. Cloquet Forestry Center, where several file cabinets full of historical forest data exist that is not in archivable format and need preservation. While working with the potato breeding data we learned that there were other historical records held by horticultural researchers at the University of Minnesota covering fruit, chrysanthemums, and turf grass. We first pursued material held by researchers at the Horticultural Research Center at the Minnesota Landscape Arboretum. The Center contains a vault with over 100 years of fruit breeding data. The data was somewhat organized, but not cataloged. Even those working with it on a day-to-day basis were not entirely clear what it contained. We worked in conjunction with the researchers to create a detailed inventory that included a data dictionary and controlled vocabulary. The researchers identified terms/keywords that they would want to use to search the data, such as "field observations", "field maps" or "breeding material." This work took several all-day site visits to complete. Following these discussions and work on the Arboretum project, it became clear that if we were to put out a call for analog data on campus, we would be inundated and would need dedicated time to work closely with researchers. Conclusions In our experience, we found that scientific analog data exist all over campus. Every researcher we talked to knew of other examples where these data were being held. We do not know how much analog data is being held on our campus; the amount we find will dictate how we proceed. If we uncover a lot of analog data, we will have to determine the costs and benefits of working with particular data sets. Through pilot projects we plan to assess the feasibility of surfacing analog data and facilitating its reuse. We will consider working with faculty to create enhanced description, determine potential for reuse, and find solutions for long-term preservation. We believe this will be a finite problem. We know that nearly all scientific researchers are now collecting their data digitally (or are at least converting most current data to machine-readable formats), and therefore analog data are not being generated at the same rates as in the past. However, our findings may aid those researchers in parts of the world where analog data are still commonly generated. There will not be a one-sized-fits-all approach to handling analog datasets. We know from preliminary investigation that some researchers will decline to deposit their data in the Archives, and that the Archives cannot accept everything offered. Some of the data may currently be in use and thus need to reside with the originating research facility or museum. In some cases, the data could be digitized and preserved electronically. However, the consensus among librarians, archivists, and researchers is that not everything can or should be digitized. We encountered both situations in our study. We chose to digitize the plover data because it was limited in size and it concerned a threatened animal, but we opted to leave the 86 notebooks spanning a century of potato data in analog format. Digitization does not necessarily facilitate reuse or accessibility of the data. Although digitizing images of lab notebooks means that someone can in theory access them online, it does not mean that the actual data will be machine-readable. A researcher who wants to use the dataset will likely still have to do the tedious work of converting it to a different format. Therefore, if the data are not going to be transformed into a machine-readable format, we need to carefully consider what the advantage is to utilizing resources to digitize and store textual data if there is no immediate need for the particular dataset. Important factors to weigh may include: relationship to urgent fields of inquiry (such as climate change or endangered species), timespan of the data, size of the data set, prominence of the researcher, influence of the research, significance to the local institution, or past requests for that type of data. While it may be true that the majority of historical analog data was shared in summarized format within published reports or articles, raw data are more useful for replicating experiments, making direct comparisons, completing meta-analyses, or conducting longitudinal studies. The scientific community most likely does not know about or cannot access raw data that are locked away in file cabinets and labs. While no one answer to this problem currently exists or is on the horizon, several solutions could be developed or adapted to increase access to and findability of analog data. This question will likely be answered by a cross-disciplinary and multi-institutional investigation. One option might be an online registry to house the metadata describing analog datasets, so they can be organized and searched without being converted to machine-readable format. Another option could be having these records integrated into some other existing resource or set of resources, such as digital data repositories. At this point, many of the existing repositories do not accept standalone metadata records. One exception is AgData Commons [https://data.nal.usda.gov/), a combined data repository and metadata registry. However, it accepts metadata only for data that are in digital format, but not for analog data. Research Data Australia [https://researchdata.ands.org.au/page/about] is an example of a registry that does provide access to analog data in paper format. Registries for physical specimens exist in the museum community and provide ways to share collections and may serve as a model. Describing analog data and making them available via a registry is a relatively low-barrier strategy, compared to scanning or converting to machine-readable format. Creating metadata records would require fewer resources and would still make the data discoverable. The development of standards for these analog metadata records would need to take scientists' disciplinary practices into consideration in order to facilitate reuse. Since analog materials have long been in the purview of archives, we will be working with archivists to make use of their expertise and resources to figure out how to best preserve, organize, and increase the visibility of scientific analog data to scientists. The work that we have done up until this point indicates the potential importance of analog data materials. Going forward, we plan to more thoroughly investigate the extent and reusability of analog data in our institutional archive and ascertain the extent and condition of analog data held within individual labs or centers across campus. With this work, we hope to uncover potential solutions and recommendations for how to feasibly address reuse of historic analog data as a profession. References Akmon, D., Zimmerman, A., Daniels, M. & Hedstrom, M. 2011. The application of archival concepts to a data-intensive environment: working with scientists to understand data management and preservation needs. Archival Science 11(3):329-348. DOI: 10.1007/s10502-011-9151-4 Biodiversity Heritage Library. A. (n.d.). Smithsonian field book collection. [Internet]. Biodiversity Heritage Library; [cited 2017 June 14]. Available from http://www.biodiversitylibrary.org/collection/SIFieldbooks Biodiversity Heritage Library. B. (n.d.). The field book project. [Internet]. Biodiversity Heritage Library; [cited 2017 June 14]. Available from https://biodivlib.wikispaces.com/The+Field+Book+Project Brázdil, R., Chromá, K., Valásek, H., Dolák, L. & Reznícková, L. 2016. Damaging hailstorms in South Moravia, Czech Republic, in the seventeenth to twentieth centuries as derived from taxation records. Theoretical and Applied Climatology 123:185. DOI: 10.1007/s00704-014-1338-1 Chauvel, B., Guillemin, J-P., Gasquez, J. & Gauvrit, C. 2012. History of chemical weeding from 1944 to 2011 in France: changes and evolution of herbicide molecules. Crop Protection 42:320-326. DOI: 10.1016/j.cropro.2012.07.011 Fahey, T.J., Templer, P.H., Anderson, B.T., Battles, J.J., Campbell, J.L., Driscoll, C.T., Fusco, A.R., Green, M.B., Kassam, K.A.S., Rodenhouse, N.L., Rustad, L., Schaberg, P.G. & Vadeboncoeur, M.A. 2015. The promise and peril of intensive-site-based ecological research: insights from the Hubbard Brook ecosystem study. Ecology 96(4):885-901. DOI: 10.1890/14-1043.1 Graul, W.D. 1973. Mountain plover research. [Internet]. Minneapolis (MN): University of Minnesota Digital Conservancy; [cited 2017 June 14]. Available from https://conservancy.umn.edu/handle/11299/169740 Haas, J.K., Samuels, H.W., and Simmons, B.T. 1985 Appraising the Records of Modern Science and Technology: A Guide. Cambridge (MA): Massachusetts Institute of Technology. Holdren, J.P. 2013. Increasing access to the results of federally funded scientific research: memorandum for the heads of executive departments and agencies. [Internet]. Washington, D.C.: Executive Office of the President Office of Science and Technology Policy; [cited 2017 June 14]. Available from https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf Janzen, M.E. 1980. Scientific records in a "general" repository. The Midwestern Archivist 5(1):29-37. Available from http://www.jstor.org/stable/41101499 Johnston, L. 2017. Curating Research Data. Chicago: Association of College and Research Libraries, a division of the American Library Association. Kellam L.M. & Thompson K. 2016. Databrarianship: the Academic Data Librarian in Theory and Practice. Chicago: Association of College and Research Libraries, a division of the American Library Association. Kelso, C. & Vogel, C. 2007. The climate of Namaqualand in the nineteenth century. Climate Change 83(3):357-380. DOI: 10.1007/s10584-007-9264-1 Lauriault, T.P., Craig, B.L., Taylor, D.R.F. & Pulsifer, P.L. 2007. Today's data are part of tomorrow's research: archival issues in the sciences. Archivaria 64:123-179. Available from https://archivaria.ca/archivar/index.php/archivaria/article/viewFile/13156/14404 Laver, T.Z. 2003. In a class by themselves: faculty papers at research university archives and manuscript repositories. The American Archivist 66:159-196. DOI: 10.17723/aarc.66.1.b713206u71162k50 Munro, P.G. & van der Horst, G. 2016. Contesting African landscapes: a critical reappraisal of Sierra Leone's competing forest cover histories. Society and Space 34(4):706-724. DOI: 10.1177/0263775815622210 Nikulin, P.F. 2015. On the factors of economic modernization in the development of agricultural production in Siberia in the early 20th century. Vestnik Tomskogo Gosudarstvennogo Universiteta-Istoriya 1:10-14. Noonan, D. & Chute, T. 2014. Data curation and the university archives. The American Archivist 77(1):201-240. DOI: 10.17723/aarc.77.1.m49r46526847g587 NSF. (n.d). Dissemination and sharing of research results. [Internet]. Arlington (VA): NSF - National Science Foundation; [cited 2017 June 14]. Available from https://www.nsf.gov/bfa/dias/policy/dmp.jsp Tenopir, C., Talja, S., Horstmann, W., Late, E., Hughes, D., Pollock, D., Schmidt, B., Baird, L., Sandusky, R.J., & Allard, S. 2017. Research data services in European academic research libraries. Liber Quarterly: The Journal of European Research Libraries 27(1):23-44. DOI: 10.18352/lq.10180 Texas A&M University Libraries. (n.d.). William B. Davis collection. College Station (TX): Texas A&M University; [cited 2017 June 14]. Available from http://oaktrust.library.tamu.edu/handle/1969.1/129120 Van der Veken, S., Verheyen, K., & Hermy, M. 2004. Plant species loss in an urban area (Turnhout, Belgium) from 1880 to 1999 and its environmental determinants. Flora - Morphology, Distribution, Functional Ecology of Plants 199(6):516-523. DOI: 10.1078/0367-2530-00180 Previous Contents Next This work is licensed under a Creative Commons Attribution 4.0 International License.