A Framework for Measuring Relevancy in Discovery Environments ARTICLE A Framework for Measuring Relevancy in Discovery Environments Blake L. Galbreath, Alex Merrill, and Corey M. Johnson INFORMATION TECHNOLOGY AND LIBRARIES | JUNE 2021 https://doi.org/10.6017/ital.v40i2.12835 ABSTRACT Discovery environments are ubiquitous in academic libraries but studying their effectiveness and use in an academic environment has mostly centered around user satisfaction, experience, and task analysis. This study aims to create a quantitative, reproducible framework to test the relevancy of results and the overall success of Washington State University’s discovery environment (Primo by Ex Libris). Within this framework, the authors use bibliographic citations from student research papers submitted as part of a required university class as the proxy for relevancy. In the context of this study, the researchers created a testing model that includes: (1) a process to produce machine-generated keywords from a corpus of research papers to compare against a set of human-created keywords, (2) a machine process to query a discovery environment to produce search result lists to compare against citation lists, and (3) four metrics to measure the comparative success of different search strategies and the relevancy of the results. This framework is used to move beyond a sentiment or task-based analysis to measure if materials cited in student papers appear in the results list of a production discovery environment. While this initial test of the framework produced fewer matches between researcher-generated search results and student bibliography sources than expected, the authors note that faceted searches represent a greater success rate when compared to open-ended searches. Future work will include comparative (A/B) testing of commonly deployed discovery layer configurations and limiters to measure the impact of local decisions on discovery layer efficacy as well as noting where in the results list a citation match occurs. INTRODUCTION Discovery environments are ubiquitous in academic libraries as all but two libraries in the Association of Research Libraries (ARL) report using a discovery environment, and they continue to gain traction in other library settings.1 The one-stop shopping model of discovery environments is one of their most alluring features as it closely resembles searching the open web. This familiarity allows users who are accustomed to searching the web to feel comfortable searching the library catalog without fear of encountering a “failed” search (zero result set). Discovery environments seldom fail to return results as even the most rudimentary or naïve search strategy will return something for a user. This idea of “returning something” has been anecdotally noted as a positive as it ensures the user does not give up and allows novices to be successful with limited search sophistication or prior instruction from information professionals. One of the potential negatives to this approach however is the sheer volume of material that is returned per search query. Library discovery environments often present thousands, if not millions, of search results from an initial search query. This emulation of Google is essentially Blake L. Galbreath (blake.galbreath@wsu.edu) is Core Services Librarian, Washington State University. Alex Merrill (merrilla@wsu.edu) is Head of Library Systems and Technical Operations, Washington State University. Corey M. Johnson (coreyj@wsu.edu) is Instruction & Assessment Librarian, Washington State University. © 2021. mailto:blake.galbreath@wsu.edu mailto:merrilla@wsu.edu mailto:coreyj@wsu.edu INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 2 making the time-honored study of relevancy (precision/recall) moot. How can one determine the number of relevant documents in a search query if the number of documents returned is becoming limitless? This study aims to create a quantitative, reproducible framework to test the relevancy of results returned from, and the overall efficacy of, a library discovery environment, in this case, Ex Libris Primo. Within this framework, the authors compare the results returned in model Primo search queries against the bibliographic citations used in students’ research papers. BACKGROUND The University Common Requirements (UCORE) curriculum, implemented in fall 2012, was a major redesign of the Washington State University (WSU) undergraduate general education program. UCORE is comprised of required categories of classes designed to build student proficiency in the Seven Undergraduate Learning Goals.2 Roots of Contemporary Issues (RCI) is the sole mandated undergraduate course under the UCORE system.3 During the 2018–2019 academic year, over 4,500 students were enrolled in RCI at WSU, the vast majority being first-year students. This paper utilizes data from the RCI Library Research Project, a term-length research experience with four central assignments designed to familiarize students with the fundamentals of quality research and a cumulative research paper where they utilize the skills learned. The research project components are spaced evenly throughout the term; students are guided along the research process from general topic formation, to research question generation, to thesis statement defense in the final paper. Students are tasked with finding sources of particular resource types (e.g., journal articles), describing the value of these sources for their research, and citing them properly in Chicago Style. WSU Libraries uses the discovery environment Primo, an Ex Libris product, to provide resources to its patrons.4 Specifically, WSU Libraries uses the New User Interface version of Primo, which incorporates search results from the Primo Central Index (PCI) in its default search. Primo, like all discovery environments, provides results with a wide variety of resource types so RCI students can use it at all stages of the term research project. Students use it in the pursuit of contemporary newspaper articles, history monographs, history journal articles, and primary sources. In this article, the authors focus on the versatility of Primo, using RCI student paper bibliographies as the central data source for the project. LITERATURE REVIEW The need for assessment of library resources and services in higher education has been well- documented. Libraries are increasingly asked to provide tangible evidence they aid student information literacy skill development and thus advance achievement of institutional learning outcomes. Accrediting bodies acknowledge, “the importance of information literacy skills, and most accreditation standards have strengthened their emphasis on the teaching roles of libraries.”5 Oakleaf and Kaske also stress the importance of librarians choosing assessments that can contribute to university-wide assessment efforts, noting they are preferable to assessments that only benefit libraries.6 The Washington State University Libraries is committed to assessment of its resources and services, with Primo as a central target resource, and with large, lower- undergraduate courses as a primary area of focus. INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 3 There are numerous papers which document usability testing of Primo. Prommann and Zhang (2015) analyzed the efficiency of Primo through hierarchical task analysis (HTA). They counted the number of physical and cognitive steps necessary to get to records or full text of known items and concluded that Primo is “a flexible discovery layer as it helps achieve many goals with minimum amount [sic] of steps.”7 Although many of these studies articulate avenues of success in terms of user interaction with the discovery environment, there are also reports of difficulties in a variety of categories. Students have problems with source retrieval, for example, understanding availability status terminology and labels, and using link resolvers and interlibrary loan.8 Dalal et al. (2015) demonstrated that retrieving the full text of an article in a discovery environment is sometimes unintuitive for students and involves navigating multiple interfaces. 9 Users also have issues using facets to find particular resource types or distinguishing between them. 10 While the study addressed in this paper does not directly address user difficulties with Primo functionality, issues with source retrieval point to a plausible explanation for the few matches between the model search results and student paper bibliographies. It is possible students saw many of the same sources from the model searches in their results, but ultimately did not secure those sources because of the difficulties outlined above. In other words, some source selection choices are based mostly on availability, not as much on relevance. Source relevancy is an active area of research for web-based discovery services, in terms of comparative studies to disciplinary subject databases. Evelhoch and Zebulin analyzed two years of usage data from both Primo and a selection of subject databases, concluding that users have difficulty finding relevant sources in Primo or they are not available. 11 Based on users’ judgments, Lee and Chung, determined that EBSCO Discovery Service was less effective than a set of education and library subject databases in terms of source relevance. 12 Another study illustrated that while students preferred discovery environments, the articles they selected from the subject (indexing and abstracting) databases were more authoritative.13 Finally, librarians are posited to believe that subject databases are superior to discovery environments in terms of the relevancy of search results and disciplinary coverage.14 Conclusions about source relevancy are complicated by the fact that students infrequently look beyond a first page of results lists.15 Researchers have also explored the idea of Primo user satisfaction through the presence of relevant results. In one instance, using online questionnaires and in-person focus groups, researchers found users had a high level of satisfaction with their institution’s discovery environment, largely attributed to the quality of search results over ease of use.16 Hamlett and Georgas (2019) conducted a mixed-methods user experience study to understand student perceptions of relevancy in Primo. This study found that participants believed Primo to return relevant results (with an average score of 8.3 out of 10). However, some of the qualitative responses indicated that the keywords used did not actually yield relevant results. 17 Many other methods and measures have been executed in determining the value and usefulness of Primo. Huurdeman, Aamodt, and Heggo analyzed a dataset of 50 popular queries in Primo. They deemed a query successful if the first 10 results included the (likely) targeted resource and found that 58% of the queries from the popular searches dataset had been successful, while 20% were unsuccessful, and 22% could not be determined. Their approach assumed there is one intended document per query and that the authors can surmise what it is.18 The research presented in the remainder of this article below is unique in that the authors explore user judgment of source relevance (satisfaction) as a function of whether sources in the model Primo searches for their topics existed in the students’ papers’ bibliographies. INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 4 METHODS Research Questions The impetus for this study was to understand the factors that play a role in establishing a framework to test the relevancy of results returned from Primo. The authors attempted to answer the following questions: • How effective is Primo at returning relevant results? • To what extent does faceting improve search results? • Which search strategies are the most effective within the given framework? • How can the researchers refine the framework for future investigations into relevancy? • What are the implications of this study for end users? Data Collection The authors began with a sample of 100 randomly selected and anonymized research papers that were submitted to the Roots of Contemporary Issues (RCI) courses in fall 2018 and spring 2019 semesters. The study used a two-pronged approach to generate keywords for model Primo search queries. For one approach, keywords were machine-generated via a word-vector generation process. For the other, keywords were human-generated by a student research assistant to approximate natural language queries. Keywords and Queries, Machine A RapidMiner (https://rapidminer.com/) word-vector generation process with term-frequency schema converted the research papers into keywords, which the authors then used to generate search queries. Within the main routine, the Process Documents from Files operator, RapidMiner transformed the texts into lower case and tokenized the final papers according to non -letters. RapidMiner then filtered the data by those tokens representing nouns and adjectives, removed English stop words, and filtered tokens by length, with a minimum of one character and maximum of 50 characters. The researchers then applied a Snowball stemmer for English words and generated 20 n-grams per paper, each with a maximum length of four. Table 1 illustrates the product of the word-vector generation process. Throughout this example research paper, "trade” occurred 40 times, "slave” occurred 34 times, “slave” and “trade” occurred together 26 times, "africa” occurred 18 times, "impact” occurred 16 times, “african” occurred 11 times, and "peopl” occurred 10 times. Table 1. Example N-grams and frequency as retrieved from RapidMiner N-gram Number of occurrences trade 40 slave 34 slave_trade 26 africa 18 impact 16 african 11 peopl 10 ... ... https://rapidminer.com/ INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 5 Number of N-grams After compiling the data in RapidMiner, the authors created a process to select those n-grams to use in the model Primo search queries. Huurdeman, Aamodt, and Heggo (2018) found that users included an average of 2.6 terms per query in their Popular Searches dataset.19 In a report by Ex Libris, Stohn indicates that most topic-search queries contain five or fewer words.20 In order to investigate both ends of this spectrum, this study constructed short-length queries, consisting of two n-grams, and full-length queries, consisting of four n-grams, using the following rubric to help systematize the construction. Rubric to Select N-grams for Short- and Full-Length Queries Pick terms that satisfy the following criteria: 1. N-grams that occur more frequently in a paper are preferred to those that occur less frequently. 2. If two n-grams appear to be structural derivatives of the same word (e.g., korea and korean), select the shortest n-gram and truncate it. 3. If one or more of the top terms appear in a later 2-gram, use the 2-gram as a phrase search. 4. Ignore n-grams with repeating terms (e.g., south_africa_africa). 5. Truncate all terms (using asterisk or question mark), except the first term of a phrase search, unless the first term is not a complete word (e.g., “busi* meeting*”). 6. For terms or phrases that end in truncated “i”, use the truncated version of the term and its truncated “y” counterpart, and combine both with an OR operator (e.g., countri* OR country*). 7. Ignore all 3- and 4-grams as they have a propensity to create nonsensical phrase searches (e.g., racism_polic_brutal). 8. If abbreviations are encountered, expand them for searching purposes (e.g., US is “united states”), except in cases where they are more commonly known by their abbreviation (e.g., ddt). 9. Ignore results of contractions (e.g., ‘t) In case of a tie in the selection of an n-gram, sequence the following rules for selection: 1. Preference proper nouns over other nouns and adjectives. If there are multiple proper nouns, preference place-name proper nouns over other proper nouns. 2. Preference the n-gram that occurs in the greatest number of two or more n-grams later in the list. 3. Preference longer words over shorter words. 4. Group all the tied n-grams with a series of OR statements. Note: this may result in the selection of more than four total n-grams. Referring to the example n-grams from table 1, an illustration of this method is shown in the following steps: 1. Arrange terms from highest to lowest frequency. 2. Select slave_trade as first n-gram, since “trade” and “slave” both occur in later n-gram. Truncate to “slave trade*”. 3. Select africa since it has the next greatest number of occurrences. Combine africa with african since they are structural derivatives of one another. Truncate to africa*. INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 6 At this point, the first two selected n-grams—slave_trade and africa—become the keywords of the short-length query “slave trade*” AND africa*. 4. Select impact since it has the next greatest number of occurrences. Truncate to impact*. 5. Select peopl since it has the next greatest number of occurrences. Truncate to peopl*. Finally, the first four selected n-grams—slave_trade, africa, impact, and peopl—become the keywords of the full-length query “slave trade*” AND africa* AND impact* AND peopl*. On average, after stop words and Booleans were removed, the full-length queries in this study were 5.69 keywords long, while the short-length queries were 3.11 keywords long. Keywords and Queries, Natural Language In addition to the machine-oriented keyword process, the authors employed a student research assistant to create human-generated phrases, consisting of 3–10 words, which served as synopses for each of the 100 papers. This study then used these phrases as proxies for creating natural language search queries. For the same example research paper cited in table 1 above, this student created the summary phrase history and effects of the slave trade. This phrase in its entirety became the natural language query. On average, after stop words and Booleans were removed, the natural language queries used in this study were 3.95 keywords long. Search Results Using the three keyword-generation strategies outlined above, the authors constructed search queries and ran them against the Ex Libris’ Primo Search API endpoint. Table 2 summarizes example result sets from the above short-length query, full-length query, and natural language query. For each of the keyword-generation strategies, the authors constructed search queries along four parameters: queries that used no faceting (open-ended), queries that faceted to articles only (articles), queries that faceted to books and ebooks only (books), and queries that faceted to newspaper articles only (newspapers). In all, there were 12 search-query constructions (three query types by four faceting modes) for fall 2018 and 12 for spring 2019. To construct a baseline for the search comparisons, the researchers designed the initial search to be open-ended. That is, the study assumed that patrons most often use the default, basic search functionality, with no facets selected. A segment of the RCI instruction specifically encourages students to incorporate materials with resource types articles, books, and newspaper articles into their research papers. The authors therefore assumed that these students would most likely utilize facets corresponding to these resource types in their more specific queries and mirrored this behavior in the comparative searches. Each Primo Search API returned titles for the top 50 results, moving beyond users’ usual search behavior in an effort to provide more flexibility to the initial steps of the relevancy framework. INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 7 Table 2. First-occurring result titles for query types: Short-length, full-length, and natural language queries Query type Query First-occurring result titles Short-length “slave trade*” AND africa* The Atlantic slave trade The Atlantic slave trade : a census The Atlantic slave trade Legacy of the trans-Atlantic slave trade : hearing before the Subcommittee on the Constitution, Civil Rights, and Civil Liberties of the Committee on the Judiciary, House of Representatives, One Hundred Tenth Congress, first session, December 18, 2007. ... Full-length “slave trade*” AND africa* AND impact* AND peopl* The Atlantic slave trade The Atlantic slave trade : effects on economies, societies, and peoples in Africa, the Americas, and Europe Slave trades, 1500–1800 : globalization of forced labour African voices of the Atlantic slave trade : beyond the silence and the shame ... Natural- language history and effects of the slave trade Urban History, the Slave Trade, and the Atlantic World 1500–1900 The Atlantic slave trade and British abolition, 1760– 1810 The Decolonization of African Education and History The United States and the transatlantic slave trade to the Americas, 1776–1867 ... A student research assistant harvested all the citations used across the 100 example papers to create an inventory of 730 bibliographic citations. Using the Excel Fuzzy Lookup Add-In, the authors then compared this bibliographic inventory against the 60,000 titles that were returned INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 8 via the Primo Search API. This add-in fuzzy matches rows between two different tables and assigns a similarity score for each match. The study focused attention on rows with matching scores of .80 and above to further investigate potential matches. Using the fuzzy matches as a starting point, the authors confirmed or denied matches by hand, using title and resource type as the main criteria. Table 3. Sample comparison of citations used in research papers against results returned from Primo search API Fuzzy score Citation title Citation resource type Results title Result resource type Confirmed match 1.0000 A Short History of Biological Warfare Article A short history of biological warfare Article Yes 0.9933 THE FEMALE MADLADY Women, Madness, and English Culture, 1830– 1980 Print book The female malady : women, madness, and English culture, 1830–1980 Print book Yes 0.9778 Industrial Revolution Web resource The industrial revolution e-book No 0.9037 Drug Use & Abuse Print book Drug use and abuse : a comprehensive introduction Print book No RESULTS Source Citation Data Description This study compared citations gathered from a random sample of 100 research papers from the two semesters of all sections of History 105/305 taught at Washington State University (WSU) from fall 2018 to spring 2019. Table 4 below gives a descriptive breakdown of the citations by resource type. The student research assistant identified and categorized the source citation list. INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 9 Table 4. Total source citations Resource type Fall 2018 (% of total) Spring 2019 (% of total) Book chapter 7 (1.94%) 4 (1.08%) Books (e-books/print) 107 (29.72%) 96 (25.95%) Newspaper article 63 (17.50%) 60 (16.22%) Journal article 84 (23.33%) 99 (26.76%) Reference entry 6 (1.67%) 6 (1.62%) Other/ Cannot determine 10 (2.78%) 15 (4.05%) Web document 81 (22.50%) 90 (24.32%) Magazine article 1 (.28%) N/A Newspaper/Magazine article 1 (.28%) N/A Semester citation count 360 (100%) 370 (100%) Total citation count 730 Target Citations List Data The citations collected from the papers were then compared against 60,000 citations retrieved from the WSU Primo Search API endpoint on July 24, 2020, as described previously in the methods section. To better account for the differing numbers of citations among resource types in the source data and to normalize reporting across query types and semesters, most results are presented as a percentage and referred to as the matching success rate. For example, the natural language query had six matches out of a possible 360 citations in the open-ended search for citations from the fall of 2018. The matching success rate of the open-ended search in the fall of 2018 therefore is calculated at 1.67% (see table 5). Table 6 below shows the percentage results for short queries, and table 7 for full queries. For information about the raw source numbers and target data, please see the Open Science Framework project site.21 When all query types and faceting modes are considered, the matching success rate almost uniformly increased from fall 2018 to spring 2019. The largest difference in matching success rate was observed in the full-query articles only search at 8.91% as shown in table 7. The open-ended search observed the smallest difference in positive movement and the anomaly of a diminishing success rate. Across the natural language and full-query types the open-ended search exhibited the least amount of positive difference in success rate, at 1.04% and 0.26% respectively, and the short-query open-ended search had a small negative change in success rate at −0.36%. INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 10 Table 5. Natural language query results success rate Fall 2018 Spring 2019 % Difference Open-ended search 1.67% 2.70% 1.04% Articles only 4.76% 9.09% 4.33% Books only 3.74% 11.46% 7.72% Newspapers only 0.00% 1.67% 1.67% Table 6. Short-query results success rate Fall 2018 Spring 2019 % Difference Open-ended search 3.33% 2.97% −0.36% Articles only 3.57% 5.05% 1.48% Books only 9.35% 10.42% 1.07% Newspapers only 0.00% 3.33% 3.33% Table 7. Full-query results success rate Fall 2018 Spring 2019 % Difference Open-ended search 0.56% 0.81% 0.26% Articles only 1.19% 10.10% 8.91% Books only 0.93% 5.21% 4.27% Newspapers only 0.00% 5.00% 5.00% Total Unique Matches Across all three search strategies and their four iterations, the researchers also note a raw count of matches which helps to determine how an overall search strategy is performing at finding matching citations. As the reader might expect, this metric includes a matching citation once across all four iterations of a search strategy. Meaning, even if a source citation appears in both the open-ended search and the books only search, that source citation is only counted once for the purpose of this metric. For example, in the natural language query in fall 2018, six citations were matched in the open- ended search. Four of the citations were articles and two were books. Some of the matches in the articles and books searches were redundant to the open-ended search. Considering only unique matches in the articles, books, and newspaper searches, the authors calculated the total number of unique matches. When the target searches were compared, the researchers matched two additional citations in the books only citations list. When the authors add the two additional matches, there were a total of eight unique citation matches across all iterations of the natural language search (open-ended search, books only, articles only, newspapers only). The total unique matches number and the corresponding success rate of the total unique matches for each search strategy is shown in Table 8. INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 11 Table 8. Total unique matches Fall 2018 Spring 2019 % Difference Natural language query 8 (2.22%) 22 (5.95%) 3.72% Short query 14 (3.89%) 16 (4.32%) 0.44% Full query 3 (0.83%) 18 (4.86%) 4.03% Matches Added by Faceting Another metric used to measure overall effectiveness of faceted searching is the percentage of matching citations that are new to the results list when limited to a certain resource type— matches added by faceting. Meaning, what matching citations were not present in the open-ended search results but are then matched when the results list is reduced to only a single resource type. In table 9, the percentage of matches that are new and only to be found in a targeted search result varies greatly. Between both semesters and among all search iterations, the smallest percentage of matches added by faceting is 14.29% and the largest is 83.33%. Table 9. Matches added by faceting Fall 2018 Spring 2019 % Difference Natural language query 2 (25.00%) 12 (54.55%) 29.55% Short query 2 (14.29%) 5 (31.25%) 16.96% Full query 1 (33.33%) 15 (83.33%) 50.00% Comparing Search Strategies The matching success rate across search strategies (natural, short, full) and iterations is a mixed result and does not allow for very useful comparison beyond descriptions of difference which are outlined in the comparison tables (tables 5–7). To better compare the search strategies as a whole, as opposed to how a particular iterative search performed relative to another open or targeted search, the researchers used a weighted success rate of the total unique matches from both semesters as the proxy for overall performance and the point of comparison among the three search strategies. The comparison of this weighted success rate shows no difference in overall success rate between the natural language query (4.11%) and the short query (4.11%). The search strategy that was demonstrably different in weighted success rate is the full query at a lagging 2.88%. See table 10 for comparison and calculation details. Table 10. Weighted success rate of total unique matches Natural language query (2.22%*360)+(5.95%*370)/730 4.11% (0.04109589) Short query (3.89%*360)+(4.32%*370)/730 4.11% (0.04109589) Full query (0.83%*360)+(4.86%*370)/730 2.88% (0.02876712) DISCUSSION How Effective is Primo at Returning Relevant Results? According to the preliminary findings, Primo is relatively ineffective at providing search results that match the citations used by the student researchers. The matching success rates of the open- INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 12 ended searches range from 0.56% to 3.33%. The possible reasons for these low numbers are numerous and varied; everything from students perhaps intending to use sources in the researchers’ auto-generated results lists, but unfortunately were unable to locate the full text, to the prevalence of finding open internet sources outside the discovery layer, to open-ended searches being flooded with rarely cited reference materials and very contemporary newspaper articles (see more about these ideas below). Future research aims to understand more clearly which potential factors are present and to what degree they impact the matching success rates. To What Extent Does Faceting Improve Search Results? Faceting within Primo leads to better results, although the matching success rates are still more ineffective than not. The faceted searches contain the only matching success rates above ten percent: 10.10% (full query, articles only), 10.42% (short query, books only), and 11.46% (natural language query, books only). The data shows that the majority of unique matches found by the 2019 full-length and natural language search strategies occurs within the faceted searches (83.33% and 54.55%, respectively). It is interesting to note that these represent the two longer query strings, on average. Future testing will reveal whether there is a relationship between query length and percentage of matches added by faceting. Which Search Strategies Are the Most Effective within the Given Framework? Looking at the search strategies holistically, the researchers note that the total unique matches increased from fall 2018 to spring 2019 across all three query types. This increase was expected behavior, partially due to the fact that Primo relevancy ranking algorithms assume that patrons prefer newer materials.22 The weighted success rate is an attempt to understand each search strategy’s performance over the 2018–2019 academic year, as opposed to comparing one semester to the other. From this metric, the consistency of the short-length query is equally effective as the more dynamic performance of the natural language query. The researchers are looking forward to adding more data to this metric to understand in which direction the average might move. How To Refine the Framework for Future Investigations into Relevancy The most popular resource types used in the source citations were books, journal articles, web documents, and newspaper articles. Together, these categories comprised approximately 93% of all resource types in both fall 2018 and spring 2019. However, not all areas were equally accessible within Washington State University’s discovery layer configuration. The heavy reliance on web documents in the source citations was somewhat problematic, given the fact that web documents did not constitute a faceted resource type in WSU Libraries’ Primo prior to this study. Therefore, the authors will need to better account for web documents in future testing. The assessment of newspaper articles also proved to be problematic, given their proclivity to inundate Primo search results with numerous and recent documents. The sheer number of newspaper articles published and indexed every year in Primo for general and introductory topics can dilute the pool of possible target citations greatly. For example, a scan of the matching newspaper articles reveals that 67% (4/6) were published in 2018. In future studies, the researchers will limit publication dates for target citations to the appropriate time period (e.g., an upper limit of May 2019 would be placed on publication dates for papers written in spring 2019) or collect data closer to the submission of research papers. In 11 out of 12 cases, matching success rates were better in spring 2019 than fall 2018, most likely due to recency. It is common for discovery environments, and true for the environment used in this study, to present content INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 13 sorted by relevance and then publication date. Therefore, the researchers expected to and did find an increased matching success rate closer to the date of testing, with the one exception of the short-length, open-ended search query. This anomaly led researchers to dig more deeply into the target citations to see if a cause could be determined. Researchers found a larger than expected number of citations for resource types that are underrepresented in source citations. For example, the reference entry resource type surfaced prominently in the open-ended search for several of the queries, diluting the pool of target citations with entries that had little chance of appearing in the source citation lists. In one standout example, there were four separate reference entries titled simply “Taiping Rebellion.” The discovery environment gave preference to these four separate reference entries over other, more substantive works, that are more likely to be cited in an academic paper. The researchers surmise this is partly a function of the relevancy ranking algorithm that gives greater weight to matches in the title, author, and subject fields.23 Depending on the search and the configuration of the discovery environment, it is possible that reference entries would push other results from books, articles, and newspaper resource types farther down the results list, making them less and less visible in an open-ended search for a given topic. This dilution of the target citations with resource types that are not emphasized or widely used in source citations is another area the researchers aim to isolate and examine in further rounds of testing. In addition to the source recency and particular source type issues explained above, the authors did not take into account source availability, nor where sources were found by students, which remains a confounding factor on matching success rate. Subsequent studies will capture whether sources are present in the local deployment of Primo during the time frame the students were conducting research. This issue will be further addressed and mitigated by analyzing URLs provided within student source citations. Implications of This Study for End Users The matching success rate in the open-ended search when compared to the type-limited searches leads to a discussion of how to define and present the default search of the discovery environment to best serve an academic population. More pointedly, it opens the discussion of what resource types to include within that default search to return the most relevant and useful results and not just the most results. In this case, the argument could be made that excluding several resource types (e.g., reference entries) would surface resources that are more likely to be cited in a researcher’s scholarship. Based on the number of matches that were introduced by performing a faceted search, it is evident that researchers still need to utilize a search strategy which includes using search filters and limiters (prior to or following the initial search) and other search tactics in a discovery environment to return relevant results. The notion that an open-ended “one and done search,” for even the most introductory of topics, will be successful in retrieving many usable and citable resources in the first page or two of results is not supported by the results of this study. CONCLUSIONS AND NEXT STEPS As the common adage goes, “it’s not what you say, it’s what you do.” In this study, the saying applies as the researchers move beyond what sources students think are relevant to the sources students ultimately use in their papers. The current slate of discovery environment research projects focuses largely on users’ affective connections to discovery environments, often INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 14 compared to other kinds of academic databases, and places users in temporary, hypothetical research scenarios in order to judge source relevance.24 In juxtaposition, the RCI research project is a term-length (10–14 weeks) venture; students have a significant amount of time and the aid of a scaffolded set of assignments, to bolster their source relevance assessment skills and authority. Methodologies which closely mirror the authentic experiences and curriculum of the students are those which arguably will provide a more accurate picture of the value of the discovery environment in an academic setting. The authors of this study took the first steps in building a relevancy rating system for discovery environments. To standardize their preliminary results, they generated four metrics: matching success rate, total unique matches, matches added by faceted search, and weighted success rate. While the results of this study do not allow the researchers to draw statistical conclusions regarding the dominance of one search strategy over another in returning relevant results, the frequencies showed a better match (success) rate with faceted than non-faceted searching. Discovery environments are commonly advertised as providing an easy to use, one-stop location for academic research needs, but the reality is more complex. Students need to engage these systems with multiple search refinements to find valuable materials. This investigation was also the initial attempt to create a machine-generated framework to test the relevancy of web-based discovery environment’s results. As the authors look to build upon this preliminary study, there are several avenues to pursue that will enhance the methodology of the framework. One avenue is a refinement of the boundaries of the testing framework. This boundary refinement includes a re-examination of the criteria for inclusion in both the source citations and the search results list. In the current study, all student citations were deemed viable regardless of whether the source citation was able to be verified and accessed. This led to the inclusion of citations of lecture notes and other such materials that are not generally expected to appear in a discovery environment. The authors will also re-examine the inclusion of newspapers and reference works in open-ended searching. These two resource types are large in number, are not indexed very well, and often do not have descriptive titles. A portion of the next round of research will be dedicated to comparative testing (A/B) of generally deployed discovery environment configurations. Another avenue of exploration is determining where in the results list a citation appears, not just the binary positive or negative, and measuring any impact based on behavior of the search (i.e., search construction) or behavior and configuration of the discovery environment. Refining the methodology of the current framework will result in fewer potentially confounding factors and allow librarians to regain an understanding of relevancy when it comes to teaching discovery layers to student researchers. These next steps will contribute to the overall picture concerning the value and efficacy of web-based discovery environments that is steadily taking shape. INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 15 ENDNOTES 1 Marshall Breeding, “Library Technology Guides: Academic Members of the Association of Research Libraries: Index-Based Discovery Services,” Library Technology Guides, https://librarytechnology.org/libraries/arl/discovery.pl. 2 “Student Learning Goals,” Washington State University Common Requirements, 2018, https://ucore.wsu.edu/about/learning-goals. 3 “Welcome to the Roots of Contemporary Issues,” Washington State University Department of History, 2017, https://ucore.wsu.edu/faculty/curriculum/root/. 4 “Search It,” Washington State University Libraries, 2020, https://searchit.libraries.wsu.edu/. 5 Megan Oakleaf and Neal Kaske, “Guiding Questions for Assessing Information Literacy in Higher Education,” portal: Libraries and the Academy 9, no. 2 (2009): 277, https://doi.org/10.1353/pla.0.0046. 6 Oakleaf and Kaske, “Guiding Questions.” 7 Marlen Prommann and Tao Zhang, “Applying Hierarchical Task Analysis Method to Discovery Layer Evaluation,” Information Technology and Libraries 34, no. 1 (2015): 97, https://doi.org/10.6017/ital.v34i1.5600. 8 Rice Majors, “Comparative User Experiences of Next-Generation Catalogue Interfaces,” Library Trends 61, no. 1 (2012): 186–207, https://doi.org/10.1353/lib.2012.0029; David Comeaux, “Usability Testing of a Web-Scale Discovery System at an Academic Library,” College & Undergraduate Libraries 19, no. 2–4 (2012): 199, https://doi.org/10.1080/10691316.2012.695671; Greta Kliewer et al., “Using Primo for Undergraduate Research: A Usability Study,” Library Hi Tech 34, no. 4 (2016): 566–84, http://doi.org/10.1108/LHT-05-2016-0052; Blake Galbreath, Corey M. Johnson, and Erin Hvizdak, “Primo New User Interface,” Information Technology and Libraries 37, no. 2 (2018): 10–33, https://doi.org/10.6017/ital.v37i2.10191. 9 Heather Dalal, Amy Kimura, and Melissa Hofmann, “Searching in the Wild: Observing Information-Seeking Behavior in a Discovery Tool” (Association of College & Research Libraries 2015 Conference Proceedings, March 25–28, 2015): 668–75, http://www.ala.org/acrl/sites/ala.org.acrl/files/content/conferences/confsandpreconfs/201 5/Dalal_Kimura_Hofmann.pdf. 10 Comeaux, “Usability Testing”; Xi Niu, Tao Zhang, and Hsin-liang Chen, “Study of User Search Activities with Two Discovery Tools at an Academic Library,” International Journal of Human- Computer Interaction 30, no. 5 (2014): 422–33, https://doi.org/10.1080/10447318.2013.873281; Kevin Patrick Seeber, “Teaching ‘Format as a Process’ in an Era of Web-Scale Discovery,” Reference Services Review 43, no. 1 (2015): 19– 30, https://doi.org/10.1108/RSR-07-2014-0023; Kylie Jarret, “Findit@Flinders: User Experiences of the Primo Discovery Search Solution,” Australian Academic & Research Libraries 43, no. 4 (2012): 278–99, https://doi.org/10.1080/00048623.2012.10722288; Aaron Nichols et al., “Kicking the Tires: A Usability Study of the Primo Discovery Tool,” Journal of Web Librarianship 8, no. 2 (2014): 172–95, https://doi.org/10.1080/19322909.2014.903133; https://librarytechnology.org/libraries/arl/discovery.pl https://ucore.wsu.edu/about/learning-goals https://ucore.wsu.edu/faculty/curriculum/root/ https://searchit.libraries.wsu.edu/ https://doi.org/10.1353/pla.0.0046 https://doi.org/10.6017/ital.v34i1.5600 https://doi.org/10.1353/lib.2012.0029 https://doi.org/10.1080/10691316.2012.695671 http://doi.org/10.1108/LHT-05-2016-0052 https://doi.org/10.6017/ital.v37i2.10191 http://www.ala.org/acrl/sites/ala.org.acrl/files/content/conferences/confsandpreconfs/2015/Dalal_Kimura_Hofmann.pdf http://www.ala.org/acrl/sites/ala.org.acrl/files/content/conferences/confsandpreconfs/2015/Dalal_Kimura_Hofmann.pdf https://doi.org/10.1080/10447318.2013.873281 https://doi.org/10.1108/RSR-07-2014-0023 https://doi.org/10.1080/00048623.2012.10722288 https://doi.org/10.1080/19322909.2014.903133 INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 16 Kelsey Renee Brett, Ashley Lierman, and Cherie Turner, “Lessons Learned: A Primo Usability Study,” Information Technology and Libraries 35, no. 1 (2016): 7–25, https://doi.org/10.6017/ital.v35i1.8965; Galbreath, Johnson, and Hvizdak, “Primo New User Interface.” 11 Zebulin Evelhoch, “Where Users Find the Answer: Discovery Layers Versus Database,” Journal of Electronic Resources Librarianship 30, no. 4 (2018): 205–15, https://doi.org/10.1080/1941126X.2018.1521092. 12 Boram Lee and EunKyung Chung, “An Analysis of Web-scale Discovery Services from the Perspective of User’s Relevance Judgement,” Journal of Academic Librarianship 42 (2016): 529–34, https://doi.org/10.1016/j.acalib.2016.06.016. 13 Sarah P. C. Dahlen and Kathlene Hanson, “Preference vs. Authority: A Comparison of Student Searching in a Subject-Specific Indexing and Abstracting Database and a Customized Discovery Layer” College & Research Libraries 78, no. 7 (2017): 878–97, https://doi.org/10.5860/crl.78.7.878. 14 Stefanie Buck and Christina Steffy, “Promising Practices in Instruction of Discovery Tools,” Communications in Information Literacy 7, no. 1 (2013): 66–80, https://doi.org/10.15760/comminfolit.2013.7.1.135; Anita K. Foster, “Determining Librarian Research Preferences: A Comparison Survey of Web-Scale Discovery Systems and Subject Databases,” Journal of Academic Librarianship 44 (2018): 330–36, https://doi.org/10.1016/j.acalib.2018.04.001. 15 Diane Cmor and Xin Li, “Beyond Boolean, Towards Thinking: Discovery Systems and Information Literacy,” 2012 IATUL Proceedings, paper 7, https://docs.lib.purdue.edu/iatul/2012/papers/7/; Kliewer et al., “Using Primo”; Alexandra Hamlett and Helen Georgas, “In the Wake of Discovery: Student Perceptions, Integration, and Instructional Design,” Journal of Web Librarianship 13, no. 3 (2019): 230–45, https://doi.org/10.1080/19322909.2019.1598919. 16 Courtney Lundrigan, Kevin Manuel, and May Yan, “‘Pretty Rad’: Explorations in User Satisfaction with a Discovery Layer at Ryerson University,” College & Research Libraries 76, no. 1 (2015): 43–62, https://doi.org/10.5860/crl.76.1.43. 17 Hamlett and Georgas, “In the Wake of Discovery.” 18 Hugo C. Huurdeman, Mikaela Aamodt, and Dan Michael Heggo, “‘More Than Meets the Eye’— Analyzing the Success of User Queries in Oria,” Nordic Journal of Information Literacy in Higher Education 10, no. 1 (2018): 18–36, https://doi.org/10.15845/noril.v10i1.270. 19 Huurdeman, Aamodt, and Heggo, “More Than Meets the Eye.” 20 Christina Stohn, ”How Do Users Search and Discover?: Findings from Ex Libris User Research,” Ex Libris, 2015, https://www.exlibrisgroup.com/blog/ex-libris-user-studies-how-do-users- search-and-discover/. https://doi.org/10.6017/ital.v35i1.8965 https://doi.org/10.1080/1941126X.2018.1521092 https://doi.org/10.1016/j.acalib.2016.06.016 https://doi.org/10.5860/crl.78.7.878 https://doi.org/10.15760/comminfolit.2013.7.1.135 https://doi.org/10.1016/j.acalib.2018.04.001 https://docs.lib.purdue.edu/iatul/2012/papers/7/ https://doi.org/10.1080/19322909.2019.1598919 https://doi.org/10.5860/crl.76.1.43 https://doi.org/10.15845/noril.v10i1.270 https://www.exlibrisgroup.com/blog/ex-libris-user-studies-how-do-users-search-and-discover/ https://www.exlibrisgroup.com/blog/ex-libris-user-studies-how-do-users-search-and-discover/ INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 FRAMEWORK FOR MEASURING RELEVANCY IN DISCOVERY ENVIRONMENTS | GALBREATH, MERRILL, AND JOHNSON 17 21 Alex Merrill and Blake L. Galbreath, “A Framework for Measuring Relevancy in Discovery Environments,” 2020, https://osf.io/ve3kp/. 22 “Primo Search Discovery: Search, Ranking, and Beyond,” Ex Libris, 2015, https://www.exlibrisgroup.com/products/primo-discovery-service/relevance-ranking/. 23 “Primo Search Discovery,” 3. 24 Lee and Chung, “An Analysis of Web-Scale Discovery Services”; Dahlen and Hanson, “Preference vs. Authority”; Lundrigan, Manuel, and Yan, “Pretty Rad”; Hamlett and Georgas, ”In the Wake of Discovery.” https://osf.io/ve3kp/ https://www.exlibrisgroup.com/products/primo-discovery-service/relevance-ranking/ ABSTRACT INTRODUCTION BACKGROUND LITERATURE REVIEW METHODS Research Questions Data Collection Keywords and Queries, Machine Number of N-grams Rubric to Select N-grams for Short- and Full-Length Queries Keywords and Queries, Natural Language Search Results RESULTS Source Citation Data Description Target Citations List Data Total Unique Matches Matches Added by Faceting Comparing Search Strategies DISCUSSION How Effective is Primo at Returning Relevant Results? To What Extent Does Faceting Improve Search Results? Which Search Strategies Are the Most Effective within the Given Framework? How To Refine the Framework for Future Investigations into Relevancy Implications of This Study for End Users CONCLUSIONS AND NEXT STEPS ENDNOTES