Building a Collection of HathiTrust Items

While collections are rarely finished, I have finished creating and curating the collection of HathiTrust files. To cut to the chase, I collected and curated approximately 345,000 items. See:

The process required just about every aspect of librarianship:

  • Collections - I needed to articulate and implement a collection management policy. Of the 800,000 items available to me, I wanted only the items written in English, described as books, and were deduplicated. Deduplication was the most difficult aspect of the problem. In the end, I reduced duplication from 20%-30% to about 2%; about 2% of the items from the Trust are duplicated. I identified 345,000 items to collect.
  • Acquisitions - Given the 345,000 identifiers, the acquisitions process locally cached the items from the Trust's computers. This was easy, but took about 24 hours to complete.
  • Cataloging - Given the 345,000 identifiers, the cataloging process harvested MARC records describing each item and modified them to meet my local cataloging practice. More specifically, pre-coordinated subject headings were converted into simpler FAST headings, two 856 fields were added denoting original/canonical locations and local/cached locations, and local notes were added denoting data format (etext) and collection (HathiTrust). The resulting records were then poured into an open source integrated library system called Koha.
  • Stacks maintenance - Given the 345,000 identifiers, the set of plain text files -- OCRed versions of the originals -- were saved on a local Web server, thus, every item has a URL in the "stacks". See: https://distantreader.org/stacks/trust/
  • Public service - The Koha application supports a very simply cataloging interface and a more sophisticated index. The former is easier to use. The later is more expressive and more full featured. More importantly, search results point directly to found items. No landing pages. No splash pages. No authentication. Moreover, there are zero links to maintain. Most importantly, the index allows one to create, curate, and use data sets (I all them "study carrels") from search results. Do a search. Download the results. Curate the results to suit your particular research question. Create a data set. Analyse and read the result.

What is the use case for this whole thing? What is the problem that I'm trying to address? The answer is simple. I'm addressing information overload. Using the index the student, scholar, or researcher is able to:

  1. create large sets of relevant content, such as: the complete works of any given author, a comprehensive set of things published as broadsides, a set of dozens if not hundreds of scholarly articles on a given topic, et cetera
  2. create a study carrel of the results
  3. employ both computer technology and traditional reading techniques to use and understand the content of the carrel

Using these features, I am easily able to:

  • compare and contrast the works of Plato and Aristotle
  • list dozens of definitions of "social justice"
  • observe the ebb and flow of ideas across just about any book
  • observe the ebb and flow of ideas across a collection of books
  • reduce a set of thousands of articles on a given topic to a couple dozen most relevant items

Fun fact: It took me about one month to this work. Thus, I did the whole of library processing at an average rate of 17,000 items/day or about 35 items/minute.

Another fun fact: The computer hosting the library catalog application (Koha) runs on 2-core computer with 4 GB of RAM and 60GB of disk space. This is about the size of your desktop computer, if not smaller. It costs me $25/month to keep the catalog up and running. The Distant Reader application -- the tool used to create study carrels -- is much bigger: 60 cores, 200 GB of RAM, and 5TB of disk storage. The Center for Research Computing hosts the Reader application.

Final fun fact: The whole of the Reader's library holdings is now about .7 million items.


Creator: Eric Lease Morgan <[email protected]>
Source: This message was first posted to the Code4Lib mailing list on October 16, 2024.
Date created: 2024-10-26
Date updated: 2024-10-26
Subject(s): library collections; HathiTrust;
URL: https://distantreader.org/blog/collection-of-trust-items/