On Large-Language Models

The hearts of this blog posting lie in the responses to a few comments made on a couple of Slack channels, and they allude to my initial thoughts regarding large-language models in libraries.

LLMs as additional tools

I'm not saying LLMs and generative-AI are greatest things since sliced bread, but I am saying they can be an additional tool in our toolbox. Creating our own models, fine-tuning existing models, and apply RAG are paths forward; it behoove us -- the library profession -- to actively explore how LLM and generative-AI can be effectively used in Library Land.

We can't deny the usefulness of the venerable card catalog when compared to the acquisitions lists that preceeded them. We can not deny the added benefits of machine-readable cataloging when compared to hand-written catalog cards of lore. We can not deny the benefits of free text searching and relevance ranking of results when compared to structured queries with Boolean logic, field operations, and controlled vocabuaries. Online access to bibliographic indexes were a boon when compared to the CDROMs that immediately preceeded them. In each of these cases, the developments offered improvements but not without some costs; their predicecors had benefits the improvements did not.

I believe we are experiencing the same thing when it comes to LLMs and generative-AI. They offer benefits at certain costs. Some of those costs are financial; having access to GPUs is very expensive. Some of those costs can be measured in the skills required to create, modify, and maintane them; one needs a great deal experirence in modeling data to exploit LLM technology. Some of those costs are related to social justice -- bias. There are many ways to mitigate all of these costs. The pooling of resources, active investigations, and information literacy activities are some examples.

I do not think we must embrace LLMs or generative-AI. But I do think we would ought to learn how to exploit them and how to communicate their benefits and drawbacks.

'Am a bit threatened

For the first time in my career, I feel a bit threatened by the technology. LLMs and generative AI goes beyond returning lists of pointers (citations, call numbers, URLs, etc.) and return answers. Moreover, these answers take the form of narrative text or even structured data (JSON, MARC, CSV, SQL, etc.).

Our profession overflows with the introduction and adoption of new technologies. They are as benign as the use of the typewriter over hand-written catalog cards, free text searching with relevancy ranked results over Boolean fielded queries and controlled vocabularies, the amalgamation of MARC and other metadata to implement discovery systems, and the prevalence of global, non-bibliographic indexes.

In my humble opinion, we -- the library profession -- ought to learn how to use and exploit LLMs and generative AI systems. There are three types of learning to do: 1) creating our own models (very expensive and technically difficult), 2) fine-tuning (augmenting existing models to include our own content, like our MARC records), and 3) retrieval-augmented generation (very similar to indexing local content and then searching the index).

Only after we actively and thoroughly investigate this technology will be able -- justified -- to articulate when and how it can be used. Such will be an additional form of information literacy.

On our mark, get set, go?

Creator: Eric Lease Morgan <[email protected]>
Source: This posting was originally articulated in the past month in response to a couple of messages on Slack.
Date created: 2024-04-30
Date updated: 2024-04-30
Subject(s): libraries and librarianship; large-language models (LLMs);
URL: https://distantreader.org/blog/on-large-language-models/