RAG (retrieval-augmented generation)

RAG (retrieval-augmented generation) -- as one way to implement generative AI -- is something easy for us libraries to get our heads around because the process is very much like the impementation of our discovery systems:

  1. create content
  2. index content
  3. query content
  4. return response

While such is the basic RAG recipe, here I will outline a way I have implemented it. You can cut to the case by perusing a temporary README file as well an advanced chat session about cataloging.

First create a set of content to be indexed. In this case I simply created a directory filled with plain text files -- chapters from Jane Austen's Emma.

Next, index the content using OpenAI's API. (See index.py) The script reads each file in the configured directory and sends it off to OpenAI. OpenAI indexes ("vectorizes") the content, returns the index, and it is cached for future use.

Third, you can query the index -- chat.py One can enter just about anything desired, and the results ought to rooted in the index content -- nothing else. Here is a tiny chat:

  $ ./chat.py 
  
  Ask anything: Who is Emma?
  
  Emma is a young woman who is described as handsome, clever, and
  rich. She has lived a comfortable life with very little distress
  or vexation. Emma is the youngest daughter in her family, with an
  affectionate and indulgent father. She has been the mistress of
  her father's house since her sister's marriage. Emma had a
  governess who was like a mother to her, and she had a close
  relationship with her. Emma is portrayed as having the power to
  have her own way and a tendency to think highly of herself,
  though these traits are not seen as significant issues by her at
  the moment.
  
  
  Ask anything: In a sentence, what does Emma like to do?
  
  Emma likes to spend time with Frank Churchill, assist him with
  tasks, and engage in conversations with him.
  
  
  Ask anything: How old is Frank Chirchill?
  
  Frank Churchill is around three or four-and-twenty years old.
  
  
  Ask anything: Where does the story take place?
  
  The story takes place in a room where Mrs. Martin and two girls
  are present, engaging in conversation and reminiscing about past events.
  
  
  Ask anything: What is a synopisis of the story?
  
  The story follows Emma as she navigates social interactions and
  relationships in her community. She becomes involved in a
  situation where her friend Harriet is misled by Frank Churchill's
  behavior. Emma reflects on her role in encouraging Harriet's
  feelings and realizes she should have intervened to prevent any
  misunderstandings. The narrative also delves into Mr. Knightley's
  suspicions regarding Frank Churchill's intentions towards Emma
  and Jane Fairfax. Themes of friendship, social dynamics, and
  self-awareness are explored among the characters.
  
  Ask anything: ^D
  Okay, bye bye, and thank you.

The results are not perfect, but what indexing/searching process is?

Getting results, such as the ones above are nice, but they beg the question, "From where did the answers originate? Show me citations." This problem is easily address through the use of metadata. More specifically, when indexing ("vectorizing") one can assoicate things like authors, titles, dates, genres, types, etc. to indexed items. Then, when results are returned, it is possible to learn what documents were used to formulate responses.

For example, I collected about 136 journal articles on the topic of cataloging. I then indexed them using OpenAI, and create a query interface. Below are snippets from an example chat, and notice how each is assoicated with one or more author/title combinations -- pointers to the articles whence the answer were generated:

  What is the role of cataloging in libraries?

  Cataloging in libraries involves creating bibliographic records for
  various resources, including books and online materials, to facilitate
  the discovery and access of these resources by library users. Catalogers
  are responsible for organizing and describing library materials,
  ensuring that they are accurately represented in the library's catalog.
  The role of cataloging has evolved with the introduction of automation
  and technological advancements, leading to changes in the tools and
  resources used by catalogers. Some libraries also outsource certain
  cataloging functions, such as copy cataloging and cataloging of
  foreign-language materials. Overall, cataloging plays a crucial role in
  ensuring that library users can easily find and access the resources
  they need.

  References: 

    1. Liaising the Catalog: Collaborating Across Library Departments to
       Promote Successful Discoverability through Enhanced Cataloging / Busch,
       Tammie (2023.0) (5)
    2. Catalogers in Academic Libraries: Their Evolving and Expanding Roles
       / Buttlar, Lois (1998.0) (3)

  
  How did OCLC effect the practice of library cataloing?

  OCLC had a significant impact on the practice of library cataloging. It
  provided a valuable resource for cataloging data, allowing libraries to
  find cataloging copy for various types of materials. Many libraries
  relied on OCLC as their primary source of cataloging copy, which helped
  improve the efficiency and effectiveness of their cataloging processes.
  OCLC's data base also played a role in interlibrary loan,
  preacquisitions verification, and cataloging data. Overall, OCLC's
  services had a positive effect on library cataloging practices.

  References: 

    1. Liaising the Catalog: Collaborating Across Library Departments to
       Promote Successful Discoverability through Enhanced Cataloging / Busch,
       Tammie (2023.0) (3)
    2. A Survey on the Outsourcing of Cataloging in Academic Libraries /
       Libby, Katherine A. (1997.0) (2)
    3. The Availability of Cataloging Copy in the OCLC Data Base / Metz,
       Paul (1980.0) (1)
    4. An Overview of the Current State of Linked and Open Data in
       Cataloging / Ullah, Irfan (2018.0) (1)
    5. Bade, David. The Creation and Persistence of Misinformation in Shared
       Library Catalogs: Language and Subject Knowledge in a Technological
       Era. Champaign-Urbana, Ill.: Graduate School of Library and Information
       Science, Univ. of Illinois (Occasional Papers, no. 211), 2002. 33p. $8
       (ISBN 087845120X). / Bland, Robert (2002.0) (1)

Finally, processes such as the ones outlined above could be applied to many differnt types of library content: MARC records, PNX files, the output of OAI-PMH harvests, LibGuides, special collections exhibits, etc. I'm not saying the results are better, but I am saying the ways to query the content are MUCH easier, and the results are MUCH more readable.

Finally, finally, you can temporarily download the whole of these sketches as a single zip file.


Creator: Eric Lease Morgan <[email protected]>
Source: I believe I posted this to the Code4Lib mailing list.
Date created: 2024-04-30
Date updated: 2024-04-30
Subject(s): Retreival-augmented generation;
URL: https://distantreader.org/blog/rag/