ACRL News Issue (B) of College & Research Libraries C&RL News ■ A p ril 2000 / 293 news C o l l e g e & R e s e a r c h L i b r a r i e s Dressing up SGML for the Web A look at UNL’s project to create electronic books by DeeAnn Allison and Jon Keene T echnology has made it easier for librar­ ies to provide access to information. Online catalogs and electronic reference tabases are commonplace in m odern librar­ ies. The latest technology makes it easier and more cost effective for libraries to move into the field of digitization. Many libraries have electronic text conversion projects in various stages of implementation. The project at the University of Nebraska- Lincoln (UNL) is a cooperative venture that includes the university libraries, the univer­ sity press, and faculty representing several humanities areas.’ A major goal of the project is to produce electronic versions of books that can be used for classroom and research applications. The end product must be a faith­ ful reproduction of the original text that is easy to read. It is important that the elec­ tronic version not lose the familiar organiza­ tion of the book (including page numbers and a table of contents), and allows the reader to move through the text in a natural fash­ ion. D eterm in in g fo rm a ts fo r electronic te xt conversion Since it is important to preserve as much of the original print presentation as possible, the selection o f electronic format is very im­ portant. For example, will documents be digi­ da tized as images or as text? Images have the advantage of appearing exactly the same as ­they did in the book, while text can be ma­ nipulated in a variety of ways to add value to the end product. If docum ents are to be digi­ tized as text, the selection of a markup lan­ guage is crucial. Markup language can deter­ mine not only how a docum ent displays but has profound implications for preservation. The UNL Libraries decided to use Standard Generalized Markup Language (SGML). SGML is an international standard (ISO 8879) that is used w hen converting books into a format that can be stored on a computer. It is a rec­ ognized standard that is supported by many applications and is widely used by publish­ ers. What is SGM L? SGML provides a standard format for manag­ ing a docum ent’s format and structure. It d e­ fines the markup language to use for the document, the character sets used in the docu­ ment, defines the docum ent structure, and defines the elements of the document. From a technical standpoint, this has the advan­ tage of handling groups of docum ents the same way rather than treating each document as a unique item. By defining document types, SGML provides a blueprint for how each item is to be presented to readers. About the author DeeAnn Allison is coordinator fo r a uto m ated systems a t the University o f Nebraska-Lincoln, e-mail: deeanna@unlib.unl.edu; Jon Keene is n e tw o r k sp ecia list a t th e U n ive rsity o f N ebraska-Lincoln, e -m a il: jo n k @ u n lib .u n l.e d u mailto:deeanna@unlib.unl.edu mailto:jonk@unlib.uni.edu 294 / C&RL News ■ A p ril 2000 The major difference betw een SGML and HyperText Markup Language (htm l) is that it describes the document, not the formatting. It uses descriptive markup for the document structure like “author,” “title,” “lan gu age,” “chapter,” and “paragraph,” which frees au thors from the tedious activity o f coding. Like html, it is hardware independent, with the software package doing the interpretation. Any software package that understands SGML can read and m anipulate the data, which makes it useful for a variety o f applications and for maintaining an archival copy. The problem of providing Web access for SGML coded documents O ne o f the limitations o f SGML is delivering it over the W eb. Becau se it is not using the standard html coding, W eb browsers cannot display it without installing a viewer. Com mercial viewers, which a user can purchase and install on his or her com puter to read SGML through a W eb browser, are available from vendors like Interleaf. However, o n e o f the goals o f the UNL project is to m ake the digitized texts avail able to a wide audience. As a consequ ence, any solution that requires end users to pur chase software to use the digitized texts is not acceptable. Likewise Internet users have a variety o f computers and operating systems, so it is also important for the solution to b e platform in dependent. Finally, it is important to control the staffing and disk space costs by eliminat ing the tedious task o f creating and maintain ing multiple cop ies o f documents in two for mats that both used disk space. For these reasons, UNL is experim enting with an alternative w here com puter programs in the Perl language are written to b e e x ecuted by the Web server w hen a link for an SGML document is clicked. The computer pro gram converts the SGML to html “on-the-fly” so the docum ent is displayed in html format for the browser. This leaves the docum ent in SGML format for archival purposes. T he con version p rocess is transparent to the user who views the document through the Web browser on his or her local computer. Since the html file is dynam ically pro duced, there is no second, perm anent file that uses disk space or requires maintenance when editing changes are made to the SGML file. Also, becau se SGML uses standard document types, a single Perl script can b e used for more than on e document. Preserving the book's look and feel Many digitized books lose m uch o f the origi nal design o f the book. B ook s are converted into very long files that are difficult to navi gate in any other way than by sequentially moving from beginning to end. UNL is e x perimenting with a new way to display con verted SGML coding. The approach is to convert all SGML into html and display the document in on e frame with the navigation bar in a second frame. This preserves the look o f the book, which is displayed as text with graphics in one frame, while adding a navigation bar built from the elem ents provided by the SGML coding. The elem ent coding o f SGML creates a table o f contents to add value to an otherwise long and difficult-to-navigate html file. Tw o programs w ere developed: tei2html, which converts the SGML docum ent into the html frame, and n a vbar, w hich creates the navigational frame from coding in the SGML document. The tei2 h tm l program The tei2html program converts TEILITE Docu ment Type Definition (DTD) coding into html. T h e program starts by p rep ro cessin g the SGML file using a program called NSGMLS, a widely used program written by Jam es Clark, which parses and validates an SGML docu ment based on its DTD. The output o f the NSGMLS program is the SGML file content with its structure informa tion in a format that greatly simplifies the next step in the conversion process. The tei2html program takes the output o f the NSGMLS pro gram as it is processed by the server. As each SGML identifier is encountered, it is converted to the appropriate html code. Since there is no one-to-one correspon den ce betw een SGML and html, the program must keep track o f the tag’s location, which determines its context and the correspond ing html code. For exam ple, the html encod ing o f the SGML tag ‹T IT L E › will b e differ ent depending on w hether the tag occurs in the title statem ent or the bibliographic sec tion or elsewhere. (con tin u ed on p a g e 3 0 4 ) 3 0 4 / C&RL News ■ April 2000 Continuing relationships Outreach to your teaching colleagues need not stop here. The success o f our article in The Teaching Professor led to the authorship o f a second piece, this one on choosing ap propriate search tools on the Web. We have b een invited to address other topics for this newsletter. It has becom e clear to us that we have technology-related knowledge that is not com m on among teaching faculty overall, and that this knowledge is eagerly sought. We urge you to share what you know with those who teach in the classroom. Librarians have done an excellent job o f sharing our ideas with each other. We have been on the cutting edge in devising ways to make intel ligent use o f the Web. It is time to take what w e know and share it. Notes 1. There is a wide array o f criteria check lists available. A handy compendium can be foun d at Susan B e c k ’s W eb site: http:// lib .n m s u .e d u / s ta f f / s u s a b e c k / c h e c s 9 8 . html#method. Esther Grassian’s checklist, Thinking Critically about World Wide Web Resources, is on e o f the earliest (http:// w ww.library.ucla.edu/libraries/college/in- struct/web/critical.htm). 2. Marsha Tate and Jan Alexander, “Teach ing Critical Evaluation Skills for World Wide Web Resources,” Computers in Libraries 16, no. 10 (1996): 4 9-54. 3. Trudi E. Jacob so n and Laura B. Cohen, “Teaching Students to Evaluate Internet Sites.” The Teaching Professor, 11, no.7 (August/Sep tem ber 1997): 4. This article is also available at: http://www.albany.edu/library/internet/ teaching.html. 4. Keith Gresham, “Surfing with a Purpose: Process and Strategy Put to the Test on the Internet,” EDUCOM Review 33, no.5 (Septem ber/October 1998): 2 2-29. 5. Susan A. Gardner, Hiltraut H. Benham a n d B r id g e t M. N e w e ll, “O h , W h a t a Tangled W eb W e’ve Woven! H elping Stu dents Evaluate S o u rc e s,” English J o u r n a l 84, n o . 1 (1 9 9 9 ): 3 9 - 4 4 and Karen Hartman and Ernest A ckerm ann, “Finding Quality In fo rm a tio n o n th e In te rn e t: T ip s and G u id e lin e s ,” Syllabus 13, n o . l (A ugust 1999): 5 2 -5 4 . ■ ( “Dressing up SGML … ” continued fro m p ag e 294) O n ce the processin g is com p lete, the SGML file is displayed as a temporary html file that the com puter automatically removes when the user disconnects. Since the html version is created dynamically w hen the user selects a hypertext link, and is deleted when the user navigates to another Web page, there is only one perm anent file for any docu ment. This means that only on e file must be edited w henever corrections or revisions are needed. An additional feature o f the program is the creation o f navigational footnotes from the SGML. All the notes are collected at the end o f the html document with hypertext links so that the reader may jump to a note and back to the text. Creating navigational links for the converted document An auxiliary program called n a v b a r was writ ten to produce the html code for the hypertext links that display in a frame to the left o f the html document. T hese links function as a hypertext table o f contents, giving the user the ability to jump from one section to an other. This imitates the printed version o f the text by creating a method for jumping to spe cific sections and scanning the e-text for spe cific parts. This was critical for the poetry books com pleted early in the project. The navigation bar enables users to scroll through the volum e’s contents and jump to specific poems. Although transferring printed material to the Web poses many challenges, it also pro vides this generation o f librarians with the opportunity to improve on the design o f the book. Innovative projects like this one un dertaken at UNL unite the process o f infor mation preservation with information rede sign, giving us the opportunity to enhance the end product. Note 1. More information on the UNL e-text project and exam ples o f digitized texts can be found at http://libr.unl.edu:2000. ■ http://www.library.ucla.edu/libraries/college/in- http://www.albany.edu/library/internet/