C / 3 . / UNITED STATES DEPARTMENT OF COMMERCE • John T. Connor, Secretary NATIONAL BUREAU OF STANDARDS • A. V. Astin, Director Statistical Association Methods For Mechanized Documentation Symposium Proceedings Washington 1964 Edited by Mary Elizabeth Stevens, Vincent E. Giuliano, and Laurence B. Heilprin National Bureau of Standards Miscellaneous Publication 269 Issued December IS, 1965 For sale by the Superintendent of Documents, U.S. Government Printing Office, Washington, D.C., 20402 Price S2.7S Abstract A Symposium on Statistical Association Methods for Mechanized Documentation was held in Washington, D.C., in March 1964. The Symposium was jointly sponsored by the Research Information Center and Advisory Service on Information Processing, Institute for Applied Technology, National Bureau of Standards, and by the American Documentation Institute. Topics covered include the historical foundations, background and principles of statistical association techniques as applied to problems of documentation, models and methods of applying such techniques, applications to citation indexing, and tests, evaluation methodology and criticism. This volume contains 22 of the papers included in the program, the abstracts of 4 additional papers that were presented, and the text of the talk given by R. M. Hayes at the banquet. Library of Congress Catalog Card No. 65-60077 Foreword The Research Information Center and Advisory Service on Information Processing was established at the National Bureau of Standards in 1959 under the joint sponsorship of the National Science Foundation and the Bureau, with the assistance of the Council on Library Resources. The Center is engaged in a continuing program to collect information and maintain current awareness of research and development activities in the field of information processing and retrieval and to encourage cooperation among workers in the field. On March 17, 18, and 19, 1964, the Center, in cooperation with the American Documentation Institute, sponsored a Symposium on Statistical Association Methods for Mechanized Documentation. The Symposium was held in Washing- ton, D.C., and was attended by approximately 250 subject-matter specialists. This volume contains the texts or abstracts of the papers presented. Primary responsibility for their technical content must rest, of course, with the individual authors. A. V. Astin, Director. in Introduction The Symposium on Statistical Association Methods for Mechanized Documentation was con- vened March 17, 1964 at the Smithsonian Institu- tion Auditorium, Washington, D.C. An introduc- tion by Dr. Donald A. Schon, Director, Institute for Applied Technology, National Bureau of Stand- ards, emphasized the different but interdependent interests of the user of scientific and technical information, the machine technologist, and the information specialist. The keynote address was given by the late Hans Peter Luhn, pioneer in the practical application of statistical techniques to mechanized documentation operations, on the subject, "Physical prototypes of meaning and their manipulation." During the three-day sessions, 26 technical papers were presented and provocative panel discussions were given by Pauline C. Ather- ton, Cyril W. Cleverdon, Calvin N. Mooers, and Alan M. Rees on problems of evaluation and by Phyllis B. Baxendale, Edward C. Bryant, John O'Connor, Herbert C. Ohlman, H. Edward Stiles, John W. Tukey, and the members of the Program Committee on problems, progress, and prospects. In recent years there has been a growing interest in the use of computers and machine aids for the processing of documents. Systems for machine- aided document classification, for automatic index- ing and abstracting, and for both document and "fact" retrieval have been the subject of research investigations and/or pilot operations. The grow- ing interest in the use of statistical association methods for such applications appears to be justi- fied for two quite excellent reasons. First, our present understanding of digital com- puters and computing techniques is such that these machines are best suited for the high-speed repeti- tive execution of simple arithmetic and logical opera- tions. The statistical association techniques are based on the counting of simple observable entities such as words in text, index terms, term co-occur- rences, document citations, etc. They also involve the computation of simple arithmetic decision func- tions based upon such counts. Digital computers are particularly suited to such tasks. In contrast, the handling of complex logical, syntactic, or seman- tic structures by machine requires comparatively arduous and intricate techniques, and the appli- cation of these methodologies for purposes of docu- mentation remains the subject of long-range re- search. The application of statistical procedures to mechanized documentation thus capitalized on and matches a significant attribute of existing data- processing machinery — its numerical capability. Secondly, the techniques appear to be based upon excellent theoretical foundations drawn from the fields of statistics and mathematical psychology. Analogous or identical techniques have previously been applied to a number of closely related problems in other fields besides documentation. As a con- sequence, considerable experience has been gained with the details of the methodology itself— and the effectiveness of the techniques has been established in analogous areas of application. Because of this, the study of statistical association techniques for mechanized documentation offers the real potential of creating powerful tools for solution of the problems at hand. The resulting effect has been to enable concentration of most of the research effort on the real problems at hand without the need to divert attention to study the methods. The major purposes of the Symposium were to bring together in one place a representative group of individuals working in a common area to ex- plore the interrelationships among the different techniques being researched, and to explore further the foundations and methods common to all of them. To further this objective, the papers "in this volume have been grouped, for convenience, into sections treating Background and Principles, Models and Methods, Applications to Citation Indexing, and finally Tests, Evaluation Methodology, and Criti- cisms. The area is still young and is now passing into a more vigorous stage of research. Much re- mains to be done, for many important topics can be treated only in a preliminary and tentative fash- ion at the present state of knowledge and under- standing. It can be hoped that the communication provided by the Symposium will contribute towards the identification of areas requiring intensive in- vestigation. More significantly, it can be expected that more purposeful research on and testing of IV the basic premises will emerge from the discussions and deliberation that were held. We, the members of the Symposium Committee, wish to express our appreciation to those who con- tributed to this conference, the authors and the discussants. Mary Elizabeth Stevens, Chairman, National Bureau of Standards Vincent E. Giuliano Arthur D. Little, Inc. Laurence B. Heilprin Council on Library Resources Contents Foreword iii Introduction j v 1. Background and principles Historical foundations of research on statistical association techniques for mechanized documentation 3 Paul E. Jones Arthur D. Little, Inc. Cambridge, Mass. 02140 Mechanized documentation: The logic behind a probabilistic interpre- tation 9 M. E. Maron The RAND Corporation Santa Monica, Calif. Some compromises between word grouping and document grouping 15 Lauren B. Doyle System development Corporation Santa Monica, Calif. ^0406 The interpretation of word associations 25 Vincent E. Giuliano Arthur D. Little, Inc. Cambridge, Mass. 02140 The continuum of coefficients of association 33 J. L. Kuhns The Bunker-Ramo Corporation Canoga Park, Calif. 91304 A correlation coefficient for attributes or events 41 H.P. Edmundson The Bunker-Ramo Corporation Canoga Park, Calif. 91304 2. Models and methods A modified statistical association procedure for automatic document content analysis and retrieval 47 Joseph Spiegel and Edward M. Bennett The Mitre Corporation Bedford, Mass. 01730 The construction of a thesaurus automatically from a sample of text 61 Sally F. Dennis International Business Machines Corporation Chicago, 111. 60620 Latent class analysis as an association model for information retrieval... 149 Frank B. Baker University of Wisconsin Madison, Wis. 53706 Problems of scale in automatic classification {abstract only) 157 Roger M. Needham University of Cambridge Cambridge, England A nonlinear variety of iterative association coefficient (abstract only) 159 Robert F. Barnes, Jr. The measurement of information from a file 161 Robert M. Hayes University of California Los Angeles, Calif. 90014 Vector images in document retrieval 163 Paul Switzer Harvard University Cambridge, Mass. 02138 Threaded term association files 167 Mark Seidel Datatrol Corporation Silver Spring, Md. 20910 vi Statistical vocabulary construction and vocabulary control with optical coincidence 177 Basil Doudnikoff and Arthur N. Conner, Jr. Jonker Business Machines, Inc. Washington, D.C. 20760 A computer-processed information-recording and association system... 181 G. N. Arnovick Planning Research Corporation Los Angeles, Calif. 3. Applications to citation indexing Statistical studies of networks of scientific papers (abstract only) 187 Derek J. deSolla Price Yale University New Haven, Conn. Can citation indexing be automated? 189 Eugene Garfield Institute for Scientific Information Philadelphia, Pa. 19106 Some statistical properties of citations in the literature of physics 193 M. M. Kessler Massachusetts Institute of Technology Cambridge, Mass. 4. Tests, evaluation methodology, and criticisms An evaluation program for associative indexing 201 Gerard Salton Harvard University Cambridge, Mass. 02138 The unevaluation of automatic indexing and classification (abstract only) 211 T. R. Savage Documentation, Inc. Bethesda, Md. 20014 Evaluation of automatic indexing using cited titles 213 Mary Elizabeth Stevens and Genevie H. Urban National Bureau of Standards Washington, D.C. 20234 Results of classifying documents with multiple discriminant functions... 217 J. H. Williams International Business Machines Corporation Bethesda, Md. 20014 Rank order patterns of common words as discriminators of subject content in scientific and technical prose 225 Everett M. Wallace System Development Corporation Santa Monica, Calif. 90406 Clumping techniques and associative retrieval 230 A. G. Dale and N. Dale University of Texas Austin, Tex. Statistical association methods for simultaneous searching of multiple document collections 237 William Hammond Datatrol Corporation Silver Spring, Md. 20910 Studies on the reliability and validity of factory-analytically derived classification categories 245 Harold Borko System Development Corporation Santa Monica, Calif. 90406 Postscript: A personal reaction to reading the conference manuscripts 259 Vincent E. Giuliano Arthur D. Little, Inc. Cambridge, Mass. 02140 vii 1. Background and Principles Historical Foundations of Research on Statistical Association Techniques for Mechanized Documentation* Paul E. Jones Arthur D. Little, Inc. Cambridge, Mass. 02140 Ultimately, in statistical association research of the type which is discussed in this Symposium, the data under analysis are taken from a symbol system generated by man. The symbol system may comprise, for example, a text prepared for communication in natural language, or it may be a pattern of terms assigned to a document collection where the purpose of the indexing relates to the retrieval of the documents. Ordinarily the purpose of the system can be well defined, but the mechanism for producing the symbols (uttering words, indexing documents) is poorly understood. As attempts are made to unravel the statistical properties of these symbol systems, the unknown processes which underlie formation of the data are in fact under scrutiny. Thus in examining the effects of the unknown symbol-producing mechanism, problems continue to be studied which have caught the attention of the greatest intellects of Western culture. 1. Introduction "Historical Foundations" may seem a surprising title to people who consider our subject to be brand new. After all, "information retrieval" is termi- nology no more than 20 years old, "mechanized documentation" is perhaps younger, and computers are so new that "historical" seems a curious term to apply to so short a period. The work is ob- viously derived from the pioneering work of Luhn [l], 1 Maron and Kuhns [2], and Stiles [3], all of whom are clearly identified with the use of com- puters for mechanized documentation. Where does one derive a historical view when developments have been so recent? Actually, the statistical association approach draws its point of view, its objectives, and its ideas from at least five major areas of study. Enumerat- ing them is almost a commonplace: psychology, philosophy, technology, linguistics, mathematics. Many of the problems now under investigation have been looked at before with a different perspective, and all five disciplines are involved in the current work. Psychology enters because the data sub- jected to statistical study were generated, and ultimately are interpreted, by man for his own purposes and objectives. Technology, especially digital computer technology, has had enormous influence: The approach would be an empty theoreti- cal conjecture were it not for the vast data-process- ing capabilities now at our disposal. Linguistics has its influence, since the data being analyzed fall into its province. The work is obviously mathematical, not only because of the prominent role of statistics but also because of the structure of the approach. Finally, philosophy contributes •Support for the preparation of this review was provided, in part, by the Deeision Sciences Laboratory, ESD, U.S. Air Force under Contract AF 19(628)- 331 1. EST- TDR-64-528. figures in brackets indicate the literature referenc :in [I. K. the epistemological basis for our work in ways to be touched upon in later paragraphs. Investiga- tions along related lines, and important develop- ments, are to be found in each of these areas, much of it independent of computers and the advent of serious thought about mechanized documentation. 1.1. A Linguistic Perspective Most workers in the area of statistical association techniques have applied their techniques to data consisting of the term assignments in a mechanized retrieval system. In general, the problem of docu- ment retrieval has served as a useful medium within which to formulate the purpose of the approach; also it has served as a source of guidelines to identify potentially fruitful fines of research. Similarly, the environment of a retrieval system has served in practice as the practical situation within which the "improvement" that might be provided by a statistical association technique can be observed. In studying an information retrieval system, or, more generally, a system for mechanized docu- mentation, we may consider that we are studying a language made up of the marks and symbols used in indexing. These marks and symbols, which were assigned to documents by indexers when the document was entered into the documentation system, are used in the system for such tasks as finding documents that are relevant to a user's request. The tags assigned to a document serve as a representation — admittedly incomplete — of what the indexer tried to say the document was about. In a mechanized system, where some degree of orderliness and regularity is to be expected, these marks and symbols are observable representations of what the indexer was trying to convey. As such, the symbol system functions as a language in the intuitive sense: It serves as a vehicle for conveying information about some universe, where the uni- verse is of course the content of the set of documents being described. There ordinarily is an effort on the part of the indexer to choose index tags which describe the document's content, just as in a natural language there ordinarily exists a willful relationship between the words an author writes down and that aspect of reality he is trying to convey. In each case, the symbols of the language are all that is observable, whereas it is the "content" of the message that has the major interest and potential utility. In each case the symbols are purposefully related to the universe under discussion. Thus either a term system or a natural language text may, at least in concept, serve as the data operated upon by the statistical techniques which are devised for mechanized documentation purposes. Although some of the statistical association techniques have been applied directly to words occurring in text, others cannot be applied directly since they ex- plicitly assume that the data will exhibit certain properties peculiar to a mechanized documentation system. Nevertheless, if only because automatic indexing of texts could be performed to obtain data to which the statistical association techniques apply [3], both types of data — text and index tags — appear at the present time to be analyzable by the same approach. A large body of work relevant to the topic of this Symposium is thus to be found among analogous aspects of the study of natural language, especially those studies in which extra-linguistic inferences are drawn from a given body of textual data [4-8]. 2. A Dualistic Historical Base 2.1. Explaining Statistical Word Associations Ultimately when a set of events is subjected to statistical study, one is inevitably making assertions about, and thus dealing with, the process which brought the given data into being. Yet what is the underlying process which is being dealt with when we perform statistical analysis of term co- occurrences in an information retrieval system, or analyze word co-occurrences in text? Are the associations to be explained in terms of a phenom- enon involving the representation, with symbols, of entities that "really do" co-occur in the "real world"? This hypothesis regards the data as strongly constrained by the external world of physi- cal nature. Or, on the other hand, are the associ- ations a manifestation of the "association of ideas" on the part of the author or the indexer? This hypothesis regards the data as strongly constrained by the internal world of mental phenomena. No complete explanation has been given for the ap- parent success of the statistical association tech- niques in discovering the provocative regularities in the data which have been reported. And since our work is interdisciplinary, it is probable that both the above explanatory mechanisms have been employed simultaneously as working hypotheses. (This fact alone is worth underscoring, for the view that allows both explanatory mechanisms to be seen as equivalent is relatively recent. 2 ) Yet historically speaking, they may be considered poles apart, with roots in two distinct schools of thought. There are two conflicting frameworks within which the studies being discussed at this Symposium may be embedded. On the one hand, since we cannot ignore the user's mental processes, we are quite content to consider ideas, concepts, meanings as perfectly respectable entities which are ob- servable by introspection. We are capable of 2 See for example the discussion of language in [9]. 3 For a detailed discussion of the development of the model, see [12]. talking quite rationally about relationships among them, their degree of similarity, and the like, with- out quibbling about their reality. As scientists, on the other hand, we are under strong influences to exclude man's mental processes from any system under objective study. As Bridgman put it, in the indroduction to a philosophical discussion of modern physics [10], It is of course the merest truism that all our experi- mental knowledge and our understanding of nature is impossible and non-existent apart from our own mental processes, so that strictly speaking no aspect of psy- chology or epistemology is without pertinence. For- tunately we shall be able to get along with a more or less naive attitude toward many of these matters. We shall accept as significant on common sense judgment that there is a world external to us, and shall limit as far as possible our inquiry to the behavior and interpretation of this "external" world. Bluntly, the physicist says, "You can't observe an idea." Yet because of the nature of our work we also cannot define ideas out of the universe of discourse. To circumvent this extreme dualism, introduced by Descartes, in which mind and physical nature are completely separate, we may employ the epistemological framework developed by the British empiricists between 1750 and 1900. Beginning with Locke and Hobbes, the mind at birth was treated as a tabula rasa upon which experience about the external world was recorded, henceforth, in a form and pattern that led ultimately to knowl- edge. Berkeley and Hume, among others, com- pleted the epistemological framework and hypothesized the associational mechanism to ac- count for and explain the higher mental processes. 3 Some scholars claim that Aristotle had a crude formulation of the association of ideas by "simi- larity" and by "contiguity." But Hume [11] wrote of the associational mechanism: And even in our wildest and most wandering reveries, nay in our very dreams, we shall find, if we reflect, that the imagination ran not altogether at adventures, but that there was still a connection upheld among the dif- ferent ideas, which succeeded each other. Were the loosest and freest conversation to be transcribed, there would immediately be transcribed, there would immediately be observed something which connected it in all its transitions. Or where this is wanting, the person who broke the thread of discourse might still inform you, that there had secretly revolved in his mind a succession of thought which had gradually led him from the subject of conversation. Though it be too obvious to escape observation, that different ideas are connected together; I do not find that any philosopher has attempted to enumerate or class all the principles of association; a subject, however, that seems worthy of curiosity. To me, there appear to be only three principles of connection among ideas, namely, resem- blance, contiguity in time or place, and cause or effect. As an epistemological framework, the work of the British empiricists has served as the principal route for transfer between the external world and the reality known by introspection. 2.2. The Psycholinguistic Route But the associationists' model was also inter- pretable as a psychological doctrine [13], and as such it was severely attacked in the early twentieth century. The model failed, for example, to provide for quantifiable observations; the inadequacy of introspection as a workable observational tool pre- vented the use of the associational model as the basis for a scientific theory. Though the associa- tionists' ideas were generally encompassed by the newer psychological theories, the mainstream of activity diverted from the epistemological interest explored by the British empiricists. It goes with- out saying that psychologists retained their interest in studying the laws that govern the mind, yet a sharp trend away from a dualistic philosophy accompanied the rise of objective psychology and behavioristics. Clearly this trend involved a move- ment away from the intuitive reality of ideas and towards the study of external, observable manifestations. Many of the developments in psychology most closely related to our present interests are derived from the resulting experimental activity, especially the efforts to analyze and quantify psychological data. Naturally, modern psychologists have always been interested in issues of scaling [14], computa- tion, and statistical analysis of observed behavior, but their objectives have involved interest in studying individual psychological parameters. Workers in statistical association techniques for mechanized documentation have not shared this objective. But though our motivation is somewhat different, there is much to be learned from the tools and approaches the psychologists developed in the early decades of the twentieth century. For example, it was this school, with its interest in drawing inferences about psychological variables 'These techniques figure importantly in the work of Bnrko and his followers. from the outcome of behavioral experiments, which developed and applied the techniques of fac- tor analysis [15, 16] with its accompanying method- ology. 4 In addition, psychologists became increasingly interested in the analysis of linguistic behavior. An important body of experimental work on human word associations was performed [17]. This atten- tion led slowly to the notion that language data could be analyzed for content by studying word frequencies and interpreting the pattern that emerged [18, 19]. For example, one vigorous fine of development in the 1940's was directed at the analysis of mass communications to ascertain the objectives behind the propaganda being transmitted or published. . . . Content analysis was initially developed some years before World War II, as a tool for the scientific study of political communication. Those who pioneered with Harold D. Lasswell in its development were interested in acquiring scientific knowledge about political communication. Accordingly, content analysis was originally defined and developed in order to list and measure the frequency of occurrence of certain characteristics of the political communication under study and to classify them under general terms, or content categories, which were suggested by a tentative theory of political communication. The objective of the research in this original content-analysis ap- proach was to make general inferences, or scientific generalizations, in the form of one-to-one regularities or correlations between some content indicator (or class or indicators) and some state or characteristic of the communicator or his environment [20]. This activity employed various techniques which are now familiar to us, but the methodology suffered from being excessively laborious. And although simple frequencies of occurrence were taken as clues, frequencies of co-occurrence were not. Work on this faded at the end of World War II, but then in 1955 a remarkable Conference was held at Atherton House at the University of Illinois. The proceedings [21] were not published until much later (1959), but the deliberation reflects a great deal of thought about problems very similar to those we are discussing this week. The conferees were psycholinguists, interested in drawing inferences from analysis of language data. They counted co-occurrences. They dis- cussed a number of association formulas. They used factor analysis. They talked about word- association profiles, meaning measures, and em- ployed a vector space representation. They performed cluster analysis. For instance, in the introduction, Pool writes It was . . . somewhat of a discovery for a group of scholars assembled in the mid-1950's, when content analysis seemed to be in a decline, to find that other scholars also had seen unexplored potentials in content analysis if certain new tacks were taken to meet the unsolved problems of the previous decade. The con- ferees, each starting from different directions and generally unaware of each other's work, did not of course see eye to eye on all issues. The discussions were vigorous .... But the striking fact was the degree of convergence. It is not for this introduction to attempt to state what the convergences of viewpoint were .... Suffice it here to note that they centered above all on two points: 1. a sophisticated concern with the problems of infer- ence from verbal material to its antecedent condi- tions, and 2. a focus on counting internal contingencies between symbols instead of the simple frequencies of sym- bols. Both these points arose out of the concern of the analysts to make their elaborate quantitative method produce something beyond what could be produced without its paraphernalia — to produce something that would go beyond the reaffirmation of the obvious." In the same volume (pp. 54-55) Osgood points out An inference about the "association structure" of a source — what leads to what in his thinking— may be made from the contingencies (or co-occurrences of symbols) in the content of a message. One of the ear- liest published examples of this type of content analysis is to be found in a paper by Baldwin [22] in which the contingencies among content categories in the letters of a woman were analyzed and interpreted. For some reason this lead does not seem to have been followed up, at least in the published reports of people working on content analysis problems. On the other hand, it soon became evident in this conference that all of the participants had been thinking about the contingency method in one form or other as being potentially useful in their work. If there is any content analysis technique which has a defensible psychological rationale it is the contingency method. It is anchored to the principles of association which were noted by Aristotle, elaborated by the British Empiricists, and made an integral part of most modern learning theories. On such grounds it seems reason- able to assume that greater-lhan-chance contingencies of items in messages would be indicative of associations in the thinking of the source. If, in the past experience of the source, events A and B (e.g., references to FOOD SUPPLY and to OCCUPIED COUNTRIES in the expe- rience of Joseph Goebbels) have often occurred to- gether, the subsequent occurrence of one of them should be a condition facilitating the occurrence of the other: the writing or speaking of one should tend to call forth thinking about and hence producing the other. In other words, out of a discipline with close in- volvement in understanding certain mental pa- rameters (like anxiety) these gentlemen did some early work on statistical measures of association with emphasis upon the psychological conse- quences. Their work differed from ours in that they were prepared to introduce a priori encoding of the data under study. Thus they were prepared to exercise human judgment in coalescing "ref- erences to factories, industry, machines, production, and the like" into the single content category FACTORIES. Less defensible from our view, they were prepared to encode, by means of human judgment, the attitude expressed toward such a content category in a given context. This posi- tion reflects, of course, a principal difference in motivation and objectives. (See also [23].) But although their motivation was different, their procedure was very closely related to that we are now discussing in the context of mechanized doc- umentation. It is of interest that their work has had no significant influence upon the foundations upon which the present work rests. 2.3. Natural Science The mainstream of the statistical association ap- proach discussed at this conference comes rather from the natural sciences and developments pro- vided by workers quite remote from psychology. The trend of this work has been in the opposite direction — away from exclusive attention to the external world and towards increased incorpora- tion of selected human intellectual activities within the province of a totally objective science. The advent of the twentieth century was accom- panied by an enormous increase in the use of statistical methods in all of science. Indeed the use of statistical methods was sufficiently broad that workers in a multiplicity of areas invented data-interpretation formulas appropriate to the task at hand. Goodman and Kruskal [24], in a survey of measures of association, critically examine a large number of closely related formulations. There were, for instance, developments in drawing inferences from medical data which were contrib- uted by experimenters in that field. A method of analysis was developed by an ecologist who was interested in the association between species and the character (e.g., marshland) of the environment in which they were discovered. A technique for evaluating the efficacy of a forecast of a tornado was developed by a meteorologist. In each case, the original report serves as a source for the phi- losophy and logic of the measure that was used and the important rationale for its interpretation. These efforts were, of course, subjected to crit- icism and debate. Yule, K. Pearson, Fisher, Ken- dall, and others continued to probe the rationale underlying statistical analysis of observations. While applied work increased in scope, they fo- cused attention on fundamental issues, delimiting the range of applicability of the approaches, clar- ifying the inherent assumptions, and creating new concepts of data analysis. But a more brutal objectivity was needed by Dirac, Einstein, and other physicists working early in this century [10, 25]. To make quantum mechanical and relativistic con- cepts comprehensible and consistent — in the face of experimental evidence that defied intuitive explanations — they found it necessary to develop and use a strict epistemological formalism [25] which stated explicitly what could be observed and the limitations on the inferences one might draw. Generally speaking, they limited the uni- verse of discourse to the observable physical reality of the "external world," defining man en- tirely out of the picture. But in a highly mathe- matical formalism, they gave impetus to what has now become an increasingly symbolic point of view toward making, and drawing inferences from, obser- vations of the real world. The important objectivity employed in the statistical association approach derives, to a considerable extent, from a corre- sponding insistence that the data from a system (e.g., an information retrieval system) are to be processed according to procedures which are spelled out in advance. No human interpretation of the data is allowed until all processing is completed. One would hardly expect that such a cold, scien- tific methodology could reveal semantic regularities when applied to uninterpreted language data. After all, language is meant to be interpreted. Yet it was approximately contemporary with the Atherton House conference that Luhn began evangelizing the use of word frequencies in text as a key to con- tent [1]. There was no a priori encoding of words into content catagories, and he and his followers had to overcome significant skepticism. Yet his experiments and demonstrations were persuasive. Luhn drew attention to information retrieval and indexing as potentially tractable tasks, and combined the objectivity of frequency analysis with the prag- matic objectives of doing something useful. More significant, he gave great impetus to a movement away from the use of manually assigned classifica- tions in indexing and retrieval. The next big step was made by Maron and Kuhns [2], who provided an overview of the act and method of information retrieval. In synthesizing a new model of the process, they broke far from previous restraints, especially in introducing "arith- metic (as opposed to logic alone) into the problem of indexing." They also argued for the statistical analysis of the co-occurrences of index tags, a sig- nificant departure which has had great influence. Their emphasis was on the retrieval of relevant docu- ments, rather than on interpretation of the associa- tion measures obtained among the terms. Thus they were quite careful in their discussion of index space to point out that The distinction between semantical and statistical relationships may be clarified as follows: Whereas the semantical relationships are based solely on the mean- ings of the terms and hence independent of the "facts" described by those words, the statistical relationships between terms are based solely on the relative frequency with which they appear and hence are based on the nature of the facts described by the documents. Thus, although there is nothing about the meaning of the term "logic" which implies "switching theory," the nature of the facts (viz., that truth-functional logic is widely used for the analysis and synthesis of switching circuits) "causes" a statistical relationship. (Another example might concern the terms "information theory" and "Shannon" — assuming, of course, that proper names are used as index terms.) 5 This comment, indeed their whole discussion, is quite free of a hypothesis regarding the "associa- tion of ideas" — rather they point to the external world as the explanatory mechanism for the sta- tistical relationships discovered. It remained for Stiles [3] to synthesize his uncom- promisingly operational view of the problem. First, he made the entire process automatic in his proposal to begin directly with the text of documents, index them automatically, perform co-occurrence analysis of the words so selected from the text, and thus obtain association measures defined from text. Though in practice he employed data from a co- ordinate index, he specifically included the possi- bility of text analysis by the same approach. Sec- ond, he dispensed with heuristics, and with this step Stiles went beyond his predecessors. He introduced the important idea of using term pro- files to obtain second-generation terms. 6 Finally he observed of this step that "It projects us beyond the purely statistical relationships and into the realm of meaningful associations. . . . Among these second-generation terms we find words closely related in meaning to the request terms." Stiles thus formulated a process which has enor- mous implications. Starting with text, a com- pletely formal process leads to relationships which admit plausible interpretations in the domain of meaning. The computer, one need hardly state, does not interpret the data — they are uninterpreted symbols. At one blow it puts "The Measurement of Meaning" in an entirely new light. 3. Conclusion Despite differences in motivation, emphasis, and perspective, the two main avenues that have been sketched very briefly in this paper have led, quite independently, to very similar constructs for the determination of meaningful measures of word as- sociation. Despite their similarity of technique, different explanatory mechanisms are suggested in each of the two traditions. On the one hand, the association of ideas is regarded as a defen- sible rationale for the method, while on the other, it is the "nature of the facts" in the external world which provides the "cause" of the statistical re- lationships observed. The roots of each tradition are found in the epistemological framework erected by the British 5 P. 225. "Cf. Harris [6], empiricists. A historian would thus be expected to regard the significance of the present effort not in terms of its mechanized documentation objec- tives but in terms of the larger movements of which it is a part. For while the two traditions from which statistical association techniques have emerged have tended to split over the value of using "ideas" as explanatory constructs, steps have been taken in both to replace the introspective method by a more quantifiable and objective technique. The discovery that the indexing language of a retrieval system is perculiarly susceptible to scientific anal- ysis is an important step. But perhaps more sig- nificant is the degree to which the two traditions, in treating substantially the same data with substan- tially the same techniques, are finding a common experimental ground after a long historical separation. 4. References [1] Luhn, H. P., A statistical approach to mechanized encoding and searching of literary information, IBM J. Res. and Devel. 1,309-317(1957). [2] Maron, M. E., and J. L. Kuhns, On relevance, probabilistic indexing and information retrieval, J. Assoc. Comp. Mach. 7, 216-244 (1960). [3] Stiles, H. E., The association factor in information retrieval, J. Assoc. Comp. Mach. 8, 271-279 (1961). [4] Morris, C. W., Signs, Language and Behavior (Prentice- Hall, New York, N.Y., 1946). [5] Harris, Z. S., Discourse Analysis, Language 28,1-30(1952). [6] Harris, Z. S., Distributional structure, Word 10, 146-162 (1954). [7] Saporta, S., Psycholinguistics (Holt, Rinehart and Winston, New York, N.Y., 1961). [8] Osgood, C. E., The nature and measurement of meaning, Psych. Bull. 49, 197-237 (1952). [9] Langer, S., Philosophy in a New Key (Harvard Univ. Press, Cambridge, Mass., 1951). [10] Bridgman, P., Logic of Modern Physics (Macmillan, New York, N.Y., 1946). [11] Hume, D., An enquiry concerning human understanding, pp. 596-597, in E. A. Burtt, ed., The English Philosophers From Bacon to Mill (Random House, Inc., New York, _N.Y., 1939). [12] Warren, H. C, A History of the Association Psychology (C. Scribner's Sons, New York, N.Y., 1921). [131 Hartley, D., Observations on Man (1749). [14] Mosier, C. L., A psychometric study of meaning, J. Soc. Phsych. 13, 123-140(1941). [15] Thurstone, L. L., Multiple Factor Analysis (Univ. of Chicago Press, Chicago, 111., 1947). [16] Harman, H. H., Modern Factor Analysis (Univ. of Chicago Press, Chicago, 111., 1960). [17] Kent, G. H., and A. J. Rosanoff, A study of association in insanity, Am. J. Insanity 67, 37-96 (1910). [18] Lasswell, N., N. Leites, et al., Language of Politics; Studies in Quantitative Semantics (G. W. Stewart, New York, N. Y., 1949). [19] Berelson, B., Content Analysis in Communications Re- search (Free Press, Glencoe, 111., 1952). [20] George, A. L., Propaganda Analysis (Row, Peterson & Co., White Plains, N.Y., 1959), pp. 29-30. [21] Pool, I. S., ed., Trends in Content Analysis (Univ. of Illinois Press, Urbana, 111., 1959). [22] Baldwin, A. L., Personal structure analysis: a statistical method for investigating the single personality, J. Abnor- mal and Social Psych. 37, 163-183 (1942). [23] Osgood, C. E., G. Suci, and P. H. Tannenbaum, The Meas- urement of Meaning (Univ. of Illinois Press, Urbana, 111., 1957). [24] Goodman, L., and W. Kruskal, Measures of association for cross-classification, J. Am. Stat. Assn. 49, 732-764; . . . Further discussion and references, Mar. 1959, 124-163. [25] Dirac, P. A. M., The Principles of Quantum Mechanics, 3d ed. (Clarendon Press, Oxford, 1947). Mechanized Documentation: The Logic Behind a Probabilistic Interpretation M. E. Maron * The RAND Corporation Santa Monica, Calif. The purpose of this paper is to look at the problem of document identification and retrieval from a logical point of view and to show why the problem must be interpreted by means of probability con- cepts. We show why one must interpret the transition between a user's request for information and the library's response as an inverse statistical inference. Furthermore, we show how a mechanized library system can elaborate automatically upon and improve a given request, and why this requires association techniques based on statistical as well as semantical relationships. The paper concludes with some remarks indicating how these notions may be extended to put the problem of mechanized documentation on an even firmer base. 1. Introductory Remarks Mechanized documentation a few years ago occu- pied a relatively small sector of the computing field; however, it may well overshadow and perhaps even dominate conventional numerical uses of computers. This prediction may appear extravagant in view of the fact that we have had larger, faster, more re- liable, and more flexible computing machines each year since the publication of Vannevar Bush's classic discussion in 1945 [l], 1 and yet the prob- lems of mechanized documentation are still largely unresolved. This suggests, of course, that the problems of mechanized documentation do not relate primarily to hardware — if they did, they would doubtless be more tractable. They are intellectual problems, and they have remained unsolved because the proper framework within which to view them has not been firmly constructed. Perhaps one reason for this has to do with the fact that the technology was ready — and as a result we had an information storage and searching machine (the Rapid Selector) — before we were clear about the logic and the strategy to be used in mechanized searching. But a more basic reason that solutions to our problems have eluded us thus far has to do with the fact that our subject is very difficult because some of its key aspects are basically epistemological, having to do with the activity of knowing. 2. Communication, Information, and Language 2.1. Knowing and the Notion of an Internal Model In order to get at fundamentals, we must be clear about the function of a library; we have to be clear about the circumstances under which someone would want to use a library. The simple answer, of course, is that someone comes to the library because he doesn't know something and wants to find out about it by reading the appropriate books. So first of all we have to ask: What does it mean to say that someone knows something? For present purposes, we will equate one aspect of knowing with having an internal model (some- times called a "cognitive map") of the world, which, in a sense, is consulted and which determines the *Any views expressed in this paper are those of the author. They should not be interpreted as reflecting the views of The RAND Corporation or the official opinion or policy of any of its governmental or private research sponsors. ' Figures in brackets indicate the literature references on p. 13. appropriate behavior in terms of knowing what to do and what to expect under various circum- stances [10]. We receive information when our internal model of the world is updated or changed. In fact, we might say that information is that which changes what we know; i.e., it modifies our internal model [3, 4]. The amount of semantic information in a message could, in principle, be measured in terms of the amount by which it changes the internal model of the receiver [6]. It is important to recognize from these remarks that information is not a stuff contained in books as marbles might be contained in a bag — even though we sometimes speak of it in that way. It is, rather, a relationship. The impact of a given message on an individual is relative to what he already knows, and, of course, the same message could convey different amounts of information to different receivers, depending on each one's internal model or map. 772-957 0-66—2 2.2. The Notion of a Question When an individual, A, wants some part of his internal map updated, he may ask a question of another individual, B. Notice, that there are dif- ferent aspects of the map that may stand in need of updating — scope, depth, detail, etc. But, the point is that A characterizes the gap in his map in the form of a question. B receives the question and responds after consulting his own map. Hope- fully, he responds by describing those facts re- quested by A. An important feature of this type of information exchange is that unless A and B are already familiar with the background, education, and experiences of each other, the process of communication be- tween them may require several cycles of iteration before B is quite sure of what A "really" wants, relative to depth and detail, and therefore how the answer must be framed. This requires that B incorporate within his model of the world some representation of A's model of the world [5]. 2.3. Interrogating a Library Computer Suppose the individual consults a library computer instead of another person to obtain information. Since current computers cannot comprehend [2], they must be instructed (programmed) as to how to manipulate incoming requests on the basis of a description of the form of the input request and stored data. That is, in order to compensate for the fact that computers only manipulate the sym- bols on the basis of stored instructions, appropriate procedures must be initiated in order to have a computer automate certain library tasks. In con- ventional library systems the procedures are as follows: A human indexer reads the library docu- ments and assigns the appropriate tags (this could be mechanized and executed by the computer [8]). Conventionally, an indexer reads the documents and assigns index tags according to his notion of where each document would fit, relative to the maps of the library users who will interrogate the system. To what extent, however, can he anticipate the needs of future users who might find the document relevant? The second step in the operation of conventional systems is that information needs of the users are described in the form of a library request — usually framed in the vocabulary of the library indexing language and the grammar of truth- functional logical connectives. Given a request, the machinery begins to grind, the computer searches its store trying to match the description of the need with descriptions of documents. A document is considered relevant to a user's infor- mation need if there is an exact logical match or if the document description implies the request formulation. 3. The Fallacy of Conventional Indexing We have argued elsewhere [9] that the conven- tional search strategy described above is based on an invalid inference scheme, and that once the logical fallacy behind such systems is unmasked, we will recognize why retrieval effectiveness is poor. The fallacy can be pointed out as follows: An indexer in the process of deciding whether or not to assign index tag Ij to document D considers the following sentence S: If document D satisfies the information need of a library user, then he will describe that need in terms of index tag Ij. S is a conditional sentence of the form: "If X, then F", where X= document D satisfies the infor- mation need, and Y= index tag Ij describes the user's information need. So we can schematize the transition from a user's request to the library response as follows: If X, then Y Y Therefore, X (The inference consists of two premises, one of which is sentence S, the truth of which is not now in question.) To say that an inference is invalid is to say that it is possible for its premises to be true and con- clusions be false. The above inference is clearly fallacious. We cannot even assert that the prem- ises confer a degree of partial truth on the con- clusion. It is not surprising that retrieval effectiveness suffers when based on an invalid search strategy. 4. The Need for a Probabilistic Interpretation What is the probability that a document indexed by a given description will satisfy the information need of a user who has described his need in an identical way? The probability may be high or rather low depending, among other things, on the richness and flexibility of the library indexing lan- guage. However, in a communication situation of the type described above, where information needs 10 are to be related to documents in terms of the im- pact of their "contents" on the cognitive map of the receiver, one must use the language of probability to represent properly the relationship between need and description and also to schematize properly the logic of the transition from input request to output documents. A document can be understood properly, for index purposes, only in terms of its impact on a per- son with an information need. That is, documents and their users stand in a relationship to each other, this relational aspect of the situation must be rec- ognized and made explicit when designing a search strategy. Therefore, it can be argued that index descriptions should not be viewed as properties of documents: They function to relate documents and users. The corollary to this is that the relationship be- tween a document and a user admits of degrees and must be interpreted probabilistically. Given an understanding of the logic of this sit- uation—namely, that an index tag for a given docu- ment can "characterize" to some degree — one is in a position to recognize the rationale behind weighted index tags [7]. The weight of an index tag, Ij, rela- tive to a given document, can be interpreted as an estimate of the probability that if a user were to read the document in question and find it to satisfy his information need, then he would have described his need in terms of Ij. This is what an intelligent individual does in- tuitively in deciding how to index a document for the purpose of information retrieval. (And in con- ventional systems he converts his intuitive estimate of this probability to either 1 or 0, depending on which extreme is closer to his intuitive estimate.) If we want to construct a valid inference of the type required by the transition from a given infor- mation request, R (consisting of some function of index tags), to the library response, which is, we suggest, an inverse probability inference, then the inference must be schematized in terms of the theorem of Bayes. We would argue as follows: That the logic behind valid mechanized documentation implies the rela- tional aspect of index tags, that the weights associ- ated with index tags can be interpreted in terms of probabilities, 2 and that the transition between a user's request and a library response must be viewed as an inverse probability inference. Given this understanding of the logic of the situation, one can explicate a comparative concept of relevance as a relationship between probabilities of the following kind: The probability that if a user describes his need in terms of a request R, then he will find that document Z>; satisfies that need. From an operational point of view, if, for a given request, one document would more probably satisfy a user's need than another document, then the former document is more relevant to his need, rela- tive to that request. The interpretation of weighted index tags and this explication of relevance provide the logical and mathematical tools needed to compute what have been called relevance numbers [7] in order to rank the output documents resulting from a re- quest. And this ranking (ordering) provides an optimal strategy in going through the class of re- trieval documents. 5. Statistical Association Techniques The fallacious logic on which conventional search strategies have been based gives rise to two typical symptoms of the logical illness: too many documents are retrieved, many of which are of very low rele- vance; some of the really relevant documents are completely missed in the search. The first problem is handled once we cast the search in its logically correct form; i.e., probabil- istically, as described above. When we do that, low-relevance documents are ranked accordingly and hence can be trimmed automatically from the output list. The second and more serious problem grows out of the fact that the document descriptions or the requests are inadequate because they contain in- sufficient redundancy. But we know that redun- dancy can be added automatically by the use of statistical association techniques. How can one increase the probability of retrieving 'For mathematical details, see [7|. a class of documents that includes relevant material not otherwise selected? One obvious method sug- gests itself: namely, to enlarge upon the initial request by using additional index terms which have a similar or related meaning to those of the given request. An intelligent librarian can always help an in- dividual enlarge upon his request, but a central concern of this Conference relates to the process of mechanizing this procedure. To do this one would need to program a computing machine to make a statistical analysis of index terms so that the machine will "know" which terms are most closely associated with one another and can in- dicate the most probable direction in which a given request should be enlarged. In 1960 [7], three techniques were analyzed for elaborating in so-called "request space" and a technique for elaborating in so-called "document space." The rationale behind these techniques was to avoid the problem of missing relevant docu- 11 ments in the search process by enlarging upon a request in the most probable direction; i.e., by adding the proper kind of redundancy. This can be done using statistical association techniques The library computer not only collects the relevant statistics, but is also programmed to reformulate the input requests to increase the probability of selecting relevant documents, as described above. Even though a redundant request implies a larger class of retrieval documents and threatens further to aggravate the problem of retrieving too many documents of low relevance, probabilistic indexing techniques provide relevance numbers so that the enlarged class may be ranked and trimmed. To enlarge upon a request in the most probable direction presupposes that we can justify our elab- oration techniques in the sense that we can show how the use of statistical association techniques does in fact increase the probability of selecting relevant documents. Thus, it would be useful to strengthen the theories (which presently are not always clear) behind some of the current techniques in order to provide logical justification for their preference (over alternatives); i.e., to have some measures of the goodness of alternative association techniques. 6. Toward a More General Theory of Association Procedures The relational nature of indexing suggests that statistical association techniques might be extended and refined so as to deal more adequately with a library whose users have heterogeneous back- grounds. For such a library, the relationship of being statistically associated with, which ordi- narily holds between pairs of index terms, could be enlarged and be interpreted as a three-place relationship. If a library user, Ui (who might have a back- ground in psychology), uses the same request index tag, say Ij, as another user, U% (who might have a background in physics), then this background in- formation should not be missed. Given a request using tag Ij by a user of type 1, we find the h(Ui) which has the highest coefficient of association (by some measure) relative to a user of type 1. And, for the physicist (as opposed to a psychologist) who also uses Ij, we find the Ik{Uz) which is most highly correlated with Ij, relative to user class of type 2. The suggestion that statistical associations be- tween index tags become three-place instead of two-place relationships implies that we look upon a request as composed of two parts: (1) Request data proper; i.e., the description of the user's information need — of the gap in his map. (2) Background data; i.e., the description of the background of the user — the "texture" and terrain of his map. Given these data, a computer could keep records and learn that a user who describes himself in one particular way most probably belongs to user class 1, whereas another individual who describes his background differently would probably belong to user class 2, etc. Just as a computer can be programmed to index a document and decide the subject category to which it most probably belongs, so also a machine could decide automatically to which class a user most probably belongs. Then there would be separate and distinct correlation relationships for each distinct class of users. This is not merely to suggest that by keeping a "profile" of library users one could program a com- puter to disseminate automatically; but rather that in order to respond more effectively — either for direct on-fine requests or for automatic dissemina- tion—we need to recognize that at least some of the statistical association relationships that we are trying to evaluate by various techniques are not two-place but are three-place relationships and, therefore, that they require different methods for their estimation. 7. Concluding Remarks Although in principle there is no reason that argues against the possibility of building an intel- ligent artifact which can truly comprehend language, a solution to the library problem does not hinge on such systems. If we make full use of human intel- ligence we can design an effective library computer. A clear comprehension of the logic of the problem can go a long way toward preventing false starts, trivial experiments, and naive discussion. The concepts of probability are required to properly frame the logic of the problem because, basically, the transition from a user's request to the resulting retrieved documents must be schematized as an inverse probability inference. Statistical asso- ciation techniques are required because, like a good detective, the library computer must be designed to use all the clues and inference techniques that are available. If we can think clearly about the logical problems of mechanized documentation, the opportunities offered by a fabulous computer technology can be exploited to our great advantage. 12 8. References [1] Bush, V., As we may think, The Atlantic Monthly 176, 101-108 (1945). [2] Kochen, M., D. M. MacKay, M. Maron, M. Scriven, and L. Uhr, Computers and Comprehension (The RAND Corpo- ration, RM-4065-PR, Apr. 1964). [3] MacKay, D. M., Operational aspects of some fundamental concepts of human communication, Synthese 9, 182-198 (1954). [4] MacKay, D. M., The place of "meaning" in the theory of information, pp. 215-255 in E. C. Cherry, ed.. Information Theory (Butterworths, London, 1956). [5] MacKay, D. M., The informational analysis of questions and commands, pp. 469-476 in E. C. Cherry, ed., Information Theory (Butterworth, London, 1961). [6] MacKay, D. M., Communication and meaning — A functional approach, in Helen Livingston, ed., Cross-Cultural Un- derstanding: Epistemology in Anthropology (Harper and Row, New York, N.Y., 1964). [7] Maron, M. E., and J. L. Kuhns, On relevance, probabilistic indexing and information retrieval, J. Assoc. Comp. Mach. 7,216-244(1960). [8] Maron, M. E., Automatic indexing: an experimental inquiry, J. Assoc. Comp. Mach. 8, 404-417 (1961). [9] Maron. M. E., Probability and the library problem, Behav- ioral Sci. 8, 250-256(1963). [10] Maron, M. E., On Cybernetics, Information Processing, and Thinking (The RAND Corporation, P-2879, Mar. 1964). 13 Some Compromises Between Word Grouping and Document Grouping Lauren B. Doyle System Development Corporation, Santa Monica, Calif. 90406 Statistical analysis of the text of document collections has yielded for information retrieval purposes two broad classes of output: word grouping and document grouping. Associative indexing comes under the general heading of word grouping; automatic classification is a kind of document grouping. Doc- ument grouping and word grouping, however, can be combined to give a scheme of classification with more attractive features than could be achieved with either document grouping or word grouping alone. A hierarchical grouping program written by Joe H. Ward of Lackland Air Force Base for use in classifying personnel by skill and aptitude turns out to be nearly ideal as a basis for a mixed document- and-word grouping approach. The program will derive four- or five-level hierarchies from key-word lists drawn from 100 documents, will position document numbers or other numbers in the smallest subcategories, and is capable with additional routines of extracting appropriate labels from the key- word lists to describe the categories at all levels of the hierarchy. Additionally, homograph separation occurs as a natural outcome of the program's operation. 1. Introduction Information retrieval technology in the 1950's was based largely on principles of logic, 1 an empha- sis which was perhaps a "logical" result of the emphasis on use of computers in information re- trieval. Computers are (above all) logical. Then a well-known logician [1] 2 said that logic was at least being grossly misapplied or at worst nearly useless in the information retrieval field. Judging by the trend of interest in statistical approaches in general and associative indexing in particular, the 1960's will see information re- rieval based more and more on principles of redun- dancy. This is more appropriate because, as we are often so painfully aware, the literature is quite redundant and not very logical. Redundancy has the adverse connotations of undue length and repetition. It is these very char- acteristics that make a statistical approach to text analysis and retrieval both feasible and desirable. Undue length favors a statistical approach because it increases the sample size, and needless to say the world's technical literature is unduly sizable as a sample. Repetition, of course, gives us some- thing to count, without which we would have no statistics; but more important than that, selective repetition by authors can be a highly reliable clue to topic, as recognized by H. P. Luhn [2]. 2. Document Grouping There seem to be two broad uses of redundancy among those who try to employ it as a means of automatically generating an organized structure by which we may have access to the literature; these are document grouping and word grouping. Document grouping was the basis of library clas- sification long before computers, and it is expect- able that those of a statistical orientation would try to duplicate by automatic means what the librarian can do intellectually, because similarity of word content in a group of documents implies similarity of topic. Of course, documents or ref- erences thereto (titles, etc.) can be grouped 3 in ways other than by word content similarity; as examples, permuted title indexing groups them alphabetically, and citation indexing groups ac- 1 Mainly the principles of Boolean algebra. - Figures in brackets indicate the literature references on p. 24. 3 "Grouped" in the loosest sense, which might mean "ordered" or even "inter- connected." cording to author-implanted cues. These ap- proaches to document grouping currently outrun the statistical approach in popularity because, among other things, they are cheaper; neither method requires the entire text of an article to be processed, or for additional intellectual work to be done other than that done by the author himself. But we value the statistical approach in spite of its current expense, not only because costs are rapidly declining and will result inevitably in feas- ible digital storage for entire documents, but also because it is a whole technology, whose applica- tions to text analysis go beyond what we talk about herein. As one example of that, statistics can be shown to be a strong right arm for syntactic analysis [3], and perhaps — eventually — for machine transla- tion. This is so because the redundancy in text can manifest itself through the grouping of words, as well as through the grouping of documents. 15 3. Word Grouping In my own work I have been preoccupied with word grouping [4]. Others, such as H. E. Stiles [5], have in effect used index-term grouping, which is equivalent to word grouping, as a basis for improving the performance of literature-searching systems. Words or terms can be grouped statistically as a re- sult of their high co-occurrence in the same docu- ments as tags or key words; when co-occurrence is high, as measured by some statistic, we speak of the co-occurring words as being strongly "associ- ated." Both word grouping and document group- ing can be seen to spring from the tendencies of many words to co-occur strongly. Developments in statistical word association are proceeding along two paths. The majority ap- proach is that of Stiles, which is a modified coordina- tion indexing in which users formulate search requests and in which the machine acts on those requests in such a way that the retrieved documents contain not only the words specified by the request, but also words which are associated statistically to those in the request. The second approach, which is still a rather small minority, is that in which the computer is used to generate an "association map" as a printout or cathode ray tube display. The best way to visualize the difference between these two approaches is in analogy to the difference between straight machine searching of text and automatic indexing. In ma- chine searching one makes a request, which is fed into the machine as a criterion that the machine can use in searching for relevant references. In auto- matic indexing, the machine is used not as a search- ing instrument but as an arranger of references which can be scanned in printout form by the human eye. In associative indexing, by analogy, the first approach involves user specification of what the machine should look for and the second approach SPRING WITHDRAWAL 1955 REDUCE UNITS ANNUAL 1 WINTER ROTATION I MANEUVERS ■REGIME' TRUCK TOLL--NEGOTIATION ■ KHRUSHCHEV ACCESS BORDER- -QUADRIPARTITE EAST BERLIN U. S. A. REFUGEE UNIVERSITY- -STUDENT- -INTELLECTUALS REPRESSIVE RESISTANCE — MILITARY ■POLAND' TRAINING—RECRUIT TENSION ODER-NEISSE GOMULKA HUNGARY- -RUMANIA FIGURE 1. Association map. 16 RECOGNITION- KHRUSHEHEV NEGOTIATION "berliTT HARASSMENT ISOLATE SPY STUDENT, UNIVERSITY, INTELLECTUAL ALLIANCE RESISTANCE SECURITY POLAND I ^\ BLOC ODER- GOMULKA NEISSE FIGURE 2. Hierarchical association map. generates a printout or display by which the user himself can search. The analogy here might even extend to develop- mental history. Recent years have seen a shift away from machine searching toward automatic indexing, especially permuted title indexing. We might well be on the point of seeing a shift from ma- chine associational searching to machine associa- tive indexing. I am assuming so, and for this reason have habitually placed my eggs in that basket. Association maps can take on a bewildering va- riety of forms. The forms with which I have be- come most familiar are shown in figures 1 and 2. Figure 1 is a map hand-drawn from computer- generated statistical co-occurrence data, and figure 2 is a "hierarchical map" generated from the same text. Both of these forms were first discussed in 1961 [6] and both are capable of completely auto- matic generation from text. The map of figure 1 could be called a "raw association map," in that it faithfully reflects the most strongly co-occurring word pairs in the corpus; the hierarchical map of figure 2 sacrifices strong co-occurrences between words of roughly equal frequency for the sake of better organization. The hierarchical effect is achieved by discriminating against the relating (i.e., linking) of words of more or less equal fre- quency and by relating words of high frequency 4 to words of lower frequencies in a cascade of cate- gories and subcategories, as shown; since the words of high frequency apply to a larger number of docu- ments, it follows that these would be used to label the larger categories. One can construct ad hoc statistical functions by which one can bring about the desired discrimination against co-occurrences between equally frequent words. The most ef- fective one I have found so far is: F = 2c b 1 (b/a- 0.35) 2 + 0.03 where a = the value of the higher frequency, b = the value of the lower frequency, and c = the frequency of co-occurrence of words a and b. The numer- ator's purpose is to maximize F as documents with tokens of word b as tags or key words approach 100 percent inclusion in the larger set of documents having word a. The denominator maximizes F as the ratio of the two frequencies, a and b, approaches 0.35; such a function would thereupon favor hier- archies having on the average three subcategories per category. The presence of the constant 0.03 in the denominator is to prevent the function from approaching infinity. 4. Disadvantages of Pure Word or Document Grouping The reason I now search for compromises be- tween word grouping and document grouping is that I have become aware of certain disadvantages of either approach used in a pure way. Pure docu- ment grouping, for example, suffers from two weaknesses: * By "frequency," here, we mean "number of documents having this word or tag" rather than "number of words." The author, in a previous article [4], has defined this kind of frequency as "prevalence. " (1) There is no obvious clear-cut way to represent the groups of documents for perusal by literature searchers. Grouping of titles in correspondence to the document groups is not entirely adequate because the simi- larities leading to group formation may not be evident, and because a flock of titles may contain too much information to characterize 17 whole groups, leading to cognitive strain for searchers who would like to inspect numerous groups. (2) The organization of the groups themselves, though potentially achievable automatically, may not be representable in a scheme which can be followed by a searcher. These faults would not seem important to those who take the viewpoint of Maron [7] and others, which pictures "heuristics in document space" as a means of machine retrieval of closely related documents. These workers would not be inclined to emphasize representation for search by the human eye. Word grouping (association maps, hierarchical maps) has three weaknesses as a pure approach: (1) Since the basic idea of an index based on word groups is to find word clusters of interest or pertinence, and to proceed from such a cluster to references contain- ing more information about the documents whose co-occurring words caused the cluster, it is important that word maps have document numbers (or other indicators) positioned properly on them. This proves difficult to do reliably by automatic means. (2) Homographs are a problem in word-grouping techniques. Though statistical separation of homographs has been shown feasible by Stiles [8], it ordinarily would require an additional statistical technique to be used along with whatever is used for the word grouping. We would like to find a sta- tistical technique from which both word grouping and homograph separation come in natural consequence. (3) Though word grouping (particularly the "hierarchical map") suggests organization of something, the literature searcher is given no sense of what it is that has been organized. A map, in order for one to accept it as a meaningful entity, ought to be a map "of something." An organized set of document clusters, if it can be repre- sented in a maplike way, would have much more reality to a searcher because it would be perceived as a map of the document collection. 5. A Procedure Permitting Both Document and Word Grouping I could not have expected that these grim doubts about either document grouping or word grouping could be cleared up by a single computer program which was used in a field quite remote from docu- ment retrieval. However, early in 1963 an article by Ward and Hook [9] came to my attention which described a hierarchical grouping procedure used by the U.S. Air Force in grouping aptitude profiles for personnel assignment. I was fortunate enough to obtain the corresponding Fortran II computer program, which was implemented and run on our Philco 2000. I used this program, in effect, as a document grouping program. As a natural outgrowth, perhaps, of my preferred orientation toward word grouping, I found that one can superimpose a highly organized word pattern on the document grouping pattern which the pro- gram generates, and that this superimposed word pattern not only describes the document groups, but also overcomes the three weaknesses of a "pure word-grouping" approach. I do not wish to discuss herein the mathematical principles of the grouping program, which are de- scribed well enough in the Ward paper [9]. Ad- herents of the statistical approach spend much time arguing among themselves as to whether this or that statistical technique is more appropriate, but those who have a chance to compare them [10] often find that the difference in output between one technique and another is not appreciable. In- deed, even if one technique led to substantially different output from that of another, it would be hard to say that one result was right and the other wrong. / have usually found that selection of technique on purely mathematical grounds is ap- propriate only when there is full and complete understanding of what the technique is supposed to do; otherwise the only sensible thing to do is to base selection of technique on an after-the-fact ap- praisal of the utility and quality of output. When there is no underlying theory of what it means that a word occurs in -text once, twice, thrice, or n times, it is only the naive who would apply "sophisticated" statistical formulae. Insight, on the other hand, might well lead to the choice of a completely ad hoc statistic with no foundation in mathematical theory, as in the case of the hierarchical map shown in figure 2. Several runs of the Ward program were made, each having 100 12-word lists as input. Each 12- word list can be regarded as a list of index tags or most-frequent content words of one document. The output, then, can be viewed as the organization by similarity of a 100-document library. Three runs will be described herein, one on 100 lists correspond- ing to reports on German affairs, one on 100 lists corresponding to information retrieval papers, and 100 which include 50 lists each from German affairs and physics collections. 18 6. Principle of Operation of Ward's Grouping Procedure Before presenting the results of these computer runs, it is desirable to give a nonmathematical de- scription of how the program operates. Its objec- tive is to form groups whose members have maximal similarity to each other. In the runs described above, it begins with 100 ungrouped lists, or, it would be better to say, 100 groups having one mem- ber each. Each program "pass" forms one group of two members or more according to any of the following three rules: (1) Combine one list and another list to form a group of two lists. (2) Add one list to a group of two or more lists. (3) Merge two groups of two or more lists. Note that never more than two entities (lists or groups of lists) are combined on a given pass; there- fore any one pass diminishes the total number of groups (remembering that we've designated un- grouped lists as "groups with one member only") by one; and also, therefore, the total number of passes must be n-1 for a collection of n lists. In other words, the program accepts n lists as input, forms a new group (in accordance with the rules just given) in each of n — 1 passes, and on the (ra— l)th pass forms one large group consisting of all n lists. There are of course a larger number of paths which the program could follow to reach the all- inclusive group at the (n — l)th pass. For example, for a collection of four lists two possible paths exist if we think of the lists as indistinguishable: (1) form two groups of two each, and merge these to form a group of four; or (2) form a group of two, add a third, and add a fourth. When we introduce combinations, however, i.e., regard the lists as distinguishable and count all possible ways of com- bining them, we find that the program has 18 possi- ble paths by which to achieve the final group of four. On the first pass it can form any of six pos- sible groups of two. On the second pass it can — for each of the six possible pairs — do three things: (1) group the two ungrouped fists, (2) add one of the ungrouped fists to form a group of three, or (3) add the other of the ungrouped lists to form a group of three. On the third pass all roads lead to Rome, i.e., the final group of four. As the number of items to be grouped increases, the number of possible paths the program is allowed to take increases enormously. According to an earlier report of Ward's [11], for a group of five there are 180 possible paths; for six, 2700; for seven, 56,700; and for eight, 1,587,600. The essence of Ward's grouping procedure is ( n l)2 that out of the '_ possible paths for n items, n(z" ') it selects some one pathway which brings together the items of greatest similarity the soonest. This selection is not as difficult as it may sound, at first hearing. Each of the (n — 1) iterations is involved in selecting the total pathway, for on each program pass a group is formed such that the following func- tion is maximized: F = A (n - \)-A 1 (n 1 -l)-A2{n 2 -l)-C. In this function, n stands for the size of the group which is a candidate for formation on a given pass. On the first pass n must equal 2. On later passes the upper limit of n is the number of the pass plus one; the lower limit, however, is always 2 except on the final pass, where n must equal n. The ni and nz are the sizes of the groups to be merged on a given pass, and their values are restricted by the relation n = «i + n 2 , with a lower limit of + 1 for either or both. Ao, A\, and A% are the corresponding average similarities for the groups, which we define as x this case being the group n in n(n -l)/2' size and x being some measure of the similarity of two of the items (in the case of the word lists used in this study, x was simply the number of words which two lists have in common). The summation of x is over all combinations of the n items taken two at a time. C is an arbitrary constant usually set at the maximum possible A value. The above function F acts in effect as a threshold, being set at its highest achievable value at the begin- ning of the first pass, and "highest achievable value" means here that only items which are identi- cal in all respects could be formed into groups. If all n items of a collection were identical to each other, the threshold F need never be lowered. But in that case, of course, there would be no point in forming groups. In a typical collection of complex items no two of which are identical, the program lowers the value of threshold F until two items are found similar enough to each other to constitute the "most similar pair in the collection." After the first pair is formed, the role of F becomes more com- plicated—and correspondingly more difficult to describe. For a comprehensive mathematical expla- nation, one should consult the Ward article [9]. I have described the function to the extent I have only for the benefit of those who might want to con- struct their own grouping algorithm without having to decipher what in some cases might prove to be unfamiliar mathematical notation. It will suffice for the purposes of this paper to state that F's role is to select at any given pass that group which has the most satisfying blend of simi- larity and homogeneity. The Ward program con- tains an alternative mode in which groups are formed based solely on maximum average simi- larity; however, my experience with this mode has convinced me that better classification is achieved (for my material, at least) in the mode which maximizes F, rather than average similarity of the next-to-be-formed group (i.e., A ). Close scrutiny of F will show the reader that a candidate for group formation is penalized to the extent that the average 19 similarity of the new group differs from the average similarities of the component groups. This has the result that on many passes groups are formed whose Ao values are substantially less than the maximum possible on those passes. To put the action of this mode of the program in sociological terms, it tends to "group the nonconformists" rather than to parcel them as individuals into the tightly knit groups of high average similarity. The practical significance of the grouping procedure described can be better understood if we think about the problems involved in grouping common objects in terms of their attributes. Suppose, for example, that we apply the three rules given at the beginning of this section to forming groups from four objects: a plum, a walnut, a flower pot, and a jar of mustard. Without splitting too many hairs on the question of specifying their attributes, it might seem reasonable to group the walnut and the plum first because they are both small, edible, tree-grown objects; furthermore even without a knowledge of biology, we suspect that they have many more things in common that we could perceive with the eye. The next question is what to do on the second pass. There are three things which can be done. One (grouping the flower pot with the plum and the walnut) appears unreasonable, since the flower pot has practically nothing in common with either of the other two. The jar of mustard, however, can either be grouped with the flower pot (because it is a non-metallic container which just happens to con- tain mustard), or it can be grouped with the walnut and the plum (because it has the common quality with them of being partly edible — the edible part being likewise derived from vegetable sources primarily). Which of the above two choices we would want to make would depend on which attributes are of greatest interest to us. For example, if we were running a store we would unquestionably want to group the edibles, whereas it we were in the transportation business we would tend to group jars of mustard with flower pots because they present fewer problems in handling than the perish- able walnuts and plums. Coming now to the world of document retrieval, how would we want to group books about walnuts, plums, flower pots, and jars of mustard? Of course, a lot depends here on the aspects of these four subjects which are being discussed — for ex- ample, plums can be discussed as crops or as plants (under biology or botany). It is to be noted, how- ever, that since jars of mustard and flower pots are finished products, it is somewhat more difficult to think of any book which might treat them in a scientific (i.e., natural science) fight, whereas any book "all about walnuts" or "all about plums" would of necessity have to begin with a biological discussion. From a librarian's viewpoint, then, it might be logical to group a book "all about jars of mustard" with similar books under the topic "manufacturing." A book all about flower pots would probably also be found under the "manu- facturing" heading, though not specifically in the area of food processing. Fortunately, in the area of statistical methods of classification, we do not (yet) have to worry about such hard intellectual choices as the above librarian might have to make; at this point we have nothing better than the simple and somewhat comfortable hypothesis that documents containing similar quan- tities of roughly the same words must be on roughly the same topic. This makes it quite easy for us to decide how we want things to be grouped. In particular, it was easy for me to decide by what criteria I wish to group the 12-word lists (described above) — group fists according to the number of words held in common. Let us assume that, based on a word count of books about walnuts, plums, flower pots, and jars of mustard, I have derived 5 the following 12-word lists: 1 With some assistance from the Encyclopedia Britannica. One now notes that lists (1) and (2) have three words in common ("tree," "soil," and "species"), and that lists (2) and (3) have two words in common ("plant" and "color"). List (4) has no words in common with any of the others. The outcome of our grouping procedure would be that the first program pass would group lists (1) and (2). The second pass has no choice but to put list (3) in with (1) and (2), since each of the other two grouping possibilities would involve list (4), which has nothing in common with any other list. Note that grouping on the basis of "words in common" gives us a grouping which we have already decided (above) was unreasonable on intuitive grounds, namely, to group flower pot with plum and walnut. These sample word lists were fabri- cated deliberately not just to illustrate the basic principle by which the lists are grouped, but also to illustrate the apparent weaknesses of the method. We enumerate and discuss these apparent weak- nesses in terms of the above sample lists: A. Word choices can accidentally relate docu- ments on dissimilar topics. Let us suppose that word list (2) had the word "flower" rather than "blossom," and that (with somewhat greater em- phasis on the production of prunes) the word "dry" appeared on the list. We would now have the situa- tion in which lists (2) and (3) would have four words in common, leading to the most unlikely initial grouping of all — plum and flower pot. Can we (1) Walnut (2) Plum (3) Clay (4) Mustard Tree Fruit Plant Seed Nut Species Pot Bottling Hull Tree Mold Blend Species Plant Fire Spice Wood Color Pottery Vinegar Shell Grow Dry Process Lumber Blossom Color Flour Kernel Soil Heat Spread Black Prune Horticulture Flavor Crops Pit Home Sandwich Soil Hybrid Flower Sharpness 20 permit such subtle shifts in vocabulary and em- phasis to have such drastic effects on the outcome of the classification? As we shall eventually see, such inappropriate groupings become less and less likely as (1) the size of the document collection increases, (2) the topical spectrum narrows, and (3) the amount of information (about each document) which is used in grouping is enlarged — i.e., list length is increased. B. Ties in number of words in common can lead to instability. Let us assume that fists (2) and (3) were to have three words in common. Now there is a tie between fists (1) and (3) in how similar they are to list (2). In such a case which group would be formed first, (2) and (3), or (1) and (2)? Since a computer program, unless suitable provision were made, would have no way to decide this issue except through comparison of similarity as we have defined it, a typical program would simply choose the first pair inspected. In other words, we can affect the program's classification simply by physically rearranging the order in which the lists are input. Such instabilities have actually been observed in the computer runs to be described in this paper, but it is not at all clear that this insta- bility is related in any way to the quality or use- fulness of the output. We are perhaps uncom- fortable with the thought that such instability could lead to many alternative classifications, and that somehow there ought to be only one organization inherent in the document collection. It remains to be seen whether such a viewpoint is really neces- sary. C. Raw lists of words omit semantic information which ought to affect the classification. Two im- portant kinds of information omitted would be ho- mography-resolving information and relationship indicators (showing which words on a list are related to each other and how). An example of both imagined deficiencies is found in the word "plant." On list (2) the word in relation to plums actually refers to a verb "to plant." On list (3) the word is a noun, describing what the flower pot is to contain, although as far as the information given on the list is concerned, it could be referring to a "plant which manufactures pottery." It could even have both usages in the text of the parent document. The answers to these arguments (tentative answers, admittedly) are that statistical separation of homo- graphs has been shown to occur [8, 12], and that relationship indicators — however useful they might be to a user consulting a classification scheme — do not contribute enough information to affect the outcome of the classification significantly. From an information theory viewpoint, the bulk of the informational bits are contributed by the choices of the words themselves. 7. Automatic Assignment of Labels to Groups Four sample word lists have been used in showing the most elementary of the principles of the Ward grouping procedure, as well as the most apparent of its possible deficiencies as applied to grouping of word lists. Given that appropriate groups can be formed by such a program, what more can be done? One question is: if we can derive a classi- fication through such statistical procedures, can we also derive labels for the various groups? The answer is that we can, and the mechanism is shown in figure 3. Six objects are pictured along with their six corresponding attribute lists. The purpose of the diagram is to illustrate that words can be drawn automatically from the attribute lists to give adequate descriptions of the groups, i.e., to describe which common attributes have been most influen- POINTED, CYLINDRICAL METALLIC (5/6) LONG, HEADED HINGED CIRCULAR NAIL LONG CYLINDRICAL POINTED HEADED SMOOTH METALLIC CATEGORIES BASED ON COMMON ATTRIBUTES SAFETY BELT POKER CIRCULAR SCREW PIN BUCKLE CHIP CAM / ^ m ^ LONG HINGED HINGED FLAT FLAT CYLINDRICAL POINTED FLAT CIRCULAR CIRCULAR POINTED METALLIC METALLIC PLASTIC METALLIC HEADED CYLINDRICAL CYLINDRICAL GROOVED SMOOTH GROOVED U-SHAPED POINTED SYMBOLIC EYED METALLK EYED RECT ^NGULAR BALANCED UNBALANCED Figure 3. Derivation of category labels. 21 tial in leading to the formation of each group. The groups of figure 3 were derived via the same considerations of list similarity that we have al- ready used. The first program pass groups "nail" and "screw," whose lists have five common attri- butes. On subsequent passes we must lower threshold F to permit the formation of groups of more and more dissimilarity and heterogeneity. On the second pass "safety pin" and "belt buckle," having four attributes in common, are combined. On the third pass different things happen, de- pending on whether one uses the maximum-F or the maximumvlo mode of Ward's program. Since I have chosen to use the maximum-F mode, I shall discuss it in those terms. "Poker chip" and "circular cam," having only 2 attributes in common, are paired, whereas the formation of the group of four consisting of "nail," "screw," "safety pin," and "belt buckle," with an average of 3.5 attributes in common, is delayed till the fourth pass; the penalty for reduction of homogeneity which formation of the group entails outweighs its lead in average similarity, as may be seen by calcu- lating and comparing values of F for the possible groupings on the third pass. The fifth pass has only one choice, formation of the final group of six. After the groups are formed, by what rules can we assign labels? Ideally, for any group we would like to select a label which described all and only the members of that group. Our first-formed pair, "nail" and "screw," have the attributes "long" and "headed" which apply to them alone. Each of the other groups of two have at least one such attribute. (In deciding how to specify attributes, I arbitrarily distinguished between "cylindrical" and "circular" so that the former could be used to pertain to cross section of structural members and the latter to pertain to gross form.) The group of four has two attributes "pointed" and "cylindrical" present on all four lists, but not present elsewhere. As we ascend upward in the hierarchy, we find some tendency for the attributes to be used up as labels for the smallest categories. There is no attribute, therefore, which perfectly describes the group as a whole. The closest we can come to perfection is "metallic," which describes five out of six of the objects. If the number of objects is increased to the point that five or six levels are generated in the hierarchy, we must either increase the number of attributes per object or else accept group descriptors which do not apply to every group member, or which apply to objects which are not part of the group. Figure 4 shows a closeup view of the grouping pattern involving seven out of the 100 12-word lists of German affairs, and even though each of the corresponding reports might be said to have "12 attributes," there are still not many satis- factory choices of labels. The only "perfect de- scriptor" in figure 4 is the word "toll," which describes the three members of that group and no outside member. The notation alongside each label specifies to what extent if any the label is not a perfect descriptor of the group. Thus, "allied" describes only 5 out of 8 of the lists in that group (one member of which is not shown), and also describes an additional list at some remote location in the hierarchy; the total number of "allied" tokens is outside of the paren- theses, and the fraction of lists described by "allied" is within the parentheses. EAST GERMANY (33/37) ® GROUP LABEL ALLIED (5/8) BONN (4/6) W. BERLIN (3/3) „ RETALIATION I (2/3 4 I '15 *81 TOLL (3/3) TRUCK (2/2), #95 '23 #24 '87 [(2/2) *75 8W. BERLIN ®E. GERMANY BUNDESTAG SOVIET W. GERMANY BERLIN ® BORDER 8RETALIATION REGIME W. GERMANY COMMISSIONER GOVERNMENT BARGE CONTROL W. GERMANY HARASS PROTEST PUSHKIN TRANSPORT ®TOLL E. BERLIN REARMAMENT TRAFFIC SALLIED ®E. GERMANY MINISTRY ®E. GERMANY CURRENCY W. EUROPE SALLIED INTELLIGENCE STRUCK NEGOTIATION RETALIATION CUSTOMS SALLIED ACCESS INTERROGATION ®TOLL STRUCK ®BONN VIOLATE APRIL ®W. BERLIN REFUGEE W. GERMANY STOLL SHIPMENT ®QUADRIPARTITE ®E. GERMANY RESISTANCE ®W. BERLIN DE FACTO SE. GERMANY STEEL BERLIN CHARGE CONTROL GERMANY RECOGNITION DEMAND NEGOTIATION SALLIED SPY ®RETALIATION ®BONN POSITION SBONN W. BERLIN MOVEMENT THREATEN W. GERMANY OPPOSITION RETALIATION RECOGNITION WATERWAY GERMANY BLOCKADE ®BONN U.S.A. EMBARGO BERLIN GERMANY COMMUNIST Figure 4. Extent to which lists contain group labels. 22 SOVIET EAST GERMANY WEST BRITAIN, GERMANY FRANCE COMMUNIST r T EUROPE WEST POLAND, FORCES, GERMANY EAST GERMANY I GERMANY I EAST ULBRICHT, BERLIN OPPOSITION _ MILITARY, i TRAINING REGIME ALLIED L_ QUADRIPARTITE, BORDER SOVIET BONN KHRUSHCHEV AUSTRIA ALLIANCE WEST, REARMAMENT HUNGARY, RUMANIA I T-54, 1954 TROOPS 1 am mm a i WITHDRAW, 1 ANNUAL, REDUCE ODER- FALL ' NEISSE WINTER CONTROL U.SA. POLICY BERLIN INVITATION, RAAB SATELLITE I 1 REFUGEE SECURITY CZECH OFFICIAL AGRICULTURE W. BERLIN, RETALIATION TRUCK, TOLL CURRENCY WEST EUROPE TROOPS LOAN GOVERNMENT DEFENSE NEGOTIATION I NUCLEAR ECONOMIC PROJECT, OBSTACLE BUNDESTAG FORCES TREATY GOVERNMENT TRADE MIDDLE EAST WARSAW, YUGOSLAV, NATO DECLARATION r REPRESSION I HARASS, WEST BERLIN SOVIET ADENAUER PARIS STUDENT, CHURCH, INTELLECTUAL, EVANGELICAL UNIVERSITY SOCIAL DEMOCRATIC 1 COALITION ELECTION TREATY, SAAR RATIFICATION, ACCORD Figure 5. Classification scheme for 100 articles based on application of Ward's grouping program. As can be seen in figure 5, which shows the hier- archy 6 for all 100 reports, there are in general four or five levels; this accounts for part of the difficulty. But also, however, there is less similarity — on the average — between these lists, even with 12 attri- butes, than there is between the lists in figure 3. From a pragmatic viewpoint, a reasonable degree of imperfection of description may not be a serious deficiency. As is well understood in the document retrieval field, there are explicit index tags for a doc- ument and there are implicit tags — tags which might well have been chosen to describe the docu- ment but which were not. Implicit descriptors, unfortunately, are one reason why relevant docu- ments are missed in a search, and this is why people are so interested these days in associative indexing. Thus, though the word "allied" pertains to only five out of eight documents in its group, one can sense that for the documents not tagged by "allied," which — as is seen from figure 4 — are about the various tensions involving East Germany and Berlin, it is reasonable to regard "allied" as one of the im- plicit tags for those documents. That we should retrieve documents which are relevant to the term "allied," but which do not actually bear the term as a tag, is the whole point of associative indexing. We must take care, of course, not to stretch the "implicit tag" viewpoint too far. The other kind of labeling imperfection — that a given tag describes members outside of the group as well as in it — is even less serious, and in fact may be regarded as not an imperfection at all under conditions of adequate system design. In figure 5 "The smallest shown categories generally contain two or three — seldom more than four -lists. some words, such as "Soviet," describe several categories and subcategories in different parts of the hierarchy; an alphabetical index of the hier- archy's label can permit a thorough search of groups described by "Soviet," if such is desired, and could even reference individual documents. It is in this multiple usage of the same word as a label that we find the homograph-separation power of the Ward grouping procedure. In the third of the three computer runs enumerated earlier, 50 lists in the field of physics and 50 in the field of German affairs were pooled as input to the program. In each field there was substantial usage of the words "satellite" and "force," which are homo- graphs in the true sense of the word as we proceed from the one field to the other. For "satellite" all of the German affairs items used the word to mean "vassal state of the U.S.S.R." All of the physics items used it to mean "manmade earth-circling object." The Ward program not only yielded a perfect separation of reports containing the variant meanings of both "satellite" and "force," but also began the 99th pass with two groups of 50 each — pure physics and pure nonphysics. When one peruses the similarity matrix for all of the lists, however, the clean-cut separation of the two subjects hardly seems miraculous. That half of the matrix which describes similarities between individual physics documents and individual Ger- man affairs documents contains mostly zeroes. There is a small percentage of document pairs having a similarity of one. When these are looked up, they turn out to be tagged by either "force" or "satellite." So there is nothing mysterious about statistical separation of homographs. The reports containing the word "force" in the physical sense, 23 also just naturally have words in common like "nucleus," "electron," "magnetic," "field," and "charge," and are therefore just as naturally grouped together by the Ward procedure. The results of the second run — on 100 lists corre- sponding to documents in information retrieval — were not so satisfactory as the results for the German reports or for the mixed library just described, chiefly because no words adequately described the largest categories (as in the case of the four major categories of figure 5). This result is expectable whenever the subject matter in a document collection is too diverse. Another reason for dissatisfaction is vocabulary. A typical structure from the information retrieval hierarchy is: INDEX Word Search entry document system language r r begin order retrieval abbreviate artificial symbol I. f 1 , generation n I..., I English property Alongside of hierarchies containing such crisp words as "Bundestag," "troops," "Khrushchev," "Hungary," and "rearmament," structures such as the above would not seem to shed much light on the organization of the literature in the information retrieval field. I have often contended that the greatest difficulty in retrieving information will be found in information retrieval's own documentation. Nevertheless, even in an area as semantically fuzzy as information retrieval, there is great reason for optimism if statistically processed material is touched up with an appropriate amount. of post- editing [13]. Earlier in this paper we listed five weaknesses of pure word grouping and pure document grouping. It may be evident after the subsequent discussion that the Ward grouping procedure is one approach which, with further development, offers great promise of overcoming these weaknesses. It permits: (1) Terse and reasonably accurate labeling of groups of all sizes. (2) Intricate and meaningful organization of groups in relation to each other. (3) Optimum positioning of references to indi- vidual documents in a network of descriptive words. (4) Homograph separation and aspect coordina- tion 7 as natural outcomes of the grouping and labeling procedures. (5) A scheme or map which is more easily com- prehensible as a result of being analogous to something which is — or could be — a physical arrangement of objects. 8. References [1] Bar-Hillel, Y., Some theoretical aspects of the mechaniza- tion of literature searching, U.S. Office of Naval Research Tech. Rept. 3 (Washington, D.C., Apr. 1960). [2] Luhn, H. P., A statistical approach to mechanized encoding and searching of literary information, IBM J., 309-317 (1957). [3] Doyle, L. B., The microstatics of text, Information Storage and Retrieval 1, 189-214 (Nov. 1963). [4] Doyle, L. B., Indexing and abstracting by association, Am. Documentation 13, 378-390 (Oct. 1962). [5] Stiles, H. E., The association factor in information retrieval, J. Assoc. Comp. Mach. 8, 271-279 (Apr. 1961). [6] Doyle, L. B., Semantic road maps for literature searchers, J. Assoc. Comp. Mach. 8, 553-578 (Oct. 1961). [7] Maron, M. E., and J. L. Kuhns, On relevance, probabilistic indexing and information retrieval, J. Assoc. Comp. Mach. 7, 216-244 (July 1960). [8] Stiles, H. E., Progress in the use of the association factor in information retrieval, unpublished memorandum, Nov. 15, 1962. [9] Ward, J. H., Jr., and M. E. Hook, Application of a hierarchi- cal grouping procedure to a problem of grouping pro- files, Educ. and Psych. Measurement 23, 69-82 (Spring 1963). [10] Borko, H., and M. D. Bernick, Automatic document classi- fication: Part II. Additional experiments, J. Assoc. Comp. Mach. 11, (Apr. 1964). [11] Ward, J. H., Jr., Hierarchical grouping to maximize payoff, WADD-TN-61-29 (Wright Air Development Division, Air Research and Development Command, USAF, Lack- land Air Force Base, Texas, Mar. 1961). [12] Doyle, L. B., Statistical semantics, Information processing 1962 (Proc. IFIP Congr., Munich, 1962), pp. 335-336 (North Holland Publ. Co., Amsterdam, The Netherlands, 1963). [13] Doyle, L. B., Expanding the editing function in language data processing (to be published). 7 An alphabetical list of words could easily be generated as part of any system in- volving the method discussed. Coordination of one term with another (i.e., the Boolean "and") can be incorporated into the system also, through the hierarchy itself or (more marginally) through document number postings in the alphabetical list. 24 The Interpretation of Word Associations* Vincent E. Giuliano Arthur D. Little, Inc. Cambridge, Mass. 02140 It is argued that it is possible to measure at least two kinds of word associations: "synonymy" associations, which relate words according to likeness of meaning, and "contiguity" associations, which relate words according to probable relationships among their physical designata. Formulas which measure both types of association are developed for content analysis and automatic abstracting. This paper is concerned with possible linguistic interpretations of such word association measures. 1. Introduction Several of the papers presented at this Con- ference describe experiments involving the appli- cation of machine-computed association measures to solutions of practical problems of documentation; such experiments have also been discussed in the previous literature [1, 2, 3, 4, 10, ll]. 1 This paper is concerned with the interpretation of as- sociation measures which relate words to other words. In previous publications it has been men- tioned that it may be possible to measure at least two kinds of semantic associations among words, "contiguity association" and "synonymy associa- tion" [3, 4, 5]. Procedures for measuring the two kinds of associations are discussed more thoroughly in the present paper. Some investigators have dealt with words auto- matically selected out of unedited running text, others with index terms manually assigned to docu- ments, others yet with contexts which are abstracts, extracts, or other documents. However, despite the differences in the types of vocabulary or con- text, many of the techniques used for computing associations are basically similar [4, 12]. Almost all of the techniques deal with words and contexts as fundamental units. However, depending on the objectives and inclinations of individual re- searchers, a word may be of a particular kind, for example, a Uniterm, a descriptor, an index term, a key term, etc. Likewise, depending on the appli- cation of interest, a context may be a document, the index set of a document, an abstract, a para- graph, a sentence, a phrase, a pair of contiguous words, etc. The discussion given here is meant to comprise all cases where the units being associated are drawn from the vocabulary of natural language. However, the discussion is specifically phrased in terms of perhaps the most difficult situation — that which exists when the given raw material is running text and when there are no well-defined criteria for either isolating a vocabulary subset or for selecting units of context. In dealing with natural language text using a computing machine within the context of a docu- mentation application, semantics is often of para- mount importance — in short, it is desirable to have means for dealing by machine With the meanings of words. Basically, one has two choices of strategy available. On the one hand, one may proceed initially to think about and write down certain relationships among words which are felt to be present within natural language and of importance in relating meanings; on the other hand, one can look for such relationships directly within a large body of text at hand. Following the first kind of strategy, the a priori route, many investigators have attempted to model the manner in which words are related semantically by directly creating a the- saurus—simply by writing down relationships of word meaning which seem to bes relevant. These association patterns can then be encoded for sub- sequent computer usage. Several of us at this meeting have taken the second viewpoint -that perhaps the most relevant relationships of meaning pertinent to the auto- matic processing of a text are inferable from the way the words are set down in the text itself. This second kind of approach must necessarily be based on certain observations and assumptions about the nature of word relationships which can be measured statistically, and I would like to review a few of these assumptions here. 2. Some Observations and Assumptions First of all, it may be observed that natural language is used to encode and transmit ideas with *This work has been supported in part by the Decision Sciences Laboratory ESD, U.S. Air Force Systems Command under contract No. AF19(628)-3311, ESD-TDR- 64-527. 1 Figures in brackets indicate the literature references on p. 32. fairly high fidelity — that a sufficiently large and com- prehensive sample of natural language text can contain within it a useful representation of the most germane conceptual relationships employed within a given area of discourse. Naturally, the way in which conceptual relations are represented in text need not at all be in any simple correspondence to 772-957 O-66— 3 25 the way in which they are represented in human minds, let alone in correspondence with the way objects actually relate to each other in the real world. 2 I wish merely to assert that to a proper decoding device (i.e.. an educated human being) a body of text ol proper size and composition can be decoded in such a manner as to reveal conceptual relationships unknown previously to the decoder. The text may in some cases offer a fairly complete representation of the concepts and conceptual relationships applicable within some areas of discourse. A second observation of significance is that con- ceptual relationships are encoded at least in part by means of the word order and proximity relation- ships present in text. That is, conveyance of conceptual relationships depends not only on the words used, but also crucially on the order in which these words are set down in text. To justify the interpretation of statistically com- puted word association patterns as having semantic significance, it is necessary to go somewhat further and to assume that the word order and proximity relationships in text are often the primary vehicle by means of which conceptual relationships are encoded. The validity of this assumption is in part self-evident, but still it must be taken as a hypothesis whose range of validity is to be estab- lished by experiment. There are in fact at least three ways to view a body of natural language text and, correspondingly, three ways to view association measures computed with respect to that body. The text can be viewed as a closed formal system which represents only itself. In this case computed association measures are descriptive rather than predictive statistics. The same formula applied twice yields the same results, and therefore one can argue about the im- portance of an association statistic, but hardly about its value. Secondly, one can view a body of text as representing a much larger corpus of text, in the sense of being a sample of that larger body of text. Thus, for example, the text of a Sunday's New York Times can be viewed as a sample of what might be expected in a whole year's worth of the Sunday issue of the same publication. Taking this viewpoint, certain of the statistics descriptive of the sample can be expected to have a predictive value; they can be used to infer patterns likely to be present in the larger population. In this case it becomes meaningful to ask questions relating to sampling, i.e., how well does the corpus represent the parent population? Thirdly, a text can be regarded as representing an encoding of concepts and of conceptual relation- ships which are of importance to some area of dis- course. Computed association measures are then viewed as being correlates of actual relation- ships which exist among the concepts which are the designata of language expressions — this is the viewpoint taken in this paper. Moreover, to the extent that practical applications of documentation require recognition of semantic relationships, the utility of computed word associations depends largely on this third kind of interpretation. 3 I would like to advance the hypothesis that it is possible to obtain at least two types of measure- ments from text which are under certain conditions interpretable as applying to relationships among the designata of words. The first type of association measure reflects what has long been called con- tiguity association by psychologists [13]. Roughly speaking, two words are considered to be contiguiiy- associated if the objects or properties denoted by them are contiguous (have to do with one another) in the real world (or, depending on one's philo- sophical viewpoint, in man's conceptualization of the real world). Thus "hammer" and "tack" are re- lated in the contiguity sense; so are "hand" and "glove." The connection between "liquid oxygen" and "rocket fuel" is a contiguity one. Strictly speaking, liquid oxygen is not actually rocket fuel, but is commonly used along with the fuel to enable proper combustion. "Subway" and "station" are also contiguity-related, as are "syndicate" and "crime." Contiguity associations need therefore not be logical in any well-defined sense; they include part-whole relations, partial synonymy, cause-effect relations, etc. They frequently are indicative of what docu- mentationalists call facets of words. The second type of association to be discussed might be called synonymy association. Two words may be regarded to be synonymy-associated (i.e., synonymous) to the extent that they are com- monly used to denote the same thing (concept, object, or property). The position taken in this paper is that under cer- tain conditions measurements which reflect these two specific relations of meaning, contiguity and synonymy, can be based upon counting procedures applied to words and word pairs found within text. 3. Contiguity Association The basic hypothesis to be considered first is that contiguity association can, under appropriate circumstances (to be examined shortly), be meas- 2 It must be recognized that such relationships can be viewed two ways, correspond- ing to two distinct philosophical viewpoints. On the one hand, one can hold that the relationships of interest appertain among actual physical objects. On the other hand, one can hold that the only meaningful relationships are among conceptual repre- sentations of objects. This point is treated further in the paper by Paul Jones pre- sented at this Symposium [131- 3 Comments apropos to this topic may be found in the paper presen.ed at this Con- ference by Maron [14]. ured in terms of the statistics of co-occurrences of words within context of text. For example, if "air- craft" and "pilot" co-occur with a frequency more than is plausibly explainable on the basis of chance alone, it may therefore be inferred that these co- occurrences are not due to chance, but due to the fact that the words are contiguity-related, i.e., that concepts designated by "pilot" and "aircraft" in fact have to do with one another. 26 It should be recognized that there are in fact two interrelated assumptions involved here: the first is that it must be possible to identify contexts in which word co-occurrences reflect contiguity relationships, and the second assumption is that an adequate statistical procedure can be found for combining observations made from many different contexts. Experimentally, these assumptions seem to be valid. In fact, one of the problems facing any researcher in the area is that there appear to be many different (at least different on the surface) ways for selecting contexts and measuring contig- uity association — and all of them seem more or less to work. First of all, there is the question of what consti- tutes a proper context of co-occurrence. Ideally, such a context would be a natural unit readily isolable out of text which has the property that every word within it is contiguity-related to every other word within it. When running text is given, the situation offers considerable choice. 4 The context "ships have decks" can certainly be said to contiguity-relate the two substantive words within it, while the co-occurrence of two words within the whole of the text of the Encyclopedia Britannica should surprise no one. Proximity in running text therefore seems generally to be re- quired for a contiguity relationship to be asserted. However, proximity does not guarantee the presence of a direct and meaningful contiguity relationship. Consider the sentence "The contract providing for the delivery of the concrete required to build the west sluice of the dam was signed in red ink yesterday." The sluice of the dam was not signed in red ink, but the contract was! Despite sentences like that just illustrated, and despite a large number of other readily construct- able counter-examples, 5 it is fair to assume that substantive words" located together or in close prox- imity in text are in most cases contiguity-related by the context. It is not absurd, as a matter of fact, to hold that any sentence or other coherent passage asserts some contiguity relationship or the other (perhaps a complicated or indirect one) among any pair of substantive words contained within it. That is, "red ink" in fact had something to do with the "dam," and the sentence is a statement of what that something was. In some preliminary experiments performed by the writer and his colleagues and described else- where [5], the precise nature of the contexts used to generate association measures for purposes of retrieval of sentences were not found to be crucial. Two types of contexts were used in this work as a basis for determining machine-computed associations: co-occurrence within sentences as basic units of contexts, and co-occurrence within syntactic subtrees of sentences as units of contexts. A passage of text 7,000 words long was syntactically analyzed, and word association matrices prepared on the basis of the two definitions of context. The association patterns obtained using the two defini- tions of context were somewhat different, and both sets of associations served to enhance recall of relevant sentences in retrieval experiments. With- in the limitations of the discriminating power of our experiments, however, we found no basis for assert- ing that one set of associations was superior to the other. My own current feeling is that, for running sequential text at least, a good unit of context is a "window" of fixed length, say seven words long, which is progressively moved from one position to the next throughout the text. Thus, if the window length is seven words, every word is regarded to be contextually related to six words on either side of it. This procedure makes all contexts the same length, which enables one to use a much simpler association formula than would be necessary if variable-length contexts were used. 6 Also, for certain kinds of running text, sentence or punctuation boundaries can often best be ignored; the benefits to be gained in relating antecedents to consequent probably far outweigh the penalties of the false connections generated. At first, the problem of picking an association formula for measurement of contiguity association appears to be even more vexing than that of select- ing a unit of context. Goodman and Kruskal have identified over 50 different formulas for measuring associations [7]. Each such formula has its own advantages as well as its drawbacks, and, given our present incomplete understanding of the problem of semantic association, it would be pre- mature to suggest any one as ideal. 7 Yet, to be specific, I would like to devote a few paragraphs to the development of a simple measure of contiguity association, one which will turn out to be a version of the formula my colleagues and I have been using in our recent experimental work [5]. It is desirable to develop the explanation from an elementary point of view in order to detail the methodology implicit in using an association measure. Suppose that one is dealing with a corpus of run- ning text and, for sake of simplicity, that the con- * When the contexts are given beforehand and there is no order relationship present among the words within a given context, for example as within a given set of uniterms assigned to a document, the situation is relatively simple. A reasonable course of action in this case is to assume that any Uniterm assigned to a given document is con- tiguity-related with each other Uniterm assigned to that document. * A pointed but humorous treatment of how one's view of language can be colored by concocted counter-examples is given by I Doyle in reference |6], as is an excellent common-sense discussion of the role of statistics in dealing with natural language text. fi It is shown in an appendix of reference [5] that, for use of the linear transformation method described in this paper, equal lengths of context are required if the Markov process corresponding to the word association transformation is to generate the same word frequency statistics as present in the original text. A more complete formula which normalizes for context length is discussed in the paper presented by Spiegel and Bennett at this Conference [12], ' In a previous paper, P. Jones and I pointed out that formulas of a certain class lend themselves to representation in such a way that word association and document retrieval can be described by matrix operations [3]. Moreover, under certain assump- tions, these formulas can be computed instantaneously using analog electrical networks (3, 8|. 27 texts to be considered are adjacent word pairs determined by a moving window which is two words in length. Thus considering the sequence of words ABCDEFG etc., the first context is the word pair AB, the second is the pair BC, then CD, etc. For an N word corpus there are TV— 1 such con- textual pairs, and for the moment we will consider the pairs to be ordered — that is, the context W\W a then the null hypothesis is accepted — i.e., it is decided that the observed event could have happened due to chance alone. If on the other hand p(S) < a, then the null hypothesis is rejected. That is if p(S)< 0.0001, then there is less than one chance in 10,000 that the observed event could happen due to chance alone, and the null hypothesis is therefore rejected. In most practical applications of statistical tests, an alternative hypothesis is accepted instead — for example, the hypothesis that a certain substance causes cancer. As has been mentioned, the observations to be used for the measurement of contiguity association consist of word frequencies and of word pair fre- quencies. An appropriate null hypothesis Ho is that the position of a word in text is determined by chance alone. That is, Ho states that a word W a is sprinkled through the text f a times, with proba- bilities of word occurrences in adjacent text positions being statistically independent. The alternative hypothesis is the presence of contiguity association. 8 A primary difficulty is that the measure Cab possesses a large variance when one of the numbers f a , fb, or f a b is very small. A good rule of thumb is that the measure is reliable only when each of these numbers is 3 or greater. 9 These values are roughly correct for the sampling distribution of a text of 45,000 running words with which we are currently experimenting. Having defined the measurements to be made and having formulated a null hypothesis, the next step is to find a statistical test to determine whether the null hypothesis is sufficient to explain the observed phenomena, these phenomena being the observed word-pair frequencies f n t>- The measure I suggest is a very simple contingency coefficient. If Ho is valid, the probability of the pair W„Wb being located in any adjacent pair of text positions, say the first and second, is, by statistical independence, p a pb which equals N 2 There are /V— 1 text posi- tions, so that the expected number of pairs W n Wb, f f on the basis of chance (H ) alone, is -^ (N— 1). For long texts, this becomes for all practical purposes: expected number of pairs assuming H - Ja 'fb N (1) However, one also knows f a b the actual measured number of pairs W a Wb, and therefore one can form a contingency coefficient, observed number of pairs expected number of pairs assuming Ho Nfab _ fa 'fb ■>ab' (2) This coefficient is the proposed measure of con- tiguity association; it measures the degree of sur- prise connected with finding fab pairs W a Wb when statistical independence and chance alone would dictate instead finding only fa' ft N pairs. A very similar measure can readily be defined for the case when the context-size window is more than two words wide. This measure, incidentally, has its faults as well as advantages, and can be considered to be reliable only for certain ranges of values of f a , f b , and f ab . 8 For f a , fb, and fab within the range that makes the measure reliable, there is associated with every value C a b a probability p{C' a b) that Cab or a greater value could be observed due to chance alone — i.e., that an observed value 5= C'ab occurs when the null hypothesis is valid. This probability is extremely small, being in a typical case less than 10 -4 when Cab — 50. 9 Say that one has picked a significance level a = 10- 4 . Then if the value C a b ^ 50, the probability of the observed event assuming the null hypothesis is less than 0.0001, and it is necessary to reject Ho and accept an alternative hypothesis. When W a and Wb are both substantive words, I propose that an appropriate alternative hypothesis is that one or two of the following events is present: (a) a significant contiguity relationship exists among the concepts denoted by the associated words and this relationship is asserted by the text, or (b) 28 the associated words combine together to denote a new concept not already implied by one of the constituent words, as for example in the case of "hot dog." The distinction between these two kinds of events, incidentally, is often one of degree, and is being studied further. 10 In practice, it is not necessary to bother with computing probabilities, for they vary monotonically with the value of the statistic, the larger the value of C the smaller the probability of observing it assuming Ho. Instead, one regards the statistic itself to be a measure of "association strength," and one lists word pairs according to decreasing value of this statistic. Different workers on statistical association methods use different formulas and often give their measures different interpretations. What is im- portant in every case, however, is the existence of an underlying statistical procedure such as that described above. To every value V of an associa- tion statistic, be this statistic Cab or some other, there exists a probability of that measure having value V 3 s V under the H assumption of random- ness. Generally, the larger the measure V the smaller this probability and the greater the con- fidence that the observed event could not be due to chance alone. In fact, if words associated with respect to a given word W are ranked in order of decreasing value of a well-behaved association measure within the framework of a well-defined statistical procedure, these words will actually be ranked in order of increasing probability of the ob- served co-occurrences being due to chance alone. 4. Synonymy Association Although universally accepted, synonymy is unfortunately an ill-understood concept. It is nearly impossible to find two words which are precisely identical in meaning. In general, a given object may be named by a number of words or phrases. Not only will some of these names be specific and others more generic, but an object may be named by a term which describes part of it, by another term which describes a whole of which it is a part, or by another term which describes the object in terms of one or more of its properties. For example, in various contexts the same object may be denoted by the following expressions: "the aircraft," "the airplane," "the 707 astrojet," "the jet," "the equipment for this flight," "the common carrier vehicle," "The Sylvia Jane II," "she," and the like. Questions of what constitutes synonymy and in- quiries into the meaning of meaning can very rapidly lead to an endless philosophical quagmire. For the achievement of practical objectives, however, it is necessary to have an operational criterion for synonymy which allows measurements to be made. Interchangeability of usage seems to provide as good a criterion of this type as any I know of. Clearly, two words are perfect synonyms if and only if either one can always be used in place of the other; likewise, partial synonyms can sometimes be used interchangeably. The basic hypothesis advanced here (and which has been advanced previously by my colleagues and others [3, 11]) is that, in a sufficiently large corpus, many synonymous words are used interchangeably, and that in proper circumstances the extent to which two words are synonymous can therefore be measured by noting the extent to which these two words are used interchangeably in various contexts. 10 If one or both of the words W„W h are function words, a third possibility exists: The observed association may be due to the presence of a syntactic unit or of a standard syntactic construction. Ideally, it would be useful to measure inter- changeability of usage considering a wide variety of contexts, not only linguistic contexts but also extralinguistic ones involving patterned situations of human behavior. In practice, however, the relationship between behavioral situations and verbal responses is poorly understood and difficult to measure, although it is under continued study by psycholinguists [9]. Most of us present at this Conference have con- fined ourselves to contexts of written text. But even here the best way to proceed is as yet not understood. At one extreme, interchangeability could be defined rigidly in terms of requiring identi- cal usage in relatively long contexts. For example, suppose that the sentence is selected as the unit of context, and that two words W a and Wb are regarded as being interchangeable and therefore synonymous when and only when two large sets of sentences exist which are pairwise identical except that the sentences in one set employ W a where as the sentences in the other set employ Wb- This definition of interchangeability would lead to uninteresting results, simply because long contexts such as sentences cannot be expected to be repeated so systematically, even in a very large corpus. That is, most sentences are not simple variants of other sentences. At the other extreme, by regarding two words W a and Wb to be interchangeable and there- fore synonymous, if there is some sentence contain- ing W a which contains a word in common with another sentence containing Wb, this definition would make almost any pair of words appear to be synonyms. As in the case of measuring contiguity association, then, there are fundamental questions as to what are appropriate contexts for comparison of inter- changeability and as to what is a correct procedure and for measurement of interchangeability. A sim- ple approach, but by no means a unique one, is 29 described in the following paragraphs — this ap- proach closely parallels that described previously for contiguity association. As in the previous discussion, suppose that one considers contexts to be ordered sequential word pairs as would be measured by a sliding window two words in length. 11 Then, to the first order at least, it is possible to hold that interchangeability in these pairwise contexts provides an approximate measure of interchangeability with longer contexts. This thought is developed in the following para- graphs and a measure of synonymy is derived. This measure will then be shown to be closely related to the contiguity measure described earlier. Let the null hypothesis Ho be the same as before, that words are sprinkled in text according to their frequencies of occurrence but without regard to position, so that word occurrence probabilities in adjacent text positions are statistically independent. The alternative hypothesis is the presence of syn- onymy association, and the statistic proposed is different than that discussed previously. Suppose that W a and Wb are specific words, and let W\ denote an arbitrary word-type found in the text. As before, there are N contexts (pairs) in the text. The statistics to be developed will assign a measure to any two words W a and Wb depending on the number of contexts in which W a and Wb are inter- changeable. It would be possible to design a sta- tistic which measures interchangeability in terms of the number of interchangeable contexts shared by W a and Wb, or in terms of the number of types of such contexts, or in terms of both. The pro- posed statistic for measuring interchangeability in fact depends on both of these quantities. f f ■ To develop the statistic, note that p ab = — — - fi is the ratio of the observed number of ways W a and Wb can be interchanged in contexts with W\ to the total number of contexts containing Wi. This quantity is therefore an observed interchange- ability measure for W a and Wb, with respect to Wv, it reflects frequency of usage of Wi. To ob- tain an overall observed interchangeability measure, the sum can be formed: Observed interchangeability: R ab = i i J 1 (3) The value of the same interchangeability measure expected under the null hypothesis is obtained by substituting expected co-occurrence frequencies fafi fbfi N N for the observed ones f a %, fu- One then obtains instead of R ab the sum: Expected interchangeability GIVEN H () = R ab \r* if of i) {fbfi) _fafb ^ r = fafb ^ N Nfi N 2 Y' N ' (4) Analogous to what was done previously for con- tiguity association, one can now obtain a contin- gency measure for synonymy association: Snb — Observed interchangeability R ab Expected interchangeability given H R ab ^faifbilfi S ab = N^ fafb (5) The process of interpreting this measure is similar to that described previously for interpreting the contiguity measure. A high value of this meas- ure corresponds to a low probability of the observed interchanges occurring given the null hypothesis, and leads to rejection of Ho and acceptance of the alternative hypothesis — the presence of synonymy. Example: It is instructive to go through a highly simplified example — one that is concocted to show how the above measures work. Consider the corpus con- sisting of the sentence: The U.S. Army launches rocket missiles while the U.S. Navy launches jet missiles; however, although the Navy flies jet planes, strangely it is not the case that the Army flies rocket planes. In this corpus N—32. It can readily be verified by computing formulas (2) and (5) using the two- word sliding window procedure with asymmetric contexts described above that the contiguity matrix fab C ab = N fa'fb (deleting portions of the matrix of minor interest) is: launches rocket missiles jet flies planes Army 8 8 launches 8 8 rocket 8 8 C = Navy 8 11 As in the case of contiguity association, the extension of the discussion given here to longer contexts or to symmetric contexts is straightforward. jet 8 8 flies 8 8 30 The corresponding synonymy matrix 2/a/W/i Sab — - fa' ft is: Army launches rocket Navy jet flies Army 8 8 launches 8 8 rocket 8 8 Navy 8 8 jet 8 8 flies 8 8 The pairs of words thus related by the synonymy measure S are (Army, Navy), (launches, flies), (rocket, jet), together with the self-associations (Army, Army), (Navy, Navy), (launches, launches), etc. 5. Matrix Representation I would like to comment briefly on the relation- ship between the two proposed statistics, Cy for contiguity association and Sy for synonymy associ- ation. The relationship can most readily be seen by writing the formulas in matrix notation. Let A be a diagonal matrix with A.* = 7 and let F= {fij}, Ji C={dj}, and S={S }. Then formula (2) can be written C = NAFA and formula (5) can be written 12 S = M AF ) 2 A = N\FAFA . (6) (7) AF is a stochastic matrix which can be thought of as corresponding to a Markov process which describes a conditional contiguity transformation in (6). The synonymy measure (7) employs the square of this matrix instead. In other words, the synonymy measure (7) in essence matches the profiles of contiguity strength of different words. The argument pursued in the previous section is therefore equivalent to asserting that measuring the interchangeability of words in pairwise contexts 12 This expression is valid only when the F matrix is symmetric, i.e., when each con>- text ab is thought of as generating two pairs: ab and ba. Otherwise, " Current experimental research on statistical association techniques at Arthur D. Little, Inc., includes investigation of the association patterns within a corpus of about 45,000 running words of transcribed speech, within a 10,000 document sub- collection of an operational mechanized retrieval system, and within a collection of 45,000 abstracts containing about a million and a half running words of text. (using the measure S) is equivalent to matching their conditional contiguity profiles; a necessary and sufficient condition for a pair of words W a and Wb to have a hJgh synonymy coefficient S a b is that words a and b nave like profiles of contiguity associ- ation with the other words in the corpus. A final comment with respect to retrieval is that higher order association matrices (AF)"A can also be interpreted as contingency coefficients, and that these matrices can be combined together to obtain association matrices which represent combined contiguity and synonymy measures [3]. In ex- perimental work on retrieval [5], we have used the matrices: as well as / + AKA + (AK) 2 A + ( AKfA I+AKA + (AK) 2 A. Examples of association profiles computed using the above Cab and S ao formulas (or using linear combinations of them) applied to various data col- lections involving vocabulary sizes of up to 1,000 words have been exhibited and discussed elsewhere [3, 4, 5]. 13 Although a large proportion of the as- sociation profiles which have been generated ap- pears to be remarkably good (in the sense of being intuitively plausible), others are equally difficult to interpret. There is little point in exhibiting fur- ther examples until carefully controlled experiments to determine the validity of the hypotheses men- tioned in this paper are completed. Such experi- ments are now in progress, and will be reported separately. 31 6. References [1] Maron, M. E., and J. L. Kuhns, On relevance, probabilistic indexing and information retrieval, J. Assoc. Comp. Mach. 7,216-244(1960). [2] Doyle, L. B., Indexing and abstracting by association, Am. Documentation 13, 378-390 (1962). [3] Giuliano, V. E., and P. E. Jones, Linear associative infor- mation retrieval, in P. Howerton and D. Weeks, eds., Vistas in Information Handling 1, ch. 2 (Spartan Books, Washington, D.C., 1963). [4] Giuliano, V. E., Automatic message retrieval by associa- tive techniques, Proc. 1st Congr. Information System Sciences (The Mitre Corporation, 1962). [5] Arthur D. Little, Inc., Automatic Message Retrieval, Studies for the Design of an English Command and Control Lan- guage System, Rept. CACL-3 (ESD-TDR-63-673) (Nov. 1963). [6] Doyle, L. B., The Microsyntax of Text, Rept. SP-1083 (System Development Corp., Feb. 1963). [7] Goodman, L. A., and W. H. Kruskal, Measures of associa- tion for cross classifications II: Further discussion and references, Am. Statist. Assoc. J. 54, 123-163 (Mar. 1959). [8] Giuliano, V. E., Analog networks for word association, IEEE Trans. Mil. Elec. MIL-7, 221-234 (Apr-July 1963). [9] Saporta, S., and J. R. Bastian, Psycholinguistics (Holt, Rine- hart and Winston, New York, N.Y., 1961). [10] Salton, G., Associative document retrieval techniques using bibliographic information, J. Assoc. Comp. Mach. 10 440-457 (Oct. 1963). [11] Stiles, H. E., The association factor in information retrieval J. Assoc. Comp. Mach. 8, 271-279 (1961). [12] Spiegel, J., and E. M. Bennett, A modified statistical asso ciation procedure for automatic document content anal ysis and retrieval, this volume, p. 47-60. [13] Jones, P. E., Historical foundations of research on statist! cal association techniques for mechanized documentation this volume, p. 3-8. [14] Maron, M. E., Mechanized documentation: The logic behind a probabilistic interpretation, this volume, p. 9—13. 32 The Continuum of Coefficients of Association J. L. Kuhns* The Bunker-Ramo Corporation Canoga Park, Calif. 91304 This paper discusses the classification of various coefficients of association between properties characterizing a collection of items. It is shown that it is useful to define a generalized coefficient of association as the product of a parameter and the deviation of the observed data from expectation assuming the properties are independent. The values of this parameter are given for twelve coeffi- cients of association. The ordering of magnitudes of these coefficients is also given. Among the coefficients discussed are "closeness" measures obtained from the Euclidian distance and rectangular distance formulas, the cosine of the angle between the vector representations of the data, the coeffi- cient of linear correlation, Yules coefficient of colligation and the index of independence. 1. Introduction This paper describes a classification of a certain broad class of coefficients of association among properties which characterize a collection of items. The results are useful for three purposes: (1) The classification has an intrinsic interest in that it unifies the theory of coefficients of associa- tion and illustrates the several points of view from which they arise; (2) the classification admits of a generalization, thus allowing the invention of new coefficients; (3) in application, the classification simplifies the problem of selecting a suitable coefficient for a par- ticular purpose. 2. What Is a Coefficient of Association? Let us consider the association of two properties. What do we mean by this? We observe the phe- nomenon of association by noting how the properties apply jointly and separately to a collection of in- dividuals. Before going further let us show the pertinence of this to the field of documentation. Example 1. Given a collection of documents (the individuals), then the classification of a document under a particular index term can be considered to be a property of the document. Thus we may want to study the association between the prop- erties "classification under the subject term 'Aerodynamics' " and "classification under the sub- ject term 'Biology'." Such an association can then be used to induce an association between the index terms themselves and consequently be used as a tool for associative retrieval. A part of this proc- ess is, of course, the answering of such questions as: Is "Biology" more strongly associated with "Aerodynamics" than "Computers" with "Aero- dynamics"? Such applications are discussed in detail in references [1] ' and [2]. Example 2. Given a collection of index terms (the individuals), then the classification of a par- ticular document under an index term can be con- sidered to be a property of the index term. Thus we may want to study the association between the properties "applicability to document 1" and "ap- plicability to document 2." Such an association * Present address: The RAND Corp., Santa Monica. Calif.. 90406. 1 Figures in brackets indicate the literature references on p. 39. 1 This is not recommended as an evaluation procedure except under highly special conditions. The reason is, of course, that the procedure does not take into account the value of the information to the user. See [4]. can then be used to induce an association between the documents themselves and, as in example 1, be used for associative retrieval. Other areas of application such as storage of documents, redesign of index systems, and orga- nization of index files stem from these two examples. Example 3. The sentences of a document can be considered to be a collection of individuals. An automatic abstracting (extracting) procedure can then be interpreted as defining a property of sen- tences by the fact of its selection or nonselection of a sentence. Reference [3] describes how the asso- ciation of two such properties (selection procedures) can be used as an evaluation of automatic abstract- ing techniques. Example 4. Given a collection of documents (the individuals), then the association between the properties of being retrieved in response to a given request and of being relevant to the information need that produced the request can be used to give a comparative evaluation of the effectiveness of two retrieval systems under certain normative con- ditions. An example of an evaluation of this kind is given in reference [l]. 2 We now introduce some terminology to discuss the common features of these examples. Let the collection of individuals be TV in number and desig- nated by 'a'i, 'a'2, . . ., 'aV Let 'A' and '#' denote the two properties. The four combinations of prop- erties A and B, A and not-fi, B and not-//, no\-A and not-fi, having numbers of individuals x, u, v, y, respectively, uniquely categorize the individuals. We use n\ to indicate the number of A's and n> to indicate the number of #'s. 33 There are four well-known methods to represent such data. Method 1. Tabular Form B not-fi X u = n.\ — x n, V=Tlf - X y=N-n, — n-z + x N-n, n 2 N-n, N nol-A Figure 1. This shows the number in each classification to- gether with the adjoined row and column sums. The "cell" numbers in terms of x, n\, n 2 , N are also shown. Method 2. N-dimensional vectors or points in N-dimensional space. (In Each property is represented by a vector of /V com- ponents: the £th component is unity if a* has the property and is zero otherwise. ai a-i A 1 B 1 1 Method 3. Venn Diagram. Figure 2. Each individual is represented by a point in the rectangle. The properties are represented by (possibly overlapping) regions and therefore display the four categories. Method 4. B 10,1) (0,0) Mass Distribution in the Plane. (i,D The four categories are represented by the vertices of the unit square: (0, 0) is not-A and not-fi, (1, 0) is A and not-fi, (1, 1) is A and B, (0, 1) is nol-A and/?. The points are assigned masses y, u, x, v, respectively. The problem is now to create from these data a measure of association between A and B. The rules of the game are to use only the numbers x, y, u, v, and not the meanings of the predicates VTand'fl'. Now, before saying what the coefficient of asso- ciation between A and B is, it is necessary to define what we mean by saying A and B are unassociated, i.e., independent. This is the logically prior con- sideration. The meaning of independence can be expressed in terms of the (logically) more primitive notion of probability. Suppose that we wish to bet that an individual of the collection has the property A given that it has the property B and that we have knowledge of the numbers x, y, u, v (or the equiva- lent x, n\, n 2 , N). The betting quotient we offer (ratio of amount offered to the total stake) we will designate by P(A\B). If we omit the condition that the individual has the property B, the quotient is designated by P(A). Now, if the information that the individual has the property B is quite ir- relevant for our choice of betting quotient, i.e., P(A\B) = P(A), (1) then we say B is independent of A. It can be shown that for the betting quotient to be fair 3 we must have and P(A) = mlN P(A\B) = x/n 2 . (2) (3) The relation (1) is thus the case if and only if x = n 1 n 2 /N. (4) This is called the independence value of x. The excess of x over its independence value is what will interest us, namely, 8(A, B) = x-mmlN. (5) (1,0) It can be seen from this that 8 may have positive and negative values. If N, n u n 2 are fixed, then the largest and smallest values of 8 are attained at the largest and smallest values of x. The following in- equality gives these values: 4 Figure 3. in 3 I H , e "f°"° ( Probabilily used here is (hat of a theory of degree of confirmation, and in particular the theory of a direct inductive inference as described in reference [51 sec. 94. l J |We use min (o, b) to indicate the smaller of the numbers a and 6, max (a b) to indicate the larger. ' min (m', n 2 ) ^ x^ max (0, ni + n 2 — N). (6) We note that in the four examples discussed and, indeed, in most applications in documentation, the situation ni + n 2 ^N will be the case; thus the smallest possible value of x will be zero. 34 Yule [6] has pointed out the importance of 8(A, B) for the theory of coefficients of association. He has shown that this quantity measures the excess over independence in all four categories in the sense that if we did the similar calculations for the nega- tions of the properties we would get 8(A, B) = S(not-A, not-B) = -8(A, not-B) = -8(not-A,B). (7) Also, 8 is symmetric, i.e., 8(A, B) = 8(B, A). (8) Following Yule, we say that A and B are associated more or less according to the size of 8(A, B), and consequently the measure of association should vary as 8(A, B). This paper will show, through an examination of various coefficients of association, that the efficients are comprised in the general form C a (A,B) = 8(A, B) CO- (9) and hence specified by the value of a parameter a. The values of a will be given for each coefficient and ordered according to magnitude. The result is a "spectrum" of coefficients of association. Ap- parently intermediate values could be used as well, hence the title "continuum" of coefficients. For example, we will show that possible values of a are min (ni, nz), max (n\, 7*2), and intermediate values given by the arithmetic and geometric means of ni, ni. We will also show that if n\ + n% =i /V/2 then the range TV/2 ^ a ^ rumlN (10) absorbs all the coefficients examined. 3. The Coefficients In this section we will make an inventory of some coefficients of association that all have the property of vanishing when 8(A, B) is zero. These coeffi- cients will also have the property of symmetry with respect to A and B. 3.1. Separation In the Venn diagram (fig. 2) it can be seen that the area of the region given by A and not-Z? plus B and not-/4 measures in some way the separation between A and B. This area relative to N is given n\ + n 2 for a two-by-two contingency table. 1. Correlation of Attributes The problem of measuring the degree of associa- tion or correlation between attributes is an old one and has been discussed by several investigators (Yule [l], 1 Steffenson [2], Goodman and Kruskal [3], [4]). Yule [1] lists several basic properties that any "legitimate" coefficient of association between attributes should be expected to have. For example, he recommends that it should (1) vanish when attributes A and B are (statistically) independent; (2) be a maximum when A implies, is implied by, or is equivalent to B; (3) be a minimum when A implies, is implied by, or is equivalent to non-Z?; and (4) have a simple range of values, say from — 1 to 1 . For reasons of conceptual and notational sim- plicity, the development of the results of this paper will be in terms of events rather than of attributes or properties of things. This is theoretically justifiable since attributes and events are in one-to- one correspondence. First, because in logic sets are defined intentionally as a collection of all things with a particular property; and second, be- cause in probability theory events are defined as subsets of a probability space. For example, the event "x is green" corresponds to the set "all green things" which, in turn, corresponds to the property "greenness." As will be shown, the desiderata of Yule are generally met by the correlation coefficient for events discussed here. Hence, the event correla- tion coefficient can be regarded as "legitimate" in the sense of Yule. 2. Classical Correlation Coefficient for Random Variables Let X and Y be random variables with expecta- tions E(X) and E(Y), standard deviations D(X) and D{Y), covariance C(X, Y), and correlation R(X, Y).' Then, by the classical definition R(X, Y) = C(X, Y) D(X)D{Y) E(XY)-E(X)E(Y) [E(X 2 )-E 2 (X)y' 2 [E(Y 2 )-E 2 (Y)] 112 (2.1) The random variables X and Y are said to be uncor- related provided R(X, Y) = 0, and to be independent provided P(XeA and YeB) = P(XeA)P(YeB) for all sets A and B. From correlation theory, the following properties are well known (see Parzen [5]): If X and Y are independent, then R(X, Y) = (2.2) If Y=X, thenR(X, Y)=l (2.3) If Y=-X, then R(X, Y) = - 1 (2.4) \R(X, Y)\ ^ 1. (2.5) 3. Correlation Coefficient for Events Let A and B be_sets (corresponding to events) with complements A and B, union AUB, intersec- tion AHB, and probabilities P(A) and P{B). *Present address: System Development Corp., Santa Moniea, Calif., yf)406. 1 Figures in brackets indicate the literature references on p. 44. It is desired to define a correlation coefficient R(A, B) for events A and B that will be analogous to the classical correlation coefficient R(X, Y) for random variables X and Y. Heuristically, this is suggested by formally mapping the algebra of ran- 772-957 O-66— 4 41 dom variables onto the algebra of events by means of the transformation: Replace X by A Replace XY by AHB Replace E( -)byP(-). Then by strict formalism, since X 2 maps into AC\A=A, it would follow from the definitions of variance, standard deviation, and covariance that V(A) = P(A) - ^(A) = P(A)[1 - P(A)] = P(A)P(A) D(A) = [P(A)-P 2 (A)] 1 ' 2 C(A , B) = P(A (IB)- P(A)P(B). With the appropriate substitutions (2.1) becomes the symmetric function R(A,B) C(A, B) which could be the heuristic definition of the cor- relation coefficient between events A and B. The appropriateness of the above formal mapping is supported by the fact that the well-known Cauchy- Schwartz inequality from probability theory becomes E 2 (XY) =£ E(X 2 )E(Y 2 ) P 2 (AHB)^P(A)P(B), D(A)D(B) P(AHB)-P(A)P(B) \P(A ) — P 2 (A ) 1 1/2 [P(B ) — P 2 (B ) ] 1/2 ' (3-1) sitions will now be examined. which is a valid theorem since AC\BC.A and AHB QB imply P(ADB) < P{A) and P(A(1B) ^ P(B). If R(A, B) is to be a measure of the correlation of two events, then, like R(X, Y), it Should satisfy Property 1: It A and B are independent, then R(A,B) = Property 2: If B = A, then R (A , B) = 1 Property 3: If B = A , then R(A , B ) = - 1 Property 4: \R{A, B)\ *£ 1. The validity of these formally constructed propo- 4. Properties of the Event Correlation Coefficient We shall now prove properties 1,2, and 3 of the event correlation coefficient. It will be helpful to interpret R(A, B) in terms of the set theoretic re- lations of A and B; for example BCA, B = A, BQA, B = (j> (null set), and B = S (event space). To do this we shall express R(A, B) as a function of the odds on A and the odds on B rather than as func- tions of the probabilities P(A) = a, P(B) = b, and P(A nS) = c. Denote the odds on A by 0(A) = P(A)_ P(A) P(A~) 1-P(A) 1-a Note that 0(1) = 0~\A ). First, when B is a subset of A we get if BQA, then R(A, B) = [0\A)0(B)yi 2 (4.1) since R(A,B) = b — ab [aa-aMl-b)] 1 ' 2 1-aV' 2 / b X 1 ' 2 jaV' 2 (£)" = [0(A)0(B)yi 2 . As a corollary, when B equals A we get property 2 ifB=A,thenR(A,B)=l. (4.2) Second, when B is a subset of A (i.e., A and B are disjoint) we get if BCA, then R(A, B) = - [0(A)0(B)Y' 2 (4.3) since R(A,B) = — ab [a(l-a)b(l-b)yi 2 -fe)' B (T^) = -[0(A)0(B)Y' 2 - 1/2 As a corollary, when B equals A we get property 3 i{B = A~,thenR(A,B) = -\. (4.4) Next, what are the values of R(A, B) when B = and B = S? Direct substitution in (3.1) yields an indeterminate form in each case. Instead, we shall use the facts that 0(% = and 0(S) = [0(0) ]"■ = °°. First, if B is the null set, then QC.A. So we get if fl = 0, then #(,4, fl) = since from (4.1) (4.5) 42 R(A, (/>) = [0(A)Om 112 = [0(2 ) ■ 0]'/* = 0. since from (3.1) Second, if B is the universal set, then AQS. So we get if B = S, then R(A, B) = (4.6) since again from (4.1) fo(A,S) = R(S,A)=[0(S)0(.A)¥l* = [omo(A)yi2 = [o-o(A)y/ 2 =o. Finally, if A and B are independent, then c = P{A (~)B) = P(A)P(B) = ab. Hence, we get property 1 R(A,B) = [a(l-a)6(l-6)] 1/2 = 0. if A and B are independent, then R(A, B) = (4.7) It is interesting to observe that we can also get the purely set-theoretic properties (4.5) and (4.6) as corollaries to the non-set theoretic property (4.7); for A and are independent because P(AC](j)) = P(A)P(ty) and A and S are independent because P(AnS) = P(A)P(S). The proof of property 4 can be given algebrai- cally also, but it is indirect and lengthy. From the fact that the proofs of properties 1, 2, and 3 are so easy, it should be suspected that something basic is involved and that some fundamental relation exists which will yield properties 1 through 4 di- rectly and immediately. In section 5, we shall show this to be the case. 5. Fundamental Relation Between the Two Correlation Coefficients We will use indicator functions to expose the fundamental relation between the classical corre- lation coefficient R(X, Y) for random variables and the one R(A, B) for events. The indicator function of a set A that is in the range of a random variable Z is defined as the random variable h(A) = 1 HZeA \{ZeA (5.1) which can be seen to have the following properties (see Parzen [5]) I z (AnB)=h(A)Iz(B) I z (A)=l-Iz(A) (5.2) (5.3) The justification of the heuristic mapping that led to the correlation coefficient between events A and B will now be given. Let X = h(A) and Y=h(B). Then E(X) = P{A) and E(Y) = P(B) (5.4) since from (5.1) E(X) = E[I Z (A)] = £ I z (A)P[h(A) = Iz(A)] 1 Z (A) = P[h{A)=X\ = P(ZeA) = P(A). Also E(XY) = P{AC\B) (5.5) since from (5.2) E(XY) = E[I Z (A)I Z (B)] = E[h(AnB)] = P(ZeA(lB) = P(AHB). From (5.5) we get as corollaries E(X i ) = P(A) and E(Y 2 ) = P(B). (5.6) Hence, we will define the correlation coefficient R(A, B) between events A and B to be R(A,B) = R[I Z (A), I Z (B)] (5.7) which is a special case of (2.1). Thus, substituting (5.4), (5.5), and (5.6) in (2.1), we get R(A,B) = P(AnB)-P(A)P(B) [P(A)~ P 2 (A)] ll2 [P(B)- PHB)] 1 ' 2 (5.8) which justifies the heuristic definition (3.1). From (5.2) it can be seen that h(A) and Iz{B) are independent if, and only if, A and B are independent; so that independence and uncorrelatedness are equivalent for indicator functions. Hence we get property 1 if A and B are independent, then R(A, B) = 0. Similarly, property 2 if B = A, then«(^,B)=l 43 follows immediately from (2.3); and property 3 i(B = A, then R(A, 5) = -l follows immediately from (2.4). Also, it follows immediately from (2.5) and (5.7) that property 4 holds \MA,B)\<1. Therefore, properties 1 through 4 are satisfied by the event correlation coefficient R(A, B). Finally, it is fitting that the probabilistic inter- pretation P 2 (ADB)^P(A)P(B) of the Cauchy- Schwartz inequality E 2 (XY) ^ E{X)E(Y) follows di- rectly from the use of (5.4) and (5.5). 6. Pearson Mean Square Contingency Of course, it is possible to show that R(A, B) is a special case of R{X, Y) without making use of the interesting properties of indicator functions. By direct calculation when both X and Y assjume two discrete values corresponding to A and A for X and to B and B for Y, R{X, Y) reduces (see Cramer [6], p. 279) to R(X, Y) = PllP22~ P12P21 (6.1) (P1.p2.p1p2) 1 ' 2 whose right side can be rewritten in our notation as P(AnB)-P(A)P(B) [P(A) - P*(A)] 1 ' 2 [P(B) - P*(B)] 1 ' 2 = R(A, B). Moreover, it follows that R(A, B) is equal to Pear- son's mean square contingency hh pt-p* where the pik are given by the contingency table for m = n = 2 B B A Pn P12 Pi A P21 P22 Pi Pi P 2 since (see Cramer [6], p. 282) 4>2=R(A,B). and hence _ (P11P22— Pl2P2l) 2 Plp2P lP2 (6.2) 7. Estimation of Event Correlation Coefficient The estimation of the event correlation coeffi- cient R(A, B) for two events A and B hinges on estimating three probabilities P(A), P(B), and P(AC)B). One approach to the estimation of these probabilities is through their corresponding relative frequencies fi(A), fj(B), and fk{A, B) where i,j, and k are the respective sample sizes. It is to be noted that i, j, and k are not necessarily equal since, in general, there will be differences in the sample procedures for the three events A, B, and Af~)B. The sample event correlation coefficient will be defined by , A f k (A^B)-HA)fAB) nA > a) [fi(A) -f*(A)Yi 2 [fj(B) -J](B)] 112 which can be computed readily, once the estimates fi(A), fj(B), and fk(AC\B) are obtained from physical observation. The accuracy of the sample value r\A, B) as an estimation of the unknown parameter R(A, B) can be determined by the application of standard statistical techniques from the theory of estimation of parameters. Finally, it should be noted that if j\A), f(B), and f(AUB) are known, then the unobserved f(Af)B) can be computed from f(A n B) =f(A) +f(B) -f(A U B). 8. References [1] Yule, G. H., On measuring association between attributes, J. Roy. Statist. Soc. 75, 579-642 (1912). [2] Steffenson, J. F., Deux problemes du calcul des probabil- ites, Ann. Inst. Henri Poincare 3, 319-344 (1932-3). [3] Goodman, L., and W. Kruskal, Measures of association for cross classifications, J. Am. Statist. Assoc. 49, 732-764 (1954). [4] Goodman, L., and W. Kruskal, Measures of association for cross classifications. II: Further discussion and references, J. Am. Statist. Assoc. 54, 123-163 (1959). [5] Parzen, E., Modern Probability Theory and Its Applications (John Wiley & Sons, New York, N.Y., 1960). [6] Cramer, H., Mathematical Methods of Statistics (Princeton Univ. Press, Princeton, N.J., 1946). 44 2. Models and Methods A Modified Statistical Association Procedure for Automatic Document Content Analysis and Retrieval Joseph Spiegel and Edward Bennett The Mitre Corporation Bedford, Mass. 01730 The very large number of documents, reports, and the like that are being sponsored and produced tend to overwhelm our indexing resources. This results in relatively poor retrieval results since re- trievals from a library of poorly indexed items are, at best, haphazard. Bearing this problem in mind, we have been designing our system to operate without the necessity for indexed documents although capable of operating with them if such are available. The system is to be fully automatic, i.e., able to accept the full textual form of the document (in machine-readable form) and to retrieve from its store those items statistically associated with the query. Let us make it clear that this has not been achieved. However, we have completed some promising steps, enough to indicate those paths that might lead to a successful system. The path we have started investigating uses a statistical association technique whereby word/ word matrix call weights are modified by means of a redundancy measure derived from statistical information theory. The result of this modification is to change cell weights of all terms in accord- ance with their corpus-bounded redundancy. Thus, some terms are elevated in association strength while some are downgraded. In addition to reporting on the influence of redundancy on word associations, the retrieval program will be described. The precise flow of operations within the computer system will be given together with the rationale for such flow. In addition, we will describe some of the validating work on machine versus manual retrieval capability currently in progress. 1. Introduction Much has been said about developing an auto- mated library where, if one is to believe the visionaries, a simple verbal statement of a query, introduced into some machine (usually specified as a computer), will result at best in a direct and correct answer or at least in a small list of references all highly relevant to the query. Although we are unboundedly enthusiastic about the need for such a system, we believe there are some theoretical and engineering problems to be overcome before its realization. In view of both the need and the problems, we have tried to design an automatic retrieval system f hat involves only a minimal number of constraints, these constraints largely introduced by the engi- neering limitations of the machinery involved rather than by any preset theoretical position concerning the nature of language or documentation. In es- sence, we sought a system that could accept as an input any type of material as long as it was in a form compatible with machine requirements. To be more specific, the method or system should be able to accept and analyze large amounts of natural message content relating to a wide range of topics. In responding to retrieval search demands, the tech- nique should be able to draw upon its total resource of stored information, not only to select an appro- priate response, but more important, to improve its program for interpreting such demands and re- sponding to them. The technique should be able to improve with experience. The system should be able to code the content from messages in a fully mechanical manner. It also should be able to relate new content to other relevant content already in memory. From its reservoir of infor- mation, it should be able to elicit the necessary clues as to which documents are relevant to each other, especially in response to a message that is also a query. For such a system to be reasonably adaptable, it also should be able to perform these functions without an index, grammar book, dic- tionary, thesaurus, or other formal constraint. What this suggested was a system for automati- cally content-coding various statistical properties of documents and then using these codes for auto- matic retrieval or, for that matter, document rout- ing. The statistical approach applies the most elementary and primitive relation among message units, that of co-occurrence probability patterns. The basic strategy is to proceed as far as possible using these patterns, with a minimum of assump- tions about the linguistic or semantic organization of the information within the message structure. This strategy implies a rather mechanistic ap- proach to language processing, and that is indeed the case. We assume that the information con- tained in a message is carried by the words that make it up and by the manner in which they are strung together. Further, we assume a person generating a message or document chooses words in a nonrandom fashion and combines them ac- cording to semantic and syntactic rules that are regular and, at least in our culture, to some extent predictable. That is, both the selection of elements and their co-occurrence with other elements are subject to restrictions by the contexts in which they occur. We intend to exploit the regularities of these associations among words, ignoring the specific nature of the rules which produce such regularity and thereby restricting ourselves to the resulting statistical features alone. 47 If one examines this approach carefully, it can be seen that we are defining an approach similar in many ways to the way humans appear to retrieve information from their own memories. Typically, humans seem to start with the query words and then to associate these with other words until the information they seek is brought to their conscious attention. This process of association of elements is so basic and obvious that Aristotle reasoned that to learn was to associate. However, although association theory has been known for many years, little use has been made of it as a methodology for information processing. In fact, literature on the use of statistical associations for information proc- essing is quite limited, although at least three significant contributions of a methodological nature appear to be of direct relevance. All are concerned with the use of index terms, from a specified library of index terms, to retrieve documents from a spec- ified library of documents. All involve obtaining descriptive statistics to indicate the extent to which specific index terms occur together in tagging the various documents of the library. Such de- scriptive statistics then are used to expand from one or more index terms used in a query to a set of associated terms, based upon evidence of the co-occurrence tendencies of the various terms. 2. Historical Background Probably the most important early work in sta- tistical association techniques comes from H. P. Luhn who in 1958 [1] ' suggested that the clerical ability of the computing machine be harnessed to develop statistical frequency counts of text. These counts would then be used to determine "signif- icant" terms. Almost as an addendum he sug- gested that one could take these "significant" terms and determine their mutual co-occurrences, thus yielding a series of connected terms. This sug- gestion was not followed through, as far as we can determine, until 1960, when Maron and Kuhns [2] published their investigations on statistical as- sociations as part of a more general methodological attack on the problems of document retrieval. Starting with a catalog of index terms and a library of documents, they develop a statistical matrix of association frequencies. Tj Tj x = N(T j ,T k ) u = N(Tj,? k ) N(Tj) v = N(Tj, T k ) y=N(T j ,f k ) N(Tj) N{T k ) N(? k ) n where Tj is a tag in the original request. Tk is a tag not in the original request. N(Tj, 7fc) = the number of documents in the library tagged jointly with both Tj and T k . N(Tj, 7V) = the number of documents tagged with Tj and not with T k . N(Tj) = the total number of documents tagged with Tj. N(?j) = the total number of documents not tagged with Tj. n = the total number of documents. From these descriptive statistics, Maron and Kuhns develop three different measures of close- ness of association for index terms. One is the conditional probability that if a term in the original request 7} is assigned to a document, then the ad- ditional term T k also will be assigned: P(T k \Tj) = N(Tj, T k ) N(Tj) (1) The second measure is the inverse conditional probability; that is, the probability that if the addi- tional term T k is assigned to a document, then the original request term Tj also would be: P(Tj\T k N(Tj, T k ) '• N(T k ) (2) Finally, they use the contingency estimate, or estimate of the frequency of co-occurrence, inde- pendent of the individual and separate influences of the two terms which form the co-occurrence in question. They remove the magnitude to be ex- pected on the basis of chance from the actual cell magnitude, taking into account the number of times the individual tags are used. 8(T h T k ) = N(Tj,T k )- N(Tj)N(T k ) (3) Maron and Kuhns then introduce an arbitrary coefficient of association, based upon 8(7), T k ), which ranges conveniently from — 1 to + 1 with a magnitude of zero for the condition: 1 Figures in brackets indicate the literature references on p. 60. 8(7), 7*) = 0. (4) 48 This coefficient is of the form: n8 Q(Tj,T k )= (xy + uv) (5) This work was followed by Doyle [3], who devel- oped a measure drawn from a contingency table to indicate strength of association: N(T h T k )n N(Tj)N(T k )' (6) Doyle [4] has subsequently repudiated this formula, and has instead substituted N(T h T k ) N(Tj) + N(T k )-N(Tj,T k ) (7) Following close on, Stiles [5] also started with a contingency table of the form given above. How- ever, he introduced a different coefficient of as- sociation: logi n8 I N(Tj)N(T k )N(Tj)N(T k ) (8) In each of the three approaches cited, the in- vestigators tend to adopt the same basic data structure from which to develop their analyses. They pass over the question of how many terms are used to index any particular document and start with the total population of indexed documents as a base. They divide this population of docu- ments into those that exhibit the common property of having been indexed by Tj, with and without T k , and those not indexed by Tj, with and without T k . Using various normalizing procedures, they adjust the sizes of these various groups, especially the group (Tj, T k ), to remove any effect that might result from the tendencies of Tj and T k , separately, to occur frequently in general. Some kind of nor- malization is required, because the more fre- quently an index word occurs, the more likely it will co-occur with some other term, simply on the basis of chance. The techniques used by Maron and Kuhns, Stiles, and Doyle, however, do not treat the fact that the more lengthy the string of index words used to index a document, the more likely that co-occurrences involving the terms in the string are due to chance. For a library retrieval problem this might be little more than a minor omission, if, for example, the number of terms used to index all documents is a constant. However, if data on statistical co- occurrence are drawn from the actual strings of words in natural language that comprise the body of a document or message, then such factors as string length, word position in the string, and vocabulary size might significantly influence the tendency of words to co-occur. Accordingly, we would like to argue that a statistical association technique should take into account such factors and, further, that it should not be dependent upon the particular level of message aggregation being considered. 3. Theoretical Development Before discussing a method for accounting for these effects, it would be useful to define our terms and examine their implications. As previously stated, a message is a carrier of information or content. The smallest message carrier of content is probably the alphabetical letter, number, or arbitrary punctuation mark. This is a message of minimum size. A continuous string of such marks, commonly a word, may be thought of as a somewhat larger message. At a still larger level of aggregation, a string of words, perhaps a sen- tence or a paragraph, is also a message. Simi- larly, documents, books, clusters of books, and so forth, are messages of increasing levels of aggregation. Analytical techniques for determining message or document content do not necessarily have to change radically because of the magnitude of mes- sage aggregation being considered. The procedures one uses to examine the subject matter index of a library card file may be similar to the procedures for understanding and searching the individual book cards, which in turn may parallel the pro- cedures used with a book's table of chapter con- tents, its page index, or the paragraphs and sentences of an individual page itself. Therefore, to maintain stress upon the common denominator, we will consider all of the strings that constitute messages as a class, becoming spe- cific, when necessary, by indicating the size or level of aggregation for any string. Alphabetical, numerical, or punctuation mark messages are one level of aggregation smaller than those considered in detail at this point. The units of immediate concern are words, strings consisting of a few words, and strings of such strings, including those larger strings that range from sentences or titles, to paragraphs or abstracts, to articles, and so forth. We establish the following working definition: a word type is the smallest unit of analysis and al- ways has the identical configuration of alphabetical, numerical, and conventional marks. Thus, the word type man is different from men or man's: Similarly is, are, and am are different types. Types may vary in size from one symbol to many. The only requirement is that the symbol arrange- ment remains the same for the same type. The ability of a person to react differently to the string of letters man in contrast to the string men, man, or manx reflects the influence of differ- ing structural arrangements of identifiable elements. The string man is a unique system that might be 49 represented by the simple flowgraph below, in which the numbers give the distance between the ele- ments of the string or, by the somewhat more redundant association list m •- -• a The arrangement or association of words can be represented in the same way to identify a sentence, or the association of sentences can identify a para- graph. This also applies to messages of larger aggregation. For example, the string Mary would like John has an identity characterized by the co- occurrence of the four words, the specific sequence of the words, and the distance among them: would \^^ 2 si •^. Jw 3^\ 2 > ^ 1 Mary John In association list form the string would have the representation: Mary •- Mary •- Mary •- would •- would •- like •- 1 would -• like -• John -• like -• John John 2 Taken from the Defense Documentation Center's Technical Abstract Bulletin, dated 30 August 1961, No. AD-262 148. In this way a message at any level of aggregation can be represented structurally by its co-occurring units at the next lower level by merely specifying the directions and distances among them. As further illustration consider the following title, descriptors, and abstract 2 as one message: (title) Psychophysical relations in the visual perception of length, area, and volume. (descriptors) Visual perception, Perception, Stimulation, Tests, Measurement. (abstract) Subjective length, area and volume as functions of the corresponding stimulus variables were studied in three experiments. The exponents of the psychophysical power functions scattered around 1 for perception of real space. For perspective drawings of cubes and spheres, how- ever, the exponents were about 0.75. It was tentatively concluded that perspective is an in- sufficient cue to visual volume. The results are discussed with special reference to certain car- tographic symbols representing population magnitude. Just for this example, we will establish the follow- ing convention. A word type consists of any unique sequence of exclusively alphabetical symbols with one or more blank spaces preceding and following it, but without blank spaces in the sequence itself. Capital and lower case letters are to be considered identical, and all numbers and punctuation are ignored in identifying types. A primary string is specified as terminating with the presence of a punctuation mark directly followed by two or more spaces. This specification results in choosing as primary strings those sequences of words that cor- respond to what we ordinarily identify as sentences. Accepting these conventions we can represent the message as a secondary string composed of sen- tence length primary strings: Psychophysical relations in the visual perception of length area and volume. Visual perception, perception stimulation, tests measurement. Subjective length area and volume as functions of the corresponding stimulus variables were studied in three experiments. The exponents of the psychophysical power functions scattered around for perception of real space. For perspective drawings of cubes and spheres however the ex- ponents were about. It was tentatively concluded that per- spective is an insufficient cue to visual volume. The results are discussed with special reference to certain cartographic symbols representing population magnitude. This message, or any part of it, also can be repre- sented by an association matrix, where the columns represent the first word in a pair, the rows represent the second word, and the cell entries indicate the frequency for each of the co-occurrences. This matrix is, in effect, a simple coded representation of part of the structural content of this one message. With the addition of other messages from the same corpus, the matrix could gradually grow to reflect the co-occurrences of types in all the messages of the corpus in question. This matrix would re- flect the statistical structure of the corpus, showing which types were associated and to what extent. It is this matrix that we use to develop our association factor. 50 4. Statistical Development The actual frequency of occurrence of any pair of word types is partially a function of the relevant tendency for the two word types to co-occur because they are associated in some meaningful manner. However, it is also a function of the separate tenden- cies, irrelevant for this purpose, of either of the word types to occur with all other word types in general. For example, a specific word type will be the first type in as many pairs as there are other types following it in a string. Similarly it will be the second type in as many pairs as there are other types preceding it in a string. A word type will also form pairs as a function of how frequently it occurs as a type in the set of strings under considera- tion. It is desirable to normalize to eliminate these extraneous influences: frequency of word occur- rence, relative word position, and string length. This can be accomplished by subtracting from the actual frequency of pair occurrence an estimate of the frequency expected on the basis of chance due to frequency and position of occurrences as well as sentence length for each of the two words that comprise the pair in question, as follows. We start with a matrix of frequencies of co-occurrences. s E C o N yj D n p o s I (fj,fk) T I O N FIRST POSITION xk (ij, tk) N(x h yj ) N(x k , yj) N((t h tk), yj) N(yj) N(x h y k ) N(x k , y k ) NU, V*), Yk) N(Yk) N(x h (f } , f k )) N(x k , (y h f k )) N((t h tk), (yj, fk)) N(fj, fk) N(xj) N(xk) mti, t^ N where N(xj, yj) — the frequency of co-occurrences with word type j preceding word type j. N(xj, (fj, ^fc)) = the frequency of co-occurrences with word type j preceding tokens which are not of word type 7 and not of word type k. N(xj) = the sum of the frequencies of all co- occurrences with word type j in the first position. N(yj) = the sum of the frequencies of all co- occurrences with word type j in the second position. iVo = the grand total frequency of co- occurrences. The total frequency of pairs that includes the word type j in the first position, N(xj), is equal to the por- tion of the length of the string that follows the type j, summed over the total number of occurrences of the type. Similarly the total frequency of pairs 3 Note that this initial correction is identical to the contingency table correction made by Maron and Kuhns, and Stiles on their matrix tabular data, although these investiga- tors use row and column totals based upon frequency of type occurrence, ignoring the variable of how many types are used to identify a document (our notion of string length). that includes the type k in the second position, N(Yk), is equal to the length of the string that pre- cedes the type k, summed over the total number of occurrences of the type. The row and column totals N(xj), N(xk), N(yj), N(yk), and so forth, supply a statistical estimate of the cell magnitude that could be expected because of the extraneous factors of frequency, position, and string length. Subtracting the customary contingency table correction 3 from the actual cell magnitudes, this estimate of cell magnitude can serve as a first-level normalization. Even with this correction, the cell frequencies are still a function of the actual magnitude of the total corpus of pairs and the total number of word types included in the entire matrix. Thus the greater the total number of pairs, the greater the number to be expected in any cell. Similarly, the fewer the number of word types, the fewer the number of matrix cells, and, therefore, the greater the number of pairs to be expected in any one cell. Con- sequently, correction of cell frequencies propro- tional to the total frequency of pairs and inversely proportional to the number of matrix cells results in a set of weights which is normalized for extra- neous factors. The resultant cell weights, Zs, 51 serve as one estimate of the influence of association forces independent of individual frequencies, sentence lengths, number of different types, and total number of pairs within the corpus under con- sideration: Z(xj, y k ) -,[ No WxdWykf. m n (9) fhere N(xj, y k ) = the frequency of co-occurrences with word type j preceding word type k. N{xj) = the total frequency of co-occurrences with type j as first type. N{yk) = the total frequency of co-occurrences with type k as second type. /Vo = the total frequency of co-occurrence of all types. n = the number of different types. When the direction of co-occurrence is not con- sidered, the matrix can be collapsed into triangular form which reflects joint occurrence, where pairs with the words reversed in direction are combined. Each matrix cell of such a triangular matrix, ex- cept the cell where j equals k, is, in effect, the sum of two cells N(x h y k ) + N(x k ,yj). In this case, the correction for extraneous factors would be: Z'(xj, y k )- n(n+l) N(x h y k ) + N(x k , yj ) No N(xj + yj )N(x k + y k ) - 2N 2 (10) where N(xj + yj) — the total frequency of pairs con- taining type j in either position. Therefore, N(xj, yj) is counted twice. If the matter of distance of displacement of the words in the pairs is ignored for the moment, a matrix of co-occurrences based upon the statistic Z'(xj, yk) would appear to reflect one statistical tendency of pairs of types to associate. The matrix is adaptive in that it starts with no cell weights if there has been no input of strings. Then as the inputs begin and continue, the matrix continues to grow and change as it digests ever-increasing quantities of pairs. Each normalized cell weight, Z', rises and falls with time as each specific associa- tion increases or decreases in relative frequency. In this way, the matrix memory changes with time, maintaining a cumulative pattern of associa- tions reflecting one statistical characteristic of messages fed into it in the past. In addition to this adaptive characteristic of changing memory with time and with changes in inputs, the matrix is also readily subject to what might be called "formal education." Any specific cell weight can be strengthened by repeatedly reading into the matrix memory the specific strings that contain the desired association. For example, by introducing the strings is am, is are, am is, am are, are is, and are am, we can increase the sta- tistical tendency of the tokens is, am, and are to be associated. More complex learning can be accomplished by the introduction of strings such as man men, men man, singular plural, plural singular, man singular, men plural. In a similar way, we can build chains, lists, trees, and circles of associations. A chain would be formed through the repetitive input of the strings of types such as a b, b c, c d, and so forth. A fist would involve input strings of the form a b, a c, a d, a e, a f, where the word a is the list heading, and the other words are subordinate entries in the lis't. A tree would involve introducing the strings a b, b c, b d, c e, c f, d g, d h. Circular associations of the form a b, b c, c d, d a could also be formed. In fact, any particular configuration of links is possible through the development of an appropriate set of input strings. The retrieval algorithm that seems almost to arise as a result of such matrices is one that takes a set of given terms (the query) and expands the set by finding other, highly associated, terms. Doing this, however, allows one to chain or pro- ceed down paths that have little or no relevance to the original query. For example, one could start with a capacitance resistance psychotherapy /~ 2 psychological neurosis term such as "neurosis" and trace a path as shown above until one reaches the term "resistance." Here there are two equal bonds, one leading off into the electronics field through the term "capaci- tance" and the other continuing in the psychological area through the term "psychotherapy." Clearly, it is this latter link we wish to use. This can be accomplished by providing a feed- back loop to the original query terms, by requiring each candidate term for expansion to have co- occurred at least once with the full set of query' terms. To state our retrieval algorithm more precisely: Given a set of query types, the matrix is searched to locate all types which have been associated with each and every one of the query types in the set. From this group of words, those (equal in number to the number of query types) that have the highest 52 sum of normalized matrix weights (when summed over all of the query types) are selected to form a set of first-order types. Having obtained this set of first-order associates, we form a new set combining these first-order types with the original query types. With this larger set of joint first-order and query types, the matrix again is searched to locate all types that have been associated with each and every one of the types in this expanded set. From this newly located group of types, those (equal in number to the number of joint first-order and query types) that have the high- est sum of normalized matrix weights (when summed over all of the first-order plus query types) now are selected to form a set of second-order types. The procedure for determining first-order asso- ciates can be presented in a symbolic form as follows: Let aj fc = the Z' for t, with respect to qk where, qeQ Q= {query terms} Tj is any term in the normalized matrix but y=any row of the normalized matrix k — any column of the normalized matrix; then TjeA = (k)ajk & sj is among the n q highest sum where, A = {first-order associates} sj = Xajk k=\ n 9 =the number of terms in the class Q. The second-order associates are derived in a similar fashion, as follows: Let Pjk = Z- k for Tj with respect to a k where, aeA Tj = any term in the normalized matrix bui IQtA- then TfiB = (k) a jk frk & sj is among the 2n q highest sums where, B— {second-order associates} s- = 2 a jk + 2 (ijk k=\ k=l R a = the number of terms in the class A. From the above it follows that Q, Z, B are mutually exclusive. Having derived the first- and second-order associ- ation terms, we can then note for each document the occurrence of each query term, each first-order term, and each second-order term. The documents then are ordered according to the following rules and definitions: Let n6 = the number of terms in the class B (second-order associates) fiq— n a = Tlb/2 j=n q + n a + rib k=100n Q +10n a + n b Dj^k — a. message or document with j and k indices as defined above. D\r > D2 means that D\ is more relevant than D2. The ordering of messages or documents on the basis of relevance is then: Djr > Dj. x and within the j set of messages D jyk r > Dj, k -i- In such an ordering each cut "/„' is further sub- divided by "A." This procedure, of course, pre- sumes that messages containing the query types are more relevant than those that do not, those that contain first-order associates are more relevant than those that do not, and so forth. 5. Natural Text Retrieval Once the system was programmed and checked out, 4 a search was undertaken to locate suitable natural language corpora already in a computer- compatible form. Certain criteria of adequacy were (1) representative of a heterogeneous message or document file; (2) pre-indexed so that criteria of retrieval success could be simply developed; (3) relatively recent; and (4) in a form convenient for input. We found that the Defense Documentation Cen- ter's Technical Abstract Bulletin met these criteria, since the TAB's provide many different types of system inputs: author names, titles, descriptors, as * See Appendix A for an informal discussion of the program details. well as an abstract. In addition, the TAB's were already being printed from punched paper tapes. Arrangements were made to borrow the punched paper tapes for two TAB issues, 15 March and 1 April 1962. With the use of a paper tape reader, the TAB's were transferred directly onto magnetic tape in a form compatible with the particular com- puter we had available. Initial retrievals were carried out using the descriptors as the input corpus. However, the intent of the project was to develop procedures to retrieve unindexed materials. To this end, we then tried the technique using the natural text found in the abstracts as the input corpus for association. Table 1 shows the query terms and their expansions for some representative efforts 53 using such natural text for association. As can be seen, the weighting technique we were using was unable to downgrade association to the "function" or "little" words, words that are extremely fre- quent and that seem to add little or nothing for retrieval. Table 1. Examples of original expansions H(y)=log 2 P(y) (11) Query number Query terms Associated terms First- order Second- order 1 Analog digital computer a for on Not requested 2 Camera data record on and to Not requested 3 Atomic bomb the to a in explosions was of 4 Convection radiation thermal of in liquid progress, made report, a this, two There are two brute force ways to downgrade these words. One is to establish an a priori list of these "function" words and then delete them from consideration. Another is to arbitrarily cut off the most frequently occurring terms. Both of these solutions we feel are unsatisfactory, the first because such a fist must be prepared anew for each new corpus and the second because high- frequency terms may be deleted which quite reason- ably should remain because they are central to the area of concern. For example, in the abstracts corpus, which approximates natural language, out of 5,803 unique words, the terms, temperature, data, results, design, effects, and others, were among the 30 most frequently occurring. Clearly, some terms like these should not be purged. Ideally the approach we were looking for was one that would downgrade only those terms that did not materially aid in the association technique. The terms we wish to suppress are those whose occurrence in the text is not significantly condi- tioned by their associations — that is, these terms occur more or less independently of their associated context of other words. More precisely stated, the occurrence of such a term can be predicted equally well whether one knows or does not know the terms with which it co-occurs. A desirable term, on the other hand, is one whose occurrence can be predicted with greater certainty knowing its associates in comparison with not knowing them. This fine of reasoning led us to an investigation of some of the ideas developed in information theory, particularly those dealing with the prediction of the occurrence of a term when one is given its paired associate. Along this line, three related measures were found to be of use. The first gives the extent to which the occurrence of a term y is generally uncertain without having any informa- tion concerning its associations. The second Ky,X) A P(x, y) 2~^r log2 P(x, y) P{x) P{y) (12) gives the average extent to which the uncertainty of the term y is reduced when knowledge of any of its associates is given. The third H( y \X) = f P(x > y) log* *(*'?> P(y) P(x) (13) gives the average uncertainty that is left remaining even after knowledge of any of the term's associates is given. In light of these, we were able to argue that in an association scheme the terms to be suppressed or downgraded are those whose uncertainty of occur- rence remains great in spite of knowledge of their associations. Using these aforementioned meas- ures, we identified such terms by taking the ratio Ky, X) H{y) (14) or the amount of reduction in uncertainty knowing the term's associates divided by the term's total uncertainty. All of the former association weights were now multiplied by this additional correction factor. The system was then tried using the new matrices. Some representative queries and their new expan- sions are shown in table 2. Table 2. Examples of original and revised expansions Query terms Associated terms Query numbers First-order Second-order Original Revised Original Revised 1 Analog digital computer a for on computation equations system Not re quested 2 Camera data on and present contained Not requested record to unit 3 Convection radiation thermal of in liquid liquid report progress progress made report a this two in between made of this too The modification of the previous association technique by the use of this additional measure seems to have added to the value of the technique. This can be noted by comparing the ranks in order of association magnitude of associated terms from the normalized matrix before and after modifica- tion. Table 3 shows some of these comparisons. 54 Table 3. Rank orders of associated terms to selected terms before and after matrix modification ASSOCIATED TERMS TERM OLD RANK NEW RANK DIAMINES of amines amines radicals radicals monovalent by tbutoxy ethylene tertiary examples substituted formation ethylene given examples monovalent formation oxidation reaction reaction oxidation substituted by tbutoxy given tertiary of are are with with the the HORIZON an horizon at airspeed of knots achieving fa airspeed photographic coverage optimized fa achieving feet coverage horizon terrain knots feet operating area optimized operating photographic while terrain above above at area minimum been been minimum an while has has for for of to to a a the the DUCTS in bile to rat bile duct addition obstruction after liver approximately regeneration changes hepatectomy common seen comparable common duct addition hepatectomy comparable hours ours known partial liver after obstruction cells partial known rat result regeneration approximately result changes seen well well of cells found found in number number of to that was was that the the TABLE 3. Rank orders of associated terms to selected terms before and after matrix modification — Continued ASSOCIATED TERMS TERM OLD RANK NEW RANK FLORYS for lattice of theories a molecules certain monomer consisting deriving deriving polymer energy review formula formula free consisting lattice solution molecule free monomer certain polymer energy review presented solution for theories a presented is is of and and the the ENGINES to centrally a trackless cargo train centrally cargo controlled highway coupled offroad highway selfpropelled into controlled offroad operate operate coupled program units selfpropelled into trackless under train conditions under control units program can can conditions systems control test presented presented results study study results systems that test to that a and are are and of the of the 6. S ummarv We have reported upon a statistical association technique and program which can accept any natural language input as long as it is in a computer- compatible form and, from this input, derive a term- term association matrix whose cell values provide a measure of the tendency of the two defining terms to co-occur through other than chance factors. This matrix appears to have a number of potential uses; among them are automatic message retrieval, content analysis studies, message routing, and so forth. 55 7. Appendix A. System Program System Overview The overall system flow chart is shown in figure 1. This system was written for the IBM 7090 com- puter. The system can be divided into two parts: data preparation and query. A. Text Tape SCAN PROGRAM / B. Tape for Concordance C. Pairs Tape(s) IBM 9SORT D. Sorted Tape for Concordance F. Sorted Pairs Tape(s) CONCORDANCE PROGRAM FREQUENCY MA1K1X PROGRAM E. Concordance Tape G. Frequency Matrix Tape H. Row Tape K. Document Tapes QUERY PROGRAM pes L NORMALIZED PROGRAM I. Normalized Matrix Tape J. Query Deck 1. QUERY EXPANSION PHASE I L. Expanded Query-word list I CONCORDANCE SEARCH AND RETRIEVAL PHASE M. On-line or off-line print of retrieved documents in order of relevance, and of expanded query-word list. , FIGURE 1. Overall system flow chart. Data preparation starts with the text and builds from it a concordance and a list of pairs. Both of these are sorted. The fist of pairs is used to build a frequency matrix of word-word co-occurrences where they — k entry tells how many times wordy and word k occurred together within a sentence, summed over all of the sentences of the corpus. The frequency matrix is "normalized" in accordance with formula (10) given above. This normalized matrix is used in the query part of the system to produce an expanded query-word fist; i.e., the origi- nal query words plus those additional words highly associated with them. The query part of the system has two phases: the query-expansion phase and the concordance search and retrieval phase. In the query-expansion phase, the program first finds those terms (called first- order associates) strongly associated with the origi- nal query words, using as input the original query words. It then iterates this process by finding those words (the second-order associates) strongly associated with the first-order associates and the query terms, and so on. The concordance search and retrieval phase then takes the expanded query- word list and using the concordance finds all of the messages or documents which contain one or more of the words from the expanded query-word fist. Each document gets a score, based on the number of words from the expanded query-word fist which refer to it. The documents are then retrieved and printed in order of score (highest score first). Description of Subroutines The following sections informally describe the subroutines and the tape formats found at each stage of the system. 5 In general, in the machine formation and com- putation stages, a word is represented by a string of 18 characters. If the word does not take up the whole string, it is padded on the right with blanks; if it is longer than 18 letters, it is truncated after the first 18 characters. This word size is an arbitrary parameter. One can choose to truncate, at 12 or even 6 letters or, for that matter, at 24 or 30 letters. Whatever length one chooses, it must be a multiple of 6 since one 7090 register can con- tain 6 characters. However, word length does have a material effect upon the total number of words that can be handled at one time within core. The shorter the word, the more words that can be manip- ulated. Table 4 shows the effects of varying word lengths, holding the vocabulary size constant, on the data preparation time and on the retrieval time. Table 4. Timing and size relations Word length trunca- tion point Word types Word tokens Data prepa- ration time (min) Retrieval* time (min) Matrix** density (percent) Com- pres- sion** Pairs produced** (millions) 18 12 6 7,500 7,500 7,500 110,000 1 10, 000 110,000 487 330, 165 15 13 7 1.5 1.5 1.5 3.5 3.5 3.5 3 3 3 5 More precise, technical descriptions of each subroutine and tape can be obtained from the authors. *This is the lime required to retrieve the first 100 documents, and includes the time necessary to search the matrix, the concordance, and the text. Rather than merely printing out document numbers and allowing the user to find them, we retrieve the actual documents, and print out the document number, the title, the list of descriptors attached to the document, and an abstract of the document. If the user wishes to retrieve more than 100 documents, these additional documents, which merely involve another pass at the Text Tape, can be retrieved at the rate of 1 minute per 100 docu- ments. **We assume that the matrix density (relation between actual entries and total possible entries) remains constant while compression (relation between pairs and non- zero entries) increases. The pairs figure is then implied by the vocabulary size. It seems reasonable to assume that as the corpus gets larger the same word patterns tend to be repeated; i.e., the old patterns are repeated much more frequently than new ones appear. The assumption about density, however, is simply made for convenience. We do not know what happens when new words are introduced because of an expanding corpus. Do the new words appear in sentences mainly with the old words, or do they tend to form a subgroup of their own? Much more experience with large samples of English text is needed before we can give an informed answer to this question. 56 For the present program, the relation of corpus size to data preparation and running time is linear; i.e., assuming that the mean sentence length stays the same, doubling the corpus will double the prep- aration time. The overriding consideration in terms of the data preparation time is the number of pairs produced. The number of pairs produced is critical because the single largest expenditure of time is incurred by the sorting program. The main variable in relating size of text to number of pairs produced is the mean string length. A string of length n will produce n(n— 1) pairs. Thus, five 20-word sentences produce 1900 pairs, whereas one 100-word sentence produces 9900 pairs. The number of pairs which a corpus will produce can be estimated by the relation: w here: N = T (S-1) (15) No= total number of pairs To= total number of tokens S— mean string length (in tokens). If S remains constant, a linear relation exists between pairs produced and corpus size. Since the relation between sorting time and the number of pairs to be sorted is more or less linear, a linear relation exists between corpus size and preparation time. The main size limitation for the present program is the necessity for having all the row names and row sums in a core at once. There seems to be no simple relation between the size of the corpus and the size of the vocabulary, but after a certain point vocabulary size increases very slowly. Text Preparation (TAPE A) Concordance Preparation (TAPE B) Pairs Preparation (TAPE C) These three subroutines and their resultant out- put tapes represent the first step in the data prepara- tion phase. The text (tape A) must contain all of the input data necessary to build the matrices. The words on the text tape are processed in two ways: associated with numbers to form the con- cordance and paired to form the basic information for the association matrix. The only restrictions on the text tape are: 1. Input may not exceed one tape for any given run. 2. The records on the tape need not be of uni- form length. However, no record may exceed 2000 registers (computer words) in length. 3. The end of intormation on the tape must be indicated by an end-of-file record. The scan program will cease accepting input upon its first encounter with an end-of-file mark. The program scans the input data by bytes, each register (or word) of data contributing 6 bytes, or characters. In turn, these strings of characters are extracted to form English words. The words are then used to generate the two out- put tapes, tape B (tape for concordance) and tape C (pairs tape). The input data is treated as having a certain simple structure (groups of words form sentences when a period followed by two spaces is encountered). Groups of sentences form messages when either a special code or 10 or more blanks are encountered. A period, blank, and comma are all treated as word separators. The pairing procedure has a large range of options. These are shown in table 5. TABLE 5. Parameters for scan program. Parameter Parameter name Value Meaning number 1 Unit of pairing. 3 Words within the same message are paired. 2 Only words within the same sentence are paired. 2 Common word list. 1 Words on the restricted list* go into the concordance. Words on the restricted list do not go into the concordance. 3 Restricted word list pairing. 1 Words on the restricted list are paired. Words on the restricted List are not paired. 4 Repeated occur- 1 'A word will be paired even if it has rence pairing. appeared previously in the same pair- ing unit. A word will not be paired if it has appeared previously in the same pairing unit. 5 Word distance. D Suppose two words W\ and W t within the same pairing unit are separated by n intervening words. If n+l 8 0.0073 3.29 7 SPECIF 2900 6.65 6.68 '.2.28 0.0 7 90 3.75 WHAT 2883 6.76 6.79 44.30 0.0725 2.52 8 RESPON 2872 5.94 6.00 29.21 0.0772 6.24 5 PERMIT 2869 6.35 6.49 39.63 0.0820 6.17 1 BOTH 2868 6.85 6.88 46. 54 0.0771 1.87 2 REVERS 2857 6.66 6.93 46.96 0.0842 2.65 5 SUQJEC 2855 6.70 6.81 45.48 0.0784 2.72 13 NOTICE 2855 6.04 6.18 30.76 0.0853 5.70 1 CAN 2822 6.93 6.94 49. 15 0.0739 1.61 RECEIV 2801 6.52 6.57 39.10 0.0764 6.76 7 CONDIT 2779 6.46 6.47 35.52 0.0760 3.52 GIVEN 2766 6.80 6.82 45.07 0.0744 2.27 SINCE 2756 6.89 6.93 48. 6 5 0.0753 1.76 3 DISMIS 2755 5.96 6.48 35.90 0.0790 5.16 WHILE 2749 6.82 6.85 46.31 0.0751 5.29 1 STATEM 2732 6.32 6.36 34.16 0.0720 4.77 9 ACCORD 2721 6.87 6.96 49.64 0.0745 2.12 5 OBJECT 2703 6.27 6.31 32.50 0.0742 8.66 8 ASSIGN 2654 6.00 6.12 29.82 0.0715 6.48 1 USED 2650 6.45 6.58 38.16 0.0734 5.62 AAAAAA 2649 7.07 7.87 99.99 0.0783 0.42 9 PARTY 2643 6.26 6.33 31.93 0.0726 4.28 THEREO 2640 6.69 6.75 41.60 0.0697 2.61 2 INCLUD 2632 6.71 6.76 43.41 0.0716 3.86 4 GROUND 2629 6.68 6.77 44.16 0.0728 3.25 OVER 2622 6.72 6.71 40.99 0.0701 2.40 2 YEARS 2601 6.53 6.56 37.10 0.0687 3.24 3 SUSTAI 2600 6.65 6.89 46.24 0.0753 3.40 HEREIN 2599 6.23 6.70 41.75 0.0670 3.17 RESPEC 2579 6.80 6.82 44.43 0.0678 1.99 SUPRA 2573 6.29 6.25 29.21 0.0636 3.34 9 CLAIM 2565 6.24 6.24 32.27 0.0735 5.91 4 CIRCUM 2543 6.75 6.75 41.94 0.0679 2.08 MAKE 2535 6.76 6.84 43.94 0.0681 2.35 2 RELATI 2530 6.54 6.53 37.10 0.0662 3.61 THOSE 2527 6.73 6.77 42.43 0.0642 3.12 3 SUBSTA 2527 6.62 6.71 41.60 0.0693 3.48 8 HEARIN 2525 6.28 6.31 31.59 0.0716 4.03 TAKEN 2518 6.67 6.76 43.07 0.0697 3.27 4 SUFFIC 2484 6.72 6.ei 42.92 0.0708 2.35 CANNOT 2467 6.74 6.92 46.54 0.0694 >.06 1 THREE 2437 6.70 6.73 41.18 0.0677 J. 19 SECOND 2415 6.53 6.61 38.50 0.0656 3.97 NOW 2384 6.60 6.80 43.29 0.0629 2.79 4 CONTIN 2382 6.37 6.40 34.35 0.0634 5.85 2 PARTIC 2381 6.48 6.76 42.12 0.0625 3.17 4 PRIOR 2379 6.69 6.74 40.88 0.0654 2.87 UNTIL 2347 6.65 6.70 39.22 0.0 62 8 2.31 7 REVIEW 2347 6.02 6.30 32.72 0.0676 5.34 STATES 2343 6.38 6.33 33.37 u.0582 6.26 1 PAID 2316 6.25 6.25 28.16 0.0616 3.21 4 CONCUR 2290 6.65 7.30 63.91 0.0643 2.45 WELL 2259 6.77 6.83 43.14 0.0592 2.87 DURING 2216 6.58 6.62 36.50 C.0609 2.73 5 DAY 2189 6.41 6.46 34.16 0.0607 3.92 11 PRINCI 2158 6.46 6.43 34.61 0.0564 6.01 1 ENTITL 2141 6.53 6.69 38.42' 0.0591 2.60 6 RIGHTS 2108 6.30 6.33 30.38 0.0581 5.59 Tabl e IV. Sorted by NOCC EK GL EKL 0.56 2.99 0.40 0.42 4.02 0.34 0.34 5.03 0.25 0.51 3.76 0.32 0.12 11.25 0.08 0.17 6.36 0.17 0.59 2.81 0.39 0.48 3.60 0.43 0.46 3.64 0.33 0.14 6.77 0.12 0.67 2.68 0.44 0.27 5.74 0.21 0.26 3.88 0.21 0.50 3.10 0.35 0.62 2.78 0.43 0.16 5.01 0.20 0.43 4.31 0.35 0.20 5.32 0.16 0.62 2.92 0.45 0.15 5.60 0.15 0.12 7.19 0.11 0.24 4.18 0.23 4.32 2.55 31.32 0.20 5.91 0.16 0.42 3.06 0.33 0.39 3.68 0.31 0.38 i.73 0.29 0.43 3.50 0.29 0.31 4.19 0.23 0.40 2.63 0.41 0.36 5.86 0.25 0.54 3.71 0.34 0.23 4.77 0.15 0.15 7.77 0.12 0.49 2.94 0.33 0.54 3.17 0.37 0.30 5.77 0.20 0.46 3.52 0.33 0.36 4.62 0.27 0.21 6.14 0.15 0.37 4.04 0.31 0.45 3.24 0.36 0.57 2.46 0.45 0.40 3.87 0.30 0.31 5.63 0.23 0.46 3.10 0.34 0.21 10.10 0.14 0.41 3.48 0.32 0.41 3.12 0.32 0.42 3.46 0.30 0.15 7.80 0.13 0.22 8.54 0.13 0.23 4.69 0.16 0.73 2.51 0.86 0.51 3.49 0.36 0.36 4.42 0.26 0.26 9.83 0.17 0.24 7.85 0.16 0.38 3.68 0.30 0.20 4.76 0.17 79 VOTES 4 9 2 3 1 1 L 3 4 1 3 WORD VIEW ATTEMP CITED SITUAT CORREC ENTIRE THEREA APPARE INTEND REFERR THOUGH REFUSE NOTHIN APPLIE SUBSEQ MANNER FAVOR OCCURR STAT SIMILA SEVERA NATURE ORDERE BELIEV BECOME CLEARL TRUE SOUGHT MANY SHOWN OTHERW SAY KNOWN TOOK DONE SHOWS THEREI MAKING MOST RAISED LONG PREVIO PURSUA THINK DISCUS HOLD RECOGN EXISTE THERET POSSIB EMPHAS HOLDIN DISTIN ITSELF NEVER PREVEN FILE CONS IS MERELY NEITHE NOCC 1406 1404 1401 1358 1358 1350 1342 1334 1333 1309 1301 1286 1275 1264 1263 1259 1249 1248 1245 1243 1243 1185 1180 1176 1158 1145 1140 1132 1117 1106 1095 1088 1083 1080 1079 1078 1068 1060 1051 1050 1047 1040 1039 1035 1034 1033 1033 1029 1022 1018 1012 1008 997 993 976 956 943 941 936 930 E 6.35 6.05 6.41 6.42 6.14 6.30 6.40 6.43 6.29 6.24 6.43 6.14 6.24 6.25 6.25 6.30 6.22 6.05 5.90 6.38 6.32 6.16 6.14 6.22 6.07 6.31 6.23 6.11 6.27 6.15 6.14 6.26 6.12 6.15 6.09 6.16 6.13 6.19 6.25 6.00 6.23 6.16 6.08 6.18 6.22 6.15 6.10 6.06 6.05 6.18 5.96 6.05 6.14 6.25 6.01 6.00 5.49 6.19 6.21 6.16 EL 6.48 6.42 6.54 6.49 6.38 6.41 6.55 6.53 6.39 6.43 6.54 6.22 6.55 6.40 6.37 6.37 6.37 6.11 5.9 3 6.46 6.36 6.31 6.33 6.34 6. 30 6.45 6.36 6.33 6.38 6.36 6.42 6.34 6.17 6.28 6.28 6.35 6.38 6.33 6.31 6.28 6.32 6.31 6.24 6.28 6.31 6.35 6.25 6.17 6.35 6.23 6.00 6.20 6.22 6.33 6.15 6.16 5.87 6.31 6.32 6.38 Table IV. Sorted by PZD 30.95 29.18 30.95 29.40 28.57 28.53 31.03 30.84 27.63 28.65 30.46 24.49 30.65 27.63 26.99 27.29 26.87 21.78 19.10 28.61 27.25 25.48 26.23 25.67 25.36 27.67 26.23 25.44 25.82 25.74 27.18 25.44 22.19 24.46 24.57 25.25 25.70 25.14 24.95 23.93 24.80 24.57 23.17 23.63 24.34 24.61 23.51 22.08 24.95 22.98 19.59 22.76 22.68 24.38 21.32 21.44 17.06 23.66 23.78 24.87 NOCC AVG 0.0375 0.0376 0.0390 0.0368 0.0370 0.0369 0.0389 0.0364 0.0361 0.0341 0.0340 0.0351 0.0345 0.0351 0.0363 0.0329 0.0364 0.0347 0.0383 0.0339 0.0331 0.0313 0.0324 0.0322 0.0320 0.0304 0.0309 0.0316 0.0286 0.0303 0.0307 0.0294 0.0285 0.0302 0.0282 0.0297 0.0279 0.0282 0.0273 0.0290 0.0280 0.0277 0.0271 0.0298 0.0267 0.0270 0.0261 0.0286 0.0278 0.0272 0.0246 0.0265 0.0265 0.0260 0.0254 0.0265 0.0265 0.0260 0.0248 0.0252 G 4.33 4.42 2.52 2.40 4.35 5.20 2.78 3.26 3.14 8.37 2.57 4.26 2.76 2.95 3.67 3.46 3.45 3.73 3.51 2.91 3.47 3.80 3.50 3.33 3.89 2.81 3.33 3.80 2.52 3.38 4.16 2.94 3.59 3.21 3.94 3.55 2.72 4.11 2.65 3.56 3.39 3.93 2.92 3.00 2.85 2.49 3.33- 5.05 3.03 3.04 3.16 3.62 2.77 2.40 4.03 3.86 5.51 2.47 2.46 2.65 EK 0.29 0.25 0.33 0.33 0.21 0.25 0.32 0.30 0.25 0.24 0.34 0.19 0.33 0.27 0.24 0.27 0.23 0.18 0.15 0.30 0.26 0.22 0.23 0.24 0.23 0.30 0.26 0.21 0.29 0.24 0.25 0.26 0.21 0.24 0.21 0.23 0.27 0.22 0.28 0.21 0.23 0.22 0.22 0.23 0.25 0.26 0.23 0.19 0.25 0.23 0.19 0.21 0.24 0.27 0.19 0.19 0.10 0.26 0.26 0.27 GL 7.01 7.93 3.08 3.07 4.34 6.76 2.92 3.32 4.27 5.55 2.82 4.13 2.84 3.46 3.97 6.32 4.09 4.81 6.23 3.18 7.53 4.10 6.13 3.34 3.96 3.28 4.42 4.23 2.73 3.23 3.79 3.71 4.34 4.38 4.53 3.06 3.38 3.75 6.00 3.95 3.84 3.68 3.93 3.20 3.19 3.24 3.94 4.18 3.31 3.70 5.19 4.43 4.15 3.32 4.18 3.57 4.17 3.02 2.82 2.44 EKL 0.20 0.19 0.27 0.25 0.20 0.20 0.28 0.26 0.21 0.21 0.28 0.17 0.29 0.22 0.21 0.19 0.21 0.15 0.11 0.24 0.18 0.19 0.18 0.21 0.19 0.24 0.20 0.20 0.23 0.22 0.23 0.21 0.16 0.19 0.18 0.22 0.23 0.21 0.18 0.19 0.20 0.20 0.18 0.20 0.21 0.22 0.18 0.16 0.22 0.18 0.13 0.17 0.18 0.22 0.16 0.17 0.12 0.21 0.22 0.25 80 ES WORD NOCC E EL PZD AVG G EK GL EKL 4 VIEW 1406 6.35 6. 48 30.95 0.0 3 75 <».33 0.29 7.01 0.20 9 ATTEMP 1404 6.05 6.42 21. 18 0.0376 4.42 0.25 7.93 0.19 CITED 1401 6.41 6.54 30.95 0.0390 2.52 0.33 3.08 0.27 2 SITUAT 1358 6.42 6.49 29.40 0.0368 2.40 0.33 3.07 0.25 3 CORREC 1358 6. 14 6.38 28.57 0.0370 4.35 0.21 4.34 0.20 1 ENTIRE 1350 6.30 6.41 28.53 0.0369 5.20 0.25 6.76 0.20 THEREA 1342 6.40 6.55 31.03 0.0389 2.78 0.32 2.92 0.28 1 APPARE 1334 6.43 6.53 30.84 0.0364 3.26 0.30 3.32 0.26 1 INTEND 1333 6.29 6.39 27.63 0.0361 3.14 0.25 4.27 0.21 3 REFERR 1309 6.24 6.43 28.65 0.0341 8.37 0.24 5.55 0.21 THOUGH 1301 6.43 6.54 30.46 0.0340 2.57 0.34 2.82 0.28 2 REFUSE 1286 6.14 6.22 24.49 0.0351 4.26 0.19 4.13 0.17 NOTHIN 1275 6.24 6.55 30.65 0.0345 2.76 0.33 2.84 0.29 APPLIE 1264 6.25 6.40 27.63 0.0351 2.95 0.27 3.46 0.22 2 SUBSEQ 1263 6.25 6.37 26.99 0.0363 3.67 0.24 3.97 0.21 MANNER 1259 6.30 6.37 27.29 0.0329 3.46 0.27 6.32 0.19 1 FAVOR 1249 6.22 6.37 26.87 0.0364 3.45 0.23 4.09 0.21 4 OCCURR 1248 6.05 6.11 21.78 0.0 347 3.73 0.18 4.81 0.15 1 STAT 1245 5.90 5.93 19.10 0.0383 3.51 0.15 6.23 0.11 2 SIMILA 1243 6.38 6.46 28.61 0.0339 2.91 0.30 3.18 0.24 5 SEVERA 1243 6.32 6.36 27.25 0.0331 3.47 0.26 7.53 0.18 1 NATURE 1185 6.16 6.31 25.48 0.0313 3.80 0.22 4.10 0.19 ORDERE 1180 6.14 6.33 26.23 0.0324 3.50 0.23 6.13 0.18 BELIEV 1176 6.22 6.34 25.67 0.0322 3.33 0.24 3.34 0.21 BECOME 1158 6.07 6.30 25.36 0.0320 3.89 0.23 3.96 0.19 CLEARL 1145 6.31 6.45 27.67 0.0304 2.81 0.30 3.28 0.24 1 TRUE 1140 6.23 6.36 26.23 0.0309 3.33 0.26 4.42 0.20 SOUGHT 1132 6.11 6.33 25.44 0.0316 3.80 0.21 4.23 0.20 MANY 1117 6.27 6.38 25.82 0.0286 2.52 0.29 2.73 0.23 SHOWN 1106 6.15 6.36 25.74 0.0303 3.38 0.24 3.23 0.22 OTHERW 1095 6.14 6.42 27.18 0.0307 4.16 0.25 3.79 0.23 SAY 1088 6.26 6.34 25.44 0.0294 2.94 0.26 3.71 0.21 1 KNOWN 1083 6.12 6.17 22.19 0.0285 3.59 0.21 4.34 0.16 TOOK 1080 6.15 6.28 24.46 0.0302 3.21 0.24 4.38 0.19 DONE 1079 6.09 6.28 24.57 0.0282 3.94 0.21 4.53 0.18 SHOWS 1078 6.16 6.35 25.25 0.0297 3.55 0.23 3.06 0.22 THEREI 1068 6.13 6.38 25.70 0.0279 2.72 0.2 7 3.38 0.23 MAKING 1060 6.19 6.33 25.14 0.0282 4.11 0.22 3.75 0.21 MOST 1051 6.25 6.31 24.95 0.0273 2.65 0.28 6.00 0.18 RAISED 1050 6.00 6.28 23.93 0.0290 3.56 0.21 3.95 0.19 LONG 1047 6.23 6.32 24.80 0.0280 3.39 0.23 3.84 0.20 PREVIO 1040 6.16 6.31 24.57 0.0277 3.93 0.22 3.68 0.20 4 PURSUA 1039 6.08 6.24 23.17 0.0271 2.92 0.22 3.93 0.18 1 THINK 1035 6.18 6.28 23.63 0.0298 3.00 0.23 3.2C 0.20 DISCUS 1034 6.22 6.31 24.34 0.0267 2.85 0.25 3.19 0.21 HOLD 1033 6.15 6.35 24.61 0.0270 2.49 0.26 3.24 0.22 5 RECOGN 1033 6.10 6.25 23.51 0.0261 3.33 0.23 3.94 0.18 2 EXISTE 1029 6.06 6.17 22.08 0.0286 5.05 0.19 4.18 0.16 THERET 1022 6.05 6.35 24.95 0.0278 3.03 0.25 3.31 0.22 POSSIB 1018 6.18 6.23 22.98 0.0272 3.04 0.23 3.70 0.18 4 EMPHAS 1012 5.96 6.00 19.59 0.0246 3.16 0.19 5.19 0.13 1 HOLDIN 1008 6.05 6.20 22.76 0.0265 3.62 0.21 4.43 0.17 3 DISTIN 997 6.14 6.22 22.68 0.0265 2.77 0.24 4.15 0.18 ITSELF 993 6.25 6.33 24.38 0.0260 2.40 0.27 3.32 0.22 NEVER 976 6.01 6.15 21.32 0.0254 4.03 0.19 4.18 0.16 5 PREVEN 956 6.00 6.16 21.44 0.0265 3.86 0.19 3.57 0.17 2 FILE 943 5.49 5.87 17.06 0.0265 5.51 0.10 4.17 0.12 CONSIS 941 6.19 6.31 23.66 0.0260 2.47 0.26 3.02 0.21 MERELY 936 6.21 6.32 23.78 0.0248 2.46 0.26 2.82 0.22 NEITHE 930 6.16 6.38 24.87 0.0252 2.65 0.27 2.44 0.25 Table IV. Sorted by NOCC 81 DTE :s WORD NOCG E EL PZD AVG G EK GL EKL FAR 923 6. 11 6.2 4 22.61 0.0247 4.89 0.20 4.79 0.18 LESS 923 6.08 6.17 21.63 0.0250 3.44 0.21 3.99 0.17 1 EVERY 922 6.11 6.22 22.31 0.0244 3.05 0.22 3.79 0.18 CLAIME 921 5.97 6.1 / 21.44 0.0261 4.84 0.17 3.94 0.17 RATHER 9L7 6. 15 6.21 22.00 0.0246 3.00 0.24 3.67 0.18 HEARD 933 5.97 6.07 19.93 0.0241 3.35 0.18 5.06 0.14 VERY 888 6.15 6.22 21.93 0.0230 2.80 0.24 3.45 0.19 7 JUSTIF 885 5.90 6.C7 19.85 0.0235 3.52 0.18 4.41 0.15 HIMSEL 864 5.95 6. 10 19.85 0.0241 5.07 0.17 3.60 0.16 TOGETH 861 6.04 6.16 20.91 0.0222 3.31 0.21 3.86 0.17 RELATE 839 5.92 6.12 20.04 0.0233 3.10 0.20 4.01 0.16 LATTER 833 6.04 6.14 20.23 0.0235 3.47 0.19 3.63 0.17 WHOM 832 6.00 6.13 20.08 0.0228 3.43 0.19 3.68 0.17 SHOWIN 829 5.78 6.16 20.53 0.0227 3.37 0.19 3.12 0.18 3 VARIOU 815 5.99 6.12 19.96 0.0214 4.01 0.20 3.74 0.16 APPLY 806 6.00 6.08 19.63 0.0212 3.14 0.19 4.78 0.15 SUGGES 782 5.94 6.06 18.68 0.0208 3.46 0.18 3.55 0.16 PLACED 781 5.88 6.05 18.91 0.0208 4.15 0.16 4.20 0.15 READS 769 5.89 6.03 18.30 0.0220 3.56 0.16 3.85 0.15 7 VALID 768 5.83 5.92 17.06 0.0207 3.58 0.16 4.77 0.12 LEAST 766 6.00 6. 11 19.40 0.0206 2.98 0.20 3.43 0.17 AGAIN 766 6.00 6.11 19.32 0.0209 4.64 0.18 3.29 0.17 BEYOND 754 5.87 5.99 17.74 0.0209 3.35 0.17 3.90 0.14 2 TIMES 751 5.95 6.09 19.21 0.0201 3.18 0.19 3.80 0.16 4 DISSEN 751 5.48 5.73 13.43 0.0191 3.84 0.12 3.90 0.11 OCCASI 742 5.95 6.03 18.38 0.0206 3.38 0.18 5.02 0.14 HOW 739 5.93 6.01 17.89 0.0191 3.23 0.19 3.80 0.15 LIKE 738 5.93 6.08 18.87 0.0198 4.09 0.17 3.62 0.16 BECAME 734 5.81 6.08 18.61 0.0196 3.61 0.18 3.09 0.17 PUT 719 5.88 5.96 17.40 0.0197 3.40 0.17 5.70 0.13 THEREB 712 5.99 6.11 19.02 0.0192 3.22 0.19 3.25 0.17 NOTED 710 5.88 6.02 18.04 0.0182 3.47 0.17 4.48 0.14 6 AGREE 707 5.91 6.10 18.98 0.0187 3.50 0.19 3.35 0.17 1 APPROX 704 5.79 5.87 15.77 0.0179 3.77 0.15 4.01 0.12 MENTIO 694 5.91 6.02 17.89 0.0191 4.96 0.16 4.13 0.15 MUCH 693 5.99 6.11 19.13 0.0187 3.85 0.19 3.99 0.17 COME 663 5.90 6.00 17.40 0.0173 3.24 0.18 3.88 0.15 1 STILL 660 5.86 6.07 18.08 0.0176 3.47 0.18 2.94 0.17 WHOSE 655 5.89 6.04 17.70 0.0179 3.34 0.18 3.38 0.16 MERE 654 5.82 5.99 17.02 0.0170 3.36 0.17 3.95 0.14 2 ESSENT 651 5.83 5.98 16.76 0.0173 3.67 0.16 3.52 0.15 2 WHOLE 651 5.74 5.7d 14.87 0.0169 3.54 0.14 5.73 0.10 SEEMS 647 5.88 5.98 16.87 0.0179 4.19 0.16 3.41 0.15 OBVIOU 645 5.87 6.09 18.23 0.0187 3.36 0.18 2.92 0.18 FOREGO 626 5.73 5.96 16.64 0.0163 3.55 0.16 3.70 0.14 DOING 625 5.71 5.89 16.04 0.0167 3.56 0.15 5.74 0.12 FULLY 591 5.74 5.93 16.00 0.0159 4.28 0.14 3.71 0.14 2 QUOTED 591 5.60 5.85 15.13 0.0149 3.88 0.14 4.09 0.12 ADDED 587 5.62 5.77 13.96 0.0144 4.33 0.13 3.95 0.12 AMONG 579 5.83 5.93 15.81 0.0152 3.05 0.17 3.70 0.14 DIFFIC 578 5.72 5.87 15.06 0.0155 3.98 0.14 3.51 0.13 MAKES 565 5.73 5.98 16.27 0.0151 3.28 0.17 3.07 0.15 WHEREI 560 5.60 5.92 15.66 0.0155 4.62 0.13 3.89 0.14 1 OPPORT 545 5.53 5.75 13.70 0.0146 5.13 0.11 4.15 0.11 ALREAD 542 5.68 5.80 14.08 0.0141 3.49 0.14 4.07 0.12 REACHE 539 5.63 5.86 14.91 0.0139 4.07 0.14 4.15 0.13 ALONE 536 5.73 5.87 14.79 0.0152 4.20 0.14 3.50 0.13 1 DESIRE 507 5.38 5.78 13.74 0.0143 4.09 0.12 3.97 0.12 NONE 506 5.58 5.82 14.23 0.0136 3.70 0.14 4.14 0.12 HERETO 498 5.41 5.64 12.60 0.0121 3.70 0.12 6.07 0.05 Table IV. Sorted by NOCG 82 VOTES WORD NOCG E EL PZD AVG G EK GL EKL MOVED 492 5.61 5.7b 1 3.40 0.0149 3.94 0.13 4.21 0.11 RELIED 487 5.62 5.80 13.89 0.0134 4.43 0.12 4.02 0.12 1 CONCED 48 5 5.58 5. 8 '3 14.00 0.0140 3.43 0.14 3.42 0.13 EVER 48L 5.47 5.65 12.23 0.0127 4.47 0.11 4.27 0.10 3 CAREFU 453 5.42 5.79 13.51 0.0118 3.79 J. 13 3.84 0.12 HENCE 447 5.43 5.68 12.26 0.0118 4.38 0.11 3.85 0.11 ARGUES 443 5.52 5.67 12.23 0.0136 4.75 0.11 3.96 0.11 SOLELY 441 5.50 5.74 12.87 0.0118 4.03 0.12 4.06 0.12 FAILS 426 5.21 5.68 12.15 0.0125 4.84 0.10 3.68 0.11 2 COMPAR 418 5.42 5.57 11.09 0.0121 4.96 0.09 4.20 0.09 1 ABLE 416 5.37 5.64 11.77 0.0107 4.69 0.11 4.20 0.10 LIKEWI 404 5.52 5.64 11.70 0.0106 3.26 0.12 4.45 0.10 ARGUED 396 5.47 5.71 12.15 0.0117 3.88 0.12 3.34 0.12 STATIN 385 5.43 5.67 11.77 0.0112 4.44 0.11 3.86 0.11 EXISTS 376 5.38 5.59 10.94 0.0104 4.09 0.11 3.84 0.10 ONCE 375 5.32 5.60 11.02 0.0094 3.70 0.11 3.77 0.10 SEEKS 374 5.15 5.62 11.32 0.0117 4.95 0.10 3.75 0.10 NEVERT 370 5.50 5.71 11.92 0.0096 3.19 0.13 3.20 0.12 INSIST 368 5.36 5.51 10.41 0.0096 3.68 0.10 4.72 0.09 INSTEA 328 5.29 5.52 10.07 0.0088 4.25 0.10 3.97 0.09 2 VIRTUE 322 5.21 5.46 9.55 0.0091 4.56 0.09 3.99 0.09 1 ALLEGI 320 5.18 5.47 9.66 0.0088 4.31 0.09 4.05 0.09 NAMELY 316 5.27 5.44 9.36 0.0080 4.71 0.09 4.09 0.09 QUITE 307 5.32 5.46 9.39 0.0083 4.11 0.09 3.74 0.09 1 RELIES 301 5.28 5.48 9.62 0.0090 4.16 0.09 3.91 0.09 1 WEYGAN 251 4.57 5.40 8.79 0.0050 6.09 0.05 3.57 0.09 2 MATTHI 249 4.57 5.37 8.64 0.0049 6.34 0.05 4.17 0.08 SOMETI 237 5.05 5.22 7.39 0.0068 5.15 0.07 4.18 0.07 SOMEWH 236 5.13 5.27 7.73 0.0070 4.87 0.07 4.12 0.07 DESMON 230 4.86 5.24 7.47 0.0065 4.60 0.07 4.06 0.07 1 PECK 216 4.34 5.22 7.43 0.0043 7.17 0.04 4.22 0.07 1 VOORHI 209 4.80 5.23 7.32 0.0059 4.32 0.07 3.98 0.07 FROESS 209 4.78 5.18 6.98 0.0062 4.96 0.06 3.98 0.07 FULD 208 4.73 5.20 7.09 0.0057 4.57 0.06 4.05 0.07 Table IV. Sorted by NOCC 83 ES WORD NOGC E EL PZD AVG G EK GL EKL THE 442506 7.87 7.65 99.99 12.1192 -0.19 41.17 1.87 1.93 AAAAAA 2649 7.07 7.87 99.99 0.0783 0.42 4.32 2.55 31.32 AND 128355 7.83 7.61 99.73 3.4562 0.53 15.25 2.14 1.57 THAT 89026 7.80 7.60 98.15 2.4343 0.70 9.48 1.92 1.54 FOR 45223 7.73 7.61 98.07 1.2529 1.03 5.00 1.87 1.59 NOT 35835 7.75 7.6C 96.97 0.9798 0.55 6.95 1.90 l.lb THIS 29490 7.66 7.59 96.67 0.8106 1.15 4.02 2.45 1.41 WAS 56044 7.69 7.55 95.73 1.5630 0.52 3.68 1.78 1.33 WHICH 25522 7.70 7.56 94.41 0.6984 0.64 4.89 1.79 1.38 LO COURT 33021 7.45 7.41 93.i-8 0.9097 1.64 1.26 3.97 0.76 FROM 19879 7.62 7.51 92.18 0.5456 1.25 3.01 1.83 1.19 WITH 21624 7.64 7.51 92. 03 0.5840 1.15 3.46 2.15 1.16 HAVE 13825 7.53 7.44 85.99 0.3761 1.17 2.52 2.53 0.97 SUCH 18195 7.50 7.35 85.80 0.4817 1.49 1.78 2.91 0.74 3 CASE 15261 7.45 7.36 84.74 0.4182 1.64 1.43 2.38 0.80 ARE 13721 7.46 7.39 84.37 0.3766 1.56 1.85 2.55 0.86 THERE 12925 7.48 7.40 84.25 0.3545 1.30 1.87 2.17 0.91 BEEN 12072 7.50 7.41 83.76 0.3306 1.41 1.96 2.07 0.95 ANY 13855 7.47 7.37 83.12 0.3703 1.29 1.87 2.37 0.83 UPON 11816 7.46 7.40 82.93 0.3232 1.37 1.76 1.83 0.95 HAD 15451 7.43 7.30 82.44 0.4205 1.49 1.38 2.68 0.69 HAS 10530 7.36 7.37 81.76 0.2838 1.34 1.51 2.41 0.83 UNDER 10893 7.40 7.31 80.44 0.2937 1.82 1.31 2.98 0.69 WERE 12911 7.43 7.31 79.91 0.3486 1.43 1.55 2.67 0.70 BUT 9174 7.48 7.37 78.89 0.2485 0.84 2.21 2.06 0.89 HIS 19529 7.32 7.2 2 78.63 0.5396 1.55 1.03 2.83 0.60 9 APPEAL 9096 6 B 80 /.06 77.61 0.2637 4.94 0.30 5.35 0.33 2 QUESTI 8776 7.25 7.28 77.08 0.2395 2.17 1.03 4.30 0.62 MAY 9510 7.37 7.30 76.70 0.2605 1.45 1.38 2.50 0.72 1 ONE 9388 7.39 7.31 76.40 0.2540 1.61 1.48 2.40 0.75 OTHER 8966 7.43 7.31 76.17 0.2397 1.18 1.79 2.45 0.76 ITS 11061 7.31 7.20 75.34 0.2888 1.71 1.13 3.49 0.54 1 ALL 9021 7.36 7.26 74.78 0.2361 1.45 1.46 3.34 0.64 MADE 7999 7.32 7.29 74.51 0.2213 1.60 1.25 1.97 0.76 2 LAW 9658 7.23 7.20 74.29 0.2554 2.34 0.88 3.39 0.54 9 JUDGME 10581 7.06 7.17 73.19 0.3119 3.01 0.54 4.08 0.49 WOULD 9678 7.34 7.23 73.12 0.2580 1.43 1.34 2.49 0.64 2 REASON 6845 7.17 7.25 72.48 0.1850 2.15 1.11 2.86 0.64 1 ONLY 6218 7.33 7.31 72.14 0.1693 1.57 1.38 1.88 0.82 5 DEFEND 25773 7.20 7.12 71.19 0.7468 1.34 0.79 2.43 0.53 3 TIME 8254 7.17 7.20 70.40 0.2237 2.55 0.92 2.17 0.62 WHEN 6875 7.28 7.24 69.87 0.1866 1.54 1.20 2.24 0.69 1 FOLLOW 6076 7.28 7.24 69.38 0.1661 1.30 1.18 2.44 0.69 SAID 10747 7.07 6.93 69.15 0.2803 4.45 0.50 6.83 0.27 BEFORE 5814 7.19 7.23 68.55 0.1612 2.12 0.95 2.63 0.66 AFTER 6340 7.24 7.21 68.47 0.1745 1.62 1.06 2.27 0.65 3 PRESEN 5653 7.18 7.20 68.25 0.1558 2.26 0.88 3.49 0.58 ALSO 5230 7.29 7.23 67.15 0.1410 1.08 1.33 1.95 0.71 DID 6224 7.24 7.17 66.70 0.1665 1.55 1.03 2.52 0.59 MUST 5208 7.18 7.22 66.70 0.1412 1.83 1.08 2.79 0.64 SHOULD 5689 7.20 7.20 66.59 0.1511 1.89 1.02 2.45 0.63 WHETHE 5173 7.22 7.19 66.13 0.1408 1.69 1.04 2.57 0.61 9 EVIDEN 12726 7.10 7.02 65.64 0.3461 1.64 0.71 3.09 0.43 WHERE 5794 7.19 7.16 65.26 0.1562 1.64 1.03 2.43 0.58 6 ACTION 8248 6.94 6.92 64.55 0.2329 3.64 0.39 4.77 0.31 THEY 7042 7.14 7.08 64.47 0.1897 2.45 0.77 3.52 0.45 2 REQUIR 6103 7.06 7.10 63.98 0.1665 2.34 0.74 4.53 0.47 4 CONCUR 2290 6.65 7.30 63.91 0.0643 2.45 0.73 2.51 0.86 8 CONS ID 5288 7.15 7.14 63.72 0.1379 2.06 0.93 2.68 0.56 WITHOU 4652 7.10 7.17 63.57 0.1274 2.02 0.91 2.39 0.62 Table y. Sorted by PZD TES WORD NOCC E EL PZD AVG G EK GL EKL 4 AFFIRM 3897 6.89 7.23 63.53 0. 1109 2.26 0.78 2.61 0.70 DOES 4264 7.09 7.2 63.30 0.1175 1.80 0.96 2.11 0.67 7 TRIAL 9898 6.97 6.98 62.35 0.2884 2.75 0.45 2.96 0.41 7 WILL 7140 6.84 6.74 62.55 0.1944 5.49 0.26 12.86 0.15 THEREF 3871 7.01 7.18 62.21 0.1050 1.43 0.90 2.25 0.65 2 STATE 9231 6.85 6.80 62.06 0.2417 3.06 0.39 4.64 0.25 FURTHE 4546 7.11 7.13 61.94 0.1230 1.92 0.91 3.44 0.53 AGAINS 5725 7.04 7.06 61.83 0.1605 2.56 0.63 3.13 0.46 COULD 5096 7. 16 7.11 61.79 0.1383 1.59 0.95 2.98 0.54 THEIR 6514 7.08 7.02 61.75 0.1756 2.19 0.70 3.29 0.42 4 PERSON 6980 7.01 6.94 60.81 0.1897 2.61 0.57 5.09 0.33 SAME 4992 7.05 7.07 60.73 0.1299 2.47 0.76 3.32 0.48 1 PART 4746 7.12 7.09 60.62 0.1287 2.57 0.78 2.85 0.52 1 TWO 5130 7.11 7.11 60.51 0.1408 1.59 0.85 2.47 0.55 6 RECORD 6093 6.91 6.98 60.51 0.1675 5.25 0.41 4.95 0.35 4 FACT 4658 7.06 7.10 60.28 D.1249 2.10 0.80 2.40 0.54 1 PROVID 5792 7.03 7.02 60.02 0.1599 2.56 0.64 3.62 0.42 THESE 4753 7.11 7.07 59.79 0.1275 1.97 0.83 3.27 0.48 WHO 5241 7.11 7.03 59.64 0.1416 1.89 0.79 3.51 0.44 4 DETERM 5030 7.02 7.01 59.45 0.1314 3.04 0.64 3.95 0.40 THAN 4378 7.11 7.13 59.38 G.1198 2.23 0.81 2.63 0.54 THEN 4583 7.12 7.07 59.19 0.1242 2.04 0.82 2.60 0.51 3 OPINIO 4764 7.02 6.98 '58.85 C.1218 2.05 0.71 4.63 0.37 5 DIRECT 5706 6.95 6.92 58.62 0.1575 5.12 0.44 6.63 0.29 3 ORDER 6773 6.78 6.77 58.32 0.1918 3.68 0.31 11.48 0.19 2 PLAINT 20986 7.02 6.94 57.71 0.6097 1.25 0.64 2.24 0.43 5 APPEAR 3855 6.95 7.00 57.68 0.1045 3.97 0.56 9.43 0.32 2 BEING 3858 7.04 7.08 57.41 0.1040 2.13 0.75 2.89 0.52 BECAUS 3553 7.00 7.11 57.19 0.0999 2.04 0.75 2.28 0.58 3 FIRST 4165 7.01 7.04 57.15 0.1116 2.30 0.71 3.27 0.46 CONTEN 3888 7.02 7.09 57.11 0.1094 2.14 0.71 2.24 0.56 OUT 4389 7.00 6.99 57.04 0.1164 3.00 0.65 6.13 0.37 HOWEVE 3333 7.09 7.11 55.90 0.0923 1.47 0.90 1.76 0.62 1 FACTS 4095 7.00 7.01 55.79 0.1137 3.05 0.60 2.90 0.46 2 SECTIO 10226 6.83 6.76 55.75 0.2858 2.91 0.38 4.29 0.27 WITHIN 4561 6.85 6.97 55.56 0.1294 2.63 0.5Q 3.59 0.41 HELD 3978 7.04 7.02 55.34 0. 1058 1.92 0.75 2.83 0.47 FILED 5362 6.67 6.91 55.26 0.1589 4.09 0.33 3.46 0.36 MATTER 4313 6.91 6.96 55.19 0. 1166 3.11 0.53 4.12 0.38 1 PROCEE 5021 6.79 6.84 55.19 0.1373 3.56 0.40 6.15 0.26 SEE 4704 6.93 6.88 55.00 0.1297 2.95 0.47 3.89 0.33 2 STATED 3698 6.99 6.99 54.77 0.0975 2.37 0.68 3.69 0.42 5 CAUSE 4463 6.77 6.90 54.28 0.1255 2.98 0.43 4. 08 0.34 6 RIGHT 5447 6.76 6.86 54.24 0.1464 2.91 0.47 3.87 0.32 HIM 5613 6.91 6.85 54.24 0.1531 2.49 0.52 6.64 0.29 7 CONCLU 3665 6.95 7.02 53.90 0.1010 2.50 0.64 2.52 0.49 8 MOTION 6621 6.71 6.84 53.90 0.1942 3.78 0.30 3.36 0.33 4 FOUND 3608 6.91 6.98 53.68 0.1017 2.73 0.53 3.16 0.43 5 STATUT 7283 6.89 6.80 53.15 0. 1985 2.26 0.48 4.39 0.29 7 CONTRA 8033 6.56 6.49 52.96 0.2158 3.98 0.23 7.29 0.15 4 GENERA 5262 6.87 6.82 52.92 0.1338 3.11 0.47 5.01 0.28 HERE 3448 6.93 6.97 52.69 0.0938 1.92 0.66 3.12 0.43 10 COUNTY 6245 6.62 6.52 52.43 0.1787 5.00 0.23 8.51 0.14 5 EFFECT 3759 6.91 6.92 52.39 0.1C18 2.86 0.56 7.29 0.34 6 AUTHOR 4898 6.78 6.81 52.32 0.1319 4.35 0.37 4.61 0.28 9 NECESS 3477 6.93 6.93 52.20 0.0937 3.31 0.52 4.91 0.35 END 6422 6.81 6.71 51.86 0.1570 3.07 0.44 6.84 0.22 CASES 3896 6.86 6.90 51.41 0.1062 2.58 0.54 3.22 0.38 INTO 3583 6.93 6.92 51.00 0.0952 2.51 0.57 3.14 0.39 SOME 3394 6.97 6.93 50.88 0.0897 1.97 0.67 4.84 0.39 Table V. Sorted by p; 2D 85 VOTES WORD NOOC E EL PZD AVG G EK GL EKL 2 CERTAI 3069 6.87 6.96 50.62 0.0830 2.20 0.65 3.90 0.42 3 APPELL 14543 6.53 6.4't 50.16 C.3877 3.05 0.23 5.26 0.16 6 EXCEPT 3589 6.58 6.82 49.79 0.1046 5.95 0.26 4.72 0.30 9 ACCORD 2721 6.37 6.96 49.64 0.0745 2.12 0.62 2.92 0.45 1 SEC 6808 6.65 6.62 49.60 0.1929 3.75 0.27 4.50 0.21 MORE 3050 6.94 6.95 49.49 0.0822 1.98 0.66 2.76 0.45 THEM 3505 6.92 6.89 49.37 0.0943 2.56 0.56 4.37 0.36 2 PURPOS 4138 6.76 6.76 49.30 0.1096 3.99 0.41 6.33 0.25 SHALL 6240 6.81 6.73 49.18 0.1705 2.77 0.43 4.34 0.27 1 CAN 2822 6.93 6.94 49.15 0.0739 1.61 0.67 2.68 SINCE 2756 6.89 6.93 48.65 0.0753 1.76 0.62 2.78 ENTERE 2920 6.78 6.87 48.58 0.0873 3.29 0.42 4.02 1 RESULT 3328 6.85 6.86 48.50 0.0911 3.50 0.49 3.97 1 NEW 4744 6.68 6.72 48.09 0.1295 3.77 0.31 4.33 OUR 3179 6.80 6.83 47.98 0.0833 2.15 0.55 4.84 INVOLV 2933 6.56 6.90 47.86 0.0789 2.29 0.56 2.99 2 ALLEGE 3766 6.72 6.81 47.86 0.1091 3.04 0.40 3.37 BETWEE 3231 6.84 6.87 47.45 0.0879 2.33 0.55 2.83 3 APPLIC 4168 6.58 6.60 47.37 0.1134 4.97 0.25 8.13 2 PROVIS 4479 6.80 6.77 47.18 0.1251 2.55 0.45 3.69 6 RULE 4090 6.56 6.70 47.18 0.1055 4.23 0.31 12.48 2 REVERS 2857 6.66 6.93 46.96 0.0842 2.65 0.48 3.60 5 JUDGE 4000 6*52 6.64 46.84 0.1181 10.30 0.19 6.80 2 DECISI 3988 6.52 6.69 46.58 0.1070 4.00 0.30 5.57 CANNOT 2467 6.74 6.92 46.54 0.0694 2.06 0.57 2.46 1 BOTH 2868 6.85 6.88 46.54 0.0771 1.87 0.59 2.81 SET 2964 6.71 6.84 46.54 0.0798 3.36 0.45 3.72 3 SUPPOR 3151 6.65 6.67 46.35 0.0855 7.06 0.24 9.79 WHILE 2749 6.82 6.85 46.31 0.0751 5.29 0.43 4.31 3 SUSTAI 2600 6.65 6.89 46.24 0.0753 3.40 0.40 2.63 1 ACT 5147 6.65 6.59 45.56 0.1370 3.30 0.32 6.21 5 SUBJEC 2855 6.70 6.81 45.48 0.0784 2.72 0.46 3.64 ITAL 11360 6.67 6.57 45.18 0.2755 3.12 0.37 7.32 FOL 5682 6.67 6.57 45.18 0.1378 3.12 0.37 7.39 GIVEN 2766 6.80 6.82 45.07 0.0744 2.27 0.50 3.10 1 APP 4769 6.74 6.72 44.92 0.1292 2.51 0.41 3.31 WHAT 2883 6.76 6.79 44.80 0.0725 2.52 0.51 3.76 6 ERROR 3841 6.56 6.66 44.80 0.1051 3.69 0.29 4.33 1 ESTABL 2947 6.74 6.72 44.46 0.0788 3.00 0.45 17.95 RESPEC 2579 6.80 6.82 44.43 0.0678 1.99 0.54 3.71 4 GROUND 2629 6.68 6.77 44.16 0.0728 3.25 0.38 5.73 MAKE 2535 6.76 6.84 43.94 0.0681 2.35 0.54 3.17 1 EACH 3332 6.68 6.69 43. 9C 0.0859 4.53 0.36 5.12 2 INCLUD 2632 6.71 6.76 43.41 0.0716 3.86 0.39 3.68 NOW 2384 6.60 6.80 43. 2S 0.0629 2.79 0.46 3.10 NOR 2099 6.70 6.86 43. 14 0.0581 1.94 0.53 2.78 WELL 2259 6.77 6.83 43.14 0.0592 2.87 0.51 3.49 TAKEN 2518 6.67 6.76 43.07 0.0697 3.27 0.37 4.04 6 CONSTI 4132 6.41 6.49 42.99 0.1058 3.48 0.28 7.53 4 SUFFIC 2484 6.72 6.81 42.92 0.0708 2.35 0.45 3.24 5 ISSUE 3113 6.61 6.66 42.88 0.0831 3.76 0.32 4.98 3 COMMON 4042 6.46 6.48 42.58 0.1171 5.85 0.19 7.01 THOSE 2527 6.73 6.77 42.43 0.0642 3.12 0.46 3.52 7 SPECIF 2900 6.65 6.68 42.28 0.0790 3.75 0.34 5.03 2 PARTIC 2381 6.48 6.76 42.12 0.0625 3.17 0.41 3.48 HAVING 2006 6.67 6.86 42.09 0.0548 2.18 0.51 2.07 4 CIRCUM 2543 6.75 6.75 41.94 0.0679 2.08 0.49 2.94 HEREIN 2599 6.23 6.70 41.75 0.0670 3.17 0.36 5.86 5 PARTIE 3496 6.55 6.59 41.71 0.0960 3.86 0.29 4.47 THEREO 2640 6.69 6.75 41.60 0.0697 2.61 0.42 3.06 Table V. Sorted by PZD 86 .44 .43 .34 .34 .26 .31 .40 .33 .38 .16 .30 .20 .43 .20 .23 .45 .39 .35 0. .18 .35 .41 0, .20 0, .33 .19 .19 0. ,35 0. ,29 0. .32 0. ,24 0, 18 0. 34 0. 29 0. 37 0. 25 0. 31 0. 34 0. 40 0. 36 0. 31 0. 15 0. 36 0. 23 0. 16 0. 33 0. 25 0. 32 0. 43 0. 33 0. 25 0. 22 0. 33 VOTES 3 3 1 3 7 3 9 5 7 8 1 2 1 7 11 2 10 1 5 7 WORD SUBSTA FINDIN THREE A80UT OVER PRIOR CHARGE CONSTR DENIED PETITI EITHER CONTRO PERMIT OPERAT ANSWER UNTIL RECEIV EVEN ALTHOU SECOND ENTITL USED CONTAI CITY FIND INDICA AMOUNT COMPLA YEARS RELATI PROPER DURING ANOTHE USE EXPRES D ISM IS PUBLIC EXAMIN CONOIT INTERE ABOVE OWN INSTAN THUS CONCER TESTIM THROUG PRINCI OHIO CONTIN MIGHT JURY STATEM DAY OFFICE SHOW UNLESS BROUGH PAGE CLEAR Table NOCC 252 7 3437 2437 3228 2622 2379 4622 3805 2053 7623 2033 2941 2869 4207 3398 2347 2801 1964 1762 2415 2141 2650 2096 5969 1954 1901 3110 3971 2601 2530 5913 2216 1881 3852 2022 2755 4658 3117 2779 3637 1812 1857 1867 1622 1797 3650 1954 2158 8519 2382 1734 5530 2732 2189 4060 1649 1520 1534 3218 1537 V. E 6.6 2 6.56 6.7C 6.65 6.72 6.69 6.48 6.53 6.30 6.19 6.71 6.48 6.35 6.52 6.42 6.65 6.52 6.64 6.67 6.53 6.53 6.45 6.55 6.24 6.51 6.64 6.49 6.40 6.53 6.54 6.40 6.58 6.57 6.29 6.51 5.96 6.33 6.19 6.46 6.36 6.40 6.53 6.54 6.58 6.57 6.42 6.52 6.46 6.49 6.37 6.57 6.41 6.32 6.41 6.26 6.36 6.54 6.50 6.47 6.52 Sorted EL ?ZD 6.7 1 4 1.60 ,59 41.56 .73 41.13 65 41. 10 ,71 40.99 74 40.88 47 40.69 55 40, 50 6 6. 6, 6, 6, 6, 6. 6.77 40.39 6.44 40.39 6.78 40.20 6.55 39.93 6.49 39.63 6.45 39.56 6.41 39.33 6.70 39.22 6.57 39.10 6.75 38.80 6.77 38.65 6.61 38.50 6.69 38.42 6.5 8 38.16 6.65 38.12 6.23 38.05 6.66 37.75 6.70 37.67 6.52 37.56 6.45 37.44 6.56 37.10 6.53 37.10 6.34 36.91 6.62 36.50 6.65 36.35 6.27 36.12 6.61 36.01 6.48 35.90 6.30 35.78 6.23 35.56 6.47 35.52 6.32 35.33 6.63 35.18 6.60 34.99 6.60 34. £18 6.65 34.80 6.59 34.76 6.41 34.65 6.56 34.61 6.43 34.61 6.35 34.39 6.40 34.35 6.63 34.27 6.31 34.27 6.36 34.16 6.46 34.16 6.12 33.93 6.59 33.89 6.63 33.82 6.59 33.74 6.45 33.71 6.57 33.48 by PZD AVC 0.0693 0.0995 0.0677 0.0 8 82 0.0701 0654 1234 1054 0580 0.2198 0.0532 ,0 849 ,0320 ,1145 ,0913 0.0628 0.0764 0.0509 0.0487 0.0656 0.0591 Q.0734 0.0578 0.1706 0.0519 0.0499 0.0869 0.1136 0.0687 0.0662 0.1591 0.0609 0.0500 0.1059 0.0546 0.0790 0.1226 0.0831 0.0760 0.0944 0.0483 0.0502 0.0494 0.0427 0.0468 0.1010 0.0531 0.0564 0.2212 0.0634 0.0465 0.1470 0.0720 0.0607 0.1032 0.0470 0.0418 0.0460 0.0815 0.0425 G 3.48 4.00 3.19 2.68 2.40 2.87 3.96 3.38 2.91 3.73 1.96 5.05 6.17 3.54 5.64 2.31 6.76 2.09 1.78 3.97 2.60 5.62 3.35 3.90 3.11 2.45 3.85 4.27 3.24 3.61 3.62 2.73 2.97 4.86 3.21 5.16 4.86 7.01 3.52 5.26 2.94 2.91 2.58 2.08 4.40 3.30 3.87 6.01 2.35 5.85 2.40 3.35 4.77 3.92 4.82 3.26 2.32 4.00 2.83 3.35 EK 0.36 0.26 0.40 0.39 0.43 0.41 0.24 0.30 0.37 0.19 0.50 0.23 0.17 0.27 0.22 0.42 0.27 0.49 0.50 0.31 0.38 0.24 0.35 0.18 0.35 0.42 0.27 0.22 0.31 0.30 0.23 0.36 0.37 0.18 0.34 0.16 0.20 0.15 0.26 0.20 0.35 0.35 0.36 0.42 0.34 0.25 0.30 0.24 0.28 0.21 0.39 0.24 0.20 0.26 0.17 0.32 0.39 0.29 0.31 0.33 GL 4.62 3.90 3.87 3.45 3.50 3.12 4.95 4.65 2.72 5.82 3.10 5.00 6.36 4.52 9.44 3.46 5.74 3.06 2.66 5.63 3.68 4.18 5.43 5.82 3.70 3.59 3.75 4.90 4.19 5.77 5.71 4.42 3.17 7.72 4.18 5.01 5.07 8.63 3.88 5.71 3.03 3.93 3.01 2.88 3.67 3.88 4.00 7.85 5.51 10.10 2.78 4.31 5.32 9.83 18.75 3.21 2.95 3.64 5.57 5.39 EKL 0.27 0.23 0.30 27 29 32 18 21 35 18 0.35 0.20 0.17 18 13 30 21 0.35 0.37 0.23 0.30 0.23 0.25 0.13 0.28 0.31 0.22 0.19 0.23 0.20 0.15 0.26 0.29 0.12 0.26 0.20 0.15 0.11 0.21 0.15 0.29 0.27 0.28 0.31 0.26 0.20 0.24 0.16 0.17 0.14 0.30 0.17 0.16 0.17 0.07 0.28 0.30 0.27 0.19 0.24 87 VOTES WORD NOCC E EL PZD AVG G EK GL EKL STATES 2343 6.38 6.33 33.37 0.0582 6.26 0.22 8.54 0.13 DIFFER 1714 6.46 6.55 33.14 0.0466 3.96 0.29 3.56 0.25 WAY 1771 6.21 6.45 32.91 0.0472 6.65 0.22 10.08 0.16 4 ILL 8605 6.49 6.46 32.88 0.2551 1.95 0.34 3.00 0.24 2 BASED 1605 6.38 6.56 32.84 0.0431 2.60 0.35 3.70 0.26 CALLED 1618 6.40 6.57 32.76 0.0444 4.43 0.31 3.42 0.27 7 REVIEW 2347 6.02 6.30 32.72 0.0676 5.34 0.15 7.80 0.13 5 COMPAN 4677 6.19 6.05 32.65 0.1180 4.27 0.17 10.01 0.09 9 COUNSE 3030 6.22 6.27 32.54 0.0868 6.05 0.15 5.28 0.14 5 OBJECT 2703 6.27 6.31 32.50 0.0742 8.66 0.15 5.60 0.15 5 EMPLOY 6062 5.98 5.89 32.50 0.1653 5.38 0.11 7.48 0.08 3 PLACE 1881 6.36 6.45 32.27 0.0528 6.46 0.21 5.21 0.19 9 CLAIM 2565 6.24 6.24 32.27 0.0735 5.91 0.15 7.77 0.12 2 ADDITI 1708 6.39 6.49 32.12 0.0453 5.06 0.25 4.68 0.22 3 DUE 1937 6.40 6.47 32.08 0.0542 4.13 0.25 3.79 0.22 5 ORIGIN 2053 6.23 6.39 32.01 0.0558 4.38 0.21 5.63 0.18 9 PARTY 2643 6.26 6.33 31.93 0.0726 4.28 0.20 5.91 0.16 HER 7548 6.30 6.20 31.89 0.2095 4.05 0.20 4.75 0.14 1 TESTIF 3484 6.35 6.35 31.74 0.0969 3.53 0.24 3.72 0.19 1 RENDER 1657 6.30 6.45 31.74 0.0464 3.94 0.23 6.39 0.19 8 HEARIN 2525 6.28 6.31 31.59 0.0716 4.03 0.21 6.14 0.15 2 RETURN 2074 6.24 6.32 31.48 0.0589 8.81 0.15 9.23 0.14 4 COMPLE 1709 6.30 6.45 31.40 0.0455 4.76 0.24 5.48 0.20 2 DATE 1983 6.31 6.41 31.37 0.0555 3.97 0.23 4.85 0.19 6 COURTS 2033 6.28 6.36 31.21 0.0553 9.19 0.16 5.77 0.17 THEREA 1342 6.40 6.55 31.03 0.0389 2.78 0.32 2.92 0.28 CITED 1401 6.41 6.54 30.95 0.0390 2.52 0.33 3.08 0.27 4 VIEW 1406 6.35 6.48 30.95 0.0375 4.33 0.29 7.01 0.20 1 APPARE 1334 6.43 6.53 30.84 0.0364 3.26 0.30 3.32 0.26 REGARD 1466 6.39 6.52 30.80 0.0380 3.05 0.32 3.05 0.26 5 BASIS 1500 6.41 6.47 30.76 0.0412 5.82 0.26 5.60 0.21 13 NOTICE 2855 6.04 6.18 30.76 0.0853 5.70 0.14 6.77 0.12 NOTHIN 1275 6.24 6.55 30.. 6 5 0.0345 2.76 0.33 2.84 0.2 9 2 COURSE 1500 6.22 6.45 30.53 0.0421 6.86 0.21 4.36 0.21 THOUGH 1301 6.43 6.54 30.46 0.0340 2.57 0.34 2.8.2 0.28 OVERRU 1644 6.23 6.42 30.46 0.0456 4.78 0.19 4.35 0.20 6 REMAIN 1592 6.35 6.38 30.46 0.0428 4.99 0.23 7.12 0.16 6 RIGHTS 2108 6.30 6.33 30.38 0.0581 5.59 0.20 4.76 0.17 TAKE 1484 6.38 6.47 30.35 0.0407 3.85 0.27 3.52 0.23 FAILED 1442 6.29 6.48 30.31 0.0414 3.32 0.29 3.79 0.23 , 5 FAILUR 1630 6.16 6.43 30.16 0.0459 3.81 0.24 4.43 0.21 DECIDE 1409 6.41 6.50 29.89 0.0381 2.48 0.31 3.99 0.25 8 ASSIGN 2654 6.00 6.12 29.82 0.0715 6.48 0.12 7.19 0.11 2 GIVE 1490 6.32 6.45 29.78 0.0399 3.06 0.29 3.67 0.23 13 JURISD 3056 6.00 6.10 29.67 0.0812 4.48 0.14 6.50 0.11 8 SERVIC 3855 6.04 6.05 29.63 0.1114 5.82 0.13 7.29 0.10 4 CODE 4152 6.21 6.18 29.55 0.1146 4.17 0.17 5.98 0.13 LATER 1426 6.43 6.47 29.48 0.0387 2.75 0.31 3.52 0.24 1 POINT 1487 6.35 6.42 29.48 0.0407 4.43 0.25 4.24 0.21 5 REQUES 1941 6.11 6.29 29.44 0.0545 7.47 0.15 5.99 0.15 2 SITUAT 1358 6.42 6.49 29.40 0.0368 2.40 0.33 3.07 0.25 SUPRA 2573 6.29 6.25 29.21 0.0636 3.34 0.23 4.77 0.15 8 RESPON 2872 5.94 6.00 29.21 0.0772 6.24 0.12 11.25 0.08 9 ATTEMP 1404 6.05 6.42 29.18 0.0376 4.42 0.25 7.93 0.19 5 ADMITT 1667 6.32 6.32 28.87 0.0436 3.82 0.23 5.59 0.17 FORTH 1458 6.25 6.40 28.80 0.0391 3.68 0.25 4.54 0.20 3 ARGUME 1528 6.26 6.37 28.69 0.0429 5.01 0.20 4.22 0.19 3 REFERR 1309 6.24 6.43 28.65 0.0341 8.37 0.24 5.55 0.21 2 SIMILA 1243 6.38 6.46 28.61 0.0339 2.91 0.30 3.18 0.24 3 CORREC 1358 6.14 6.38 28.57 0.0370 4.35 0.21 4.34 0.20 Table V. Sorted by PZD 88 *ES WORD NOCC E EL PZD AVG G EK GL EKL 1 LEGAL 1650 6.25 6.30 28.57 0.0423 7.41 0.19 9.77 0.14 1 ENTIRE 1350 6.30 6.41 28.53 0.0369 5.20 0.25 6.76 0.20 2 TERMS 1583 6.33 6.39 28.46 0.0424 3.43 0.25 3.35 0.21 6 DUTY 1873 6.25 6.30 28.35 0.0506 3.82 0.21 5.09 0.17 o.p 3 GRANTE 1574 6.25 6.34 28.35 0.0425 4.97 0.20 5.70 1 PAID 2316 6.25 6.25 28.16 0.0616 3.21 0.23 4.69 0.16 CLEARL 1145 6.31 6.45 27.67 0.0304 2.81 0.30 3.28 0.24 APPLIE 1264 6.25 6.40 27.63 0.0351 2.95 0.27 3.46 0.22 1 INTEND 1333 6.29 6.39 27.63 0.0361 3.14 0.25 4.27 0.21 1 SUPREM 1904 6.16 6.24 27.44 0.0474 3.73 0.21 6.65 0.14 OBTAIN 1498 6.18 6.30 27.40 0.0397 3.28 0.23 5.62 0.17 MANNER 1259 6.30 6.37 27.29 0.0329 3.46 0.27 6.32 0.19 5 SEVERA 1243 6.32 6.36 27.25 0.0331 3.47 0.26 7.53 0.18 OTHERW 1095 6.14 6.42 27.18 0.0307 4.16 0.25 3.79 0.23 2 SUBSEQ 1263 6.25 6.37 26.99 0.0363 3.67 0.24 3.97 0.21 1 FAVOR 1249 6.22 6.37 26.87 0.0364 3.45 0.23 4.09 0.21 1 TRUE 1140 6.23 6.36 26.23 0.0309 3.33 0.26 4.42 0.20 ORDERE 1180 6.14 6.33 26.23 0.0324 3.50 0.23 6.13 0.18 MANY 1117 6.27 6.38 25.82 0.0286 2.52 0.29 2.73 0.23 2 LANGUA 1492 6.22 6.23 25.78 0.0411 3.66 0.21 5.17 0.16 SHOWN 1106 6.15 6.36 25.74 0.0303 3.38 0.24 3.23 0.22 THEREI 1068 6.13 6.38 25.70 0.0279 2.72 0.27 3.38 0.23 8ELIEV 1176 6.22 6.34 25.67 0.0322 3.33 0.24 3.34 0.21 1 NATURE 1185 6.16 6.31 25.48 0.0313 3.80 0.22 4.10 0.19 SAY 1088 6.26 6.34 25.44 0.0294 2.94 0.26 3.71 0.21 SOUGHT 1132 6.11 6.33 25.44 0.0316 3.80 0.21 4.23 0.20 BECOME 1158 6.07 6.30 25.36 0.0320 3.89 0.23 3.96 0.19 SHOWS 1078 6.16 6.35 25.25 0.0297 3.55 0.23 3.06 0.22 MAKING 1060 6.19 6.33 25.14 0.0282 4.11 0.22 3.75 0.21 1 DAYS 1500 6.05 6.22 24.99 0.0447 6.03 0.14 3.91 0.17 THERET 1022 6.05 6.35 24.95 0.0278 3.03 0.25 3.31 0.22 MOST 1051 6.25 6.31 24.95 0.0273 2.65 0.28 6.00 0.18 NEITHE 930 6.16 6.3b 24.87 0.0252 2.65 0.27 2.44 0.25 LONG 1047 6.23 6.32 24.80 0.0280 3.39 0.23 3.84 0.20 HOLD 1033 6.15 6.35 24.61 0.0270 2.49 0.26 3.24 0.22 PREVIO 1040 6.16 6.31 24.57 0.0277 3.93 0.22 3.68 0.20 DONE 1079 6.09 6.28 24.57 0.0282 3.94 0.21 4.53 0.18 2 REFUSE 1286 6.14 6.22 24.49 0.0351 4.26 0.19 4.13 0.17 TOOK 1080 6.15 6.28 24.46 0.0302 3.21 0.24 4.38 0.19 ITSELF 993 6.25 6.33 24.38 0.0260 2.40 0.27 3.32 0.22 DISCUS 1034 6.22 6.31 24.34 0.0267 2.85 0.25 3.19 0.21 RAISED 1050 6.00 6.28 23.93 0.0290 3.56 0.21 3.95 0.19 MERELY 936 6.21 6.32 23.78 0.0248 2.46 0.26 2.82 0.22 CONSIS 941 6.19 6.31 23.66 0.0260 2.47 0.26 3.02 0.21 1 THINK 1035 6.18 6.2b 23.63 0.0298 3.00 0.23 3.20 0.20 5 RECOGN 1033 6.10 6.25 23.51 0.0261 3.33 0.23 3.94 0.18 4 PURSUA 1039 6.08 6.24 23.17 0.0271 2.92 0.22 3.93 0.18 POSSIB 1018 6.18 6.23 22.98 0.0272 3.04 0.23 3.70 0.18 1 HOLDIN 1008 6.05 6.20 22.76 0.0265 3.62 0.21 4.43 0.17 1 REV 1484 6.07 6.08 22.72 0.0446 3.55 0.18 9.27 0.12 3 DISTIN 997 6.14 6.22 22.68 0.0265 2.77 0.24 4.15 0.18 FAR 923 6.11 6.24 22.61 0.0247 4.89 0.20 4.79 0.18 1 EVERY 922 6.11 6.22 22.31 0.0244 3.05 0.22 3.79 0.18 1 KNOWN 1083 6.12 6.17 22.19 0.0285 3.59 0.21 4.34 0.16 2 EXISTE 1029 6.06 6.17 22.08 0.0286 5.05 0.19 4.18 0.16 RATHER 917 6.15 6.21 22.00 0.0246 3.00 0.24 3.67 0.18 VERY 888 6.15 6.22 21.93 0.0230 2.80 0.24 3.45 0.19 4 OCCURR 1248 6.05 6.11 21.78 0.0347 3.73 0.18 4.81 0.15 LESS 923 6.08 6.17 21.63 0.0250 3.44 0.21 3.99 0.17 5 PREVEN 956 6.00 6.16 21.44 0.0265 3.86 0.19 3.57 0.17 Table V. Sorted by PZD 772-957 0-66— 7 89 VOTES WORD NOCC E EL PZD A7G G EK GL EKL CLAIME 921 5.97 6.17 21.44 0.0261 4.84 0.17 3.94 0.17 NEVER 976 6.01 6.15 21.32 0.0254 4.03 0.19 4.18 0.16 TOGETH 861 6.04 6.16 20.91 0.0222 3.31 0.21 3.86 0.17 SHOWIN 829 5.78 6.16 20. S3 0.0227 3.37 0.19 3.12 0.18 LATTER 833 6.04 6.14 20.23 0.0235 3.47 0.19 3.63 0.17 WHOM 832 6.00 6.13 20.08 0.0228 3.43 0.19 3.68 0.17 RELATE 839 5.92 6.12 20.04 0.0233 3.10 0.20 4.01 0.16 3 VARIOU 815 5.99 6.12 19.96 0.0214 4.01 0.20 3.74 0.16 HEARD 903 5.97 6.07 19.93 0.02*1 3.35 0.18 5.06 0.14 HIMSEL 864 5.95 6.10 19.85 0.0241 5.07 0.17 3.60 0.16 7 JUSTIF 885 5.90 6.07 19.85 0.0235 3.52 0.18 4.41 0.15 APPLY 806 6.00 6.08 19.63 0.0212 3.14 0.19 4.78 0.15 4 EMPHAS 1012 5.96 6.00 19.59 0.0246 3.16 0.19 5.19 0.13 LEAST 766 6.00 6.11 19.40 0.0206 2.98 0.20 3.43 0.17 AGAIN 766 6.00 6.11 19.32 0.0209 4.64 0.18 3.29 0.17 2 TIMES 751 5.95 6.09 19.21 0.0201 3.18 0.19 3.80 0.16 MUCH 693 5.99 6.11 19.13 0.0187 3.85 0.19 3.99 0.17 1 STAT 1245 5.90 5.93 19.10 0.0383 3.51 0.15 6.23 0.11 THEREB 712 5.99 6.11 19.02 0.0192 3.22 0.19 3.25 0.17 6 AGREE 707 5.91 6.10 18.98 0.0187 3.50 0.19 3.35 0.17 PLACED 781 5.88 6.05 18.91 0.0208 4.15 0.16 4.20 0.15 LIKE 738 5.93 6.08 18.87 0.0198 4.09 0.17 3.62 0.16 SUGGES 782 5.94 6.06 18.68 0.0208 3.46 0.18 3.55 0.16 8ECAME 734 5.81 6.08 18.61 0.0196 3.61 0.18 3.09 0.17 OCCASI 742 5.95 6.03 18.38 0.0206 3.38 0.18 5.02 0.14 READS 769 5.89 6.03 18.30 0.0220 3.56 0.16 3.85 0.15 OBVIOU 645 5.87 6.09 18.23 0.0187 3.36 0.18 2.92 0.18 1 STILL 660 5.86 6.07 18.08 0.0176 3.47 0.18 2.94 0.17 NOTED 710 5.88 6.02 18.04 0.0182 3.47 0.17 4.48 0.14 HOW 739 5.93 6.01 17.89 0.0191 3.23 0.19 3.80 0.15 MENTIO 694 5.91 6.02 17.89 0.0191 4.96 0.16 4.13 0.15 BEYOND 754 5.87 5.99 17.74 0.0209 3.35 0.17 3.90 0.14 WHOSE 655 5.89 6.04 17.70 0.0179 3.34 0.18 3.38 0.16 COME 663 5.90 6.00 17.40 0.0173 3.24 0.18 3.88 0.15 PUT 719 5.88 5.96 17.40 0.0197 3.40 0.17 5.70 0.13 2 FILE 943 5.49 5.87 17.06 0.0265 5.51 0.10 4.17 0.12 7 VALID 768 5.83 5.92 17.06 0.0207 3.58 0.16 4.77 0.12 MERE 654 5.82 5.99 17.02 0.0170 3.36 0.17 3.95 0.14 2 MASS 4687 5.77 5.73 16.98 0.1483 3.41 0.12 4.36 0.10 SEEMS 647 5.88 5.98 16.87 0.0179 4.19 0.16 3.41 0.15 2 ESSENT 651 5.83 5.98 16.76 0.0173 3.67 0.16 3.52 0.15 FOREGO 626 5.73 5.96 16.64 0.0163 3.55 0.16 3.70 0.14 MAKES 565 5.73 5.98 16.27 0.0151 3.28 0.17 3.07 0.15 DOING 625 5.71 5.89 16.04 0.0167 3.56 0.15 5.74 0.12 FULLY 591 5.74 5.93 16.00 0.0159 4.28 0.14 3.71 0.14 AMONG 579 5.83 5.93 15.81 0.0152 3.05 0.17 3.70 0.14 1 APPROX 704 5.79 5.87 15.77 0.0179 3.77 0.15 4.01 0.12 WHEREI 560 5.60 5.92 15.66 0.0155 4.62 0.13 3.89 0.14 2 QUOTED 591 5.60 5.85 15.13 0.0149 3.88 0.14 4.09 0.12 DIFFIC 578 5.72 5.87 15.06 0.0155 3.98 0.14 3.51 0.13 REACHE 539 5.63 5.86 14.91 0.0139 4.07 0.14 4.15 0.13 2 WHOLE 651 5.74 5.78 14.87 0.0169 3.54 0.14 5.73 0.10 ALONE 536 5.73 5.87 14.79 0.0152 4.20 0.14 3.50 0.13 NONE 506 5.58 5.82 14.23 0.0136 3.70 0.14 4.14 0.12 ALREAD 542 5.68 5.80 14.08 0.0141 3.49 0.14 4.07 0.12 1 CONCED 48 5 5.58 5.83 14.00 0.0140 3.43 0.14 3.42 0.13 ADDED 587 5.62 5.77 13.96 0.0144 4.33 0.13 3.95 0.12 RELIED 487 5.62 5.80 13.89 0.0134 4.43 0.12 4.02 0.12 1 DESIRE 507 5.38 5.78 13.74 0.0143 4.09 0.12 3.97 0.12 1 OPPORT 545 5 -V , 5.75 13.70 0.0146 5.13 0.11 4.15 0.11 Table V. Sorted by PZD 90 VOTES WORD NOGC E EL PZD AVG G EK GL EKL 3 CAREFU 453 5.42 5.79 13.51 0.0118 3.79 0.13 3.84 0.12 4 DISSEN 751 5.48 5.73 13.43 0.0191 3.84 0.12 3.90 0.11 MOVED 492 5.61 5.75 13.40 0.0149 3.94 0.13 4.21 0.11 SOLELY 441 5.50 5.74 12.87 0.0118 4.03 0.12 4.06 0,12 HERETO 498 5.41 5.64 12.60 0.0121 3.70 0.12 6.07 0169 HENCE 447 5.43 5.68 12.26 0.0118 4.38 0.11 3.85 o.U ARGUES 44 3 5.52 5.67 12.23 0.0136 4.75 0.11 3.96 0.11 EVER 481 5.47 5.65 12.23 0.0127 4.47 0.11 4.27 0.10 ARGUED 396 5.47 5*71 12.15 0,0117 3.88 0.12 3.34 0.12 FAILS 426 5.21 5.60 12.15 0.0125 4.84 0.10 3.68 0.11 NEVERT 370 5.50 5.71 11.92 0.0096 3.19 0.13 3.20 0.12 STATIN 385 5.43 5.67 11.77 0.0112 4.44 0.11 3.86 0.11 1 ABLE 416 5.37 5.64 11.77 0.0107 4.69 0.11 4.20 0.10 LIKEWI 404 5.52 5.64 11.70 0.0106 3.26 0.12 4.45 0.10 SEEKS 374 5.15 5.62 11.32 0.0117 4.95 0.10 3.75 0.10 2 COMPAR 418 5.42 5.57 11.09 0.0121 4.96 0.09 4.20 0.09 ONCE 375 5.32 5.60 11.02 0.0094 3.70 0.11 3.77 0.10 EXISTS 376 5.38 5.59 10.94 0.0104 4.09 0.11 3.84 0.10 INSIST 368 5.36 5.51 10.41 0.0096 3.68 0.10 4.72 0.09 INSTEA 328 5.29 5.52 10.07 0.0088 4.25 0.10 3.97 0.09 1 ALLEGI 320 5.18 5.47 9.66 0.0088 4.31 0.09 4.05 0.09 1 RELIES 301 5.28 5.48 9.62 0.0090 4.16 0.09 3.91 0.09 2 VIRTUE 322 5.21 5.46 9.55 0.0091 4.56 0.09 3.99 0.09 QUITE 30 7 5.32 5.46 9.39 0.0083 4.11 0.09 3.74 0.09 NAMELY 316 5.27 5.44 9.36 0.0080 4.71 0.09 4.09 0.09 1 WEYGAN 251 4.57 5.40 8.79 0.0050 6.09 0.05 3.57 0.09 2 MATTHI 249 4.57 5.37 8.64 0.0049 6.34 0.05 4.17 0.08 SOMEWH 236 5.13 5.27 7.73 0.0070 4.87 0.07 4.12 0.07 DESMON 230 4.86 5.24 7.4 7 0.0065 4.60 0.07 4.06 0.07 1 PECK 216 4.34 5.22 7.43 0.0043 7.17 0.04 4.22 0.07 SOMETI 237 5.05 5.22 7.39 0.0068 5.15 0.07 4.18 0.07 1 VOORHI 209 4.80 5.23 7.32 0.0059 4.32 0.07 3.98 0.07 FULD 208 4.73 5.20 7.09 0.0057 4.57 0.06 4.05 0.07 FROESS 209 4.78 5.18 6.98 0.0062 4.96 0.06 3.98 0.07 Table V. Sorted by PZD 91 VOTES 10 3 WORD THE AND THAT NOT FOR WHICH WAS THIS WITH FROM HAVE BEEN SUCH BUT THERE ANY UPON ARE COURT CASE OTHER WERE HAD UNDER ONE MAY HAS ALL WOULD ONLY HIS MADE ITS ALSO FOLLOW WHEN QUESTI DID AFTER LAW WHETHE DEFEND SHOULD whej/e BEFORE MUST PRESEN REASON TIME COULD CONSID THEY THEN PART TWO WHO FURTHE THESE THAN EVIDEN NOCG 442506 128355 89026 35835 45223 25522 56044 29490 21624 19879 13825 12072 18195 9174 12925 13855 11816 13721 33021 15261 8966 12911 15451 10893 9388 9510 10530 9021 9678 6218 19529 7999 11061 5230 6076 6875 8776 6224 6340 9658 5173 25773 5689 5794 5814 5208 5653 6845 8254 5096 5288 7042 4583 4746 5130 5241 4546 4753 4378 12726 E 7.87 7.83 7.80 7.75 7.73 7.70 7.69 7.66 7.64 7.62 7.53 7.50 7.50 7.48 7.48 7.47 7.46 7.46 7.45 7.45 7.43 7.43 7.43 7.40 7.39 7.37 7.36 7.36 7.34 7.33 7.32 7.32 7.31 7.29 7.28 7.28 7.25 7.24 7.24 7.23 7.22 7.20 7.20 7.19 7.19 7.18 7.18 7.17 7.17 7.16 7.15 7.14 7.12 7.12 7.11 7.11 7.11 7.11 7.11 7.10 EL 7.65 7.61 7.60 7.60 7.61 7.56 7.55 7.59 7*51 7.51 7.44 7.41 7.35 7.37 7.40 7.37 7.40 7.39 7.41 7.36 7.31 7.31 7.30 7.31 7.31 7.30 7.37 7.26 7.23 7.31 7.22 7.29 7.20 7.23 7.24 7.24 7.28 7.17 7.21 7.20 7.19 7.12 7.20 7.16 7.23 7.22 7.20 7.25 7.20 7.11 7.14 7.08 7.07 7.09 7.11 7.03 7.13 7.07 7.10 7.02 PZD 99.99 99.7 3 98 96 98 94 95 96, 92 92 15 9 7 7 41 73 67 03 18 85.99 83.76 85, 78 84, 83, 82, 84, 93. 84, 76, 79. 80 89 25 12 93 37 58 74 17 91 82.44 80.44 76.40 76.70 81.76 74.78 73.12 72.14 78.63 74.51 75.34 67.15 69.38 69.87 77.08 66.70 68.47 74.29 66.13 71 66 65 68 66 68, 72, 70, 61, 63, 19 59 26 55 70 25 7.02 5 3.90 0.1010 2.50 0.64 2.52 0.49 5 APPEAR 3055 6.95 7.00 5f .68 0.1045 3.97 0.56 9.43 0.32 5 DIRECT 5706 6.95 6.92 58.62 0.1575 5.12 0.44 6.63 0.2S MORE 3050 6.94 6.95 49.49 0.0822 1.98 0.66 2.76 0.45 6 ACTION 8248 6.94 6.92 64.55 0.2329 3.64 0.39 4.77 0.31 1 CAN 2822 6.93 6.94 49.15 0.0739 1.61 0.67 2.68 0.44 HERE 3448 6.93 6.97 52.69 0.0938 1.92 0.66 3.12 0.43 INTO 3583 6.93 6.92 51.00 0.0952 2.51 0.57 3.14 0.39 SEE 4704 6.93 6.88 55.00 0.1297 2.95 0.47 3.89 0.33 9 NECESS 3477 6.93 6.93 52.20 0.0937 3.31 0.52 4.91 0.35 THEM 3505 6.92 6.89 49.37 0.0943 2.56 0.56 4.37 0.36 HIM 5613 6.91 6.85 54.24 0.1531 2.49 0.52 6.64 0.29 4 FOUND 3608 6.91 6.98 53.68 0.1017 2.73 0.53 3.16 0.43 5 EFFECT 3759 6.91 6.92 52.39 0.1018 2.86 0.56 7.29 0.34 MATTER 4313 6.91 6.96 55.19 0.1166 3.11 0.53 4.12 0.38 6 RECORD 6093 6.91 6.98 60.51 0.1675 5.25 0.41 4.95 0.35 SINCE 2756 6.89 6.93 48.65 0.0753 1.76 0.62 2.78 0.43 4 AFFIRM 3897 6.89 7.23 63.53 0.1109 2.26 0.78 2.61 0.70 5 STATUT 7283 6.89 6.80 53.15 0.1985 2.26 0.48 4.39 0.29 9 ACCORD 2721 6.87 6.96 49.64 0.0745 2.12 0.62 2.92 0.45 2 CERTAI 3069 6.87 6.96 50.62 0.0830 2.20 0.65 3.90 0.42 4 GENERA CASES 5262 6.87 6.82 52.92 0.1338 3.11 0.47 5.01 0.28 3896 6.86 6.90 51.41 0.1062 2.58 0.54 3.22 0.38 1 BOTH 2868 6.85 6.88 46.54 0.0771 1.87 0.59 2.81 0.39 WITHIN 4561 6.85 6.97 55.56 0.1294 2.63 0.50 3.59 0.41 2 STATE 9231 6.85 6.80 62.06 0.2417 3.06 0.39 4.64 0.25 1 RESULT 3328 6.85 6.86 48.50 0.0911 3.50 0.49 3.97 0.34 BETWEE 3231 6.84 6.87 47.45 0.0879 2.33 0.55 2.83 0.38 7 WILL 7140 6.84 6.74 62.55 0.1944 5.49 0.26 12.86 0.15 2 SECTIO 10226 6.83 6.76 55.75 0.2858 2.91 0.38 4.29 0.27 WHILE 2749 6.82 6.85 46.31 0.0751 5.29 0.43 4.31 0.35 SHALL 6240 6.81 6.73 49.18 0.1705 2.77 0.43 4.34 0.27 END 6422 6.81 6.71 51.86 0.1570 3.07 0.44 6.84 0.22 Tabl e VI. Sorted by E 93 VOTES WORD NOCC E RESPEC 2579 6.80 OUR 3179 6. BO GIVEN 2766 6.80 2 PROVIS 4479 6.80 9 APPEAL 9096 6.80 1 PROCEE 5021 6.79 ENTERE 2920 6.78 3 ORDER 6773 6.78 6 AUTHOR 4898 6.78 WELL 2259 6.77 5 CAUSE 4463 6.77 MAKE 2535 6.76 WHAT 2883 6.76 6 RIGHT 5447 6.76 2 PURPOS 4138 6.76 4 CIRCUM 2543 6.75 CANNOT 2467 6.74 1 APP 4769 6.74 1 ESTABL 2947 6.74 THOSE 2527 6.73 4 SUFFIC 2484 6.72 OVER 2622 6.72 2 ALLEGE 3766 6.72 EITHER 2033 6.71 SET 2964 6.71 8 MOTION 6621 6.71 2 INCLUD 2632 6.71 NOR 2099 6.70 5 SU8JEC 2855 6.70 1 THREE 2437 6.70 THEREO 2640 6.69 4 PRIOR 2379 6.69 4 GROUND 2629 6.68 1 NEW 4744 6.68 1 EACH 3332 6.68 ALTHOU 1762 6.67 HAVING 2006 6.67 ITAL 11360 6.67 FOL 5682 6.67 TAKEN 2518 6.67 FILED 5362 6.67 2 REVERS 2857 6.66 UNTIL 2347 6.65 4 CONCUR 2290 6.65 ABOUT 3228 6.65 1 ACT 5147 6.65 3 SUSTAI 2600 6.65 1 SEC 6808 6.65 7 SPECIF 2900 6.65 3 SUPPOR 3151 6.65 EVEN 1964 6.64 3 INDICA 1901 6.64 3 SUBSTA 2527 6.62 1J3 COUNTY 6245 6.62 5 ISSUE 3113 6.61 NOW 2384 6.60 THUS 1622 6.58 DURING 2216 6.58 4 CONSTR 3805 6.58 3 APPLIC 4168 6.58 EL PZD 6.8 2 44.4 3 6.83 A 7 . 9 3 6.82 45.07 6.7 I 47. 18 7.0 6 7 7.61 6.84 55.19 6.87 48.58 6.77 58.32 6.81 52.32 6.83 43.14 6.90 54.28 6.84 43.94 6.79 44.80 6.86 54.24 6.7 6 49.30 6.75 41.94 6.92 46.54 6.72 44.92 6.72 44.46 6.77 42.43 6.81 42.92 6.71 40.99 6.8 1 47.86 6.78 40.20 6.84 46.54 6.84 53.90 6.76 43.41 6.86 43.14 6.81 45.48 6.7 3 41.18 6.75 41.60 6.74 40.8 3 6.77 44.16 6.72 48.09 6.69 43.90 6.77 38.65 6.86 42.09 6.57 45.18 6.57 45.18 6.76 43.07 6.91 55.26 6.93 46.96 6.70 39.22 7.30 63.91 6.65 41.10 6.59 45. -36 6.89 46.24 6.62 49.60 6.68 42.28 6.67 46.3 5 6.75 38.80 6.70 37.67 6.71 41.60 6.52 52.43 6.66 42.88 6.80 43.29 6.65 34.80 6.6 2 36.50 6.55 40.50 6.60 47.37 AVG 0.0678 0.0833 0.0744 0.1251 0.2637 0.1373 0.0873 0.1918 0.1319 0.0592 0.1255 0.0681 0.0725 0.1464 0.1096 0.0679 0.0694 0.1292 0.0788 0.0642 0.0708 0.0701 0.1091 0.0532 0.0798 0.1942 0.0716 0.0581 0.0784 0.0677 0.0697 0.0654 0.0728 0.1295 0.0859 0.0487 0.0548 0.2755 0.1378 0.0697 0.1589 0.0842 0.0628 0.0643 0.0882 0.1370 0.0753 0.1929 0.0790 0.0855 0.0509 0.0499 0.0693 0.1787 0.0831 0.0629 0.0427 0.0609 0.1054 0.1134 G 1.99 2.15 2.27 2.55 4.94 3.56 3.29 3.68 4.35 2.87 2.98 2.35 2.52 2.91 3.99 2.08 2. 06 2.51 3.00 3.12 2.35 2.40 3.04 1.96 3.36 3.78 3.86 1.94 2.72 3.19 2.61 2.87 3.25 3.77 4.53 1.78 2.18 3.12 3.12 3.27 4.09 2.65 2.31 2.45 2.68 3.30 3.40 3.75 3.75 7.06 2.09 2.45 3.48 5. CO 3.76 2.79 2.08 2.73 3.38 4.97 EK 0.54 0.55 0.50 0.45 0.30 0.40 0.42 0.31 0.37 0.51 0.43 0.54 51 47 41 49 57 0.41 0.45 0.46 0.45 0.43 0.40 0.50 0.45 0.30 0.39 0.53 0.46 0.40 0.42 0.41 0.38 0.31 0.36 0.50 0.51 0.37 0.37 0.37 0.33 0.48 0.42 0.73 0.39 0.32 0.40 0.27 0.34 0.24 0.49 0.42 0.36 0.23 0.32 0.46 0.42 0.36 0.30 0.25 GL 3.71 4.84 3.10 3.69 5.35 6.15 4.02 11.48 4.61 3.49 4.08 3.17 3.76 3.87 6.33 2.94 2.46 3.31 17.95 3.52 3.24 3.50 3.37 3.10 3.72 3.36 3.68 2.78 3.64 3.87 3.06 3.12 5.73 4.33 5.12 2.66 2.07 7.32 7.39 4.04 3.46 3.60 3.46 2.51 3.45 6.21 2.63 4.50 5.03 9.79 3.06 3.59 4.62 8.51 4.98 3.10 2.88 4.42 4.65 8.13 EKL 0.34 0.31 0.35 0.30 0.33 0.26 0.34 0.19 0.28 0.36 0.34 0,37 0,32 0.32 0.25 0.33 0.45 0.29 0.18 0.33 0.36 0.29 0.33 0.35 0.35 0.33 0.31 0.40 0.33 0.30 0.33 0.32 0.29 0.26 0.25 0.37 0.43 0.19 0.19 0.31 0.36 0.43 0.3C 0.86 0.2 7 0.20 0.41 0.21 0.25 0.13 0.35 0.31 0.27 0.14 0.23 0.34 0.31 0.26 0.21 0.16 Table VI. Sorted by E 94 VOTES WORD NOCC E EL PZD AVG G EK GL EKL 6 EXCEPT 3589 6.53 6.82 49.79 0.1046 5.95 0.26 4.72 0.30 MIGHT 1734 6.57 6.63 34.27 0.0465 2.40 0.39 2.78 0.30 ANOTHE 1881 6.57 6.65 36.35 0.0500 2.97 0.37 3.17 0.29 1 CONCER 1797 6.57 6.59 34. 76 0.0468 4.40 0.34 3.67 0.26 INVOLV 2933 6.56 6.90 47.86 0.0789 2.29 0.56 2.99 0.40 6 ERROR 3841 6.56 6.66 44.80 0.1051 3.69 0.29 4.33 0.24 7 CONTRA 803 3 6.56 6.49 52.96 0.2158 3.98 0.23 7.29 0.15 3 FIND IN 3437 6.56 6.59 41.56 0.0995 4.00 0.26 3.90 0.23 6 RULE 4090 6.56 6.70 47.18 0.1055 4.23 0.31 12.48 0.20 1 CONTAI 2096 6.55 6.65 38.12 0.0578 3.35 0.35 5.43 0.25 5 PARTIE 3496 6.55 6.59 41.71 0.0960 3.86 0.29 4.47 0.22 UNLESS 1520 6.54 6.63 33.82 0.0418 2.32 0.39 2.95 0.30 2 INSTAN 1867 6.54 6.60 34.88 0.0494 2.58 0.36 3.01 0.28 2 RELATI 2530 6.54 6.53 37.10 0.0662 3.61 0.30 5.77 0.20 1 ENTITL 2141 6.53 6.69 38.42 0.0591 2.60 0.38 3.68 0.30 1 OWN 1857 6.53 6.60 34.99 0.0502 2.91 0.35 3.93 0.27 3 APPELL 14543 6.53 6.44 50.16 0.3877 3.05 0.23 5.26 0.16 2 YEARS 2601 6.53 6.56 37.10 0.0687 3.24 0.31 4.19 0.23 SECOND 2415 6.53 6.61 38.50 0.0656 3.97 0.31 5.63 0.23 4 CLEAR 1537 6.52 6.57 33.48 0.0425 3.35 0.33 5.39 0.24 3 OPERAT 4207 6.52 6.45 39.56 0^1145 3.54 0.27 4.52 0.18 THROUG 1954 6.52 6.56 34.61 0.0531 3.87 0.30 4.00 0.24 2 DECISI 3988 6.52 6.69 46.58 O.)070 4.00 0.30 5.57 0.23 RECEIV 2 801 6.52 6.57 39.10 0.0764 6.76 0.27 5.74 0.21 5 JUDGE 4000 6.52 6.64 46.84 0.U81 10.30 0.19 6.80 0.20 3 FIND 1954 6.51 6.66 37.75 0.0519 3.11 0.35 3.70 0.28 7 EXPRES 2022 6.51 6.61 36.01 0.0546 3.21 0.34 4.18 0.26 BROUGH 1534 6.50 6.59 33.74 0.0460 4.00 0.29 3.64 0.27 4 ILL 8605 6.49 6.46 32.88 0.2551 1.95 0.34 3.00 0.24 2 OHIO 8519 6.49 6.35 34.39 0.2212 2.35 0.28 5.51 0.17 4 AMOUNT 3110 6.49 6.52 37.56 0.0869 3.85 0.27 3.75 0.22 2 PARTIC 2381 6.48 6.76 42.12 0.0625 3.17 0.41 3.48 0.32 8 CHARGE 4622 6.48 6.47 40.69 0.1234 3.96 0.24 4.95 0.18 2 CONTRO 2941 6.48 6.55 39.93 0.0849 5.05 0.23 5.00 0.20 PAGE 3218 6.47 6.45 33.71 0.0815 2.83 0.31 5.57 0.19 7 CONDIT 2779 6.46 6.4 7 35.52 0.0760 3.52 0.26 3.88 0.21 DIFFER 1714 6.46 6.55 33.14 0.0466 3.96 0.29 3.56 0.25 3 COMMON 4042 6.46 6.48 42.58 0.1171 5.85 0.19 7.01 0.16 11 PRINCI 2158 6.46 6.43 34.61 0.0564 6.01 0.24 7.85 0.16 1 USED 2650 6.45 6.58 38.16 0.0734 5.62 0.24 4.18 0.23 THOUGH 1301 6.43 6.54 30.46 0.0340 2.57 0.34 2.82 0.23 LATER 1426 6.43 6.47 29.48 0.0387 2.75 0.31 3.52 0.24 1 APPARE 1334 6.43 6.53 30.84 0.0364 3.26 0.30 3.32 0.26 2 SITUAT 1358 6.42 6.49 29.40 0.0368 2.40 0.33 3.07 0.25 7 TESTIM 3650 6.42 6.41 34.65 0.1010 3.30 0.25 3.88 0.20 5 ANSWER 3398 6.42 6.41 39.33 0.0913 5.64 0.22 9.44 0.13 DECIDE 1409 6.41 6.50 29.89 0.0381 2.48 0.31 3.99 0.25 CITED 1401 6.41 6.54 30.95 0.0390 2.52 0.33 3.08 0.27 10 JURY 5530 6.41 6.31 34.27 0.1470 3.35 0.24 4.31 0.17 6 CONSTI 4132 6.41 6.49 42.99 0.1058 3.48 0.28 7.53 0.15 5 DAY 2189 6.41 6.46 34.16 0.0607 3.92 0.26 9.83 0.17 5 BASIS 1500 6.41 6.47 30.76 0.0412 5.82 0.26 5.60 0.21 THEREA 1342 6.40 6.55 31.03 0.0389 2.78 0.32 2.92 0.28 ABOVE 1812 6.40 6.63 35.18 0.0483 2.94 0.35 3.03 0.29 3 PROPER 5913 6.40 6.34 36.91 0.1591 3.62 0.23 5.71 0.15 3 DUE 1937 6.40 6.47 32.08 0.0542 4.13 0.25 3.79 0.22 3 COMPLA 3971 6.40 6.45 37.44 0.1136 4.27 0.22 4.90 0.19 CALLED 1618 6.40 6.57 32.76 0.0444 4.43 0.31 3.42 0.27 REGARD 1466 6.39 6.52 30.80 0.0380 3.05 0.32 3.05 0.26 2 ADDITI 1708 6.39 6.49 32.12 0.0453 5.06 0.25 4.68 0.22 Tabl e VI. Sorted by E 95 VOTE£ > WORD NOCC E EL PZD AVG G EK GL EKL 2 BASED 1605 6.38 6.56 32.84 0.0431 2.60 0.35 3.70 0.26 2 SIMILA 1243 6.38 6.46 28.61 0.0339 2.91 0.30 3.18 0.24 TAKE 1484 6.38 6.47 30.35* 0.0407 3.85 0.27 3.52 0.23 STATES 2343 6.38 6.33 33.37 0.0582 6.26 0.22 8.54 0.13 -V CONTIN 2382 6.37 6.40 34.35 0.0634 5.85 0.21 10.10 0.14 SHOW 1649 6.36 6.59 33.89 0.0470 3.26 0.32 3.21 0.28 8 INTERE 3637 6.36 6.32 35.33 0.0944 5.26 0.20 5.71 0.15 3 PLACE 1881 6.36 6.45 32.27 0.0528 6.46 0.21 5.21 0.19 L TESTIF 3484 6.35 6.35 31.74 0.0969 3.53 0.24 3.72 0.19 4 VIEW 1406 6.35 6.48 30.95 0.0375 4.33 0.29 7.01 0.20 1 POINT 1487 6.35 6.42 29.48 0.0407 4.43 0.25 4.24 0.21 6 REMAIN 1592 6.35 6.38 30.46 0.0428 4.99 0.23 7.12 0.16 5 PERMIT 2869 6.35 6.49 39.63 0.0820 6.17 0.17 6.36 0.17 2 TERMS 1583 6.33 6.39 28.46 0.0424 3.43 0.25 3.35 0.21 9 PUBLIC 4658 6.33 6.30 35.78 0.1226 4.86 0.20 5.07 0.15 2 GIVE 1490 6.32 6.45 29.78 0.0399 3.06 0.29 3.67 0.23 5 SEVERA 1243 6.32 6.36 27.25 0.0331 3.47 0.26 7.53 0.18 5 AOMITT 1667 6.32 6.32 28.87 0.0436 3.82 0.23 5.59 0.17 1 STATEM 2732 6.32 6.36 34.16 0.0720 4.77 0.20 5.32 0.16 CLEARL 1145 6.31 6.45 27.67 0.0304 2.81 0.30 3.28 0.24 2 DATE 1983 6.31 6.41 31.37 0.0555 3.97 0.23 4.85 0.19 1 DENIED 2053 6.30 6.77 40.39 0.0580 2.91 0.37 2.72 0.35 MANNER 1259 6.30 6.37 27.29 0.0329 3.46 0.27 6.32 0.19 I RENDER 1657 6.30 6.45 31. 74 0.0464 3.94 0.23 6.39 0.19 HER 7548 6.30 6.20 31.89 0.2095 4.05 0.20 4. 75 0.14 4 COMPLE 1709 6.30 6.45 31.40 0.0455 4.76 0.24 5.48 0.20 1 ENTIRE 1350 6.30 6.41 28.53 0.0369 5.20 0.25 6.76 0.20 6 RIGHTS 2108 6.30 6.33 30.38 0.0581 5.59 0.20 4.76 0.17 1 INTEND 1333 6.29 6.39 27.63 0.0361 3.14 0.25 4.27 0.21 FAILED 1442 6.29 6.48 30.31 0.0414 3.32 C.29 3.79 0.23 SUPRA 2573 6.29 6.25 29.21 0.0636 3.34 0.23 4.77 0.15 3 USE 3852 6.29 6.27 36.12 0.1059 4.86 0.18 7.72 0.12 8 HEARIN 2525 6.28 6.31 31.59 0.0716 4.03 0.21 6. 14 0.15 6 COURTS 2033 6.28 6.36 31.21 0.0553 9.19 0.16 5.77 0.17 MANY 1117 6.27 6.38 25.82 0.0286 2.52 0.29 2.73 0.23 5 OBJECT 2703 6.27 6.31 32.50 0.0742 8.66 0.15 5.60 0.15 SAY 1088 6.26 6.34 25.44 0.0294 2.94 0.26 3.71 0.21 9 PARTY 2643 6.26 6.33 31.93 0.0726 4.28 0.20 5.91 0.16 7 OFFICE 4060 6.26 6.12 33.93 0.1032 4.82 0.17 18.75 0.07 3 ARGUME 1528 6.26 6.37 28.69 0.0429 5.01 0.20 4.22 0.19 ITSELF 993 6.25 6.33 24.38 0.0260 2.40 0.27 3.32 0.22 MOST 1051 6.25 6.31 24.95 0.0273 2.65 0.28 6.00 0.18 APPLIE 1264 6.25 6.40 27.63 0.0351 2.95 0.27 3.46 0.22 1 PAID 2316 6.25 6.25 28.16 0.0616 3.21 0.23 4.69 0.16 2 SUBSEQ 1263 6.25 6.37 26.99 0.0363 3.67 0.24 3.97 0.21 FORTH 1458 6.25 6.40 28.80 0.0391 3.68 0.25 4.54 0.20 6 DUTY 1873 6.25 6.30 28.35 0.0506 3.82 0.21 5.09 0.17 3 GRANTE 1574 6.25 6.34 28.35 0.0425 4.97 0.20 5.70 0.17 1 LEGAL 1650 6.25 6.3C 28.57 0.0423 7.41 0.19 9.77 0.14 NOTHIN 1275 6.24 6.55 30.65 0.0345 2.76 0.33 2.84 0.29 5 CITY 5969 6.24 6.23 38.05 0.1706 3.90 0.18 5.82 0.13 9 CLAIM 2565 6.24 6.24 32.27 0.0735 5.91 0.15 7.77 0.12 3 REFERR 1309 6.24 6.43 28.65 0.0341 8.37 0.24 5.55 0.21 2 RETURN 2074 6.24 6.32 31.48 0.0589 8.81 0.15 9.23 0.14 HEREIN 2599 6.23 6.70 41.75 0.0670 3.17 0.36 5.86 0.25 1 TRUE 1140 6.23 6.36 26.23 0.0309 3.33 0.26 4.42 0.20 LONG 1047 6.23 6.32 24.80 0.0280 3.39 0.23 3.84 0.20 5 ORIGIN 2053 6.23 6.39 32.01 0.0558 4.38 0.21 5.63 0.18 OVERRU 1644 6.23 6.42 30.46 0.0456 4.78 0.19 4.35 0.20 DISCUS 1034 6.22 6.31 24.34 0.0267 2.85 0.25 3.19 0.21 Table VI. Sorted by E 96 •ES WORD NOCC E EL PZD AVG G EK GL EKL BELIEV 1176 6.22 6.34 25.67 0.0322 3.33 0.24 3.34 0.21 1 FAVOR 1249 6.22 6.37 26.8 7 0.0364 3.45 0.23 4.09 0.21 2 LANGUA 1492 6.22 6.23 25.78 0.0411 3.66 0.21 5.17 0.16 9 COUNSE 3030 6.22 6.2/ 32.54 0.0868 6.05 0.15 5.28 0.14 2 COURSE 1500 6.22 6.45 30.53 0.0421 6.86 0.21 4.36 0.21 MERELY 936 6.21 6.32 23.78 0.0248 2.46 0.26 2.82 0.22 4 CODE 4152 6.21 6.18 29.55 0.1146 4.17 0.17 5.98 0.13 WAY 1771 6.21 6.45 32.91 0.0472 6.65 0.22 10.08 0.16 CONSIS 941 6.19 6.31 23.66 0.0260 2.47 0.26 3.02 0,21 5 PETITI 7623 6.19 6.44 40.39 0.2198 3.73 0.19 5.82 0.18 MAKING 1060 6.19 6.33 25.14 0.0282 4.11 0.22 3.75 0.21 5 COMPAN .4677 6.19 6.05 32.65 0.1180 4.27 0.17 10.01 0.09 5 EXAMIN 3117 6.19 6.23 35.56 0.0831 7.01 0.15 8.63 0.11 1 THINK 1035 6.18 6.28 23.63 0.0298 3.00 0.23 3.20 0.20 POSSIB 1018 6.18 6.23 22.98 0.0272 3.04 0.23 3.70 0.18 OBTAIN 1498 6.18 6.30 27.40 0.0397 3.28 0.23 5.62 0.17 NEITHE 930 6.16 6.38 24.87 0.0252 2.65 0.27 2.44 0.25 SHOWS 1078 6.16 6.35 25.25 0.0297 3.55 0.23 3.06 0.22 1 SUPREM 1904 6.16 6.24 27.44 0.0474 3.73 0.21 6.65 0.14 1 NATURE 1185 6.16 6.31 25.48 0.0313 3.80 0.22 4.10 0.19 5 FAILUR 1630 6.16 6.43 30.16 0.0459 3.81 0.24 4.43 0.21 PREVIO 1040 6.16 6.31 24.57 0.0277 3.93 0.22 3.68 0.20 HOLD 1033 6.15 6.35 24.61 0.0270 2.49 0.26 3.24 0.22 VERY 888 6.15 6.22 21.93 0.0230 2.80 0.24 3.45 0.19 RATHER 917 6.15 6.21 22.00 0.0246 3.00 0.24 3.67 0.18 TOOK 1080 6.15 6.28 24.46 0.0302 3.21 0.24 4.38 0.19 SHOWN 1106 6.15 6.36 2 5.74 0.0303 3.38 0.24 3.23 0.22 3 DISTIN 997 6.14 6.22 22.68 0.0265 2.77 0.24 4.15 0.18 ORDERE 1180 6.14 6.33 26.23 0.0324 3.50 0.23 6.13 0.18 OTHERW 1095 6.14 6.42 27.18 0.0307 4.16 0.25 3.79 0.23 2 REFUSE 1286 6.14 6.22 24.49 0.0351 4.26 0.19 4.13 0.17 3 CORREC 1358 6.14 6.38 28.57 0.0370 4.35 0.21 4.34 0.20 THEREI 1068 6.13 6.38 25.70 0.0279 2.72 0.27 3.38 0.23 1 KNOWN 1083 6.12 6.17 22.19 0.0285 3.59 0.21 4.34 0.16 1 EVERY 922 6.11 6.22 22.31 0.0244 3.05 0.22 3.79 0.18 SOUGHT 1132 6.11 6.33 25.44 0.0316 3.80 0.21 4.23 0.20 FAR 923 6.11 6.24 22.61 0.0247 4.89 0.20 4.79 0.18 5 REOUES 1941 6.11 6.29 29.44 0.0545 7.47 0.15 5.99 0.15 5 RECOGN 1033 6.10 6.25 23.51 0.0261 3.33 0.23 3.94 0.18 DONE 1079 6.09 6.28 24.57 0.0282 3.94 0.21 4. 53 0.18 4 PURSUA 1039 6.08 6.24 23.17 0.0271 2.92 0.22 3.93 0.18 LESS 923 6.08 6.17 21.63 0.0250 3.44 0.21 3.99 0.17 1 REV 1484 6.07 6.08 22.72 0.0446 3.55 0.18 9.27 0.12 BECOME 1158 6.07 6.30 25.36 0.0320 3.89 0.23 3.96 0.19 2 EXISTE 1029 6.06 6.17 22.08 0.0286 5.05 0.19 4.18 0.16 THERET 1022 6.05 6.35 24.95 0.0278 3.03 0.25 3.31 0.22 1 HOLDIN 1008 6.05 6.20 22.76 0.0265 3.62 0.21 4.43 0.17 4 OCCURR 1248 6.05 6.11 21.78 0.0347 3.73 0.18 4.81 0.15 9 ATTEMP 1404 6.05 6.42 29.18 0.0376 4.42 0.25 7.93 0.19 1 DAYS 1500 6.05 6.22 24.99 0.0447 6.03 0.14 3.91 0.17 TOGETH 861 6.04 6.16 20.91 0.0222 3.31 0.21 3.86 0.17 LATTER 833 6.04 6.14 20.23 0.0235 3.47 0.19 3.63 0.17 13 NOTICE 2855 6.04 6.18 30.76 0.0853 5.70 0.14 6.77 0.12 8 SERVIC 3855 6.04 6.05 29.63 0.1114 5.82 0.13 7.29 0.10 7 REVIEW 2347 6.02 6.30 32.72 0.0676 5.34 0.15 7.80 0.13 NEVER 976 6.01 6.15 21.32 0.0254 4.03 0.19 4.18 0.16 LEAST 766 6.00 6.11 19.40 0.0206 2.98 0.20 3.43 0.17 APPLY 806 6.00 6.08 19.63 0.0212 3.14 0.19 4.78 0.15 WHOM 832 6.00 6.13 20.08 0.0228 3.43 0.19 3.68 0.17 RAISED 1050 6.00 6.28 23.93 0.0290 3.56 0.2L 3.95 0.19 Table VI. Sorted by E 97 VOTES WORD NOCC E EL PZD AVG G EK GL EKL 5 PREVEN 956 6.00 6. 16 21.44 0.0265 3.86 0.19 3.57 0.17 13 JURISD 30 5 6 6.00 6.10 21.67 0.0812 4.48 0.14 6.50 0.11 AGAIN 766 6.00 6.11 19. 32 0.0209 4.64 0.18 3.29 0.17 8 ASSIGN 2654 6.00 6.12 29.32 0.0715 6.48 0.12 7.19 0.11 THEREB 712 5.99 6.11 19.02 0.0192 3.22 0.19 3.25 0.17 MUCH 693 5.99 6.11 19. L3 0.0187 3.85 0.19 3.99 0.17 3 VARIOU 815 5.99 6.12 19.96 0.0214 4.01 0.20 3.74 0.16 5 EMPLOY 6062 5.98 5.89 32.5"0 0.1653 5.38 0.11 7.48 0.08 HEARD 903 5.97 6.0 7 19.93 0.0241 3.35 0.18 5.06 0.14 CLAIME 921 5.97 6.17 21.44 0.0261 4.84 0.17 3.94 0.17 4 EMPHAS 1012 5.96 6.00 19.59 0.0246 3.16 0.19 5.19 0.13 3 DISMIS 2755 5.96 6.48 35.90 0.0790 5.16 0.16 5.01 0.20 2 TIMES 751 5.95 6.09 19.21 0.0201 3.18 0.19 3.80 0.16 OCCASI 742 5.95 6.03 18.38 0.0206 3.38 0.18 5.02 0.14 HIMSEL 864 5.95 6.10 19.85 0.0241 5.07 0.17 3.60 0.16 SUGGES 782 5.94 6.06 18.68 0.0208 3.46 0.18 3.55 0.16 8 RESPON 2872 5.94 6.00 29.21 0.0772 6.24 0.12 11.25 0.08 HOW 739 5.93 6.01 17.89 0.0191 3.23 0.19 3.80 0.15 LIKE 738 5.93 6.08 18.87' 0.0198 4.09 0.17 3.62 0.16 RELATE 839 5.92 6.12 20.04 0.0233 3.10 0.20 4.01 0.16 6 AGREE 707 5.91 6.10 18.98 0.0187 3.50 0.19 3.35 0.17 MENTIO 694 5.91 6.02 17.89 0.0191 4.96 0.16 4.13 0.15 COME 663 5.90 6.00 17.40 0.0173 3.24 0.18 3.88 0.15 1 STAT 1245 5.90 5.93 19.10 0.0383 3.51 0.15 6.23 0.11 7 JUSTIF 885 5.90 6.07 19.85 0.0235 3.52 0.18 4.41 0.15 WHOSE 655 5.89 6.04 17.70 0.0179 3.34 0.18 3.38 0.16 READS 769 5.89 6.03 18.30 0.0220 3.56 0.16 3.85 0.15 PUT 719 5.88 5.96 17.40 0.0197 3.40 0.17 5.70 0.13 NOTED 710 5.88 6.02 13.04 0.0182 3.47 0.17 4.48 0.14 PLACED 781 5.38 6.05 13.91 0.0208 4.15 0.16 4.20 0.15 SEEMS 647 5.88 5.98 16.87 0.0179 4. 19 0.16 3.41 0.15 BEYOND 754 5.87 5.99 17.74 0.0209 3.35 0.17 3.90 0.14 OBVIOU 645 5.87 6.09 18.23 0.0187 3.36 0.18 2.92 0.18 1 STILL 660 5.86 6.07 18.08 0.0176 3.47 0.18 2.94 0.17 AMONG 579 5.83 5.93 15.81 0.0152 3.05 0.17 3.70 0.14 7 VALID 768 5.83 5.92 17.06 0.0207 3.58 0.16 4.77 0.12 2 ESSENT 651 5.83 5.98 16.76 0.0173 3.67 0.16 3.52 0.15 MERE 654 5.82 5.99 17.02 0.0170 3.36 0.17 3.95 0.14 BECAME 734 5.81 6.08 18.61 0.0196 3.61 0.18 3.09 0.17 1 APPROX 704 5.79 5.87 15.77 0.0179 3.77 0.15 4.01 0.12 SHOWIN 829 5.78 6.16 20.53 3.0227 3.37 0.19 3.12 0.18 2 MASS 4687 5.77 5.73 16.98 0. 1483 3.41 0.12 4.36 0.10 2 WHOLE 651 5.74 5.78 14.87 0.0169 3.54 0.14 5.73 0.10 FULLY 591 5.74 5.93 16.00 0.0159 4.28 0.14 3.71 0.14 MAKES 565 5.73 5.98 16.27 0.0151 3.28 0.17 3.07 0.15 FOREGO 626 5.73 5.96 16.64 0.0163 3.55 0.16 3.70 0.14 ALONE 536 5.73 5.8 7 14. 79 0.0152 4.20 0.14 3.50 0.13 DIFFIC 578 5.72 5.8 7 15.06 0.0155 3.98 0.14 3.51 0.13 DOING 625 5.71 5.89 16.04 0.0167 3.56 0.15 5.74 0.12 ALREAD 542 5.68 5.8 14.08 0.0141 3.49 0.14 4.07 0.12 REACHE 539 5.63 5.86 14.91 0.0139 4.07 0.14 4.15 0.13 ADDED 587 5.62 5.77 13.96 0.0144 4.33 0.13 3.95 0.12 RELIED 487 5.62 5.80 13.89 0.0134 4.43 0.12 4.02 0.12 MOVED 492 5.61 5.75 13.40 0.0149 . 3.94 0.13 4.21 0.11 2 QUOTED 591 5.60 5.85 15.13 0.0149 3.88 0.14 4.09 0.12 WHERE I 560 5.60 5.92 15.66 0.0155 4.62 0.13 3.89 0.14 1 CONCED 48 5 5.58 5.83 14-00 0.0140 3.43 0.14 3.42 0.13 NONE 506 5.58 5.82 14.23 0.0136 3.70 0.14 4.14 0.12 1 OPPORT 545 5.53 5.75 13.70 0.0146 5.13 0.11 4.15 0.11 LIKEWI 404 5.52 5.64 11.70 0.0106 3.26 0.12 4.45 0.10 Tabl e VI. Sorted by E 98 VOTES • WORD NOCC E EL PZD AVG G EK GL EKL ARGUES 443 5.52 5.67 12.23 0.0136 4.75 0.11 3.96 0.11 NEVERT 370 5. 50 5.71 11.92 0.0096 3.19 0.13 3*20 0.12 SOLELY 441 5.50 5.74 12.87 0.0118 4.03 0.12 4.06 0.12 2 FILE 943 5.49 5.87 17.06 0.0265 5.51 0.10 4.17 0.12 4 DISSEN 751 5.48 5.73 13.43 0.0191 3.84 0.12 3.90 0.11 ARGUED 396 5.47 5.71 12.15 0.0117 3.88 0.12 3.34 0.12 EVER 481 5.47 5.65 12.23 0.0127 4.47 0.11 4.27 0.10 HENCE 447 5.43 5.68 12.26 0.0118 4.38 0.11 3.85 0.11 STATIN 385 5.43 5.67 11.77 0.0112 4.44 0.11 3.86 0.11 3 CAREFU 453 5.42 5.79 13.51 0.0118 3.79 0.13 3.84 0.12 2 COMPAR 418 5.42 5.57 11.09 0.0121 4.96 0.09 4.20 0.09 HERETO 498 5.41 5.64 12.60 0.0121 3.70 0.12 6.07 0.09 1 DESIRE 507 5.38 5.78 13.74 0.0143 4.09 0.12 3.97 0.12 EXISTS 376 5.38 5.59 10.94 0.0104 4.09 0.11 3.84 0.10 1 ABLE 416 5.37 5.64 11.77 0.0107 4.69 0.11 4.20 0.10 INSIST 368 5.36 5.51 10.41 0.0096 3.68 0.10 4.72 0.09 ONCE 375 5.32 5.60 11.02 0.0094 3.70 0.11 3.77 0.10 QUITE 307 5.32 5.46 9.39 0.0083 4.11 0.09 3.74 0.09 INSTEA 328 5.29 5.52 10.07 0.0088 4.25 0.10 3.97 0.09 1 RELIES 301 5.28 5.48 9.62 0.0090 4.16 0.09 3.91 0.09 NAMELY 316 5.27 5.44 9.36 0.0080 4.71 0.09 4.09 0.09 2 VIRTUE 322 5.21 5.46 9.55 O.0091 4.56 0.09 3.99 0.09 FAILS 426 5.21 5.68 12.15 0.0125 4.84 0.10 3.68 0.11 1 ALLEGI 320 5.18 5.47 9.66 0.0088 4.31 0.09 4.05 0.09 SEEKS 374 5.15 5.62 11.32 0.0117 4.95 0.10 3.75 0.10 SOMEWH 236 5.13 5.27 7.73 0.0070 4.87 0.07 4.12 0.07 SOMETI 237 5.05 5.22 7.39 0.0068 5.15 0.07 4.18 0.07 DESMON 230 4.86 5.24 7.47 0.0065 4.60 0.07 4.06 0.07 1 VOORHI 209 4.80 5.23 7.32 0.0059 4.32 0.07 3.98 0.07 FROESS 209 4.78 5.18 6.98 0.0062 4.96 D.06 3.98 0.07 FULD 208 4.73 5.20 7.09 0.0057 4.57 3.06 4.05 0.07 1 WEYGAN 251 4.57 5.40 8.79 0.0050 6.09 0.05 3.57 0.09 2 MATTHI 249 4.57 5.37 8.64 0.0049 6.34 0.05 4.17 0.08 1 PECK 216 4.34 5.22 7.43 0.0043 7.17 0.04 4.22 0.07 Table VI. Sorted by E 99 ES WORD NOCG E EL PZD AVG G EK GL EKL AAAAAA 26 4 9 7.07 7.87 99.99 0.0783 0.42 4.32 2.55 31.32 THE 442506 7.87 7.65 99.99 12.1192 -0.19 41.17 1.87 1.93 AND 128355 7.83 7.61 99.73 3.4562 0.53 15.25 2.14 1.S7 FOR 45223 7.73 7.61 98.07 1.2529 1.03 5.00 1.87 1.59 THAT 89026 7.80 7.60 98.15 2.4343 0.70 9.48 1.92 1.54 NOT 35835 7.75 7.60 96.97 0.9798 0.55 6.95 1.90 1.56 THIS 29490 7.66 7.59 96.67 0.8106 1.15 4.02 2.45 1.41 WHICH 25522 7.70 7.56 94.41 0.6984 0.64 4.89 1.79 1.38 WAS 56044 7.69 7.55 95.73 1.5630 0.52 3.6R 1.78 1.33 WITH 21624 7.64 7.51 92.03 0.5840 1.15 3.46 2.15 1.16 FROM 19879 7.62 7.51 92.18 0.5456 1.25 3.01 1.83 1.19 HAVE 13825 7.53 7.44 85.99 0.3761 1.17 2.52 2.53 0.97 BEEN 12072 7.50 7.41 83.76 0.3306 1.41 1.96 2.07 0.95 COURT 33021 7.45 7.41 93.58 0.9097 1.64 1.26 3.97 0.76 THERE 12925 7.48 7.40 84.25 0.3 545 1.30 1.87 2.17 0.91 UPON 11816 7.46 7.40 82.93 0.3232 1.37 1.76 1.83 0.95 ARE 13721 7.46 7.39 84.37 0.3766 1.56 1.85 2.55 0.86 BUT 9174 7.48 7.37 78.89 0.2485 0.84 2.21 2.06 0.89 ANY 13855 7.47 7.37 83.12 0.3703 1.29 1.87 2.37 0.83 HAS 10530 7.36 7.37 81.76 0.2838 1.34 1.51 2.41 0.83 3 CASE 15261 7.45 7.36 84.74 0.4182 1.64 1.43 2.38 0.80 SUCH 18195 7.50 7.35 85.80 0.4817 1.49 1.78 2.91 0.74 OTHER 8966 7.43 7.31 76.17 0.2397 1.18 1.79 2.45 0.76 WERE 12911 7.43 7.31 79.91 0.3486 1.43 1.55 2.67 0.70 UNDER 10893 7.40 7.31 80.44 0.2937 1.82 1.31 2.98 0.69 1 ONE 9388 7.39 7.31 76.40 0.2540 1.61 1.48 2.40 0.75 1 ONLY 6218 7.33 7.31 72.14 0.1693 1.57 1.38 1.88 0.82 HAD 15451 7.43 7.30 82.44 0.4205 1.49 1.38 2.68 0.69 MAY 9510 7.37 7.30 76.70 0.2605 1.45 1.38 2.50 0.72 4 CONCUR 2290 6.65 7.30 63.91 0.0643 2.45 0.73 2.51 0.86 MADE 7999 7.32 7.29 74.51 0.2213 1.60 1.25 1.97 0.76 2 QUESTI 8776 7.25 7.28 77.08 0.2395 2.17 1.03 4.30 0.6? 1 ALL 9021 7.36 7.26 74.78 0.2361 1.45 1.46 3.34 0.6 0.1145 3.54 0.27 4.52 0.18 PAGE 32L8 6.47 6.45 33.71 0.0815 2.83 0.31 5.57 0.19 3 COMPLA 3971 6.40 6.45 3 7.44 0.1136 4.27 0.22 4.90 0.19 3 PLACE 1881 6.36 6.45 32.27 0.0528 6.46 0.21 5.21 0.l'9 0.23 2 GIVE 1490 6.32 6.45 29.78 0.0399 3.06 0.29 3.67 CLEARL 1145 6.31 6.45 27.67 0.0304 2.81 0.30 3.28 0.24 * COMPLE 1709 6.30 6.45 31.40 0.0455 4.76 0.24 5.48 0.20 1 RENDER 1657 6.30 6.45 31.74 0*0464 3.94 0.23 6.39 0.19 2 COURSE 1500 6.22 6.45 30.53 0.0421 6.86 0.21 4.36 0.21 WAY 1771 6.21 6.45 32.91 0.0472 6.65 0.22 10.08 0.16 3 APPELL 14543 6.53 6.44 50.16 0.3877 3.05 0.23 5.26 0.16 5 PETITI 7623 6.19 6.44 40.39 0.2198 3.73 0.19 5.82 0.18 11 PRINCI 2158 6.46 6.43 34.61 0.0 564 6.01 0.24 7.85 0.16 3 REFERR 1309 6.24 6.43 28.65 0.0341 8.37 0.24 5.55 0.21 5 FAILUR 1630 6.16 6.43 30.16 0.0459 3.81 0.24 4.43 0.21 1 POINT 1487 6.35 6.42 29.48 0.0407 4.43 0.25 4.24 0.21 OVERRU 1644 6.23 6.42 30.46 0.0456 4.78 0.19 4.35 0.20 OTHERW 1095 6.14 6.42 27.18 0.0307 4.16 0.25 3.79 0.23 9 ATTEMP 1404 6.05 6.42 29.18 0.0376 4.42 0.25 7.93 0.19 7 TESTIM 3650 6.42 6.41 34.65 0.1010 3.30 0.25 3.88 0.20 5 ANSWER 3398 6.42 6.41 39.33 0.0913 5.64 0.22 9.44 0.13 2 DATE 1983 6.31 6.41 31.37 0.0555 3.97 0.23 4.85 0.19 1 ENTIRE 1350 6.30 6.41 28.53 0.0369 5.20 0.25 6.76 0.20 4 CONTIN 2382 6.37 6.40 34.35 0.0634 5.85 0.21 10.10 0.14 APPLIE 1264 6.25 6.40 27.63 0.0351 2.95 0.27 "3.46 0.22 FORTH 1458 6.25 6.40 28.80 0.0391 3.68 0.25 4.54 0.20 2 TERMS 1583 6.33 6.39 28.46 0.0424 3.43 0.25 3.35 0.21 1 INTENO 1333 6.29 6.39 27.63 0.0361 3.14 0.25 4.27 0.21 5 ORIGIN 2053 6.23 6.39 32.01 0.0558 4.38 0.21 5.63 0.18 6 REMAIN 1592 6.35 6.38 30.46 0.0428 4.99 0.23 7.12 0.16 MANY 1117 6.27 6.38 25.82 0.0286 2.52 0.29 2.73 0.23 NEITHE 930 6.16 6.38 24.87 0.0252 2.65 0.27 2.44 0.25 3 CORREC 1358 6.14 6.38 28.57 0.0370 4.35 0.21 4.34 0.20 THEREI 1068 6.13 6.38 25.70 0.0279 2.72 0.27 3.38 0.23 MANNER 1259 6.30 6.37 27.29 0.0329 3.46 0.27 6.32 0.19 3 ARGUME 1528 6.26 6.37 28.69 0.0429 5.01 0.20 4.22 0.19 2 SUBSEQ 1263 6.25 6.37 26.99 0.0363 3.67 0.24 3.97 0.21 1 FAVOR 1249 6.22 6.37 26.87 0.0364 3.45 0.23 4.09 0.21 1 STATEM 2732 6.32 6.36 34.16 0.0720 4.77 0.20 5.32 0.16 5 SEVERA 1243 6.32 6.36 27.25 0.0331 3.47 0.26 7.53 0.18 6 COURTS 2033 6.28 6.36 31.21 0.0553 9.19 0.16 5.77 0.17 1 TRUE 1140 6.23 6.36 26.23 0.0309 3.33 0.26 4.42 0.20 SHOWN 1106 6.15 6.36 25.74 0.0303 3.38 0.24 3.23 0.22 2 OHIO 8519 6.49 6.35 34.39 0.2212 2.35 0.28 5.51 0.17 1 TESTIF 3484 6.35 6.35 31.74 0.0969 3.53 0.24 3.72 0.19 SHOWS 1078 6.16 6.35 25.25 0.0297 3.55 0.23 3.06 0.22 HOLD 1033 6.15 6.35 24.61 0.0270 2.49 0.26 3.24 0.22 THERET 1022 6.05 6.35 24.95 0.0278 3.03 0.25 3.31 0.22 3 PROPER 5913 6.40 6.34 36.91 0.1591 3.62 0.23 5.71 0.15 SAY 1088 6.26 6.34 25.44 0.0294 2.94 0.26 3.71 0.21 3 GRANTE 1574 6.25 6.34 28.35 0.0425 4.97 0.20 5.70 0.17 BELIEV 1176 6.22 6.34 25.67 0.0322 3.33 0.24 3.34 0.21 STATES 2343 6.38 6.33 33.37 0.0582 6.26 0.22 8.54 0.13 6 RIGHTS 2108 6.30 6.33 30.38 0.0581 5.59 0.20 4.76 0.17 9 PARTY 2643 6.26 6.33 31.93 0.0726 4.28 0.20 5.91 0.16 ITSELF 993 6.25 6.33 24.38 0.0260 2.40 0.27 3.32 0.22 MAKING 1060 6.19 6.33 25.14 0.0282 4.11 0.22 3.75 0.21 ORDERE 1180 6.14 6.33 26.23 0.0324 3.50 0.23 6.13 0.18 SOUGHT 1132 6.11 6.33 25.44 0.0316 3.80 0.21 4.23 0.20 Table VII. Sorted by EL 104 VOTES WORD NOCC E EL PZD AVG G EK GL EKL 8 INTERE 3637 6.36 6.32 35.33 0.0944 5.26 0.20 5.71 0.15 5 ADMITT 1667 6.32 6.32 28.87 0.0436 3.82 0.23 5.59 0.17 2 RETURN 2074 6.24 6.32 31.48 0.0589 8.81 0.15 9.23 0.14 LONG 1047 6.23 6.32 24.80 0.0280 3.39 0.23 3.84 0.20 MERELY 936 6.21 6.32 23.78 0.0248 2.46 0.26 2.82 0.22 10 JURY 5530 6.41 6.31 34.27 0.1470 3.35 0.24 4.31 O.j.7 8 HEARIN 2525 6.28 6.31 31.t>9 0.0716 4.03 0.21 6.14 0.15 5 OBJECT 2703 6.27 6.31 32.50 0.0742 8.66 0.15 5.60 0.15 MOST 1051 6.25 6.31 24.9 5 0.0273 2.65 0.28 6.00 0.18 DISCUS 1034 6.22 6.31 24.34 0.0267 2.85 0.25 3.19 0.21 CONSIS 941 6.19 6.31 23.66 0.0260 2.47 0.26 3.02 0.21 PREVIO 1040 6.16 6.31 24.57 0.0277 3.93 0.22 3.68 0.20 1 NATURE 1185 6.16 6.31 25.48 0.0313 3.80 0.22 4.10 0.19 9 PUBLIC 4658 6.33 6.30 35.78 0.1226 4.86 0.20 5.07 0.15 6 DUTY 1873 6.25 6.30 28.35 0.0506 3.82 0.21 5.09 0.17 1 LEGAL 1650 6.25 6.30 28.57 0.0423 7.41 0.19 9.77 0.14 OBTAIN 1498 6.18 6.30 27.40 0.0397 3.28 0.23 5.62 0.17 BECOME 1158 6.07 6.30 25.36 0.0320 3.89 0.23 3.96 0.19 7 REVIEW 2347 6.02 6.30 32.72 0.0676 5.34 0.15 7.80 0.13 5 REQUES 1941 6.11 6.29 29.44 0.0545 7.47 0.15 5.99 0.15 1 THINK 1035 6.18 6.28 23.63 0.0298 3.00 0.23 3.20 0.20 TOOK 1080 6.15 6.28 24.46 0.0302 3.21 0.24 4.38 0.19 DONE 1079 6.09 6.28 24.57 0.0282 3.94 0.21 4.53 0.18 RAISED 1050 6.00 6.28 23.93 0.0290 3.56 0.21 3.95 0.19 3 USE 3852 6.29 6.27 36.12 0.1059 4.86 0.18 7.72 0.12 9 COUNSE 3030 6.22 6.27 32.54 0.0868 6.05 0.15 5.28 0.14 SUPRA 2573 6.29 6.25 29.21 0.0636 3.34 0.23 4.77 0.15 1 PAID 2316 6.25 6.25 28.16 0.0616 3.21 0.23 4.69 0.16 5 RECOGN 1033 6.10 6.25 23.51 0.0261 3.33 0.23 3.94 0.18 9 CLAIM 2565 6.24 6.24 32.27 0.0735 5.91 0.15 7.77 0.12 1 SUPREM 1904 6.16 6.24 27.44 0.0474 3.73 0.21 6.65 0.14 FAR 923 6.11 6.24 22.61 0.0247 4.89 0.20 4.79 0.18 4 PURSUA 1039 6.08 6.24 23.17 0.0271 2.92 0.22 3.93 0.18 5 CITY 5969 6.24 6.23 38.05 0.1706 3.90 0.18 5.82 0.13 2 LANGUA 1492 6.22 6.23 25.78 0.0411 3.66 0.21 5.17 0.16 5 EXAMIN 3117 6.19 6.23 35.56 0.0831 7.01 0.15 8.63 0.11 POSSIB 1018 6.18 6.23 22.98 0.0272 3.04 0.23 3.70 0.18 VERY 888 6.15 6.22 21.93 0.0230 2.80 0.24 3.45 0.19 2 REFUSE 1286 6.14 6.22 24.49 0.0351 4.26 0.19 4.13 0.17 3 DISTIN 997 6.14 6.22 22.68 0.0265 2.77 0.24 4.15 0.18 1 EVERY 922 6.11 6.22 22.31 0.0244 3.05 0.22 3.79 0.18 1 DAYS 1500 6.05 6.22 24.99 0.0447 6.03 0.14 3.91 0.17 RATHER 917 6.15 6.21 22.00 0.0246 3.00 0.24 3.67 0.18 HER 7548 6.30 6.20 31.89 0.2095 4.05 0.20 4.75 0.14 1 HOLDIN 1008 6.05 6.20 22.76 0.0265 3.62 0.21 4.43 0.17 4 CODE 4152 6.21 6.18 29.55 0.1146 4.17 0.17 5.98 0.13 13 NOTICE 2855 6.04 6.18 30.76 0.0853 5.70 0.14 6.77 0.12 l KNOWN 1083 6.12 6.17 22.19 0.0285 3.59 0.21 4.34 0.16 LESS 923 6.08 6.17 21.63 0.0250 3.44 0.21 3.99 0.17 2 EXISTE 1029 6.06 6.17 22.08 0.0286 5.05 0.19 4.18 0.16 CLAIME 921 5.97 6.17 21.44 0.0261 4.84 0.17 3.94 0.17 TOGETH 861 6.04 6.16 20.91 0.0222 3.31 0.21 3.86 0.17 5 PREVEN 956 6.00 6.16 21.44 0.0265 3.86 0.19 3.57 0.17 SHOWIN 829 5.78 6.16 20.53 0.0227 3.37 0.19 3.12 0.18 NEVER 976 6.01 6.15 21.32 0.0254 4.03 0.19 4.18 0.16 LATTER 833 6.04 6.14 20.23 0.0235 3.47 0.19 3.63 0.17 WHOM 832 6.00 6.13 20.08 0.0228 3.43 0.19 3.68 0.17 7 OFFICE 4060 6.26 6.12 33.93 0.1032 4.82 0.17 18.75 0.07 8 ASSIGN 2654 6.00 6.12 29.82 0.0715 6.48 0.12 7.19 0.11 3 VARIOU 815 5.99 6.12 19.96 0.0214 4.01 0.20 3.74 0.16 Table VII. Sorted by EL 105 772 -957 0-66— 8 VOTES WORD NOCC E EL PZD AVG G EK GL EKL RELATE 839 5.92 6.12 ?0.04 0.0233 3.10 0.20 4.01 0.16 4 OCCURR 1248 6.05 6.11 21.78 0.0347 3.73 0.18 4.81 0.15 AGAIN 766 6.00 6.11 19.32 0.0209 4.64 0.18 3.29 0.17 LEAST 766 6.00 6.11 19.40 0.0206 2.98 0.20 3.43 0.17 THEREB 712 5.99 6.11 19.02 0.0192 3.22 0.19 3.25 0.11 MUCH 693 5.99 6.11 19.13 0.0187 3.85 0.19 3.99 0.11 13 JURISD 3056 6.00 6.10 29.67 0.0812 4.48 0.14 6.50 0.11 HIMSEL 864 5.95 6.10 19.85 0.0241 5.07 0.17 3.60 0.16 6 ACKgg 707 5.91 6.10 18.98 0.0187 3.50 0.19 3*35 0.17 2 TIMES 751 5.95 6.09 19.21 0.0201 3.18 0.19 3.80 0.16 OBVIOU 645 5.8 7 6.09 18.23 0.0187 3.36 0.18 2.92 0.18 1 REV 1484 6.07 6.08 22.72 0.0446 3.55 0.18 9.27 0.12 APPLY 806 6.00 6.08 19.63 0.0212 3.14 0.19 4.78 0.15 LIKE 738 5.93 6.08 18.87 0.0198 4.09 0.17 3.62 0.16 BECAME 734 5.81 6.08 18.61 0.0196 3.61 0.18 3.09 0.17 HEARD 903 5.97 6.07 19.93 0.0241 3.35 0.18 5.06 0.14 7 JUSTIF 885 5.90 6.07 19.8 5 0.0235 3.52 0.18 4.41 0.15 1 STILL 660 5.86 6.07 1 8 . C 8 0.0176 3.47 0.18 2.94 0.17 SUGGES 782 5.94 6.06 18.68 0.0208 3.46 0.18 3.55 0.16 5 COMPAN 4677 6.19 6.05 32.65 0.1180 4.27 0.17 10.01 0.09 8 SERVIC 3855 6.04 6.05 29.63 0.1114 5.82 0.13 7.29 0.10 PLACED 781 5.88 6.05 18.91 0.0208 4.15 0.16 4.20 0.15 WHOSE 655 5.89 6.04 17.70 0.0179 3.34 0.18 3.38 0.16 OCCASI 742 5.95 6.03 18.38 0.0206 3.38 0.18 5.02 0.14 READS 769 5.89 6.03 18.30 0.0220 3.56 0.16 3.85 0.15 MENTIO 694 5.91 6.02 17.89 0.0191 4.96 0.16 4.13 0.15 NOTED 710 5.88 6.02 18.04 0.0182 3.47 0.17 4.48 0.14 HOW 739 5.93 6.01 17.89 0.0191 3.23 0.19 3.80 0.15 4 EMPHAS 1012 5.96 6.00 19.59 0.0246 3.16 0.19 5.19 0.13 8 RESPON 2872 5.94 6.00 29.21 0.0772 6.24 0.12 11.25 0.08 COME 663 5.90 6.00 17.40 0.0173 3.24 0.18 3.88 0.15 BEYOND 754 5.87 5.99 17.74 0.0209 3.35 0.17 3.90 0.14 MERE 654 5.82 5.99 17.02 0.0170 3.36 0.17 3.95 0.14 SEEMS 647 5.88 5.98 16.87 0.0179 4.19 0.16 3.41 0.15 2 ESSENT 651 5.83 5.98 16.76 0.0173 3.67 0.16 3.52 0.15 MAKES 565 5.73 5.98 16.27 0.0151 3.28 0.17 3.07 0.15 PUT 719 5.88 5.96 17.40 0.0197 3.40 0.17 5.70 0.13 FOREGO 626 5.73 5.96 16.64 0.0163 3.55 0.16 3.70 0.14 1 STAT 1245 5.90 5.93 19.10 0.0383 3.51 0.15 6.23 0.11 AMONG 579 5.83 5.93 15.81 0.0152 3.05 0.17 3.70 0.14 FULLY 591 5.74 5.93 16.00 0.0159 4.28 0.14 3.71 0.14 7 VALID 768 5.83 5.92 17.06 0.0207 3.58 0.16 4.77 0.12 WHEREI 560 5.60 5.92 15.66 0.0155 4.62 0.13 3.89 0.14 5 EMPLOY 6062 5.98 5.89 32.50 0.1653 5.38 0.11 7.48 0.08 DOING 625 5.71 5.89 16.04 0.0167 3.56 0.15 5.74 0.12 1 APPROX 704 5.79 5.87 15.77 0.0179 3.77 0.15 4.01 0.12 ALONE 536 5.73 5.87 14.79 0.0152 4.20 0.14 3.50 0.13 DIFFIC 578 5.72 5.87 15.06 0.0155 3.98 0.14 3.51 0.13 2 FILE 943 5.49 5.87 17.06 0.0265 5.51 0.10 4.17 0.12 REACHE 539 5.63 5.86 14.91 0.0139 4.07 0.14 4.15 0.13 2 QUOTED 591 5.60 5.85 15.13 0.0149 3.88 0.14 4.09 0.12 1 CONCED 485 5.58 5.83 14.00 0.0140 3.43 0.14 3.42 0.13 NONE 506 5.58 5.82 14.23 0.0136 3.70 0.14 4.14 0.12 ALREAD 542 5.68 5.80 14.08 0.0141 3.49 0.14 4.07 0.12 RELIED 487 5.62 5.80 13.89 0.0134 4.43 0.12 4.02 0.12 3 CAREFU 453 5.42 5.79 13.51 0.0118 3.79 0.13 3.84 0.12 2 WHOLE 651 5.74 5.78 14.87 0.0169 3.54 0.14 5.73 0.10 1 DESIRE 50 7 5.38 5.78 13.74 0.0143 4.09 0.12 3.97 0.12 ADDED 587 5.62 5.77 13.96 0.0144 4.33 0.13 3.95 0.12 MOVED 49 2 5.61 5.75 13.40 0.0149 3.94 0.13 4.21 0.11 Table VII. Sorted by EL 106 2S WORD NOCC E EL PZD AVG G EK GL EKL 1 OPPORT 545 5.53 5.7b 13. 70 0.0146 5.13 0.11 4.15 0.11 SOLELY 441 5.50 5.74 12.87 0.0118 4.03 0.12 4.06 0.12 2 MASS 4687 5.77 5.73 16.98 0.1483 3.41 0.12 4.36 0.10 4 DISSEN 751 5.48 5.73 13.43 0.0191 3.84 0.12 3.90 0.11 NEVERT 370 5.50 5.71 11.92 0.0096 3.19 0.13 3.20 0.12 ARGUED 396 5.47 5.71 12.15 0.0117 3.88 0.12 3.34 0.^2 HENCE 447 5.43 5.68 12.26 0.0118 4.38 0.11 3.85 0.11 FAILS 426 5.21 5.68 12.15 0.0125 4.84 0.10 3.68 0.11 ARGUES 443 5.52 5.67 12.23 0.0136 4.75 0.11 3.96 0.11 STATIN 385 5.43 5.67 11.77 0.0112 4.44 0.11 3.86 0.11 EVER 481 5.47 5.65 12.23 0.0127 4.47 0.11 4.27 0.10 LIKEWI 404 5.52 5.64 11.70 0.0106 3.26 0.12 4.45 0.10 HERETO 498 5.41 5.64 12.60 0.0121 3.70 0.12 6.07 0.09 1 ABLE 416 5.37 5.64 11.77 0.0107 4.69 0.11 4.20 0.10 SEEKS 374 5.15 5.62 11.32 0.0117 4.95 0.10 3.75 0.10 ONCE 375 5.32 5.60 11.02 0.0094 3.70 0.11 3.77 0.10 EXISTS 376 5.38 5.59 10.94 0.0104 4.09 0.11 3.84 0.10 2 COMPAR 418 5.42 5.57 11.09 0.0121 4.96 0.09 4.20 0.09 INSTEA 328 5.29 5.52 10.07 0.0088 4.25 0.10 3.97 0.09 INSIST 368 5.36 5.51 10.41 0.0096 3.68 0.10 4.72 0.09 1 RELIES 301 5.28 5.48 9.62 0.0090 4.16 0.09 3.91 0.09 1 ALLEGI 320 5.18 5.47 9.66 0.0088 4.31 0.09 4.05 0.09 QUITE 307 5.32 5.46 9.39 0.0083 4.11 0.09 3.74 0.09 2 VIRTUE 322 5.21 5.46 9.55 0.0091 4.56 0.09 3.99 0.09 NAMELY 316 5.27 5.44 9.36 0.0080 4.71 0.09 4.09 0.09 1 WEYGAN 251 4.57 5.40 8.79 0.0050 6.09 0.05 3.57 0.09 2 MATTHI 249 4.57 5.37 8.64 0.0049 6.34 0.05 4.17 0.08 SOMEWH 236 5.13 5.27 7.73 0.0070 4.87 0.07 4.12 0.07 DESMON 230 4.86 5.24 7.47 0.0065 4.60 0.07 4.06 0.07 1 VOORHI 209 4.80 5.23 7.32 0.0059 4.32 0.07 3.98 0.07 SOMETI 237 5.05 5.22 7.39 0.0068 5.15 0.07 4.18 0.07 1 PECK 216 4.34 5.22 7.43 0.0043 7.17 0.04 4.22 0.07 FULD 208 4.73 5.20 7.09 0.0057 4.57 0.06 4.05 0.07 FROESS 209 4.78 5.18 6.98 0.0062 4.96 0.06 3.98 0.07 Table VII. Sorted by EL 107 VOTES WORD MOCC E EL PZD AVG G EK GL EKL THE 442506 7.87 7.65 99.99 12. 1192 -0.19 41.17 1.87 1.93 A NO 128355 7.83 7.61 9 9.73 3.4562 0.53 15.25 2.14 1.57 THAT 89026 7.80 7.6 98.15 2.4343 0.70 9.48 1.92 1.54 NOT 35835 7.75 7.60 96.97 0.9798 0.55 6.95 1.90 1.56 FOR 45223 7.73 7.61 98.07 1.2529 1.0 3 5.00 1.87 1.59 WHICH 25522 7.70 7.56 94.41 0.6984 0.64 4.89 1.79 1.38 AAAAAA 2649 7.07 7.87 99.99 0.0783 0.42 4.32 2.55 31.32 THIS 29490 7.66 7.59 96.67 0.8106 1.15 4.02 2.45 1.41 WAS 56044 7.69 7.55 95.73 1.5 6 30 0.52 3.68 1.78 1.33 WITH 21624 7.64 7.51 92.03 0.5840 1.15 3.46 2.15 1.16 FROM 19879 7.62 7.51 92.18 0.5456 1.25 3.01 1.83 1.19 HAVE 13825 7.53 7.44 85.99 0.3 761 1.17 2.52 2.53 0.97 BUT 9174 7.48 7.37 78.89 0.2485 0.84 2.21 2.06 0.89 BEEN 12072 7.50 7.41 83.76 0.3306 1.41 1.96 2.07 0.95 THERE 12925 7.48 7.40 84.25 0.3545 1.30 1.87 2.17 0.91 ANY 13855 7.47 7.3 7 83.12 0.3703 1.29 1.87 2.37 0.83 ARE 13721 7.46 7.39 84.37 0.3766 1.56 1.85 2.55 0.86 OTHER 8966 7.43 7.31 76.17 0.2397 1.18 1.79 2.45 0.76 SUCH 18195 7.50 7.35 85.80 0.4817 1.49 1.78 2.91 0.74 UPON 11816 7.46 7.40 82.93 0.3232 1.37 1.76 1.83 0.95 WERE 12911 7.43 7.31 79.91 0.3486 1.43 1.55 2.67 0.70 HAS 10530 7.36 7.37 81.76 0.2838 1.34 1.51 2.41 0.83 1 ONE 9388 7.39 7.31 76.40 0.2540 1.61 1.48 2.40 0.75 1 ALL 9021 7.36 7.26 74.78 0.2361 1.45 1.46 3.34 0.64 3 CASE 15261 7.45 7.36 8 4.74 0.4182 1.64 1.43 2.38 0.80 1 ONLY 6218 7.33 7.31 72.14 0.1693 1.57 1.38 1.88 0.82 HAD 15451 7.43 7.30 82.44 0.4205 1.49 1.38 2.68 0.69 MAY 9510 7.37 7.30 76.70 0.2605 1.45 1.38 2.50 0.72 WOULO 9678 7.34 7.23 73.12 0.2580 1.43 1.34 2.49 0.64 ALSO 5230 7.29 7.23 67.15 0.1410 1.08 1.33 1.95 0.71 UNDER 10893 7.40 7.31 80.44 0.2937 1.82 1.31 2.98 0.69 10 COURT 33021 7.45 7.41 93.58 0.9097 1.64 1.26 3.97 0.76 MADE 7999 7.32 7.29 74.51 0.2213 1.60 1.25 1.97 0.76 WHEN 6875 7.28 7.24 69.87 0.1866 1.54 1.20 2.24 0.69 1 FOLLOW 6076 7.28 7.24 69.38 0.1661 1.30 1.18 2.44 0.69 ITS 11061 7.31 7.20 75.34 0.2888 1.71 1.13 3.49 0.54 2 REASON 6845 7.17 7.25 72.48 0.1850 2.15 1.11 2.86 0.64 MUST 5208 7.18 7.22 66.70 0.1412 1.83 1.08 2.79 0.64 AFTER 6340 7.24 7.21 68.47 0.1745 1.62 1.06 2.27 0.65 WHETHE 5173 7.22 7.19 66.13 0.1408 1.69 1.04 2.57 0.61 2 QUESTI 8776 7.25 7.28 77.08 0.2395 2.17 1.03 4.30 0.62 HIS 19529 7.32 7.22 78.63 0.5396 1.55 1.03 2.83 0.60 DID 6224 7.24 7.17 66.70 0.1665 1.55 1.03 2.52 0.59 WHERE 5794 7.19 7.16 65.26 0.1562 1.64 1.03 2.43 0.58 SHOULD 5689 7.20 7.20 66.59 0.1511 1.89 1.02 2.45 0.63 DOES 4264 7.09 7.20 63.30 0.1175 1.80 0.96 2.11 0.67 BEFORE 5814 7.19 7.23 68.55 0.1612 2.12 0.95 2.63 0.66 COULD 5096 7.16 7.11 61.79 0.1383 1.59 0.95 2.58 0.54 8 CONS ID 5288 7.J.5 7.14 63.72 0.1379 2.06 0.93 2.68 0.56 3 TIME 8254 7.17 7.20 70.40 0.2237 2.55 0.92 2.17 0.62 WITHOU 4652 7.10 7.17 63.57 0.1274 2.02 0.91 2.39 0.62 FURTHE 4546 7.11 7.13 61.94 0.1230 1.92 0.91 3.44 0.53 THEREF 3871 7.01 7.18 62.21 0.1050 1.43 0.90 2.25 0.65 HOWEVE 3333 7.09 7.11 55.90 0.0923 1.47 0.90 1.76 0.62 2 LAW 9658 7.23 7.20 74.29 0.2554 2.34 0.88 3.39 0.54 3 PRESEN 5653 7.18 7.20 68.25 0.1558 2.26 0.88 3.49 0.58 1 TWO 5130 7.11 7.11 60.51 0.1408 1.59 0.85 2.47 0.55 THESE 4753 7.11 7.07 59.79 0.1275 1.97 0.83 3.27 0.48 THEN 4583 7.12 7.07 59.19' 0.1242 2.04 0.82 2.60 0.51 THAN 4378 7.11 7.10 59.38 0.1198 2.23 0.81 2.63 0.54 Table VIII. Sorted by EK 108 VOTES WORD NOCC E EL PZD AVG G EK GL EKL 4 FACT 4658 7.06 7.10 60.28 0.1249 2.10 0.80 2.40 0.54 5 DEFEND 25773 7.20 7.12 71.19 C.7468 1.34 0.79 2.43 0.53 WHO 5241 7. 11 7.03 59.64 C.1416 1.89 0.79 3.51 0.44 4 AFFIRM 3897 6.89 7.23 6 3.53 0.1109 2.26 0.78 2.61 0.70 1 PART 4746 7.12 7.09 60.62 0.1287 2.57 0.78 2.85 0.52 THEY 7042 7.14 7.08 64.47 0.1897 2.45 0.77 3.52 0.45 SAME 4992 7.05 7.07 60.73 0.1299 2.47 0.76 3.32 0.48 BECAUS 3553 7.00 7.11 57.19 0.0999 2.04 0.75 2.28 0.58 2 BEING 3858 7.04 7.08 57.41 0.1040 2.13 0.75 2.89 0.52 HELD 3978 7.04 7.02 55.34 0.1058 1.92 0.75 2.83 0.47 2 REOUIR 6103 7.06 7.10 63.98 0.1665 2.34 0.74 4.53 0.47 4 CONCUR 2290 6.65 7.30 63.91 0.0643 2.45 0.73 2.51 0.86 CONTEN 3888 7.02 7.09 57.11 0.1094 2.14 0.71 2.24 0.56 3 FIRST 4165 7.01 7.04 57.15 0.1116 2.30 0.71 3.27 0.46 9 EVIDEN 12726 7.10 7.02 65.64 0.3461 1.64 0.71 3.09 0.43 3 OPINIO 4764 7.02 6.98 58.85 0.1218 2.05 0.71 4.63 0.37 THEIR 6514 7.08 7.02 61.75 0.1756 2.19 0.70 3.29 0.42 2 STATED 3698 6.99 6.99 54.77 0.0975 2.37 0.68 3.69 0.42 1 CAN 2822 6.93 6.94 49.15 0.0739 1.61 0.67 2.68 0.44 SOME 3394 6.97 6.93 50.88 0.0897 1.97 0.67 4.84 0.39 HERE 3448 6.93 6.97 52.69 0.0938 1.92 0.66 3.12 0.43 MORE 3050 6.94 6.95 49.49 0.0822 1.98 0.66 2.76 0.45 OUT 4389 7.00 6.99 57.04 0.1164 3.00 0.65 6.13 0.37 2 CERTAI 3069 6.87 6.96 50.62 0.0830 2.20 0.65 3.90 0.42 1 PROVID 5792 7.03 7.02 60.02 0.1599 2.56 0.64 3.62 0.42 7 CONCLU 3665 6.95 7.02 53.90 0.1010 2.50 0.64 2.52 0.49 4 DETERM 5030 7.02 7.01 59.45 0.1314 3.04 0.64 3.95 0.40 2 PLAINT 20986 7.02 6.94 57.71 0.6097 1.25 0.64 2.24 0.43 AGAINS 5725 7.04 7.06 61.83 0.1605 2.56 0.63 3.13 0.46 9 ACCORD 2721 6.87 6.96 49.64 0.0745 2.12 0.62 2.92 0.45 SINCE 2756 6.89 6.93 48.65 0.0753 1.76 0.62 2.78 0.43 1 FACTS 4095 7.00 7.01 55.79 0.1137 3.05 0.60 2.90 0.46 1 BOTH 2868 6.85 6.88 46.54 0.0771 1.87 0.59 2.81 0.39 4 PERSON 6980 7.01 6.94 60.81 0.1897 2.61 0.57 5.09 0.33 INTO 3583 6.93 6.92 51.00 0.0952 2.51 0.57 3.14 0.39 CANNOT 2467 6.74 6.92 46.54 0.0694 2.06 0.57 2.46 0.45 5 APPEAR 3855 6.95 7.00 57.68 0.1045 3.97 0.56 9.43 0.32 5 EFFECT 3759 6.91 6.92 52.39 0.1018 2.86 0.56 7.29 0.34 INVOLV 2933 6.56 6.90 47.86 0.0789 2.29 0.56 2.99 0.40 THEM 3505 6.92 6.89 49.37 0.0943 2.56 0.56 4.37 0.36 BETWEE 3231 6.84 6.87 47.45 0.0879 2.33 0.55 2.83 0.38 OUR 3179 6.80 6.83 47.98 0.0833 2.15 0.55 4.84 0.31 9 JUDGME 10581 7.06 7.17 73.19 0.3119 3.01 0.54 4.08 0.49 CASES 3896 6.86 6.90 51.41 0.1062 2.58 0.54 3.22 0.38 MAKE 2535 6.76 6.84 43.94 0.0681 2.35 0.54 3.17 0.37 RESPEC 2579 6.80 6.82 44.43 0.0678 1.99 0.54 3.71 0.34 4 FOUND 3608 6.91 6.98 53.68 0.1017 2.73 0.53 3.16 0.43 MATTER 4313 6.91 6.96 55.19 0.1166 3.11 0.53 4.12 0.38 NOR 2099 6.70 6.86 43.14 0.0581 1.94 0.53 2.78 0.40 9 NECESS 3477 6.93 6.93 52.20 0.0937 3.31 0.52 4.91 0.35 HIM 5613 6.91 6.85 54.24 0.1531 2.49 0.52 6.64 0.29 HAVING 2006 6.67 6.86 42.09 0.0548 2.18 0.51 2.07 0.43 WELL 2259 6.77 6.83 43.14 0.0592 2.87 0.51 3.49 0.36 WHAT 2883 6.76 6.79 44.80 0.0725 2.52 0.51 3.76 0.32 WITHIN 4561 6.85 6.97 55.56 0.1294 2.63 0.50 3.59 0.41 SAID 10747 7.07 6.93 69.15 0.2803 4.45 0.50 6.83 0.27 GIVEN 2766 6.80 6.82 45.07 0.0744 2.27 0.50 3.10 0.35 EITHER 2033 6.71 6.78 40.20 0.0532 1.96 0.50 3.10 0.35 ALTHOU 1762 6.67 6.77 38.65 0.0487 1.78 0.50 2.66 0.37 1 RESULT 3328 6.85 6.86 48.50 0.0911 3.50 0.49 3.97 0.34 Table VIII. Sorted by EK 109 VOTES WORD NOCC E 4 CIRCUM 2543 6.75 EVEN 1964 6.64 2 REVERS 2857 6.66 5 STATUT 7283 6.89 SEE 4704 6.93 6 RIGHT 5447 6.76 4 GENERA 5262 6.87 5 SUBJEC 2855 6.70 NOW 2384 6.60 THOSE 2527 6.73 7 TRIAL 9898 6.97 SET 2964 6.71 4 SUFFIC 2484 6.72 2 PROVIS 4479 6.80 1 ESTABL 2947 6.74 5 DIRECT 5706 6.95 END 6422 6.81 5 CAUSE 4463 6.77 WHILE 2749 6.82 SHALL 6240 6.81 OVER 2622 6.72 ENTERE 2920 6.78 THEREO 2640 6.69 UNTIL 2347 6.65 3 INDICA 1901 6.64 THUS 1622 6.58 6 RECORD 6093 6.91 2 PURPOS 4138 6.76 2 PARTIC 2381 6.48 4 PRIOR 2379 6.69 1 APP 4769 6.74 3 SUSTAI 2600 6.65 1 PROCEE 5021 6.79 2 ALLEGE 3766 6.72 1 THREE 2437 6.70 6 ACTION 8248 6.94 2 STATE 9231 6.85 2 INCLUD 2632 6.71 ABOUT 3228 6.65 MIGHT 1734 6.57 UNLESS 1520 6.54 4 GROUND 2629 6.68 2 SECTIO 10226 6.83 1 ENTITL 2141 6.53 6 AUTHOR 4898 6.78 1 DENIED 2053 6.30 TAKEN 2518 6.67 ANOTHE 1881 6.57 ITAL 11360 6.67 FOL 5682 6.67 3 SUBSTA 2527 6.62 HEREIN 2599 6.23 1 EACH 3332 6.68 DURING 2216 6.58 2 INSTAN 1867 6.54 3 FIND 1954 6.51 1 CONTAI 2096 6.55 ABOVE 1812 6.40 1 OWN 1857 6.53 2 BASED 1605 6.38 EL PZD 6.7 ■> 4 1.9 4 6.75 38.83 6.93 46.96 6.80 53.15 6.88 55.00 6.86 54.24 6.82 52.92 6.81 45.48 6.80 43.29 6.77 42.43 6.98 62.85 6.84 46.54 6.81 42.92 6.77 47.18 6.72 44.46 6.92 58.62 6.71 51.86 6.90 54.28 6.85 46.31 6.73 49.18 6.71 40.99 6.87 48.58 6.75 41.60 6.70 39.22 6.70 37.67 6.65 34.80 6.98 60.51 6.76 49.30 6.76 42.12 6.74 40.88 6.72 44.92 6.89 46.24 6.84 55.19 6.81 47.86 6.73 41.18 6.92 64.55 6.80 62.06 6.76 43.41 6.65 41.10 6.63 34.27 6.63 33.82 6.77 44.16 6.76 55.75 6.69 38.42 6.81 52.32 6.77 40.39 6.76 43.07 6.65 36.35 6.57 45.18 6.57 45.18 6.71 41.60 6.70 41.75 6.69 43.90 62 36.50 60 34.88 66 37.75 65 38.12 63 35.18 60 34.99 56 32.84 AVG G EK GL EKL 0.0679 2.08 0.49 2.94 0.33 0.0509 2.09 0.49 3.06 0.35 0.0842 2.65 0.48 3.60 0.43 0.1985 2.26 0.48 4.39 0.29 0.1297 2.95 0.47 3.89 0.33 0.1464 2.91 0.47 3.87 0.32 0.1338 3.11 0.47 5.01 0.28 0.0784 2.72 0.46 3.64 0.33 0.0629 2.79 0.46 3.10 0.34 0.0642 3.12 0.46 3.52 0.33 0.2884 2.75 0.45 2.96 0.41 0.0798 3.36 0.45 3.72 0.35 0.0708 2.35 0.45 3.24 0.36 0.1251 2.55 0.45 3.69 0.30 0.0788 3.00 0.45 17.95 0.18 0.1575 5.12 0.44 6.63 0.29 0.1570 3.07 0.44 6.84 0.22 0.1255 2.98 0.43 4.08 0.34 0.0751 5.29 0.43 4.31 0.35 0.1705 2.77 0.43 4.34 0.27 0.0701 2.40 0.43 3.50 0.29 0.0873 3.29 0.42 4.02 0.34 0.0697 2.61 0.42 3.06 0.33 0.0628 2.31 0.42 3.46 0.30 0.0499 2.45 0.42 3.59 0.31 0.0427 2.08 0.42 2.88 0.31 0.1675 5.25 0.41 4.95 0.35 0.1096 3.99 0.41 6.33 0.25 0.0625 3.17 0.41 3.48 0.32 0.0654 2.87 0.41 3.12 0.32 0.1292 2.51 0.41 3.31 0.29 0.0753 3.40 0.40 2.63 0.41 0.1373 3.56 0.40 6.15 0.26 0.1091 3.04 0.40 3.37 0.33 0.0677 3.19 0.40 3.87 0.30 0.2329 3.64 0.39 4.77 0.31 0.2417 3.06 0.39 4.64 0.25 0.0716 3.86 0.39 3.68 0.31 0.0882 2.68 0.39 3.45 0.27 0.0465 2.40 0.39 2.78 0.30 0.0418 2.32 0.39 2.95 0.30 0.0728 3.25 0.38 5.73 0.29 0.2858 2.91 0.38 4.29 0.27 0.0591 2.60 0.38 3.68 0.30 0.1319 4.35 0.37 4.61 0.28 0.0580 2.91 0.37 2.72 0.35 0.0697 3.27 0.37 4.04 0.31 0.0500 2.97 0.37 3.17 0.29 0.2755 3.12 0.37 7.32 0.19 0.1378 3.12 0.37 7.39 0.19 0.0693 3.48 0.36 4.62 0.27 0.0670 3.17 0.36 5.86 0.25 0.0859 4.53 0.36 5.12 0.25 0.0609 2.73 0.36 4.42 0.26 0.0494 2.58 0.36 3.01 0.28 0.0519 3.11 0.35 3.70 0.28 0.0578 3.35 0.35 5.43 0.25 0.0483 2.94 0.35 3.03 0.29 0.0502 2.91 0.35 3.93 0.27 0.0431 2.60 0.35 3.70 0.26 Table VIII. Sorted by EK 110 VOTES WORD NOCC E EL PZD AVG G EK GL EKL 7 SPECIF 2900 6.65 6.68 42.28 0.0790 3.75 0.34 5.03 0.25 7 EXPRES 202 2 6.51 6.61 36.01 C.0546 3.21 0.34 4.18 0.26 1 CQNCER 1797 6.57 6.59 34.76 C.0468 4.40 0.34 3.67 0.26 THOUGH 1301 6.43 6.54 30.46 0.0340 2.57 0.34 2.82 0.28 4 ILL 8605 6.49 6.46 32.08 0.2551 1.95 0.34 3.00 0.24 FILED 5362 6.67 6.91 55.26 0.1589 4.09 0.33 3.46 0.36 4 CLEAR 1537 6.52 6.57 33.48 0.0425 3.35 0.33 5.39 0.24 NOTHIN 1275 6.24 6.55 30.65 0.0345 2.76 0.33 2.84 0.29 CITED 1401 6.41 6.54 30.95 0.0390 2.52 0.33 3.08 0.27 2 SITUAT 1358 6.42 6.49 29.40 0.0368 2.40 0.33 3.07 0.25 5 ISSUE 3113 6.61 6.66 42.88 0.0831 3.76 0.32 4.98 0.23 1 ACT 5147 6.65 6.59 45.56 0.1370 3.30 0.32 6.21 0.20 SHOW 1649 6.36 6.59 33.89 0.0470 3.26 0.32 3.21 0.28 THEREA 1342 6.40 6.55 31.03 0.0389 2.78 0.32 2.92 0.28 REGARD 1466 6.39 6.52 30.80 0.0380 3.05 0.32 3.05 0.26 3 ORDER 6773 6.78 6.77 58.32 0.1918 3.68 0.31 11.48 0.19 1 NEW 4744 6.68 6.72 48.09 0.1295 3.77 0.31 4.33 0.26 6 RULE 4090 6.56 6. 70 47. 18 0.1055 4.23 0.31 12.48 0.20 SECOND 2415 6.53 6.61 38.50 0.0656 3.97 0.31 5.63 0.23 CALLED 1618 6.40 6.57 32.76 0.0444 4.43 0.31 3.42 0.27 2 YEARS 2601 6.53 6.56 37.10 0.0687 3.24 0.31 4.19 0.23 DECIDE 1409 6.41 6.50 29.89 0.0381 2.48 0.31 3.99 0.25 LATER 1426 6.43 6.47 29.48 0.0387 2.75 0.31 3.52 0.24 PAGE 3218 6.47 6.45 33.71 0.0815 2.83 0.31 5.57 0.19 9 APPEAL 9096 6.80 7.06 77.61 0.2637 4.94 0.30 5.35 0.33 8 MOTION 6621 6.71 6.84 53.90 0.1942 3.78 0.30 3.36 0.33 2 DECISI 3988 6.52 6.69 46.58 0.1070 4.00 0.30 5.57 0.23 THROUG 1954 6.52 6.56 34.61 0.0531 3.87 0.30 4.00 0.24 4 CONSTR 3805 6.58 6.55 40.50 0.1054 3.38 0.30 4.65 0.21 2 RELATI 2530 6.54 6.53 37.10 0.0662 3.61 0.30 5.77 0.20 1 APPARE 1334 6.43 6.53 30.84 0.0364 3.26 0.30 3.32 0.26 2 SIMILA 1243 6.38 6.46 28.61 0.0339 2.91 0.30 3.18 0.24 CLEARL 1145 6.31 6.45 27.67 0.0304 2.81 0.30 3.28 0.24 6 ERROR 3841 6.56 6.66 44.80 0.1051 3.69 0.29 4.33 0.24 5 PARTIE 3496 6.55 6.59 41.71 0.0960 3.86 0.29 4.47 0.22 BROUGH 1534 6.50 6.59 33.74 0.0460 4.00 0.29 3.64 0.27 DIFFER 1714 6.46 6.55 33.14 0.0466 3.96 0.29^ 3.56 0.25 4 VIEW 140 6 6.35 6.48 30.95 0.0375 4.33 0.29 7.01 0.20 FAILED 1442 6.29 6.48 30.31 0.0414 3.32 0.29 3.79 0.23 2 GIVE 1490 6.32 6.45 29.78 0.0399 3.06 0.29 3.67 0.23 MANY 1117 6.27 6.38 25.82 0.0286 2.52 0.29 2.73 0.23 6 CONSTI 4132 6.41 6.49 42.99 0.1058 3.48 0.28 7.53 0.15 2 OHIO 8519 6.49 6.35 34.39 0.2212 2.35 0.28 5.51 0.17 MOST 1051 6.25 6.31 24.95 0.0273 2.65 0.28 6.00 0.18 1 SEC 6808 6.65 6.62 49.60 0.1929 3.75 0.27 4.50 0.21 RECEIV 2801 6.52 6.57 39.10 0.0764 6.76 0.27 5.74 0.21 4 AMOUNT 3110 6.49 6.52 37.56 0.0869 3.85 0.27 3.75 0.22 TAKE 1484 6.38 6.47 30.35 0.0407 3.85 0.27 3.52 0.23 3 OPERAT 4207 6.52 6.45 39.56 0.1145 3.54 0.27 4.52 0.18 APPLIE 1264 6.25 6.40 27.63 0.0351 2.95 0.27 3.46 0.22 NEITHE 930 6.16 6.38 24.87 0.0252 2.65 0.27 2.44 0.25 THEREI 1068 6.13 6.38 25.70 0.0279 2.72 0.27 3.38 0.23 MANNER 1259 6.30 6.37 27.29 0.0329 3.46 0.27 6.32 0.19 ITSELF 993 6.25 6.33 24.38 0.0260 2.40 0.27 3.32 0.22 6 EXCEPT 3589 6.58 6.82 49.79 0.1046 5.95 0.26 4.72 0.30 7 WILL 7140 6.84 6.74 62.55 0.1944 5.49 0.26 12.86 0.15 3 FIND IN 3437 6.56 6.59 41.56 0.0995 4.00 0.26 3.90 0.23 7 CONDIT 2779 6.46 6.47 35.52 0.0760 3.52 0.26 3.88 0.21 5 BASIS 1500 6.41 6.47 30.76' 0.0412 5.82 0.26 5.60 0.21 5 DAY 2189 6.41 6.46 34. 16 0.0607 3.92 0.26 9.83 0.17 Table VIII. Sorted by EK 111 VOTES WORD NOCC E EL PZD AVG G EK GL EKL 5 SEVERA 1243 6.32 6.36 27.25 0.0331 3.47 0.26 7.53 0.18 1 TRUE 1140 6.23 6.36 26.23 0.0309 3.33 0.26 4.42 0.20 HOLD 1033 6.15 6.35 24.61 0.0270 2.49 0.26 3.24 0.22 SAY 1088 6.26 6.34 25.44 0.0294 2.94 0.26 3.71 0.21 MERELY 936 6.21 6.32 23.78 0.0248 2.46 0.26 2.82 0.22 CONSIS 941 6.19 6.31 23.66 0.0260 2.47 0.26 3.02 0.21 3 APPLIC 4168 6.58 6.60 47.37 0.1134 4.97 0.25 8.13 0.16 2 ADDITI 1708 6.39 6.49 32.12 0.0453 5.06 0.25 4.68 0.22 3 DUE 1937 6.40 6.47 32.08 0.0542 4.13 0.25 3.79 0.22 1 POINT 1487 6.35 6.42 29.48 0.0407 4.43 0.25 4.24 0.21 OTHERW 1095 6.14 6.42 27.18 0.0307 4.16 0.25 3.79 0.23 9 ATTEMP 1404 6.05 6.42 29.18 0.0376 4.42 0.25 7.93 0.19 7 TESTIM 3650 6.42 6.41 34.65 0.1010 3.30 0.25 3.88 0.20 1 ENTIRE 1350 6.30 6.41 28.53 0.0369 5.20 0.25 6.76 0.20 FORTH 1458 6.25 6.40 28.80 0.0391 3.68 0.25 4.54 0.20 2 TERMS 1583 6.33 6.39 28.46 0.0424 3.43 0.25 3.35 0.21 1 INTEND 1333 6.29 6.39 27.63 0.0361 3.14 0.25 4.27 0.21 THERET 1022 6.05 6.35 24.95 0.0278 3.03 0.25 3.31 0.22 DISCUS 1034 6.22 6.31 24.34 0.0267 2.85 0.25 3.19 0.21 3 SUPPOR 3151 6.65 6.67 46.35 0.0855 7.06 0.24 9.79 0.18 1 USED 2650 6.45 6.58 38.16 0. 07 34 5.62 0.24 4.18 0.23 8 CHARGE 4622 6.48 6.47 40.69 0.1234 3.96 0.24 4.95 0.18 4 COMPLE 1709 6.30 6.45 31.40 0.0455 4.76 0.24 5.48 0.20 11 PRINCI 2158 6.46 6.43 34.61 0.0564 6.01 0.24 7.85 0.16 3 REFERR 1309 6.24 6.43 28.65 0.0341 8.37 0.24 5.55 0.21 5 FAILUR 1630 6.16 6.43 30.16 0.0459 3.81 0.24 4.43 0.21 2 SUBSEQ 1263 6.25 6.37 26.99 0.0363 3.67 0.24 3.97 0.21 SHOWN 1106 6.15 6.36 25.74 0.0303 3.38 0.24 3.23 0.22 1 TESTIF 3484 6.35 6.35 31.74 0.0969 3.53 0.24 3.72 0.19 BELIEV 1176 6.22 6.34 25.67 0.0322 3.33 0.24 3.34 0.21 10 JURY 5530 6.41 6.31 34.2 7 0.1470 3.35 0.24 4.31 0.17 TOOK 1080 6.15 6.28 24.46 0.0302 3.21 0.24 4.38 0.19 VERY 888 6.15 6.22 2L 93 0.0230 2.80 0.24 3.45 0.19 3 DISTIN 997 6. 14 6.22 22.68 0.0265 2.77 0.24 4.15 0.18 RATHER 917 6. 15 6.21 22.00 0.0246 3.00 0.24 3.67 0.18 2 CONTRO 2941 6.48 6.55 39.9 3 0.0849 5.05 0.23 5.00 0.20 10 COUNTY 6245 6.62 6.52 52.43 0.1787 5.00 0.23 8.51 0.14 7 CONTRA 8033 6.56 6.49 52.96 0.2158 3.98 0.23 7.29 0.15 1 RENDER 1657 6.30 6.45 31.74 0.0464 3.94 0.23 6.39 0.19 3 APPELL 14543 6.53 6.44 50.16 0.3877 3.05 0.23 5.26 0.16 2 DATE 1983 6.31 6.41 31.37 0.0555 3.97 0.23 4.85 0.19 6 REMAIN 1592 6.35 6.38 30.46 0.0428 4.99 0.23 7.12 0.16 1 FAVOR 1249 6.22 6.37 26.87 0.0364 3.45 0.23 4.09 0.21 SHOWS 1078 6.16 6.35 25.25 0.0297 3.55 0.23 3.06 0.22 3 PROPER 5913 6.40 6.34 36.91 0.1591 3.62 0.23 5.71 0.15 ORDERE 1180 6.14 6.33 26.23 0.0324 3.50 0.23 6.13 0.18 5 ADMITT 1667 6.32 6.32 28.87 0.0436 3.82 0.23 5.59 0.17 LONG 1047 6.23 6.32 24.80 0.0280 3.39 0.23 3.84 0.20 OBTAIN 1498 6.18 6.30 27.40 0.0397 3.28 0.23 5.62 0.17 BECOME 1158 6.07 6.30 25.36 0.0320 3.89 0.23 3.96 0.19 1 THINK 1035 6.18 6.28 23.53 0.0298 3.00 0.23 3.20 0.20 SUPRA 2573 6.29 6.25 29.71 0.0636 3.34 0.23 4.77 0.15 1 PAID 2316 6.25 6.25 2 8.16 0.0616 3.21 0.23 4.69 0.16 5 RECOGN 1033 6.10 6.25 23.51 0.0261 3.33 0.23 3.94 0.18 POSSIB 1018 6.18 6.23 22.98 0.0272 3.04 0.23 3.70 0.18 3 COMPLA 3971 6.40 6.45 37.44 0.1136 4.27 0.22 4.90 0.19 WAY 1771 6.21 6.45 32.91 0.0472 6.65 0.22 10.08 0.16 5 ANSWER 3398 6.42 6.41 39.33 0.0913 5.64 0.22 9.44 0.13 STATES 2343 6.38 6.33 33.37 0.0582 6.26 0.22 8.54 0.13 MAKING 1060 6.19 6.33 25.14 0.0282 4.11 0.22 3.75 0.21 Table VIII. Sorted by EK 112 VOTES WORD NOCC E EL PZD AVG G EK GL EKL PREVIO 1043 6. 16 6.31 24.57 0.0277 3.93 0.22 3.68 0.20 1 NATURE 1185 6. 16 6.31 2 5.48 0.0313 3.80 0.22 4.10 0.19 4 PURSUA 1039 6.08 6.24 23.17 0.0271 2.92 0.22 3.93 0.18 1 EVERY 922 6.11 6.2 2 22.31 0.0244 3.05 0.22 3.79 0.18 3 PLACE 1881 6.36 6.45 32.27 0.0 52 8 6.46 0.21 5.21 0.19 2 COURSE 1500 6.22 6.45 30.53 0.0421 6.86 0.21 4.36 0.21 4 CONTIN 2382 6.37 6.40 34.35 0.0634 5.85 0.21 10.10 0.14 5 ORIGIN 2053 6.23 6.39 32.01 0.0558 4.38 0.21 5.63 0.18 3 CORREC 1358 6. 14 6.38 28.57 0.037Q 4.35 0.21 4.34 0.20 SOUGHT 1132 6.11 6.33 25.44 0.0316 3.80 0.21 4.23 0.20 8 HEARIN 2525 6.28 6.31 31.59 0.0716 4.03 0.21 6.14 0.15 6 DUTY 1873 6.25 6.30 28.35 0.0506 3.82 0.21 5.09 0.17 DONE 1079 6.09 6.28 24.57 0.0282 3.94 0.21 4.53 0.18 RAISED 1050 6.00 6.28 23.93 0.0290 3.56 0.21 3.95 0.19 1 SUPREM 1904 6. 16 6.24 27.44 0.0474 3.73 0.21 6.65 0.14 2 LANGUA 1492 6.22 6.23 25.78 0.0411 3.66 0.21 5.17 0.16 1 HOLDIN 1008 6.05 6.20 22.76 0.0265 3.62 0.21 4.43 0.17 1 KNOWN 1083 6. 12 6.17 22.19 0.0285 3.59 0.21 4.34 0.16 LESS 923 6.08 6.17 21.63 0.0250 3.44 0.21 3.99 0.17 TOGETH 861 6.04 6.16 20.91 0.0222 3.31 0.21 3.86 0.17 3 ARGUME 1528 6.26 6.37 28.69 0.0429 5.01 0.20 4.22 0.19 1 STATEM 2732 6.32 6.36 34.16 0.0720 4.77 0.20 5.32 0.16 3 GRANTE 1574 6.25 6.34 28.35 0.0425 4.97 0.20 5.70 0.17 6 RIGHTS 2108 6.30 6.33 30.38 0.0581 5.59 0.20 4.76 0.17 9 PARTY 2643 6.26 6.33 31.93 0.0726 4.28 0.20 5.91 0.16 8 INTERE 3637 6.36 6.32 35.33 0.0944 5.26 0.20 5.71 0.15 9 PUBLIC 4658 6.33 6.3C 35.78 0.1226 4.86 0.20 5.07 0.15 FAR 923 6.11 6.24 22.61 0.0247 4.89 0.20 4.79 0.18 HER 7548 6.30 6.2C 31.89 0.2095 4.05 0.20 4.75 0.14 3 VARIOU 815 5.99 6.12 19.96 0.0214 4.01 0.20 3.74 0.16 RELATE 839 5.92 6.12 20.04 0.0233 3.10 0.20 4.01 0.16 LEAST 766 6.00 6.11 19.40 0.0206 2.98 0.20 3.43 0.17 5 JUDGE 4000 6.52 6.64 46.84 0.1181 10.30 0.19 6.80 0.20 3 COMMON 4042 6.46 6.48 42.58 0.1171 5.85 0.19 7.01 0.16 5 PETITI 7623 6.19 6.44 40.39 0.2198 3.73 0.19 5.82 0.18 OVERRU 1644 6.23 6.42 30.46 0.0456 4.78 0.19 4.35 0.20 1 LEGAL 1650 6.25 6.30 28.57 0.0423 7.41 0.19„ 9.77 0.14 2 REFUSE 1286 6.14 6.22 24.49 0.0351 4.26 0.19 4.13 0.17 2 EXISTE 1029 6.06 6.17 22.08 0.0286 5.05 0.19 4.18 0.16 5 PREVEN 956 6.00 6.16 2L.44 0.0265 3.86 0.19 3.57 0.17 SHOW IN 829 5.78 6.16 20.53 0.0227 3.37 0.19 3.12 0.18 NEVER 976 6.01 6.15 21.32 0.0254 4.03 0.19 4.18 0.16 LATTER 833 6.04 6.14 20.23 0.0235 3.47 0.19 3.63 0.17 WHOM 832 6.00 6.13 20.08 0.0228 3.43 0.19 3.68 0.17 THEREB 712 5.99 6.11 19.02 0.0192 3.22 0.19 3.25 0.17 MUCH 693 5.99 6.11 19.13 0.0187 3.85 0.19 3.99 0.17 6 AGREE 707 5.91 6.10 18.98 0.0187 3.50 0.19 3.35 0.17 2 TIMES 751 5.95 6.09 19.21 0.0201 3.18 0.19 3.80 0.16 APPLY 806 6.00 6.08 19.63 0.0212 3.14 0.19 4.78 0.15 HOW 739 5.93 6.01 17.89 0.0191 3.23 0.19 3.80 0.15 4 EMPHAS 1012 5.96 6.00 19.59 0.0246 3.16 0.19 5.19 0.13 3 USE 3852 6.29 6.27 36.12 0.1059 4.86 0.18 7.72 0.12 5 CITY 5969 6.24 6.23 38.05 0.1706 3.90 0.18 5.82 0.13 4 OCCURR 1248 6.05 6.11 21.78 0.0347 3.73 0.18 4.81 0.15 AGAIN 766 6.00 6.11 19.32 0.0209 4.64 0.18 3.29 0.17 OBVIOU 645 5.87 6.09 18.23 0.0187 3.36 0.18 2.92 0.18 1 REV 1484 6.07 6.08 22.72 0.0446 3.55 0.18 9.27 0.12 BECAME 734 5.81 6.08 13.61 0.0196 3.61 0.18 3.09 0.17 HEARD 903 5.97 6.07 19.93 0.0241 3.35 0.18 5.06 0.14 7 JUSTIF 885 5.90 6.07 19.85 0.0235 3.52 0.18 4.41 0.15 Table VIII. Sorted by EK 113 VOTES WORD NOCC E EL PZD AVG G EK GL EKL 1 STILL 660 5.06 6.07 18.08 3.0176 3.47 0.18 2.94 0.17 SUGGES 782 5.94 6.06 18.68 0.0208 3.46 0.18 3.55 0.16 WHOSE 655 5.89 6.04 17. 70 0.0179 3.34 0.18 3.38 0.16 OCCASI 742 5.95 6.03 18.38 0.0206 3.38 0.18 5.02 0.14 COME 663 5.90 6. 00 17.40 0.0173 3.24 0.18 3.88 0.15 5 PERMIT 2869 6.35 6.49 39.63 0.0820 6.17 0.17 6.36 0.17 4 CODE 4152 6.21 6.18 29.55 0. 1146 4.17 0.17 5.98 0.13 CLAIME 921 5.97 6.17 21.44 0.0261 4.84 0.17 3.94 0.17 7 OFFICE 4060 6.26 6.12 33.93 0.1032 4.82 0.17 18.75 0.07 HIMSEL 864 5.95 6.10 19.85 0.0241 5.07 0.17 3.60 0.16 LIKE 738 5.93 6.08 18.87 0.0198 4.09 0.17 3.62 0.16 5 COMPAN 4677 6.19 6.05 32.65 0.1180 4.27 0.17 10.01 0.09 NOTED 710 5.88 6.02 18.04 0.0182 3.47 0.17 4.48 0.14 BEYOND 754 5.87 5.99 17.74 0.0209 3.35 0.17 3.90 0.14 MERE 654 5.82 5.99 17.02 0.0170 3.36 0.17 3.95 0.14 MAKES 565 5.73 5.98 16.27 0.0151 3.28 0.17 3.07 0.15 PUT 719 5.88 5.96 17.40 0.0197 3.40 0.17 5.70 0.13 AMONG 579 5.83 5.9 3 15.81 0.0152 3.05 0.17 3.70 0.14 3 D ISM IS 2755 5.96 6.48 35.90 0.0790 5.16 0.16 5.01 0.20 6 COURTS 2033 6.28 6.36 31.21 0.0553 9.19 0.16 5.77 0.17 PLACED 781 5.88 6.05 18.91 0.0208 4.15 0.16 4.20 0.15 READS 769 5.89 6.03 18.30 0.0220 3.56 0.16 3.85 0.15 MENTIO 694 5.91 6.02 17.89 0.0191 4.96 0.16 4.13 0.15 SEEMS 647 5.88 5.98 16.87 0.0179 4.19 0.16 3.41 0.15 2 ESSENT 651 5.83 5.98 16.76 0.0173 3.67 0.16 3.52 0.15 FOREGO 626 5.73 5.96 16.64 0.0163 3.55 0.16 3.70 0.14 7 VALID 768 5.83 5.92 17.06 0.0207 3.58 0.16 4.77 0.12 2 RETURN 2074 6.24 6.32 31.48 0.0589 8.81 0.15 9.23 0.14 5 OBJECT 2703 6.27 6.31 32.50 0.0742 8.66 0.15 5.60 0.15 7 REVIEW 2347 6.02 6.30 32.72 0.0676 5.34 0.15 7.80 0.13 5 REQUES 1941 6.11 6.29 29.44 0.0545 7.47 0.15 5.99 0.15 9 COUNSE 30 30 6.22 6.27 32.54 0.0868 6.05 0.15 5.28 0.14 9 CLAIM 2565 6.24 6.24 32.27 0.0735 5.91 0.15 7.77 0.12 5 EXAMIN 3117 6.19 6.23 35.56 0.0831 7.01 0.15 8.63 0.11 1 STAT 1245 5.90 5.93 19.10 0.0383 3.51 0.15 6.23 0.11 DOING 625 5.71 5.89 16.04 0.0167 3.56 0.15 5.74 0.12 1 APPROX 704 5.79 5.87 15.77 0.0179 3.77 0.15 4.01 0.12 1 DAYS 1500 6.05 6.22 24.99 0.0447 6.03 0.14 3.91 0.17 13 NOTICE 2855 6.04 6.18 30.76 0.0853 5.70 0.14 6.77 0.12 13 JURISD 3056 6.00 6.10 29.67 0.0812 4.48 0.14 6.50 0.11 FULLY 591 5.74 5.93 16.00 0.0159 4.28 0.14 3.71 0.14 ALONE 536 5.73 5.87 14.79 0.0152 4.20 0.14 3.50 0.13 DIFFIC 578 5.72 5.87 15.06 0.0155 3.98 G.14 3.51 0.13 REACHE 539 5.63 5.86 14.91 0.0139 4.07 0.14 4.15 0.13 2 QUOTED 591 5.60 5.85 15.13 0.0149 3.88 0.14 4.09 0.12 1 CONCED 485 5.58 5.8 3 14.00 0.0140 3.43 0.14 3.42 0.13 NONE 506 5.58 5.82 14.23 0.0136 3.70 0.14 4.14 0.12 ALREAD 542 5.68 5.80 14.08 0.0141 3.49 0.14 4.07 0.12 2 WHOLE 651 5.74 5.78 14.87 0.0169 3.54 0.14 5.73 0.10 8 SERVIC 3855 6.04 6.05 29.63 0.1114 5.82 0.13 7.29 0.10 WHEREI 560 5.60 5.92 15.66 0.0155 4.62 0.13 3.89 0.14 3 CAREFU 453 5.42 5.79 13.51 0.0118 3.79 0.13 3.84 0.12 ADDED 587 5.62 5.77 13.96 0.0144 4.33 0.13 3.95 0.12 MOVED 492 5.61 5.75 13.40 0.0149 3.94 0.13 4.21 0.11 NEVERT 370 5.50 5.71 11.92 0.0096 3.19 0.13 3.20 0.12 8 ASSIGN 2654 6.00 6.12 29.82 0.0715 6.48 0.12 7.19 0.11 8 RESPON 2872 5.94 6.00 29.21 0.0772 6.24 0.12 11.25 0.08 RELIED 487 5.62 5.8C 13.89 0.0134 4.43 0.12 4.02 0.12 1 DESIRE 507 5.38 5.78 13.74 0.0143 4.09 0.12 3.97 0.12 SOLELY 441 5.50 5.74 12.87 0.0118 4.03 0.12 4.06 0.12 Table VIII. Sorted by EK 114 rE. 3 WORD NOCC E EL PZD AVG G EK GL EKL 2 MASS 4687 5.77 5.73 16.98 0.1483 3.41 0.12 4.36 0.10 4 DISSEN 751 5.4b 5.7 i 13.43 C . 1 9 1 3.84 0.12 3.90 0.11 ARGUED 396 5.47 5.71 12.15 C.0117 3.88 0.12 3.34 0.12 LIKEWI 404 5.52 5.64 11.70 C.0106 3.26 0.12 4.45 0.10 HERETO 498 5.41 5.64 12.60 0.0121 3.70 0.12 6.07 0.09 5 EMPLOY 6062 5.98 5.89 32.50 0.1653 5.38 0.11 7.48 0.08 1 OPPORT 545 5.53 5.75 13.70 0.0146 5.13 0.11 4.15 0.11 HENCE 447 5.43 5.68 12.26 0.0118 4.38 0.11 3.85 0.11 ARGUES 443 5.52 5.67 12.23 0.0136 4.75 0.11 3.96 0.11 STATIN 385 5.43 5.67 11.77 0.0112 4.44 0.11 3.86 0.11 EVER 481 5.47 5.65 12.23 0.0127 4.47 0.11 4.27 0.10 1 ABLE 416 5.37 5.64 11.77 0.0107 4.69 0.11 4.20 0.10 ONCE 375 5.32 5.60 11.02 0.0094 3.70 0.11 3.77 0.10 EXISTS 376 5.38 5.59 10.94 0.0104 4.09 0.11 3.84 0.10 2 FILE 943 5.49 5.87 17.06 0.0265 5.51 0.10 4.17 0.12 FAILS 426 5.21 5.68 12.15 0.0125 4.84 0.10 3.68 0.11 SEEKS 374 5.15 5.62 11.32 0.0117 4.95 0.10 3.75 0.10 INSTEA 328 5.29 5.52 10.07 0.0088 4.25 0.10 3.97 0.09 INSIST 368 5.36 5.51 10.41 0.0096 3.68 0.10 4.72 0.09 2 COMPAR 418 5.42 5.57 11.09 0.0121 4.96 0.09 4.20 0.09 1 RELIES 301 5.28 5.48 9.62 0.0090 4.16 0.09 3.91 0.09 1 ALLEGI 320 5.18 5.47 9.66 0.0088 4.31 0.09 4.05 0.09 QUITE 307 5.32 5.46 9.39 0.0083 4.11 0.09 3.74 0.09 2 VIRTUE 322 5.21 5.46 9.55 0.0091 4.56 0.09 3.99 0.09 NAMELY 316 5.27 5.44 9.36 0.0080 4.71 0.09 4.09 0.09 SOMEWH 236 5.13 5.27 7.73 0.0070 4.87 0.07 4.12 0.07 DESMON 230 4.86 5.24 7.47 0.0065 4.60 0.07 4.06 0.07 1 VOORHI 209 4.80 5.23 7.32 0.0059 4.32 0.07 3.98 0.07 SOMETI 237 5.05 5.22 7.39 0.0068 5.15 0.07 4.18 0.07 FULD 208 4.73 5.20 7.09 0.0057 4.57 0.06 4.05 0.07 FROESS 209 4.78 5.18 6.98 0.0062 4.96 0.06 3.98 0.07 1 WEYGAN 251 4.57 5.40 8.79 0.0050 6.09 0.05 3.57 0.09 2 MATTHI 249 4.57 5.37 8.64 0.0049 6.34 0.05 4.17 0.08 1 PECK 216 4.34 5.22 7.43 0.0043 7.17 0.04 4.22 0.07 Table Vlir. Sorted by EK 115 VOTES WORD NOCC E EL PZD AVG G EK GL EKL AAAAAA 2649 7.0 7 7.8 7 99.99 0.0783 0.42 4.32 2.55 31.32 THE 442506 7.87 7.65 99.99 12.1192 -0.19 41.17 1.87 1.93 FOR 45223 7.73 7.61 98.07 1.2529 1.03 5.00 1.87 1.59 AND 128355 7.83 7.61 99.73 3.4562 0.53 15.25 2.14 1.57 NOT 35835 7.75 7.60 96.97 0.9798 0.55 6.95 1.90 1.56 THAT 89026 7.80 7.60 98.15 2.4343 0.70 9.48 1.92 1.54 THIS 29490 7.66 7.59 96.67 0.8106 1.15 4.02 2.45 1.41 WHICH 25522 7.70 7.56 94.41 0.6984 0.64 4.89 1.79 1.38 WAS 56044 7.69 7.55 95.73 1.5630 0.52 3.68 1.78 1.33 FROM 19879 7.62 7.51 92.18 0.5456 1.25 3.01 1.83 1.19 WITH 21624 7.64 7.51 92.03 0.5840 1.15 3.46 2.15 1.16 HAVE 13825 7.53 7.44 85.99 0.3761 1.17 2.52 2.53 0.97 BEEN 12072 7.50 7.41 83.76 0.3306 1.41 1.96 2.07 0.95 UPON 11816 7.46 7.40 82.93 0.3232 1.37 1.76 1.83 0.95 THERE 12925 7.48 7.40 84.25 0.3545 1.30 1.87 2.17 0.91 BUT 9174 7.48 7.37 78.89 0.2485 0.84 2.21 2.06 0.89 ARE 13721 7.46 7.39 84.37 0.3766 1.56 1.85 2.55 0.86 4 CONCUR 2290 6.65 7.30 63.91 0.0643 2.45 0.73 2.51 0.86 ANY 13855 7.47 7.37 83.12 0.3703 1.29 1.87 2.37 0.83 HAS 10530 7.36 7.37 81.76 0.2838 1.34 1.51 2.41 0.83 1 ONLY 6218 7.33 7.31 72.14 0.1693 1.57 1.38 1.88 0.82 3 CASE 15261 7.45 7.36 84.74 0.4182 1.64 1.43 2.38 0.80 OTHER 8966 7.43 7.31 76.17 0.2397 1.18 1.79 2.45 0.76 10 COURT 33021 7.45 7.41 93.58 0.9C97 1.64 1.26 3.97 0.76 MADE 7999 7.32 7.29 74.51 0.2213 1.60 1.25 1.97 0.76 1 ONE 9388 7.39 7.31 76.40 0.2540 1.61 1.48 2.40 0.75 SUCH 18195 7.50 7.35 85.80 0.4817 1.49 1.78 2.91 0.74 MAY 9510 7.37 7.30 76.70 0.2605 1.45 1.38 2.50 0.72 ALSO 5230 7.29 7.23 67.15 0.1410 1.08 1.33 1.95 0.71 WERE 12911 7.43 7.31 79.91 0.3486 1.43 1.55 2.67 0.70 4 AFFIRM 3897 6.89 7.23 63.53 0.1109 2.26 C.78 2.61 0.70 HAD 15451 7.43 7. 30 82.44 0.4205 1.49 1.38 2.68 0.69 UNDER 10893 7.40 7.31 80.44 0.2937 1.82 1.31 2.98 0.69 WHEN 6875 7.28 7.24 69.87 0.1866 1.54 1.20 2.24 0.69 1 FOLLOW 6076 7.28 7.24 69.38 0.1661 1.30 1.18 2.44 0.69 DOES 4264 7.09 7.20 63.30 0.1175 1.80 0.96 2.11 0.67 BEFORE 5814 7.19 7.23 68.55 0.1612 2.12 0.95 2.63 0.66 AFTER 6340 7.24 7.21 68.47 0.1745 1.62 1.06 2.27 0.65 THEREF 3871 7.01 7.18 62.21 0.1050 1.43 0.90 2.25 0.65 1 ALL 9021 7.36 7.26 74.78 0.2361 1.45 1.46 3.34 0.64 WOULD 9678 7.34 7.23 73.12 0.2580 1.43 1.34 2.49 0.64 2 REASON 6845 7.17 7.25 72.48 0.1850 2.15 1.11 2.86 0.64 MUST 5208 7.18 7.22 66.70 0.1412 1.83 1.08 2.79 0.64 SHOULD 5689 7.20 7.20 66.59 0.1511 1.89 1.02 2.45 0.63 2 QUEST I 8776 7.25 7.28 77.08 0.2395 2.17 1.03 4.30 0.62 3 TIME 8254 7.17 7.20 70.40 0.2237 2.55 0.92 2.17 0.62 WITHOU 4652 7.10 7.17 63.57 0.1274 2.02 0.91 2.39 0.62 HOWEVE 3333 7.09 7.11 55.90 0.0923 1.47 0.90 1.76 0.62 WHETHE 5173 7.22 7.19 66.13 0.1408 1.69 1.04 2.57 0.61 HIS 19529 7.32 7.22 78.63 0.5396 1.55 1.03 2.83 0.60 DID 6224 7.24 7.17 66.70 0.1665 1.55 1.03 2.52 0.59 WHERE 5794 7.19 7.16 65.26 0.1562 1.64 1.03 2.43 0.58 3 PRESEN 5653 7.18 7.20 68.25 0.1558 2.26 0.88 3.49 0.58 BECAUS 3553 7.00 7.11 57.19 0.0999 2.04 0.75 2.28 0.58 8 CONSID 5288 7.15 7.14 63.72 0.1379 2.06 0.93 2.68 0.56 CONTEN 3888 7.02 7.09 57.11 0.1094 2.14 0.71 2.24 0.56 1 TWO 5130 7.11 7.11 6CT.51 0.1408 1.59 0.85 2.47 0.55 ITS 11061 7.31 7.20 75.34 0.2888 1.71 1.13 3.49 0.54 COULD 5096 7.16 7.11 61.79 0.1383 1.59 0.95 2.58 0.54 2 LAW 9658 7.23 7. 20 74.29 0.2554 2.34 0.88 3.39 0.54 Table IX. Sorted by EKL 116 VOTES WORD NOCG E EL PZD AVG G EK GL EKL THAN 4378 7. 11 7.10 59.38 0.1198 2.23 0.81 2.63 0.54 4 FACT 4658 7.06 7.10 60.28 0.1249 2.10 0.80 2.40 0.54 FURTHE 4546 7. LI 7.13 6L.94 T.1230 1.92 0.91 3.44 0.53 5 DEFEND 25773 7.20 7.12 71.19 0.7468 1.34 0.79 2.43 0.53 1 PART 4746 7.12 7.09 60.62 0.1287 2.57 0.78 2.85 0.52 2 BEING 3858 7.04 7.08 57.41 0.1040 2.13 0.75 2.89 0.52 THEN 4583 7.12 7.07 59.19 0.1242 2.04 0.82 2.60 0.51 7 CONCLU 3665 6.95 7.02 53.90 0.1010 2.50 0.64 2.52 0.49 9 JUDGME 10581 7.06 7.17 73.19 0.3119 3.01 0.54 4.08 0.49 THESE 4753 7.11 7.07 59.79 0.1275 1.97 0.83 3.27 0.48 SAME 4992 7.05 7.07 60.73 0.1299 2.47 0.76 3.32 0.48 HELD 3978 7.04 7.02 55.34 0.1058 1.92 0.75 2.83 0.47 2 REOUIR 6103 7.06 7.10 63.98 0.1665 2.34 0.74 4.53 0.47 3 FIRST 4165 7.01 7.04 57.15 0.1116 2.30 0.71 3.27 0.46 AGAINS 5725 7.04 7.06 61.83 0.1605 2.56 0.63 3.13 0.46 1 FACTS 4095 7.00 7.01 55.79 0.1137 3.05 0.60 2.90 0.46 THEY 7042 7.14 7.08 64.47 0.1897 2.45 0.77 3.52 0.45 MORE 3050 6.94 6.95 49.49 0.0822 1.98 0.66 2.76 0.45 9 ACCORD 2721 6.87 6.96 49.64 0.0745 2.12 0.62 2.92 0.45 CANNOT 2467 6.74 6.92 46.54 0.0694 2.06 0.57 2.46 0.45 WHO 5241 7.11 7.03 59.64 0.1416 1.89 0.79 3.51 0.44 1 CAN 2822 6.93 6.94 49.15 0.0739 1.61 0.67 2.68 0.44 9 EVIDEN 12726 7.10 7.02 65.64 0.3461 1.64 0.71 3.09 0.43 HERE 3448 6.93 6.97 52.69 0.0938 1.92 0.66 3.12 0.43 2 PLAINT 20986 7.02 6.94 57.71 0.6097 1.25 0.64 2.24 0.43 SINCE 2756 6.89 6.93 48.65 0.0753 1.76 0.62 2.78 0.43 4 FOUND 3608 6.91 6.98 53.68 0.1017 2.73 0.53 3.16 0.43 HAVING 2006 6.67 6.86 42.09 0.0548 2.18 0.51 2.07 0.43 2 REVERS 2857 6.66 6.93 46.96 0.0842 2.65 0.48 3.60 0.43 THEIR 6514 7.08 7.02 61.75 0.1756 2.19 0.70 3.29 0.42 2 STATED 3698 6.99 6.99 54.77 0.0975 2.37 0.68 3.69 0.42 2 CERTAI 3069 6.87 6.96 50.62 0.0830 2.20 0.65 3.90 0.42 1 PROVID 5792 7.03 7.02 60.02 0.1599 2.56 0.64 3.62 0.42 WITHIN 4561 6.85 6.97 55.56 0.1294 2.63 0.50 3.59 0.41 7 TRIAL 9898 6.97 6.98 62.85 0.2884 2.75 0.45 2.96 0.41 3 SUSTAI 2600 6.65 6.89 46.24 0.0753 3.40 0.40 2.63 0.41 4 DETERM 5030 7.02 7.01 59.45 0.1314 3.04 0.64 3.95 0.40 INVOLV 2933 6.56 6.90 47.86 0.0789 2.29 0.56 2.99 0.40 NOR 2099 6.70 6.86 43.14 0.0581 1.94 0.53 2.78 0.40 SOME 3394 6.97 6.93 50.88 0.0897 1.97 0.67 4.84 0.39 1 BOTH 2868 6.85 6.88 46.54 0.0771 1.87 0.59 2.81 0.39 INTO 3583 6.93 6.92 51.00 0.0952 2.51 0.57 3.14 0.39 BETWEE 3231 6.84 6.87 47.45 0.0879 2.33 0.55 2.83 0.38 CASES 3896 6.86 6.90 51.41 0.1062 2.58 0.54 3.22 0.38 MATTER 4313 6.91 6.96 55.19 0.1166 3.11 0.53 4.12 0.38 3 OPINIO 4764 7.02 6.98 58.85 0.1218 2.05 0.71 4.63 0.37 OUT 4389 7.00 6.99 57.04 0.1164 3.00 0.65 6.13 0.37 MAKE 2535 6.76 6.84 43.94 0.0681 2.35 0.54 3.17 0.37 ALTHOU 1762 6.67 6.77 38.65 0.0487 1.78 0.50 2.66 0.37 THEM 3505 6.92 6.89 49.37 0.0943 2.56 0.56 4.37 0.36 WELL 2259 6.77 6.83 43.14 0.0592 2.87 0.51 3.49 0.36 4 SUFFIC 2484 6.72 6.81 42.92 0.0708 2.35 0.45 3.24 0.36 FILED 5362 6.67 6.91 55.26 0.1589 4.09 0.33 3.46 0.36 9 NECESS 3477 6.93 6.93 52.20 0.0937 3.31 0.52 4.91 0.35 GIVEN 2766 6.80 6.82 45.07 0.0744 2.27 0.50 3.10 0.35 EITHER 2033 6.71 6.78 40.20 0.0532 1.96 0.50 3.10 0.35 EVEN 1964 6.64 6.75 38.80 0.0509 2.09 0.49 3.06 0.35 SET 2964 6.71 6.84 46.54 0.0798 3.36 0.45 3.72 0.35 WHILE 2749 6.82 6.85 46.31 0.0751 5.29 0.43 4.31 0.35 6 RECORD 6093 6.91 6.98 60.51 0.1675 5.25 0.41 4.95 0.35 Tabl e IX. Sorted by EKL 117 VOTES WORD NOGC E EL PZD AVG G EK GL EKL DURING 2216 6.58 6.62 36.50 0.0609 2.73 0.36 4.42 0.26 2 BASED 1605 6.38 6.56 32.34 0.0431 2.60 0.35 3.70 0.26 7 EXPRES 2022 6.51 6.61 36.01 0.0546 3.21 0.34 4.18 0.26 1 CONCER 1797 6.57 6.59 34.76 0.0468 4.40 0.34 3.67 0.26 REGARD 1466 6.39 6.52 30.80 0.0380 3.05 0.32 3.05 0.26 1 NEW 4744 6.68 6.72 48.09 0.1295 3.77 0.31 4.33 0.26 1 APPARE 1334 6.43 6.53 30.84 0.0364 3.26 0.30 3.32 0.26 2 PURPOS 4138 6.76 6.76 49.30 0.1096 3.99 0.41 6.33 0.25 2 STATE 9231 6. 05 6.80 62.06 0.2417 3.06 0.39 4.64 0.25 HEREIN 2599 6.23 6.70 41.75 0.0670 3.17 0.36 5.86 0.25 1 EACH 3332 6.68 6.69 43.90 0.0859 4.53 0.36 5.12 0.25 1 CONTAI 2096 6.55 6.65 38.12 0.0 5 78 3.35 0.35 5.43 0.25 7 SPECIF 2900 6.65 6.68 42.28 0.0790 3.75 0.34 5.03 0.25 2 SITUAT 1358 6.42 6.49 29.40 0.0368 2.40 0.33 3.07 0.25 DECIDE 1409 6.41 6.50 29.89 0.0381 2.48 0.31 3.99 0.25 DIFFER 1714 6.46 6.55 33.14 0.0466 3.96 0.29 3.56 0.25 NEITHE 930 6.16 6.38 24.87 0.0252 2.65 0.27 2.44 0.25 4 ILL 8605 6.49 6.46 32.88 0.2551 1.95 0.34 3.00 0.24 4 CLEAR 1537 6.52 6.57 33.48 0.0425 3.35 0.33 5.39 0.24 LATER 1426 6.43 6.47 29.48 0.0387 2.75 0.31 3.52 0.24 THROUG 1954 6.52 6.56 34.61 CO 5 31 3.87 0.30 4.00 0.24 2 SIMLA 1243 6.38 6.46 28.61 0.0339 2.91 0.30 3.18 0.24 CLEARL 1145 6.31 6.45 27.67 0.0304 2.81 0.30 3.28 0.24 6 ERROR 3841 6.56 6.66 44.80 0.1051 3.69 0.29 4.33 0.24 5 ISSUE 3113 6.61 6.66 42.88 0.0831 3.76 0.32 4.98 0.23 SECOND 2415 6.53 6.61 38.50 0.0656 3.97 0.31 5.63 0.23 2 YEARS 2601 6.53 6.56 37.10 0.0687 3.24 0.31 4.19 0.23 2 DECISI 3988 6.52 6.69 46.58 0.1070 4.00 0.30 5.57 0.23 FAILED 1442 6.29 6.48 30.31 0.0414 3.32 0.29 3.79 0.23 2 GIVE 1490 6.32 6.45 29.78 0.0399 3.06 0.29 3.67 0.23 MANY 1117 6.27 6.38 25.82 0.0286 2.52 0.29 2.73 0.23 TAKE 1484 6.38 6.47 30.35 0.0407 3.85 0.27 3.52 0.23 THEREI 1068 6.13 6.38 25.70 0.0279 2.72 0.27 3.38 0.23 3 FINDIN 3437 6.56 6.59 41.56 0.0995 4.00 0.26 3.90 0.23 OTHERW 1095 6.14 6.42 27.18 0.0307 4.16 0.25 3.79 0.23 1 USED 2650 6.45 6.58 38.16 0.0734 5.62 0.24 4. 18 0.23 END 6422 6.81 6.71 51.86 0.1570 3.07 0.44 6.84 0.22 5 PARTIE 3496 6.55 6.59 41.71 0.0960 3.86 0.29 4.47 0.22 4 AMOUNT 3110 6.49 6.52 37.56 0.0869 3.85 0.27 3.75 0.22 APPLIE 1264 6.25 6.40 27.63 0.0351 2.95 0.27 3.46 0.22 ITSELF 993 6.25 6.33 24.38 0.0260 2.40 0.27 3.32 0.22 HOLD 1033 6.15 6.35 24.61 0.0270 2.49 0.26 3.24 0.22 MERELY 936 6.21 6.32 23.78 0.0248 2.46 0.26 2.82 0.22 2 ADDITI 1708 6.39 6.49 32.12 0.0453 5.06 0.25 4.68 0.22 3 DUE 1937 6.40 6.47 32.08 0.0542 4.13 0.25 3.79 0.22 THERET 1022 6.05 6.35 24.95 0.0278 3.03 0.25 3.31 0.22 SHOWN 1106 6.15 6.36 25.74 0.0303 3.38 0.24 3.23 0.22 SHOWS 1078 6. 16 6.35 25.25 0.0297 3.55 0.23 3.06 0.22 4 CONSTR 3805 6.58 6.55 40.50 0.1054 3.38 0.30 4.65 0.21 1 SEC 6808 6.65 6.62 49.60 0.1929 3.75 0.27 4.50 0.21 RECEIV 2801 6.52 6.57 39.10 0.0764 6.76 0.27 5.74 0.21 7 CONDIT 2779 6.46 6.47 35.52 0.0760 3.52 0.26 3.88 0.21 5 BASIS 1500 6.41 6.47 30.76 0.0412 5.82 0.26 5.60 0.21 SAY 1088 6.26 6.34 25.44 0.0294 2.94 0.26 3.71 0.21 CONSIS 941 6.19 6.31 23.66 0.0260 2.47 0.26 3.02 0.21 1 POINT 1487 6.35 6.42 29.48 0.0407 4.43 0.25 4.24 0.21 2 TERMS 1583 6.33 6.39 28.46 0.0424 3.43 0.25 3.35 0.21 1 INTEND 1333 6.29 6.39 27.63 0.0361 3.14 0.25 4.27 0.21 DISCUS 1034 6.22 6.31 24.34 0.0267 2.85 0.25 3.19 0.21 3 REFERR 1309 6.24 6.43 28.65 0.0341 8.37 0.24 5.55 0.21 Tabl e IX. Sorted by EKL 118 VOTES WORD NOGG E EL PZD AVG G EK GL EKL 5 FAILUR 1630 6. L6 6.43 30. 16 0.0459 3.81 0.24 4.43 0.21 2 SUBSEQ 1263 6.25 6.37 26.99 0.0363 3.67 0.24 3.97 0.21 BELIEV 1176 6.22 6.34 25.67 0.0322 3.33 0.24 3.34 0.21 I FAVOR 1249 6.22 6.37 26.87 0.0364 3.45 0.23 4.09 0.21 MAKING 1060 6.19 6.33 25.14 0.0282 4.11 0.22 3.75 0.21 2 COURSE 1500 6.22 6.45 30.53 3.0421 6.86 0.21 4.36 0.21 1 ACT 5147 6.65 6.59 45.56 0.1370 3.30 0.32 6.21 0.20 6 RULE 4090 6.56 6.7C 47.18 0.1055 4.23 0.31 12.48 0.20 2 RELATI 2530 6.54 6.53 37.10 0.0662 3. 61 0.30 5.77 0.20 4 VIEW 1406 6.35 6.48 30.95 0.0375 4.33 0.29 7.01 0.20 1 TRUE 1140 6.23 6.36 26.23 0.0309 3.33 0.26 4.42 0.20 7 TESTIM 3650 6.42 6.41 34.65 0.1010 3.30 0.25 3.88 0.20 1 ENTIRE 1350 6.30 6.41 28.53 0.0369 5.20 0.25 6.76 0.20 FORTH 1458 6.25 6.40 28.80 0.0 391 3.68 0.25 4.54 0.20 4 COMPLE 1709 6.30 6.45 31.40 0.0455 4.76 0.24 5.48 0.20 2 CONTRO 2941 6.48 6.55 39.93 0.0849 5.05 0.23 5.00 0.20 LONG 1047 6.23 6.32 24.80 0.0280 3.39 0.23 3.84 0.20 1 THINK 1035 6.18 6.28 23.63 0.0298 3.00 0.23 3.20 0.20 PREVIO 1040 6.16 6.31 24.57 0.0277 3.93 0.22 3.68 0.20 3 CORREC 1358 6.14 6.38 28.57 0.0370 4.35 0.21 4.34 0.20 SOUGHT 1132 6.11 6.33 25.44 0.0316 3.80 0.21 4.23 0.20 5 JUDGE 4000 6.52 6.64 46.84 0.1181 10.30 0.19 6.80 0.20 OVERRU 1644 6.23 6.42 30.46 0.0456 4.78 0.19 4.35 0.20 3 DISMIS 2755 5.96 6.48 35.90 0.0790 5.16 0.16 5.01 0.20 ITAL 11360 6.67 6.57 45.18 0.2755 3.12 0.37 7.32 0.19 FOL 5682 6.67 6.57 45.18 0.1378 3.12 0.37 7.39 0.19 3 ORDER 6773 6.78 6.77 58.32 0.1918 3.68 0.31 11.48 0.19 PAGE 3218 6.47 6.45 33.71 0.0815 2.83 0.31 5.57 0.19 MANNER 1259 6.30 6.37 27.29 0.0329 3.46 0.27 6.32 0.19 9 ATTEMP 1404 6.05 6.42 29.18 0.0376 4.42 0.25 7.93 0.19 1 TESTIF 3484 6.35 6.35 31.74 0.0969 3.53 0.24 3.72 0.19 TOOK 1080 6.15 6.28 24.46 0.0302 3.21 0.24 4.38 0.19 VERY 888 6.15 6.22 21.93 0.0230 2.80 0.24 3.45 0.19 1 RENDER 1657 6.30 6.45 31.74 0.0464 3.94 0.23 6.39 0.19 2 DATE 1983 6.31 6.41 31.37 0.0555 3.97 0.23 4.85 0.19 BECOME 1158 6.07 6.30 25.36 0.0320 3.89 0.23 3.96 0.19 3 COMPLA 3971 6.40 6.45 37.44 0.1136 4.27 0.22 4.90 0.19 1 NATURE 1185 6.16 6.31 25.43 0.0313 3.80 0.22^ 4.10 0.19 3 PLACE 1881 6.36 6.45 32.27 0.0528 6.46 0.21 5.21 0.19 RAISED 1050 6.00 6.28 23.93 0.0290 3.56 0.21 3.95 0.19 3 ARGUME 1528 6.26 6.37 28.69 0.0429 5.01 0.20 4.22 0.19 1 ESTABL 2947 6.74 6.72 44.46 0.0788 3.00 0.45 17.95 0.18 MOST 1051 6.25 6.31 24.95 0.0273 2.65 0.28 6.00 0.18 3 OPERAT 4207 6.52 6.45 39.56 0.1145 3.54 0.27 4.52 0.18 5 SEVERA 1243 6.32 6.36 27.25 0.0331 3.47 0.26 7.53 0.18 3 SUPPOR 3151 6.65 6.67 46.35 0.0855 7.06 0.24 9.79 0.18 8 CHARGE 4622 6.48 6.47 40.69 0.1234 3.96 0.24 4.95 0.18 3 DISTIN 997 6.14 6.22 22.68 0.0265 2.77 0.24 4.15 0.18 RATHER 917 6.15 6.21 22.00 0.0246 3.00 0.24 3.67 0.18 ORDERE 1180 6.14 6.33 26.23 0.0324 3.50 0.23 6.13 0.18 5 RECOGN 1033 6.10 6i25 23.51 0.0261 3.33 0.23 3.94 0.18 POSSIB 1018 6.18 6.23 22.98 0.0272 3.04 0.23 3.70 0.18 4 PURSUA 1039 6.08 6.24 23.17 0.0271 2.92 0.22 3.93 0.18 1 EVERY 922 6.11 6.22 22.31 0.0244 3.05 0.22 3.79 0.18 5 ORIGIN 2053 6.23 6.39 32.01 0.0558 4.38 0.21 5.63 0.18 DONE 1079 6.09 6.28 24.57 0.0282 3.94 0.21 4.53 0.18 FAR 923 6.11 6.24 22.61 0.0247 4.89 0.20 4.79 0.18 5 PETITI 7623 6. 19 6.44 40.39 0.2198 3.73 0.19 5.82 0.18 SHOWIN 829 5.78 6.16 20.53 0.0227 3.37 0.19 3.12 0.18 OBVIOU 645 5.87 6.09 18.23 0.0187 3.36 0.18 2.92 0.18 Tabl e IX. Sorted by EKL 119 VOTES WORD NOGG E EL PZD AVG G EK GL •JEKL 2 OHIO 8519 6.49 6. 3b 34. J 9 . 2 2 1 2 2.35 0.28 5.51 0.17 5 DAY 2189 6.41 6.46 34. L6 C.0607 3.92 0.26 9.83 0.17 0.17 10 JURY 5530 6.41 6.31 34.27 C.14 70 3.35 0.24 4.31 5 ADMITT 1667 6.32 6.32 28.87 0.0436 3.82 0.23 5.59 0.17 OBTAIN 1498 6. 18 6.30 27.40 0.0397 3.2 8 0.23 5.62 0.17 6 DUTY 1873 6.25 6.30 28.35 C.0506 3.82 0.21 5.09 0.17 1 HOLDIN 1008 6.05 6.20 22.76 0.0265 3.62 0.21 4.43 0.17 LESS 923 6.08 6.17 21.63 0.0250 3.44 0.21 3.99 0.17 TQGETH 861 6.04 6.16 20.91 0.0222 3.31 0.21 3.86 0.17 3 GRANTE 1574 6.25 6.34 28.35 0.0425 4.97 0.20 5.70 0.17 6 RIGHTS 2108 6.30 6.33 30.38 0.0581 5.59 0.20 4.76 0.17 LEAST 766 6.00 6.11 19.40 0.0206 2.98 0.20 3.43 0.17 2 REFUSE 1286 6.14 6.22 24.49 0.0351 4.26 0.19 4.13 0.17 5 PREVEN 956 6.00 6.16 21.44 0.0265 3.86 0.19 3.57 0.17 LATTER 833 6.04 6.14 20.23 0.0235 3.47 0.19 3.63 0.17 WHOM 832 6.00 6.13 20.,08 0.0228 3.43 0.19 3.68 0.17 THEREB 712 5.99 6.11 19.02 0.0192 3.22 0.19 3.25 0.17 MUCH 693 5.99 6.11 19.13 0.0187 3.85 0.19 3.99 0.17 6 AGREE 707 5.91 6.10 18.98 0.0187 3.50 0.19 3.35 0.17 AGAIN 766 6.00 6.11 19.32 0.0209 4.64 0.18 3.29 0.17 BECAME 734 5.81 6.08 18.61 0.0196 3.61 0.18 3.09 0.17 1 STILL 660 5.86 6.07 18.08 0.0176 3.47 0.18 2.94 0.17 5 PERMIT 2869 6.35 6.49 39.63 0.0820 6.17 0.17 6.36 0.17 CLAIME 921 5.97 6.17 21.44 0.0261 4.84 0.17 3.94 0.17 6 COURTS 2033 6.28 6.36 31 .21 0.0553 9.19 0.16 5.77 0.17 1 DAYS 1500 6.05 6.22 24.99 0.0447 6.03 0.14 3.91 0.17 3 APPLIC 4168 6.58 6.60 47.37 0.1134 4.97 0.25 8.13 0.16 11 PRINCI 2158 6.46 6.43 34.61 0.0564 6.01 0.24 7.85 0.16 3 APPELL 14543 6.53 6.44 50.16 0.3877 3.05 0.23 5.26 0.16 6 REMAIN 1592 6.35 6.38 30.46 0.0428 4.99 0.23 7.12 0.16 1 PAID 2316 6.25 6.25 28.16 0.0616 3.21 0.23 4.69 0.16 WAY 1771 6.21 6.45 32.91 0.0472 6.65 0.22 10.08 0.16 2 LANGUA 1492 6.22 6.23 25.78 0.0411 3.66 0.21 5.17 0.16 1 KNOWN 1083 6.12 6.17 22.19 0.0285 3.59 0.21 4.34 0.16 1 STATEM 2732 6.32 6.36 34.16 0.0720 4.77 0.20 5.32 0.16 9 PARTY 2643 6.26 6.33 31.93 0.0726 4.28 0.20 5.91 0.16 3 VARIOU 815 5.99 6.12 19.96 0.0214 4.01 0.20 3.74 0.16 RELATE 839 5.92 6.12 20.04 0.0233 3.10 0.20 4.01 0.16 3 COMMON 4042 6.46 6.48 42.58 0.1171 5.85 0.19 7.01 0.16 2 EXISTE 1029 6.06 6.17 22.08 0.0286 5.05 0.19 4.18 0.16 NEVER 976 6.01 6.15 21.32 0.0254 4.03 0.19 4.18 0.16 2 TIMES 751 5.95 6.09 19.21 0.0201 3.18 0.19 3.80 0.16 SUGGES 782 5.94 6.06 18.68 0.0208 3.46 0.18 3.55 0.16 WHOSE 655 5.89 6.04 17.70 0.0179 3.34 0.18 3.38 0.16 HIMSEL 864 5.95 6.10 19.85 0.0241 5.07 0.17 3.60 0.16 LIKE 738 5.93 6.08 18.87 0.0198 4.09 0.17 3.62 0.16 6 CONSTI 4132 6.41 6.49 42.99 0.1058 3.48 0.28 7.53 0.15 7 WILL 7140 6.84 6.74 62.55 0.1944 5.49 0.26 12.86 0.15 7 CONTRA 8033 6.56 6.49 52.96 0.2158 3.98 0.23 7.29 0.15 3 PROPER 5913 6.40 6.34 36.91 0.1591 3.62 0.23 5.71 0.15 SUPRA 2573 6.29 6.25 29.21 0.0636 3.34 0.23 4.77 0.15 8 HEARIN 2525 6.28 6.31 31.59 0.0716 4.03 0.21 6.14 0.15 8 INTERE 3637 6.36 6.32 35.33 0.0944 5.26 0.20 5.71 0.15 9 PUBLIC 4658 6.33 6.30 35.78 Q.1226 4.86 0.20 5.07 0.15 APPLY 806 6.00 6.08 19.63 0.0212 3.14 0.19 4.78 0.15 HOW 739 5.93 6.01 17.89 0.0191 3.23 0.19 3.80 0.15 4 OCCURR 1248 6.05 6.11 21.78 0.0347 3.73 0.18 4.81 0.15 7 JUSTIF 885 5.90 6.07 19.85 0.0235 3.52 0.18 4.41 0.15 COME 663 5.90 6.00 17.40 0.0173 3.24 0.18 3.88 0.15 MAKES 565 5.73 5.98 16.27 0.0151 3.28 0.17 3.07 0.15 Table IX. Sorted by EKL 120 VOTES WORD NOCC s EL F'ZO AVG G EK GL EKL PLACED 781 5. 89 6.0 b 18.91 0.0208 4.15 0. 16 4.20 0.15 READS 769 5.89 6.03 IB. 30 0.3220 3.56 0.16 3.85 0.15 MENTIO 694 5.91 6.02 17.39 0.0191 4.96 0.16 4.13 0.15 SEEKS 647 5.88 5.90 16.87 0.0179 4.19 0.16 3.41 0.15 2 ESSENT 651 5.83 5. 9 3 16. 76 0.0173 3.67 0.16 3.52 0.15 5 OBJECT 2703 6.27 6.31 32.50 0.0742 8.66 0.15 5.60 0.15 5 REQUES 1941 6.11 6.29 29.44 0.0545 7.47 0.15 5.99 0.15 10 COUNTY 6245 6.62 6.52 52.43 0. 1787 5.00 0.23 8.51 0.14 4 CQNTIN 2382 6.37 6.40 34.3 5 0.0 6 34 5.B5 0.21 10.10 0.14 1 SUPREM 1904 6. 16 6.24 2 7.44 0.0474 3.73 0.21 6.65 0.14 HER 7548 6.30 6.20 31 .89 0.2095 4.05 0.20 4.75 0.14 1 LEGAL 1650 6.25 6.30 28.57 0.0423 7.41 0.19 9.77 0.14 HEARD 903 5.97 6.07 19.93 0.0241 3.35 0.18 5.06 0.14 OCCASI 742 5.95 6.03 10.38 0.0206 3.38 0.18 5.02 0.14 NOTED 710 5.88 6.02 18.04 0.0182 3.47 0.17 4.48 0.14 BEYOND 754 5.87 5.99 17.74 0.0209 3.35 0.17 3.90 0.14 MERE 654 5.82 5.99 17.02 0.0170 3.36 0.17 3.95 0.14 AMONG 579 5.83 5.93 15.81 0.0152 3.05 0.17 3.70 0.14 FOREGO 626 5.73 5.96 16.64 0.0163 3.55 0.16 3.70 0.14 2 RETURN 2074 6.24 6.32 31.48 0.0589 8.81 0.15 9.23 0.14 9 COUNSE 30 30 6.22 6.27 32.54 0.0868 6.05 0.15 5.28 0.14 FULLY 591 5.74 5.93 16.00 0.0159 4.28 0.14 3.71 0.14 WHEREI 560 5.60 5.92 15.66 0.0155 4.62 0.13 3.89 0.14 5 ANSWER 3398 6.42 6.41 39.33 0.0913 5.64 0.22 9.44 0.13 STATES 2343 6.38 6.33 33.37 0.0582 6.26 0.22 3.54 0.13 4 EMPHAS 1012 5.96 6.00 19.59 0.0246 3.16 0.19 5.19 0.13 5 CITY 5969 6.24 6.23 38.05 0.1706 3.90 0.18 5.82 0.13 4 CODE 4152 6.21 6.18 29.5 5 0.1146 4.17 0.17 5.98 0.13 PUT 719 5.88 5.96 17.40 0.0197 3.40 0.17 5.70 0.13 7 REVIEW 2347 6.02 6.30 32.72 0.0676 5.34 0.15 7.80 0.13 ALONE 536 5.73 5.87 14.79 0.0152 4.20 0. 14 3.50 0.13 DIFFIC 578 5.72 5.87 15.06 0.0155 3.98 0.14 3.51 0.13 REACHE 539 5.63 5.86 14.91 0.0139 4.C7 0.14 4.15 0.13 1 CONCED 485 5.58 5.83 14.00 0.0140 3.43 0.14 3.42 0.13 3 USE 3852 6.29 6.27 36. 12 0.1059 4.86 0.18 7.72 0.12 1 REV 1484 6.07 6.08 22.72 0.0446 3.55 0.18 9.27 0.12 7 VALID 768 5.83 5.92 17.06 0.0207 3.58 0.16 4.77 0.12 9 CLAIM 2565 6.24 6.2 4 32.27 0.0735 5.91 0.15 7.77 0.12 DOING 625 5.71 5.89 16.04 0.0167 3.56 0.15 5.74 0.12 1 APPROX 704 5.79 5.87 15.77 0.0179 3.77 0.15 4.01 0.12 13 NOTICE 2855 6.04 6.18 30.76 0.0853 5.70 0.14 6.77 0.12 2 QUOTED 591 5.60 5.85 15.13 0.0149 3.88 0.14 4.09 0.12 NONE 506 5.58 5.82 14.23 0.0136 3.70 0.14 4.14 0.12 ALREAD 542 5.68 5.80 14.08 0.0141 3.49 0.14 4.07 0.12 3 CAREFU 453 5.42 5.79 13.51 0.0118 3.79 0.13 3.84 0.12 ADDED 587 5.62 5.7 7 13.96 0.0144 4.33 0.13 3.95 0.12 NEVERT 370 5.50 5.71 11.92 0.0096 3.19 0.13 3.20 0.12 RELIED 487 5.62 5.80 13.89 0.0134 4.43 0.12 4.02 0.12 1 DESIRE 507 5.38 5.7 8 13.74 0.0143 4.09 0.12 3.97 0.12 SOLELY 441 5.50 5.74 12.07 0.0118 4.03 0.12 4.06 0.12 ARGUED 396 5.47 5.71 12.15 0.0117 3.88 0.12 3.34 0.12 2 FILE 943 5.49 5.8 7 17.06 0.0265 5.51 0.10 4.17 0.12 5 EXAMIN 3117 6.19 6.23 35.56 0.0831 7.01 0.15 8.63 0.11 1 STAT 1245 5.90 5.9 3 19.10 0.0383 3.51 0.15 6.23 0.11 13 JURISD 3056 6.00 6.10 29.67 0.0812 4.48 0.14 6.50 0.11 MOVED 492 5.61 5.75 13.40 0.0149 3.94 0.13 4.21 0.11 8 ASSIGN 2654 6.00 6.12 29.32 0.0715 6.48 0.12 7.19 0.11 4 DISSEN 751 5.48 5.73 13.43 0.0191 3.84 0.12 3.90 0.11 1 OPPORT 545 5.53 5.75 13.70 0.0146 5.13 0.11 4. 15 0.11 HENCE 447 5.43 5.68 12.26 0.0118 4.38 0.11 3.85 0.11 Table IX. Sorted by EKL 121 772-957 0-66— 9 (?E£ > WORD NOCC E EL PZD AVG G EK GL EKL ARGUES 44 3 5.52 5.67 12.23 0.0136 4.75 0.11 3.96 0.11 STATIN 385 5.43 5.67 11.77 0.0112 4.44 0.11 3.86 0.11 FAILS 426 5.21 5.68 12.15 0.0125 4.84 0.10 3.68 0.11 2 WHOLE 651 5.74 5.78 14.87 0.0169 3.54 0.14 5.73 0.10 8 SEKVIC 3855 6.04 6.05 29.63 0.1114 5.82 0.13 7.29 0.10 2 MASS 4687 5.77 5.73 16.98 0.1483 3.41 0.12 4.36 0.10 LIKEWI 404 5.52 5.64 11.70 0.0106 3.26 0.12 4.45 0.10 EVER 481 5.47 5.65 12.23 0.0127 4.47 0.11 4.27 0.10 1 ABLE 416 5.37 5.64 11.77 0.0107 4.69 0.11 4.20 0.10 ONCE 375 5.32 5.60 11.02 0.0094 3.70 0.11 3.77 0.10 EXISTS 376 5.38 5.59 10.94 0.0104 4.09 0.11 3.84 0.10 SEEKS 374 5.15 5.62 11.32 0.0117 4.95 0.10 3.75 0.10 5 COMPAN 4677 6. 19 6.05 32.65 0.1180 4.27 0.17 10.01 0.09 HERETO 498 5.41 5.64 12.60 0.0121 3.70 0.12 6.07 0.09 INSTEA 328 5.29 5.52 10.07 0.0088 4.25 0.10 3.97 0.09 INSIST 368 5.36 5.51 10.41 0.0096 3.68 0.10 4.72 0.09 2 COMPAR 418 5.42 5.57 11.09 0.0121 4.96 0.09 4.20 0.09 1 RELIES 301 5.28 5.48 9.62 0.0090 4.16 0.09 3.91 0.09 1 ALLEGI 320 5.18 5.47 9.66 0.0088 4.31 0.09 4.05 0.09 QUITE 30 7 5.32 5.46 9.39 0.0083 4.11 0.09 3.74 0.09 2 VIRTUE 322 5.21 5.46 9.55 0.0091 4.56 0.09 3.99 0.09 NAMELY 316 5.27 5.44 9.36 0.0080 4.71 0.09 4.09 0.09 1 WEYGAN 251 4.57 5.40 8.79 0.0050 6.09 0.05 3.57 0.09 8 RESPON 2872 5.94 6.00 29.21 0.0772 6.24 0.12 11.25 0.08 5 EMPLOY 6062 5.98 5.89 32.50 0.1653 5.38 0.11 7.48 0.08 2 MATTHI 249 4.57 5.37 8.64 0.0049 6.34 0.05 4.17 0.08 7 OFFICE 4060 6.26 6.12 33.93 0.1032 4.82 0.17 18.75 0.07 SOMEWH 236 5.13 5.27 7.73 0.0070 4.87 0.07 4.12 0.07 DESMON 230 4.86 5.24 7.47 0.0065 4.60 0.07 4.06 0.07 1 VOORHI 209 4.80 5.23 7.32 0.0059 4.32 0.07 3.98 0.07 SOMETI 237 5.05 5.22 7.39 0.0068 5.15 0.07 4.18 0.07 FULD 208 4.73 5.20 7.09 0.0057 4.57 0.06 4.05 0.07 FROESS 209 4.78 5.18 6.98 0.0062 4.96 0.06 3.98 0.07 1 PECK 216 4.34 5.22 7.43 0.0043 7.17 0.04 4.22 0.07 T<±Ie IX. Sorted by EKL 122 VOTES WORD NOCC E EL PZD AVG G EK GL EKL THE 442506 7.87 7.65 99.99 12.1192 -0.19 41.17 1.87 1.93 AAAAAA 2649 7.07 7.87 99.99 0.0783 0.42 4.32 2.55 31.32 WAS 56044 7.69 7.55 95. 73 1.5630 0.52 3.68 1.78 1.33 AND 128355 7.83 7.61 99.73 3.4562 0.53 15.25 2.14 1.57 NOT 35835 7.75 7.60 96.97 0.9798 0.55 6.95 1.90 1.56 WHICH 25522 7.70 7.56 94.41 0.6984 0.64 4.89 1.79 1.38 THAT 89026 7.80 7.60 98.15 2.4343 0.70 9.48 1.92 1.54 BUT 9174 7.48 7.3/ 78.89 0.2485 0.84 2.21 2.06 0.89 FOR 45223 7.73 7.61 98.07 1.2529 1.03 5.00 1.87 1.59 ALSO 5230 7.29 7.23 67.15 0.1410 1.08 1.33 1.95 0.71 THIS 29490 7.66 7.59 96.67 0.8106 1.15 4.02 2.45 1.41 WITH 21624 7.64 7.51 92.03 0.5840 1.15 3.46 2.15 1.16 HAVE 13825 7.53 7.44 85.99 0.3761 1.17 2.52 2.53 0.97 OTHER 8966 7.43 7.31 76.17 0.2397 1.18 1.79 2.45 0.76 FROM 19879 7.62 7.51 92.18 0.5456 1.25 3.01 1.83 1.19 2 PLAINT 20986 7.02 6.94 57.71 0.6097 1.25 0.64 2.24 0.43 ANY 13855 7.47 7.37 83.12 0.3703 1.29 1.87 2.37 0.83 THERE 12925 7.48 7.40 84.25 0.3545 1.30 1.87 2.17 0.91 1 FOLLOW 6076 7.28 7.24 69.38 0.1661 1.30 1.18 2.44 0.69 HAS 10530 7.36 7.37 81.76 0.2838 1.34 1.51 2.41 0.83 5 DEFEND 25773 7.20 7.12 71.19 0.7468 1.34 0.79 2.43 0.53 UPON 11816 7.46 7.40 82.93 0.32 32 1.37 1.76 1.83 0.95 BEEN 12072 7.50 7.41 83.76 0.3306 1.41 1.96 2.07 0.95 WERE 12911 7.43 7.31 79.91 0.3486 1.43 1.55 2.67 0.70 WOULD 9678 7.34 7.23 73.12 0.2580 1.43 1.34 2.49 0.64 THEREF 3871 7.01 7.18 62.21 0.1050 1.43 0.90 2.25 0.65 MAY 9510 7.37 7.30 76.70 0.2605 1.45 1.38 2.50 0.72 1 ALL 9021 7.36 7.26 74.78 0.2361 1.45 1.46 3.34 0.64 HOWEVE 3333 7.09 7.11 55.90 0.0923 1.47 0.90 1.76 0.62 SUCH 18195 7.50 7.35 85.80 0.4817 1.49 1.78 2.91 0.74 HAD 15451 7.43 7.30 82.44 0.4205 1.49 1.38 2.68 0.69 WHEN 6875 7.28 7.24 69.87 0.1866 1.54 1.20 2.24 0.69 HIS 19529 7.32 7.22 78.63 0.5396 1.55 1.03 2.83 0.60 DID 6224 7.24 7.17 66.70 0.1665 1.55 1.03 2.52 0.59 ARE 13721 7.46 7.39 84.37 0.3766 1.56 1.85 2.55 0.86 1 ONLY 6218 7.33 7.31 72.14 0.1693 1.57 1.38 1.88 0.82 COULD 5096 7.16 7.11 61.79 0.1383 1.59 0.95 2.58 0.54 i TWO 5130 7.11 7.11 60.51 0.1408 1.59 0.85 2.47 0.55 MADE 7999 7.32 7.29 74.51 0.2213 1.60 1.25 1.97 0.76 1 ONE 9388 7.39 7.31 76.40 0.2540 1.61 1.48 2.40 0.75 1 CAN 2822 6.93 6.94 49.15 0.0739 1.61 0.67 2.68 0.44 AFTER 6340 7.24 7.21 68.47 0.1745 1.62 1.06 2.27 0.65 LO COURT 33021 7.45 7.4L 93.58 0.9097 1.64 1.26 3.97 0.76 3 CASE 15261 7.45 7.36 84.74 0.4182 1.64 1.43 2.38 0.80 9 EVIDEN 12726 7. 10 7.02 65.64 0.3461 1.64 0.71 3.09 0.43 WHERE 5794 7.19 7.16 65.26 0.1562 1.64 1.03 2.43 0.58 WHETHE 5173 7.22 7.19 66.13 0.1408 1.69 1.04 2.57 0.61 ITS 11061 7.31 7.20 75.34 0.2888 1.71 1.13 3.49 0.54 SINCE 2756 6.89 6.93 48.65 0.0753 1.76 0.62 2.78 0.43 ALTHOU 1762 6.67 6.77 38.65 0.0487 1.78 0.50 2.66 0.37 DOES 4264 7.09 7.20 63.30 0.1175 1.80 0.96 2.11 0.67 UNDER 10893 7.40 7.31 80.44 0.2937 1.82 1.31 2.98 0.69 MUST 5208 7.18 7.22 66.70 0.1412 1.83 1.08 2.79 0.64 1 BOTH 2868 6.85 6.88 46.54 0.0771 1.87 0.59 2.81 0.39 SHOULD 5689 7.20 7.20 66.59 0.1511 1.89 1.02 2.45 0.63 WHO 5241 7.11 7.0 3 59.64 0.1416 1.89 0.79 3.51 0.44 FURTHE 4546 7.11 7.13 61.94 0.12 30 1.92 0.91 3.44 0.53 HELD 3978 7.04 7.02 5 5.34 0.1058 1.92 0.75 2.83 0.47 HERE 3448 6.93 6.97 52.69 0.0938 1.92 0.66 3.12 0.43 NOR 2099 6.70 6.86 43. 14 0.0581 1.94 0.53 2.78 0.40 Tabl e X. Sorted 1 oy G 123 VOTES WORD NOCC E EL PZD AVG G EK GL EKL 4 ILL 0605 6.49 6.46 32.8 8 0.2551 1.95 0.34 3.00 0.24 EITHER 2033 6.71 6.7;] 40.20 0.0 5 32 1.96 0.50 3.10 0.35 THESE 4753 7.11 7.0 7 59.79 0.1275 1.97 0.83 3.27 0.48 SOME 3304 6.97 6.9 3 50.88 0.0897 1.97 0.67 4.84 0.39 MORE 3050 6.94 6.95 49.49 0.0822 1.9 8 0.66 2.76 0.45 RESPEC 2579 6.80 6.82 44.43 0.0678 1.99 0.54 3.71 0.34 WITHOU 4652 7.10 7.17 63.57 0.1274 2.02 0.91 2.39 0.62 THEN 4583 7.12 7.07 59.19 0.1242 2.04 0.82 2.60 0.51 BECAUS 3553 7.00 7.11 57.19 0.0999 2.04 0.75 2.28 0.58 3 OPINIO 4764 7.02 6.98 58.85 0.1218 2.05 0.71 4.63 0.37 8 CONS ID 5288 7.15 7.14 63.72 0.1379 2.06 0.93 2.68 0.56 CANNOT 2467 6.74 6.92 46.54 0.0694 2.06 0.57 2.46 0.45 4 CIRCUM 2543 6.75 6.75 41.94 0.0679 2.08 0.49 2.94 0.33 THUS 1622 6.58 6.65 34.80 0.0427 2.08 0.42 2.88 0.31 EVEN 1964 6.64 6.75 38.80 0.0509 2.09 0.49 3.06 0.35 4 FACT 4658 7.06 7.10 60.28 0.1249 2.10 0.80 2.40 0.54 BEFORE 5814 7.19 7.23 68.55 0.1612 2.12 0.95 2.63 0.66 9 ACCORD 2721 6.87 6.96 49.64 0.0745 2.12 0.62 2.92 0.45 2 BEING 3858 7.04 7.08 57.41 0.1040 2.13 0.75 2.89 0.52 CONTEN 3888 7.02 7.09 57.11 0.1094 2.14 0.71 2.24 0.56 2 REASON 6845 7.17 7.25 72.48 0.1850 2.15 1.11 2.86 0.64 OUR 3179 6.80 6.83 47.98 0.0833 2.15 0.55 4.84 0.31 2 QUESTI 8776 7.25 7.28 77.08 0.2395 2.17 1.03 4.30 0.62 HAVING 2006 6.67 6.86 42.09 0.0548 2.18 0.51 2.07 0.43 THEIR 6514 7.08 7.02 61.75 0.1756 2.19 0.70 3.29 0.42 2 CERTAI 3069 6.87 6.96 50.62 0.0830 2.20 0.65 3.90 0.42 THAN 4378 7.11 7.10 59.38 0.1198 2.23 0.81 2.63 0.54 3 PRESEN 5653 7. 18 7.20 68.25 0.1558 2.26 0.88 3.49 0.58 4 AFFIRM 3897 6.89 7.23 63.53 0.1109 2.26 0.78 2.61 0.70 5 STATUT 7283 6.89 6.80 53.15 0.1985 2.26 0.48 4.39 0.29 GIVEN 2766 6.80 6.82 45.07 0.0744 2.27 0.50 3.10 0.35 INVOLV 2933 6.56 6.90 47.86 0.0789 2.29 0.56 2.99 0.40 3 FIRST 4165 7.01 7.04 57.15 0.1116 2.30 0.71 3.27 0.46 UNTIL 2347 6.65 6.70 39.22 0.0628 2.31 0.42 3.46 0.30 UNLESS 1520 6.54 6.63 33.82 0.0418 2.32 0.39 2.95 0.30 BETWEE 3231 6.84 6.87 47.45 0.0879 2.33 0.55 2.83 0.38 2 LAW 9658 7.23 7.20 74.29 0.2554 2.34 0.88 3.39 0.54 2 REQUIR 6103 7.06 7.10 63.98 0.1665 2.34 0.74 4.53 0.47 MAKE 2535 6.76 6.84 4 3.94 0.0681 2.35 0.54 3.17 0.37 4 SUFFIC 2484 6.72 6.81 42.92 0.0708 2.35 0.45 3.24 0.36 2 OHIO 8519 6.49 6.35 34.39 0.2212 2.35 0.28 5.51 0.17 2 STATED 3698 6.99 6.99 54.77 0.0975 2.37 0.68 3.69 0.42 OVER 2622 6.72 6.71 40.99 0.0701 2.40 0.43 3.50 0.29 MIGHT 1734 6.57 6.63 34.27 0.0465 2.40 0.39 2.78 0.30 2 SITUAT 1358 6.42 6.49 29.40 0.0368 2.40 0.33 3.07 0.25 ITSELF 993 6.25 6.33 24.38 0.0260 2.40 0.27 3.32 0.22 THEY 7042 7.14 7.08 64.47 0.1897 2.45 0.77 3.52 0.45 4 CONCUR 2290 6.65 7.30 63.91 0.0643 2.45 0.73 2.51 0.86 3 INDICA 1901 6.64 6.70 37.67 0.0499 2.45 0.42 3.59 0.31 MERELY 936 6.21 6.32 23.78 0.0248 2.46 0.26 2.82 0.22 SAME 4992 7.05 7.07 60.73 0.1299 2.47 0.76 3.32 0.48 CONS IS 941 6.19 6.31 23.66 0.0260 2.47 0.26 3.02 0.21 DECIDE 1409 6.41 6.50 29.89 0.0381 2.48 0.31 3.99 0.25 HIM 5613 6.91 6.85 54.24 0.1531 2.49 0.52 6.64 0.29 HOLD 1033 6.15 6.35 24.61 0.0270 2.49 0.26 3.24 0.22 7 CONCLU 3665 6.95 7.02 53.90 0.1010 2.50 0.64 2.52 0.49 INTO 3583 6.93 6.92 51.00 0.0952 2.51 0.57 3.14 0.39 1 APP 4769 6.74 6.72 44.92 0.1292 2.51 0.41 3.31 0.29 WHAT 2883 6.76 6.79 44.80 0.0725 2.52 0.51 3.76 0.32 CITED 1401 6.41 6.54 30.7 5 0.0390 2.52 0.33 3.08 0.27 Table X. Sorted 1 sy G 124 VOTES WORD MANY NOCC 1117 E 6.27 EL 6.38 PZD 25.82 AVG 0.0286 G 2.52 EK 0.29 GL 2.73 EKL 0.23 3 TIME 8254 7.17 7. 20 70.40 0.2237 2.55 0.92 2. 17 0.62 2 PROVIS 4479 6.80 6.77 47. 18 0.1251 2.55 0.45 3.69 0.30 AGAINS 5725 7.04 7.06 61.83 0. 1605 2.56 0.63 3.13 0.46 1 PROVID 5792 7.03 7.02 60.02 0.1599 2.56 0.64 3.62 0.42 THEM 3505 6.92 6.89 40.37 0.0943 2.56 0.56 4.37 0.36 1 PART 4746 7.12 7.09 60.62 0.1287 2.57 0.78 2.85 0.52 THOUGH 1301 6.43 6.54 30.46 0.0340 2.57 0.34 2.82 0.28 2 CASES INSTAN 3896 1867 6.86 6.54 6.90 6.60 51.41 34.88 0.1062 0.0494 2.58 2.58 0.54 0.36 3.22 3.01 0.38 0.28 I ENTITL 2141 6.53 6.69 38.42 0.0591 2.60 0.38 3.68 0.30 2 BASED 1605 6.38 6.56 32.84 0.0431 2.60 0.35 3.70 0.26 4 PERSON 6980 7.01 6.94 60.81 0.1897 2.61 0.57 5.09 0.33 THEREO 2640 6.69 6.75 41.60 0.0697 2.61 0.42 3.06 0.33 WITHIN 4561 6.85 6.97 55.56 0.1294 2.63 0.50 3.59 0.41 2 REVERS 2857 6.66 6.93 46.96 0.0842 2.65 0.48 3.60 0.43 MOST 1051 6.25 6.31 24.95 0.0273 2.65 0.28 6.00 0.18 NEITHE 930 6.16 6.38 24.87 0.0252 2.65 0.27 2.44 0.25 ABOUT 3228 6.65 6.65 41.10 0.0882 2.68 0.39 3.45 0.27 5 SUBJEC 2855 6.70 6.81 45.48 0.0784 2.72 0.46 3.64 0.33 THEREI 1068 6.13 6.38 25.70 0.0279 2.72 0.27 3.38 0.23 4 FOUND 3608 6.91 6.98 53.68 0.1017 2.73 0.53 3.16 0.43 DURING 2216 6.58 6.62 36.50 0.0609 2.73 0.36 4.42 0.26 7 TRIAL 9898 6.97 6.98 62.8 5 0.2884 2.75 0.45 2.96 0.41 LATER 1426 6.43 6.47 29.48 0.0387 2.75 0.31 3.52 0.24 NOTHIN 1275 6.24 6.55 30.65 0.0345 2.76 0.33 2.84 0.29 SHALL 6240 6.81 6.73 49.18 0.1705 2.77 0.43 4.34 0.27 3 DISTIN 997 6.14 6.22 22.68 0.0265 2.77 0.24 4.15 0.18 THEREA 1342 6.40 6.55 31.03 0.0389 2.78 0.32 2.92 0.28 NOW 2384 6.60 6.80 43.29 0.0629 2.79 0.46 3.10 0.34 VERY 888 6.15 6.22 21.93 0.0230 2.80 0.24 3.45 0.19 CLEARL 1145 6.31 6.45 27.67 0.0304 2.81 0.30 3.28 0.24 PAGE 3218 6.47 6.45 33.71 0.0815 2.83 0.31 5.57 0.19 DISCUS 1034 6.22 6.31 24.34 0.0267 2.85 0.25 3.19 0.21 5 EFFECT 3759 6.91 6.92 52.39 0.1018 2.86 0.56 7.29 0.34 WELL 2259 6.77 6.83 43.14 0.0592 2.87 0.51 3.49 0.36 4 PRIOR 2379 6.69 6.74 40.88 0.0654 2.87 0.41 3.12 0.32 2 SECTIO 10226 6.83 6.76 55.75 0.2858 2.91 0.38 4.29 0.27 6 R I GHT 5447 6.76 6.86 54.24 0.1464 2.91 0.47 3.87 0.32 1 DENIED 2053 6.30 6.77 40.39 0.0580 2.91 0.37 2.72 0.35 1 OWN 1857 6.53 6.60 34.99 0.0502 2.91 0.35 3.93 0.27 2 SIMILA 1243 6.38 6.46 28.61 0.0339 2.91 0.30 3.18 0.24 4 PURSUA 1039 6.08 6.24 23.17 0.0271 2.92 0.22 3.93 0.18 ABOVE 1812 6.40 6.63 35.18 0.0483 2.94 0.35 3.03 0.29 SAY 1088 6.26 6.34 25.44 0.0294 2.94 0.26 3.71 0.21 SEE 4704 6.93 6.88 55.00 0.1297 2.95 0.47 3.89 0.33 APPLIE 1264 6.25 6.40 27.63 0.0351 2.95 0.27 3.46 0.22 ANOTHE 1881 6.57 6.65 36.35 0.0500 2.97 0.37 3.17 0.29 5 CAUSE 4463 6.77 • 6.90 54.28 0.1255 2.98 0.43 4.08 0.34 LEAST 766 6.00 6.11 19.40 0.0206 2.98 0.20 3.43 0.17 OUT 4389 7.00 6.99 57.04 0.1164 3.00 0.65 6.13 0.37 1 ESTABL 2947 6.74 6.72 44.46 0.0788 3.00 0.45 17.95 0.18 1 THINK 1035 6.18 6.28 23.63 0.0298 3.00 0.23 3.20 0.20 RATHER 917 6.15 6.21 22.00 0.0246 3.00 0.24 3.67 0.18 9 JUDGME 10581 7.06 7.17 73.19 0.3119 3.01 0.54 4.08 0.49 THERET 1022 6.05 6.35 24.95 0.0278 3.03 0.25 3.31 0.22 « DETERM 5030 7.02 7.01 59.45 0.1314 3.04 0.64 3.95 0.40 2 ALLEGE 3766 6.72 6.81 47.86 0.1091 3.04 0.40 3.37 0.33 POSSIB 1018 6.18 6.23 22.98 0.0272 3.04 0.23 3.70 0.18 1 FACTS Tabl 4095 7.00 e X. Sorted 7.01 by G 55.79 0.1137 3.05 0.60 2.90 0.46 125 VOTES WORD NOCC E EL PZD AVG G EK GL EKL 3 APPELL 14543 6.53 6.44 50.16 0.3R77 3.05 0.23 5.26 0.16 REGARD 1466 6.39 6.52 30.80 0.0380 3.05 0.32 3.05 0.26 1 EVERY 922 6. 11 6.22 22.31 0.0244 3.05 0.22 3.79 0.18 AMONG 579 5.83 5.93 15.81 0.0152 3.05 0.17 3.70 0.14 2 STATE 9231 6.85 6.80 62.06 0.2417 3.06 0.39 4.64 0.25 2 GIVE 1490 6.32 6.45 29.78 0.0399 3.06 0.29 3.67 0.23 END 6422 6.81 6.71 51.86 0.1570 3.07 0.44 6.84 0.22 RELATE 839 5.92 6.12 20.04 0.0233 3.10 0.20 4.01 0.16 MATT6R 4313 6.91 6.96 55.19 0.1166 3.11 0.53 4.12 0.38 4 GENERA 5262 6.87 6.82 52.92 0.1338 3.11 0.47 5.01 0.28 3 FIND 1954 6.51 6.66 37.75 0.0519 3.11 0.35 3.70 0.28 ITAL 11360 6.67 6.57 45.18 0.2755 3.12 0.37 7.32 0.19 FOL 5682 6.67 6.57 45.18 0.1378 3.12 0.37 7.39 0.19 THOSE 2527 6.73 6.77 42.43 0.0642 3.12 0.46 3.52 0.33 1 INTEND 1333 6.29 6.39 27.63 0.0361 3.14 0.25 4.27 0.21 APPLY 806 6.00 6.08 19.63 0.0212 3.14 0.19 4.78 0.15 4 EMPHAS 1012 5.96 6.00 19.59 0.0246 3.16 0.19 5.19 0.13 2 PARTIC 2381 6.48 6.76 42.12 0.0625 3.17 0.41 3.48 0.32 HEREIN 2599 6.23 6.70 41.75 0.0670 3.17 0.36 5.86 0.25 2 TIMES 751 5.95 6.09 19.21 0.0201 3.18 0.19 3.80 0.16 1 THREE 2437 6.70 6.73 41.18 0.0677 3.19 0.40 3.87 0.30 NEVERT 370 5.50 5.71 11.92 0.0096 3.19 0.13 3.20 0.12 7 EXPRES 2022 6.51 6.61 36.01 0.0546 3.21 0.34 4.18 0.26 1 PAID 2316 6.25 6.25 28.16 0.0616 3.21 0.23 4.69 0.16 TOOK 1080 6.15 6.28 24.46 0.0302 3.21 0.24 4.38 0.19 THEREB 712 5.99 6.11 19.02 0.0192 3.22 0.19 3.25 0.17 HOW 739 5.93 6.01 17.89 0.0191 3.23 0.19 3.80 0.15 2 YEARS 2601 6.53 6.56 37.10 0.0687 3.24 0.31 4.19 0.23 COME 663 5.90 6.00 17.40 0.0173 3.24 0.18 3.88 0.15 4 GROUND 2629 6.68 6.77 44.16 0.0728 3.25 0.38 5.73 0.29 SHOW 1649 6.36 6.59 33.89 0.0470 3.26 0.32 3.21 0.28 1 APPARE 1334 6.43 6 . 5 3 30.84 0.0364 3.26 0.30 3.32 0.26 LIKEWI 404 5.52 5.64 11.70 0.0106 3.26 0.12 4.45 0.10 TAKEN 2518 6.67 6.76 43.07 0.0697 3.27 0.37 4.04 0.31 OBTAIN 1498 6.18 6.30 27.40 0.0397 3.28 0.23 5.62 0.17 MAKES 565 5.73 5.98 16.27 0.0151 3.28 0.17 3.07 0.15 ENTERS 2920 6.78 6.87 48.58 0.0873 3.29 0.42 4.02 0.34 L ACT 5147 6.65 6.59 45.56 0.1370 3.30 0.32 6.21 0.20 7 TESTIM 3650 6.42 6.41 34.65 0.1010 3.30 0.25 3.88 0.20 9 NECESS 347 7 6.93 6.93 52.20 0.0937 3.31 0.52 4.91 0.35 TOGETH 861 6.04 6.16 20.91 0.0222 3.31 0.21 3.86 0.17 FAILED 1442 6.29 6.48 30.31 0.0414 3.32 0.29 3.79 0.23 1 TRUE 1140 6.23 6.36 26.23 0.0309 3.33 0.26 4.42 0.20 BELIEV 1176 6.22 6.34 25.67 0.0322 3.33 0.24 3.34 0.21 5 RECOGN 1033 6.10 6.25 23.51 0.0261 3.33 0.23 3.94 0.18 SUPRA 2573 6.29 6.25 29.21 0.0636 3.34 0.23 4.77 0.15 WHOSE 655 5.89 6.04 17.70 0.0179 3.34 0.18 3.38 0.16 1 CONTAI 2096 6.55 6.65 38.12 0.0578 3.35 0.35 5.43 0.25 10 JURY 5530 6.41 6.31 34.27 0.1470 3.35 0.24 4.31 0.17 4 CLEAR 1537 6.52 6.57 33.48 0.0425 3.35 0.33 5.39 0.24 HEARD 903 5.97 6.07 19.93 0.0241 3.35 0.18 5.06 0.14 BEYOND 754 5.87 5.99 17.74 0.0209 3.35 0.17 3.90 0.14 SET 2964 6.71 6.84 46.54 0.0798 3.36 0.45 3.72 0.35 OBVIOU 645 5.87 6.09 18.23 0.0187 3.36 0.18 2.92 0.18 MERE 654 5.82 5.99 17.02 0.0170 3.36 0.17 3.95 0.14 SHOWIN 829 5.78 6.16 20.53 0.0227 3.37 0.19 3.12 0.18 4 CONSTR 3805 6.58 6.55 40.50 0.1054 3.38 0.30 4.65 0.21 SHOWN 1106 6.15 6.36 25.74 0.0303 3.38 0.24 3.23 0.22 OCCASI 742 5.95 6.03 18.38 0.0206 3.38 0.18 5.02 0.14 LONG 1047 6.23 6.32 24.80 0.0280 3.39 0.23 3.84 0.20 Table X. Sorted by G 126 VOTES WORD NOGG E EL PZD AVG G EK GL EKL 3 SUSTAI 2600 6.65 6.89 46.^4 0.0 753 3.40 0.40 2.63 0.41 PUT 719 5.88 5.96 17.40 0.0197 3.40 0.17 5.70 0.13 2 MASS 4687 5.77 5.73 16.98 0.1483 3.41 0.12 4.36 0.10 2 TERMS 1583 6.33 6.39 28.46 0.0424 3.43 0.25 3.35 0.21 WHOM 832 6.00 6.13 20.08 0.0228 3.43 0.19 3.68 0.17 1 CONCEO 485 5.58 5.8 3 14.00 0.0140 3.43 0.14 3.42 0.13 LESS 923 6.08 6.17 21.63 0.0250 3.44 0.21 3.99 0.17 1 FAVOR 1249 6.22 6.37 26.87 0.0364 3.45 0.23 4.09 0.21 MANNER 1259 6.30 6.37 27.29 0.0329 3.46 0.27 6.32 0.19 SUGGES 782 5.94 6.06 18.68 0.0208 3.46 0.18 3.55 0.16 5 SEVERA 1243 6.32 6.36 27.25 0.0331 3.47 0.26 7.53 0.18 LATTER 833 6.04 6.14 20.23 0.0235 3.47 0.19 3.63 0.17 1 STILL 660 5.86 6.07 18.08 0.0176 3.47 0.18 2.94 0.17 NOTED 710 5.88 6.02 18.04 0.0182 3.47 0.17 4.48 0.14 6 CONST I 4132 6.41 6.49 42.99 0.1058 3.48 0.28 7.53 0.15 3 SUBSTA 2527 6.62 6.71 41.60 0.0693 3.48 0.36 4.62 0.27 ALREAO 542 5.68 5.8 14.08 0.0141 3.49 0.14 4.07 0.12 1 RESULT 3328 6.85 6.86 48.50 0.0911 3.50 0.49 3.97 0.34 ORDERE 1180 6.14 6.33 26.23 0.0324 3.50 0.23 6.13 0.18 6 AGREE 707 5.91 6.10 18.98 0.0187 3.50 0.19 3.35 0.17 1 STAT 1245 5.90 5.93 19.10 0.0383 3.51 0.15 6.23 0.11 7 CONDIT 2779 6.46 6.47 35.52 C.0760 3.52 0.26 3.88 0.21 7 JUSTIF 885 5.90 6.07 19.85 0.0235 3.52 0.18 4.41 0.15 1 TESTIF 3484 6.35 6.35 31.74 0.0969 3.53 0.24 3.72 0.19 3 OPERAT 4207 6.52 6.45 39.56 0.1145 3.54 0.27 4.52 0.18 2 WHOLE 651 5.74 5.78 14.87 0.0169 3.54 0.14 5.73 0.10 SHOWS 1073 6. 16 6.35 25.25 0.0297 3.55 0.23 3.06 0.22 1 REV 1484 6.07 6.08 22.72 0.0446 3.55 0.18 9.27 0.12 FOREGO 626 5.73 5.96 16.64 0.0163 3.55 0.16 3.70 0.14 1 PROCEE 5021 6.79 6.84 55.19 0.1373 3.56 0.40 6.15 0.26 RAISED 1050 6.00 6*28 23.93 0.0290 3.56 0.21 3.95 0.19 READS 769 5.89 6.03 18.30 0.0220 3.56 0.16 3.85 0.15 DOING 625 5.71 5.89 16.04 0.0167 3.56 0.15 5.74 0.12 7 VALIO 768 5.83 5.92 17.06 0.0207 3.58 0.16 4.77 0.12 1 KNOWN 1083 6.12 6.17 22.19 0.0285 3.59 0.21 4.34 0.16 2 RELATI 2530 6.54 6.53 37.10 0.0662 3.61 0.30 5.77 0.20 BECAME 734 5.81 6.08 18.61 0.0196 3.61 0.18, 3.09 0.17 3 PROPER 5913 6.40 6.34 36.91 0.1591 3.62 0.23 5.71 0.15 1 HOLDIN 1008 6.05 6.20 22.76 0.0265 3.62 0.21 4.43 0.17 6 ACTION 8248 6.94 6.92 64.55 0.2329 3.64 0.39 4.77 0.31 2 LANGUA 1492 6.22 6.23 25.78 0.0411 3.66 0.21 5.17 0.16 2 SUBSEQ 1263 6.25 6.37 26.99 0.0363 3.67 0.24 3.97 0.21 2 ESSENT 651 5.83 5.98 16.76 0.0173 3.67 0.16 3.52 0.15 3 ORDER 6773 6.78 6.77 58.32 0.1918 3.68 0.31 11.48 0.19 FORTH 1458 6.25 6.40 28.80 0.0391 3.68 0.25 4.54 0.20 INSIST 368 5.36 5.51 10.41 0.0096 3.68 0.10 4.72 0.09 6 ERROR 3841 6.56 6.66 44.80 0.1051 3.69 0.29 4.33 0.24 NONE 50 6 5.58 5.82 14.23 0.0136 3.70 0.14 4.14 0.12 HERETO 498 5.41 5.64 12.60 0.0121 3.70 0.12 6.07 0.09 ONCE 375 5.32 5.60 11.02 0.0094 3.70 0.11 3.77 0.10 5 PETITI 7623 6.19 6.44 40.39 0.2198 3.73 0.19 5.82 0.18 1 SUPREM 1904 6. 16 6.24 27.44 0.0474 3.73 0.21 6.65 0.14 4 OCCURR 1248 6.05 6.11 21.78 0.0347 3.73 0.18 4.81 0.15 1 SEC 6808 6.65 6.62 49.60 0.1929 3.75 0.27 4.50 0.21 7 SPECIF 2900 6.65 6.68 42.28 0.0790 3.75 0.34 5.03 0.25 5 ISSUE 3113 6.61 6.66 42.88 0.0831 3.76 0.32 4.98 0.23 L NEW 4744 6.68 6.72 48.09 0.1295 3.77 0.31 4.33 0.26 1 APPROX 704 5.79 5.87 15.77 0.0179 3.77 0.15 4.01 0.12 8 MOTION 6621 6.71 6.84 53.90 ' 0.1942 3.78 0.30 3.36 0.33 3 CAREFU Tabl 453 e X. 5.42 Sorted 5.79 by G 13.51 0.0118 3.79 0.13 3.84 0.12 127 VOTES WORD NOCC E EL PZO AVG G EK GL EKL 1 NATURE 11R5 6. 16 6.31 25.48 0.0313 3.80 0.22 4. 10 0.19 SOUGHT 1 L 32 6. 11 6.33 2 5 . 4 4 0.0316 3.80 0.21 4.23 0.20 5 FAILUR 1630 6. 16 6.43 30. 1.6 0.0459 3.81 0.24 4.43 0.21 5 ADMITT L667 6.32 6.32 2 8 . i\ 7 0.0436 3.82 0.23 5.59 0.17 6 DUTY 1873 6.25 6.30 28.35 0.0506 3.82 0.21 5.09 0.17 4 DISSEN 751 5.48 5.73 13.43 0.0191 3.84 0.12 3.90 0.11 4 AMOUNT 3110 6.49 6.52 37.56 0.0869 3.85 0.27 3.75 0.22 TAKE 1484 6.38 6.47 30.3 5 0.0407 3.85 0.27 3.52 0.23 MUCH 693 5.99 6.11 19.13 0.0187 3.85 0.19 3.99 0.17 2 INCLUD 2632 6.71 6.76 43.41 0.0716 3.86 0.39 3.68 0.31 5 PARTIE 3496 6.55 6.59 41.71 0.0960 3.86 0.29 4.47 0.22 5 PREVEN 956 6.00 6.16 21.44 0.0265 3.86 0.19 3.57 0.17 THROUG 1954 6.52 6.56 34.61 0.0531 3.87 0.30 4.00 0.24 2 QUOTED 591 5.60 5.85 15.13 0.0149 3.88 0.14 4.09 0.12 ARGUED 396 5.47 5.71 12.15 0.0117 3.88 0.12 3.34 0.12 BECOME 1158 6.07 6.30 25.36 0.0320 3.89 0.23 3.96 0.19 5 CITY 5969 6.24 6.23 38.05 0.1706 3.90 0.18 5.82 0.13 5 DAY 2189 6.41 6.46 34.16 0.0607 3.92 0.26 9.83 0.17 PREVIO 1040 6.16 6.31 24.57 0.0277 3.93 0.22 3.68 0.20 1 RENDER 1657 6.30 6.45 31.74 0.0464 3.94 0.23 6.39 0.19 DONE 1079 6.09 6.28 24.57 0.0282 3.94 0.21 4.53 0.18 MOVED 492 5.61 5.75 13.40 0.0149 3.94 0.13 4.21 0.11 8 CHARGE 4622 6.48 6.47 40.69 0.1234 3.96 0.24 4.95 0.18 DIFFER 1714 6.46 6.55 33.14 0.0466 3.96 0.29 3.56 0.25 5 APPEAR 3855 6.95 7.00 57.68 0. 1045 3.97 0.56 9.43 0.32 SECOND 2415 6.53 6.61 38.50 0.0656 3.97 0.31 5.63 0.23 2 DATE 1983 6.31 6.41 31.37 0.0555 3.97 0.23 4.85 0.19 7 CONTRA 8033 6.56 6.49 52.96 0.2158 3.98 0.23 7.29 0.15 DIFFIC 578 5.72 5.87 15.06 0.0155 3.98 0.14 3.51 0.13 2 PURPOS 4138 6.76 6.76 49.30 0.1096 3.99 0.41 6.33 0.25 2 DECISI 3988 6.52 6.69 46.58 0.1070 4.00 0.30 5.57 0.23 3 FIND IN 3437 6.56 6.59 41.56 0.0995 4.00 0.26 3.90 0.23 BROUGH 1534 6.50 6.59 33.74 0.0460 4.00 0.29 3.64 0.27 3 VARIOU 815 5.99 6.12 19.96 0.0214 4.01 0.20 3.74 0.16 8 HEARIN 2525 6.28 6.31 31.59 0.0716 4.03 0.21 6.14 0.15 NEVER 976 6.01 6.15 21.32 3.0254 4.03 0.19 4.18 0.16 SOLELY 441 5.50 5.74 12.87 0.0118 4.03 0.12 4.06 0.12 HER 7548 6.30 6.20 31.89 0.2095 4.05 0.20 4.75 0.14 REACHE 539 5.63 5.86 14.91 0.0139 4.07 0.14 4.15 0.13 FILED 5362 6.67 6.91 55.26 0. 1589 4.09 0.33 3.46 0.36 LIKE 738 5.93 6.08 18.87 CO 198 4.09 0.17 3.62 0.16 1 DESIRE 507 5.38 5.78 13.74 0.0143 4.09 0.12 3.97 0.12 EXISTS 376 5.38 5.59 10.94 C.0104 4.09 0.11 3.84 0.10 MAKING 1060 6.19 6.33 25.14 0.0282 4.11 0.22 3.75 0.21 QUITE 30 7 5.32 5.46 9.39 0.0083 4.11 0.09 3.74 0.09 3 DUE 1937 6.40 6.47 32.08 0.0542 4.13 0.25 3.79 0.22 PLACED 781 5.88 6.05 18.91 0.0208 4.15 0.16 4.20 0.15 OTHERW 1095 6.14 6.42 27.18 0.0307 4.16 0.25 3.79 0.23 1 RELIES 301 5.28 5.48 9.62 0.0090 4.16 0.09 3.91 0.09 4 CODE 4152 6.21 6.18 29.55 0.1146 4.17 0.17 5.98 0.13 SEEMS 647 5.88 5.98 16.87 0.0179 4.19 0.16 3.41 0.15 ALONE 536 5.73 5.87 14.79 0.0152 4.20 0.14 3.50 0.13 6 RULE 4090 6.56 6.70 47.18 0.1055 4.23 0.31 12.48 0.20 INSTEA 328 5.29 5.52 10.07 0.0088 4.25 0.10 3.97 0.09 2 REFUSE 1286 6.14 6.22 24.49 0.0351 4.26 0.19 4.13 0.17 3 COMPLA 3971 6.40 6.45 37.44 0.1136 4.27 0.22 4.90 0.19 5 COMPAN 4677 6.19 6.05 32.65 0.1180 4.27 0.17 10.01 0.09 9 PARTY 2643 6.26 6.33 31.93 0.0726 4.28 0.20 5.91 0.16 FULLY 591 5.74 5.93 16.00 0.0159 4.28 0.14 3.71 0.14 1 ALLEGI 320 5.18 5.47 9.66 0.0088 4.31 0.09 4.05 0.09 Table X. Sorted by G 128 r OTEJ 5 WORD NOCG E EL PZD AVG G EK GL EKL 1 VOORHI 209 4.80 5.23 7.32 0.0059 4.32 0.07 3.98 0.07 4 VIEW 1406 6. 35 6.48 30.9 5 0.0375 4.33 0.29 7.01 0.20 AODED 587 5.62 5.77 13.96 0.0144 4.33 0.13 3.95 0.12 6 AUTHOR 4898 6.78 6.81 52.32 0.1319 4.35 0.37 4.61 0.28 3 CORREC 1358 6. 14 6.38 28.57 0.0370 4.3 5 0.21 4.34 0.20 5 ORIGIN 2053 6.23 6.39 32.01 0.0558 4.38 0.21 5.63 0.18 HENCE 447 5.43 5.68 12.26 0.0118 4.38 0.11 3.85 0.11 1 CONCER 1797 6.57 6.59 34.76 0.0468 4.40 0.34 3.67 0.26 9 ATTEMP 1404 6.05 6.42 29.18 0.0376 4.42 0.25 7.93 0.19 CALLED 1618 6.40 6.57 32.76 0.0444 4.43 0.31 3.42 0.27 1 POINT 1487 6.35 6.42 29.48 0.0407 4.43 0.25 4.24 0.21 RELIED 487 5.62 5.80 13.89 0.0134 4.43 0.12 4.02 0.12 STATIN 385 5.43 5.67 11.77 0.0112 4.44 0.11 3.86 0.11 SAID 10747 7.07 6.93 69. 15 0.2803 4.45 0.50 6.83 0.27 EVER 481 5.47 5.65 12.2 3 0.0127 4.47 0.11 4.27 0.10 13 JURISD 3056 6.00 6.10 29.67 0.0812 4.48 0.14 6.50 0.11 1 EACH 3332 6.68 6.69 43.90 0.0859 4.53 0.36 5.12 0.25 2 VIRTUE 322 5.21 5.46 9.55 0.0091 4.56 0.09 3.99 0.09 FULD 208 4.73 5.20 7.C9 0.0057 4.57 0.06 4.05 0.07 DESMON 230 4.86 5.24 7.47 0.0065 4.60 0.07 4.06 0.07 WHEREI 560 5.60 5.92 15.66 0.0155 4.62 0.13 3.89 0.14 AGAIN 766 6.00 6.11 19.32 0.0209 4.64 0.18 3.29 0.17 1 ABLE 416 5.37 5.64 11.77 0.0107 4.69 0.11 4.20 0.10 NAMELY 316 5.27 5.44 9.36 0.0080 4.71 0.09 4.09 0.09 ARGUES 443 5.52 5.67 12.23 0.0136 4.75 0.11 3.96 0.11 4 COMPLE 1709 6.30 6.45 31.40 0.0455 4.76 0.24 5.48 0.20 1 STATEM 2732 6.32 6.36 34.16 0.0720 4.77 0.20 5.32 0.16 OVERRU 1644 6.23 6.42 30.46 0.0456 4.78 0.19 4.35 0.20 7 OFFICE 4060 6.26 6.12 33.93 0.1032 4.82 0.17 18.75 0.07 CLAIME 921 5.97 6.17 21.44 0.0261 4.84 0.17 3.94 0.17 FAILS 426 5.21 5.68 12.15 0.0125 4.84 0.10 3.68 0.11 3 USE 3852 6.29 6.27 36.12 0.1059 4.86 0.18 7.72 0.12 9 PUBLIC 4658 6.33 6.30 35.78 0.1226 4.86 0.20 5.07 0.15 SOMEWH 236 5.13 5.27 7.73 0.0070 4.87 0.07 4.12 0.07 FAR 923 6.11 6.24 22.61 0.0247 4.89 0.20 4.79 0.18 9 APPEAL 9096 6.80 7.06 77.61 0.2637 4.94 0.30 5.35 0.33 SEEKS 374 5. 15 5.62 11.32 0.0117 4.95 0.10 3.75 0.10 MENTIO 694 5.91 6.02 17.89 0.0191 4.96 0.16 4.13 0.15 2 COMPAR 418 5.42 5.57 11.09 0.0121 4.96 0.09 4.20 0.09 FROESS 209 4.78 5.18 6.98 0.0062 4.96 0.06 3.98 0.07 3 APPLIC 4168 6.58 6.60 47.37 0.1134 4.97 0.25 8.13 0.16 3 GRANTE 1574 6.25 6.34 28.35 0.0425 4.97 0.20 5.70 0.17 6 REMAIN 1592 6.35 6.38 30.46 0.0428 4.99 0.23 7.12 0.16 10 COUNTY 6245 6.62 6.52 52.43 0.1787 5.00 0.23 8.51 0.14 3 ARGUME 1528 6.26 6.37 28.69 0.0429 5.01 0.20 4.22 0.19 2 CONTRO 2941 6.48 6.55 39.93 0.0849 5.05 0.23 5.00 0.20 2 EXISTE 1029 6.06 6.17 22.08 0.0286 5.05 0.19 4.18 0.16 2 ADDITI 1708 6.39 6.49 32.12 3.0453 5.06 0.25 4.68 0.22 HIMSEL 864 5.95 6.10 19.85 0.0241 5.07 0.17 3.60 0.16 5 DIRECT 5706 6.95 6.92 58.62 0.1575 5.12 0.44 6.63 0.29 1 OPPORT 545 5.53 5.75 13.70 0.0146 5.13 0.11 4.15 0.11 SOMETI 237 5.05 5.22 7.39 0.0068 5.15 0.07 4.18 0.07 3 DISMIS 2755 5.96 6.48 35.90 0.0 790 5.16 0.16 5.01 0.20 1 ENTIRE 1350 6.30 6.41 28.53 0.0369 5.20 0.25 6.76 0.20 6 RECORD 609 3 6.91 6.98 60.51 C. 1675 5.25 0.41 4.95 0.35 8 INTERE 3637 6.36 6.32 35.33 0.0944 5.26 0.20 5.71 0.15 WHILE 2749 6.82 6.85 46.31 0.0751 5.29 0.43 4.31 0.35 7 REVIEW 2347 6.02 6.30 32.72 0.0676 5.34 0.15 7.80 0.13 5 EMPLOY 6062 5.98 5.89 32.50 0. 1653 5.38 0.11 7.48 0.08 7 WILL Tabl 7140 e X. 6.84 Sorted 6.74 by G 62.55 0. 1944 5.49 0.26 12.86 0.15 129 VOTES WORD NOCC E EL PZD AVG G EK GL EKL 2 FILE 943 5.49 5.87 17.06 0.0265 5.51 0.10 4.17 0.12 6 RIGHTS 2108 6.30 6.33 30.38 0.0581 5.59 0.20 4.76 0.17 1 USEO 2650 6.45 6.58 3R.16 0.0734 5.62 0.24 4.18 0.23 5 ANSWER 3398 6.42 6.41 39.33 0.0913 5.64 0.22 9.44 0.13 13 NOTICE 2855 6.04 6.18 30.76 0.0853 5.70 0.14 6.77 0.12 5 BASIS 1500 6.41 6.47 30.76 0.0412 5.82 0.26 5.60 0.21 8 SERVIC 3855 6.04 6.05 29.63 0.1114 5.82 0.13 7.29 0.10 3 COMMON 4042 6.46 6.48 42.58 0.1171 5.85 0.19 7.01 0.16 4 CONTIN 2382 6.37 6.40 34. 35 0.0634 5.85 0.21 10.10 0.14 9 CLAIM 2565 6.24 6.24 32.27 0.0735 5.91 0.15 7.77 0.12 6 EXCEPT 3589 6.58 6i82 49.79 0.1046 5.95 0.26 4.72 0.30 11 PRINCI 2158 6.46 6.43 34.61 0.0564 6.01 0.24 7.85 0.16 1 DAYS 1500 6.05 6.22 24.99 0.0447 6.03 0.14 3.91 0.17 9 COUNSE 3030 6.22 6.27 32.54 0.0868 6.05 0.15 5.28 0.14 1 WEYGAN 251 4.57 5.40 8.79 0.0050 6.09 0.05 3.57 0.09 5 PERMIT 2869 6.35 6.49 39.63 0.0820 6.17 0.17 6.36 0.17 8 RESPON 2872 5.94 6.00 29.21 0.0772 6.24 0.12 11.25 0.08 STATES 2343 6.38 6.33 33.37 0.0582 6.26 0.22 8.54 0.13 2 MATTHI 249 4.57 5.37 8.64 0.0049 6.34 0.05 4.17 0.08 3 PLACE 1881 6.36 6.45 32.27 0.0528 6.46 0.21 5.21 0.19 8 ASSIGN 2654 6.00 6.12 29. d2 0.0715 6.48 0.12 7.19 0.11 WAY 1771 6.21 6.45 32.91 0.0472 6.65 0.22 10.08 0.16 RECEIV 2801 6.52 6.57 39.10 0.0764 6.76 0.27 5.74 0.21 2 COURSE 1500 6.22 6.45 30.53 0.0421 6.86 0.21 4.36 0.21 5 EXAMIN 3117 6.19 6.23 35.56 0.0831 7.01 0.15 8.63 0.11 3 SUPPOR 3151 6.65 6.67 46.3 5 0.0855 7.06 0.24 9.79 0.18 1 PECK 216 4.34 5.22 7.43 0.0043 7.17 0.04 4.22 0.07 1 LEGAL 1650 6.25 6.30 28.57 0.0423 7.41 0.19 9.77 0.14 5 REOUES 1941 6.11 6.29 29.44 0.0545 7.47 0.15 5.99 0.15 3 REFERR 1309 6.24 6.43 28.65 0.0341 8.37 0.24 5.55 0.21 5 OBJECT 2703 6.27 6.31 32. 5C 0.0742 8.66 0.15 5.60 0.15 2 RETURN 2074 6.24 6.32 31.48 0.0589 8.81 0.15 9.23 0.14 6 COURTS 2033 6.28 6.36 31.21 0.0553 9.19 0.16 5.77 0.17 5 JUOGE 4000 6.52 6.64 46.84 0.1181 10.30 0.19 6.80 0.20 Table X. Sorted by G 130 VOTES WORD NOCC E EL PZD AVG G EK GL EKL HOWEVE 3333 7.09 7.11 55.90 0.0923 1.47 0.90 1.76 0.62 WAS 56044 7.69 7.55 95.73 1.5630 0.52 3.68 1.78 1.33 WHICH 25522 7.70 7.56 94.41 0.6984 0.64 4.89 1.79 1.38 FROM 19879 7.62 7.51 92.18 0.5456 1.25 3.01 1.83 1.19 UPON 11816 7.46 7.40 82.93 0.3232 1.37 1.76 1.83 0.95 FOR 45223 7.73 7.61 98.07 1.2529 1.03 5.00 1.87 1.59 THE 442506 7.87 7.65 99.99 12.1192 0.19 41.17 1.87 1.93 1 ONLY 6218 7.33 7.31 72.14 0.1693 1.57 1.38 1.88 0.82 NOT 35835 7.75 7.60 96.97 0.9798 0.55 6.95 1.90 1.56 THAT 89026 7.80 7.60 98.15 2.4343 0.70 9.48 1.92 1.54 ALSO 5230 7.29 7.23 67.15 0.1410 1.08 1.33 1.95 0.71 MADE 7999 7.32 7.29 74.51 0.2213 1.60 1.25 1.97 0.76 BUT 9174 7.48 7.37 78.89 0.2485 0.84 2.21 2.06 0.89 BEEN 12072 7.50 7.41 83.76 0.3306 1.41 1.96 2.07 0.95 HAVING 2006 6.67 6.86 42.09 0.0548 2.18 0.51 2.07 0.43 DOES 4264 7.09 7.20 6 3.30 0.1175 1.80 0-96 2.11 0.67 AND 128355 7.83 7.61 99.73 3.4562 0.53 15.25 2.14 1.57 WITH 21624 7.64 7.51 92.03 0.5840 1.15 3.46 2.15 1.16 THERE 12925 7.48 7.40 84.25 0.3545 1.30 1.87 2.17 0.91 3 TIME 8254 7.17 7.20 70.40 0.2237 2.55 0.92 2.17 0.62 2 PLAINT 20986 7.02 6.94 57.71 0.6097 1.25 0.64 2.24 0.43 WHEN 6875 7.28 7.24 69.87 0.1866 1.54 1.20 2.24 0.69 CONTEN 3888 7.02 7.09 57.11 0.1094 2.14 0.71 2.24 0.56 THEREF 3871 7.01 7.18 62.21 0.1050 1.43 0.90 2.25 0.65 AFTER 6340 7.24 7.21 68.47 0.1745 1.62 1.06 2.27 0.65 BECAUS 3553 7.00 7.11 57.19 0.0999 2.04 0.75 2.28 0.58 ANY 13855 7.47 7.37 83.12 0.3703 1.29 1.87 2.37 0.83 3 CASE 15261 7.45 7.36 84.74 0.4182 1.64 1.43 2.38 0.80 WITHOU 4652 7.10 7.17 63.57 0.1274 2.02 0.91 2.39 0.62 1 ONE 9388 7.39 7.31 76.40 0.2540 1.61 1.48 2.40 0.75 4 FACT 4658 7.06 7.10 60.28 0.1249 2.10 0.80 2.40 0.54 HAS 10530 7.36 7.37 81.76 0.2838 1.34 1.51 2.41 0.83 5 DEFEND 25773 7.20 7.12 71.19 0.7468 1.34 0.79 2.43 0.53 WHERE 5794 7.19 7.16 65.26 0.1562 1.64 1.03 2.43 0.58 1 FOLLOW 6076 7.28 7.24 69.38 0.1661 1.30 1.18 2.44 0.69 NEITHE 930 6.16 6.38 24.87 0.0252 2.65 0.27 2.44 0.25 THIS 29490 7.66 7.59 96.67 0.8106 1.15 4.02 2.45 1.41 OTHER 8966 7.43 7.31 76.17 0.2397 1.18 1.79 2.45 0.76 SHOULD 5689 7.20 7.20 66.59 0.1511 1.89 1.02 2.45 0.63 CANNOT 2467 6.74 6.92 46.54 0.0694 2.06 0.57 2.46 0.45 1 TWO 5130 7.11 7.11 60.51 0.1408 1.59 0.85 2.47 0.55 WOULD 9678 7.34 7.23 73.12 0.2580 1.43 1.34 2.49 0.64 MAY 9510 7.37 7.30 76.70 0.2605 1.45 1.38 2.50 0.72 4 CONCUR 2290 6.65 7.30 63.91. 0.0643 2.45 0.73 2.51 0.86 DID 6224 7.24 7.17 66.70 0.1665 1.55 1.03 2.52 0.59 7 CONCLU 3665 6.95 7.02 53.90 0.1010 2.50 0.64 2.52 0.49 HAVE 13825 7.53 7.44 85.99 0.3761 1.17 2.52 2.53 0.97 AAAAAA 2649 7.07 7.87 99.99 0.0783 0.42 4.32 2.55 31.32 ARE 13721 7.46 7.39 84.37 0.3766 1.56 1.85 2.55 0.86 WHETHE 5173 7.22 7 19 66.13 0.1408 1.69 1.04 2.57 0.61 COULD 5096 7.16 7.11 61.79 0.1383 1.59 0.95 2.58 0.54 THEN 4583 7.12 7.07 59.19 0.1242 2.04 0.82 2.60 0.51 4 AFFIRM 3897 6.89 7.23 63.53 0.1109 2.26 0.78 2.61 0.70 BEFORE 5814 7.19 7.23 68.55 0.1612 2.12 0.95 2.63 0.66 THAN 4378 7.11 7.10 59.38 0.1198 2.23 0.81 2.63 0.54 3 SUSTAI 2600 6.65 6.89 46.24 0.0753 3.40 0.40 2.63 0.41 ALTHOU 1762 6.67 6.77 38.65 0.0487 1.78 0.50 2.66 0.37 WERE 12911 7.43 7.31 79.91 0.3486 1.43 1.55 2.67 0.70 HAD 15451 7.43 7.30 82.44 0.4205 1.49 1.38 2.68 0.69 1 CAN 2822 6.93 6.94 49.15 0.0739 1.61 0.67 2.68 0.44 Table XI. Sorted by GL 131 VOTES WORD NOGC E EL PZD AVG G EK GL EKL 8 CONS ID 5288 7. 15 7.14 63. 72 0.1379 2.06 0.93 2.68 0.56 1 DENIED 2053 6.30 6.7 7 40. 39 0.0580 2.91 0.37 2.72 0.35 MANY 1117 6.27 6.38 25.82 0.0286 2.52 0.29 2.73 0.23 MORE 3050 6.94 6.95 49.49 0.0822 1.98 0.66 2.76 0.45 SINCE 2756 6.89 6.93 48.65 0.0753 1 .76 0.62 2.78 0.43 NOR 2099 6.70 6.86 43.14 0.0581 1.94 0.53 2.78 0.40 MIGHT 1734 6.57 6.63 34.27 0.0465 2.40 0.39 2.78 0.30 MUST 5208 7.18 7.22 66. 70 0.1412 1.83 1.08 2.79 0.64 1 BOTH 2068 6.85 6.88 46.!>4 0.0771 1.87 0.59 2.81 0.39 MERELY 936 6.21 6.32 2 3.78 0.0248 2.46 0.26 2.82 0.22 THOUGH 1301 6.43 6.54 30.46 0.0340 2.57 0.34 2.82 0.28 HIS 19529 7.32 7.22 78.63 0.5396 1.55 1.03 2.83 0.60 HELD 3978 7.04 7.02 55.34 0.1058 1.92 0.75 2.83 0.47 BETWEE 3231 6.84 6.87 47.45 0.0879 2.33 0.55 2.83 0.38 NOTHIN 1275 6.24 6.55 30.65 0.0345 2.76 0.33 2.84 0.29 L PART 4746 7.12 7.09 60.62 0.1287 2.57 0.78 2.85 0.52 2 REASON 6845 7.17 7.25 72. 48 0.1850 2.15 1.11 2.86 0.64 THUS 1622 6.58 6.65 34.80 0.0427 2.08 0.42 2.88 0.31 2 BEING 3858 7.04 7.08 57.41 0.1040 2.13 0.75 2.89 0.52 1 FACTS 4095 7.00 7.01 5 5.79 0.1137 3.05 0.60 2.90 0.46 SUCH 18195 7.50 7.35 85.80 0.4817 1.49 1.78 2.91 0.74 9 ACCORD 2721 6.87 6.96 49.64 0.0745 2.12 0.62 2.92 0.45 THEREA 1342 6.40 6.55 31.03 0.0389 2.78 0.32 2.92 0.28 OBVIOU 645 5.87 6.09 18.23 0.0187 3.36 0.18 2.92 0.18 4 CIRCUM 2543 6.75 6.75 41.94 0.0679 2.08 0.49 2.94 0.33 1 STILL 660 5.86 6.07 18.08 0.0176 3.47 0.18 2.94 0.17 UNLESS 1520 6.54 6.63 33.82 0.0418 2.32 0.39 2.95 0.30 7 TRIAL 9898 6.97 6.98 62.85 0.2884 2.75 0.45 2.96 0.41 UNDER 10893 7.40 7.31 80.44 0.2937 1.82 1.31 2.98 0.69 INVOLV 2933 6.56 6.90 47.86 0.0789 2.29 0.56 2.99 0.40 4 ILL 8605 6.49 6.46 32.88 0.2551 1.95 0.34 3.00 0.24 2 INSTAN 1867 6.54 6.60 34.88 0.0494 2.58 0.36 3.01 0.28 CONSIS 941 6.19 6.31 2 3.. 66 0.0260 2.47 0.26 3.02 0.21 ABOVE 1812 6.40 6.63 35.18 0.0483 2.94 0.35 3.03 0.29 REGARD 1466 6.39 6.52 30.80 0.0380 3.05 0.32 3.05 0.26 EVEN 1964 6.64 6.75 38.80 0.0509 2.09 0.49 3.06 0.35 THEREO 2640 6.69 6.75 41.60 0.0697 2.61 0.42 3.06 0.33 SHOWS 1078 6. 16 6.35 25.25 0.0297 3.55 0.23 3.06 0.22 2 SITUAT 1358 6.42 6.49 29.40 0.0368 2.40 0.33 3.07 0.25 MAKES 565 3.73 5.98 16.27 0.0151 3.28 0.17 3.07 0.15 CITED 1401 6.41 6.54 30.95 0.0390 2.52 0.33 3.08 0.27 9 EVIDEN 12726 7.10 7.02 65.64 0.3461 1.64 0.71 3.09 0.43 BECAME 734 5.81 6.08 18.61 0.0196 3.61 0.18 3.09 0.17 EITHER 2033 6.71 6.78 40.20 0.0532 1.96 0.50 3.10 0.35 GIVEN 2766 6.80 6.82 45.07 0.0744 2.27 0.50 3.10 0.35 NOW 2384 6.60 6.80 43.29 0.0629 2.79 0.46 3.10 0.34 HERE 3448 6.93 6.97 52.69 0.0938 1.92 0.66 3.12 0.43 4 PRIOR 2379 6.69 6.74 40.88 0.0654 2.87 0.41 3.12 0.32 SHOWIN 829 5.78 6.16 20.53 0.0227 3.37 0.19 3.12 0.18 AGAINS 5725 7.04 7.06 61.83 0.1605 2.56 0.63 3.13 0.46 INTO 3583 6.93 6.92 51.00 0.0952 2.51 0.57 3.14 0.39 4 FOUND 3608 6.91 6.98 53.68 0.1017 2.73 0.53 3.16 0.43 MAKE 2535 6.76 6.84 43.94 0.0681 2.35 0.54 3.17 0.37 ANOTHE 1881 6.57 6.65 36.35 0.0500 2.97 0.37 3.17 0.29 2 SIMILA 1243 6.38 6.46 28.61 0.0339 2.91 0.30 3.18 0.24 DISCUS 1034 6.22 6.31 24.34 0.0267 2.85 0.25 3.19 0.21 1 THINK 1035 6. 18 6.28 23.63 11.92 0.0298 3.00 0.23 3.20 0.20 NEVERT 370 5.50 5.71 0.0096 3.19 0.13 3.20 0.12 SHOW 1649 6.36 6.59 33.89 0.0470 3.26 0.32 3.21 0.28 CASES 3896 6.86 6.90 51.41 0.1062 2.58 0.54 3.22 0.38 Tabl e XI. Sorted by G. L 132 VOTES WORD NOCC E EL PZD AVG G EK GL EKL SHOWN 1106 6. 15 6.36 2 '3 . 74 0.0303 3.38 0.24 3.23 0.22 4 SUFFIC 2484 6.72 6.81 42.0 2 0.0708 2.35 0.45 3.24 0.36 HOLO 1033 6. 15 6.35 24.61 0.C2 70 2.49 0.26 3.24 0.22 THEREB 712 5.99 6.11 19.02 0.0192 3.22 0. 19 3.25 0.17 THESE 4753 7.11 7.07 59.79 0.1275 1.97 0.83 3.27 0.48 3 FIRST 4165 7.01 7.04 57.15 0.1116 2.30 0.71 3.27 0.46 CLEARL 1145 6.31 6.4 5 27.67 0.0304 2.81 0.30 3.28 0.24 THEIR 6514 7.08 7.02 61.75 0.1756 2.19 0.70 3.29 0.42 AGAIN 766 6.00 6.11 19.32 0.0209 4.64 O.lfl 3.29 0.17 1 APP 4769 6.74 6.72 44.92 0.1292 2.51 0.41 3.31 0.29 THERET 1022 6.05 6.35 24.95 C.0278 3.03 0.25 3.31 0.22 ITSELF 993 6.25 6.33 24.38 0.0260 2.40 0.27 3.32 0.22 SAME 4992 7.05 7.07 60.73 0.1299 2.47 0.76 3.32 0.48 1 APPARE 1334 6.43 6.53 30.84 0.0364 3.26 0.30 3.32 0.26 1 ALL 9021 7.36 7.26 74.78 0.2361 1.45 1.46 3.34 0.64 8ELIEV 1176 6.22 6.34 25.67 0.0322 3.33 0.24 3.34 0.21 ARGUED 396 5.47 5.71 12.15 0.0117 3.88 0.12 3.34 0.12 2 TERMS 1583 6.33 6.39 28.46 0.0424 3.43 0.25 3.35 0.21 6 AGREE 70 7 5.91 6.10 18.98 0.0187 3.50 0.19 3.35 0.17 8 MOTION 6621 6.71 6.84 53.90 0.1942 3.78 0.30 3.36 0.33 2 ALLEGE 3766 6.72 6.81 47.86 0.1091 3.04 0.40 3.37 0.33 THEREI 1068 6.13 6.38 25.70 0.0279 2.72 0.27 3.38 0.23 WHOSE 655 5.89 6.04 17.70 0.0179 3.34 0.18 3.38 0.16 2 LAW 9658 7.23 7.20 74.29 0.2554 2.34 0.88 3.39 0.54 SEEMS 647 5.88 5.98 16.87 0.0179 4.19 0.16 3.41 0.15 I CONCED 485 5.58 5.83 14.00 0.0140 3.43 0.14 3.42 0.13 CALLEO 1618 6.40 6.57 32.76 0.0444 4.43 0.31 3.42 0.27 LEAST 766 6.00 6.11 19.40 0.0206 2.98 0.20 3.43 0.17 FURTHE 4546 7.11 7.13 61.94 0.1230 1.92 0.91 3.44 0.53 ABOUT 3228 6.65 6.6 5 41.10 0.0882 2.68 0.39 3.45 0.27 VERY 888 6. 15 6.22 21.93 0.0230 2.80 0.24 3.45 0.19 UNTIL 2347 6.65 6.70 39.22 0.0628 2.31 0.42 3.46 0.30 APPLIE 1264 6.25 6.40 2 7.63 0.0351 2.95 G.27 3.46 0.22 FILED 5362 6.67 6.91 55.26 0.1589 4.09 0.33 3.46 0.36 2 PARTIC 2381 6.48 6.76 42.12 0.0625 3.17 0.41 3.48 0.32 ITS 11061 7.31 7. 2 75.34 0.2888 1.71 1.13 3.49 0.54 3 PRESEN 5653 7.18 7.20 68.25 0.1558 2.26 0.88 3.49 0.58 WELL 2259 6.77 6.83 43.14 0.0592 2.87 0.51 3.49 0.36 OVER 2622 6.72 6.71 40.99 0.0701 2.40 0.43 3.50 0.29 ALONE 536 5.73 5.8 7 14.79 0.0152 4.20 0.14 3.50 0.13 WHO 5241 7.11 7.03 59.64 0.1416 1.89 0.79 3.51 0.44 DIFFIC 578 5.72 5.87 15.06 0.0155 3.98 0.14 3.51 0.13 THEY 7042 7.14 7.08 64.47 0.1897 2.45 0.77 3.52 0.45 LATER 1426 6.43 6.47 29.48 0.0387 2.75 0.31 3.52 0.24 THOSE 2527 6.73 6.77 42.43 0.0642 3.12 0.46 3.52 0.33 2 ESSENT 651 5.83 5.98 16.76 0.0173 3.67 0.16 3.52 0.15 TAKE 1484 6.38 6.47 30.35 0.0407 3.85 0.27 3.52 0.23 SUGGES 782 5.94 6.06 18.68 0.0208 3.46 0.18 3.55 0.16 DIFFER 1714 6.46 6.55 33.14 0.0466 3.96 0.29 3.56 0.25 5 PREVEN 956 6.00 6.16 21.44 0.0265 3.86 0.19 3.57 0.17 1 WEYGAN 251 4.57 5.40 8.79 0.0050 6.09 0.05 3.57 0.09 3 INDICA 1901 6.64 6.70 37.67 0.0499 2.45 0.42 3.59 0.31 WITHIN 4561 6.85 6.97 55.56 0.1294 2.63 0.50 3.59 0.41 2 REVERS 2857 6.66 6.93 46.96 0.0842 2.65 0.48 3.60 0.43 HIMSEL 864 5.95 6.10 19.85 0.0241 5.07 C.17 3.60 0.16 1 PROVID 5792 7.03 7.02 60.02 0.15 99 2.56 0.64 3.62 0.42 LIKE 738 5.93 6.08 18.87 0.0198 4.09 0.17 3.62 0.16 LATTER 83 3 6.04 6.14 20.23 0.0235 3.47 0.19 3.63 0.17 5 SUBJEC 2855 6.70 6.81 4 5.48 0.0784 2.72 0.46 3.64 0.33 BROUGH 1534 6.50 6.59 33.74 0.0460 4.00 0.29 3.64 0.27 Ta ble XI. Sorted by GL 133 ES WORD NOGC E EL PZD AVG G EK GL EKL RATHER 917 6.15 6.21 22.00 0.0246 3.00 0.24 3.67 0.18 2 GIVE 1490 6.32 6.45 29.78 0.0399 3.06 0.29 3.67 0.23 1 C ONCER 1797 6.57 6.59 34.76 0.0468 4.40 0.34 3.67 0.26 1 ENTITL 2141 6.53 6.69 38.42 0.0591 2.60 0.38 3.68 0.30 WHOM 832 6.00 6.13 20.08 0.0228 3.43 0.19 3.68 0.17 2 INCLUD 2632 6.71 6.76 43.41 0.0716 3.86 0.39 3.68 0.31 PREVIO 1040 6. 16 6.31 24.57 0.0277 3.93 0.22 3.68 0.20 FAILS 426 5.21 5.60 12.15 0.0125 4.84 0.10 3.68 0.11 2 STATED 3698 6.99 6.99 54.77 0.0975 2.37 0.68 3.69 0.42 2 PROVIS 4479 6.80 6.77 47.18 0.1251 2.55 0.45 3.69 0.30 2 BASED 1605 6.38 6.56 32.84 0.0431 2.60 0.35 3.70 0.26 POSSIB 1018 6.18 6.23 22.98 0.0272 3.04 0.23 3.70 0.18 AMONG 579 5.83 5.93 15.81 0.0152 3.05 0.17 3.70 0.14 3 FIND 1954 6.51 6.66 37.75 0.0519 3.11 0.35 3.70 0.28 FOREGO 626 5.73 5.96 16.64 0.0163 3.55 0.16 3.70 0.14 RESPEC 2579 6.80 6.82 44.43 0.0678 1.99 0.54 3.71 0.34 SAY 1088 6.26 6.34 25.44 0.02 94 2.94 0.26 3.71 0.21 FULLY 591 5.74 5.93 16.00 0.0159 4.28 0.14 3.71 0.14 SET 2964 6.71 6.84 46.54 0.0798 3.36 0.45 3.72 0.35 1 TESTIF 3484 6.35 6.35 31.74 0.0969 3.53 0.24 3.72 0.19 3 VARIOU 815 5.99 6.12 19.96 0.0214 4.01 0.20 3.74 0.16 QUITE 307 5.32 5.46 9.39 C.0083 4.11 0.09 3.74 0.09 4 AMOUNT 3110 6.49 6.52 37.56 C.0869 3.85 0.27 3.75 0.22 MAKING 1060 6.19 6.33 25.14 C.0282 4.11 0.22 3.75 0.21 SEEKS 374 5.15 5.62 11.32 0.0117 4.95 0.10 3.75 0.10 WHAT 2883 6.76 6.79 44.80 0.0725 2.52 0.51 3.76 0.32 ONCE 375 5.32 5.60 11.02 0.0094 3.70 0.11 3.77 0.10 1 EVERY 922 6.11 6.22 22.31 0.0244 3.05 0.22 3.79 0.18 FAILED 1442 6.29 6.48 30.31 0.0414 3.32 0.29 3.79 0.23 3 DUE 1937 6.40 6.47 32.08 0.0542 4.13 0.25 3.79 0.22 OTHERW 1095 6.14 6.42 27.18 0.0307 4.16 0.25 3.79 0.23 2 TIMES 751 5.95 6.09 19.21 0.0201 3.18 0.19 3.80 0.16 HOW 739 5.93 6.01 17.89 0.0191 3.23 0.19 3.80 0.15 LONG 1047 6.23 6.32 24.80 0.0280 3.39 0.23 3.84 0.20 3 CAREFU 453 5.42 5.79 13.51 0.0118 3.79 0.13 3.84 0.12 EXISTS 376 5.38 5.59 10.94 0.0104 4.09 0.11 3.84 0.10 READS 769 5.89 6.03 18.30 0.0220 3.56 0.16 3.85 0.15 HENCE 447 5.43 5.68 12.26 0.0118 4.38 0.11 3.85 0.11 TOGETH 861 6.04 6.16 20.91 0.0222 3.31 0.21 3.86 0.17 STATIN 385 5.43 5.67 11.77 0.0112 4.44 0.11 3.86 0.11 6 RIGHT 5447 6.76 6.86 54.24 0.1464 2.91 0.47 3.87 0.32 1 THREE 2437 6.70 6.73 41.18 0.0677 3.19 0.40 3.87 0.30 COME 663 5.90 6.00 17.40 0.0173 3.24 0.18 3.88 0.15 7 TESTIM 3650 6.42 6.41 34.65 0.1010 3.30 0.25 3.88 0.20 7 CONDIT 2779 6.46 6.47 35.52 0.0760 3.52 0.26 3.88 0.21 SEE 4704 6.93 6.88 55.00 0.1297 2.95 0.47 3.89 0.33 WHEREI 560 5.60 5.92 15.66 0.0155 4.62 0.13 3.89 0.14 2 CERTAI 3069 6.87 6.96 50.62 0.0830 2.20 0.65 3.90 0.42 BEYOND 754 5.87 5.99 17.74 0.0209 3.35 0.17 3.90 0.14 4 DISSEN 751 5.48 5.73 13.43 0.0191 3.84 0.12 3.90 0.11 3 FIND IN 3437 6.56 6.59 41.56 0.0995 4.00 0.26 3.90 0.23 1 RELIES 301 5.28 5.48 9.62 0.0090 4.16 0.09 3.91 0.09 1 DAYS 1500 6.05 6.22 24.99 0.0447 6.03 0.14 3.91 0.17 1 OWN 1857 6.53 6.60 34.99 0.0502 2.91 0.35 3.93 0.27 4 PURSUA 1039 6.08 6.24 23.17 0.0271 2.92 0.22 3.93 0.18 5 RECOGN 1033 6.10 6.25 23.51 0.0261 3.33 0.23 3.94 0.18 CLAIME 921 5.97 6.17 21.44 0.0261 4.84 0.17 3.94 0.17 4 DETERM 5030 7.02 7.01 59.45 0.1314 3.04 0.64 3.95 0.40 MERE 654 5.82 5.99 17.02 0.0170 3.36 0.17 3.95 0.14 RAISED 1050 6.00 6.28 23.93 0.0290 3.56 0.21 3.95 0.19 Table XI. Sorted by GL 134 Votes word NOCC E EL PZD AVG G EK GL ffh ADDED 587 5.62 5.77 13.96 0.0144 4.33 0.13 3.95 BECOME 1158 6.07 6.30 25.36 0.0320 3.89 0.23 3.96 0.19 ARGUES 44 3 5.52 5.67 12.23 0.0136 4.75 0.11 3.96 0.11 10 COURT 33021 7.45 7.41 93.58 0.9097 1.64 1.26 3.97 0.76 1 RESULT 3328 6.85 6.86 48.50 0.0911 3.50 0.49 3.97 0.34 2 SUBSEO 1263 6.25 6.37 26.99 0.0363 3.67 0.24 3.97 0.21 1 DESIRE 507 5.38 5.78 13.74 0.0143 4.09 0.12 3.97 0.12 INSTEA 328 5.29 5.52 10.07 0.0088 4.25 0.10 3.97 0.09 1 VOORHI 209 4.80 5.23 7.32 0.0059 4.32 0.07 3.98 0.07 FROESS 209 4.78 5.18 6.98 0.0062 4.96 0.06 3.98 0.07 DECIDE 1409 6.41 6.50 29.89 0.0381 2.48 0.31 3.99 0.25 LESS 923 6.08 6.17 21.63 0.0250 3.44 0.21 3.99 0.17 MUCH 693 5.99 6.11 19.13 0.0187 3.85 0.19 3.99 0.17 2 VIRTUE 322 5.21 5.46 9.55 0.0091 4.56 0.09 3.99 0.09 THROUG 1954 6.52 6.56 34.61 0.0531 3.87 0.30 4.00 0.24 RELATE 839 5.92 6.12 20.04 0.0233 3.10 0.20 4.01 0.16 1 APPROX 704 5.79 5.87 15.77 0.0179 3.77 0.15 4.01 0.12 ENTERE 2920 6.78 6.87 48.58 0.0873 3.29 0.42 4.02 0.34 RELIED 487 5.62 5.80 13.89 0.0134 4.43 0.12 4.02 0.12 TAKEN 2518 6.67 6.76 43.07 0.0697 3.27 0.37 4.04 0.31 1 ALLEGI 320 5.18 5.47 9.66 0.0088 4.31 0.09 4.05 0.09 FULD 208 4.73 5.20 7.09 0.0057 4.57 0.06 4.05 0.07 SOLELY 441 5.50 5.74 12.87 O.0L18 4.03 0.12 4.06 0.12 DESMON 230 4.86 5.24 7.47 0.0065 4.60 0.07 4.06 0.07 ALREAD 542 5.68 5.80 14.08 0.0141 3.49 0.14 4.07 0.12 5 CAUSE 446 3 6.77 6.90 54.28 0.1255 2.98 0.43 4.08 0.34 9 JUDGME 10581 7.06 7.17 73.19 0.3119 3.01 0.54 4.08 0.49 1 FAVOR 1249 6.22 6.37 26.87 0.0364 3.45 0.23 4.09 0.21 2 QUOTED 591 5.60 5.85 15.13 0.0149 3.88 0.14 4.09 0.12 NAMELY 316 5.27 5.44 9.36 0.0080 4.71 0.09 4.09 0.09 1 NATURE 1185 6. 16 6.31 25.48 0.0313 3.80 0.22 4.10 0.19 MATTER 4313 6.91 6.96 55.19 0.1166 3.11 0.53 4.12 0.38 SOMEWH 236 5.13 5.27 7.73 0.0070 4.87 0.07 4.12 0.07 2 REFUSE 1286 6.14 6.22 24.49 0.0351 4.26 0.19 4.13 0.17 MENTIO 694 5.91 6.02 17.89 0.0191 4.96 0.16 4.13 0.15 NONE 506 5.58 5.82 14.23 0.0136 3.70 0.14 4.14 0.12 3 DISTIN 997 6.14 6.22 22.68 0.0265 2.77 0.24 4.15 0.18 REACHE 539 5.63 5.86 14.91 0.0139 4.07 0.14 4.15 0.13 1 OPPORT 545 5.53 5.75 13.70 0.0146 5.13 0.11 4.15 0.11 2 FILE 943 5.49 5.87 17.06 0.0265 5.51 0.10 4.17 0.12 2 MATTHI 249 4.57 5.37 8.64 0.0049 6.34 0.05 4.17 0.08 7 EXPRES 2022 6.51 6.61 36.01 0.0546 3.21 0.34 4.18 0.26 NEVER 976 6.01 6.15 21.32 0.0254 4.03 0.19 4.18 0.16 2 EXISTE 1029 6.06 6.17 22.08 0.0286 5.05 0.19 4.18 0.16 SOMETI 237 5.05 5.22 7.39 0.0068 5.15 0.07 4.18 0.07 1 USED 2650 6.45 6.58 38.16 0.0734 5.62 0.24 4.18 0.23 2 YEARS 2601 6.53 6.56 37.10 0.0687 3.24 0.31 4.19 0.23 PLACED 781 5.88 6.05 18.91 0.0208 4.15 0.16 4.20 0.15 1 ABLE 416 5.37 5.64 11.77 0.0107 4.69 0.11 4.20 0.10 2 COMPAR 418 5.42 5.57 11.09 0.0121 4.96 0.09 4.20 0.09 MOVED 492 5.61 5.75 13.40 0.0149 3.94 0.13 4.21 0.11 3 ARGUME 1528 6.26 6.37 28.69 0.0429 5.01 0.20 4.22 0.19 1 PECK 216 4.34 5.22 7.43 0.0043 7.17 0.04 4.22 0.07 SOUGHT 1132 6.11 6.33 25.44 0.0316 3.80 0.21 4.23 0.20 1 POINT 1487 6.35 6.42 29.48 0.0407 4.43 0.25 4.24 0.21 1 INTEND 1333 6.29 6.39 27.63 0.0361 3.14 0.25 4.27 0.21 EVER 48 1 5.47 5.65 12.23 0.0127 4.47 0.11 4.27 0.10 2 SECTIO 10226 6.83 6.76 55.75 0.2858 2.91 0.38 4.29 0.27 2 QUESTI 8776 7.25 7.28 77.08 0.2395 2.17 1.03 4.30 0.62 10 JURY 5530 6.41 6.31 34.27 0.1470 3.35 0.24 4.31 0.17 Table XI. Sorted by I 3L 135 ES WORD WHILE NCCC 2749 E 6.82 EL 6.3'> PZD 4 6 . 3 I AVG 0.0751 G 5.29 EK 0.43 GL 4.31 EKL 0.35 6 ERROR 3841 6.56 6.66 44.80 0.1051 3.69 0.29 4.33 0.24 1 NEW 4744 6.68 6.72 4 8.09 0.1295 3.77 0.31 4.33 0.26 SHALL 6240 6.81 6.7 3 49. 18 0.1705 2.77 0.43 4.34 0.27 1 KNOWN 1083 6. 12 6.17 22. 19 0.0285 3.59 0.21 4.34 0.16 3 CORREC 1358 6.14 6.38 28.57 0.0370 4.35 0.21 4.34 0.20 OVERRU 1644 6.23 6.42 30.46 0.0456 4.78 0.19 4.35 0.20 2 MASS 4687 5.77 5.73 16.98 0.14 83 3.41 0.12 4.36 0.10 2 COURSE 1500 6.22 6.45 30.53 0.0421 6.86 0.21 4.36 0.21 THEM 3505 6.92 6.89 49.37 0.0943 2.56 0.56 4.37 0.36 TOOK 1080 6.15 6.28 24.46 0.0302 3.21 0.24 4.38 0.19 5 STATUT 7283 6.89 6.80 53. 15 0.1985 2.26 0.48 4.39 0.29 7 JUSTIF 885 5.90 6.07 19.85 0.0235 3.52 0.18 4.41 0.15 DURING 2216 6.58 6.62 36.50 0.0609 2.73 0.36 4.42 0.26 1 TRUE 1140 6.23 6.36 26.23 0.0 309 3.33 0.26 4.42 0.20 1 HOLDIN 1008 6.05 6.20 22.76 0.0265 3.62 0.21 4.43 0.17 5 FAILUR 1630 6. 16 6.43 30.16 0.0459 3.81 0.24 4.43 0.21 LIKEWI 404 5.52 5.64 11.70 0.0106 3.26 0.12 4.45 0.10 5 PARTIE 3496 6.55 6.59 41.71 0.0960 3.86 0.29 4.47 0.22 NOTED 710 5.88 6.02 18.04 0.0182 3.47 0.17 4.48 0.14 1 SEC 6808 6.65 6.62 49.60 0.1929 3.75 0.27 4.50 0.21 3 OPERAT 4207 6.52 6.45 39.56 0.1145 3.54 0.27 4.52 0.18 2 REOUIR 6103 7i06 7.1C 63.98 0.1665 2.34 0.74 4.53 0.47 DONE 1079 6.09 6.26 24.57 0.0282 3.94 0.21 4.53 0.18 FORTH 1458 6.25 6.40 28.80 0.0391 3.68 0.25 4.54 0.20 6 AUTHOR 4898 6.78 6.81 52.32 0.1319 4.35 0.37 4.61 0.28 3 SU8STA 2527 6.62 6.71 41.60 0.0693 3.48 0.36 4.62 0.27 3 OPINIO 4764 7.02 6.98 58.85 0.1218 2.05 0.71 4.63 0.37 2 STATE 9231 6.85 6.80 62.06 0.2417 3.06 0.39 4.64 0.25 4 CONSTR 380 5 6.58 6.55 40.50 0.1054 3.38 0.30 4.65 0.21 2 ADDITI 1708 6.39 6.49 32.12 0.0453 5.06 0.25 4.68 0.22 1 PAID 2316 6.25 6.25 28.16 0.0616 3.21 0.23 4.69 0.16 INSIST 368 5.36 5.51 10.41 0.0096 3.68 0.10 4.72 0.09 6 EXCEPT 3589 6.58 6.82 49.79 0.1046 5.95 0.26 4.72 0.30 HER 7548 6.30 6.20 31.89 0.2095 4.05 0.20 4.75 0.14 6 RIGHTS 2108 6.30 6.33 30.38 0.0581 5.59 0.20 4.76 0.17 SUPRA 2573 6.29 6.25 29.21 0.0636 3.34 0.23 4.77 0.15 7 VALID 768 5.83 5.92 17.06 0.0207 3.58 0.16 4.77 0.12 6 ACTION 8248 6.94 6.92 64.55 0.2329 3.64 0.39 4.77 0.31 APPLY 806 6.00 6.08 19.63 0.0212 3.14 0.19 4.78 0.15 FAR 923 6.11 6.24 22.61 0.0247 4.89 0.20 4.79 0.18 4 OCCURR 1248 6.05 6.11 21.78 0.0347 3.73 0.18 4.81 0.15 SOME 3394 6.97 6.93 50.88 0.0897 1.97 0.67 4.84 0.39 OUR 3179 6.80 6.83 47.98 0.0833 2.15 0.55 4.84 0.31 2 DATE 1983 6.31 6.41 31.37 0.0555 3.97 0.23 4.85 0.19 3 COMPLA 3971 6.40 6.45 37.44 0. 1136 4.27 0.22 4.90 0.19 9 NECESS 3477 6.93 6.93 52.20 0.0937 3.31 0.52 4.91 0.35 8 CHARGE 4622 6.48 6.47 40.69 0.1234 3.96 0.24 4.95 0.18 6 RECORD 6093 6.91 6.98 60.51 0.1675 5.25 0.41 4.95 0.35 5 ISSUE 3113 6.61 6.66 42.88 0.0831 3.76 0.32 4.98 0.23 2 CONTRO 2941 6.48 6.55 39.93 0.0849 5.05 0.23 5.00 0.20 4 GENERA 5262 6.87 6.82 52.92 0.1338 3.11 0.47 5.01 0.28 3 DISMIS 2755 5.96 6.48 35.90 0.0790 5.16 0.16 5.01 0.20 OCCASI 742 5.95 6.03 18.38 0.0206 3.38 0.18 5.02 0.14 7 SPECIF 2900 6.65 6.68 42.28 0.0790 3.75 0.34 5.03 0.25 HEARD 90 3 5.97 6.07 19.93 0.0241 3.35 0.18 5.06 0.14 9 PUBLIC 4658 6.33 6.30 35.78 0.1226 4.86 0.20 5.07 0.15 4 PERSON 6980 7.01 6.94 60.81 0.1897 2.61 0.57 5.09 0.33 6 DUTY 1873 6.25 6.30 28.35 0.0506 3.82 0.21 5.09 0.17 1 EACH Tabl 3332 e XI. 6.68 6.69 Sorted by G 43.90 L 0.0859 4.53 0.36 5.12 0.25 136 VOTES WORD NOCC E EL PZD AVG G EK GL EKL 2 LANGUA 149 2 6.22 6.23 25. 7H C.0411 3.66 0.21 5.17 0.16 4 empha: 1012 5.96 6.0 19.59 0.0246 3.16 0.19 5.19 0.13 3 PLACE 1881 6.36 6.45 32.27 0.0528 6.46 0.21 5.21 0.19 3 APPELL 14543 6. 53 6.44 50. 16 0.3877 3.05 0.23 5.26 0.16 9 COUNSE 30 30 6.22 6.27 32.54 0.0868 6.05 0.15 5.28 0.14 1 STATEM 2732 6.32 6.36 34.16 0.0720 4.77 0.20 5.32 0.16 9 APPEAL 9096 6.80 7.06 77.61 0.2637 4.94 0.30 5.35 0.33 4 CLEAR 1537 6.52 6.57 33.48 0.0425 3.35 0.33 5.39 0.24 1 CONTAI 2096 6. 55 6. 6 -j 38.12 0.0578 3.35 0.35 5.43 0.25 4 COMPLE 1709 6.30 6.45 31.40 0.0455 4.76 0.24 5.48 0.20 2 OHIO 8519 6.49 6.35 34.39 0.2212 2.35 0.28 5.51 0.17 3 REFERR 1309 6.24 6.43 28.65 0.0341 8.37 0.24 5.55 0.21 PAGE 3218 6.47 6.45 33.71 0.0815 2.83 0.31 5.57 0.19 2 OECISI 3988 6.52 6.69 46.58 0.1070 4.00 0.30 5.57 0.23 5 ADMITT 1667 6.32 6.32 28.87 0.0436 3.82 0.23 5.59 0.17 5 BASIS 1500 6.41 6.47 30.76 0.0412 5.82 0.26 5.60 0.21 5 OBJECT 2 70 3 6.27 6.31 32.50 0.0742 8.66 0.15 5.60 0.15 OBTAIN 1498 6.18 6.30 27.40 0.0397 3.28 0.23 5.62 0.17 SECOND 2415 6.53 6.61 38.50 0.0656 3.97 0.31 5.63 0.23 5 ORIGIN 2053 6.23 6.39 32.01 0.0558 4.38 0.21 5.63 0.18 PUT 719 5.88 5.96 17.40 0.0197 3.40 0.17 5.70 0.13 3 GRANTE 1574 6.25 6.34 28.35 0.0425 4.97 0.20 5.70 0.17 3 PROPER 5913 6.40 6.34 36.91 0.1591 3.62 0.23 5.71 0.15 8 INTERE 3637 6.36 6.32 35.33 0.0944 5.26 0.20 5.71 0.15 4 GROUND 2629 6.68 6.77 44.16 0.0728 3.25 0.38 5.73 0.29 2 WHOLE 651 5.74 5.7 8 14.87 0.0169 3.54 C.14 5.73 0.10 DOING 625 5.71 5.89 16.04 0.0167 3.56 0.15 5.74 0.12 RECEIV 2801 6.52 6.57 39.10 0.0764 6.76 0.27 5.74 0.21 2 RELATI 2530 6.54 6.53 37.10 0.0662 3.61 0.30 5.77 0.20 6 COURTS 2033 6.28 6.36 31.21 0.0553 9.19 0.16 5.77 0.17 5 PETITI 7623 6. 19 6.44 40.39 0.2198 3.73 0.19 5.82 0.18 5 CITY 5969 6.24 6.23 38.05 0.1706 3.90 0.18 5.82 0.13 HEREIN 2599 6.23 6.70 41.75 0.0670 3.17 0.36 5.86 0.25 9 PARTY 2643 6.26 6.33 31.93 0.0726 4.28 0.20 5.91 0.16 4 CODE 4152 6.21 6.18 29.55 0.1146 4.17 0.17 5.98 0.13 5 REQUES 1941 6.11 6.29 29.44 0.0545 7.47 0.15 5.99 0.15 MOST 1051 6.25 6.31 24.95 0.0273 2.65 0.28 6.00 0.18 HERETO 498 5.41 5.64 12.60 0.0121 3.70 0.12 6.07 0.09 OUT 4389 7.00 6.99 57.04 0.1164 3.00 0.65 6. 13 0.37 ORDERE 1180 6.14 6.33 26.23 0.0324 3.50 0.23 6.13 0.18 8 HEARIN 2525 6.28 6.31 31.59 0.0716 4.03 0.21 6.14 0.15 1 PROCEE 5021 6.79 6.84 55.19 0.1373 3.56 0.40 6.15 0.26 1 ACT 5147 6.65 6.59 45.56 0.1370 3.30 0.32 6.21 0.20 1 STAT 1245 5.90 5.93 19.10 0.0383 3.51 0.15 6.23 0.11 MANNER 1259 6.30 6.37 27.29 0.0329 3.46 0.27 6.32 0.19 2 PURPOS 4138 6.76 6.76 49.30 0.1096 3.99 0.41 6.33 0.25 5 PERMIT 2869 6.35 6.49 39.63 0.0820 6.17 0.17 6.36 0.17 1 RENDER 1657 6.30 6.45 31.74 0.0464 3.94 0.23 6.39 0.19 13 JURISD 3056 6.00 6.10 29.67 0.0812 4.48 0.14 6.50 0.11 5 DIRECT 5706 6.95 6.92 58.62 0.1575 5.12 0.44 6.63 0.29 HIM 5613 6.91 6.85 54.24 0.1531 2.49 0.52 6.64 0.29 1 SUPREM 1904 6. 16 6.24 27.44 0.0474 3.73 0.21 6.65 0.14 1 ENTIRE 1350 6.30 6.41 28.53 0.0369 5.20 0.25 6. 76 0.20 13 NOTICE 2855 6.04 6.18 30.76 0.0853 5.70 0.14 6.77 0.12 5 JUDGE 4000 6.52 6.64 46.84 0.1181 10.30 0.19 6.80 0.20 SAID 10747 7.07 6.93 69. 15 0.2803 4.45 0.50 6.83 0.27 END 6422 6.81 6.71 51.86 0.1570 3.07 0.44 6.84 0.22 4 VIEW 1406 6.35 6.48 30.95 0.0375 4.33 0.29 7.01 0.20 3 COMMON 4042 6.46 6.48 42.58 0.1171 5.85 0.19 7.01 0.16 6 REMAIN 1592 6.35 6.38 30.46 0.0428 4.99 0.23 7.12 0.16 Table XI. Sorted by C ;l 137 772-957 0-66— 10 VOTES WORD NOGC E EL PZD AVG G EK GL EKL 8 ASSIGN 2654 6.00 6.12 29.82 0.0715 6.48 0.12 7.19 0.11 5 EFFECT 3759 6.91 6.92 52.39 0.1018 2.86 0.56 7.29 0.34 7 CONTRA 8033 6. 56 6.49 52.96 0.2158 3.98 0.23 7.29 0.15 8 SERVIC 3855 6.04 6.05 29.63 0.1114 5.82 0.13 7.29 0.10 ITAL 11360 6.67 6.57 4 5.18 0.2755 3.12 0.37 7.32 0.19 FOL 5682 6.67 6.57 45.18 0.1378 3.12 0.37 7.39 0.19 5 EMPLOY 6062 5.98 5.89 32.50 0.1653 5.38 0.11 7.48 0.08 5 SEVERA 1243 6.32 6.36 27.25 0.0331 3.47 0.26 7.53 0.18 6 qONSTI 4132 6.41 6.49 42.99 0.1058 3.48 0.28 7. S3 0.19 3 USE 3852 6.29 6.27 36.12 0.1059 4.86 0.18 7.72 0.12 9 CLAIM 2565 6.24 6.24 32.27 0.0735 5.91 0.15 7.77 0.12 7 REVIEW 2347 6.02 6.30 32.72 0.0676 5.34 0.15 7.80 0.13 11 PRINCI 2158 6.46 6.43 34.61 0.0564 6.01 0.24 7.85 0.16 9 ATTEMP 1404 6.05 6.42 29.18 0.0376 4.42 0.25 7.93 0.19 3 APPLIC 4168 6.58 6.60 47.37 0.1134 4.97 0.25 8.13 0.16 10 COUNTY 6245 6.62 6.52 52.43 0.1787 5.00 0.23 8.51 0.14 STATES 2343 6.38 6.33 33.37 0.0582 6.26 0.22 8.54 0.13 5 EXAMIN 3117 6.19 6.23 35.56 0.0831 7.01 0.15 8.63 0.11 2 RETURN 2074 6.24 6.32 31.48 0.0589 8.81 0.15 9.23 0.14 1 REV . 1484 6.07 6.08 22.72 0.0446 3.55 0.18 9.27 0.12 5 APPEAR 3855 6.95 7.00 57.68 0.1045 3.97 0.56 9.43 0.32 5 ANSWER 3398 6.42 6.41 39.33 0.0913 5.64 0.22 9.44 0.13 1 LEGAL 1650 6.25 6.30 28.'.> 7 0.0423 7.41 0.19 9.77 0.14 3 SUPPOR 3151 6.65 6.67 46.35 0.0855 7.06 0.24 9.79 0.18 5 DAY 2189 6.41 6.46 34.16 0.0607 3.92 0.26 9.83 0.17 5 COMPAN 4677 6.19 6.05 32.65 0.1180 4.27 0.17 10.01 0.09 WAY 1771 6.21 6.45 32.91 0.0472 6.65 0.22 10.08 0.16 4 CONTIN 2382 6.37 6.40 34.35 0.0634 5.85 0.21 10.10 0.14 8 RESPON 2872 5.94 6.00 29.21 0.0772 6.24 0.12 11.25 0.08 3 ORDER 6773 6.78 6.77 58.32 0.1918 3.68 0.31 11.48 0.19 AAAAAA 2649 7.07 7.87 99.99 0.0783 0.42 4.32 2.55 31.32 1 ESTA8L 2947 6.74 6.72 44.46 0.0788 3.00 0.45 17.95 0.18 7 OFFICE 4060 6.26 6.12 33.93 0.1032 4.82 0.17 18.75 0.07 Table XI. Sorted by GL 138 TABLE XII. PROGRAMMING STEPS TO ACCOMPLISH PHASE II Step Input Nature of Processor Output 1) concordance FORTRAN: deletes words 1) purged alpha concord- 2) list of having EK >. 30 and/or GL ance (1, 225, 000 words) 16, 000 word ^ 4. (about 350 word types) 2) thesaurus word list types with (15, 780 types) statistics purged concordance SORT: orders by document concordance by number, paragraph number, doc-par-word- -3 reels and alpha word concordance by doc-par-word FORTRAN: generates word word pair list-- 18 reels pairs within paragraphs, sampling via random number generator for words appear- ing in more than 253 paragraphs word-pair list SORT: orders alphabetically alpha word-pair list by word-pair alpha word-pair FORTRAN: counts cooc- list currences, writes out insignificant cooccurrences with applicable statistics on second file 1) summary of insignificant cooccurrences--6 reels 2) summary of potentially significant cooccurrences --3 reels potentially significant cooccurrences FORTRAN: edits and eliminates to produce readable report "significantly*' cooccurring words 139 TABLE XIII. THESAURUS SETS. The pages following are a sample extracted from the computer printout of the thesaurus sets. The full printout contains about 7000 sets. The word at the far left is the head-word; it is followed immediately by the number of paragraphs in which the head-word appears. The words grouped to the right are the associated words, arranged in descending order of standard deviation units . For example, the word abutti appears in 94 paragraphs and is associated with 23 other words, the first of which is egress. The number of standard deviation units measured for the abutti-egress association is 34 and egress appears in 82 paragraphs in the total file. 140 ABUSES 32 CONTRO 2 + 2374 CONVIC 17 1107 ABUTS 17 HICHWA 31 1067 TRAVER JP 37 ARUTTI 25 94 MAP 23 61 INGRES 22 69 EGRESS 20 82 WEST IB 642 STREE I 17 1420 ACCESS 15 273 PROPER 15 39 1 1 • BUT 1 1 94 EGRESS 3 + 82 INGRES 34 69 OFDICA 26 62 ABUTS 25 I 7 AREA 25 842 EASEME 22 226 PLAT 21 96 PLATTE 20 26 ORI VF 19 245 CENTER 18 331 IMPROV IB 566 MA INS 1 7 35 PROPER 17 3911 SOUTH 17 635 AHEY 16 149 COMPAC 16 22 L IGHT 1 16 67 TRAFFI 16 580 EOUAL 1 15 47 HIGHWA 15 1067 LOTS 15 299 PRIVAT 15 480 V ILL AG 15 730 ACAOEM 43 TAUCHI 34 11 CURRIC 27 31 FACULT 24 41 COLLEG 23 21 ; ARIS 22 26 SC IENC 21 49 MOOT 20 126 OHIO 20 5161 TFACMI 20 54 SIUDEN 19 101 TENURE 18 68 DEATH 17 1667 LEARN1 15 56 PRORAT 15 1196 RESFAR 15 99 ACC IT CAS 34 79 HAR TFO 33 54 LONDON 29 36 ACCIOE 27 1415 MASS 26 1504 HUT 23 62 HASSAC 21 307 OECREF 19 1700 INC 1R 1466 niv 17 34 1 INS 1 7 121 DOCTR 1 16 536 POLICY 16 12B8 COHPEN 15 1069 READIL 15 139 ACCELE 72 FLOW 33 134 PAYHEN 33 1 724 OEFERR 30 47 DEPREC 29 122 OVERFL 25 22 AGGRAV 23 81 SURFAC 23 227 DIVERT 22 61 MA TUB 1 22 45 MILES 19 373 OCCURR 18 1096 RATE IB 525 DEPLFT 17 25 OISEAS 17 145 WATER 17 636 AIRPOR 16 76 BRAKES 16 121 HOUR 16 406 OUTLET 16 28 CORONA 15 34 ACCEPT 1150 OFFER 34 341 PAYMEN 33 1724 DATE 32 1611 DAY 27 1 712 PURCHA 26 1487 VAL IDA 25 27 CITY 24 3529 PLAUSI 24 17 STATEM 24 2014 FIXEO 23 459 LANGUA 23 1252 REFUSE 22 1134 AHERN 21 22 APPEAL 21 6176 DRAWEE 21 36 EXCUSE 21 170 PERIOD 21 1516 CASH 20 345 COMPLI 20 651 OISHON 20 40 MASS 20 1504 REJECT 20 346 ADVANC 19 44 7 COMMIT 19 1317 OBSERV 19 810 PRICE 19 586 WRI TIN 19 432 ASSIGN IB 2010 BIO IB 79 CLIENT IB 188 COMMOC le 48 CONSIG 18 29 CONTRA 18 4781 COUNSE 18 2111 PLFADS IB 29 REV 18 1216 SHIPME 18 51 AGREEM 17 1853 CONTIN 17 1912 DEPOSI 17 697 HEREBY 17 455 HUN 17 15 MAILS 17 15 SEATS 17 30 AOMINI 16 1699 BREACH 16 414 DEVOTI 16 16 ELECTI 16 633 LITTLE 16 404 PORTIO 16 94 2 PRINCI 16 1753 RECOVE 16 1577 SETTLE 16 903 CIR 15 374 CONOUC 15 1406 CORBIN 15 40 OISCHA 15 771 FRANCI 15 65 GIFT 15 241 GUILTY 15 1427 HER 15 3428 IPSO 15 19 JUSTIF 15 610 NOVEMB 15 814 RELIGI 15 137 REMIT 15 19 S1NCER 15 19 SK ILLE 15 19 TOLO 15 796 WITHDR 15 531 ACCESS 273 HICHWA 30 1067 KEYSTO 24 22 EGRESS 23 82 LANES 23 60 L IMITE 22 782 APPURT 21 48 INGRES 21 69 MEANS 21 845 USED 21 2130 EASEME 20 226 FREEWA 19 26 REBUIL 19 35 ORAINA IB 109 FEET 18 933 20N1NG 18 783 CONGES 17 52 PRIVAT 17 460 STRIP 17 134 USE 17 2696 CONTRO 16 2374 TRACT 16 295 URBAN 16 25 ABUTS 15 17 ACROSS 15 325 AIOS 15 IB STOLEN 15 146 WAY 15 1530 ACCIOE 1+15 INJURY 53 1298 CONTRA 51 4761 COHPAN 48 2887 DISABI 45 247 INJURI 45 1125 INSURE 44 868 HAPPEN 42 355 CAR 41 1222 OCCUR R 39 t096 COVERA 38 256 LIAB1L )8 1336 DAY 37 1712 CAUS60 36 936 OR IVEN 35 2B1 INJURE 35 679 ROAO 32 m ROURKE 30 23 SICKNE 30 52 PLACE 29 1583 PROVOC 29 25 RECOVE 29 1577 A001TI 28 1490 DRIVER 28 728 TRUCK 2B 565 ACC 27 17 DAMAGE 27 2096 LIFE 27 930 LOSSES 27 86 MEDICA 27 546 PHYSIC 27 702 RACING 27 61 CONTIN 26 1912 DAYLIG 26 IB DISEAS 26 145 HANGIN 26 18 MISHAP 26 17 THROMB 26 17 ANGLE 25 50 EMPLOY 25 3101 MEDFOR 25 19 OCCLUS 25 19 THANKS 25 19 ARTFRI 24 21 OR IVIN 24 526 EXTERN 24 20 HEART 24 106 INS 24 321 AUTOHO 23 1513 WAY 23 15 30 AR1SIN 22 46B DEPUTI 22 23 LOSS 22 652 MIDI AN 22 23 PANEL 22 64 STRAIN 22 62 STREET 22 1420 WHEEL 22 63 AUTO 21 128 POLICY 21 1286 BODILY 20 109 INSURA 20 1188 REACHE 20 504 STRUCK 20 372 TRAUMA 20 29 TRUCKS 20 109 UNAVOI 20 50 VEHICL 20 HBO BUICK 19 52 CAL 19 158 CAUSAT 19 30 COMPEN 19 1069 HIGHWA 19 1067 LEFT 19 1063 PERMIS 19 472 ROUTES 19 32 SEC 19 4065 SEVERA 19 1113 THIRD 19 1012 TRAVEL 19 462 TRIP 19 154 AVERAG 16 125 CAUSA 18 15 CAUSAL 18 125 CHIHNE 16 15 COCA 18 16 CORONA 18 34 DECEOE 18 966 DISCAR 18 16 FORCE IB 599 FRONT 16 442 HAYOEN IB 16 HOUR 18 406 IMPULS 18 15 JURY 18 3B10 MATHEW 18 15 PRECIP 18 61 PRINCI 18 1753 REMOVE 16 631 R1GGS 18 15 SAW IB 626 SEDAN 18 33 SMALL 18 324 SUFFER IB 599 WILMIN 18 36 BLOOD 17 182 BOTTLI 17 IB BREATH 17 38 CEREBR 17 38 DECORA 17 17 ELEVAT 17 142 EVENTU 17 63 HAZARD 17 233 IMHEOI 17 850 MASS 17 1504 MIDWES 17 18 MILES 17 373 OCCURS 17 101 PRE 17 141 QUICKL 17 38 SCENE 17 143 SHAFT 17 17 SYHPTO 17 37 TR I PPE 17 17 BRIDGE 16 209 EXCEPT IC 2937 GAVE 16 768 HYPERT 16 20 ILL INO 16 1363 INCAPA 16 154 OPERAT 16 2956 PERFOR 16 1392 PIECE 16 76 PORTER 16 72 RESTIN 16 44 SIXTY 16 159 SLIGHT 16 212 TRACTO 16 105 TRAILE 16 154 TRIVIA 16 20 VIEW 16 1265 ABRAMS 15 21 AGGRAV 15 Bl BOSTON 15 409 BRAKES 15 121 COLA 15 21 CONSEO 15 704 DINNER 15 21 ORESS 15 23 ENGAGE 15 705 EXCUSE 15 170 FIVE 15 1195 FOOTBA 15 21 HAMPSH 15 49 HEHORR 15 46 IDEA 15 85 INCHES 15 170 LOOSE 15 48 MONTHS 15 709 HORNIN 15 268 NATURA 15 446 NEARES 15 45 OMNIBU 15 23 REACHI 15 118 RIDE 15 49 SKID 15 48 SPONTA 15 21 ST ARTE 15 293 IOTALL 15 122 TRACKS 15 170 TYPE 15 439 UPSET 15 48 WOODWA 15 21 ACCOMM 58 EVICTI EXCUSE 28 15 35 170 RENTIN 25 15 BUSINE 21 1909 OPERAT 19 2956 HOUSIN 17 126 POINT 17 1301 ACCOHP 615 SHE 32 2250 COUNSE 31 2111 OESIGN 29 1075 INTENT 27 1614 UNCORR 27 29 ASHLAN 25 15 OF F 1 C E 25 2598 MEN 24 423 APPEAL 23 6176 COMMON 21 2956 CONTIN 21 1912 MOST 21 974 OPERAT 21 2956 WORK 21 1319 IDENTI 20 853 CONVIC 19 1107 CORROB 19 174 BUSH 18 29 MEAN1N IB 9B4 ATTORN 17 1893 BOARD 17 3543 NOTICE 17 1938 PEOPLE 17 N 1850 SUSPIC 17 68 TRANSP 17 481 CREDIB 16 269 PASSBO 16 20 PRINCI 16 1753 SCHOOL 16 1161 COMPLI 15 651 DATE 15 1611 QUICKL 15 38 ACCOUN 1267 BANK 75 1129 SURVIV 52 466 SAVING 50 392 JUNE 46 1080 PART IE 44 2743 JOINT 41 585 ASSETS 40 426 CHECKS 36 226 DECEMB 36 1042 CARDS 34 36 SIGNED 34 752 CREDIT 33 673 DEPOS I 33 697 Wl THDR 32 531 PASSBO 31 20 CHECKI 30 48 NET 30 238 PAIO 29 1747 RELIVE 26 854 CONVER 27 592 FUNDS 27 595 OPENEO 27 177 PAY 27 1465 RECEIV 27 2270 TELLER 27 26 VERACI 27 26 EMIL 26 16 CHECK 25 404 CITY 25 3529 STOCK 25 510 HER 24 3428 LEDGER 24 19 PARTY 24 1888 VOEGEL 24 19 APPELL 23 7099 BORROW 23 130 OEATH 23 1667 EMPLOY 23 3101 TOTAL 23 542 ANNEXE 22 177 ASSOCI 22 671 BALANC 22 508 OONATI 22 21 INVENT 22 172 PROFI T 22 311 ESCROW 21 62 SETTLI 21 42 TRANSA 21 640 VAULT 21 24 LAST 20 819 MONEY 20 1274 PERCEN 20 816 RENTS 20 96 SHE 20 2250 SLIPS 20 25 WRIT IN 20 432 APPROV 19 1201 CARRIE 19 740 CLOSIN 19 135 CONSIG 19 29 LOAN 19 3B9 MAKER 19 49 HEFERR 19 1197 SHOWED 19 379 SIGNAT 19 298 SPOKE 19 77 SUMS 19 146 AUOITO 18 359 AVAILA 18 673 BUSINE 18 1909 CASH 18 345 CONTIN 18 1912 COST 18 4 86 INVOIC 18 52 PAROL 18 56 REINSU 18 30 RETAIN 16 493 SEC 18 4065 SHARE 18 355 SUPPLE 16 365 TURNED IB 279 UNIMPO IB 30 ABSENC 1 1 911 BONOS 17 316 BOOK 17 168 BOOKS 17 213 BOXES 17 36 COURTS 17 1655 DATEO 17 445 DAY 17 1712 DECREE 17 1700 DEPREC 17 122 DONE 17 944 ENVELO 17 36 EXPLAI 17 312 FORM 17 921 INURE 17 15 ITALIA 17 15 JULY 17 955 KNOWLE 17 96 9 LEASES 17 89 MARCH 17 981 NAME 17 988 ORVILL 17 15 PRINCI 17 1753 REPRIN 17 15 SHARES 1 7 298 WORTH 17 131 ALLOCA 16 65 CARD 16 68 COMPLE 16 1453 DAUGHT 16 286 DRAWER 16 38 INCOME 16 525 INOEBT 16 190 KEEPIN 16 135 LOANS 16 96 MERCHA 16 260 MONIES 16 63 NOVEMB 16 814 OCTOBE 16 808 OFFSET 16 18 OVERDR 16 17 PERHAP 16 193 REAL 16 I52B RECOMP 16 17 TOLD 16 796 TROOP 16 18 TRUST 16 1546 BASIS 15 1334 BATH 15 20 COMMIT 15 1317 COMPLA 15 2823 DEDUCE 15 19 EXPENS 15 764 FATHER 15 49 1 GROCER 15 70 HUTCHI 15 19 MEAN 15 336 PARTNE 15 259 REDUCE 15 338 REGULA 15 1333 SHEETS 15 44 WAGE 15 45 ACCREO 22 COLLEG 36 217 CURRIC 29 31 TEACHI 29 54 YEAR 28 1185 GRAOUA 26 65 ELECTI 21 633 SCHOOL 21 1161 STUDEN 21 101 APPOI N 20 1097 FUNO i e 471 PROGRA in 131 TRAIN! 17 83 PERFOR 15 1392 ACCRUA 29 ACCRUE 39 205 OEATH 22 1667 DATE 21 1611 REPORT 20 1436 COUNTY 19 4129 EXCEPT 19 2937 OECREA 17 64 MAP 17 61 PROOUC 17 1040 DEEMEO 15 436 ACCRUE 205 ACCRUA ACCUMU 39 15 29 125 ACCRUI EXCEPT 32 15 56 2937 PAYMEN UNPAID 20 15 1724 140 EXECUT 19 1970 OATE 17 1611 COMPEN 16 1069 ACCRUI 58 ACCRUE 32 205 VOUCHE 31 18 KELLY 21 85 COST 19 486 EXCEPT 19 2937 CERTIF 16 1358 FARE 1 7 34 RESPON 16 1964 BENEFI 15 1616 IMPROV 15 566 PAYMEN 15 1724 ACCUMU 125 INCOME 31 525 SNOW 28 103 ICE 25 134 EARNIN 20 231 FUND 20 471 DIVIOE 19 347 SURPLU 18 115 CAPI TA 17 221 EMPLOY 17 3101 INTACT 17 43 SL IPPE 17 103 ARTIFI 16 67 FROZEN 16 16 MENIOR 16 17 STAIRW 16 64 ACCRUE 15 205 CESSAJ 15 19 INTERE 15 2557 LIFE 15 930 ACCURA 198 STATEM 19 2014 DE5CRI 16 1067 CORREC 15 1221 ACCUSA 62 MALFEA 33 15 SA2AMA 33 15 EOUIVO 32 25 ACCUSE 24 38 2 INFORM 20 1107 DURHAM 19 44 INCR IM 18 50 METC IB 26 EVASI V 17 30 INOICT 17 648 POL ICE 16 1241 JEOPAR 15 105 141 ACCUSE 182 PEOPLE 63 1850 CRIME 60 B50 GUILT 38 345 UrctNS ia 69 8 CUNVIL Al ttVI ■ NOICI 27 848 ACCUSA 26 62 FUGITI 24 31 JURY 24 3810 OUASHE 21 34 AOHISS 22 832 PROOF 21 1220 CHIMIN 20 1063 GUILTY 20 1427 COMMIT 19 1317 CONFES 19 267 REPRES 19 1337 OEFENS IB 1139 INTENT 18 1616 EXCUSE IT 170 EXTRAO IT 60 IMPART 16 130 SUSPIC 16 68 BRAND IS 24 CREOIB IS 269 DUTY 15 1485 SPEEDY IS 69 ACCUST IS ITENIZ LIVING 36 IT 29 29$ WASN CARE 24 16 64 1060 BUILOI SUPPOS 20 IS 1593 163 MAY 19 1510 PREMIS 18 1327 REMAIN 18 UTO ACHIEV 66 ORASTI IT 29 BEITER 16 225 PROSSE 16 57 GOVERN IS US4 ACHOR 160 LANOIS 103 126 BOSS 1 T 93 137 ARTERB 90 122 ILLNES 85 126 JACK SO 15 279 ACKNOH 239 NOTARY 65 59 BURTON 37 76 OEAN 29 97 NAME 2B 988 RECEIP 25 194 AFF1XE 24 16 SIGNAT 23 298 SIGNEO 23 752 OEEO 22 705 VIRGIL 22 32 PEAK 21 17 SEAL 21 65 GRANTO 20 133 DORR 19 26 ATTEST 18 71 F1CT1T 18 19 DEEDS IT 238 BEAITI 16 16 PAIO 16 IT67 SMITH 16 565 UNTO 16 33 ACME SO GOOORI 86 36 RICE 59 39 BALOWI 54 46 HEATER 52 49 LINCOL 34 90 POULT R 3)4 16 GRAIN 31 15 BEAM 30 21 PRICE 26 586 CHICKE 24 11 MERGER 24 11 CORONA 21 14 BRAKE 21 61 BUFFAL 21 62 INC 20 1666 SEC 20 406 5 CORPOR IS 191* OISTRI IS 194S STATEN 15 2016 ACOUAI 6* SOUNON OUASHE 19 IS 21 36 OAV WITNES 18 IS 1712 2175 WATCH IT 4* ASKEO 16 • 71 01SCRE IS 1190 POL ICC IS 1241 ACQUIS 101 CONTRA 26 6T81 KNOWLE 23 969 EMPLOY 19 1101 PARTIE 10 2761 LACHES IT 70 CONSEN IS SSI ARTIFI »T WATFR 10 616 CHANNF 28 77 NATIJRA 27 4 4 6 WAITRC 26 611 SIIFP4F. 22 1 7 1 A GOON 21 19 NCGL IG 21 2129 THROW in 46 MASS 1 7 1404 ACCIIMU 16 125 r.UARO 15 99 LOANS 15 96 WAFERS IS 103 ARTS 26 MtFNC 97 69 CRAFTS 51 15 SABBAT 11 21 REVERE 10 4 J ETHICS 28 27 IFACHI 27 54 tEARNI 26 56 TAUGHT 26 11 PROGRA 25 111 ml IGI 25 11/ AllPFN 24 IB GRADUA 24 65 con 21 60 ACAOEM 22 41 BRANCH 21 181 PUBLIC 20 1129 SUHGER 20 55 STUDLN 19 101 YOUNG IB 176 MAINIA 17 979 AOMISS 16 B12 COLLEG 16 21 / LEGISL 16 IH45 MEOICI 15 H9 assert 42T CONTRA 22 47RI CONTIN 21 1912 LFGISL 21 IB45 TESTA! 21 54 7 CONSTR 17 275B INUUIR 17 4 74 PART IE 17 2761 STEKN IT 21 INTf RE 16 2557 NAPERV 16 23 USED 15 2130 WlTNES 15 2175 ASCRTB 33 LIBELO 29 16 MEANIN 26 984 CONSTR 17 275R OPERA! IT 2956 USED 15 2110 ASHLAN 19 KENTUC 26 55 ACCOMP 25 615 HIGHWA 16 106 7 HALL 15 154 ashky 11 CORRIC 75 29 WESLEY 51 19 TABLF 11 75 DOUGLA 32 82 WASHIN 30 179 KNIFE 24 50 SAW 22 628 EX AM IN 20 2219 REMARK 17 173 IMPART 15 130 SOMEON 15 126 »s roe 6 TO SE T T 1 N 51 366 VACATE 38 611 RENDER 35 1456 FRAUD 29 571 EXCUSA 27 21 NAUGH! 27 22 PETITl 27 6506 VOIO 27 587 ATTORN 26 1891 CIRCUI 26 1189 WEIGHT 23 7B9 COMMIS 22 1056 excuse 22 170 DEFAUL 21 309 BRUSHE 19 15 CONFIR 18 279 LINDRO 18 17 N01ICF 18 1938 ACTUAL IT 1165 ALLEGA 17 1171 DECREE 17 1700 FILING 16 1001 IND 16 1615 MAN1FE 16 S42 PR I0E tb 21 RETURN 16 1641 STAT 16 1042 WALGRE 16 21 ADMITT 15 1419 INURE 15 2557 MOVED llS 651 PARIIT 15 125 PREMIS 15 1327 REMAND 15 915 SETTLE IS 90 3 ASK 186 ATTORN ri 1R93 YOUR 17 551 ROOM 16 457 ANSWER 15 2431 YOU 15 1471 ASKEO me REPLIE 56 161 SHE 50 2250 FXAMIN 49 2219 WANTED 46 216 KNOW 33 649 TOLO 33 796 WITNFS 32 2175 COMPAN 31 2887 ANYTHI JO 466 OON 29 1T5 POL ICE 29 1241 WHY 28 40B INFORM 21 1107 CALL 26 609 JUDGE 25 2939 PURVIE 25 81 CAME 24 700 DAY 24 1712 MEMORY 26 68 OFFICE 24 259R RECALL 24 156 TALKED 24 132 ANSWER 23 2411 GET 23 485 GOT 23 2 79 HOME 23 1091 SERVIC 23 2498 WON 23 37 BRYSON 22 27 HAND 22 715 LEAVE 22 717 TALK 22 76 WENT 22 713 WASN 21 64 DION 20 222 HANDED 20 92 JUST 20 862 AUNT 19 20 BEER 19 130 GAVE 19 768 GOING 19 545 TELL 19 205 APPEAL 18 6176 RUILOI IR 1591 KNOWLE IB 969 PRE1TY 18 22 REMARK 18 173 SHOES 18 39 TELEPH IB 691 WATCHI 18 21 YES 18 365 REOROO 17 43 CONVER 17 592 INOUIR 17 4 74 PUT 17 661 REPRES 17 1 J37 RULING 17 881 SAW 17 628 WANT 17 4 39 ACOUAI 16 69 BACK 16 793 COUNSE 16 2111 FRIEND 16 320 JUROR 16 104 JURORS 16 22B KNIFE 16 50 MRS 16 739 PARKED 16 1R1 PRE JUD 16 1174 STAND 16 392 SURPRI 16 45 WALKED 16 105 CORRIO 15 29 COURTR 15 53 CROSS 15 1074 DOUGLA 15 82 DRINK 15 51 FOREMA 15 81 HAROLD 15 55 HEAR IN 15 1935 HELPED 15 53 LAWYER 15 24 5 MAIN 15 330 NEXT 15 674 OBJECT 15 1963 PARTY 15 1888 PLACEO 15 713 PROBAB 15 470 RFCONV 15 52 WHEREU 15 117 ASKING 171 PET IT 1 17 6506 ANSWER 16 2431 DECLAR 15 1381 ASKS 92 APPELL 23 7099 ASLEEP 23 SLEEP 52 25 EXCUSA 34 21 FALLIN 32 94 BU1CK 29 52 OR INK 1 27 93 FELL 26 244 WHEEL 26 63 REOUES 23 1534 BLACK 22 141 OROVE 20 238 PARKED 19 181 VEROIC IB 2067 MEN 16 623 AUTOMO 15 1513 COLLIS 15 6B1 KNEW 15 619 MINUTE 15 375 PLACE 15 1583 ASPECT 202 FAVORA PROPER 31 18 361 3911 LINORO CRIMIN 25 16 17 1063 MOST RELEVA 24 16 974 417 SEC 23 406 5 WALGRE 23 21 MASS 19 1504 ASPHAL 23 PAVING PLAY 68 17 30 R6 LIMEST CONCRE 60 16 15 15B OUANTU SLIPPE 29 15 50 103 AREA 17 842 DRIVEW 17 141 PLANT 17 297 ASS 201 BLDG 38 70 6MERIC 23 635 INC 22 1466 MEMORI 22 59 COMMER 21 527 CEMETE 20 74 CIR 20 176 COM 19 162 CIVIC IB 22 POLISH 18 15 SAVING 18 392 RANKER 17 25 LOAN 17 389 MASS 17 1504 APPLIC 16 3117 OIV 16 34 1 LOWELL 16 58 SUPRA 16 2012 ASSOC 1 IS 671 CHICAG 15 1176 MAYWOO 15 20 ASSAIL 45 STRUGG 30 25 WORE 27 10 GRARBE 26 18 HAT 26 18 RUBBED 25 34 MEN 24 423 DARK 21 75 WITNES 19 2175 PEOPLE 18 1850 NECK 17 4 3 ROBBER 17 302 ASSAUL 16 191 ALLEY 15 169 DRINK 15 51 ASSAUl 191 BATTER 110 69 PROVOC 40 25 HUROER 32 195 HOMICI 29 62 INFL IC 27 97 INTENT 27 1614 KE02IE 27 16 BOOILY 25 109 PALMER 24 45 HAMMER 23 47 SODOMY 23 15 VIOLEN 23 132 WEAPON 23 72 CANOY 22 42 SEXUAL 21 82 MALIC1 20 147 NEGRO 20 19 AGGRES 19 30 BEAT 19 30 COMMIT 19 1317 STA8BI 19 21 BEATEN 18 16 KILL 18 49 CHARGE 17 3341 FUROR 17 18 GUILTY 17 1427 OUICKL 17 3H ASSAIL 16 46 ACTUAT 15 23 CRIME 13 850 INDICT 15 868 MANSLA 15 89 ASSEMB 626 SENATE 33 66 ARTICL 30 973 APPOIN 27 1097 ENACT 27 78 SESSIO 24 170 ROADS 23 138 LAYING 22 66 LEGISl 21 1845 POWER 21 1853 HOLD IN 17 924 OHIO 17 5161 UNITS 17 64 DELEGA 16 163 ENACTE 16 381 HOUSE 16 829 LAWS 16 1072 UN I FOR 16 264 AMENDE 15 1484 CORONE 15 2B MUNICI 15 1382 ASSENT 59 PARKS PR INC 1 21 15 36 1753 PURCHA 19 1487 ADVANT 16 198 COMPE! 16 671 WEBB 16 34 LITEM 15 71 ASSERT 1112 APPEAL 52 6176 PERMIT 37 2261 PEOPLE 34 1B50 SEC 33 406 5 SUPPOR 33 2618 HER 31 3428 CONST 1 30 3193 NOTICE 29 1938 DAMAGE 24 2098 DECLAR 24 1J8 1 CLAIM 23 1938 POSITI 23 1276 RELIEF 23 902 RELATE 22 785 COURTS 21 1655 RECOVE 21 1577 SUE 21 133 ESTOPP 20 222 EXERCI 20 1541 PARTIE 20 2743 QUI TTI 20 23 WAIVED 20 472 INSTRU 19 2295 JURISD 18 2138 VALUE 18 1229 APPOIN 17 1097 BAR 17 1061 BIODIN 17 31 COUNT 17 495 ISSUED 17 1086 SEVERA 17 1113 WRONG 17 275 ADVANC 16 447 ELECT1 16 633 EXISTE 16 919 PROCEO 16 993 PROPER 16 3911 OUASHE 16 34 ROOT 16 15 SLOCHO 16 16 THEORY ■16 597 URGE 16 123 ADOITI 15 1490 AGREEO 15 805 ARGUE 15 244 BOARO 15 354 3 CRIM IS 36 GLENN 15 36 INC 15 1466 INOEED 15 276 LEGAL 15 1431 PETROL 15 39 PRACTI 15 1577 REAL 15 1528 SUBROG 15 103 TENOER 15 366 UNCERT 15 189 UNSUPP 15 63 ASSESS 823 TAKES 50 498 PROPER 49 3911 VALUAT 48 173 TAX 45 1334 REASSE 35 31 EQUAL 1 32 47 REAL 31 152B ROLL 31 65 DANVll 30 21 LEVIED 30 136 SEC 29 4065 EXCISE 27 65 HAPS 25 20 TRUSTE 25 1033 ABATEM 24 126 TRAILW 24 22 VALUE 24 1229 TAXABL 23 136 TAXAT I 23 230 COST 22 486 LEVY 22 200 REVISE 22 2493 REMISS 21 15 COLLEC 20 670 INHABI 20 139 LEVYIN 20 31 TAXING 20 88 IMPROV 19 566 SALES 19 572 APPRA1 IB 247 BOSTON 18 409 C0MM1S 18 3056 EXCEED 18 405 LIBRAR 18 102 APPEAL 17 6176 OISPRO 17 41 ESTATE 17 2561 HENORI 17 40 INC 17 1466 LANO 17 1389 SENATE 17 64 VENDOR 17 148 ACREAG 16 26 PERCEN 16 816 TANGIB 16 94 DAMAGE 15 2098 IMPOSE 15 777 PAVING 15 30 SKETCh 15 30 WILL 15 4823 ASSET 67 MCLEAN HER 29 15 38 3428 ENOUIR L I ST IN 27 15 46 51 ASSETS 26 426 CONTRA IB 4781 BONIS 17 39 OECEDE 17 966 ASSETS 626 ESTATE 65 2561 ACCOUN 40 1267 COLBY 38 18 CORPOR 35 1916 INVENT 32 172 PARTNE 30 259 EXECUT 29 1970 TRUST 29 1546 TRUSTE 29 1033 PAY 28 1485 SALE 27 1338 ASSET 26 47 BONIS 26 39 SHARES 26 298 LIFE 25 9 30 LIOUIO 25 10 3 STOCK 25 510 SMYTH 24 25 TRANSF 23 837 DECEOE 22 966 INCOME 22 525 REAL 22 1528 APPOIN 21 1097 BENEFI 21 1816 BUSINE 21 1909 OELIVE 21 854 SHARE 21 355 DINER 20 16 INTERE 20 2557 PAYMEN 20 1724 PR 1 NCI 20 1753 SHAREH 20 115 VALUE 20 1229 DEBTS 19 147 APPL IC IB 3117 CENTRA 17 373 142 ENOUIR 17 66 EXISTE 17 919 NON 17 389 PAID 17 1747 PARTIE 17 2743 PROBAT 1196 HILL 17 6B23 AUDIT 16 56 POSSES 16 1163 STOCKH 16 171 ADMITT 15 1419 CASH 34S COUNSE IS 2111 OISSOL IS 185 01 VI SI IS 976 PURCHA 15 1487 ROCKY 15 16 WEBBER 2* ASSIGN 2010 ERROR 79 309 3 ERRORS 72 761 AGREEM 38 1853 PROPER 33 3911 CHICAG 30 1176 CODE 314* FULL 29 1120 CLAIMS 28 1015 NAME 28 98B REFUSA 27 470 ARGUME 2S 1299 DIMINU 22 PERMIT 25 2261 DENIAL 26 565 PREJUD 24 1176 TITLE 24 1183 HEAR IN 23 1935 MONETA 30 SEATS 23 30 DAY 22 1712 REview 22 1798 TOILET 22 32 UNTO 22 33 BRIEF 861 COURTS 21 1655 OISPOS 21 881 DRILLE 21 16 ITALIA 21 15 REF1LE 21 17 RIGHTS 1681 FILE 20 792 MAN-IFE 20 562 OVERRU 20 1460 TOO 20 50 7 BASIS 19 1334 GIVING 731 POSTPO 19 75 PURPOR 19 686 ROYAL 19 63 SALE 19 1338 WISHES 19 74 ACCEPT 1150 BANK 18 1129 INSURA 18 1188 NUMBER IB 1167 WISH IS 78 ERREO 17 1033 ISSUES 1101 LEASEH 17 S3 ACREAG 16 26 CAUSES 16 341 NARRAT 16 28 RAPE 16 94 APPROP 7*0 LEASE IS 679 PARTY IS 1B8B PASS IS 49 7 RELATE IS 785 ROUT IN 15 30 TRANSC ««0 ASSIST 520 SERVIC 31 249B ATTORN 27 1H93 AID 25 267 COUNSE 25 2111 RECIPI 26 53 NITNES 2ITS VERDIC 20 2067 OBTAIN 19 1296 OAY IS 1712 OFFICE IS 259U COUNTY 17 4129 RECEIV 2270 CLAIM 16 1938 INSPEC 16 611 RESPON 16 1964 WOBURN 16 16 CHIEF IS 443 COMMIS 3054 FORTUN IS 20 PEOPLE IS 1850 PUBLIC IS 3129 REMOVE IS 631 TOLO 15 796 ASSOC I 671 SAVING 71 392 LOAN 68 389 CORPOR 36 1916 MEMBER 33 1268 UN1NC0 33 38 GRIEVA 97 SUPER I 30 876 CEMETE 29 76 RECEIV 29 2270 AUDITO 2S J59 MONEY 2S 1274 REINSU 30 ACCOUN 22 1267 COMPAN 22 28B7 CONTIN 22 1912 MORTGA 22 574 PAVABL 22 339 SALE 1331 PRINC1 21 1753 SECURI 21 516 8UILDI 20 1593 PROVOC 20 25 CONOUC 19 1406 ENJOIN 421 EX AM IN 19 2219 ILL INO 19 1363 PRESS 19 R3 PROPER 19 3911 TAXATI 19 230 VOLUNT 422 WITHOR 19 531 ATHLET 18 65 BOARD 18 3563 FRATER 18 66 PRESID 18 49B SOCIET 2SS BORROW 17 130 CHAP 17 761 CHARGE 17 3361 CH1CAG 17 1176 CONNEC 17 1142 OISPUT 741 HERS 17 20 INSURA 17 1188 JOINT 17 585 LAWN 17 35 NORWOO 17 19 ORGAN! 441 PUBLIC 17 3129 ARIICL 16 973 BAR 16 1061 INTERN 16 24 1 LAWS 16 1072 PARTNE 234 USE 16 2696 VIOLAT 16 1561 ASS IS 201 AVELLO 15 24 BASIS IS 1334 CLARA 23 DORR IS 26 FIRM 15 156 LAND 15 1389 LEGAL 15 1431 LUCY IS 24 MUNIC1 1382 PURCHA IS 1687 REGULA 15 1333 SALLE 15 42 SPELL 15 24 SUPERV 15 335 TRANSA 640 UNIVER IS 229 Assure 894 ASSUMP 35 212 NEGL IG 32 2129 COMPAN 31 2887 RISK 30 249 PREMIS 26 1327 EXCEPT 2937 DANGER 26 506 FLAGRA 26 23 EXERCI 23 1541 RISKS 23 57 ADO IT 1 22 1490 CASTS IS BASIS 21 1336 OBTAIN 21 1296 POINT 21 1301 CAREY 20 19 HOLY 20 18 EXCUSA 21 PAYHEN 19 1726 QUIT 19 56 DUTY 18 14H5 PARTIE IB 2743 PERIOD IB 1516 PLACE 1583 REAL 18 1528 SUPPOR 18 2618 CONTRO 17 2374 OEFIAN 17 24 LIABIL 17 1336 LICHT BIS CONTRA 16 6781 OECIOI 16 170 OISCRE 16 1190 INTEND 16 1175 UNOERT 16 287 ATTORN 1893 DESCRI 15 1067 INTROO 15 681 LET 15 241 NOTED 15 675 NULL IF IS S7 PHYSIC 702 POwERL 15 32 ASSUttl 101 UNFIT* 17 IT OECIDl 15 170 ASSUMP 212 RISK 36 269 ASSUME 35 196 PROPER IS 3911 ASSURA • 3 OONE 19 966 SOLVE 17 21 BARNET 16 26 ASSURE 211 DISTAN SEWER 65 16 365 156 EXCUSE ORAINA 31 15 170 109 AHEAD PLACE 29 147 SMILEY 26 15 15 141* OISCER WATFR 22 15 39 636 FOG 19 20 IS 1583 REPORT CATEGO MB FALL 20 360 FALLS 20 170 EXCEPT 15 2937 CATHER 31 AGNES 31 19 LEO 29 22 ALICE 26 48 LARKIN 26 21 HER 22 3428 FOLEY 21 42 MURRAY 19 50 HOME IT 1093 APPEAL 16 61 It, BRIFN 15 75 NFVCR 15 852 CATHCl 36 ROMAN 56 15 LEBLON 62 16 MERIOI 38 19 CHURCH 37 349 ARCHBI 33 40 ATHENA 26 40 PLAYGR 21 36 TEACHE 21 192 OHIO 19 5161 ZONING 18 78 i HILLS IT 53 INSTIT 17 756 SCHOOL 16 1161 UNIVER 16 229 C 1 NC 1 N 15 34 1 GRAOUA 15 65 CATTLE 33 LIVEST 52 35 SANITA 20 167 PROGRA 19 131 PISE AS 18 145 CREEK 1 1 57 ANIMAL 16 66 CAUGHT 52 CORN 27 27 GEAR 25 IT GLOVE 25 17 BURNEO 22 63 PICKER 21 22 CURBIN 21 24 HANO 21 715 THROWN 21 68 APPFLL 19 7099 MIRROR 19 29 SHAPE 19 29 HEARD 18 175 ROOMS 18 55 FOOT 17 257 HOLE 17 66 SIDE 17 722 SIOEWA 17 185 WASN 17 64 INSIDE 16 106 WEAHIN 16 62 COAT 15 44 OEBRIS 15 4 7 LOOSE 15 48 NOISE 15 43 PIECES 15 67 PLACE 15 1583 CAUSA IS MORTIS 206 17 DONOR 51 RO GIFT 50 241 DONEE 46 50 VIVOS 32 36 GIFTS 21 77 VESTIN 29 66 INTER 26 151 Tl TLC 26 1183 DAMAGE 20 2090 ACCIOE 18 1415 CONVIN 16 253 CAUSAL 125 CONNEC 67 1162 INJURY 60 1298 Dl SARI 36 247 HVPOTH 32 10B CORONA 30 34 PROXIM 29 515 CAUSAT 26 30 EXFRTI 26 21 fllSEAS 22 145 MEDIC A 22 546 PROVOC 22 25 SYMPTO 22 37 SHOCK 20 62 TRAUMA 20 29 COMPf N 19 1069 EXPERT 19 2B6 ACCIOE 18 1415 DOCTOR in 333 HEART 17 106 MESSER IT 25 STRAIN 17 62 THROHR 16 1 1 HYPFRT 15 20 MANIFE 15 542 MEDFOR IS 19 NATHAN 15 32 OCCLUS 15 19 CAUSAT 30 CAUSED 27 936 CAUSAL 26 125 CORONA 23 34 SUFFER 21 59 9 AGGRAV 20 81 ACCIOE 19 1415 INJURY 19 1298 CHAIN IB 55 VIEW 18 1285 CLAIHA 16 55 3 EXAMIN 16 2219 MFDICA 16 546 ABSENC 15 911 CAUSED 936 LIABIL 60 1336 PROVOC 40 25 DAMAGE 38 2098 ACCIOE 36 1415 INJURY 35 1298 NEGLIG 33 2129 RECOVE 33 1577 CAUSAT 27 30 EMPLOY 27 3101 ERROR 27 309 3 LOSS 27 v 652 TRAUMA 27 29 VERDIC 27 2067 WORK 26 1 )19 LIABLE 25 661 USE 25 2696 PROXIM 24 515 ELECTR 23 472 MASS 23 1504 PERIOO 23 1516 SUFFER 73 599 BODILY 22 109 FALL 22 340 INJURE 22 679 PROPER 22 3911 SCINTI 22 29 TENTHS 22 16 CAUSIN 21 269 CITY 21 3529 EXPLOS 21 117 INJUR 1 20 1125 OUASHE 20 34 AGGRAV 19 81 CARE 19 1060 COMPAN 19 2887 COMPLA 19 2823 COURTS 19 1655 FALLEN 19 38 UNSKIL 19 22 PAY 18 1485 SHOCK ie 42 SUPPOR 18 261B HIND IS 23 BLASTI 17 27 HEMORR 17 46 INSTRU 17 2295 MEOICA 17 546 ORSERV 17 810 RECEIV 17 2270 SERVAN 17 161 WALLS 17 71 ABATE 16 48 ALLEGA 16 1171 CENTRA 16 373 EXPOSU 16 30 FRACTU 16 SO HAZARD 16 233 MAINE 16 49 SEVERA 16 1113 WRONGF 16 386 CANAL 15 36 COLL IS 15 681 COMPEN 15 1069 DETERI 15 33 KNOWLE 15 969 OWENS 15 34 PAVEME 15 89 PUT 15 661 CAUSES 361 GROUPE COMPLA 28 15 15 2823 PETITI CONTRA 19 15 6506 6781 ASSIGN SUPPOR 16 15 2010 2618 BASIS 16 1334 EXCUSE 16 170 VEROIC 16 2067 CAUSIN 269 PROVOC 35 25 NEGLIG 30 2129 INJUR! 26 1125 INJURY 23 1290 PLACE 23 1583 CAUSED 21 936 DEATH 21 1667 COLL 10 20 161 FALL 19 340 LEFT 19 1083 PURVIE 19 81 THROWN 19 68 ONTO 17 162 PATH 17 82 SIOEWA 17 185 WALKIN 17 81 APPELL 16 T099 COLLAP 16 35 FRONT 16 662 LOSE 16 62 SEVERE 16 147 TORTFE 16 24 USE 16 2696 ORIVER 15 72B FELL 15 266 HEAD 15 227 KNOCKS 15 41 RIDING 15 222 CAUTIO 126 CARE 33 1060 SUSPIC 30 6fl WI TNES 27 2175 EXERCI 26 1541 GUILTY 25 1427 SAFETY 20 569 JURY 19 3810 UTMOST 19 22 NFGLIG IB 2129 GREAT IT 528 INFERE 17 648 MINDS 17 164 UNCORR 16 29 ORAM 15 133 CAVANA IT REPROO DEPREC 29 16 38 122 ELLIOT ERRED 26 15 58 1033 RECOVE 18 1577 COMMIT IT 1317 COMPUT 17 192 ILL INO 17 1363 CEASE 66 EXCUSE 21 170 REMARR 20 81 DEATH 19 1667 REVERT 19 40 CEASEO 103 DATE EXECUT 26 15 1611 1970 JANUAR HER 22 15 116T 3428 EXIST 19 300 CONTRA 17 47S1 DECEMB IT 1042 AGREEM 15 1853 CEASES 62 OERIVA 25 21 CEILIN 28 FALLEN 66 38 MALLS 65 Tl HOTELS 43 31 PAINT 39 24 WIRING 37 15 SLEEPI 34 30 SPRINK 32 20 COMHUS 29 23 Fl XTUR 28 69 FLOOR 28 361 FLOORS 28 26 UNTENA 20 46 FIVE 29 1195 LODGIN 26 33 DOORS 22 70 RESIST 22 72 BUILOI 21 1593 PARTIT 21 123 STAIRS 21 65 SIX 20 745 ROOF 17 6B DOOR 16 30 5 WINDOW 16 137 LIGHT 15 815 CELLAR 29 LEAK 69 27 WATER 37 636 HOUSE 36 829 CEMENT 34 65 ELLA 34 29 WALLS 33 71 EXPLOD 31 35 PUMP 31 35 GAS 28 387 PENETR 27 26 SHUT 24 57 BURNEO 21 63 FLENIN 22 38 SWITCH 22 68 RASEME 21 114 FLOOR 20 36 1 STORED 20 46 BUILD 19 94 EXCEPT 19 2937 STEWAR 19 92 WALL 19 138 ORY 18 60 INSIDE IB 106 OFF 18 637 CURB 17 112 DRAIN IT 67 EXPLOS 1 7 117 CARELE 16 122 ADV 15 85 PURVIE IS 81 143 CEMENT 69 CONCRE 47 158 PORTLA 47 16 CELLAR 34 29 SAND 32 t ) POURED 27 21 BUCKET 23 16 BUILOI 23 1593 SIDEWA 22 185 LAWN 21 35 MARBLE 21 20 PATIO 21 19 HOARD 20 3543 MIX 20 21 COMMER 19 527 WALLS 18 71 WATER 18 636 WINDOW IB 137 FELL 17 244 BLOCK 16 178 INGREO 16 31 STONE 16 124 GARAGE 15 187 PLACED 15 713 PLEASA 15 38 PUHP 15 35 SLIPPE 15 103 SOIL 15 38 CEMETE 74 BURIAL 77 52 MONUME 60 24 GROVE 39 67 MARKER 34 18 LOTS 33 299 ASSOCI 29 671 LANO 29 1389 TRAC1S 27 56 OUI TTI 24 23 EVILS 23 26 PEOPLE 23 1850 EXTEND 21 722 OAK 21 67 SALES 21 572 ASS 20 201 OEOUCT 20 200 GRAVES 20 19 NONPRO 20 32 QUASHE 20 34 REESE 20 19 HALF 19 612 PERCEN 19 816 PRIVIL 19 455 PROPER 19 3911 C0NT1N 18 1912 CORPOU 18 1916 EXEMPT 18 406 PLOT IB 23 ACRES 17 217 OPERAT 17 2956 PLATTE 17 26 REL 17 1028 ACOUIR 16 610 FUNDS 16 59 5 INTERN 16 80 MAINTE 15 388 SALE 15 1338 CENSUR 25 EXACTI 35 18 DISC IP 31 91 PROFES 23 290 CLIENT 22 188 ENTIRE 20 1234 LAWYER 19 245 OISMIS IB 2222 FEES 16 463 LEGISL 16 1B45 RES PON 16 1964 RTCOMM 15 372 UNCONS 15 400 CENSUS IT POPULA 55 122 MANUAL 31 34 CRI 1ER 25 53 PROPOS 23 1053 ENUMER 20 146 FEDERA 18 678 CITY 16 3529 COUNT 1 16 131 YEAR 16 1185 INHABI 15 139 RES IDE 15 1079 UNITED 15 1080 CENT T2 YORK 41 904 AOMX 21 17 BAIRD 21 31 COAST 18 24 CORP 18 542 POWER 18 1B53 SCHNEI 16 50 CHESAP 15 32 CHIC AG 15 1176 PIERCE 15 33 CENTER 331 SHOPPI B4 53 LANE 46 238 ROAD 42 722 LINE 38 69 5 SIDE 34 722 SOUTH 34 635 WEST 34 642 ACROSS 33 325 FEET 30 933 LEFT 30 108 J EAST 27 653 INTERS 25 643 TRUCK 2A 565 HIGHWA 23 1067 HOUR 23 406 PER 23 861 COLL 10 22 141 MAINTA 22 979 NORTH 22 694 PROPER 22 3911 EGRESS 21 82 FRONT 21 442 IMPACT 21 116 INGRES 21 69 BUILOI 20 1593 CAR 20 1222 COLLI S 20 681 LANES 20 60 MILES 20 373 SLOWED 20 38 SOUTHE 20 258 TRAVEL 20 462 AIRWAY 19 22 OUTER 19 32 TRAFF I 19 580 ABUTTI 18 94 APPELL 18 7099 APPKOA 18 548 CORNER 18 234 CURB 18 112 ORIVIN 16 526 FENDER 18 16 NORTHS IB 35 SKIO 18 48 ANGLF 17 50 VEHICL IT 1180 APPROX 16 613 BOUNDE 16 41 CCMMER 16 527 CONCOR 16 18 EASTRO 16 42 PARK IN 16 213 REZONI 16 30 STREET 16 1420 WIDTH 16 12B BLOCK 15 178 ONCOMI 15 22 PAINTE 15 34 PARKWA 15 60 POINT 15 1301 QUANTU 15 50 ROUTE 15 232 SIORM 15 35 STRUCK 15 372 TURN 15 429 USE 15 2696 VILLAC 15 730 WIDE 15 209 CENTRA 373 YORK 36 904 KENNET 35 5H PROPER 28 3911 ALBERT 27 109 BROTHE 25 389 SHARES 25 298 CLEVEL 24 521 WORCES 23 95 SI STER 21 272 UNION 21 566 COMPLA 20 2823 STOCK 20 510 RAILRO 19 627 SUB SCR 19 130 MASTER 18 604 QUOTES 18 51 RUSSIA 18 50 STOCKH 18 171 ASSETS 17 426 REINSU 17 30 CAUSED 16 936 MAINE 16 4'i NORTH 16 694 REVOLU 16 22 OIOCES 15 25 INFERE 15 64B SUPPOR 15 2618 CENTS 94 OOLLAR 29 441 HUNDRE 28 415 CUBIC 27 23 EIGHTY 25 65 FIFTY 23 236 SAV 22 32 RATE 20 525 TAXABL 19 136 PAY 18 1485 SIXTY 18 159 GI080N 17 20 PER 17 061 TEN 17 610 1H0USA 17 218 BROOKS 16 39 FIVE 16 1195 SALES 16 572 THIRTY 15 399 CENTUR 78 STANIS 104 17 SPEARS 43 16 AGO 33 86 INOEMN 29 21 3 RITCHE 27 18 MIKE 20 18 DICT10 19 73 LOAOIN 16 111 LIABIL 15 1336 ORTHOO 15 32 CEREBR 38 HEMORR 121 46 HYPERT 101 20 BRAIN 75 43 SYMPTO 54 3 1 PRESSU 53 84 SHOCK 50 42 THROMB 49 17 HEART 47 106 HEADAC 38 18 BLOOD 36 182 ARTERI 35 21 LACERA 35 22 OECEOE 34 966 HYPOTH 31 108 01 SEAS 30 145 WEAKNE 30 29 PATHOL 29 32 LARGER 27 80 DRESS 25 23 HER 25 342 8 IIOSPI T 25 731 SKULL 25 23 VESSEL 25 40 OIAGNO 24 72 01E0 24 498 DEPRES 23 50 SUFFFR 23 599 TRAUMA 22 29 EMOTIO 21 32 COLLAP 20 35 DOCTOR 20 333 PAT IEN 20 147 BRFATH 19 38 DAY 19 1712 MEOICA 18 546 RECOVE 18 1577 WOUND 18 42 ACCIOE 17 1415 SMALL 1 7 324 PROGRE 16 157 C IRCUL 15 117 MCCART 15 62 STRAIN 15 62 CEREMO 19 ALFREO 52 44 COHABI 39 20 GLOUCE 39 20 MYRTLE 39 35 MARRIA 36 321 EL l/AO 34 69 AL 1VE 33 47 MARGAR 29 100 BIRTH 23 53 WIFE 22 1141 MARRIE 19 208 LAWSUI 18 89 WENT 18 733 TRANSA 17 640 DEAD 16 116 FAITH 16 33H KNEW 16 619 PERFOR 16 1392 OIEO 15 498 LIVE 15 no CERTIF 1398 COUNTY 47 4129 CLERK 34 847 COPY 34 456 ALLOWA 32 39 7 COPIES 31 173 ESTATE 29 2561 OFFICE 29 2598 PLUMBE 27 59 OOARD 26 3543 TITLE 25 1183 AUTHEN 24 53 LEVY 24 200 SCHOOL 24 1161 APPEAL 23 6176 AMENDE 22 1484 NEGATI 22 184 ANSWFR 21 2431 JOURNE 21 44 TUESDA 21 26 VEHICL 20 1180 TRANSC 19 480 ACCRUI 18 58 DATFO 18 445 PAYMEN 18 1724 REGIS! 18 434 REVIEW 18 1798 sec 18 4065 SURREN 18 89 ADD IT 1 17 1490 BIRTMA 17 17 JUVEN1 17 184 PLUHBI 17 92 OUICKL \7 38 RETURN 17 1643 AUGUST 16 B04 CODE 16 3149 ENTRIE 16 107 L ICENS 16 782 MUNICI 16 1382 PAR 16 796 RfCOUN 16 41 RCISSU 16 18 VOUCHE 16 18 WORK 16 1319 ZONING 16 783 AUCT 10 15 21 BRIT IS 15 22 HAHILT 15 32 3 JANUAH 15 1167 NUMBER 15 1167 PREPAI 15 21 SUPPOR 15 261 b YEAR 15 1185 CERTIO 202 WR IT 44 1096 APPEAL 40 6176 C IR )9 374 2 ON INC 36 783 PETITI 34 4506 REVIEW 24 1798 MARION 22 301 UUASHE 21 34 SALVAG 21 33 VARI AN 21 210 BOARD 17 3543 JURISD 17 2138 SUPREM 17 1622 ILLINO 16 1363 UNI TED 16 1080 CESSAT 19 STOPPA 40 19 STRIKE 20 498 OPERAT 19 2956 BASIS 16 1334 REGULA 16 1333 ACCUMU 15 125 CESTUI 25 gUASHE 121 34 ENRICH 44 32 FIDUCI 39 196 BARR 36 1 1 RCSCIS 35 50 SETTLO 30 Tl AGREEM 25 1853 UNJUST 23 117 LFGAL 22 1431 TRUST 22 1546 RFPU01 21 51 CONSTR 20 2758 RATES 20 308 PROPER 19 3911 RFCE1 V IB 2270 THEREU IB 539 FREIGH 17 135 RENEFI 16 1816 PR 1 NCI 16 1753 RESCIN 16 84 RFCOVE 15 1577 CHMN 55 PROVOC 27 25 CMEMIS 26 27 SPEC1H 23 35 HOCKER 21 23 PIT 19 47 CAUSAT 18 30 LUCY 2A SOLOMO BENEFI 31 16 24 1816 CUSH ASSOC I 28 15 30 671 GIBSON AUTOMO 22 15 49 1513 DAUGHT 17 286 REMARR 17 81 SURVIV 17 4 66 LUMBER 96 COTTON 57 39 MA SONR 25 17 MILL 25 63 OWNER 23 1163 PROPER 23 3911 gUASHE 22 34 OUICKL 20 38 YARD 19 137 INC 18 1466 STORED 15 46 LUMMUS 20 DOWO 30 31 INO 27 1615 PR I NCI 26 1753 LEAD 22 155 LIENS 17 97 MASS 16 1504 LUMP 32 SUM 39 1032 PERIOD 26 1516 SETTLE 26 903 LIEU 25 75 JURY 20 3810 0ISA8I 19 247 PATERN 19 4 7 RATE 19 5?5 REMARR 19 81 COMPEN 16 1069 OIVEST 16 69 CLOTHI 15 74 USED 15 2130 LUNCH 41 GLANOE 47 25 DINNER 25 21 MASS 23 1504 PERIOD 23 1516 TELLER 23 26 RESTAU 20 186 CONT IN 19 1912 DAY 18 1712 BOARD 17 3543 CODE 16 3149 COMPAN 15 2887 LUTHER 30 CLARA 28 23 CLINE 27 26 ESTATE 26 2561 BLDG 22 70 RUSSEL 21 115 VIGO 20 44 ERROR 19 3093 AMERIC 17 635 CHURCH 16 349 HEIR 15 79 LYING 151 SAW 33 628 IMMEOI 24 850 SOUTH 23 635 TENEPE 21 15 WEST 21 642 TRACT 20 295 CAME 18 700 LID 18 20 ROAD 18 722 SIDE 18 722 AVENUE 17 SIB NORTHW 17 127 ALLEY 16 149 CORNER 16 234 FEET 16 933 HELPED 16 53 HIGHLA 16 38 LANDS 16 316 FENDER 15 16 LEGS 15 41 LOGAN 15 44 LOOKED 15 210 OUTSIO 15 478 WALKED 15 105 WHEEL 15 63 LYNCH 32 INC 26 1466 SARAH 22 37 GRANT 17 646 GREEN 17 152 LYNN 38 BOSTON 16 409 LOWELL 16 58 LYONS 31 8UCHAN 54 25 DAVIS 31 255 CADILL 28 40 OUICKL 22 38 COUNTY 20 4129 CAR 18 1222 MABEL 22 EVERET 92 34 BOWMAN 90 43 SALARI 46 65 SHAREH 45 115 SHOE 41 43 ALLEN 40 211 DONALC 34 120 GRDSSL 31 47 GRATUI 29 54 JOHN 23 483 OFFICE 22 2598 INTERE 20 2557 MCCART 20 62 MEETIN 19 331 SHARES IS 298 TEMPOR 18 46 5 VOTING 18 129 PERCEN 17 816 RESIGN 17 83 VERIFI 17 221 COMPOS 16 93 DAY 16 1712 HOWARD 16 101 INJUNC 16 684 WILL 15 4823 MACHIN 38B PINBAL 52 17 REflUll 49 35 SEWING 47 21 COIN 46 35 GAMES 45 47 SLOT 43 20 DEVICE 42 221 PIN 41 27 PLAYER 41 37 GAMBLI 39 78 VEN01N 39 16 SWITCH 35 68 CORN 30 27 TRiMON 30 26 PICKER 29 22 BALL 28 80 DECEPT 28 48 FAMOUS 28 17 VACUUM 28 29 ADVERT 26 25 7 SAZAMA 25 15 USED 25 2130 OUANTU 24 50 COFFMA 23 18 GEAR 23 17 VOTING 23 129 CLEANE 22 38 SCORE 22 19 SKILL 22 74 EOUIPH 20 366 MANUFA 20 415 nu(.KET 19 16 GAMING 19 16 LOCK 19 60 OUASHE 19 34 SELL 19 514 TOOL 19 37 WON 19 37 AUTOMA 18 176 CITY 18 3529 COINS 18 17 SALES 18 572 KNOWLE 17 969 OPIRAT 17 2956 PRECIN 17 58 SALE 17 1338 OISPAT 16 34 POOL 16 85 ROLLER 16 21 SELLIN 16 255 TOOLS 16 B6 USE 16 2696 BLUE 15 56 COLUHN 15 37 OESCRI 15 1067 Dl '.PAR 15 26 INC 15 1466 MERCHA 15 260 OFF 15 637 PARTS 15 378 TELEVI 15 57 VOTED 15 91 144 maoiso us MAE 1") MAGAZI 4? MAGNIT MAHONI MAILS MAIN MAINLY MAINS MAJOR MAJOR! MALONE MALPRA IT 42 IS 3 30 NA1NTA 979 97 588 MAKERS 26 BALDEN 2S MALE 32 MALFEA IS MALICE 120 15 42 OATH 131 OATHS 24 OBEOIE 21 OBEY 24 OHEYEO 19 OOITER 15 SHOE 32 A3 REVOLV 31 75 BAG 24 45 GUN 24 124 NARCOT 22 201 POLICE 22 1241 MAN 21 641 RENTED 20 65 KEY 19 68 PART IT 19 125 WITNES 19 2175 OISTIN 18 898 DRUGS 17 ei PEOPLE IT 1850 RAY 1 7 69 HOWARO 16 101 PAPER 15 200 PURCHA 15 1487 DECATU AT 35 LOIIB 4T 16 CIRCUI 35 1389 REPLEV 29 65 HAYES 21 73 CARROL 20 84 rAVERN 20 18A BUIIGLA IT 1 77 JURI SO 16 2138 CORPOR 15 1916 IMMED1 15 R50 NIGHT 15 282 LEAGUE 36 30 DINER 23 16 FLORES 16 1H JURISO 15 2138 BERTHA 76 21 RARNET 68 26 HARRY 22 110 FORMER 16 785 LEE 15 128 DELL 58 16 MODERN 40 134 unci 36 IB NEWSPA 29 171 PUBl IS 29 246 STORIE 25 21 MARK 24 92 PUBLIC 24 3129 CONFES 22 267 TITLES 22 48 TRADE 18 218 UPPER 18 71 PUB 17 T9 RETURN 17 1643 CHANGE 16 1383 UNFAIR 16 130 HOOKS 15 213 COMMON 15 2956 COYER 15 20B HAND 15 715 NEWS 15 56 WORD 15 659 POLICE 39 1241 PEOPLE 28 1850 SESSIO 26 170 PROC 22 28 MISOEM 20 124 RECEIV 20 2270 PEACE 19 175 CI IY 18 3529 CRIHIN 17 1063 SIT 17 46 ARREST 16 614 JUSTIC 16 8 76 BOND 15 528 CLERK 15 R47 USE 19 2696 COMPLE 17 1453 VALLEY 49 63 SMITH 23 167 YOUNGS 19 97 COUNTY 16 4129 JURY 17 3810 WATER 15 636 REGIST 54 434 MAIL IN 44 39 POSTAL 37 16 NOTICE 31 1938 LETTER 30 601 OELIVE 28 854 DRAWEE 28 36 MAILS 27 15 ADORES 26 347 COPY 26 456 SENT 25 245 PROPER 23 3911 PIEPER 22 15 MAKER 21 49 NOTIFI 21 303 DISHON 20 40 NOT IFY 18 107 PREPAI 18 21 ENVELO IT 36 FILING 17 1003 RECEIP 17 394 SENO 17 53 DEFERR 15 47 POSTIN IS 44 SUPPOR 15 261B MAILIN 35 39 BEALS 34 16 COPY 22 456 NOTICE 22 1930 LETTER 20 601 WRITTE 20 1019 FNVELO 18 36 CHECK 17 404 JANUAR 17 1167 NOTATI 17 65 DATE 16 1611 AOORES 15 34T COMPAN 15 2887 REAL 15 1528 MAIL 44 139 MAILEO 35 82 NOTICE 34 1938 AOORES 32 347 BEALS 30 16 POSTAL 30 16 LETTER 23 601 ENVELO 20 36 OISHON 19 40 OELIVE 16 854 PROPER 18 3911 PUBLIC 16 3129 REGIST 16 434 SENO 16 53 COMPAN 15 2887 POSTAL 49 16 DISHON 31 40 MAI L 27 139 ACCEPT 17 1150 ESCAPI 30 22 SWITCH 28 68 LINE 24 695 TRACK 22 173 GAS 21 387 WEST 21 642 ALONG 20 426 INCH 20 94 INTERS 20 643 NORTH 20 694 RUNS 19 95 OOOR 18 46 TRAIN 18 328 EXPLOS 17 117 LIGHT 16 815 PIPE 16 167 PLANTS 16 55 STREET 16 1420 TRAINS 16 90 ASKED 15 878 COMPAN 15 2887 DISTAN 15 365 PRINCI 15 1753 SOUTHW 15 131 TRACKS 15 170 BOSTON 69 409 FAY 56 15 MASS 26 1504 WILMIN 24 36 COMMON 21 2956 CONTRO 19 2374 EMPTY 18 33 LEGISL 17 1845 MASSAC 17 307 PARTY 17 1888 CAUSED 16 936 CENTRA 16 3T3 CLEAN! 16 41 COMPAR 16 392 YORK 16 904 CHANGE 15 138 3 CONOUC 15 1406 CROSSI 15 32T HER 18 3428 PIPES T2 50 LEAK 41 27 WATER 33 636 GAS 32 38 7 PUMP 28 35 SEWER 2T 154 INSTAL 24 439 SYSTEM 24 596 EXPLOS 23 117 EXPLOD 21 35 LOCATE 21 707 BENEAT 20 41 BASENE 19 114 PROPER 19 3911 SEWERS 19 45 CXTENS 18 342 ABUTT1 17 94 TRANSM 16 1ST CONSTR 39 2758 MAINTE 29 3BB CHARGE 28 3341 CONST 1 28 3193 LIABLE 28 661 NU1SAN 28 184 PUBLIC 2T 3129 PUMP 25 35 ADOITI 24 1490 FUNOS 24 595 PROVOC 24 25 REAL 24 1528 ROADWA 2A 76 MEMORI 23 59 CENTER 22 331 CAREY 21 19 PEOPLE 21 1850 POWER 21 1653 REV 21 1216 ERECT 20 77 INTRUS 20 21 OPERAT 20 2956 PRINCI 20 1753 OUASHE 20 34 LEWONE 19 23 PROSSE 19 57 CLEAN 18 44 JOIN 18 89 LOCUS 18 63 OVERAL 18 24 PERFOR 18 1392 PROTEC 18 924 TAX IB 1334 URBAN 18 25 APPURT 17 48 ARTS IT 26 EXCEEO IT 405 OBTAIN 17 1296 PAR 17 796 REMEDY 17 545 S INGLE 17 571 USED IT 2130 WIRES 17 75 AGREEM 16 1853 APPOIN 16 1097 EXIST! 16 660 GOVERN 16 1154 GRAOES 16 53 HIGHWA 16 1067 INC IDE 16 474 PARTY 16 1888 PLEAOI 16 1037 ROAD 16 722 SUPRA 16 2012 WARNIN 16 196 SYSTEM 15 596 SUPPOR 32 2618 MAINTA 29 979' EXPENS 26 784 REPAIR 25 464 PROPER 24 3911 TAXES 22 498 SICKNE 21 52 FUND 20 471 POWER 20 1853 HUSBAN 19 931 PAYMEN 19 1724 OUASHE 19 34 CHILOR IB 910 INSURA 18 118R LEVIEO 17 136 HENEFI 16 1816 DIVORC 16 578 HO SPIT 16 731 LIBRAR 16 102 PRINCI 16 1753 AUTOMO 15 1513 CEMETE 15 74 GRANOP 15 26 LEGALL IS 289 PAR IS 796 PILGRI MA 30 GREENS 101 49 TAYLOR 66 190 LOANS 59 96 INDICT 52 848 COLLAT 49 203 LARCEN 48 122 HARVAR 44 32 CREDIT 33 673 FICTIT 26 39 LENOIN 26 23 COMPAN 25 2887 OEFRAU 23 77 SHORE 23 80 CONSPI 22 217 MONEY 22 1274 NOTES 22 218 CHECKS 21 226 BORROW 18 130 OISCOU 18 79 BASES 17 48 CHARGE 17 334 1 GENU IN 17 93 NINETY IT 92 OBTAIN IT 1296 BOSTON 16 409 COMMON 16 2956 MONEYS 16 103 TRANSA 16 640 WRIT1N 16 4 32 LOAN 15 389 SERVIC 13 2498 GOO 16 40 LIMA 16 21 INTRAS 15 41 VILLAG 15 730 APPEAL 30 6176 VOTE 26 306 TERRI T 18 293 OEADLO 17 16 JUSTIC 17 876 MEMBER IT 1268 VOTES IT 90 CONTIN 16 1912 GREAT 15 528 PROPOS 15 1053 REAL 15 1528 TENOR 15 20 ORAWEE S* 36 PAYEE 42 47 NOTE 30 726 MAKERS 2B 26 POSTIN 27 44 DEFERR 26 47 INSTRU 26 2295 SOLVEN 24 19 CHARGE 23 3341 PAYMEN 22 1724 MAIL 21 139 DEFRAU 20 77 ACCOUN 19 1267 BANK 18 1129 DRAWER 17 38 FORGER 17 65 CHECKS 16 226 MATURI 16 45 CHECK 1 15 48 HER 15 3428 HOLLIO 33 20 MAKER 28 49 JOINT 23 585 COGNOV 22 45 PAYEE 21 47 CONTRA IT 4781 INSTRU IT 2295 PAYMEN IS 1724 OOWLIN SI 24 FLYNN 31 42 BOYLE 28 26 JORDAN 26 33 EXCEPT 22 2937 POINT 21 1301 GOVERN 19 1154 INTERE 19 2557 TERM 19 1149 OWNED 16 842 BLOCK 15 178 OAY 15 1712 FEMALE 84 55 GIRLS 24 29 REFORM 23 90 WOMEN 20 119 SENTEN 19 BIT CODE 18 3149 PENITE la 207 PROSSE 17 57 REVISE 16 2493 ACCUSA 33 62 NEGLEC 20 248 LIA8IL 18 13 36 OFFICE 18 2598 PERFOR 18 1392 OONALO IT 120 POLICE IT 1241 CHARGI 16 236 FEES 16 463 MALICI 5A 147 GIST 45 44 ACTUAT 43 23 DEFAMA 36 40 ACTUAL 32 1165 CAPIAS 32 18 KILLlN 28 77 SLANOE 28 31 MURDER 27 195 PROVOC 27 25 KILL 26 49 PREMEO 26 28 JUSTIF 24 810 MANSLA 24 89 PUNI SH 23 240 HECKLE 23 98 LIBEL 22 49 PROBAB 20 470 FALSIT 19 33 LIBELO 19 36 WILFUL 18 249 PROSEC 17 1143 ACIEO 16 29T OEROGA 16 49 RECOVE 16 1577 WANTON 16 195 DENOTE 15 21 WILLFU 60 107 MALICE 54 120 WILFUL 37 249 HISCHI 36 26 PROSEC 32 1143 RECKLE 27 98 BANKRU 26 138 WANTON 26 195 BATTER 24 69 JURY 22 3810 SLANOC 22 31 ASSAUL 20 191 INJURY 19 1298 UNLAWF 19 530 FALSE 18 316 OFF ICE 18 259b LIBEL 17 49 PUI SON 16 41 PROBAB 16 4 70 CONPLA 15 2823 CRIME 15 850 FELONI 15 62 AYER 53 18 HENOER 35 41 BROKER 32 147 FRANKS 29 33 LAWKEN 25 125 WOOO 23 90 NEGOTI 21 2S6 COUNSE 20 2111 LAWS 16 1072 SURGEO 18 83 OAMAGE 34 2098 BLOOMS 26 19 APPLIC 24 3117 AGGRAV 21 81 SUKGER 20 55 FEASOR 19 65 SURGIC 17 46 TREATM 17 268 EMPLOY 15 3101 MEN A5 423 HAT 35 18 WENT 35 733 WITNES 32 21T5 TOLO 31 T96 GUN 30 124 CAR 28 1222 PLACE 28 1583 SHOUTE 28 20 SWUNG 28 19 VICTIM 28 108 COLORE 27 29 AROUNC 26 355 GET 26 485 PAIR ?<• 23 SAW 25 628 HAND 24 T15 nun vie 24 81 CREEK 19 57 nrxoHO 18 1016 SF WI.K 15 154 SWORN 29 1 '1 ANASIA 25 )7 F KAMI N ?2 7219 INUUI" to 4 74 niino 19 52H MEMHf M 17 126H SOLEMN 1 ' 15 STATES 1 7 1863 HUM V \ 5 11 JURORS 40 22K GRAND 29 256 n«» IV 25 1 868 VII 1 H 25 66 01 RE 24 70 I1IGNI T 22 46 AFORFS 19 256 OFFICE 19 2598 FORM 18 92 1 PUT 15 64 I INTERS 21 643 STOP 20 161 E KPnsE n 7} STOPPE 19 122 ACTS 15 1?7) AMPHUA 15 548 FAILUR 15 1402 ORDINA in 2341 PUhEH 16 185 1 STATES 24 1863 UNFUUI n. 10 3 DIFINI 19 R64 INJUNC 1 7 6 8'. CONOUC 16 14^6 WAH.-JIN 16 I '16 CllUf 15 285 F OL 1 I V 15 54 4 DICTUM 7H 14 1 C 1 A 54 ? t 145 OBJECT 1963 APPEIL 39 7099 WITNES 41 2175 COUNSE 40 2111 COMPLA 37 2823 PERMI I 33 2261 RECLIV 33 2270 A0M1SS 30 812 CRIME 30 850 LEGISL 30 1845 ERROR 29 309 3 OUASHE 29 34 ANSWER 26 2411 OFFICE 26 2598 OVERRU 26 1460 PROFFE 26 64 RENDER 26 1456 BENCH 25 66 INSTRU 25 2295 0R01NA 25 2341 POINT 25 1301 OEFFNS 24 1159 KFDIRE 24 2d CROSS 23 1074 FAIR 2J 709 HEAR IN 23 1935 INQUIR 23 4 74 PURVIE 23 81 READ 2i 69 7 VILLAG 23 730 MOVEO 22 451 OBVIAT 22 32 OFFERS 22 661 CIVIL 21 956 DATE 21 1611 FUMES 21 15 PAYERS 21 15 PUMP 21 35 RELFVA 21 417 BERNHA 20 17 OISCER 20 39 INTERP 20 860 OUICKL 20 18 RULING 20 883 TREATS 20 17 TRUST 20 1546 DRAWN 19 469 JUOICI 19 800 MOUTH 19 41 PERFOR 19 1392 PREMIS 19 1327 SI X 19 745 TOLO 19 796 WEEKEN 19 19 WENT 19 733 AL IKE 18 48 BEHALF 18 689 BOUNTY IB 21 CHICAG 18 1176 CONSTR 18 2758 HOLLAN 18 22 1NTR0C 18 681 NUMBER 18 1167 PROOF IB 1220 PURSUE 18 173 SUPPOR IB 2618 UN WOR T 18 22 WANT 18 4 39 CLASSI 17 378 ISSUES 17 1101 MASTER 17 604 OPEN IT 597 PLEAD1 17 1037 PRESER i; 259 REFUSE 17 1134 SOLE 17 638 CHEMIS 16 21 OESIRA 16 96 EMBRAC 16 138 FAVOR 1 16 26 INIERV 16 543 METHOD 16 559 NEGLIG 16 2129 OWNERS 16 913 QUIT 16 5* REAOIL 16 139 SOIJNO 16 334 TOOK 16 978 TYPICA 16 26 ASKED 15 876 CAPITA 15 221 CHILDR 15 910 COMES 15 328 CREAM 15 29 DISTRI 15 1945 EVASIV 15 30 EXPECT 15 2 90 EXPOSU 15 30 GUISE 15 31 HEARSA 15 66 INDICT 15 848 ITEMIi 15 29 JUSTIF 15 BIO NOTICE 15 1938 PAVING 15 30 STRIKE 15 498 OBIIGA 9T8 PROPER 31 3911 INSURE 25 868 OISCHA 24 771 INCURR 24 231 ENFORC 23 745 NEGLIG 22 2129 AGREEN 21 1853 BUSINE 21 1909 INDERT 21 190 F1NANC 20 54 3 IMPLIE 20 431 PAYMEN 20 1724 RELIEV 20 208 CREOIT 19 673 LANDLO 19 185 LIABLE 19 66 1 OUICKL 19 38 CREATE 18 852 ENCUMtt 18 93 LAND 18 1389 PERFOR 18 1392 RFPRES 18 1337 OAMAGE 17 2098 EXONER 17 28 INJURY 17 129R INTEND IT 1175 LANGUA 17 1252 LESSEE 17 178 MORTGA 17 574 EXISTI 16 660 LEGISL 16 1845 LEVYIN 16 31 MANNER 16 1161 NOTE 16 726 REAO 16 697 BONDS 15 316 FURNIS 15 915 LESSOR 15 119 SFRVE 15 319 SOVERE 15 60 TENANC 15 125 OBLIGE 100 PROPER 31 3911 OBSCEN 90 LEWD 104 55 INOECE 89 55 LASCI V 66 34 PROFAN 48 15 PAHPHL 42 25 SPEECH 41 59 POSSES 36 1143 FILM 37 33 PICTUR 37 202 LEWONE 33 23 LITERA 31 105 BOOKS 30 211 SCIENT 30 91 HOOK 28 168 ROTH 25 18 PROSSE 24 57 BROADC 22 36 FREEOO 22 110 GUILTY 22 1427 PRINT 19 29 STRICT 19 357 CONTRO 18 2374 DISORD 18 76 INFR1N 18 50 LIAB1L 18 1336 IMMORA 17 36 RACING 16 61 STATES 16 186 3 CONST 1 15 3193 DRAWIN 15 69 EVILS 15 26 KNOWIN 15 232 KNOWLE 15 969 MATURE 15 26 OBSCUR 60 FOG BEDFOR 58 15 20 40 SOLON 28 21 KEALY 23 18 VIS 1 B I 22 34 PLACE 19 1583 TRAVEL 16 462 OBSERV 810 SUNDAY 37 177 OEMEAN 33 25 WILL 31 4823 SAOHAT 29 23 FORM 27 921 HOLY 26 IB WEIGHT 25 789 SIGHT 23 49 FAI TH 20 338 NEGLIG 20 2129 STATEH 20 2014 VIEW 20 1285 ACCEPT 19 1150 APPELL 19 7099 FLORES 19 18 FUROR 19 IB SAW 19 62B ENFORC 18 745 RELIGI 18 137 WEN! 18 T33 CAUSEO 17 936 GOO 17 40 MAN 17 641 OFFICE 17 2598 OPPORT 17 500 RUBHIN 17 23 SUPPOR 17 2618 ADMISS 16 832 CHRIST 16 130 GOLF 16 26 ALMOST 15 293 AIREAD 15 507 AMPLE 15 146 BESIDE 15 ra CAME 15 700 CAR 15 1222 OISTAN 15 365 DISTRI 15 1945 DRIVER 15 726 MANAGE 15 532 MEMORY 15 48 NUMBER 15 1167 OCCURR 15 1096 OUANIU 15 50 RING 15 47 SHAPE 15 29 WtTNES 15 2175 OBSOLE 20 ACREAC BUILOI 33 17 26 1593 OEPREC COMMIT 30 15 122 1317 CARDS SCHEDU 28 15 36 211 VALUES 23 94 REPLAC 22 398 OWNERS 19 911 OBSTAC 17 JURISO 21 2138 SUPREM 19 1622 POSSES 15 1143 OBSTAN 18 VEREOI 232 17 NON 54 389 OVERRU 26 1460 VERDIC 22 2067 INJURY 16 129B OBSTRU 199 HINOER 51 31 TRACKS 29 170 WEEOS 29 18 CROSSI 28 327 APPROA 27 54B UNOBST 26 30 VIEW 26 1285 TREES 25 40 INTERF 24 399 TRAVEL 24 462 VISION 19 52 WINDOW 19 137 OISTAN 18 365 FENCE 18 71 ACROSS 17 325 C0NST1 17 319J PERJUR 17 48 PILED 17 16 POLES 17 27 PUMP 17 35 TRACK 17 173 WARNIN 17 196 EVASIV 16 30 FLOW 16 134 MOTOR I 16 54 WEST 16 642 BARRIE 15 20 JUSTIC 15 876 LAWFUL 15 435 MANNER 15 1161 NUISAN 15 184 SURFAC 15 227 TRUTHF 15 33 OBTAIN 1296 INFORM 47 1107 PRETEN 43 66 MONEY 35 1274 FALSE 33 316 TRANSA 32 640 PUMP 29 35 COAST 28 24 NATURE 28 1064 ADVANT 26 198 CREOIT 26 673 PETERS 23 102 PROPOS 21 1051 AODITI 22 1490 PURCHA 22 1487 ASSUME 21 894 QU 1 T T 1 21 23 CHECK 20 404 CONTRO 20 2174 OISCOV 20 417 REPRES 20 1337 ASSIST 19 520 Dt'FRAU 19 7 J SUMMON 19 369 YEAR 19 1185 CLAIM 'IB 1938 COLLEC 18 670 C0NT1N 18 1912 COUNTY 18 4129 ENRICH 18 32 LEGISL 18 1845 NOTICE IB 1938 PARTY 18 1888 PR I NCI 18 1753 SERVED 18 650 BDILDI 17 1593 CLAY 17 16 OIVORC 17 578 ORAFT 17 63 FRAUD 17 571 HUNDRE 17 415 INOUCE 17 227 MAINTA 17 979 MAITLA 17 37 RIGHTS 17 1681 RULING 17 883 USER 17 35 ARRANG 16 364 BALANC 16 508 COOL IN 16 17 CREATE 16 852 DELIVE 16 854 OlLIGE 16 190 EPISCO 16 18 FINANC 16 541 GUILTY 16 1427 HELPLE 16 in KNOWIN 16 232 LEOGER 16 19 PAY 16 1485 UNLAWF 16 530 CARRY IS 304 CONSTR 15 2758 EPSTEI 15 21 NOTE 15 726 REV 15 1216 SEPARA 15 1210 SIGNED 15 752 SIX 15 745 TRY 15 154 VISITA 15 20 WENT 15 733 OBV1AT 32 OBJECT 22 19*3 AMENOM 15 903 PROBLE 15 481 OCCASI 678 PROPER 38 39tl SIX 28 745 REFERR 25 1197 CROSS 23 1074 TEMPER 23 43 DOWN 22 761 MONEY 22 1274 PURVIE 22 81 FIVE 21 1195 PRINCI 21 1753 SEVERA 20 1113 MEN 19 421 SUICIO 19 16 WITNES 19 2175 BUSINE 18 1909 CONTIN 18 1912 MCCANT 18 18 TABLET IB 17 WEOOIN 18 18 WORK 18 1319 BONESE 17 20 GRAVES IT 19 RUGGLE 17 20 SIGNIF 17 449 STICK 17 20 TOLD 17 796 WANT 17 439 COMPAN 16 2887 COMPLA 16 2823 FURNAC 16 22 IMPRES 16 179 NUMERO 16 304 PROPOS 16 1053 VICIOU 16 21 VISIT 16 84 WENT 16 711 BABY 15 40 OELIVE 15 854 ENTER 15 610 HABITS 15 24 JUMPED 15 25 NAMEO 15 646 NEVER 15 852 REPRES 15 1337 SATI SF 15 628 STREET 15 1420 OCCLUS 19 CORONA 189 34 THROMB 70 17 HEART 67 106 HULL 60 15 ARTERY 52 20 AUTOPS 43 28 ATTACK 42 427 EMOTIO 41 32 , NATHAN 41 32 PATHOL 41 32 CHEST 37 36 PRECIP 37 61 STRAIN 36 62 VESSEL 36 40 01AGN0 34 72 CAUSAT 31 30 THROWN 28 68 GOO 27 40 ACCIOE 25 1415 FRACTU 24 50 INJURY 24 1298 HLOOD 21 182 TRUCK 21 565 EXAMIN 20 2219 JULY 20 955 SUDDEN 19 136 PAVEME 18 69 FATAL 16 115 FORCE 16 599 SUFFER 16 599 CAUSAL 15 125 HEDICA 15 546 WEEKS 15 221 OCCUPA 422 OISEAS 40 145 BUSINE 35 1909 ENGAGE 28 705 TENANT 28 47 1 SILICO 27 20 BUILOI 26 1591 ZONING 25 783 LANDLO 23 165 LICENS 22 782 LIMB 22 30 RENT 22 257 COMPEN 21 1069 LEWONE 21 23 OCCUPI 21 246 DI SABL 20 91 PREMIS 20 1327 SHIPPI 20 34 OWNER 19 1161 EXPOSU 18 30 PAY 18 1485 RESPON 18 1964 BENEFI 17 1816 ENOANG 17 46 EXCEPT 17 2917 LIFE 17 9 30 PURSUE 17 173 RETAIL 17 232 WAGE 17 45 WORK 17 1319 POSSES 16 1141 MEAN1N 15 984 PROSSE 15 57 VOCATI 15 28 OCCUPI 2*6 OWNED 35 842 APARTM 27 352 LANO 26 1389 ROOMS 25 55 SPACE 24 146 BUILDI 23 1591 RES10E 23 1079 FLOOR 21 361 FLOORS 21 26 OCCUPA 21 422 PROPER 20 3911 SHINNE 20 16 LANDLO 19 185 RENT 19 257 RENTAL 19 259 TENANT 19 477 LEASE 18 479 PREMIS 18 1127 8ASEME 17 114 LOCATE 17 707 EVICTI 16 35 OUANTU 15 50 SHE 15 2250 OCCUPY 82 PREMIS 18 1327 SIOE 16 722 PAY 15 1485 RENT 15 257 USE 15 2696 OCCUR 126 WHICHE 18 24 OCCURR 16 1096 OCCURR 1096 ACCIDE 39 1415 COLCIS 37 681 TRUCK 34 565 IMPACT 29 116 MOTORC 29 20 SAW 29 628 UNEXPE 27 35 IMMEOI 26 850 PLACE 26 1583 WEST 26 642 CONTRO 25 2374 OR I V 1 N 24 526 EVENTS 24 188 GUILTY 24 1427 HAPPEN 24 355 HEAO 24 22 1 JURY 24 3810 MIRROR 24 29 NEGL IG 24 2129 VIEW 23 1285 MATCH 23 48 APPLIC 22 3117 CHICAG 22 1176 CORONA 22 14 ROBBED 22 34 EXTEND 21 722 MONTHS 21 709 PASSEN 21 423 ROAD 21 722 SIOE 21 722 JULY 20 955 USE 20 2696 BOY 19 93 BRAKES 19 121 ORIVEN 19 281 IPSA 19 65 LOOUIT 19 64 OFF 19 637 PAVEME 19 89 POINT 19 1301 PROVOC 19 25 SHORTL 19 214 SK IDDE 19 25 SPOT 19 43 STREET 19 1420 ACCELE 18 72 ALONGS 18 28 CONTRA 18 4781 MEN 18 423 OAKWOO 18 26 PROOF 18 1220 SKIO 18 4d STOPPE 18 322 ANYTHI 17 466 BLOCKS 17 76 CARE 17 1060 CARRIE 17 740 ENDED 17 52 LEFT 17 1083 NATURE 17 1064 PET ITI 17 4506 OUANTU 17 50 STOP 17 361 WALKIN 17 81 ACTOR 16 15 A0M1NI 16 1699 BREAK 16 92 BROKEN 16 119 COLLAP 16 35 COMMEN 16 100 1 FENDER 16 16 FLASH 16 16 HULL 16 15 INTERS 16 643 MAN 16 64 1 MOVING 16 216 OCCUR 16 126 ROBBER 16 102 ROPE 16 15 SEATED 16 56 SPARK 16 15 AWAY 15 389 BELT 15 18 HICYCL 15 18 CARS 15 240 GLASSE 15 17 GOT 15 279 HER 15 3428 IDENTI 15 853 LAPSE 15 67 LETTIN 15 36 LOSING 15 17 MARKS 15 100 MISHAP 15 1 7 MORNIN 15 288 OBSERV 15 610 RECOVE 15 1577 RE TROA 15 103 RUNNIN 15 244 STRUCK 15 372 SUDDEN 15 136 THROMB 15 17 THROWN 15 68 TON 15 18 WAY 15 1530 WHEEL 15 63 146 OCCURS 101 WHICHE 30 24 VACANC 24 104 ACCIDE 17 1415 IER HIN 17 694 PROPER 16 3911 FILL 15 129 OCEAN 16 CARGO 46 17 EXP0R1 43 10 INLAND 39 23 SHIPPE 32 60 AMERIC 31 635 SHIP 31 36 VESSEL 30 40 CORPOR 21 1116 FRCIGH 21 135 CARRIE 20 740 FOKEIG 11 167 SPECIA 19 1568 CHARTE 18 179 COURSE 18 1 3B2 CORP 17 542 HOVEHE 17 113 POINT 17 1301 AR6ITR 15 453 OCTOBE 80S KANKAK 3H 26 STILES 35 16 JULY 31 955 JANUAR 30 167 OECEMD 29 1042 SEPTCM 29 847 DAY 28 1712 0A1E 27 1611 KAY 27 17 SIGNEO 27 752 EMPLOY 25 3101 PAID 23 1747 ANSWER 22 2431 NOVEMB 22 814 OVERRU 21 1460 SHE 21 2250 DAIEO 11 445 COMMON 18 2956 aodit i 17 1490 MARCH 17 981 ACCOUN 16 126 7 AUGUST 16 804 CLINIC 16 46 RALPH 16 63 CONSIC 15 29 CONTRA 15 4781 FOUR 15 1173 PRE AMB 15 27 ODOR 46 BHEATt- 54 38 ESCAPI 47 22 CAPITO 44 IB ALCOHO 38 154 SMOKE 35 40 DK 1 NK. ', 34 30 NOISE 34 43 VISUAL 25 1 9 BURLIN 23 22 PRINKI 23 13 GAS 22 387 OR INK 20 51 WESTE" 18 243 WES TWA 18 15 ACCOM* IT 58 BRIGHT 17 37 CARGO 17 17 CLUSEL IT 96 DRV II 60 HIOUEN 17 36 LAWN 17 33 LETTIN 17 16 LITTLE 17 404 MARK IT 92 NCCONB 17 16 NORTHW 17 127 OCEAN IT 16 OWEN IT 17 PLACE 17 1383 OUASHt IT 34 RUNS II 95 SOUNOE IT 37 SUICIO 1 1 16 IERRAI 17 16 THICK 17 17 TRACKS IT 170 BRIEF 16 868 COMPLE 16 1453 DRIVEN 16 141 FELL 16 244 FIFTEE 16 115 GUARDS 16 19 CUESTS 16 41 HANCOC 16 18 HER 16 3428 INC 16 1466 MOMCNT 16 142 PEORIA 16 71 SIOINC 16 19 SOUTHE 16 258 STOPPA 16 11 STOPPI 16 66 TON 16 18 TURNED 16 2 7* HHATEV 16 386 ARGUME 15 1299 "LOCKS 15 76 BUHPER 15 20 CENTER 15 '331 COLL IS 13 681 COHPAN 13 2887 DARK 15 75 DOUBTE 13 20 FASTER 15 20 HANDLE 15 160 INC IDE 15 474 LANCUA 15 1252 LINES 15 3 30 MATEAI 15 1222 MOVING 15 216 NORTH 15 694 PALPA8 15 44 PPOTOG 15 158 POSI M 15 12 76 REFERE u 1350 SEAT 15 117 SOUND 15 334 UNCOIL 15 21 WARNEO 15 42 WORK 15 1319 POINTE 475 NANCY 37 21 JESSIE 29 19 WILL 2* 4B23 HOLDUP 20 17 CORREC 18 1221 REVIEW IB 1798 CHANCE 16 1383 LINEUP 16 16 MEN 16 423 STAT 16 1042 DEFINI 15 864 HOLDIN 15 924 POINT1 43 PIN 29 27 VIEW 25 1283 DATE IT 1611 ANGLE 15 50 POINTS 330 SUPPO" IB 2618 MEMORY 15 48 POISON 41 TIFFIN 51 15 FOOD 35 174 HAM 35 20 ADULTS 33 51 OELETE 30 59 REFRIG 25 39 SYMPTO 25 37 INFECT 23 26 SICK 23 73 TEA 21 53 FEED 20 33 HEALTH 20 313 AIL AN! 11 66 CARBON 19 38 GROWTH 19 38 HUMAN 19 106 PRODUC 11 1040 AARON IB 43 TEMPER 18 43 UNFI T 18 39 BOARO IT 3543 MAI 1 CI 16 147 PACIFI 16 91 POLE 53 CURB 32 112 HOLE 21 66 CORNER 20 234 CURVE 20 44 POLES 20 27 SIOEWA 20 185 HIT 18 120 UTILIT 18 370 SIOE IT 722 SIGN 17 330 CROSSI 16 327 FRONT 16 442 MILES 15 373 STOP 15 361 TOWER 15 44 POLES 27 WIRES 67 75 CONDUI 63 15 CABLES 35 17 WFEOS 34 18 VOLT AG 27 29 OVERHE 24 37 TREES 23 40 ERECT 22 77 TOWER 22 44 APPLIA 20 86 POLE 20 53 ALONG 19 426 TRANSN 11 157 ILLUHI 18 63 HEAT 17 71 OBSTRU 17 199 TELEPH 16 411 UNUERG 15 B3 VILLAG 15 730 POLICE 1241 OFFICE 135 2598 ARREST 62 61* CHIEF 53 443 CAR 44 1222 TOLO 43 796 SEARCH 40 319 DEPART 39 1230 MAGIST 39 75 PATROL 39 76 MAYOR 38 248 FIREME 37 83 ROBBER 36 302 WENT 34 733 ORDINA 33 2341 RFCEIV 33 2270 APPREH 32 88 TAVERN 32 184 ATTORN 30 1893 COMMIT 30 1317 PEOPLE 30 1850 POWER 30 1853 STOPPE 30 322 ASKEO 21 878 ROHBCO 29 34 SERCEA 29 35 CRIME 28 850 PACKAG 28 88 ANSWER 27 2431 APARTH 26 332 FENPFR 26 16 FIRE 26 554 L INE UP 26 16 REL 2(, 1028 ROOM 26 457 SA7AMA 26 13 VILLAG 26 730 CITY 25 3529 CONFES 25 267 OANCC 25 17 SALE 23 1338 SERIAL 25 29 ALLEY 24 149 OURGLA 24 177 CHICAG 24 1176 OETECT 24 71 FOUR 24 1173 PARKED 24 1B1 STATIO 24 490 COMMIS 23 3056 DRINK 23 51 MEN 23 423 SOUAD 23 34 GOING 22 545 HEALTH 22 515 MACK 22 22 SCENE 22 143 ABSENC 21 911 CAOILL 21 40 MAN 21 641 PUHCHA ,21 1487 RACINC 21 61 SAW 21 62 8 STORE 21 366 WATCHI 21 23 BELONG 20 325 FORCE \20 599 GUILT 20 345 GUN 20 124 HANOEO 20 92 IMMEOI 20 830 KEYS 20 26 PERFOR 20 1392 SIREN 20 25 STAIRS 20 * 45 TICKET 20 42 OISMIS 19 2222 HEAOOU 19 28 KNIFE 19 50 MEMBER 19 1268 MORNIN 19 288 PHONE 19 27 VEHICL 19 1180 VICTIM 19 108 WALKED 19 105 WATCH 19 48 ALCOHO IS 154 APPOIN IB 1097 BULLET 18 55 BUS INE 18 1909 CLOCK IB 149 OELICT 18 55 GONE IB 194 JACOBS 18 51 NARCOT IB 201 SANDER 18 32 SIDE IB 722 STOLEN 18 146 ACROSS 17 125 BEATEN 17 16 BREAK 17 92 CLOSET 17 13 CURTAI IT 33 OR 1 NK 1 17 93 EVER 17 439 FILLED IT 119 HOLDIN IT 924 KNOW 17 649 HALFEA 17 15 PENSIO 17 124 PROTEC 17 924 PUMP IT 33 SOUTH IT 635 TAMPER 17 15 IOOK IT 978 TRYING 17 92 VOICE 17 34 ACCUSA 16 62 ADDICT 16 36 BOARD 16 3543 CANADA 16 38 OB IV IN 16 526 GOT 16 279 HANGIN 16 18 HOUR 16 40 6 M IONIC 16 39 MISSIN 16 64 PINBAL 16 17 PISTOL 16 38 PLACE 16 1583 REVEAL 16 404 SHOES 16 39 SHOTS 16 38 STARTE 16 293 SUPPRE 16 99 SWINGI 16 18 IALKIN 16 67 THREAT 16 183 VIOLAT 16 1361 ACOUAI 13 69 BLOCKS IS 76 BRUTAL 13 20 COAT 15 44 COMMON 15 2956 CONDUC 15 1406 DR IYER 15 728 FASTER 15 20 FUNCTI 15 494 GAVE 15 768 HFARO 15 775 HERS 15 20 INSIOE 15 106 LEARNE 15 150 MERRI T IS 20 HONEY 15 1274 MUNICI 15 1382 PETTY 15 19 POWERS 15 499 PUBL IC 15 3129 RFGULA 15 1333 REVOLV 15 75 SCREW 15 20 SHOE 15 43 TELL 15 205 TREASU 15 249 VICINI 15 114 VISUAL 15 19 POLICI 163 POLICY 56 1288 INSURA 51 1188 PREMIU O 187 INSURE 33 86 8 LIFE 30 930 COHPAN 29 2887 EXCISE 29 65 INS 25 321 COVERA 24 256 FRATER 20 66 ANNUIT 19 34 AETNA 17 41 8USINE 17 1909 LAPSED 17 18 NET 17 238 SICKNE IT 52 AHSTER 16 *° INSURI 16 76 LTD 16 32 MATERN 16 20 REINSU 16 30 SOVIET 16 30 CASUAL 13 257 POLICY 1268 INSURE 92 868 INSURA 83 1188 COVERA 63 256 POLICI 56 183 LIABIL 46 1336 INSURI 44 76 INTENT 40 1614 COHPAN 38 28B7 ATLAS 36 34 CLAUSE 33 341 INS 30 321 COVERE 28 385 S1CKNE 28 52 AUTO 27 128 FIOELI 27 75 AMERIC 25 635 INJURY 25 1298 LIMIT 25 271 ATTORN 24 1893 PREMIU 24 187 AHSTER 23 20 EVENT 23 72 3 EXCESS 23 559 PAID 23 1747 ' APPEAL 22 6176 ACCIOE 21 1415 BEN 21 42 BUCKET 21 66 INTERE 21 2557 NAMED 21 646 METER 20 25 OOOILY 19 109 CAS 19 79 DAMAGE 19 2098 FULL 19 1120 GULF 19 48 SIGNEO 19 752 VALID 19 712 WILL 19 4823 CASUAL 18 257 COLL IS 18 681 DESIGN 18 1075 INDORS 18 55 ISSUED 18 1086 LEGISl 18 1845 LOSS 18 652 NAT ION IB 670 PROSSE IB 57 UNLOAO 18 113 CANCEL 17 223 01 SUN 17 898 E OU 1 T A 17 399 INOEMN 17 213 MANUAL 17 3* PROPER 17 3911 RECOVE 17 1577 SUPREM 17 1622 ACC 16 17 APPLIC 16 3117 CAN40I 16 17 CONST! 16 3193 DISA8I 16 247 LAPSED 16 IH PAY 16 1485 PUBL IC 16 3129 RENTED 16 65 SCRAP 16 39 SUBROG 16 103 TOUCHI 16 41 BULL 15 20 CHANGE 15 1383 CONSTR 13 2758 DECLAR 15 1381 EXTENT 13 727 JERRY 13 21 OPERAT 15 2956 POLISH IS WASHIN 19 179 ASS 18 201 CHICAG 17 1176 LOAN 15 381 POL 1 T 1 13S AFFILI 46 55 PATRIA 34 57 ANASIA 32 37 RUSSIA 30 50 LEAOER 29 34 KONIGS 28 28 SUDVER 28 58 CANOIO 27 138 OEHOCR 27 23 GOVERN 27 1154 SOVIET 27 30 SUUDIV 26 391 REYOLU 23 22 BODY 22 491 MORAL 22 106 AUTONO 21 17 ORTHOO 11 32 OVERTH IB 23 loyal r 17 37 CURIAM 16 82 PHUOS 16 27 GOOO 13 1033 NOMINA 15 126 OFFICE 15 2598 PART IE 15 2743 REPUBL 15 46 POLLUr 32 STREAM 50 BO AIR 43 lie SEW«GE 40 60 PLYMOU 30 35 TESTIN 30 34 WAIF.RS 30 103 INCREM 28 22 WATER 26 636 SANITA 24 167 CITY 22 3521 SAHPLE 22 37 WASTE 22 62 ABATEN 19 126 BOARD 19 3543 W6IGHI IT 57 CONIRO 16 2374 POND 4? DAM 78 31 LANO 41 1389 WILLAR 37 15 LOCUS 36 63 SWIHMI 35 38 HATER 35 636 MAYOEN 27 16 BROOK 26 17 RIVER 24 173 SHORE 24 80 DOCK 22 23 FISH 2 1 25 WAY 21 1530 SUES 20 27 BRUSH 19 30 CAMP 19 32 PERMIT 11 2261 PREHIS 19 1327 ATTRAC 18 58 FLOW IB 134 HILL IS 63 ACRE 17 141 ADJACE 17 203 SURVEY 17 152 TURNPI 17 40 PROPER 16 3911 STREAM 16 no INC 15 146 6 POOL 15 85 TREE 15 51 PONT It 23 KL INE 54 15 BROAOS 39 29 ORESS 33 23 MISS 27 H REVERE 24 43 PARKED 23 181 RESEMB 22 50 BUICK 21 52 SAW 20 628 CHEVRO 19 61 CORNER 17 234 IOENTI 17 853 ROAO 16 722 STANDI 16 401 POOL 85 SWIHMI 89 38 WELLS 40 91 SHAKER 36 27 POOLRO 34 16 LEASEH 33 53 PIN 26 27 RECREA 26 68 MCCANT 25 18 GIOBON 24 20 ROOM 22 457 BALL 21 80 IDLE 20 28 TOOK 19 978 JANE 18 19 MARI JU IB 53 SKILL 18 74 HALL 17 154 MACHIN 16 368 PAMPHL 16 25 RETURN 16 1643 BARREL 15 27 GAMBLI 15 78 HEIGHT 15 153 PUND 15 49 TABLES 15 29 POOLRO 16 MCCANT 74 18 SALOON 44 18 MARI JU 63 53 IDLE 35 28 POOL 34 85 BAG 28 45 BACK 27 793 GET 27 485 HANOEO 26 92 nun 26 616 COT 24 279 TOOK 23 978 INDICT 21 848 DAY 20 1712 WALKFO in 105 MONEY 17 1274 FOUR 16 1173 WANTED 16 236 ALLEY 15 149 BUY 15 140 GOOD IS 1033 POOR 72 CARED 18 24 WEAK 16 29 ACTS 15 1273 PAUPER 15 33 SHE 15 2250 VISIUI 15 34 147 POPE POPOVI POPULA 122 PORCH 23 POUT 30 PORTE 19 POMUK 72 PO«tlO 9*2 HOKIU 16 32 JO PORTHA PORTSH CHAM fa 17 ue OESICN CHRVSL ■»a 56 MUTUAL REPLEV 37 65 RETAIL HEALER 27 176 HONORE PAYNEN 2 3 172* TRANSA STATEM 20 201* OISIRE TITLES 17 *8 CAVE POSSES 15 11*3 TI TIE CENSUS 55 17 LECISL CITIES 19 223" PAH ART ICL 15 973 PU8 ENCLOS 27 60 WENT LAN0L0 11 165 DAMAGE FLOOR 15 361 INJURE CLINTO *l 62 PIER LUCAS 23 60 OAK JEfFEH 18 55 LOCAL FOX 39 107 VENUE HARDWA 26 28 PAYROL ACCIOE 16 1*15 CIRCUI PAVEO 33 52 JUNCTI AREA 21 8*2 BERN COMHIS 20 3056 LINE SOLON 19 21 DOCK TAX 17 133* URBAN NORTH 16 69* RATE PERT |N 15 620 TRUSTE CEMENT *7 65 HANHOL EXCAVA 111 100 LOCAL PHOTOC 2A 158 HAP ENOUIR 66 *6 CLOSIN LIAHIL 19 1336 PI TCHE 1075 FAVOR 16 1135 JURY is 3610 *3S WHOLES 59 96 FLOOR 5* 361 WINSTO 5 23 PLAN *0 698 2 32 MORTGA 3* 57* AUTOMO 31 1513 PURCHA 31 1*87 LOANED 30 63 2! FINANC 2* 543 VEHICL 2* 1180 CARS 22 2*0 ORAFT 21 63 6*0 HISSIN 22 6* HANUFA 21 *15 REMIT 2 19 MONEY 20 127* 23 JANUAR 19 1167 LOAN 16 38 9 ORIGIN li 1692 INC 17 1*66 766 IM.MEOl 16 850 CASH 15 3*5 CHATTE 1! 93 DRAFTS 15 37 1183 USEO 15 2130 18*5 INHAB1 2* 139 MANUAL 23 3* SCHOOL 2 1 1161 CHAP 21 761 796 UNIT 19 218 SEWAGE 17 60 SERVIC l( 2*96 STAT 16 10*2 79 TECHNI 15 308 733 REMODE 25 39 STAIRS 23 *5 STORE 2 1 366 XI ICHE 19 6* 2C9R TWENTY 18 503 HOUSE 17 829 HALL 1( 15* DOYLE 15 102 6 79 PUT 15 6*1 20 HAROOR 30 59 SALEM 29 *0 VESSEL 2< *0 LOADIN 26 111 67 CI TV 20 3529 COUNTY 20 *129 ERIE 11 93 FACIll IS 296 602 TOLEDO 15 135 200 NORMAN 2* 50 CIRCUI 16 1389 *3 CLEANI 18 *l FORMER 16 7B5 OMNIBU 18 23 COMHI S IT 3056 1369 ANSWER 15 2*31 in OUOTES 25 51 OAMACE 22 2096 LAND 2. 1389 APPURT 21 *« 31 HRIFF 21 B68 OFF 21 637 SOUTH 2 635 WEST 21 6*2 695 LOCATE 19 707 QUICKL 19 38 HOAO 1< 722 slot 19 722 23 FENCE 17 71 INTERP 17 660 IRON 1 126 REFERE 17 1350 25 WHEELS 17 71 WILL 17 *82 3 ACCEPT 11 • 1150 NARHAI 16 28 525 TOWNSH 16 378 ham 15 31 FRONT 1 > **2 GRASS 15 il 1C33 WESTER 15 2*3 WIDE 15 209 26 NINNES 29 *1 INTERS 25 6*3 COMMER 2 t 527 SUPREH 19 1622 602 61 135 1 Lit NO 17 1363 FRANKl 16 272 NORIHW |l > 127 GALLAG 31 19 OHOINA 25 23*1 CIRCLE 2 > 36 C 1 NC 1 N 19 3*1 86 flAlINC 1* 50P riosf 16 579 DEFECT I- 1 5 m can be met. Let V" be the probability of a document being ran- domly drawn from the ath subpopulation (class), a=l, 2, . . ., M. Let k°\ be the probability of a document from the ath subpopulation possessing the /th word, where Af a== 1 — Af denotes the proba- bility of not possessing the word. The probability that a document drawn from the ath class will pos- sess both words / and J is given by Xg. It should be noted, however, that the model assumes independ- ence of key words; i.e., k a — k a k a . u i j The probability of obtaining a given key word pattern for a document is the sum of the products of the probability of belonging to a latent class V" and the probability of possessing the word, A"; thus, the response patterns represented by the ITs are functions of the Vs and X's. The relation- ships existing among the ITs, Ps, and A's are ex- pressed in a system of equations known as the ac- counting equations, several of which are given below for illustrative purposes. n,= £ v a \f n, E^M i+j (i) a=l If one denotes those key words which a document possesses by the subscript z, where z is the subset of the integers 1, 2, . . ., N, the accounting equa- m tions can be summarized as IL= V V a \%. Latent a=l class analysis is fundamentally the problem of solving the accounting equations for the estimates of the Vs and A's using approximations for the ITs. Because the ITs are unavailable manifest parameter values, they must be replaced by the corresponding observed P's. The original mathe- matical computations given by Lazarsfeld [7] were extremely laborious and difficult to implement; hence more tractable methods based upon matrix algebra were soon developed (Anderson [9, 10]; Gibson [11, 12]; Green [13]; Madansky [14]. At the present time we are writing a FORTRAN program for Green's method of solving the accounting equa- tions. The solution of the matrix equations yields a mXn matrix, illustrated in table 1, of X's, which express the probability of key word (j) having been possessed by documents belonging to latent class m(i) and a vector of Ps which specify the propor- tion of the total group of documents which belong in each of the m classes. The relation of documents to the mathematically derived storage categories (latent classes) is determined by computing order- ing ratios which are composed of the products of the probabilities of key words present and absent in a particular pattern of key words. V a Nl\Xf pa = J =l £ (TLViNXf) where Xf=kf when the document possesses key word j Xf= 1 — Xf when the document does not pos- sess key wordy. Table 1. Estimated latent structure Latent class Pre ibability c ass Probability of possessing the key word 1 V 1 XI 2 Is 3 n 2 V* 3 . m ym XT- K? x™ K' The ordering ratio is associated with a particular pattern of key words and can be interpreted as the probability that a particular pattern of key words would be possessed by the documents in a par- ticular latent class. The inverse interpretation is used to associate documents with a latent class. The key word pattern of the document is used to compute m ordering ratios, and the latent class which has the highest probability of generating such a pattern is the one to which the document is assigned. The possibility exists of key word patterns yielding identical ordering ratios for sev- eral classes, but the mutually exclusive assumption indicates the document should be assigned to one class. From a practitioner's point of view, I doubt if after-the-fact violation of the assumption and multiple assignment of doubtful documents would degrade the system. An important feature of 151 latent class analysis is that the ordering ratio is a function of the pattern of key words and involves terms corresponding to both the presence and ab- sence of a key word in a document. Several pre- vious statistical association methods utilize only the fact that a key word is present (Maron, [1]; Borko, [5]), and Maron's [1] automatic indexing scheme breaks down when a key word for a category was absent, necessitating the use of 0.001 in place of zero in the index calculations. To summarize briefly, latent class analysis provides a method for mathematically deriving storage cate- gories based upon the information contained in the vector of key words representing documents. The model utilizes the data provided by 2-tuples, 3-tuples up to rc-tuples of key words rather than being re- stricted only to 2-tuples as are other models (Borko, [5]; Stiles, [15]; Salton, [16]). The key word patterns are associated with underlying storage categories on a probabilistic basis rather than on an absolute basis. Key words in latent class analysis are designated by a 1 if they are present in the document, which is equivalent to giving them a relevance of 1 in Maron and Kuhn's [6] system. Maron and Kuhns [6] found 70 percent more answer documents were retrieved when they switched from l's to relevance numbers for representing key words. Hence, use of relevance numbers in latent class analysis might also effect a significant improvement in deriving appro- priate classes, etc. 3. Comparison of Latent Class Analysis and Factor Analysis The statistical association model which bears the closest resemblance to latent class analysis is the factor analytic scheme due to Borko [5]. Factor analysis is another attempt to do something with key words. What it does is to reduce the n-dimen- sional index space of the key word dictionary to a fewer number of dimensions. In Borko's applica- tion, the orthogonal axes of the reduced index space correspond to storage categories. Thus to assign a document to a storage category one computes its location in this reduced space and assigns the document to the closest axis. The assignment is accomplished by computing a vector of factor scores and the largest factor score determines the category to which the document is assigned. Latent class analysis has a somewhat similar system except that the calculation of the ordering ratio includes terms for both the presence and absence of key words and it yields a probability value rather than a correlational value. Borko and Bernick [17, 18] reported approxi- mately 50 percent success in assigning documents to categories in an experiment which was a replica- tion of Maron's [2] earlier work with the exception of the classification technique employed. Borko and Bernick [17, 18] used the key word vectors from Maron's 247 computer documents to derive factor-analytically 21 storage categories. A sec- ond sample also obtained from Maron [1] was then classified by means of the key word factor loadings derived from the first sample. There are several points in the procedure which should be elucidated. First, such a two-sample procedure is contrary to the rationale underlying factor analysis and latent class analysis. With a scheme such as factor analysis one should not attempt to derive a replace- ment for the Dewey Decimal System which will then be used to categorize all subsequent docu- ments entering the library. Rather what one does is to derive a classificatory system which is optimal for the documents already in the library. That this is the case is shown by Borko's data, where 63 per- cent of the first sample documents were classified properly and only 50 percent of the second sample were classified correctly. The two-sample proce- dure leads to some horrendous sampling problems which could never be adequately resolved, and samples of size 247 and 85 do not provide a very good basis for resolving them. Both latent class analysis and factor analysis yield derived storage categories which are valid only for th^ documents upon which they were calculated. If one wishes to add additional documents to the library, their key words must be assigned from the same dictionary and addition of any sizable numbers of new docu- ments requires a rederivation of the storage cate- gories and possibly an expansion of the key word dictionarv. Second, factor analysis depends upon 2-tuples of key words and hence the 90 X 90 matrix consists of all possible correlations of 90 words taken two at a time. Maron's data [1] showed that as the number of key words used conjunctively to identify a document increases, the probability of correct classification increases. To take the conjunction of say n key words and fractionate it into all pos- sible 2-tuples seems to be a backward step. Human indexers employ the total (or at least a large part) combination of key words to assign a document. For example, given computer teaching devices one would not break it up into computer teaching, com- puter devices, teaching devices, teaching computers, devices teaching, and devices computer and then use the six pairs to assign the document. If you re- strict human indexers to independent knowledge of six 2-tuples rather than the whole patterns, I would suspect that they would do a poor job of classifica- tion. The rationale underlying the 2 -tuple approach is that words which appear often in company will form clusters which show a high intracorrelation and a low correlation with words not in the cluster, hence the original key word conjunction will reap- pear. In this respect I feel that latent class analysis offers a significant advantage over factor analysis in that the mathematical model of the former in- volves all possible tuples of key words, not just 152 2-tuples as in the latter. Green's [13] method for solving the accounting equations consists essen- tially of factor analyzing the matrix of 2-tuples and rotating the structure until it fits the 3-tuple data, whereas factor anaylsis merely rotates the structure until it fits the 2-tuple matrix. Hence, latent class analysis should reflect the 3-tuples, whereas factor analysis cannot do so. Other solutions of the ac- counting equations take into account the higher- order n-tuples but I would not want to try to write the computer programs to implement them. An investigation needs to be performed to determine the relative frequency of the possible tuples of key words in a corpus, as there is probably a value of n beyond which ^-tuples are too rare to be of any value. Third, how one compares the effectiveness of two different statistical association models is a very sticky problem. Maron [1], Borko [5], and Borko and Bernick [17. 18] have attempted to evaluate their procedures by means of comparing the de- rived document assignments against existing clas- sifications of the same documents. I would suspect such a comparison is foredoomed due to the sample not being a miniature of the population and due to peculiarities of the existing system. I would rather evaluate the systems in terms of their ability to yield documents relevant to a request. When I send an assistant to the library to search for books re- lated to a topic I couldn't care less as to how the librarian has categorized them. My interest is in the relevance to the original request of the books the assistant brings back and in this regard I would not anticipate the categories derived by latent class analysis or factor analysis to correspond closely to any existing scheme. Despite their differences latent class analysis and factor analysis share two common problems, communalities and the number of classes to be derived. The communality problem arises out of the necessity to express the relationship of the key word with itself, i.e., what are the diagonal terms in the correlation matrix. This perplexing problem has essentially been solved for factor analysis by means of Guttman's [19] image analysis (Harris, [20]; Kaiser, [21], and in our latent class analysis computer program we will incorporate the image analysis approach to resolve the communality prob- lem. How many storage categories to derive re- mains a rule-of-thumb procedure in both latent class analysis and in factor analysis and no really good solution is in sight. The lack of a definitive rule for determining the number of storage cate- gories is rather embarrassing in the context of infor- mation retrieval, as the effectiveness of the system is highly dependent upon the number of categories employed. Borko [5] does not state what rule was employed to ascertain that 21 rather than 20 or 30 categories should be employed. In the case of latent class analysis, McHugh [22] has provided a chi-square goodness of fit test which enables one to compare how well the corpus has been parti- tioned for different numbers of classes. One must however reanalyze the corpus for each different set of classes to obtain the data necessary for the test and such an iterative approach is extremely expensive. If one derives m underlying storage categories by means of latent class analysis or factor analysis, documents can be assigned to these classes on the basis of their ordering ratio or factor scores. Within these derived classes the documents are stored in descending order of these weighting num- bers. Retrieval in such a system is performed by reading a key word vector as a request, computing the vector of factor scores or ordering ratios, and the largest value determines the appropriate class. Once the storage category is found, those docu- ments having a high probability of belonging to the storage category or factor score are retrieved and now we are in a trap. Such a procedure means that there are only m possible sets of documents retrieved. The length of these m lists varies with the cutoff number set by the request but nonethe- less are the same m fists. This is useless of course but Baker [4], at least, did not appear to have been aware of this trap: one should not employ the same scheme to categorize the documents and then re- trieve them. In the case of latent class analysis we are looking at the possibility of retrieving not those documents which have a high probability of belonging to the category, but those which have a probability of belonging similar to that of the re- quest. Such a system would at least yield dif- ferent sets of documents for different requests, but would need to be checked out carefully as it is only a guess at present. The trap described above was not realized until I reread Stiles' [15] description of his method for searching the corpus for key word profiles which in essence generates storage classes unique to each request. These storage classes are then investi- gated in more detail for the desired documents. Definition of sets of documents peculiar to the words in the request leads to a large amount of mag- netic tape spinning which can be avoided by a struc- tured library; hence the latter is to be preferred. 4. Problems Involving Matrices in Statistical Association Models Statistical association methods such as latent class analysis are essentially problems in matrix algebra; factor analysis and latent class analysis involve taking the eigenvalues, eigenvectors of the n X n index space and manipulating some matrices of order mX n. With present computer capabilities (7090, 1604), matrices of order 200 are about maxi- mum and yet maintain reasonable running times. A more serious problem is that of computational accuracy in the matrix algebra calculations (Freund 153 772-957 O 66— 11 [23]). It is well known that inverses of matrices of size 50 or greater are highly suspect unless matrix improvement schemes are employed. The single precision floating point arithmetic of 7090 FORTRAN yields 27 bit mantissas and I doubt if this is sufficient accuracy for matrices of order 200. The double precision floating point of the Control Data 3600 has a mantissa of 84 bits which should improve accuracy considerably but if it is sufficient for matrices of order 1,000 is a moot point. In addition to the storage requirements and accuracy problems, the sheer mechanics of manipulating matrices of sufficient size to accommodate the key word dictionary of a reasonably sized library is a problem and I do not believe that conventional techniques will prove adequate. If one can dem- onstrate that the index space is sparse when large dictionaries of key words are used, where sparse means a large number of cells are empty, then some newer techniques are available. The inverse of a large sparse matrix has been presented by Steward [24] and the eigenvalues of a large matrix can be obtained by the graph theoretic technique due to Harary [25]. At the present time we are rapidly approaching the upper limit of our capability for manipulating matrices and yet are dealing with unrealistically small dictionaries of key words. One needs to look at restricted matrix size in its proper context; I do not believe any of the authors of statistical models involving matrices advocate attempting to implement such models as operational systems. Rather, they intend to implement them in order to study the structure of a corpus of docu- ments and to explore various other avenues of research. 5. Information Retrieval and Correlational Indices Inspection of the published statistical associa- tion methods reveals that many of them are based entirely upon the product moment correlation co- efficient or variants thereof (Borko [5]; Maron and Kuhns [6]; Stiles [15]; Salton [16]). The product moment correlation coefficient is a very peculiar descriptive statistic and improperly used leads one into a number of unusual activities. Parker-Rhodes [26], for instance, states that the product moment correlation coefficient is a predictive statistic, which is a new twist for one of the classical descriptive statistics. The recent paper by Salton [16] which presents a statistical association technique is a prime example of the type situations into which the product moment correlation coefficient leads. He established a number of correlation matrices of terms (it was only after 8 pages of text that he ad- mitted his cosine index of association was in fact the product moment correlation coefficient) and then proceeded to compare these matrices by com- puting correlation coefficients using the correlation coefficients of these matrices as the data. What meaning can be attached to the correlation of cor- relation coefficients is not easily elicited. The intent was to compare matrices to determine if they were significantly different. A number of legitimate statistical techniques exist (Anderson, [9, 10]; Federer [27]) for this purpose, but to cor- relate correlation coefficients and then test the supercorrelation for significance is not one of them. In the behavioral sciences we have already been through the major portion of our correlational period and the educational, psychological literature is resplendent with similar inappropriate applications of the correlation coefficient. It seems as if each developing science is compelled to discover the correlation coefficient and this is most unfortunate. The excursion into the blind alley of the correlation coefficient set educational psychology back 50 years; let's profit from their example and not do the same for information retrieval. 6. Summary The lack of mathematical models for information retrieval has resulted in borrowing from other dis- ciplines models and techniques which appear to have promise in the information retrieval context. The introduction of such borrowed models does not imply that they will resolve existing problems, but rather it is hoped that they might provide the steppingstones to mathematical models unique to information retrieval. In order to proceed in the development of mathematical models, one must of practical necessity introduce certain assumptions which are at variance with the real world such as independence of key words and mutually exclusive sets of documents. The implications of such as- sumptions cannot be ignored, yet one usually can- not proceed smoothly without such assumptions. The latent class model embodies features of a number of existing techniques in one compact package, which makes it an attractive model to study in the information retrieval context. It satis- fies Maron's desire for an approach which yields an indication of the relationship of a document on a storage category and does it on a probabilistic basis. It should be noted the probability actually involved is that of the documents in a given latent class possessing a specific pattern of key words. The calculation of these probabilities, i.e., ordering ratios, employs terms corresponding to both the presence and absence of key words, whereas pre- vious models have been concerned only with terms 154 representing the presence of a key word. The mathematical model of latent class analysis involves all of the possible n-tuples of key words in its ac- counting equations rather than dealing only with 2-tuples such as in factor analysis, however. The particular solution of the accounting equations presently being developed into a computer pro- gram (that due to Green, [13]) involves only 2-tuples and 3-tuples. The solution of the accounting equa- tions involves matrix algebra with its accompany- ing problems of numerical accuracy, matrix size, and utility. Although the requirement for such matrix calculations is a disadvantage, I feel this can be overcome. If experiments with a corpus of documents indicated latent class analysis per- forms well in the information retrieval context, it would be a relatively straightforward task for mathematicians to derive approximation techniques for realistically large key word dictionaries. The lack of a really good corpus of say 10,000 documents key worded from a dictionary of 1,000 words is severely hampering research. A common corpus such as this would be of incalculable benefit to research workers, as would some objective cri- terion for comparing various techniques for manipu- lating such a corpus. As a final comment I would like to reiterate my distaste for the product moment correlation coeffi- cient and its variants. This descriptive statistic can lead one far from the goal and should be studiously avoided. 7. References [1] Maron, M. E., Automatic indexing: An experimental inquiry, J. Assoc. Comp. Mach. 8, 404-417 (1961). [2] Doyle, L., The microstatics of text, Information Storage Retrieval 1, 189-214 (1963). [3] Doyle, L., Semantic road maps for literature searchers, J. Assoc. Comp. Mach. 8, 553-578 (1961). [4] Baker, F. B., Information retrieval based upon latent class analysis, J. Assoc. Comp. Mach. 9, 512-521 (1962). [5] Borko, H., The construction of an empirically based mathe- matically derived classification system, Proc. 1962 Spring Joint Computer Conf., Palo Alto, Calif., 279-285(a) (National Press, 1962). [6] Maron, M. E., and J. L. Kuhns, On relevance, probabilistic indexing and information retrieval, J. Assoc. Comp. Mach. 17, 216-244 (1960). [7] Lazarsfeld, P.F., Latent structure analysis, Ch. 10 and 11 of the American Soldier, Vol. 4, Measurement and Predic- tion, ed. S. A. Stouffer, Princeton Univ. Press, Princeton, N.J. (1950). [8] Torgerson, W. S., Theory and Methods of Scaling (John Wiley & Sons, New York, 1958. 460 pages). [9] Anderson, T. W., On estimation of parameters in latent structure analysis, Psychometrika 19, 1-10 (1954). [10] Anderson, T. W., Introduction to Multivariate Statistics, ch. 1 (John Wiley & Sons, New York, 1958). [11] Gibson, W. A., Extending Latent class solutions to other variables, Psychometrika 27, 73-81 (1962). [12] Gibson, W. A., An extension of Anderson's solution for the latent structure equations, Psychometrika 20, 60-73 (1955). [13] Green, B. F., A general solution for the latent class model of latent structure analysis, Psychometrika 16, 151-166 (1951). [14] Madansky, A., Determinantal methods in latent class analy- sis, Psychometrika 25, 183-198 (1960). [15] Stiles, H. E., The association factor in information retrieval, J. Assoc. Comp. Mach. 8, 271-279 (1961). [16] Salton, G., Associative document retrieval techniques using bibliographic information, J. Assoc. Comp. Mach. 10, 440-457(1963). [17] Borko, H., and Myrna Bernick, Automatic document clas- sification, J. Assoc. Comp. Mach. 10, 151-162 (1963). [18] Borko, H., and Myrna Bernick, Automatic document clas- sification: Part II. Additional experiments, System De- velopment Corp. TM-771/001/000, 33 pages (Oct. 1963). [19] Guttman, L., Image theory for the structure of quantitative variates, Psychometrika 18, 277-296 (1953). [20] Harris, C. W., Some Rao-Guttman relationships, Psy- chometrika 27, 247-263 (1962). [21] Kaiser, H. F., Image analysis, Proc. Social Sci. Res. Council Conf. on Measuring Change, ed. C. W. Harris (Univ. of Wisconsin Press, 1963). [22] McHugh, R. C, Efficient estimation and local identification in latent class analysis, Psychometrika 21, 331-347 (1956). [23] Freund, R. J., A warning of roundoff errors in regression, Am. Statistician 17, 13-15 (Dec. 1963). [24] Steward, D. V., On an approach to techniques for the analysis of the structure of large systems of equations, SIAM Rev. 4, 321-342 (1962). [25] Harary, F., A graph theoretic method for the complete reduction of a matrix with a view toward finding its eigenvalues, J. Math. Phys. 38, 104-111 (1959/60). [26] Parker-Rhodes, A. F., Contributions to the theory of clumps: the usefulness and feasibility of the theory, ML-138, Cambridge Language Research Unit (Mar. 1961). [27] Federer, W. T., Testing proportionality of covariance matrices, Ann. Math. Statist. 22, 102-196 (1951). 155 Problems of Scale in Automatic Classification Roger M. Needham University of Cambridge Cambridge, England One of the problems of automatic classification for information retrieval is the number of terms which need to be handled. It is not difficult to construct and use association matrices between, say, two or three thousand terms. However, even "controlled" vocabularies are often larger than this, and part of the object of automatic classification is to lessen the need for careful vocabulary control. The paper will discuss some approaches to the problem of scale, specifically involving: 1. Techniques for constructing partial matrices, or sample matrices. 2. Some techniques at present under experiment which implicitly make use of associations, but avoid constructing a matrix at all. It is hoped that some preliminary results will be available. The paper will conclude with some arguments in favor of using a classification technique rather than using a matrix of associations directly for reference purposes, even if the latter were techno- logically convenient. 157 A Nonlinear Variety of Iterative Association Coefficients Robert F. Barnes, Jr. University of California Berkeley, Calif. There are in existence a number of different systems of association coefficients, which may be characterized and compared in several different ways. A framework that seems especially fruitful treats each set of coefficients as elements of a linear vector space of dimension A' 2 (where N is the size of the object population at hand). Then any given set of coefficients can be viewed as the image under some vector-space transformation of a certain canonical set of coefficients. From this point of view, many of the properties of the resulting coefficients can be related to corresponding properties of the generating transformation. For one type of association coefficient, which we term an iterative association coefficient, the generating transformation is best viewed as the limit of the set of iterations of a second transformation. Such iterative coefficients can take into account higher-order relationships of co-occurrence, which are generally neglected by simple coefficients but which may be of considerable significance. Where the iterated transformation is non-linear, the theory of such coefficients becomes quite complicated; however, analytic and empirical studies of one such variety of coefficient have revealed certain prop- erties of some interest and have indicated certain kinds of retrieval situations in which these coeffi- cients might prove useful. 159 The Measurement of Information from a File Robert M. Hayes University of California at Los Angeles Los Angeles, Calif. 90014 Many of the problems of measuring the responsiveness of a file can be approached by appropriate extension of communication theory: (1) by introducing the parameter of relevancy into the entropy function; (2) by allowing the output of multiple signals as a method of handling error; and (3) by com- bining these with the methods of sequential decoding for analyzing file indexing procedures. At the risk of being boring and perhaps obvious, I am going to present a technical approach to the statistical view of information storage and re- trieval, one which is somewhat different from that with which we are concerned this week and yet one which very clearly relates to it. I hope that another dose of mathematics will not be too in- digestible, but I offer you this opportunity to steal quietly away. To introduce this approach, I would like to raise three questions, two of which I won't pursue much further and the third of which is the concern of my talk this evening. The first question involves the relation between the value of an information system and the re- sponse time from it. I propose that this relation- ship is characterized by a logistic decay function based on a single parameter — its half -life — and I suggest that virtually all of the characteristics of an information system are a function of that single parameter. I therefore raise the question, "Can we define the appropriate relation between time and value and determine that parameter?" The second question involves the relation between the value of an information system and the cost of it. I suggest that the obvious criterion is the economist's dictum — "cost equals value" — but that is apparently not valid. All too many systems have been designed with virtually no concern for their cost. I therefore raise the question, "Can we define the appropriate relation between cost and value?" The third question involves the relation between the value of an information system and the informa- tion derived from it. I propose that this relationship is characterized by a logistic growth curve as a function of the amount of information provided. This obviously raises the question, "Is this the relationship?", but more fundamentally, it raises the question, "How do we measure the information from a file system?" I raise these three questions for two reasons: First, I believe that the efficiency of an information system is expressible as a function of the three parameters, T, C, and N with which these questions are concerned, and second, I wish to suggest some approaches to the study of the third — the measure- ment of information from a file. The obvious approach — so obvious in fact that one might wonder why the question is raised at all — is to apply information theory. So let's try it. Picture a file system as though it were a com- munication channel with an associated decoder. As input we have requests and as output we have the file records for relevant documents — perhaps including selected content from the document itself. Can we characterize the information characteristics of such a channel? Consider a file of F bits consisting of items, x, each of N bits. Suppose a request y is matched against each item in the file over a specified n bits of the N, and the item which matches most closely is output. I am concerned with measuring the information from the file, in response to y, as a function of F, N, and n. I want to consider it in four parts: 1. Assuming that the search process is noiseless. 2. Assuming that the significance is dependent upon the relevancy of the information. 3. Assuming that the search process is noisy due to error in the request, the items, or the match process. 4. Assuming that the search process is noisy due to the imposed indexing structure. Consider the 2 N possible x's. Assume that they are equally likely and consider any one of them, say x. If we measure the relevancy, or degree of match, between x and y by the number of bits of the n over which x and y agree, we can formulate the total number of files from which x might be the response and, therefore, the probability of x given y. The measure of information provided by such a communication channel with this probability distribution is traditionally given by the entropy function H(xly) = — ^p(xly)logp (x/y). We can bound this and derive the not unexpected result that the information is approximately H{x/y)^N -log — Thus, given the file as a communication channel to which requests are input, the output consists F of sets of /V bits, of which log — are in some sense N already "known" and the remainder are essentially new information. However, in some very important senses, this seems counterintuitive. For instance, one feels 161 that the "information" from a file should increase as the size of the file increases, but the stand- ard measure of information states the opposite. Secondly, and perhaps more importantly, this measure completely ignores the extent to which the output is actually responsive to the request. In this respect, a file is not simply a communication channel, and disparity between input and output is not solely a result of noise. Thus, as we increase F -r n the number of file items, we increase the likeli- i\ hood of finding a good match, but we decrease the traditional measure of information in communication theory. Communication theory normally confines itself to models that are statistically defined so that the only significant feature of the communication is its predictability. I wish to extend this to include, as an equally significant feature, the relevancy of the information received — determined, for example, by its degree of similarity to a request input to the file. I therefore define the concept of "significance" as a function of both the probability of x,p (x) and the relevancy of x, r(x). Under the most straightforward assumptions of additivity with respect to both parameters, we can define the significance of a selection x as the product — r(x) logp (x) and the average significance as S(X) = -J j p(x)r(x)\ogp(xy X In the special case of a noiseless communication channel, r.(x) = l and we have the usual entropy function. Returning now to the importance of finding a good match, if the relevancy of x is measured, for example, by the number of bits of agreement be- tween x and y, the average significance from a file is a convex function of the size of the file. Intuitively, it has the properties which I think such a measure should have, and I suggest that it be considered not only in the context of a file, but in other situations where value to the receiver is significant. The nature of the characteristics of a file as a communication channel is particularly felt in the effects of error. Again, in normal communication theory, where one expects to get out of the chan- nel what one puts into it, the effects of a probability of error in a single bit can be counteracted simply by increasing the number of bits of match. In fact, the probability of erroneously decoding the output is an exponentially decreasing function of the length of the identifier, n. Unfortunately, this is just not true of a file operation, since we are deal- ing at potentially correct points in the coding lattice near which the number of possible alter- natives is enormously greater. In fact, there is a size of identifier beyond which the probability of error must increase. How then can we combat the effects of error, if increasing the length of the identifier is at best a stopgap? The answer is obvious, once it is recog- nized—we must output not just one response but a set of potential responses to reduce the probability of erroneously missing the correct one. Then, the probability of error becomes an exponentially decreasing function of the number of items output. However, error in file operation as we have de- fined it will not be due solely to the type of noise resulting from an error in single bits of the request, or the file items, or the comparison process. A highly significant source of error arises from the failure even to consider the file item which matches the request over the maximum number of bits; such an error can arise whenever an indexing structure is imposed upon the file. In fact, the type of process I have just described — the output of several items in response to a request — represents the character of such an indexing structure. For example, an index might be constructed by establishing a "sequence of significance" on the identifying bits and using successive groups of bits as index criteria; a match on some fewer number of identify- ing bits then requires selecting not only the closest index term but a set of them. This problem can now be analyzed by an approach similar to that of Wozencraft in his Sequential Decoding procedure, but including the additional complexities which I have discussed. In summary, I suggest that many of the problems in measuring the responsiveness of a file can be approached by appropriate extension of com- munication theory and in particular first by introduc- ing the parameter of relevancy into the entropy function; second, by allowing the output of multiple signals as a method of handling error; and third, by combining these with the methods of sequential decoding for analyzing file indexing procedures. 162 Vector Images in Document Retrieval Paul Switzer Arthur D. Little, Inc. Cambridge, Mass. 02138 The paper describes a model for generating a term-term association matrix. The model, based on co-occurrence frequencies, is a consequence of probability theoretic considerations. Using this association matrix, a method is then suggested for selecting a small subset of the index terms as axes for a low-dimensional index term vector space. The method is intended to approximate a canoni- cal factor analysis, but is much quicker to apply and easier to interpret. The position of an arbitrary term may be located in this reduced "image" space by reading off appropriate entries in the already- computed association matrix. The approximate method may, of course, be used in conjunction with association matrices derived in ways other than that described in this paper. A procedure is then outlined for locating documents in this same image-space. Basically, this involves obtaining a description of the document consisting of a list of index terms with appropriate weights or frequencies. This may be done by referring to the title, table of contents, selected portions of the text, or what have you. Authors' names and cited authors and titles may also be incorporated in deriving the position of a document in the image space. Simple linear calculations characterize all the operations. Then the procedure for locating an enquiry in the image space is presented. The form of the enquiry is extremely flexible, permitting the use of any number of index terms or authors' names, with differential weighting. A quick method for retrieving "relevant" documents is proposed. The method is basically a search for document images contained in a hypercube with the enquiry image at its center. The proposed method of filing means that "relevant" documents may be identified immediately without any spurious scanning. 1. Introduction 1.1. Statement of the Problem The elements of the problem are a collection of documents, e.g., a library, and an enquiry. The solution to the problem is a system which selects (retrieves) that subset of the document collection which contains the answer to the enquiry. Some of the difficulties which present themselves im- mediately are as follows: (1) Any verbalized enquiry is not usually more than a good approximation to what one really wants to know. Furthermore, the same verbalized en- quiry may have any number of connotations. Hence, we will make this simplifying assumption — an enquiry has a unique connotation, i.e., each enquiry has only one correct answer; (2) The obvious and trivial solution to the re- trieval problem is to scan the document collection completely, selecting the subset which contains the one correct answer. Presumably, this could only be done by a human being who "knew" the content of the entire collection; in general, such a system is unavailable. We shall, therefore, assume that the solution, the retrieval of the correct docu- ments, can be achieved by a mechanical, objective, and operational system; (3) Inevitably, any system which is mechanical can communicate only in a prescribed and pro- scribed form. Thus, we further assume that every enquiry can be translated to a form which can be communicated to the system. However, the system to be proposed in these pages will be sufficiently flexible so that this assumption will not prove to be very restrictive; (4) Even though we have assumed that an en- quiry is unambiguous and thus can have only one correct answer, it is usually the case that the answer is complex, with varying degrees of generality. Therefore, the subset of documents which contains the complete answer may be very large, wherein some documents may contribute very little to the answer. Thus, we assume that all the documents can be differentiated with respect to their relevance to a particular enquiry and, further, that this relevance can be measured. These are four major difficulties and the four corresponding basic simplifying assumptions we are employing. Each assumption may introduce into the system noise which may be difficult or impossible to assess. Though these assumptions are almost always incorporated in a retrieval system, they are rarely articulated. The worth of any mechanical retrieval system hinges crucially on the degree of validity of the foregoing assump- tions. 1.2. Scope This paper will concern itself primarily with a model for the mechanical selection of documents most relevant to an enquiry. It is based chiefly on the construction of a low-dimensional document space and the development of a meaningful method of locating a document in this space. The vector representation of a document in this space will be called the document image. Retrieval is achieved by (1) translating the enquiry into an enquiry image. (2) entering this enquiry image in the space of all document images, and 163 (3) selecting those document images which are nearest to the enquiry image according to the defined criterion. 1.3. Elements of the Document Image Since, by assumption, the idea of human scanning of documents has been abandoned, it becomes necessary to devise some means of mechanically describing the information contained in a document. This might be achieved by a more or less complex statistical and/or syntactical analysis of the entire document. Alternatively (because it is much easier to do and may not result in too much loss) we will use only (1) the document title, (2) the author's name or authors' names and (3) the titles and authors' names of any docu- ments cited by the given document or which cite the given document. In a certain sense this model will therefore be a combination of Salton's model for use of citations [l] 1 and Baxendale's model for title analysis [2]. However, we will not be attempting any of Miss Baxendale's semisyntactic analysis of titles. In addition, authors' names are included in the descrip- tion of the document. Implicitly assumed then, is that (1), (2), and (3) together in some way represent the information content of a document. This basic assumption is not totally unreasonable and effects the economy of not having to look at the contents of the documents. For specialized collections, e.g., journal articles in a single field, the assumption may be especially well justified. For those who feel somewhat uneasy about ignoring the body of the document, there is a straightforward extension of the model which provides for a scanning of the body material in whole or in part; this extension appears in Appendix B to this paper. As a convenient and flexible way of summarizing and combining the information contained in titles and authors' names, we will be constructing an "image space" of m dimensions. Every index term, document title, and author, whether actual, cited, or citing, will be transformable to a vector of m elements called an "image." All the images relating to a particular document will then be brought together to form a composite vector — the document image. How these images will be used for retrieval will be outlined later. In general, the transformation of index terms, etc., to vectors will be achieved by scoring them on •each of m characteristics, the characteristics being chosen in a way to provide maximum discrimina- tion among different documents in the collection. These scores will be the elements of the image vector. As a preview of what follows, it will turn out that once the images of index terms are defined, then the images of titles, authors, and documents can be derived in a simple manner from these index term images. Thus, a good part of this paper is devoted to a meaningful construction of the vector images of the basic index terms. 2. Term Images 2.1. Preliminary Definition The argument now hinges on the ability to find m characteristics by which index terms could be described as m-dimensional vectors. Ideally, if m = t= number of distinct index terms, then a given term t\ could have the unique representation t, = 0, 0, . . . , 0, 1, 0, . . . , 0), where t, is a vector of t elements whose ith element is a 1. Then we find ourselves working with a f-dimen- sional space where t is impracticably (and often spuriously) large. In practice, we will want m, the dimension of the image space, to be much smaller than t, the number of distinct index terms. As soon as m< t, the problem of an m-dimensional vector representation for an index term becomes nontrivial. This problem can be approached in the following manner. Suppose there was some way of finding those m index terms (out of the t terms available) which in some way were the m most "character- istic." Denote these m terms by t a , tp, . . . t^. Then for a suitably defined distance measure, A, on the space of all index terms, we could define ' Figures in brackets indicate the literator references on at the end of the paper. the m-vector representation, t,-, of an arbitrary index term, tj, as tj=(A ja ,Aj/?, . . . , Aj>), where A ja , etc., represent the distances of the term tjfrom each of the specially chosen "characteristic" terms. In this way we compress the total index space to an m-dimensional index image space, while this method for compression does seem reasonable, the argument for the method will be strengthened by the detailed development which follows. We have in this way shifted the problem to (1) finding a suitable distance measure, A, on the total index space, and (2) finding some way of selecting the m most characteristic index terms by using these suitably defined distances. 2.2. A, the Term-Term Distance Measure A number of term-term distance measures have been proposed. Most of these are based on the number of co-occurrences, N a b, of a pair of index terms, t„ and tb, i.e., the number of documents in which the two terms co-occur. All these proposed measures tacitly assume that frequency of co-occur- 164 rence in some way reflects the degree to which t a and t b are related. We, too, shall incorporate this assumption, though in a somewhat different form. The proposed distance measure between two index terms is as simple as it is meaningful. Sup- pose we observe N n i> co-occurrences of the terms t a and t b - We might ask the following question: Given the occurrence frequencies N a and N , what is the probability of observing as many as N a b co-occurrences, assuming there is no association between t a and tb? That is, what is the significance probability of the event '7V a & co-occurrences?" It is this significance probability which will be taken to measure the distance between t a and tb- In general, the larger the value of N a b the smaller will be its probability of occurring purely by chance, i.e., its significance probability; and the smaller its significance probability the more likely it is that t a and tb are not unassociated. Therefore, the sig- nificance probability of N a b does provide us with a meaningful measure of the closeness of the terms t a and tb- To get this probability we need to know the the- oretical distribution of N a b, conditional on 7V a , Nb, and d (the total number of documents in the collection). It may be checked that this distri- bution is in fact the hypergeometric distribution with parameters N a , Nb, and d. So the distance between t a and tb, say A a ?„ is just the significance probability and is given by d-N a \/(d Nb-Xj/ \Nb, Fortunately, this rather horrendous-looking animal is tabulated [3]. Thus, we may get the t X t A- matrix by substituting the quantities A a ft for the quantities Nab in the co-occurrence matrix. Since the distances are probabilities, we have that A Q 6 is in the interval (0, 1). 2.3. The m Separators — Axes for the Space of Images The primary purpose of calculating the term- term distances was to construct the m-dimensional index-term images. It was suggested that this might be done by selecting m index terms out of the t available index terms in such a way as to be most "characteristic." What was implied was a choice of those m terms which give rise to the most variation in the matrix of distances. These spe- cially chosen terms will from now on be called "separators," and they will be denoted by t a , tjs, • • • , V The image of an arbitrary index term, t„, will then be the vector whose elements are the distances of t„ from t a , tp, . . . , tfj., respectively, denoted by t„ = (A„ a , A r „3, . . . , A„ M ). If the m separators are well chosen, terms which are closely related will have similar images while terms which are essentially unrelated will be "pulled apart" and will have widely different images. The usual approach to a problem of this kind would be to perform a factor analysis of the matrix of term-term distances. We could then pick those m factors which have the largest variances and use these as separators. However, the factors would no longer be single terms but would, in general, be linear combinations of all t terms of the vocabulary. The inherent difficulty of calculation and interpreta- tion have led me not to consider factor analysis for this problem. Instead, consider the following: Denote A„=2A a6 /U-l). Then A„ is the average distance of the terms of the vocabulary from the term t a . If the individual distances, A 6, differ considerably from their average value, A a , then it is reasonable to say that t a is a good discriminator (or that t a carries a lot of varia- tion); that is, if r a=y |Ao& — A a | is large, then t a is a good discriminator. Thus compute the quantity T a for each index term t a in the vocabulary. The m separators, t a , t/s, . . . , t^ will be those m index terms whose r-value is greatest. The nature and amount of calculation involved for this process of selection are outlined in appendix A to this paper; it is certainly superior to factor analysis in this respect. However, this method of selecting separator variables is not, to my knowl- edge, discussed in the statistical literature. There- fore, I am not able to discuss its statistical properties, but they should be investigated more fully. Nevertheless, the process does have the strong intuitive argument of the preceding para- graphs. Nothing has been said so far about how one goes about choosing m, the number of separators (the dimension of the space of images). Unfortunately, there does not seem to be any "internal" objective way of doing this. The best that can be said now is to choose m to be conveniently small. Clearly, the smaller m becomes, the simpler and less sensi- tive the retrieval system becomes; experience in this regard would certainly help. We can, however, formulate the following rule for getting m: The set of separators consists of those m index terms which have a r-value greater than a threshold value, To. Thus, the problem of choosing m is in this way shifted to the problem of choosing t () , which could perhaps be more objectively chosen from a con- sideration of the distribution of the r's. 165 2.4. Recapitulation Having found our set of separator index terms, the question now is what shall we do with them? A purpose of this study was to create an image for each document which was to be constructed in such a way that similar documents would have similar images. (What use would be made of these images will be taken up in greater detail further on.) The image was to consist of a document's score on each of m characteristics, i.e., the image is a point (vector) in an m-dimensional space. These m characteristics were then taken to be a special subset of the vocabulary of index terms. These m index terms were called separators. Any term, t a , in the vocabulary could then be represented by an m-vector, t„, whose components were the distance of t a from each of the separator index terms, according to the metric, A. The separators were chosen in a way that gave them maximum discriminating power according to a defined cri- terion. The metric, A, was also carefully chosen so that it would have a natural probabilistic inter- pretation. Now we are at the stage where we can construct the images of each of the index terms in the vo- cabulary. This involves no further calculation — merely the picking out of the appropriate entries from the term-term distance matrix. It was re- marked earlier that title images, author images, and document images would be a direct conse- quence of the index-term images (which we have just calculated). The next part of this paper shows how this is accomplished. The fourth part of this paper will treat of applications. 3. Scoring the Document 3.1. The Title Image The scoring of any document on the m separators may be conveniently divided into two parts: (1) finding the title images, and (2) finding the author images. It turns out to be rather straightforward to create the title image. First select all the index terms in the title — this means all words except those which, by themselves, do not convey any substantive mean- ing, e.g., most quantifiers, prepositions, conjunc- tions, etc. This operation is performed quite easily by human beings but could be mechanically per- formed by storing a vocabulary of the nonsubstan- tive words. (Here, this operation is assumed to have already been performed when the original term-occurrence counts were made.) Suppose the title, T, contains the y index terms t\, t 2 , . . ., t y whose corresponding m-dimensional image vectors are ti, tz, . . ., t y . Then define the title image of T as the m-vector, T, which is the weighted average of the images of all the index terms which appear in the title T, i.e., where £Aj = l and y — number of terms in the title. The weight kj is chosen to correspond to the importance of term tj relative to the other index terms in the title. There appear to be two ways of choosing kj in an objective and mechanical manner: (1) \j= 1/y for ally, that is, each term of the title is given equal weight in the construction of the title image T. In this case we have simply (2) X j = llN j /X]f= 1 1/Nj, where Nj = total frequency of term tj in the titles of the collection. Thus, the more rarely does a term occur, the greater is its weight in the construction of T. In this case, T={Xt J INj}IHIN j . The second method for assigning kj seems to have stronger appeal since rarely occurring terms are given greater weight than commonly occurring terms in the construction of the title image, while the first method is a "no-information" type of weighting. In collections which are so small that the quantities Nj are not especially reliable esti- mates of the relative frequencies of occurrence of the different index terms, it may be just as well to use the simpler first method of weighting. 3.2. The Author Image The construction of the author image is carried out in the same straightforward manner as the construction of the title image. The author image is built up by considering all the index terms he used in the titles of all his documents which are in the collection. In fact, it is natural to regard the author image as some composite of all the title images of his titles. The obvious and simplest composite is just the average, i.e., if an author, W, has p titles in the collection, Ti, T2, . . ., T p , whose corre- sponding m-dimensional titles images are Ti, T2, . . ., T p , then the author image, W, is defined as the m-vector Thus, we have for each document a title image, T, and an author image, W. (It is worth noting that 166 the elements of T and W are still contained in the interval (0, 1).) The problem now is — how can T and W be combined to produce a single image for the document? Again, we resort to an average, but we must first decide on the relative importance of W, the author image, with respect to T, the title image. Therefore, consider: If the author of the given document, say D, has p documents in the collection, then the given document represents 1/pth of the author image, on the hypothesis that all his documents contribute equally to his author image. Hence, a natural weighted average, D'"', of W and T is D'"'=l/pW + (l-l/p)T, which will be called the author-and-title image. Loosely speaking, the more documents an author has in the collection, the more varied will their content be, the less important is the author's name for the purposes of describing a particular docu- ment; this fact is incorporated in the expression for D'"'. (Note that ifp = l, i.e., if the given docu- ment is the only one that the author has in the col- lection, then T = W = D'", as one would hope.) The use of authors for retrieval is definitely no more than a conjecture and this, in itself, might justify the light weighting. But note that D'" = l/p l/p^Tj , +(l-l/p)T = l/p 2 £ T; + (l-l/p+l/p 2 )T. Thus the title gets a weight of 1 — 1/p+l/p 2 and all the other p — 1 titles by the same author get a combined weight of 1/p— 1/p 2 (1/p 2 each). Thus if p = 3, the title gets weight 7/9 while all other titles by the same author get combined weight of 2/9. If, in practice, it turns out that author "de- serves" more weight, then it might be worth an- other look. 3.3. Citations The vector, D'"', is not quite the final document image, for it has not taken into account the docu- ment's citations. To complete the picture, the first step is to list all the titles and authors of the documents which (1) are cited by the given document, and (2) cite the given document. The "cited" list is easy to compile and usually con- sists only of scanning the bibliography of the docu- ment. The "citing" list is impossible to compile unless the collection is "closed." However, since collections are rarely if ever actually closed, it is preferable not to incorporate this assumption. Thus, the citing list is restricted to those documents which cite the given document and which are in the collection. Except for a brief note in the appendix, we will not be distinguishing between cited and citing, so the two lists may be combined for each document. How do we use this list of citations? Suppose that for the document D we have the set of q cited documents and r citing documents, denoted D\, D 2 , . . ., D q +r- For each of these q+r docu- ments, compute the corresponding m-dimensional title-and-author images, D'"' (as defined above). The average of these q + r title-and-author images will be called the citation image, D c , for the docu- ment D, i.e., D f = | ? Df/(r+ ? ). The next step is to combine this citation image D f with the given document's own title-and-author image D'"'. Now, it is often unfortunately true that citations are not very closely related to the contents of the document. In fact, it seems that the more citations we have, the less closely are they, on the average, related to the document in question. This last observation is now incorporated as an assumption: the weight of the citation image will now be taken to be inversely proportional to the number of citations. Thus, we finally have that the document image D for a document D is the vector found by taking the weighted average o/D tw and D c , where D c , the citation image, is weighted inversely to the number of citations, i.e., D = 1 1 + q+r D c + 1 I 1 + q + r. D< (Note, if there are not citations, i.e., q + r = 0, then D = D'"'.) At long last we have arrived at an expres- sion for the document image. Figure 1 provides a summary of the process used to derive this ex- pression. And now we are in a position to construct an image for each document in the collection. En passant, we also defined these other images: t, the basic index-term image T, the title image W, the author image D'"', the title-and-author image D r , the citation image. In certain applications, these intermediate images will be useful and interesting in themselves. It might be noted that the document image D is a linear function of index-term image vectors t, where the elements of t are the A-distances of the index term T from each of the separator index terms t a , tn, . . ., t^. In fact, D may be written entirely in terms of N u N 2 , . ■ ., N, and N V i, N\z, ■ ■ -, N,-i, t , the frequencies of occurrences and co- occurrences of all the t index terms in the vocabu- lary—this is, of course, in accord with the basic assumption made at the beginning of this paper. 167 Authors W Documents D Title- and- Author Images D tw Titles T Author Images W Title Images T 7 ' cm Titles T Citations ~T Authors Term Occurrences and Co-occurrences Term Images Title Images T Term- term Distances Separators Author Images W Title- and- Author Images Dtw Citation Images DC Document Images (^ D FIGURE 1. From documents to document images. 4. Application 4.1. The Enquiry Image We now examine how the document images and other images thus generated can be useful for re- trieval. Any retrieval operation starts with an enquiry. In most systems the enquiry must be in a closely specified form. One of the great advan- tages of this proposed system is its extreme flex- ibility with regard to the form of ihe enquiry, as will now be shown. The enquirer is given a preliminary form which is divided into two sections. Author-names section: In this section the en- quirer may write the names of any authors who he believes have some relevance to his problem. He may assign differential weights, stressing cer- tain authors, if he wishes. He is not limited in the number of names he may write down, and he may may, if he wishes, leave this section blank. The only restriction is that he should use only names of authors who are represented in the collection. A list of these authors would be available to the enquirer. Text section: In this section the enquirer may scribble down any "textual material" which he feels may help in retrieving relevant documents. By "textual material" we mean any titles, sentences, phrases, single words, or what have you. The restriction is that he should not use words which are not one of the system's original terms or which are not in the system's glossary of nonsubstantive words (typically prepositions, quantifiers, conjunc- tions, etc.). Actually, this restriction and the similar one for the author-names section may be relaxed if it is assumed that ineligible names and terms can be edited out of the enquiry. The en- quirer may assign differential weights to any of the substantive words (index terms) he has written down, either as individuals or in groups. He may leave this section blank if he has not left the other section blank. This preliminary enquiry form in two sections then goes to the interpreter (possibly mechanical), who has before him the following: (1) an alphabetic fist of authors represented in the collection. Next to each name is a string of m numbers (all between and 1) representing the author image; (2) an alphabetic list of all the index terms rep- resented in the titles of all the documents in the collection. Next to each term in the list is a string of m numbers (all between and 1) representing the index-term image; (3) a form ENQ, which is reproduced in figure 2. The interpreter then looks up each of the authors cited on the preliminary enquiry. He notes whether the enquirer has assigned weights to the authors' names. If so, he multiplies the m numbers by the stated weight and records them on the form ENQ, repeating this for each specified by the enquirer. 168 ■id Index Term or Author Wl 1 2 3 m-1 m Smith 2 1.36 1.44 1.98 1.00 1.64 Jones 1 .71 .81 1.00 .50 .71 cat 5 3.60 3.95 4.75 2.00 3.15 house 2 1.44 1.50 2.00 1.20 1.66 TOTALS 10 7.11 7.70 9.73 4.70 7.16 Enquiry Image .71 .77 .97 .47 .72 FIGURE 2. Form ENQ (with hypothetical numbers). If no weights are indicated, then all weights are taken to be 1. The interpreter then goes to the text section of the preliminary enquiry and crosses out any words which are not on his list of index terms. For the words which remain he enters the m corresponding numbers, duly multiplied by any weighting factor, on the form ENQ. Having done this, he then totals each of the m columns on the from and also totals the weights. Each column total is then divided by the weight total. The re- sulting m numbers represent the weighted average of the various image vectors, the weights having been chosen subjectively by the enquirer. The enquiry image is the vector represented by the numbers on the last line of the form ENQ. 4.2. Measuring Resemblance To effect retrieval, it is now necessary to compare the enquiry image with the image of each document in the collection. According to the hypotheses and assumptions made at the very outset and else- where throughout this paper, those document images which most resemble the enquiry are most likely to represent the documents which contain the information relevant to the enquiry. There are a number of ways of defining the resem- blance between two images. One way is to com- pute the correlation between them, this being the usual method. The higher the correlation, the greater we assume the resemblance to be. Thus, one could compute the correlation between the enquiry image and each of the d document images. These correlations can then be ranked. Then the z documents giving the highest correlations with the enquiry image would be picked as the solution to the retrieval problem; alternatively, all docu- ments having a correlation greater than /O with the enquiry would be picked. The value of z or p () would be selected to yield the right blend of pre- cision and accuracy as defined by Giuliano et al [4]. However, d, the number of documents in the collection is usually large, and calculating d cor- relations for every enquiry could be undesirable. Therefore, consider the following alternative method for picking out resemblances. Suppose the enquiry image vector is denoted by (t\, v 2 , . . ., v m ) and a document vector by (u\, 112, . . ., u m ). Then, for a preselected €o, retrieve only those documents such that \vj — Uj\ DOCUMENT LISTING 1140 1362 1363 1589 1627 1809 2431 3976 4885 THE TERM -CARD SCANNER PRODUCES TWO TOOLS FOR VOCABULARY ANALYSES. STATISTICAI DATA antonym. These will yield term clusters, pairing, relationships, nonrelationships, and correlations. Access time is on a demand basis. In addition, generic relationships are easily es- tablished and counts can be made at all levels. How deeply was the input made? What is the percentage at different levels? These are the questions asked by the analyst. His intuitive logic is relied upon to determine which card is to be compared with which card. Size of the card and ease of handling would seem to impose certain constraints on the report popula- tion. In the system under study, the collection was well within the 10,000-item limitation. But many systems may not be so limited. We admit that such a collection would be difficult to handle in its en- tirety. But in the event that the collection be large, say 50,000, we believe a statistically sound random sampling could be made with a high confidence limit to enable this technique to be used. Does the other parameter, one of the number of candidate key words for the vocabulary, impose prohibitive limitations on optical coincidence use and counting? Of course large numbers of candi- date terms are a problem in any system. Wall [2] puts this into what we consider its proper perspec- tive: "One is inclined to wonder whether all the hundreds of thousands of words in the English language must be included, and if so, one is appalled by the multitude of the task. But, in fact, the vocabulary of science is quite limited. Numerous investigators have pointed out that the vocabulary of any one field of technology is limited to approxi- mately 5,000 terms, that the vocabulary of all technologies is limited to approximately 20,000 terms, and that the whole of human knowledge could be expressed in less than 40,000 terms." Figure 3. RADIATION A : A RADIATIONg = B Some of the possible manipulations here are the pairing of key word to scientific field, of key word to responsible laboratory or organization, of key word to key word synonym, of key word to key word Qa Qb AMOUNT OF OVERLAP BETWEEN LAB A AND LAB B. FlGUKK 4. 179 It should be remembered that this hole-count scanner is not intended to be used to count all of the documents indexed by each and every key word on a card by card count basis. Initial key word counts will be done more easily and simply by the tabulating machine while running the initial listing. Kurt Lewin [3] never hesitated to advise the student: "Only ask the question in your research that you can answer with the techniques you can use. If you can't learn to ignore questions that you are not prepared to answer definitely, you will never answer any." Indeed, only a small propor- tion of the key words is subjected to an in-depth scrutiny and statistical comparison by the analyst. Most current thinking in documentation is oriented to the static document and its retrieval. Another school of thought is being applied to statis- tically "managing" current work. For example/ how many projects are being worked on in a given area? What is the relative funding? Is there any overlap of effort? How much, and specifi- cally where is it? With the rapid advance of the state of art of information retrieval in recent years, it is not only possible, but mandatory, to resolve some of these problems. Since the scanner makes it possible to analyze vocabulary development, it is equally simple to interweave into this some studies of considerable depth of the work done in different laboratories and research groups within the organization or its contractors, plus the various subgroups within them. This new approach to statistical vocabulary development provides the analyst, or the decision- making group, with a rather simple tool, which when used in conjunction with his knowledge and imagination lays the foundation to the information system. Creative simplicity is one approach that we feel should not be overlooked in this age of complexity. References [1] Luhn, H. P., A statistical approach to mechanized encoding and searching of literary information, IBM J. Res. and Develop. 1, 313 (Oct. 1957). [2] Wall, Eugene, A practical system for documenting building research, from Documentation of Building Science Litera- ture, from the proceedings of a program conducted as part of the 1959 Fall Conferences of the Building Research Institute, Division of Engineering and Industrial Research, NAS-NRC Publ. 791 (1960). [3] Lewin, Kurt, Field Theory in Social Science: Selected Theo- retical Papers, ed. D. C. Harper, p. 29 (1951). 180 A Computer-Processed Information-Recording and Association System G. N. Arnovick Planning Research Corporation Los Angeles, Calif. As a result of previous research studies in analyzing the problems of automatic data association in a man-machine information environment, a set of conditions is defined which represents a system logic concept for automatically processing input data for information content and relevance. The system technique which is presented is the result of several separate research investigations and is defined as a system concept which indicates a possible breakthrough in automatic information association. Automatic syntactical analysis and automatic reference to vocabulary lists may be used to construct a formal operating statement given in equation form, by utilizing current methodologies of machine language translation. Various levels of statistical association can be determined which represent a logically manipulatable information unit. The association system logic which is presented can be con- ceived as a new and more efficient approach for a computer-processed information-recording and association system. 1. Introduction In the course of designing an information- processing system, a major problem becomes appar- ent, namely that of selectively identifying specific information as it is related to information meaning or coherence. The problem is further complex when one considers the parameters of information control that must process, correlate, or extrapolate data elements in a rational manner. The tasks in- volved in information handling of syntax and seman- tic variables, and how they are identified and related to a multiplex of stored items for comparison and correlation purposes, are extremely difficult to process by a human analyst. The analysis and processing of information as described above in- creases in magnitude when constraints such as effective real-time inputs are part of the system, and data buffering for prolonged off-line operations cannot be tolerated due to loss of information message content over a time continuum. The information-processing logic and techniques described in this paper are considered and defined as an overall system concept in which system sub- tasks for automatic information association are com- puter processed. Significant research and systems development in information association for (1) analysis and (2) machine organization have been reported by G. Salton, V. Giuliano, R. Barnes, H. P. Edmundson, L. B. Doyle, H. E. Stiles, and others. (See references at end of paper.) These findings and the technical methods suggested for information association are taken into account, with the expectation that they can be effectively utilized within an information-processing environment such as conceptually presented in this paper, and that the method or combination of methods to be selected would depend on the application requirements. To develop an optimum system configuration it is necessary to specify a man-machine information- processing environment, in which information re- cording and association are defined as the major system task. Accordingly, a subsystem task frame- work is provided for automatic information record- ing and association based on the utilization of machine language translation (MLT) methods for analyzing recorded information statements. The methods for utilizing MLT employ functional developments which are optimally suited to the system solution. The technical approach and system design ration- ale for recording and associating information by the utilization of MLT are dependent on suitable functional solutions and special-purpose processing equipment, and will be dependent on application variations as they relate to (1) real versus non-real time data handling; (2) file format and organization; (3) semiautomatic or manual processing; (4) cost/ system tradeoffs for optimal utilization; (5) memory size and type needed and available; (6) utilization of serial or parallel file processors; (7) random order of data arrival; (8) priority interrupt; (9) on- or off- line to a computer; (10) queuing and information distribution. In summary, a method is described for data analysis which considers information sets as an operating group of formal statements as part of an input message. The basic approach for a functional system design is based on the use of MLT for analyzing recorded information statements. The system utility is not expressly designed for library/ document system solutions as they are related to current automated library requirements. However, the man-machine concepts utilized by the system definition may be practical with further design constraints for automatic document content analysis, and on-line document browsing for the library of the future, incorporating a man/console/ computer system suggested by Dr. D. Swanson, at the Airlie Conference on Libraries and Auto- mation, Warrenton, Va., 1963. The proposed system concept is more applicable and suited to the technical and decision-making requirements of information control systems as applied to (1) management information systems; (2) control center management; (3) mission analysis and information processing; (4) simulation. 181 2. System Concept A basic requirement for an information storage system is the ability to draw together all the relevant pieces and bits of information in answer to interroga- tions which may be made at any hierarchical level of relationships. Systems which have in the past as well as currently, employed simple descrip- tors and low-level association between those factors allow for the use of relatively simple computer processing. The result of such relatively simple and limited operational capability placed a heavy burden on the analyst, who has to determine the relevancy of the retrieved information, much of which is redundant, therefore reducing the rele- vancy of the retrieved data, as well as allowing for nonpertinent information flow. Previous work for very large automatic informa- tion-processing systems utilizing tree-structure techniques expressed as multiplets provided a logically manipulatable information unit. However, the system planning and design for such systems, which theoretically provided complete automatic information handling, was not able to process data automatically as planned. This was due to the inability of the subsystem to maintain automatically logical consistency checks for input message com- pleteness as verified by a stored item file for data correlation. The item-compare subsystems ex- pressed as a function of word association pertinent to incoming statements failed to provide message reasonableness as defined by logical rules for semantic reliability. Thus the system design goals were not satisfactorily met; this appears to limit the possibilities of utilizing automatic input process- ing. Empirically, there is no doubt that fully auto- matic systems are inherently limited, and must require human analysts to be an integral part of the system performance functions. This man- machine interface is mainly centered on the need for human analysts to be in complete control for input message encoding. 3. Technical Approach The system design rationale proposed for a computer-processed information-association and recording model is specifically concerned with several major system variables, which are as follows: 1. The system is semiautomatic by definition. 2. Humans (the analyst) are linked to the system. 3. The control element is a man-machine func- tion. 4. The computer's role is defined as a servo- system for rapid processing slaved to the analyst. 5. The inferential technique for information analysis (association and recording) utilizes machine language translation as the major interface between data control and computer processing. 6. The system is relatively dualistic (dependent and nondependent on machine translation methods relative to the time domain frequency for computer- processed data), e.g., information content may be processed in raw form independent of translation requirements and at select time sequences, and in- formation processing is a control function dependent on the logical algorithms of machine language translation procedures. The techniques of machine language translation offer a means for automatically analyzing the syn- tactical structure of sentences. The semantic content of a sentence is dependent both upon the words used and upon their relative order of use. In this instance, automatic syntactical analysis and automatic reference to vocabulary lists (formally equated to hierarchical code lists) which are gov- erned by a formal set of rules will be used to con- struct an operating set of formal statements, expressed in the form shown in eq (1): where: { } = operating level formal statements R — total stored intelligence item (gives loca- tion of storage and acts as a link between statements included in a particular item /= field of interest (e.g., strategy, tactical, intelligence, economics, etc.) £ = time of statement or origination of subject or object {A„ may modify) S = subject taking the action = object acted upon or co-subject of intransi- tive actions A = action P= product or result — » = leads to. A symbol before a bracket may modify the hier- archical structure of the code elements within that bracket, e.g., / modifies S, O, A„, and P. t may modify S, O, and A n but is unlikely to modify P, as this should be chosen to include time-stable terminology. For example, S and O might contain names of countries or cities whose names may be subject to change with time, t itself may express either relative time, as dates, or absolute time rela- tionships such as elapsed time, velocity, rate, etc. An information set is defined as that group of operating level formal statements derived from one message input to the system. This can be repre- sented by eq (2): (X i,X-2, . . ■ -Xj)i . . . {X\,X 2 , . . . Xj) n —> R (2) {I\t{S-0-A n )-^P]}l, {I[t(S-0'A n )-»P]} n - >R (1) where the A"s inside the parentheses stand for some of the symbols defined above, where the items in- side the parentheses are numerically coded repre- sentations of the original statement information, 182 n is the total number of operating level formal statements in one information set, and R is the set identifier. This information is compared with the /-file. A nonduplicate statement is stored as a hierarchical structure in the /-file. In the case of a duplicate statement, only the set identifier, R, is stored. The numher of /?'s stored serves to enforce the validity of the corresponding statement. The hierarchical structure is now modified by interchanging S and /; the above process is then repeated, using the S-file. This process continues until all six combinations of /, S, 0, A, n, and t have been exhausted. The last combination of items will be sorted in the f-file. These six files will enable rapid retrieval of information based on any one of the six categories. The system concept expressed as a subtask of file identification and flow of data sequence for input analysis and operations is shown in figure 1. FIGURE 1. Input information-processing flow. Various index files of the formal statements will be derived from the combination of input data and logical decisions applicable to those data by the system or by the human analysis. A file of logical statements will be created to serve as a check upon the reasonableness of incoming statements. For example, an input statement re- garding the movement of the troops of one nation through the territory of another nation cannot be considered as reasonable unless (a) these two nations have some treaty or agreement regarding such movements; (b) these two nations are at war with each other; (c) one of these nations is in a critical geographical location with respect to some aggressor nation. Such statements may themselves be derived from verified input data. Any contradictions to the stored logical rules of reasonableness, any lack of completeness or other inherent defects of the statement would be sensed automatically, and cause the statement to be trans- mitted to the analyst for further investigation. 4. Application and System Extension At this time, an analysis of the proposed associa- tion and recording system concept suggests several areas of possible applications. Some of the more immediate applications concluded from the system are (1) mathematical simulation of syntactical variables for weighting functions expressed as probabilistic association events; (2) the utilization of the proposed model in screening data redundancy for management information systems; (3) the extrapolation of select associative terms related to 183 message identification; (4) the use of MLT models for programming conversion suitable to input data format; (5) utilization of MLT techniques as a man- machine system for information concept building; (6) generation of a compiler for common message translation which is computer independent as to type of equipment; (7) automatic thesaurus genera- tion and development. These applications as expressed above are logically possible, and repre- sent a potential breakthrough for current problems in information handling and manipulation. The technical problems associated for such projected functions are not easy, as it is obvious that the solu- tions required do not deal with simple data, but rather with complex sets of data, expressed as infor- mation for human understanding. Further study for the development and imple- mentation of a computer-processed information- association and recording system is needed at this time. It is recommended that a study program be initiated which would allow for the systematic development of functional tasks that are logically related to each other as a chronological step for each subevent in the total analysis effort. The major analysis criteria are as follows: — Analyze various kinds of information to be used for the system. — Determine various relevancy requirements and techniques for total information match. — Study and analyze various methods for record- ing hierarchical relationships of data. — Analyze various methods of syntactical analysis appropriate to the system. — Determine methods for establishing the equi- valence of statements on the basis of syntactical analysis and hierarchical relationships. — Ascertain the appropriate man-machine inter- face requirements. — Design information system model. — Describe the logic and computer program to simulate and test an information-recording and association model. — Recommendation for methods of implementing the system. 5. References 1. Arnovick, G. N., A linear programming model for system design, EDP Division, Radio Corporation of America, Cherry Hill Laboratories, Cherry Hill, N.J. AATM-1 (Apr. 1960). 2. Arnovick, G. N., Advanced techniques information search and retrieval systems, Data Systems Division, Radio Corporation of America, Van Nuys, Calif. EM-61-583-6 (June 1962). 3. Arnovick, G. N., Information Processing systems: system and subsystem characteristics, Space and Information Systems Division, North American Aviation, Inc., Downey, Calif. SID-63-1362 (Jan. 1963). 4. Arnovick, G. N., New machine development for information processing system, presented at NATO Advanced Study Institute on Automatic Document Analysis, Venice, Italy (July 1963). 5. Arnovick, G. N., J. A. Liles, and J. S. Wood, Information storage and retrieval: Analysis of the state-of-the-art, presented at 1964 Spring Joint Computer Conf., Washing- ton, D.C. (Apr. 21-23, 1964). 6. Bernier, C. L., Correlative indexes I. Alphabetical correla- tive indexes. Am. Documentation 7, 283-288 (1956). 7. Borko, Harold, The construction of an empirically based mathematically derived classification system, System Development Corporation, SP585 and FN-6164 (Oct. 26, 1961). 10 11 8. Doyle, L. B., Indexing and abstracting by association, Am. Documentation 13, No. 4 (Oct. 1962). 9. Giuliano, V. E., Automatic message retrieval by associative techniques, Preprints of 1st Congr. on Information System Sciences, The MITRE Corporation (Sept. 1962). Giuliano, V. E., Analog networks for word association, BIONICS Conf., March 1963. To appear in IEEE Trans. Mil. Elec. Giuliano, V. E., and P. E. Jones, Linear associative informa- tion retrieval, Ch. 2 of Howerton and Weeks, Vistas in Information Handling 1 (Spartan Books, Washington, D.C, 1963). 12. Maron, M. E., Automatic indexing: an experimental inquiry, J. Assoc. Comp. Mach. 8, No. 3 (July 1961). 13. Maron, M. E., and J. L. Kuhns, On relevance, probabilistic indexing and information retrieval, J. Assoc. Comp. Mach. 7, No. 3 (July 1960). 14. Quillan, Ross, A revised design for an understanding ma- chine, Mechanical Translation (MT) 7, No. 1 (July 1962). 15. Stiles, H. E., The association factor in information retrieval, J. Assoc. Comp. Mach. 8, No. 2 (Apr. 1961). 16. Vazsonyi, Andrew, Automated information systems in plan- ning, control and command, presented at 10th Intern. Meeting of Inst, of Management Sci. (TIMS), Tokyo, Japan (Aug. 21-23, 1963). 17. Yngve, Victor H., A model and an hypothesis for language structure, Proc. Am. Phil. Soc. 104, No. 5 (Oct. 1960). 184 3. Applications to Citation Indexing 772-957 O-66— 13 Statistical Studies of Networks of Scientific Papers Derek J. DeSolla Price Yale University New Haven, Conn. Statistical analysis is made of the way in which papers are linked together by the citation of one paper by another. The distributions of numbers of references and of numbers of citations per paper are estimated, and from this a general structure of the network is derived. Every paper once published is cited on the average about once per year. The linking of papers is such, however, that an Immediacy Effect tends to join new papers to relatively recent ones rather than the entire available body of litera- ture. Perhaps, half the literature is of the immediate type and the other half "immortal record." The nature of the research front is shown to correspond to a fabric of knitted strips, the width of each strip being such that it corresponds to the work of a few hundred men at any one time. These form natural parcels of subject matter. 187 Can Citation Indexing be Automated? Eugene Garfield Institute for Scientific Information Philadelphia, Pa. 19106 The main characteristics of conventional language-oriented indexing systems are itemized and compared to the characteristics of citation indexes. The advantages and disadvantages are discussed in relation to the capability of the computer automatically to simulate human critical processes reflected in the act of citation. It is shown that a considerable standardization of document presentations will be necessary and probably not achievable for many years if we are to achieve automatic referencing. On the other hand, many citations, now fortuitously or otherwise omitted, might be supplied by computer analyses of text. This paper considers whether, by man or ma- chine, we can simulate the process of "document- ing," the process by which authors provide reference citations to pertinent and usually earlier documents. My paper does not concern the manipulative or mechanical problems of auto- matically compiling or printing citation indexes. The existence of the Science Citation Index is adequate testimony to the ability of the computer rapidly to sort, edit, and print large-scale citation indexes [l]. 1 My paper also does not consider the problem of automatically recognizing (reading) and/or extract- ing explicit citations appearing in published docu- ments by use of character-recognition devices. Programming such a device will require the reso- lution of fantastic syntactic problems even if the machine has a universal multifont reading capa- bility. For example, in the citation, "J. Chem. Soc. 1964, 1963," which number is the year and which the page number? These are not trivial problems. To handle the vagaries of bibliographic syntax we "pre-edit" all documents before key-punching the citation data needed for the Science Citation Index. We also "post-edit" both by computer and human editing procedures. Do not confuse the "automatic" or "routine" nature of citation index- ing with a syntactically intelligent automaton. Our citation indexers do not require subject-matter competence, but they do require considerable bibliographic training. The diverse and un- standardized citation practices in the world's litera- ture make this necessary. In addition, there are linguistic variations in names and publication titles which must be handled. Our citation in- dexers essentially must be trained in descriptive cataloging. My paper does concern the ability of an artifi- cially intelligent machine to deal with, among other things, the implicit reference citation as distin- guished from the explicit reference citation. Such might be the case in a paper where the author, for one reason or another, has neglected to provide a pertinent bibliography. The editor of a scientific journal would ask such an automaton to supply all "pertinent" references, if for no other reason than 1 Figures in brackets indicate the literature references at the end of the paper. to make certain the research was original. Cita- tions are generally used to provide "documentation" or support for specific statements. However, reference citations are also provided in papers for numerous reasons including, among others: 1. Paying homage to pioneers 2. Giving credit for related work (homage to peers) 3. Identifying methodology, equipment, etc. 4. Providing background reading 5. Correcting one's own work 6. Correcting the work of others 7. Criticizing previous work 8. Substantiating claims 9. Alerting to forthcoming work 10. Providing leads to poorly disseminated, poorly indexed, or uncited work 11. Authenticating data and classes of fact — physical constants, etc. 12. Identifying original publications in which an idea or concept was discussed. 13. Identifying original publication or other work describing an eponymic concept or term as, e.g., Hodgkin's disease, Pareto's Law, Friedel-Crafts Reaction, etc. 14. Disclaiming work or ideas of others (negative claims) 15. Disputing priority claims of others (negative homage) The problem of identifying all "pertinent" refer- ences, to support implicit citations, is a special case of the general problem of automatic indexing. It has previously been reported that machines can index or abstract by use of key words in context taken from titles [2], by use of statistically signifi- cant sentences [3], kernels [4], etc. O'Connor has recently reviewed these methods [5], as has Artandi [6]. Associative methods have been widely discussed by Stiles [7], Maron [8], Giuliano [9], etc. All of these systems, however, are con- cerned with indexing by use of the text only. Bibliographic citations are regarded as meta- linguistic elements. Recently, however, Salton [10] has discussed the use of bibliographic citations as indicators of document content. Essentially he proposes to treat citations as descriptors, which may seem strange to those who think in terms of conventional indexing. Indexers do not ordinarily think of cita- 189 tions (addresses of cited documents) as descrip- tions of the citing document. However, that does not alter the fact that they are [11]. Citations (document addresses) are brief repre- sentations of the documents they identify. As one sacrifices compactness, such as is found in serial numbers for patents [12], and expands to full titles and then to abstracts, one sees the gradual enlargement of the document description toward the complete text. In this transition from "cita- tion" to "document," redundancy is introduced as well as additional information content. Indeed, a document and a citation approach equality as the depth of indexing decreases (from the full text) and the length of the citation increases. This corresponds to my earlier definition of the document as the set of descriptors which describe it [13]. In an information retrieval system, information content can be measured only on the basis of in- dexed information that is supplied in the indexing process. By this definition a document is a unique combination of descriptors not assigned to any other document in the collection. In most the- saurus-based collections indexing is not sufficiently deep to achieve such uniqueness. However, the combination of conventional subject headings or descriptors with the bibliographic citations used as references increases our ability to describe docu- ments uniquely and specifically. Indeed, those who have studied citation indexes and so-called bibliographic coupling are well aware that only a small number of reference citations are needed to isolate uniquely a particular document in the collec- tion from all others [H]. That is why a search of a citation index generally produces a highly selec- tive and useful search result. In discussing citation indexing it is frequently stated that weaknesses of the method include under- citation (the deliberate or unwitting failure to cite pertinent literature) and over-citation (the excessive reference to presumably nonpertinent literature). Under-citation is illustrated by the patent literature, since there is an economic motivation to cloud rather than clarify the information disclosed in a patent. However, the patent examiner, otherwise motivated, attempts to clarify the prior art by providing a list of "references cited" [14]. Suppose, however, the patent examiner, or a journal editor, wishes to examine a document quite critically and asks that the "machine" provide all the pertinent documenta- tion or prior art. This brings me once again to the main theme of my paper. To answer the question "Can citation indexing be automated," as we have seen, obviously entails a discussion of the entire range of question-answer- ing problems encountered in designing any informa- tion retrieval system. Consideration of the auto- matic procedure for supplying reference citations, when they are missing, merely focuses attention on the complex indexing task performed by the author when he does give pertinent reference cita- tions. Such considerations help us focus attention on the significant differences between a priori and a posteriori indexing [15]. Since each person may interpret the meaning or significance of words and documents differently, the problem we are dealing with inevitably involves the human ability to create novelty, to invent, to discover, and to be critical. Are machines, or machinelike people, capable of imitating or simulating the human process of being critical? What are the peculiarly "human" earmarks of certain sentences containing citations? When do such sentences contain implicit citations that could be supplied by an intelligent machine and when would this appear to be difficult or impossible? Consider the following example: "Mr. X, an impossible idiot, has recently published a paper on gobbledegook. The conclusions reported in his paper are wrong as are the data on which the con- clusions are based. The recommendations made by Mr. X, on the basis of his conclusions, will be a calamity for mankind." In polite circles, this is called the critical review. Obviously, "intelligent" machines are not yet ready to generate such criticism. Or at least program- mers are not yet able to program machines to prepare such critiques. If they were, then the paper by Mr. X would probably never have appeared because the same artificial intelligence would have been available to tell him that his data were wrong before he published and why! (If he persisted in publishing, we probably would have identified a quality common to humans, but invariably attrib- uted to machines — stupidity.) The first sentence in the example illustrates the case for an implicit citation that our machine ought to be able to provide. What could be more simple than the kernel sentence "Mr. X has published," which one would hope could be the result of a transformational analysis [4] when such methods are perfected. Such an analysis combined with a complete computer fisting of the papers by Mr. X is a good starting point. Since we know that this is not sufficiently specific we must then expect of the linguistic analysis "Mr. X has published on gobbledegook" and then we have reduced the com- puter search to the "simple" task of identifying the one paper out of the thousands by men named X to those which concern gobbledegook. Alas, this simple task alone requires the resolution of all the linguistic and semantic problems associated with matching the word "gobbledegook" with the possibly different words in the title of the implic- itly cited paper or book. Indeed, there is no rea- son at all to assume the same word has occurred either in the title or the text of the "cited" work. If these problems were not sufficient, keep in mind that the word "recently" is quite significant in the example chosen because it stresses the possibility that Mr. X may have written extensively on gobble- degook and it is only one particular, or a few recent papers, that is the target for discussion. Fortunately authors usually do provide, explicitly, the citations needed to support such sentences. As a consequence the citation index, created by 190 human indexers, does correlate the cited work with the critical statements which appear in the second and third sentences of the example paragraph. This feature of the citation index alone would have justified its creation. However, it is interesting to speculate whether transformational or any other automatic analysis of such a paragraph could pro- duce a useful additional "marker" which would de- scribe briefly the kind of relationship that exists between the citing and cited documents. These "markers" would appear in the published citation index along with the usual citation data. In the case of the paragraph above, for example, "critique" or one of several other terse statements like "Mr. X is wrong," "data spurious," "conclu- sions wrong," "calamity for mankind," etc., might be appropriate. The "intelligent" machine would examine a new document and generate a critical statement such as "rather poor paper." As we have seen above, a less intelligent machine might analyze the paragraph and conclude that a biblio- graphic citation to the work of Mr. X is missing and needed. The machine might also conclude that the cited work was under "critical" discussion be- cause of certain syntactic or vocabulary character- istics associated with "critical." Presumably they would be identified by transformational or other sophisticated analyses not yet available. This would be no mean accomplishment. Among other nontrivial problems is the fact that the information needed to assign the marker can be spread through- out, not in a single sentence of, the source paper. O'Connor's studies on the term "toxicity" are quite pertinent to this problem because the prob- lems have in common the need to discover methods for assigning descriptions of documents which are subject to considerable variation [16]. What is toxic to one man may be euphoric to another! To examine a document from the "citation" point of view, to determine what reference citations could or should be provided which link the sentence, phrase, or word in question to man's prior recorded knowledge, is to say the least a formidable chal- lenge. The task is an excellent exercise for new journal editors. To follow the "citation" method of appraising a paper is in essence to challenge rigorously each statement in that paper. If an author does not provide documentation for state- ments it does not mean that they are false. How- ever, they should ideally be supported by a "refer- ence" to some prior document, conversation, etc. It would appear that in the "ideally" documented paper almost every sentence or phrase could be in- terpreted to require reference to the past. While one can accept intuitively the notion that there are novel sentences that one can express in English, novel concepts appear to be comparatively rare. Most novel combinations of words, punctuation, etc. could be transformed into concepts that had ap- peared before. Indeed, patent examiners like to remind inventors of this when disclosing generic concepts, alone or in combination, which anticipate specific embodiments. I recently did an experiment with a group of my students at the University of Pennsylvania in which I asked them to read a paper published in the Journal of Chemical Documentation [13] which con- tained no bibliographic citations. The reason this paper did not have a bibliography is simple. Many published papers don't have bibliographies for similar reasons. The paper was originally presented at a meeting. The editor of the journal asked for a copy, but it was published without the bibliography which obviously was not needed in the oral presentation. Each student was asked to supply the missing bibliography for this paper. Twelve students were involved in the experiment. One student assigned 12 references while another assigned 75. The average was about 40. This is not surprising, as a considerable amount of literature was reviewed in the paper. The bibliography could have been ex- panded to hundreds of items if the common German practice were adopted of giving a complete list of papers every time a topic is mentioned. Thus, in a discussion of information theory where I felt one citation was sufficient, someone else might have cited numerous related works. The comments above are intended to give you a feeling for the problem we face in automating cita- tion indexing. It is a wide open area of research and it will take us into every fundamental area of textual analysis — something comparable to exe- gesis [17]. It is apparent that each author re- stricts his use of reference citations according to the importance he places on the statements in- volved. From our knowledge of quantitative cita- tion data, a doubling or trebling of the number of citations in the average paper would not overload the system from the user's viewpoint. The average paper that was cited in 1961 was cited about 1.5 times [18]. To double the amount of citation would not even double this figure, because not the exact same set of papers would be cited. However, even if we did significantly increase the average number of references to a particular work, we would then give consideration to a more specific approach to citations. This is well illustrated in the citations to books where one finds the list of sources sub- divided by the page cited. This only adds an addi- tional dimension in the specificity of citation indexing. There is no reason why this same principle cannot be extended to the paragraph, sentence, or word. Indeed, this is exactly what happens in exegesis. 191 References [1] Garfield, E., and I. H. Sher, Science Citation Index, 2672 pp. (Inst, for Scientific Information, Philadelphia, Pa., 1963). # [2] Luhn, H. P., Keyword-in-context index for technical litera- ture (KWIC Index), ASDD Rept. RC-127 (IBM, Yorktown Heights, N.Y., Aug. 31, 1959). [3] Luhn, H. P., The automatic creation of literature abstracts, IBM J. Res. and Devel. 2, 159-165 (1958). [4] Harris, Z. S., Linguistic transformation for information retrieval, Proc. Intern. Conf. Sci. Inform. 1958, vol. 2, 937-950 (Natl. Acad. Sci., Washington, D.C., 1959). [5] O'Connor, J., Mechanical indexing methods and their testing, AD#409, 276, J. Assoc. Comp. Mach. 11, 437- 449(1964). [6] Artandi, J., A selective bibliographic survey of automatic indexing methods, Special Libraries 54, 630-634 (1963). [7] Stiles, H. E., The association factor in information retrieval, J. Assoc. Comp. Mach. 8, 271-279 (1961). [8] Maron, M. E., Automatic indexing: an experimental inquiry, J. Assoc. Comp. Mach. 8, 404-417 (1961). [9] Giuliano, V. E., Analog networks for word association, IEEE Trans. Mil. Elec. MIL-7, 221-234 (1963). [10] Salton, G., Associative document retrieval techniques using bibliographic information, J. Assoc. Comp. Mach. 10, 440-457 (1963). [11] Garfield, E., The science citation index — new dimension in indexing, Sci. 144, 649-654 ( 1 964). [12] Garfield, E., Forms for literature citations, Sci. 120, 1030-1040 (1954). [13] Garfield, E., Information theory and other quantitative factors in code design for document card systems, J. Chem. Doc. 1, 70-75 (1961). [14] Garfield, E., Breaking the subject index barrier — a citation index for chemical patents, J. Patent Office Soc. 39, 583-595 (1957). [15] Garfield, E., Citation indexes — new paths to scientific knowledge, Chem. Bull. (Chicago) 43, No. 4, 11-12 (1956). [16] O'Connor, J., Mechanical indexing studies of MSD, toxicity (DDC No. not yet assigned. Contact author for copies c/o Inst, for Scientific Information). [17] Garfield, E., Citation indexes to the Old Testament, Am. Documentation Inst. (Nov. 1955). [18] Garfield, E., Citation indexes in sociological and historical research, Am. Documentation 14, 289-291 (1963). 192 Some Statistical Properties of Citations in the Literature of Physics 1 M. M. Kessler The Libraries, Massachusetts Institute of Technology Cambridge, Mass. The bibliographic sources in a number of physics journals are analyzed. The frequencies of inter-citation between the journals, expressed as percentages, are arranged in a matrix. It is postulated that the properties of this matrix may be used to define a functionally related family of journals. The Technical Information Project of the M.I.T. Libraries is engaged in the design of a working model of a technical information system that will serve a local community of scientists on a test basis. The choice of an experimental body of literature became a crucial question in the design of the system. It was recognized that the literature must be large enough to provide a realistic search situation and yet it should not be too large for model operation. The physics periodic literature was chosen as the experimental corpus for the model library. The choice of specific journals was based on the associative statistics of the various journals, the criterion of association was the frequency of inter-journal references. The design of the retrieval system, its com- ponents, and operations will be described in a forth- coming report. The present paper is concerned with the statistics and association measures that give guidance to the choice of an experimental literature. The statistics presented in this paper are based on a study of the citations in 36 volumes of the Physical Review (Vol. 77, 1950 to Vol. 112, 1958). These volumes contained 8521 articles that yielded 137,108 references to 805 sources. Spot studies were made on 18 other journals. Except for minor editing to eliminate misprints, duplications, and obvious errors, the given data are exactly as copied from the journals. Repetitions due to lack of standardization in notation or ab- breviations were left unchanged. Such repetitions are common in references to the foreign literature, particularly the Russian. These data must not be interpreted as a defini- tive list of periodicals but rather as a sample of the operational literature of a large number of research physicists who publish in the Physical Review and other journals. As such it sheds light on the collec- tive nature of the working literature of physics and provides significant guidance for the design of a science communication network. It is from this point of view that the data were of most interest to the author. Table 1 is a summary of the statistical highlights of the references in the Physical Review. Table 2 lists the titles in order of decreasing frequency of citation. The first column in Table 2 (order num- ber) locates the title along the frequency scale. 'This work was sponsored by the National Science foundation and in pari hy Proj- ect MAC. the experimental computer facility at M.I.T. which is sponsored liy ARI'A The second column (frequency) indicates the number of times the title was referred to in the 36 volumes of the Physical Review. The last column is the title of the source as it appeared in the literature. Table 2 does not list those titles that occur only four times or less. We draw three conclusions from the statistics of this list: A. There exists a definitive journal (Jo), in our case the Physical Review, that occupies a unique and dominant position as the most-referred-to source. B. The definitive journal plus a relatively small number of additional titles account for the over- whelming majority of all the references. In our case the Physical Review plus 55 titles out of a total list of 805 titles account for 95 percent of the source material. The significant property that this class of journals shares with Jo is stability in time. The same list of 55 journals (plus J ) will account for the majority of references year after year. C. The remaining 5 percent of the references is to a large and ever-growing list of rarely used sources. Unlike the titles in Groups A and B, this list has no stability in time; each new volume ex- amined yields some 15 to 20 new titles. This phenomenon is illustrated in Table 3. The total number of references to the periodic literature in the 36 volumes was 113,997. The titles that appeared in Vol. 77, the first volume examined, account for 107,385 references. In other words, the titles that appear in the first volume examined are destined to carry 96 percent of the references in the subsequent 35 volumes. As we examine those subsequent volumes, 78-96, it is clear that although the fist of new titles never ends, their contribution to the total reference literature is comparatively small. The investigation was continued to journals other than the Physical Review but related to it. Table 4 shows the distribution of citations between titles previously coded (i.e., those encountered in the Physical Review study) and new titles. These data are much like those in Table 3, indicating that these journals contribute to the list of titles of Class C but share the same Class B journals. An established, well-edited journal is not a static and isolated phenomenon. It is an active carrier of information within the community of scientific workers. Thus, a given journal relates to a family 193 of journals by referring to them and in turn serving as a source for others. There is a two-way flow of information between any two journals which is a measure of their correlation. In our analysis, we shall use the following nota- tion: J m m — 0, 1,2,3. . . . A; represents a list of journals. J is the definitive journal. J mn is the percentage of references in J m to J„. We can construct a matrix that shows the flow of information between the individual journals in the list. Figure 1 is a schematic representation of such a matrix. A column such as J 3n (m = 3, n variable) repre- sents the distribution of references in J 3 among a list of n journals, J n - A row such as J m3 (m variable, n = 3) represents the references of a list of journals, Jm, to the specific journal, 73. J mm-, the diagonal of the matrix, represents in each case the references of a journal to itself. Thus, 7oo refers to the per- centage of references in the definitive journal to itself. FIGURE 1. Matrix representation of information flow between journals. (See text for meaning of J mn .) x Jo Jl J* J 3 Jk Jo Joo J\0 J20 J:w Jko Jl J 01 Jn J 21 J 31 Jk\ h Jo2 Jl2 J22 J32 Jk2 J* J03 Jl3 J23 Jw JkX Jk Jok J\k Jlk Jzk Jkk We shall define a family of journals and the posi- tion of each member relative to all others in the family by means of a matrix such as in figure 1, using percentage of references for the 7m«'s. Fig- ure 2 is an illustrative example of such a family. The numbers in figure 2 are relative percentages for illustration only and do not represent any particular case. Referring to figure 2, we generalize that a family matrix of journals may be generated by a definitive journal. A journal matrix constitutes a family if it has a strong upper lefthand corner (Joo), a strong diagonal, a strong upper row, and if each column adds up to about 50 percent. Form- ally we may characterize a family matrix by the following: a. Jmn = Jmo= 15 percent b. 7oo — 2Jmn = 30 percent m=constant c. ^T J mn = 50 percent (m is any member of the family and n includes all the other members ending at Jk.) We can define several classes of journals within the matrix (refer to fig. 2). Class 1. Jo the definitive journal, as previously defined. Class 2. J i, J 2, J 3 : a group of journals that, in addition to being strongly coupled to J , are also strongly mutually coupled within themselves. In this region J mn =J nm . Class 3. y 4 , Jb- a group of journals that refer strongly to Jo and to 7i-3 but are not strongly re- ferred to by others. J mn , however, is strong. Class 4. All others, J 6 - 9 . These journals do not satisfy the conditions for inclusion in this parti- cular family. Within this last group we note three phenomena depending on the magnitude of J mm : FIGURE 2. Illustrative example of journal family matrix \ J m Jo y, y 2 J 3 A y 5 Je Ji Js Jo Jo 30 15 15 15 15 15 Jl 5 15 5 5 5 5 A 5 5 15 5 5 5 Js 5 5 5 15 5 5 J4 1 1 1 1 15 1 Js 1 1 1 1 1 15 Je 15 Ji Js 30 15 Jo 5 15 n=0, 1, 2 . k a. 766—15 percent: although 76 does not fit into this family, it may well fit into some other family. b. 777 — 0: the expectation is low that 77 will fit into any family matrix. c. 788 = 30 percent: 7s is very likely to act as 7o for a new family and indeed is showing signs of starting the family with 79- Figure 3 is a family matrix of actual journals. The main difference between it and the illustra- tion of figure 2 is that the boundaries between the classes are gradual transitions rather than sharp fines. This is of course to be expected in the case where definitions depend on statistical properties. The regions are nevertheless recognizable and the family structure clear. 194 Referring to figure 3, we note the strong diagonal since we chose only journals of some character and standing in the field. The family matrix is generated by the Physical Review. (J )- Joo = 47 percent. A strong J m o row extends from J to J15 where we have drawn the family fine. J\ to J» represent the Class 2 journals, namely, strong con- tributors and receptors of information within the family. Jo to J13 are strong receptors but negligible contributors. (Note, however, that J mm is still strong.) Within the family each column, ^Jmn, n=l, 2, . . . adds up to about 50 percent. Journals outside this family include J19 which shows signs of start- ing a new family extending up toJi4. Two journals, Ju and 7i 5 belong to both families. It is our hypothesis that the location of a journal in a family matrix is a quantitative measure of the probability that the journal will carry a specific type of information. FIGURE 3. Reference matrix of a family of journals Jo 7, 72 73 Ji 75 7* 7t 7s 79 7io Ju 7l2 7.:. 7.4 7.5 Ji6 7.7 7.8 7l9 REFERENCES FROM OS 0* u C/) — 0- 6 If) U CD 0J -J > CD os Cu — 0- "H. a. < Oh H W 1— > 1 to Oh >' en IS '33 ;*. a, s &i S U > 3 z >. Dh 'S N — Dh O V _c H bC O |H 0, Cy t5 en "3 en 1 J3 Dh >' O en >• Dh c CO u — Dh 1—1 JS O CD N u '3 E — Dh c CO a to 6 en Dh »— } c en OS 6 Ih Dh >■* J3 Oh 8 U — . 8 CD — U e ca U 6 en 8 V -C U 8 CD — ca — Oh 6 en 8 cu J3 u 8 TO ' 7o Phys. Rev. 47.2 34.1 28.4 14.5 18.5 15.8 25.0 19.7 29.8 12.8 15.1 12.3 8.7 15.4 6.9 12.8 1.3 h Proc. Phys. Soc. 2.0 9.4 1.2 2.4 4.3 1.0 2.5 1.1 1.2 2.0 2.0 3.7 2.9 h Phys. Rev. Letters 12.6 1.6 29.5 1.8 2.5 1.7 14.4 4.3 13.7 2.6 1.0 2.0 h J. Appl. Phys. 1.3 2.4 1.8 23.0 2.1 1.4 3.5 3.4 2.6 1.0 Jl Sov. Phys.-JETP 2.8 2.6 1.3 32.0 3.6 2.2 1.1 8.3 2.5 1.0 75 Physica 1.1 21.5 1.7 2.6 1.5 1.2 2.2 1.3 7* Nuovo Cimento 4.0 1.6 4.5 3.1 21.4 8.4 2.0 1.6 Jr Zeit. Physik 3.1 3.0 20.4 1.0 2.5 1.4 J* Progr. Theor. Phys. 1.5 3.7 25.7 J* Sov. Phys. — Sol. State 23.8 1.8 7io Can. J. Phys. 9.1 Ju Czech. J. Phys. 12.5 7.2 Phys. Fluids 19.5 7,3 J. Phys. Soc. Japan 1.0 16.2 Ju Proc. Roy. Soc. 1.8 6.2 1.1 2.7 4.3 1.7 2.5 2.0 1.0 4.8 4.0 3.3 14.7 3.3 1.1 1.2 2.1 /is J. Chem. Phys. 3.9 1.1 2.1 5.0 2.1 3.0 13.1 5.4 6.5 33.4 8.4 1.1 7.9 5.4 Ju Can. J. Chem. 12.3 1.3 1.1 7,7 J. Chem. Soc. 2.7 10.2 25. 1 3.7 8.0 7.H J. Phys. Chem. 1.3 1.4 3.0 12.3 2.4 7,9 J. Am. Chem. Soc. 2.0 3.8 6.3 17.6 19.4 22.2 39.2 195 Table 1. Statistical Summary of Citation Sources in Physical Review Material examined: Physical Review, Vol. 77, 1950 to Vol. 112, 1958 inclusive. Total number of articles: 8521 Total number of journal titles referred to: 805 Total number of references: 137, 108 of these 68,162 references were to the Physical Review. 11,695 were to private communications and unpublished works. 9,191 to books. 1,929 to reports and memoranda. 296 to theses. 4,252 to Reviews of Mod. Physics. 3,725 to Proc. Roy. Soc. (London). 7,072 to 3 titles each used 2000-2999 times. 12,957 to 9 titles each used 1000-1999 times. 12,377 to 43 titles each used 100-999 times. 1,642 to 25 titles each used 50-99 times. 1,107 to 32 titles each used 25-49 times. 1,304 to 79 titles each used 10-24 times. 595 to 88 titles each used 5-9 times. 523 to 519 titles each used 4 times or less. Table 2. List of journal titles cited in Physical Review, Vol. 77-Vol. 112 (Arranged in order of decreasing frequency) Order Fre- Number quency Source Title 1 68,162 Physical Review 2 11,695 *Private Comm., Unpublished, To Be Published 3 9,191 *Books 4 4,252 Revs. Mod. Phys. 5 3,725 Proc. Roy. Soc. (London) 6 2,473 Z. Physik 7 2,459 Proc. Phys. Soc. A (London) 8 2,140 Phil. Mag. 9 1,929 *Reports, Technical Memos 10 1,831 Rev. Sci. Instr. 11 1,796 Physica 12 1,724 J. Chem. Phys. 13 1,662 Bull. Am. Phys. Soc. 14 1,473 Nature 15 1,330 Nuovo Cimento 16 1,096 Helv. Phys. Acta. 17 1,023 Ann. Physik 18 1,022 Progr. of Theoret. Phys. (Japan) 19 867 J. App. Phys. 20 755 Compt. Rend. 21 741 Kgl. Danske Vidensdab. Selskab. Mat-Fys Med 22 586 Z Natur Forsch 23 567 Can. J. Phys. 24 539 J. Phys. et. Radium 25 518 Proc. Camb. Phil. Soc. 26 443 J. Phys. (USSR) 27 418 J. Exptl. Theoret. Phys. (USSR) 28 416 J. Am. Chem. Soc. 29 352 Nucleonics 30 336 Astrophys. J. 31 321 J. Opt. Soc. Am. 32 320 Physik Z 33 313 J. Phys. Soc. (Japan) 34 296 Arkiv Fysik 35 296 *Theses 36 249 Ann. Phys. 37 244 Nuclear Phys. 38 237 Proc. Nat. Acad. Sci. U.S. 39 223 Naturwiss 40 222 Bell System Tech. J. 41 209 Acta Cryst. 42 208 Proc. Inst. Radio Engrs. 43 202 Arkiv. Mat. Astron. Fysik TABLE 2. — Continued Order Fre- Nitmber quency Title *Nonperiodic Literature. 44 198 45 190 46 166 47 164 48 160 49 157 50 153 51 148 52 140 53 133 54 120 55 118 56 116 57 111 58 108 59 107 107 60 103 61 99 62 93 63 86 64 84 65 80 66 77 77 67 76 68 74 69 68 68 70 66 71 65 72 61 73 60 74 59 75 57 57 76 53 53 77 51 51 51 78 50 50 79 45 80 42 81 41 82 39 39 83 38 84 36 36 85 34 34 34 86 33 87 32 88 31 31 31 31 89 30 30 90 29 29 91 28 28 28 28 28 92 27 93 26 26 26 Trans. Roy. Soc. (London) Can. J. Research Soviet Phys-JETP J. Research Nat. Bu. Stand. Physik. Z. Sowjetunion Repts. Prog, in Phys. Science Z. Physik. Chem. Trans. Faraday Soc. Acta Metallurgica J. Phys. Chem. J. Phys. and Chem. Solids Am. J. Phys. Proc. Indian Acad. Sci. Proc. Phys. Math. Soc. Japan Proc. Am. Acad. Arts and Sci. Ann. Rev. Nuclear Sci. Leiden Comm. Philips Research Repts. Zhur. Eksptl. I Teoret. Fiz. Z. Anorg. U. Allgem. Chem. J. Electrochem. Soc. Terrestrial Magnetism and Atm. Elec. Ann. Math. J. Franklin Inst. Z. Krist. Advances in Phys. Proc. Acad. Sci. Amsterdam Discussions Faraday Soc. Proc. Roy. Irish. Acad. Trans. Am. Inst. Mining Met. Engrs. J. Geophys. Research Nachr. Akad. Wiss. Gottingen Math. Physik Kl. RCA Review J. Metals Sci. Repts. Tohuku Univ. Monthly Notices Roy. Astron. Soc. J. Inorg. Nuc. Chem. Z. Electrochem. Australian J. Phys. Compt. Rend. Acad. Sci. URSS Ricerca Sci. Indian J. Phys. J. Sci. Instr. Izvestia Akad. Nauk. SSSR Ser. Fiz. Sci. Papers Inst. Phys. Chem. Research (Tokyo) J. Tech. Phys. (U.S.S.R.) Z. Astrophys. J. Nuclear Energy J. Acoust. Soc. Am. Can. J. Math. J. Atmos. Terr. Phys. Anal. Chem. Proc. Roy. Acad. Sci. (Amsterdam) Australian J. Sci. Research Brit. J. Appl. Phys. Z. Tech. Phys. Nuclear Science Abstracts Ann. N.Y. Acad. Sci. Appl. Sci. Research J. Am. Ceram. Soc. Proc. Koninkl. Ned. Akad. Wetenschap Sci. Repts. Research Insts. Tohoku Univ. Geochim. et Coschim. Acta Prog. Nuclear Phys. Quart. Appl. Math. Acta. Phys. Polonica Ergev. Exact. Naturw. Wien. Ber. II A Rec. Trav. Chim. Proc. Am. Phil Soc. Am. Mineralogist J. Electronics 196 TABLE 2.— Continued Order Fre- Number quency Source Title TABLE 2. — Continued Order Fre- Number quency Source Title 94 25 25 95 24 24 24 96 23 23 23 97 22 22 22 22 98 21 21 21 21 21 99 20 20 20 20 20 20 20 20 100 19 19 19 19 101 18 102 17 17 17 17 17 17 17 103 16 16 16 16 16 16 104 15 15 15 15 15 15 105 14 14 14 14 106 13 13 13 13 107 12 12 12 12 12 12 12 12 12 108 11 109 10 10 10 10 10 J. Chem. Soc. Gen. Elec. Rev. J. Phys. Z. Metallkunde Trans. Electrochem. Soc. Ind. Eng. Chem. Zhur. Tekh. Fiz. Optik. Proc. Am. Acad. Sci. Acta Chem. Scand. Nachr. Ges. Wiss. Gottingen Kgl. Norske Videnskab. Selskabs. Skrifter Anais. Acad. Brasil. Cienc. Elec. Eng. J. Inst. Metals Acta Phys. Austriaca Communs. Phys. Lab. Univ. Leiden Verhandl. Deut. Physik. Ges. Acta Physicochim. U.R.S.S. Kgl. Fysiograf. Sallskap. Lund. Forh. Commun. Pure and Appl. Math. Trans. Am. Math. Soc. Arch. Sci. Phys. et Natur. Proc. London Math. Soc. Can. J. Chem. Arch. Elektrotech. Sitzber. Akad. Wiss. Wien. Math.-Naturw. Kl. J. Math. Phys. Math. Ann. Atti. Accad. Natl. Lincei Physics Phys. Today Acta Phys. Acad. Sci. Hung. Cahiers Phys. J. Chim. Phys. Proc. Inst. Elec. Engrs. Ill Acta Mat. Philips Tech. Rev. Proc. Roy. Soc. (Edinburgh) Proc. Natl. Inst. Sci. India Ann. Chim. Phys. Chem. Revs. Ann. Inst. Henri Poincare Busseiron Kenkyu Observatory Ann. Geophys. Wireless Engr. Sitzber. Preuss. Akad. Wiss., Physik-Math Kl. Phil. Trans. Roy. Soc. (London) Electronics Phil Mag. Suppl. I Am. J. Roentgenal Radium Therapy Communs. Kamerlingh Onnes Lab. Univ. Leiden Astrophys. Norv. Tellus Z. Angew. Phys. Nova Acta Reg. Soc. Sci. Ups. Pubis. Astron. Soc. Pacific Bull. Soc. Franc. Mineral Mem. Soc. Roy. Sci. Liege Rept. Ionus. Research Japan Quart. J. Math Nuovo Cimento Suppl. Current Sci. Bureau Standards .1. Research Duke Math. .1. Bull. Astron. Netherlands Trans. Am. Soc. Metals. Technol. Hepts. Osaka Univ. Ned. Tijdschr. Natumk. Ann. Kev. Phys. Chem. Rend. Reale Accad. Na/.l. Lincei 110 111 112 113 14 10 10 10 10 10 10 10 10 10 9 9 9 9 9 9 9 Comm. Leiden. Radiology Atti. Congr. Intern. Fis. Como Brit. J. App. Phys. Supplement Acta Phys. Hung. Preuss. Akad. Wiss. Berlin. Ber. Ann de Physique Atominaia Energya Soviet Physic Doklady Ann. Astrophys. Rept. Inst. Sci. Tech. Univ. Tokyo J. Aeronaut. Sci. Cent. Bras. Besq. Fis. (Notas de Fisica) Advances in Electronics Trans. Roy. Soc. Can. Ill Nuclear Instr. Trans. Am. Geophys. Union J. Math, and Phys. Kgl. Norske Videnskab. Selskav. Forh. Trans. Am. Inst. Elec. Engrs. J. Geomag. and Geoelec. Metal Progr. Am. J. Math. Verhandel. Koninkl. Akad. Wetenschap Amsterdam Afdeel Natuurk. Czechoslov. J. Phys. Brit. J. Radial. Appl. Spectroscopy J. Iron and Steel Inst. Sorysiron Kinkyu Phys. Chem. Solids Nuclear Sci. and Eng. Phys. Fluids Chem. Weekblad Arch. Math. Naturvidenskab. American Scientist J. Sci. Research Inst. (Tokyo) J. Sci. Hiroshima Univ. Bull. Inst. Nuclear Sci. Belgrade Ber. Deut. Chem. Ges. Skrifter Norse Videnskaps-Akad. Oslo I Mat- Natur. Kl. Trans. Am. Soc. Meqh. Engrs. Sylvania Technologist J. Washington Acad. Sci. Rev. Mex Trs Trans. Am. Inst. Mec. Engrs. Ann. Radioelec Compagn Gen de T.S.F. Bull. Akad. Sci. URSS 1 Actualities Sci. et Ind. Naturw. Anz. Ungar. Akad. Wiss. Zhur. Fiz. Khim. J. Phys. and Colloid Chem. Amer. Math. Mon. Proc. Leed Phil. Lit. Soc. Sci. Sect. Arkiv. Kemi. Mineral. Geol. Experientia Progr. Metal Phys. J. Proc. Roy. Soc. (N.S. Wales) Encykl. D. Math. Wiss. Am. J. Sci. Uspekhi Fiz. Nauk. Elec. Comm. Bull. Am. Math. Soc. J. Colloid Sci. Geofus Publ Soviet J. Atomic Energy IBM J. Research and Development Proc. Intern. Conf. Refrig. Bull. Soc. Chim. Z Hochfrequenz Akad. Wiss. Wien. Festschr. Akad. Wiss. Gottinger Math- Physik Kl Kolloid-Z. 197 TABLE 2. — Continued Order Fre- Number quency Source Title 5 Z. Angew. Math. U. Mech. 5 Abhandl. Braunschweig. Wiss. Gen. 5 J. Ind. Eng. Chem. 5 Akad. Nauk. S.S.S.R. 5 Ceram. Age. 5 Svensk. Kem. Tidskr. 5 Kgl. Svenska. Vetenskapsakad. Handl. 5 Ilium Engr. 5 Ann. Univ. Grenoble 5 Wiss. Veroffentl. Siemens-Werke 5 Bull. Soc. Roy. Sci. Liege 5 Ann. Math. Stat. 5 Carnegie Inst. Wash. Publ. 5 Physik Bl. 5 Radiation Research 5 Memoirs and Proceedings of the Manchester Literary and philosophical Soc. 5 Wied. Ann. J. 5 Chinese J. Phys. 5 Astron. J. 5 Phil. Trans. 5 Fortschr. Physik 5 J. Rational Mech. and Anal. 5 Rocqniki Chem. 5 Univ. I. Bergen Arbak. Naturvidenskap. Rekke 5 Soviet Phys-Tech. Phys. Table 4. Incremental growth of the list of cited journals as new journals are examined (Tliis table illustrates the stability of the most cited journals in the physics literature outside the Physical Review.) Total number Number of Source journal of citations citations to new* titles Phys. Rev 1120 10 Phys. Rev. Letters 1004 8 Proc. Phys. Soc. 1000 27 Z. Physik 1000 23 Physica 379 19 JETP 1011 18 Jn. Phys. Soc. Japan 1250 57 Can. J. Phys. 996 43 Prog. Theor. Phys. 1016 10 Czech. J. Phys. 476 16 Nuovo Cimento 996 8 Rev. Sci. Instr. 839 32 Jn. Appl. Phys. 1002 26 Phys. Fluids 956 32 Sov. Phys. Sol. State 1000 44 Philosophical Mag. 1000 34 *Citations of titles not encountered in Phys. Rev. Vol. 77-112. Table 3. Incremental growth of the list of cited journals as new issues are examined [This table shows that a relatively small number of sources account for most of the references found in the Physical Review.) Phys. Rev. Number of Number of Number of volume new titles times cited times cited in cited in this vol. Vol. 77-112 77(1950) 108 1517 107,385 78 40 57 1,025 79 29 42 605 80 27 35 249 81 (1951) 18 26 662 82 21 28 163 83 30 49 987 84 19 19 126 85 (1952) 12 13 81 86 9 12 47 87 12 18 150 88 28 38 340 89(1953) 13 14 57 90 20 21 72 91 24 29 137 92 17 20 183 93 (1954) 18 23 57 94 21 28 138 95 14 15 32 96 10 15 50 198 4. Tests, Evaluation Methodology, and Criticisms An Evaluation Program for Associative Indexing * Gerard Salton Harvard University Cambridge, Mass. 02138 Statistical association techniques have been widely used in information retrieval to relate items of information such as documents or words occurring in documents. The desired relationships between the given items are normally determined by means of a variety of different criteria, including in par- ticular the co-occurrence of words in documents, the similarity in bibliographic citations, and the identity of authorship. Associative techniques are particularly useful as a means for adding to the index terms attached to a given document, a number of new, related terms. Such associated terms then effectively broaden the scope of the original terms in such a way as to increase the number of relevant documents retriev- able in response to a specific search request. Word associations can therefore be used in an adaptive retrieval system in which requests for information are successively altered until a satisfactory response is obtained. One of the difficulties which beset associative systems is the problem of evaluating the effective- ness of the procedure. Specifically, it is not clear whether an improvement in retrieval is actually obtained by using term and document associations, or whether equally effective results might not be generated with a small thesaurus, or synonym dictionary, used to normalize the vocabulary. An adaptive information retrieval system is presented which can be operated with or without a synonym dictionary, with or without term and document associations, and with or without a hierarchical subject arrangement. By processing the same search requests under a variety of different modes it is possible to compare the relative effectiveness of the various automatic methods without large-scale human effort. The retrieval system is described in detail, and test results obtained by processing a sample document collection on the 7090 computer are exhibited. 1. Introduction Within the last few years the design of automatic information systems has become increasingly com- plex, and so have the techniques which are used to analyze and manipulate the information. As more and more different types of systems are proposed and generated, the evaluation of these systems becomes of increasing urgency. Unfortunately, no real guidelines are available which could be used in the design of evaluation procedures, and most of the methods actually proposed are based on ad hoc rules which stress theoretically desirable features, and do not concern themselves with prac- tical questions. As a result, much of the proposed methodology cannot, in fact, be implemented reasonably in a test situation. In the present report, an evaluation program is outlined which is believed to be both useful and practical. No attempt is made to treat all aspects of a retrieval system; the program confines itself, instead, to the evaluation of retrieval techniques, including methods for analyzing document and in- formation content, and methods for the comparison of stored information with search requests. Spe- cifically excluded from the testing process are operational criteria such as cost, access time, re- sponse time, and so on, since these factors are not of immediate interest in experimental automatic information systems. 'This study was supported by the National Science Foundation under grant GN-82. Furthermore, in order to circumvent the diffi- culties which arise from the dual, and probably incompatible, requirements of demanding, on the one hand, an absolute standard against which the performance of each retrieval system is to be com- pared, and of insisting, on the other, that the user himself be the ultimate judge in deciding what part of the retrieved information is to be relevant to any given request, the evaluation procedures described here are based on relative measures of system effec- tiveness. In particular, an attempt is made to rank the various retrieval procedures as a function of their excellence in performing certain desired tasks without, however, specifying how far removed each performance is from some optimum standard. Such a relative evaluation process cannot then be used to design an ideal system, but will make it possible to choose from among a set of available procedures the one which may be expected to render the best performance in a given situation. Moreover, the use of a relative standard of excel- lence makes it unnecessary manually to produce an index of relevance for each document with respect to each question, and permits instead a largely automatic testing procedure. This in turn implies that the tests can be performed on relatively larger collections of stored information than is possible in a purely manual operation, thus insur- ing a reasonable statistical base for the test results. In addition, since the cooperation of large numbers of persons over long periods of time is no longer 201 772-957 0-66^14 needed, one of the basic weaknesses built into con- ventional testing systems — namely the variability of the environment — is now removed. The principal criteria used in the design of the testing procedure are outlined in the next section; the system itself is briefly described in section 3; and some of the many possible testing routines are listed in the concluding section. 2 2. Evaluation Criteria A number of diverse systems for the identification of stored information have come into general use within the last several years. The first and most widely known is the key word system in which cer- tain terms, manually chosen or automatically ex- tracted from the body of documents, are used for purposes of information identification. These terms are normally assumed to be independent in the sense that they do not exhibit relations among each other, and may be chosen from a controlled vocabulary, or else may be completely free. In a key word system, the information relevant to a given search request is identified by comparing, respectively, the term sets representing stored information with the term sets representing infor- mation requests. In order to eliminate the variations resulting from an uncontrolled vocabulary, and to supply some of the more obvious inclusion and generic relations between terms, a synonym dictionary, or thesaurus, is often introduced. Key words, chosen as before, are then looked up in the dictionary and replaced by the corresponding thesaurus heads before being used as information identifiers. Within the the- saurus, the items may be hierarchically arranged in such a way that terms appearing "high up" in the hierarchy (near the roots of the corresponding abstract tree structure) are general terms which are generically related to the more specific terms listed under them on a lower level. Such an arrangement makes it possible to use the thesaurus for a variety of term expansion procedures, as will be seen. Additional relations between key words may also be taken into account by using for purposes of docu- ment identification clusters or phrases, consisting of subsets of terms with specified relations between them (instead of individual key words alone). Such phrases may again be chosen manually or else may be generated automatically by a variety of statistical, syntactic, or semantic techniques. The relations which obtain between the individual terms within a cluster may be purely formal ones, such as co- occurrence of words within the sentences of a docu- ment, or within the documents of a collection, or 2 Some recent works dealing with the design of testing and evaluation systems for information retrieval are included in the reference list [1, 2, 3, 4, 5, 6, 7]. (Figures in brackets indicate the literature references on p. 210.) 3 The precision ratio of a search is that fraction of the retrieved documents which is in fact relevant to the user's request; the recall ratio, on the other hand, is that fraction of all the relevant documents in a collection which is in fact retrieved [7]. 4 It is an unfortunate fact that recall and precision ratios cannot, in general, both be improved simultaneously, because as recall increases through retrieval of additional relevant material, more irrelevant matter will also be produced, thus decreasing pre- cision; similarly, as precision improves through decrease in the amount of irrelevant material, recall may deteriorate because some of the newly missing material may origi- nally have been relevant [5, 7]. else they may be described in very specific terms, such as cause-effect or whole-part relations; in the latter case, extensive syntactic and contextual analyses may be needed to identify them. Relevant information in such a system is retrieved by more or less complicated phrase-matching procedures. In addition to information extracted from the text of documents, or supplied by auxiliary dictionaries and tables and by various analytical procedures, it is often convenient to use a number of related sources for purposes of information analysis. Thus it is possible, under certain circumstances, to uti- lize contextual criteria such as the date of a pub- lication, the name of the author, the references cited in the bibliography of each document, and other related indicators. In a typical retrieval situation, the user is first given some indication of the parameters within which the system operates, and is then free to formulate any acceptable search request. In response to each request, the system then furnishes a certain set of items which is considered relevant to the respective requests. The user may now find himself in one of three situations: (a) the information retrieved is in general satis- factory, and there is no need to rephase the request; (b) the information retrieved is not satisfactory because too much irrelevant material is included (the precision ratio 3 of the search is too low); (c) the information retrieved is not satisfactory because too little relevant material is included (the recall ratio 3 of the search is too low). In the last two situations the user will want to rephrase his search request in an attempt to obtain a more nearly satisfactory answer. Specifically, to improve the precision ratio it is necessary to narrow the scope of the terms used to specify the search request, and to tighten the criteria used to match the stored information with the requests for information. Contrariwise, to improve the recall ratio the search specifications must be broadened, and the matching criteria between the respective sets of terms relaxed. 4 In a practical, useful retrieval system, the follow- ing types of operations are then seen to be of primary concern: (a) the construction of matching procedures which would make it possible to produce succes- sively more and more relevant, or less and less irrelevant, material in answer to a given search request; (b) the generation of term expansion and con- traction methods which could alter the coverage of the original terms used to specify a search request 202 by addition, deletion, or modification of terms, in such a way as to produce response alterations in the desired direction; (c) the assembly of a variety of methods of the kind described under (a) and (b) into a unified, flexible retrieval system. The discussion at the beginning of the present section indicates that a considerable number of different methods have already been proposed for the automatic identification of search requests and stored information. Adaptive matching techniques which can be used to compare items under more or less stringent conditions have also been generated [8]. The difficulty which arises in the actual imple- mentation of retrieval systems is that very little is known about the precise effect of each of the many possible steps which may be taken in a given situ- ation. For example, which of many possible cor- relation coefficients should be used to measure the similarity between sets of key words? Given a specific correlation coefficient, what cutoff point should be chosen to distinguish relevant from ir- relevant information? How much more (or less) information is retrieved by replacing each original key word by a more general (or a more specific) one? Is it better to use a synonym dictionary or a statistical association method for the expansion of index terms? And so on. In the next section, a retrieval system called SMART is described which is believed to be useful in answering questions of this type. The SMART system makes it possible to process data in dozens of different modes by calling into play different methods for the determination of information con- tent, different criteria for matching items of stored information, and different ways of specifying the information requests. This system may be used for the evaluation of retrieval techniques by proc- essing the same search requests and the same docu- ment collection several times and effecting each time a slight change in the processing conditions. To evaluate the effect of a certain processing tech- nique it is then sufficient to concentrate on the differences in output produced by two search opera- tions in which the given technique is used in one case but not in the other. This is further described in section 4 of this study. 3. The SMART Retrieval System [9] A simplified flowchart of the complete system is shown in figure 1. The system is seen to consist of a sequence of largely optional, text-processing routines, including dictionary lookup processes, statistical correlations, and syntactic matching procedures. Documents consisting of English texts, as well as search requests, are submitted to Incoming Text or Search Request Dictionary lookup to obtain syntactic and semantic labels I Expansion of semantic labels > I through search in concept hierarchy ^Computation of sentence significance I and automatic sentence extraction X. Syntactic analysis of significant | sentences and structural matching I with criterion phrases I Expansion of semantic labels through statistical term correlations r Comparison of search request with document identifications and possible document correlations / / ^optional steps 'compulsory steps Figure 1. Simplified SMART system. the same process and a complete run consists of a sequence of text manipulations including input operations of new texts, and matching operations between certain specified texts (the search requests) and all other texts. The system is designed around a monitor called CHIEF, which can in turn call on many different subroutines. The monitor accepts input instruc- tions to specify the type of operation to be per- formed, and control data to choose the subroutines which are to be called. At the present time, four basic input operations are available and about 35 different processing options. The processing options fall into seven basic categories: general processing methods, alphabetic dictionary pro- cedures, operations using the semantic concept hierarchy, statistical correlation options using co- occurrence of terms within sentences, syntactic prodecures using a phrase dictionary and structural matching methods, statistical term correlations using co-occurrences within documents, and docu- ment-matching procedures. Four basic dictionaries or tables are used by the system: an alphabetic-stem dictionary designed to supply each word stem with a number of syntactic and semantic codes, an alphabetic-suffix table to obtain syntactic codes for word suffixes, a numeric concept hierarchy to represent various relations between semantic categories, and a criterion-phrase dictionary to aid in the syntactic processing. 3.1. The Alphabetic Dictionary Programs The input texts are first segmented by identifying the individual words of the texts and noting the 203 sentence number and text code for each word. The individual words are then looked up in an alpha- betical dictionary to supply each word found with both syntactic and semantic codes. The alphabetic dictionary consists actually of two parts: a stem dictionary and a suffix dictionary, and both parts are stored in list form. An attempt is made by a dual left-to-right and right-to-left letter-by-letter scanning procedure to find a match between each input word and the respective entries in the stem and suffix dictionaries. When a match is actually found, the semantic concept codes and the syntax codes included in the dictionary are used to replace the alphabetic characters which specify the input word. The importance of the dictionary lookup proce- dure is threefold: first, it reduces the dependence of the various procedures on the vocabulary of the original texts by assigning the same concept numbers to a variety of synonymous expressions; second, it permits the remainder of the process to be carried out with standardized numeric codes instead of with variable alphabetic information; third, a replacement of the original words by con- cept codes tends to broaden the coverage of each term and therefore affects the retrieval action, as will be seen. For purposes of comparison and evaluation, it may in some circumstances be desirable to operate with the original input words. Provision is there- fore made to substitute for the alphabetic stem dic- tionary a simulated vacuous dictionary. This dictionary includes no entries initially, but is con- structed during the "lookup" operation by entering in the dictionary every occurrence of a new word found in the input text, together with a fictitious "concept" code. Each new word type is thus assigned a different concept code, so that a one-to- one correspondence exists in the simulated dic- tionary between dictionary entries and concept codes. When the simulated dictionary is used, the statistical correlation programs, while still technically operating on numeric concept numbers, are in fact then associating the original alphabetic text entries. An excerpt of a text, including both real concept numbers as well as simulated dummy numbers, is shown in figure 2. It is seen that the actual concepts are assigned to a variety of different words, whereas the simulated numbers are repeated only if the corresponding word is repeated also. High- frequency function words are not assigned any concept numbers. [ A 113 V. H. Yngve, "A Framework for Syntactic Translation, " Mechanical Translation, 4, pp. 59-65 (December 1957). APPROACHES TO MECHANICAL TRANSLATION BASED ON ADEQUATE (230)45 (119)46 (98)5 (_15, _16)47 (64)48 ("STRUCTURAL DESCRIPTIONS OF THE LANGUAGES INVOLVED AND ON AN | (107)49 (57)50 (102)51 (73)52 | ^ , ADEQUATE STATEMENT OF EQUIVALENCE ARE DISCUSSED. (64)48 (181,57,32,245)53 (94)54 (230)55 ' L zr _f — XL — —. — — I TRANSLATION IS CONCEIVED OF AS A THREE -STEP PROCESS: (98)5 (230)56 (NF)57 (147)58 RECOGNITION (123)59 TRANSFER (22, 42)64 (a) : concept numbers a : dummy concepts FIGURE 2. Excerpt of typical abstract. 3.2. Processing of the Concept Hierarchy Whereas the lookup in the alphabetical dictionary, real or fictitious, is compulsory since the numeric concept codes must be obtained in one way or another, all operations involving the concept hier- archy are entirely optional. If no hierarchy is available, these operations can be skipped. The concept hierarchy is a treelike arrangement of numeric concept numbers as illustrated in the simplified excerpt of figure 3. Each node in figure 3 represents a concept number, and the hori- zontal dashes next to the nodes symbolize the text words which are replaced by the corresponding concept numbers during the dictionary lookup. Associated with a given concept appearing in the 204 FIGURE 3. Hierarchical concept dictionary with cross references. hierarchy are more specific concepts which appear on a lower level in the hierarchy, more general concepts which appear on a higher level, and cross- referenced concepts which appear on the same level. Thus, when a concept number is obtained as a result of the lookup operation in the alphabetical dictionary, it is possible to enter the hierarchy in order to obtain a number of related concepts or, alternatively, more general or more specific ones. The hierarchy is stored in the computer as a multiply-chained list, and list processing operations are used to obtain the "parent" of a given node on the next higher level, the "brothers" on the same level, the "heirs" on the next lower level, and the cross references. Each concept may be said to "include" other concepts located on lower levels, or to "be included" in concepts situated on higher levels; no such inclusion relation is implied, however, for the cross references. In the SMART system, search requests as well as document speci- fications may be broadened by moving upward in the hierarchy or restricted by moving downward, and related concepts are picked up through the cross-reference lists. 3.3. Statistical Concept Associations The text-segmentation and alphabetical-dictionary lookup programs furnish for each sentence a list of all the included concept numbers. An inverse sort followed by a simple counting procedure can then be used to obtain for each concept a list of the corresponding sentences, as well as the frequency of occurrence in each sentence. This in turn per- mits the construction for each document of a concept- sentence incidence matrix in which the i/th element is set equal to n if sentence j contains concept i ex- actly n times. A typical concept-sentence inci- dence matrix is shown in figure 4. In the same manner, it is possible to take the sets of concepts attached to each document within a complete document collection and to form a single concept-document matrix. The i/th element in such a matrix is set equal to 1 if and only if concept Ti is assigned to document Dj. A typical concept- document matrix is shown in figure 5. Documents Concepts — —^Sentences Concepts "~"~ — -^^ S l s 2 s m T l 1 C l 1 1 C m T 2 c\ i~" ---'< T 'n c'i k--'- - - - c" m FIGURE 4. Concept-sentence incidence matrix for a given document. C' = n« — » Sentence Sj contains term T, exactly n times. FIGURE 5. Concept-document matrix for a given document collection. C 1 = 1 * — * Term T| has been assigned to document Dj (otherwise C 1 — .0). To obtain a measure of similarity between a pair of concepts, it is necessary to compute a correlation coefficient between the two corresponding rows of the concept-sentence incidence matrix or of the concept-document matrix. If correlation coeffic- ients are computed for all concept-pairs, a concept- concept correlation or similarity matrix is obtained in which the i/th element denotes the strength of association between concept i and concept j, based either upon the number of co-occurrences of two concepts within the sentences of a given document, or within the documents of a given collection. 205 Concept-correlation options are included in the SMART system for two principal reasons. First, it may be desirable to replace a given set of old concepts by a new concept formed of a cluster of highly correlating original ones. Second, it may be useful to add to an original concept new ones which correlate significantly with the original. The clustering procedure is carried out by starting with a single term, and then adding a new term whose correlation coefficient with the old one is larger than a given threshold. To the pair thus formed, a third term is added whose correlation with both of the others is significantly high, and so on. Three types of output may be obtained to represent simi- larities between terms: the "term correlations" exhibit all correlation coefficients for a given term; the "term relations" include only those related terms which have significant correlation coefficients with a given term; finally, the "term clusters" in- clude terms which have significant correlations with all other terms in the cluster. It may be noted that the generation of new con- cepts formed from sets of old ones is similar in effect to the concept expansion obtained by means of the concept hierarchy. The two methods may then be compared by performing first the one and then the other and checking results. Options are available to skip the concept-correlation process if desired. 3.4. Syntactic Processing A syntactic-analysis program may be a useful part of an information-retrieval system since it permits a further refinement of the matching criteria between information requests and document identi- fications. Specifically, the document sentences and search requests may be analyzed syntactically, and individual concepts or terms may be clustered only if the syntactic relationships between con- cepts are identical. Similarly, a phrase or cluster included in a search request can then be made to match the corresponding phrases included in the document identifications only if the syntactic relations also match. A syntactic analysis program is included in the SMART system which can transform each sentence processed into dependency tree form. Tree- matching procedures are then used to compare sentences and sentence parts [8, 9, 10]. Specif- ically, a dictionary of so-called "criterion phrases" or "criterion trees" is used. Each entry in this dictionary consists of a set of concept numbers corresponding to a phrase in ordinary written texts. Typical phrases might be "information retrieval," "computer design," "syntactic analysis of phrases," and so on. Also included in the criterion-phrase dictionary are the semantic concept numbers and the syntactic codes corresponding to the terms included in each phrase, as well as a specification of the syntactic connection pattern between the concepts. A typical criterion phrase is shown in figure 6, including also the syntactic indicators and semantic concept numbers attached to the nodes of the phrase. If the "criterion tree" option is chosen, each of the previously syntactically analyzed sentences is compared against all entries in the criterion-phrase dictionary, and those phrases are identified which match a given part of a sentence. To match, not only must the semantic and syntactic labels compare properly, but the syntactic connection pattern must also be the same. Thus, a phrase such as "information retrieval," where the concept "infor- mation" is syntactically dependent on "retrieval," would not match the sentence, "Because the text contains secret information retrieval is vital," but would match the sentences, "The retrieval of information is necessary," or "He discusses infor- mation and document retrieval." A tree which matches the criterion phrase of figure 6 is shown in figure 7. A comparison of figures 6 and 7 shows that nodes (a) and © of figure 6 match nodes © and @ of figure 7, respectively, and that the paths between tne nodes are properiy preserved. © LABELS (014, 023) (017) THESAURUS CATEGORY 014:/ INFORMATION, 1 DOCUMENT (S), FACT (S), DATUM, DATA, .ETC. THESAURUS CATEGORY 017:/ RETRIEVAL, PROCESSING, ORGANIZATION,! .SEARCH, ETC. Figure 6. Typical criterion phrase. 206 © syntacticI LABELS J SEMANTIC | LABELS J THE RETRIEVAL OF INFORMATION FIGURE 7. Tree structure which matches the criterion phrase of figure 6. At the end of the matching process, the criterion routine furnishes for each document a count of the number of matches obtained between each criterion phrase and the sentences included in that document. The concept numbers identifying the criterion phrases which match sufficiently often can then be added to the concept lists of the corresponding docu- ments, thus resulting in an expansion of the concept vectors similar to the expansion previously obtained through the hierarchy and the statistical correla- tions. By using the option "no syntactic processing," the complete syntactic analysis and the criterion phrase processing can be eliminated. 3.5. Document Associations and Request Processing The programs described for the generation of concept correlations can be used unchanged to obtain document similarities by performing column instead of row correlations of the concept-document matrix. Specifically, one of the documents, newly introduced or previously included in the collection, may now take the place of a search re- quest. This special request vector can of course be subjected to the same procedures as the other documents, including lookup in the alphabetic dictionary, expansion through the concept hier- archy, and so on. By correlating the request vector with all other documents in the collection, a "rele- vance coefficient" is obtained for each document, and documents with sufficiently high coefficients can be considered to answer the request. More- over, given a set of documents obtained in response to some request, new documents may be added by using the document-document similarity matrix, including the correlation coefficients between all pairs of documents, to form document clusters. The clustering techniques are the same as those used before for concept clusters, and these clusters can be used as an entity in the generation of answers to search requests. 5 4. Test Procedures The system described in the preceding section can be used to generate document identifications by a variety of methods. In particular, starting with a simple term-document matrix of the type shown in figure 5, it is possible to generate an ex- panded matrix as shown in figure 8, including new terms derived by hierarchical expansion, syntactic processing, and statistical associations. The prob- lem is then to find a way for constructing in each case the most effective possible matrix and the most useful matching procedure for the comparison of the matrix columns. The following general methods are available for this purpose: (a) a variety of correlation measures may be used to compare the similarity between the information identifications and search requests; 5 Procedures for the generation of term and document associations have been de- scribed in the Uterature and are not repeated here in detail [11, 12|. Extensions of the term association, to include bibliographic information, have also been proposed [13]. (b) a variety of coefficient thresholds may be chosen for each correlation coefficient, so as to increase or decrease the amount of retrieved infor- mation in each case; (c) the matching procedures may be altered (without change in the search specification) by using, for example, binary-term document matrices instead of numeric ones, or by disregarding various kinds of relations between terms; (d) the search specifications themselves may be modified, for example, by addition or deletion of terms, or by replacement of original terms by new ones. It is seen that each of these four principal proc- essing alterations can be brought into play inde- pendently of the other three. Not much can be said concerning the choice of a useful correlation measure; it is in fact conceivable that, for practi- cal purposes, this step may be of little importance. In any case, experimentation may indicate that 207 Documents Concepts Original Documents Citation and Author Relations m+1 Search Request R Original Terms R I- Hierarchy Expansion n+1 n+1 n+2 °1 °2 • n+2 Ir I . n+1 T * : p +1 Tr p+1 t . I- • I" L 4 - - Syntactic Phrases ■p+1 p+1 p+1 °1 C 2 • Statistical Associations r+1 r+1 r+1 S °2 ' C l C 2 „r+l ,„r + l c Ir . I • _L FIGURE 8. Expanded concept-document incidence matrix. some coefficients are more satisfactory than others; in particular, everything else being equal, it is most efficient to use that coefficient which minimizes the amount of computation to be performed. One of the simplest ways to increase or decrease the amount of information produced in response to a given search request is to alter the threshold of the coefficient of correlation used in the matching process. Clearly, the lower the threshold, the more information is produced. A change in the cutoff point will not, however, be effective if different kinds of responses are expected, but will affect mainly the number of answers. Alterations in the matching process itself are most useful in the dictionary lookup operations. For example, word endings could be disregarded in the alphabetic-dictionary lookup; alternatively, syntactic codes might be deleted as a matching criterion in the tree-matching process. In general, the fewer the number of restrictions affecting a lookup process, the larger the number of matches between arguments and stored information. The most powerful process available for altering the kind (rather than merely the amount) of infor- mation produced in answer to a search request is to change the search specification itself. The many methods by which this can be done are summarized in figure 9. In general, addition of new terms to a given search specification may be expected to yield a more narrowly defined document set, thus increas- ing precision; on the other hand, deletion of terms may have the reverse effect, thus increasing recall. Replacement of old terms by new ones may have one or the other effect, depending on whether the new terms have a more restricted definition than the original, or a broader one. Thus the use of clusters of terms, or syntactic phrases, instead of individual terms alone should refine the definition, as indicated in figure 9. Clearly, each of these possible devices may be expected to have a different effect upon the eventual outcome of a search, in the sense that recall and precision are affected in different ways. In order to be able to design a useful system, it is then neces- sary to obtain a measure of the effect of each indi- vidual processing step alone. This can be done by keeping the main system invariant and making one judicious processing change at a time. If the differences in output are then evaluated, a measure should be obtainable of the usefulness of 208 Type of Process Method of Alteration of Specification Probable Effect Improves' Improves Recall | Precision Dictionary Lookup (1) Each input word is replaced by one or more terms (or term numbers) \S \ Hierarchical Processing (2) Each term is replaced by its "parent" on the next higher level in the hierarchy (3) Each term is replaced by its "sons" on the next lower level in the hierarchy (4) To each term are added its "brothers" on the same level in the hierarchy, and its first-order cross references 1 ^ Statistical Correlation Methods Syntactic Matching (5) To each term are added all other terms from within the same signifi- cant term cluster (6) Each term is replaced by the term cluster of which it is a part (7) Each term is replaced by the criterion phrases in which it is contained (8) To each list of terms are added the criterion phrases which match the original input 1 ^ 1 V Simple Addition and Deletion (9) To each list of terms are added a set of new terms (10) From each list of terms are deleted 1 1/ a set of specified terms FIGURE 9. Alterations of search specification or of document identifications. the given step in relation to the usefulness of the possible alternative steps. A continuing type of process can then be envisaged, as illustrated in figure 10, in which a sequence of processing altera- tions is executed until such time as the right kind and amount of information are produced. The weakest link in this procedure is the manual evaluation of output differences produced by two given search procedures. This cannot, unfortu- nately, be done automatically, since it is necessary to determine to what extent the information added by a given processing modification is in fact relevant, and the information deleted is in fact marginal. No method exists for eliminating this step entirely; by adjusting the system in such a way that only small amounts of output are produced (so that output differences are . also small) the difficulty of this manual evaluation process can, however, be mini- mized. It is hoped that tests now under way will lead to the construction of preferred sequences of process- ing steps. This in turn may lead to the determi- nation of specific processing options which may be particularly useful for certain kinds of subject matter. Eventually, it may be possible to suggest to the user at each step a set of alternative moves to reach a given goal most efficiently. 209 Document Collection and Search Requests Choose a reasonable corre- lation measure to compare document identifications with search reques ts Choose a reasonable threshold to distinguish relevant from nonrelevant information Process search requests against documents using a standard mode 4 RELEVANCE VALUE - STANDARD UNITS Figure 3. Distribution of document relevance values for category 91 . observations on the effects of changes , in these parameters on the overall performance of the sys- tem were made. In one of the experiments, the number of reference documents in each category was decreased from 100 to 80. The results shown in figure 1 indicate that a more detailed analysis of this parameter is required. The number of documents required may change from category to category. Figure 1 shows that category 95 achieved 98 percent sucess with only 80 reference documents, whereas category 93 achieved 68 percent success. The effects of a change in the number of reference documents cannot be analyzed independently of the other classification parameters. The effect of the number of documents on classification accuracy as well as the inter-effect of representativeness of documents can also be observed from the same figure. It cannot simply be assumed that if 20 more documents are added the classification ac- curacy will improve. A check on the representa- tiveness of the documents being added is required. When documents that are not as representative are added, a decrease in accuracy can result, as shown by category 93. Figures 2 and 3 show how the distribution of relevance values can be used to measure the rep- resentativeness of the documents. In figure 2, the solid fine shows the distribution about the mean relevance value of documents known to belong to category 95. Ideally, the dashed lines should 218 be low around the mean relevance of zero and be- come higher with increasing distance from the mean. Lack of representativeness in a document can be caused in two ways. (1) A document may con- tain one and only one concept at that level, but it may be shorter or longer than the average. The word frequencies then would be atypical with re- spect to the category and thus cause an increase in the within-category variance. (2) A document could contain more than one concent at the same level. Such a document could contain words from several categories and would cause a decrease in the among-category variance. A preliminary experiment in which 10 documents of each type were removed from each category has borne out this hypothesis. Figure 2B shows that after removal only 10 percent of the other category documents were near the mean of category 95, and 50 percent were more than four standard devi- ations away. Figure 3 shows additional information concerning the degree of similarity of two cate- gories. The dashed line close to the solid line is the distribution of documents belonging to cate- gory 93. When two distributions are close to each other, it can be interpreted that they belong to the same population rather than two distinct popula- tions. Even after removal of the 20 less repre- sentative documents from each category, the lines are closer than expected. Thus the categories probably represent a related subject. As a result of these experiments, it became ap- parent that a more analytical technique would be required to classify documents, and also to ana- lyze misclassifications. A metric that is not biased by the parameter of the data from which it was de- rived seems to be needed in measuring relevance and the effects of the parameters. Mahalanobis' D 2 is a metric that appears to satisfy these condi- tions. Therefore, the objective of our latest experi- ment was to test the effectiveness of multiple dis- criminant functions' and Mahalanobis' D z for classi- fying documents. The steps in the classification procedure will be illustrated in section 3 by the de- tailed description of the latest experiment. 3. Classification Procedure A user starts with a set of documents and decides on a group of subjects of interest to him. He then partitions this set into subsets of documents be- longing to the various subject categories. These documents will be called reference documents and are used to compute mean frequencies and vari- ances of each word type. In this experiment the solid state categories as defined by the Cambridge Communications Corporation (CCC) were used. The reference set consisted of 320 documents. CCC had previously classified 80 of these docu- ments into each of four categories, as shown in figure 4. In this experiment classification was performed only at one level. Topics included in each of the categories are shown in figure 4, to indicate the level of difficulty presented by this data base. Communications Conductive Devices Computers Photo Devices Power Magnetic Devices Instrumentation Crystal Structure Crystal Physics Crystal Growth Electrical Properties Magnetic Properties Optical Properties Thermal Properties Crystal Surfaces Environmental Effects FIGURE 4. Experimental solid-state structure. Since CCC can be considered the user, the defi- nition and structure of categories are determined by their outline of solid state categories. In an operational situation, the method provides an opportunity for the user to improve the initial defi- nition of the categories after a preliminary com- puter run. The degree of improvement is entirely under the direction of the user. He is given sev- eral control statistics which tell him the amount of dispersion in each category, the amount of over- lap of each category with every other category, and the discriminating power of the variables. He can add, remove, or redefine categories to suit the specificity of his particular needs. These sta- tistics are based on the sample of documents that he assigns to each category. Thus the user is not obligated to define each subject category with merely a word label. He is free to supply any docu- ments which contain his concept of that subject. Various users of an identical set of documents can thus derive their own structure of subjects from their individual points of view. At the next step, the reference documents are input to the word counting program. The program computes, for each word type in a category, the mean frequency as well as the variance. The pooled within-category variance, the among-cate- gory variance, and an F ratio (described below) are computed. At this point there is an F value for every word type that occurred in a document. Previous experiments indicate that all word types do not need to be retained for the classification equation. But what criterion can be used to select the words to be retained? This is a question which has frequently been underemphasized in the clas- 219 situation process. Ideally the criterion should be similar to the one used by indexers and classifiers. Therefore, we have used a statistical criterion which appears to quantify the intuitive criterion that has been used. The intuitive criterion is one in which words that represent a category should occur in nearly every document of that category and should not occur in documents belonging to another category. If they do occur in documents of another category, near the same frequency, ambiguity exists, and the word will not be a good predictor. Two easily obtained statistics can represent this criterion. The consistency with which a word occurs in each document in a category can be measured by the pooled within-category variance, W. The devia- tion of the frequency of occurrence of a word in documents belonging to different categories can be measured by the among-category variance, A. The ideal predictor should occur regularly in all the documents of a category; therefore its W should be low. It should not occur with the same fre- quency in documents of the other categories; there- fore its A should be high. It was noted that, by forming the ratio F — A/W, the value of F quan- tifies the qualitative criterion because it is high for excellent predictors and low for poor ones. This F ratio is similar to the multivariate maximizing condition of discriminant analysis. Figure 5 fists the 48 most discriminating words selected in this experiment relative to the above F ratio. Only the frequencies of these 48 words are used in the actual computation of the discriminant func- tion. The object of this computation is to find the optimum linear combination of weighting coef- ficients for these words. Each of the 48 words has a set of weighting coefficients which represents its discriminating ability with respect to each of the various categories. Since these coefficients are affected by the definition of the categories, words will have a different set of weights depending on the context. Classification can now be achieved by comparing the observed frequency of each of the 48 word types to their corresponding mean frequencies in each category. When the comparison is performed by the classification equations, each word type is Category 91-Applic. of 93-SS 94-SS Metallurgy 95-SS SS Devices Devices and Chemistry Physics CIRCU1 .94 .25 .00 .03 COUNTE .33 .01 .01 .01 DESIGN .25 .08 .05 .01 DETECT .50 .04 .05 .03 NOISE .22 .05 .00 .05 OUTPUT .73 .14 .00 .01 POWER .63 .21 .10 .19 PULSE .68 .06 .00 .05 REGULA .24 .01 .03 .00 STABIL .16 .05 .04 .00 SWITCH .60 .29 .00 .03 TRANSI .98 .90 .04 .28 CONSTR .08 .09 .03 .01 CURREN .78 .85 .06 .56 DEVICE .21 .45 .04 .00 FERRIT .06 .25 .06 .01 FREQUE .28 .83 .05 .54 HIGH .34 .53 .21 .31 JUNCTI .10 .79 .10 .03 MADE .23 .29 .14 .19 MAGNET .43 .76 .03 .61 P .06 .34 .13 .23 TUNNEL .00 .05 .00 .01 VOLTAG .56 .71 .00 .38 CRUCIB .00 .00 .25 .00 CRYSTA .14 .20 2.28 .63 DISLOC .00 .00 .68 .03 FURNAC .00 .00 .13 .00 GROWTH .00 .00 1.18 .04 ION .06 .00 .15 .15 MICRON .01 .00 .14 .09 OXIDE .00 .00 .16 .11 SEED .00 .00 .16 .00 SINGLE .16 .09 .59 .29 TEMPER .25 .24 .74 1.58 VAPOR .00 .00 .28 .00 AND 3.06 3.28 3.34 4.54 DEPEND .05 .13 .20 .63 EFFECT .16 .25 .14 .98 ELECTR .34 .43 .31 1.96 FERROE .00 .00 .00 .24 FIELD .06 .69 .05 1.25 IMPURI .00 .09 .30 .24 INTERA .00 .03 .03 .24 OXYGEN .00 .00 .09 .20 PHONON .00 .01 .00 .20 PIEZOR .00 .00 .00 .16 TRANSV .00 .04 .01 .24 Figure 5. Mean frequencies of discriminating words., weighted by its discriminant coefficient, its own variance, and its covariance with other word types. Thus frequency is not the sole criterion for classi- fication. Compensation for its discriminating abil- ity in context and for its dependence on other words, is included. A relevance value is computed for each document with respect to each category. All relevance values can be retained for retrieval purposes, or an additional step of assignment to one or more categories can be made. 4. Linear Discriminant Functions Suppose there are c categories with pj documents in the jth category (j—l, 2 . . ., c). For each document find the n values representing meas- urements on the n variates Xi, X2, ■ ■ ., x n - One problem of interest here is to classify a document into the appropriate categories on the basis of the set of n values when it is known that the document belongs to at least one of the categories. The first aspect is concerned with whether these n variates can distinguish among c categories. If so, then the distance between separating pairs of categories and the assignment of an individual document to one or more of the c categories can be considered. The linear discriminant function is one of the tools available for this process. The linear discriminant function is a function of n variables measured on each category such that this linear combination provides the best discrimi- nation between categories. Specifically, the best discrimination is effected by maximizing the ratio of the among-category sum-of-squares of this func- tion to its within-categories sum-of-squares. As 220 will be noted later, appropriate generalizations of this discriminant criterion have been made in the case of several groups of categories. Since the concern is with discrimination among categories, one of the first tests of interest deals with the problem of separation. That is, are the category means (centroids) distinct? Under the assumption of equality of category variances, one test of the degree of confidence with which it can be assumed that the centroids are indeed distinct is given by the Wilks' statistic: A = - m W+A (1) The symbol A is the ratio of two determinants where W is an n by n matrix whose elements are the pooled within-category sums-of-squares and sums-of-cross products. A is an n by n matrix whose elements are the among-category sums-of- squares and sums-of-cross products. Values of the A matrix which are correspondingly larger than values of the W matrix result in an increasingly smaller ratio with increasing confidence in reject- ing the hypothesis of equality of category means. Now, if the centroids are distinct as measured by the A criteria, the questions of distance between categories and assignments of individual docu- ments may be analyzed next. For a single word type, a possible method of classification would involve comparing the measure- ment of that type in the new document against the corresponding category sample mean, and as- signing the item to the category for which the mean is closest to the measurement. For the multivariate case (i.e., the case in which there are n 2= 2 variates) one of the simplest trans- formations would be a linear combination of the n variates resulting in a single quantity. Consider for example, the linear combination X — CiXi + C2X2 + T (^n%ni where X is the value resulting from the linear com- bination, x\, X2, . . . x n measurements, and Ci, C2, . . . , C n are a set of coefficients chosen in such a way that the best discrimination is effected. That is, the set of coefficients which should be chosen is of the type which satisfies the discrimi- nant criterion stated above. It has been shown (see, e.g., Bryan [2]) that the condition for maximizing the ratio of the among- category sum-of-squares to the pooled within- category sums-of-squares is satisfied by solving the determinantal equation, IW-'A-M^O, (2) where I is the identity matrix, W and A are as defined previously, and k is any one of the re eigen- values to be determined. The eigenvector cor- responding to \ provides the set of coefficients for a discriminant function which transforms the re individual measurements into a single value or discriminant score. This discriminant score is then the basis for assigning an incoming document to one of the categories. In dealing with the problem of discriminating among several categories, more than one dimen- sion is considered, since there is no reason to as- sume that the centroids are collinear. It follows that by taking only one linear combination, in effect a linear ordering of the categories is made. Fur- ther, a linear ordering cannot exhaust all the information in the data relevant to group separation. It has been shown (see, e.g., Bryan [2]) that the linear combinations corresponding to the pre- viously discussed eigenvectors have the following property: the first linear combination, correspond- ing to the largest eigenvalue \i, maximizes the discriminant criterion in the sense that one is dis- criminating between two categories; the second linear combination, corresponding to the second largest eigenvalue A.2, maximizes the ratio of the residual among-category sums-of-squares to the residual within-category sums-of-squares after the effect of the first has been removed, and so forth. Furthermore, the number of solutions of the de- terminantal equation such that k i^ is at most equal to the smaller of the two numbers c— 1 and re. These solutions are the multiple discriminant func- tions (MDFs) and exhaust the total discriminative power of the variables relevant to category sepa- ration. The MDFs are a powerful tool in that they pre- serve the information given by the variables relevant to group separation and yet allow one to classify in an m-dimensional space, where m = min (c— 1, re). The eigenvectors of the MDF can be used to form a transformation matrix V, where V = (re, /ft) Vxx v 2l . (3) The vector of means for each category, the disper- sion matrix for each category, and the vector of observations for an incoming document are each appropriately transformed to a reduced discriminant space having only m dimensions. The classification question is now posed in the reduced space. How far does an observation lie from the centroid of each category? Mahalanobis' D 2 (see, e.g., [7]) can again be used to measure this distance, using values derived for the reduced space by the transformations indicated above. An incoming document will then be assigned to the category for which its Mahalanobis' D 2 value is 221 smallest. The number of dimensions has thus been reduced considerably and at the same time the MDFs have preserved, in this reduced space, the effect of the most discriminating variables. The D 2 value in the reduced space is also used to represent the relevance value of an individual document. For the distributional properties of Mahalanobis' D 2 , see reference [7]. Upon making the necessary assumptions (which need, of course, to be tested further), most of the necessary computer programs for the procedure described above can be found in reference [3]. 5. Interpretation of Discriminant Functions The separability of the solid state categories can be observed in either the original 48-dimensional variable space or in the reduced three-dimensional space. Figure 6 shows the centroids of the four categories in the reduced three-dimensional dis- criminant space. Figure 7 shows that category 93 has a larger percentage of overlap than any other category. In addition to these visual checks, a statistical check can be made with Wilks' A test. 91(-0.95. 0.54. -0.73) 93(-0.19. 0.58. 0.47 95(1.05. 1.37, -0.47) (0.76, -0.68, -0.39) Figure 6. Category centroids in three-dimensional discriminant space. CIRCUl COUNTE DESIGN DETECT NOISE OUTPUT POWER PULSE REGULA STAB1L SWITCH TRANSI CONSTR CURREN DEVICE F ERR IT FREQUE HIGH JUNCTI MADE MAGNET P TUNNEL VOLTAG CRUCIB CRYSTA DISLOC FURNAC GROWTH ION MICRON OXIDE SEED SINGLE TEMPER VAPOR AND DEPEND EFFECT ELECTR FERROE FIELD IMPURI INTERA OXYGEN PHONON PIEZOR TRANSV .20 .25 .30 .28 .13 .12 .10 ■.17 -.20 .30 .19 -.05 .01 -.07 -.13 -.11 -.02 -.07 .10 .03 -.10 .13 .03 .05 .09 .06 .11 .21 .13 .07 .18 .21 -.10 .06 .11 .15 .02 .25 .18 .10 .07 .10 .10 .15 .07 .06 .07 .18 .03 -.03 .06 .04 .10 .02 -.01 .08 -.10 -.07 .03 .06 -.12 .02 -.03 -.11 .06 -.02 -.08 .19 .02 .05 .06 .15 -.18 -.18 -.26 -.31 -.25 -.01 -.05 0.0 .22 .04 .09 -.39 .04 .18 .23 .16 .29 .16 -.11 .08 .17 .12 .27 0.0 -.14 -.15 -.18 -.30 -.14 -.28 -.08 -.16 -.21 -.11 -.14 .09 .20 -.01 .24 .33 .11 .11 .27 .07 .05 .09 .35 .04 -.06 0.0 -.05 .02 -.07 -.18 -.16 -.09 -.01 -.09 -.07 -.08 -.02 -.13 -.09 -.07 -.15 .04 -.04 -.12 -.03 .06 -.04 -.06 Figure 8. Normalized coefficients of words in discriminant space. CATEGORY 91 93 94 95 91 .987 .013 0.0 0.0 93 .181 .738 .046 .035 94 0.0 .002 .997 .001 95 0.0 .003 .003 .994 Figure 7. Overlap of categories. Analysis of the coefficients of the discriminant functions shown in figure 8 indicates how the separation of categories is achieved. The first 24 words generally have negative coefficients, and the last 24 generally have positive coefficients. This means that the first discriminant function divided the space into two parts. If discrimination between only the two pairs of categories 91 and 93 or 94 and 95 were desired, it could be achieved along this axis. In the second discriminant function the coefficients of words for categories 91, 93, and 95 are generally positive and for category 94 are negative; therefore it appears to provide a decision boundary between categories 94 and 95. In the 222 third discriminant function the coefficients of words for categories 91, 94 and 95 are generally negative and for category 93 are positive; therefore it ap- pears to provide the decision boundary between categories 91 and 93. The relation of the decision boundaries to each category can also be observed from the coordinates of each centroid as shown in figure 6. A few examples will show how discriminant func- tions are used to transform a 48-dimensional space to a three-dimensional space. Since the coeffi- cients in figure 8 are normalized, the square of these values is the percentage of discrimination contributed by each word. Thus, the word AND contributes less than one percent on each of the axes, whereas OXIDE accounts for four percent on the first axis. The direction of the effect of each word can be observed in the three-dimensional reduced space by letting its value in each discrimi- nant function equal one and the value of all other words equal zero. Three different types of words to be discussed are: (1) CRUCIB — occurs in one and only one category: (2) OXIDE — occurs in two categories: (3) AND — occurs in all four categories. CRUCIB (0.0, 0.0, 0.25, 0.0) is a word which has a significant difference between its means and which occurs in only one category. Its discriminant coefficients (0.09, — 0.18, — 0.06) as shown in figure 8 he near the centroid of category 94 as expected. OXIDE (0.0, 0.0, 0.16, 0.11) has a significant dif- ference between pairs of means, but not within the pairs. In some techniques this word would not be retained as a predictor. However, in the dis- criminant technique, utilization of this information can be easily seen through its discriminant coeffi- cients (0.21, 0.0, —0.09). The positive value on Axis I indicates that it is in either 94 or 95, whereas the zero value on Axis II indicates that OXIDE has little discrimination power between 94 and 95. AND (3.06, 3.28, 3.34, 4.54) has an insignificant difference between all its means and, as expected, its discriminant coefficients (0.02, 0.04, —0.02) are very low on all axes. Thus, analysis of the dis- criminant procedures indicates that the results do have a meaningful interpretation. Significant words will have high discriminant coefficients, whereas insignificant words occurring in the inter- section of all categories will fie near the origin. 6. Results and Potential Use The classification procedure just outlined in section 3 was used to classify both the 320 reference documents and 474 independent test documents. The percentages of correct classifications shown in figure 9 are based on all documents input to the INDEPENDENT TEST DOCUMENTS CATEGORY TOTAL NUMBER OF DOCUMENTS \. A D N. 91 93 94 95 91 63.95 30.23 3.49 2.33 86 93 18.75 64.58 6.25 10.42 48 94 1.21 10.3 80.61 7.88 165 95 4.67 21.7 30.21 43.42 175 REFERENCE DOCUMENTS CATEGORY TOTAL NUMBER OF DOCUMENTS >. A D N^ 91 93 94 95 91 87.5 11.25 1.25 0.0 80 93 11.25 75.00 10.0 3.75 80 94 0.0 5.00 90.0 5.0 80 95 1.25 6.25 8.75 83.75 80 LEGEND: A: Actual D: Desired FIGURE 9. Percentage of correct classifications. system. Even though some may not contain any of the discriminating words (i.e., the small set of only 48 types), their results are included in the percentages. Therefore these results were achieved by using only 48 out of the 3155 total word types in all 320 reference documents. Only 80 documents were used to represent each category. In the selection of reference documents for cate- gory 95, the longest documents were intentionally placed in the reference set and the shortest in the test set. The results for the test set of category 95 indicate that compensation for variation in docu- ment length must be considered. These two are the most obvious parameters to change in order to increase classification accuracy. Another impor- tant parameter is the range of document length. The procedure described in this paper was uti- lized in order to assist in content analysis, that is, in determining what subject or subjects are cov- ered by a particular document. The unique feature of this statistical approach is that it provides for an analysis of a set of documents from many divergent points of view. For example, if three user groups, who are interested in the political, electronic, and military aspects of a situation, all receive the same set of documents, how can they be indexed or clas- sified to serve the different needs of each user? The present technique permits a matching of in- coming documents against statistically derived profiles which are specifically oriented towards the user's point of view. These profiles could be derived for each group and to any level of detail specified. They could be determined independ- ently of the other users' needs, or combined at a higher and more general level. 223 Since the technique is based on an analysis of variance of word-type frequencies, the definition of these word-types can be changed to suit specific requirements. A word can be defined as a string of n characters, so that foreign language documents as a separate group can be processed without translation. The technique is also general enough to handle various intervals of text. The textual interval to be classified could be either a whole document, an abstract, a section, a paragraph, a sentence, or a set of key words. The system output is not limited to subject clas- sification because relevance values are computed and retained for each document with respect to every category. The output for each document could also include each of the discriminating words that actually occurred in the document, at each level of the structure. Furthermore, the following retrieval aids could be made available: (a) asso- ciation factors at every level (either for each sub- ject separately or for all subjects within that group), (b) lists of the most discriminating words for every category, ranked in descending sequence of their discrimination ability. With these aids, retrieval could be accomplished either by subject heading, descriptors, associated words, or by a narrative query. For retrieval by subject heading, the user would request all docu- ments in the desired category having a relevance value higher than some specified threshold. Re- trieval by narrative query would be entirely analo- gous to the matching of an incoming document against all available categories. The output in this case would indicate which categories are most relevant to the request, and these categories could then be searched in descending sequence. It appears that the system would be capable of detecting changes in disciplines or relationships of subjects. Each group of categories should contain one which will be "general" or "all other." Periodically the distribution of relevance values for all documents processed in the preceding peri- od will be compared with the distributions previ- ously established for each category. Detection of the fact that words from two disciplines are now being used interchangeably can be made easily by noticing that the measured overlap between two categories is becoming greater. Detection of the arrival of new words and concepts can be achieved either when the dispersion of a category increases or when a new word moves up on the ranked dis- criminating word fist. Consistent increases in rank can be detected very early, for example a change from rank 1000 to 900. When a change in the structure is required, documents can easily be re- classified since the permanent machinable form of the document is condensed at one point to a single record of word-frequency pairs. When a change occurs in a group only the documents having a sig- nificant relevance value with respect to the cate- gories of that group are reclassified and the appro- priate files updated. Interpretation of textual subject matter may vary widely depending on a user's background, current interest, and other factors. For effective classi- fication and retrieval, it is essential that some means be provided which will allow a variable "point of view" in information processing. It is believed that the discriminant procedures described here are not only responsive to this operational require- ment, but also furnish valuable analytical tools for use in content analysis. 7. References [1] Williams, J. H., A discriminant method for automatically classifying documents, Proc. Fall Joint Computer Conf. 24, 161-166 (1963). [2] Bryan, J. G., The generalized discriminant function: mathematical foundations and computational routine, Harvard Ed. Rev. 21, 90-95 (1951). [3] Rao, C. R., Advanced Statistical Methods in Biometric Research (New York, N.Y., John Wiley & Sons, 1952). [4] Cooley, W. W., and P. R. Lohnes, Multivariate Procedures for the Behavioral Sciences (New York, N.Y., John Wiley & Sons, 1962). [5] Borko, H., and M. Bernick, Automatic document classifica- tion, J. Assoc. Comp. Mach. 10, No. 2, 151-162 (1963). [6] Edmundson, H. P., and R. E. Wyllys, Automatic abstracting and indexing — survey and recommendations, Commun. Assoc. Comp. Mach. 4, No. 5, 226-234 (1961). [7] Maron, M. E., Automatic indexing: and experimental in- quiry, J. Assoc. Comp. Mach. 8, No. 3, 404-417 (1961). [8] Posten, H. O., Bibliography on Classification Discrimina- tion, Generalized Distance and Related Topics, RC-743, IBM, Yorktown, N.Y. (1962). [9] Tatsuoka, M. M., and D. V. Tiedman, Discriminant analysis, Rev. Ed. Res. 24, No. 5, 402-416 (n.d.). [10] Williams, J. H., Statistical Analysis and Classification of Documents, IRAD Task No. 0274, IBM, FSD, Rockville, Md. (1963). 224 Rank Order Patterns of Common Words as Discriminators of Subject Content in Scientific and Technical Prose Everett M. Wallace System Development Corporation Santa Monica, Calif. 90406 There is a style of language characteristic of different subject areas which is particularly notice- able in scientific and technical writing. It is not only the unique vocabulary of a subject field which sets it apart from others, but also the different habits of writers in using the most common words. An experiment was devised to test whether these differences could be used for subject discrimination in addition to identification of unique vocabulary, particularly to determine whether or not author variation in style is sufficiently great to override the variation from field to field. Fifty IRE abstracts in the field of electronic computers and fifty Psychological Abstracts were matched, one abstract at a time, one word type at a time, against two lists of words ranked in descend- ing order of frequency as they occurred within two different sets of 300 psychological and computer abstracts. All fully inflected forms of all function and content words were included in the rankings. Using the first 50 ranks only of the two lists, 93 percent of the abstracts were successfully discrimi- nated. For the first 75 and 100 ranks, the success rates were 96 percent and 97 percent, respectively. 1. Introduction There is little reason to be satisfied with current information system designs for either dissemination or retrieval. The use of condensed representa- tions in the form of class categories or index terms has limitations. Systems using such devices ap- pear, inherently, to produce a great deal of "noise," as can be seen in the recent work on relevance/ recall ratios. Whole text or "natural language" processing approaches appear to offer the greatest promise of improvement in retrieval systems. The designers of prose processing schemes, however, have encountered serious difficulties in building systems which are both practical and economical. A major problem in working with natural language is the range of variation in linguistic behavior. The wide range of variation has been an obstacle to successful predictive generalization, whether applied to mechanical or human information storage and retrieval. One reason for the current diffi- culties is that we do not have a sufficiently precise knowledge of the stochastic parameters of lan- guage, particularly as it is used in different sub- jects and contexts. A second reason is that efforts directed at statistical techniques of linguistic analysis have concentrated upon the relatively infrequent verbal constructs. It has been a common practice in building lan- guage-processing programs to reduce the number of different entities which must be handled by ex- cluding the most common articles, prepositions, conjunctions, and auxiliary verb forms, and by com- bining .., fleeted forms of common roots. Such procedures do result in the loss of a certain amount 1 Figures in brackets indicate the literature references on p. 228. of information. Through reading the reports of G. Yule [l], 1 G. Herdan [2], and F. Mosteller and D. Wallace [3] in establishing the authorship of disputed works, I was led to consider ways in which this lost information could be recovered and used to supplement established methods. G. K. Zipf [4] had already shown one way of using rank order distributions of words. Others have indicated that there is a considerable range of variation in the way individual authors use the most commonly occurring words in a language in different contexts. There is a style of language characteristic of different subject areas which is particularly no- ticeable in scientific and technical writing. It is not only the unique vocabulary of a subject field which sets it apart from others, but also the different habits of writers in different fields in using common prepositions, nouns, and verbs. This is most clearly illustrated in mathematical writing, in which symbology is embedded in a highly stylized form of prose, sufficiently unlike ordinary language to be considered a distinct dialect. The growth of "dialects" in this sense is common to all subjects in varying degrees. The question is whether these behavioral differences are sufficiently distinctive to provide a basis for subject discrimination in addition to the identification of unique vocabulary. One of the first considerations in estimating whether a practical discriminator could be built was whether or not author variation in style is sufficiently great to override the variation from field to field. An experiment was devised to test this proposition and to gather evidence for identifi- cation of statistical parameters and techniques useful for subject discrimination. 225 2. The Experiment An experimental corpus was selected consisting of 350 Psychological Abstracts and 350 IRE ab- stracts from the Transactions of the Professional Group on Electronic Computers (PGEC). The abstracts were available at System Development Corporation in machine-readable form. 2 This corpus was considered to provide an adequate re- flection of author variation, in that the abstracts had largely been written by different persons, including authors of the papers abstracted. Three hundred psychological abstracts and 300 PGEC abstracts were taken from the corpus for establishment of population "profiles" of the two subject areas. The profiles consisted of two fists of the most frequent 100 words ranked in descend- ing order of occurrence within the two sets of 300 abstracts. A System Development Corporation computer program called FEAT was used to provide the counts and listings. The appendix presents a consolidated alphabetic list of the words in the two profiles, together with their rank numbers. Where occurrence frequencies of two or more words were equal, a word-length criterion was applied such that the shorter word was given the higher rank. This was based on the assumption that, in general, short words are more prevalent than long. When word length as well as frequency were equal, the words were ranked in alphabetic order. A version of the FEAT program was used to count and list the words in each of the 100 abstracts re- maining in the experimental corpus of 700. Each abstract was matched, one word type at a time, against the two profiles of 100 rank-ordered words. The words in each abstract occurring in one or both of the two profiles were recorded, together with their rank numbers. PSYCHOLOGICAL ABSTRACT * 1 - 54 word types Word in Abstract Psych . Profile PGEC Profile IRE PGEC ABSTRACT » I - 15 word types 50R 75R 100R 50R 75R 100R a 6 3 and 3 4 be 17 13 but - 63 - - by 14 14 first - 74 _ _ have - 56 _ 69 information - - 40 is 4 7 7 5 of 2 2 on 13 16 the 1 1 to 5 6 were 18 - - with 9 17 No. words in common 12 15 12 13 13 Figure 1. Psychological abstract No. 1—54 word types. Word in Abstract Psych Profile PGEC Profile 50R 75R 100R 50R 75R 100R are 8 9 automatic - - - _ _ 80 be 17 13 considered - - - - - 85 data - - 80 37 may 50 - - 91 of 2 2 or 21 27 that 12 19 no . words in common 6 6 7 6 6 9 Rank no . sum 110 110 107 107 Figure 2. IRE PGEC abstract No. 1-15 word types. The purpose of this procedure was to segregate the abstracts into two files — psychological and PGEC abstracts, respectively. After considering a number of decision rules, the following criteria were adopted: 1. An abstract belongs to psychology if the num- ber of words in common with the psychology profile is greater than the number in common with the PGEC profile, and conversely. 2. If the number of words in common in the abstract and the two profiles were equal, the sum of the rank numbers of those words on the two lists would be determined, and the abstract assigned to the profile with the smaller sum. If the sums were equal, no decision would be made. Figures 1 and 2 illustrate the data recorded and the results of matching two abstracts against the first 50, 75, and the full 100 ranks of the two pro- files. In both cases the number of words in the abstracts contained in the first 50 ranks of the two profiles is the same. Summing the rank numbers permits both abstracts to be correctly discrimi- nated by the rule given. The following table summarizes the results of matching the psychological and PGEC abstracts against the first 50, 75, and 100 ranks of the profiles. Number correctly discriminated for 50 Psychological abstracts 50 IRE PGEC abstracts Success ratio 50 Ranks 75 Ranks 100 Ranks 43 46 47 50 50 50 93% 96% 97% 2 The abstracts were drawn from the experimental sets used originally by Borko for automatic classification and by Maron for automatic indexing. All of the abstracts which were cast into the "wrong" category by this procedure were psycho- logical abstracts. Examination of the abstracts contributing to the profiles suggests several reasons for this. The PGEC abstracts represent a more specialized subject matter than those from Psycho- logical Abstracts. In general, the PGEC abstracts contain fewer word types used more frequently. Consequently the counts contributing to the PGEC profile are higher than those of psychology. 226 Rank Rank Rank Rank Rank Rank Psych PGEC Word Psych PGEC Word Psych PGEC Word 6 3 a 56 69 have 43 70 other 16 12 an 4 7 in 58 59 presented 3 4 and 91 64 into 52 88 problems 8 9 are 7 5 is 45 67 some 11 20 as 23 24 it 71 48 such 39 43 at 49 90 its 82 18 system 17 13 be 50 91 may 29 49 than 14 14 by 44 22 method 12 19 that 100 23 can 72 63 methods 1 1 the 80 37 data 30 66 more 33 44 these 34 21 discussed 90 53 new 19 25 this 84 54 each 94 50 number 46 62 time 10 8 for 2 2 of 5 6 to 96 68 function 13 16 on 36 61 two 69 94 general 65 41 one 20 11 which 64 58 has 21 Mean differe 27 nee in Rank = 17.4 9 17 with Figure 3. Rank numbers of the 48 words in common in the first 100 ranks of psycho- logical and IRE PGEC abstract profiles. In examining the results it was found that, at the 100 rank level, 88 percent of the successfully discriminated abstracts were dependent on the 52 words that are unique to each profile, with 9 percent successfully decided through summing the rank numbers. It was considered useful to investigate the discrimination to be obtained by the rank sum criterion alone, using only words common to the profiles. There are 48 words in common on the profiles in the first 100 ranks. Figure 3 lists the words in common and their ranks. The mean difference of rank for these words is 17.4, with the lower ranks tending to larger differences than the higher ranks. As can be seen from the figure, function words predominate. The following table shows the results of matching the 100 abstracts against the list of 48 words common to the profiles and applying the rank sum criterion: 50 Psychological abstracts 50 IRE PGEC abstracts Percentage Correct Incorrect 36 14 42 8 78% 22% 3. Conclusions The results of this experiment indicate that author variation in style imposes no serious obstacle to using patterns of common words as discriminators. Considering the length of the profiles, the small size of the sample contributing to the profiles, and the limited number of word types contained in individual abstracts, the success ratios are sur- prisingly high. It is uncertain, however, to what degree the results are biased by editorial conven- tions and style. The results also tend to support the idea that there is much useful information to be found in the high-frequency area of word occurrence, and that frequency alone can provide a basis for subject discrimination of widely different fields, particularly when all word type occurrences of fully inflected forms are taken into account. Further work is required to establish the precision which may be expected of such a technique, especially if ap- plied to fields more closely related than psychology and computers. 4. Potential Applications A system designed to make use of common word patterns through a technique similar to that de- scribed in this paper would include a short table intended to combine the functions of an exclusion list with identification of broad subject areas. Such a quick initial segregation would reduce the search time required for matching against the par- ticular vocabulary of those areas. Figure 4 illus- trates the contrast between using a large dictionary with the familiar features of exclusion lists, root stripping, and an extended search of a long table, and the approach suggested here. The initial segregation would lead directly to a relatively short specialized dictionary or to a mismatch monitor. The thesaurus devices necessary to a large dic- tionary could be simplified, and the range of am- biguity inherent to terms used in many different fields would be narrowed. It is quite feasible to use specialized tables now, provided the texts are segregated by subject prior to input. This ap- proach, however, looks forward to the application of optical readers for the transformation of printed text to machine readable form in systems that do not require the intervention of a human mind for prior subject classification. 227 TEXT Exclusions & Root Stripping LARGE DICTIONARY ProcessedText TEXT Exclusions & Profile Matching Table Processed Text Figure 4. Schematic flow contrasting a conventional technique with suggested approach using common word patterns. 5. References [1] Yule, G. V., A Statistical Study of Vocabulary (Cambridge Univ. Press, 1944). [2] Herdan, G., Type-Token Mathematics ('S-GravenHage, Mouton & Co., 1960). [3] Mosteller, F., and D. L. Wallace, Inference in an authorship problem, J. Am. Statist. Assoc. 58, No. 302 (June 1963). [4] Zipf, G. K., Human Behavior and the Principle of Least Effort (Addison-Wesley, 1949). Borko, H., The construction of an empirically based mathemat- ically derived classification system, Proc. Spring Joint Computer Conf. 21, 279-289 (1962). 6. Appendix. The Profiles The 300 Psychological Abstracts used to build the rank- ordered profiles for this experiment contained a total of 22,175 word occurrences of 4,587 word types. The 300 IRE PGEC abstracts contained 23,200 word occurrences of 3,678 word types. The mean number of word occurrences per abstract was 77.3 for PGEC versus 73.9 for Psychology. When broken into subsets, both samples exhibited a broad internal range of variation for the expectation that a given word would appear at a given rank, with the broader range appearing in the Psycho- logical Abstract set. The following table presents a consolidated alphabetic list of words occurring in the first 100 ranks of the IRE PGEC and Psychological Abstract Profiles, together with their rank num- bers. Dots (....) are used instead of a rank number to indicate that the word does not occur in the first 100 ranks of one or other of the profiles. 228 Word type Rank number Word type Rank number Psych. PGEC Psych. PGEC a all 06 99 16 03 may means memory mental method methods more network new no not number of on one only operation operations or other out output part perception performance personality possible presented problem problems program programming psychological psychology reinforcement relationship required research response results set shown social solution some storage study such switching system systems technique techniques test than that the their theory these this time to two under use used using various visual was were when which with 50 91 74 28 an 12 42 93 44 72 30 analysis and any are 42 03 22 63 66 95 04 65 09 20 43 08 11 39 66 as at 90 79 24 94 02 13 65 85 53 80 13 77 50 02 16 41 be 17 27 22 86 97 55 both 96 63 14 100 92 21 43 27 70 99 by can 14 23 84 46 34 10 45 85 56 73 77 83 52 computer computers 98 58 59 87 80 75 data 37 15 36 51 88 51 83 design development 38 98 97 78 59 89 70 26 21 34 75 84 37 57 89 during 47 54 35 54 72 60 87 76 73 25 45 equations 79 68 95 74 10 67 findings 57 28 71 for 08 92 48 39 53 82 18 32 68 47 94 35 38 96 81 82 general given group groups 69 40 29 \2 111 61 26 33 19 46 05 36 + 1 49 32 76 64 56 55 81 04 19 01 58 69 have 44' 25 in 07 40 100 64 05 24 90 62 06 input 61 91 07 23 49 31 29 78 language learning logic logical 86 67 15 18 60 20 09 31 93 52 30 33 magnetic 11 17 229 Clumping Techniques and Associative Retrieval * A. G. Dale and N. Dale The University of Texas Austin, Tex. Experimental work applying clump theory to the problem of defining word associations useful for document retrieval is described. A clump-finding computer program developed by the authors has been successfully used to clump key words in a document-key word data set previously used by H. Borko of System Development Corporation and M. Maron of RAND Corporation for classification experiments described in the literature. The main features of the program, which permits several analytical options at execution time, are described. An analysis is made of word associations implicit in a collection of GR-clumps found under a given term-term connection definition. Clump intersections define small subsets of terms that possess identical properties of contextual distribution and the structure of the subsets forms an associative network useful for retrieval. An algorithm for associative retrieval is suggested. Information on the membership of key words in GR-clumps can be used to define the context of a retrieval request and to provide a rapid parti- tioning of the document set into relevant and nonrelevant subsets. Clump associations can then be used to order the prospectively relevant documents for output. 1. Introduction This paper summarizes experimental work applying clump theory [1, 2, 3, 4] 2 to the problem of defining word associations in a context where documents are described by key terms, and of implementing a retrieval process within an asso- ciative network produced by key-term clumps. For reasons discussed by R. M. Needham [3], who has been responsible for much of the existing work on clump theory, experimentation has been largely confined to work with GR-clumps. 3 2. Key-Term Clumping: Data and Software The clumping experiments were made with a data set supplied by H. Borko of System Develop- ment Corporation [5]. The data characterize the use of 90 key terms in 260 documents in a classifi- cation array in which the elements are 1 or 0, depending on whether or not a key term is used in a given document. 4 Several connection definitions have been used in experiments to date, two of which have proved most useful with these data. Let l(n, m) be the number of l's in the intersection of rows n and m in the classification array (i.e., the number of co- occurrences of the nth and mth terms in the set of documents), and l(n) be the number of l's in row n (i.e., the total number of occurrences of the nth term in the set of documents): 1 Work described in this paper was supported in part by the National Science Foundation under Institutional Grant GU-483 at The University of Texas. 2 Figures in brackets indicate the literature references at end ol paper. 3 The definition of a GR-clump is as follows: U: a finite set of elements, between pairs of which there is a symmetrical rela- tion attaching a real number to each pair, called the connection of the pair. dx, s): The connection of a pair of elements x and s. S: a subset of V (s., 5,, . . ., s ) 3*: £7-S C{x, S): Id*, sWstS _ CU, 5*): Icix, s*)Vs*eS* b(x, 3*): Qx,S)-C(x,S*) Hence the bias (A(x, §)) of an element x to a subset S is the excess (positive or negative) of the total connections of* to the members of S over the total connections of x to the members of S*. GR-clump S: {x | id, 5) 3 and 6(y, S) < V y<£*} A subset S of C/Js a GR-clump if all members of S have a positive or zero bias to S and all members of S* have a negative bias to S, given the convention that dx, x) = 0. 4 The documents are 260 abstracts published in the March and June issues of the 1959 IRE Transactions on Electronic Computers; the topics cover computing hard- ware and computer applications. Connection def. 1: Connection def. 2: l(n, m) l(n, m) Vl(n) • l(m) FORTRAN programs have been written to com- pute the appropriate connection matrices and to implement an algorithm for finding GR-clumps in the connection space. Since the clumping pro- cedure works iteratively from an initial partitioning of the universe, and since a prohibitive number of possible initial partitions exists, the practicability of the procedure depends upon heuristics governing the selection of initial partitions. For clumping in sparse matrices characteristic of the type of data used in the experiments, initial partitions defined by what we have termed the pivot variable method provide useful starting points. For each variable a set S consisting of that variable and all other terms with which it has a nonzero connection is defined, so that in a system of n terms, n initial partitions are considered. The clumping algo- rithm is essentially as described by Needham [4], following the initial partitioning operation. Since the size, z, of a GR-clump is typically large — n/3 < z < n in an ra-element universe — several methods for defining smaller clumps within GR- clumps have been tried. For some purposes it may be desirable to work with clumps possessing strong internal connections, and many GR-clumps contain fringe elements with small positive bias. Two promising methods found to yield useful smaller clumps within GR-clumps are as follows: 230 Method 1: 1. Remove elements with minimum bias (min(6)). 2. Recompute the bias of each remaining ele- ment over U. 3. Repeat 1 and 2 until all remaining elements have a bias to the reduced set greater than min(6). 4. The reduced set is a clump with threshold min(6). 5. Repeat 1-4. 6. The process ends when the set collapses, i.e., when no set containing elements with bias greater than min(6) can be found. Method 2: This uses the same procedures as method 1, except that biases are computed only over the set consisting of the elements of the previous clump found. The two methods produce quite different minimal clumps. For example, consider an element x of a GR-clump, S, with a large number of connections over U. Its bias to S is likely to be small despite its large number of connections, and it would be transferred from S early in the method 1 procedure. However, since the sum of its connections to S may be large, its bias to reduced clumps in method 2 would also be large, and it would therefore prob- ably be retained in the reduction process. The clump-finding program used in the experi- ments is executed under three major options per- mitting: (1) location of GR-clumps, (2) location of GR-clumps and method 1 reduction, and (3) location of GR-clumps and method 2 reduction. The pro- gram works in core (32K) with connection matrices of up to 100 variables, and with up to 100 pivot variable initial partitions on one run. Repro- gramming to handle significantly larger connection matrices is planned. The programs are being implemented on a Control Data 1604 (FORTRAN compile-and-go system), with the following average execution times for finding and reducing one GR- clump (or reaching a dead-end) in a 90 X 90 con- nection matrix: Fixed point Option 1: 10.8 sec Option 2: 20.4 sec Option 3: 30.0 sec Floating point Option 1: 5.6 sec Option 2: 56.3 sec Option 3: 43.8 sec. 3. Key-Term Clumping: Results Table 1 summarizes the output of the clump- finding prodecures outlined above, showing the number and mean size (number of elements) of clumps found. The network implicit in the GR-clump structure, using the second connection definition, is shown in figure 1. The relationships for the definition-2 clump structure are shown for illustrative purposes since the association structure is simpler than that Table 1. Reduced — Reduced — GR-clumps Method 1 Method 2 definition No. Mean No. Mean No. Mean found size found size found size 1 19 52 13 44 8 19 2 8 49 7 49 3 58 No . of clumps containing term subsets General Context: Applications General Context: Hardware 7 j@ @ 6 ® © © ^^^ ^ / 1 v) 5 ^P C vr\^C j) l® \ © QO jtai 4 T £ \f c ) j® %^i ® © © JO) 3 C< ® (fj © © \J8T 2 <3 Fl<;tfRE 1. Strong term associations implicit in GR-clump structure, connection definition A. (See table 2 for contents of numhered subsets.) 231 implicit in the definition-1 set of clumps and is more easily diagrammed. Preliminary investiga- tion suggests, however, that the more complex association structure given by definition-1 clumps provides better retrieval outputs. The circled numbers identify subsets of terms appearing in identical clumps; the number of clumps in which the subset appears is indicated on the left of the diagram. The contents of the numbered subsets are identified in table 2. The connecting lines in the network indicate inclusion relations. For example, two of the three clumps in which subset No. 3 (mechanical, translation) appears form an intersection in which subset No. 1 (complexity, language, Uncol) uniquely appears. These connec- tions specify the strongest association paths in the network. An interesting contextual partition of the entire set of index terms is evident; one sub- network deals largely with hardware topics, the second with applications, with relatively weak connections between the two. The retrieval model described below uses the contextual distributional properties of terms as a basis for associative retrieval. Table 2. Key to numbered term subsets in figure 1 Subset Terms number 1 complexity, language, Uncol 2 arithmetic, expressions 3 mechanical, translation 4 bound, definition, parity 5 chess, mechanisms, process, program, programming, programs 6 pseudo-random, random 7 square 8 average, differential, division, equation, equations, multiplication, solution, traffic 9 character, delays, Monte Carlo, shuttle, stage, unit 10 numbers 11 abacus, boolean, functions, matrix 12 diffusion, error 13 characters, office 14 section 15 simulation 16 analog, control, function, generator, plane 17 code, conversion, elements 18 adder, carry, network, networks, scientific, synthesis 19 communications, register, decoder, shift, wire 20 circuit, circuits, counter, logic, pulse, transistor, transistors 21 storage 22 switching 23 fields 24 element 25 barium 26 file, information, library, magnetic, processing, tape 27 memory 28 transmission 29 printed, recording 30 side 31 coding, compressions, film, speech 4. Retrieval Model The retrieval model will be described informally. Given a collection of m documents described by n index terms, and k clumps of terms, the initial data arrays are 1. A clump-key term binary matrix, T, with elements 7y=l or depending on whether or not the jth term is a member of the ith clump. 2. A document-key term binary matrix, C, with elements Cy=l or depending on whether or not the jth term is in the ith document. A secondary data array D = CT r can be formed, such that Dij — the number of terms in the ith document contained in the jth clump. Considering an input request as a binary vector q of dimension n, with qi—1 if the ith term is in- cluded in the request and otherwise, a simple retrieval model would be e = DTq (1) where e is an output vector of dimension m, and d is the relevancy weight of the ith document with respect to the input request. It is evident, however, that this model has several defects. In particular: 1. It is desirable to partition the set of m docu- ments so that only relevant documents are con- sidered for output. A possible definition of relevancy is to require that an output document possess a clump list that encloses the clump fist of the request (i.e., that the union set of clumps associated with the key terms of a document enclose the union set associated with the key terms of a request). This condition proves to be over- restrictive, since it can lead to the exclusion of documents that possess some key terms included in a request. Consequently, we define a relevant document to be one that either (a) contains key words included in the request, or (b) possesses a clump list that encloses the clump fist of the request. Only such documents will be considered for output. 2. It is desirable to normalize the weights, a, in the output vector, since in the simple model these weights are directly proportional to the number of key terms in a document. 3. Other things being equal, a relevant document with an extensive clump fist should have a lower relevancy weight than a document with a shorter clump list. 4. Other things being equal, a relevant document with a larger number of key terms matching key terms contained in the request should have a higher relevancy weight than one with a lower number of matches. A model satisfying these conditions is: e=[D'sGV]+m (2) where D' is a submatrix of D of dimension rX k, and r = the number of documents satisfying the relevancy criteria. s = T q defined above. G is a diagonal matrix of dimension rXr, such that Gji is the ratio of the number of relevant clumps attached to document i (i.e., the number of clumps that match the clump fist of the request) to its total clump fist. 232 V is a diagonal matrix of dimension rXr, with Vu the reciprocal of the number of key terms in document i. m is a row vector of length r, such that m, — k x t ly, where k is a constant, Xi is the number of request terms contained in the key term list of the ith document, and y is the number of key terms con- tained in the input request. Thus, [D's] is a row vector, the elements of which are the crude relevancy scores for r relevant docu- ments; [D'sG] is a row vector in which the docu- ment scores have been modified to reflect what might be termed the "contextual dispersion" of the document key terms; and [D's GV\ is a row vector of normalized relevancy scores. The values in m give added weight to documents for key word matches. The exponential weighting scheme has the desirable property of increasing the relative weight of mi in the model for larger values of y and Xi; this is intuitively satisfactory since requests containing a large number of key terms are likely to require more specific outputs than general requests using fewer (and probably broader) terms. The value of the parameter k may be modified to adjust the relative weight of m in the model. 5. Retrieval Experiments An algorithm simulating retrieval model (2) de- scribed above has been programmed. Retrieval requests were executed in each of the associative networks implicit in the clump structures found under the two different definitions. The principal purposes of the retrieval experi- ments were 1. To examine the efficiency of the model in partitioning the document collection into relevant and nonrelevant subsets. 2. To compare retrieval output from the two as- sociative networks. 3. To examine the validity of the relevance weighting scheme. 5.1. Partitioning Efficiency In evaluating the suitability of the retrieval model for use with large document collections, its effi- ciency in initially partitioning the set of documents to identify a prospectively relevant subset is impor- tant. Efficient partitioning will reduce search time and computation time associated with the cal- culation of relevance weights. In 19 test retrieval requests, the mean number of documents retrieved per request was approximately 84.5 from the clump structure of connection def- inition 1 (19 clumps), and 110.8 from the clump structure of connection definition 2 (8 clumps). The standard errors are approximately 5.8 and 7.1 respectively. These data suggest that the mean number of documents that would be retrieved per request using clump structure 1, over a large number of requests, would be in the range of about 73 to 96 documents, and using clump structure 2 in the range 97 to 125 documents (at the 95 percent con- fidence level). Thus, from clump structure 1, we would expect, on the average, an initial partitioning of the set of documents to be of the order of 28 percent to 37 percent of the collection; from clump structure 2 the retrieval algorithm would produce an average initial partition in the range 37 percent to 48 percent of the collection. These figures illustrate that the partitioning effi- ciency of the model is directly related to the number of key word clumps available to it in a given col- lection of documents with a given set of key words. It can be shown that the expected number of clumps to be found in some set S' will probably be greater than in some set S, if S' D S, since the possible num- ber of clumps is greater in S' . Thus, for a docu- ment collection of a given size, it is probable that partitioning efficiency would improve if the set of descriptive key terms were increased. It should be recognized that the efficiency of the retrieval algorithm, as measured by the number of documents returned as a result of a search, is a function of a number of variables, including (a) the frequency of use of key terms in the documents and (b) the distributional characteristics of terms in the key term clump structure, in addition to the number of key term clumps. The properties of this function are being investigated. In general, however, if it is assumed that the initial partitioning ratio is improved by the use of larger key term sets (producing more key term clumps), then the model appears to be adaptable for retrieval in large collections, provided a suit- ably large set of key terms is used for clumping and a suitably large number of clumps are identi- fied. Further experimentation is planned to permit estimates of initial partitioning ratios attainable in larger collections. 5.2. Comparison of Retrieval Outputs As noted above, the output fists from clump struc- ture 2 tend to be larger than from clump structure 1. Considering the set of retrieval requests as samples with ra=19 in each structure and testing for a significant difference between the mean length of output lists generated, the null hypothesis is rejected at the 0.01 significance level (t = 2.958, exceeding the critical value of approximately 2.72 with 36 deg of freedom). Thus, there is a signifi- cant difference between the mean lengths of the output lists, outputs from structure 1 being sig- nificantly shorter. The relevancy ordering of documents within re- trieval outputs was also compared. Output lists 233 772-957 O 66— 16 from three of the 19 retrieval requests were ran- domly selected, and relevancy weights computed for retrieved documents by the system were nor- malized. For each request, documents retrieved from structure 1 were located in the corresponding structure 2 outputs, to produce paired observa- tions of normalized relevance weights. If a docu- ment from structure 1 did not appear on the corresponding structure 2 output list, the second member of the pair was assigned a zero value. This procedure provided 260 observations of paired relevance weights. Linear correlation of the variables yielded a correlation coefficient of 0.3448, a rather low value, but nevertheless significant at the 0.01 significance level. Two conclusions are permissible: (a) Significantly shorter output fists are generated from structure 1. (b) Significant correlations exist between the relevancy orderings generated by the retrieval algorithm in clump structures using different con- nection definitions defining term associations. The second point is of interest since it indicates that different nearness definitions can produce comparable relevancy orderings (or, alternatively, the association structure generated by one near- ness definition will resemble, at least grossly, the associations produced by an alternative definition). A practical consequence of the two conclusions noted above is that it may be desirable to work with a connection definition that yields the most clumps, rather than making an a priori selection of a particular definition as a basis for clumping. 5.3. Validity of the Retrieval Model We have not, at this stage, undertaken any rig- orous validation of the retrieval model, or of the relevancy weighting scheme. However, informal validation of the following type has been under- taken: (a) Four individuals with general familiarity of the subject fields covered by the set of 260 docu- ments were given four randomly selected retrieval requests and asked to independently prepare lists of documents relevant to the requests by scanning the 260 abstracts and identifying documents on a three-valued relevancy scale ranging from most relevant (1) to possibly relevant (3). (b) The manually prepared lists for a given re- quest were consolidated and a sublist of documents most relevant to the request was prepared. This sublist comprised documents rated with a value of 1 by at least two of the four individuals, or rated with a value of 1 by one individual and rated 2 by at least two others. (c) Comparisons of manual and automatic re- trievals are given in table 3. Request 1 asked for documents dealing with language translation. Request 2 asked for docu- ments dealing with circuitry in analog computers. Request 3 was for documents on simulation. Re- quest 4 called for documents dealing with pro- gramming languages. Table 3. Comparison of manual and automatic retrievals Number of most relevant documents identified Total number of documents retrieved Number of most rele- vant docu- Number of most rele- vant docu- Number of most rele- vant docu- Request Manual Automatic per fourth of output Usts mainder of output lists retrieved Structure Structure Structure Structure 1 2 1 2 1 2 1 2 i 10 19 104 105 7 9 2 1 1 2 14 63 89 100 10 10 4 4 3 15 43 119 94 8 11 6 1 1 3 4 12 32 115 181 8 11 3 1 1 Using the rule outlined in (b) above, 10 documents most relevant to the first request were identified from a union set of 19 documents retrieved by the four investigators. The retrieval algorithm pro- duced ordered lists of 104 documents using struc- ture 1, and 105 documents using structure 2. In the upper fourth of the output list from structure 1, 7 of the 10 most relevant documents were located, and in the upper fourth of the structure 2 output list, 9 of the 10 most relevant documents. The algorithm failed to retrieve one of the most rele- vant documents using structure 1, but retrieved all the relevant documents using structure 2. The table indicates the generally satisfictory performance of the retrieval model and confirms the reasonableness of the definition of relevance used. It also again suggests that the choice of nearness definition as a basis for clumping may not be criti- cal to retrieval performance. In some respects the output from the model is even better than the data suggest. For example, in executing the retrieval request for documents dealing with the use of computers for simulation, the algorithm produced towards the top of its output fists a number of documents covering Monte Carlo processes and the generation and use of random and pseudo-random numbers. Reference to these documents in response to a general request for information on simulation is quite reasonable, and is an interesting indication of the associative ca- pabilities of the system. 234 6. Summary The experiments described above were designed to yield information on the utility of a document retrieval model working with term associations implicit in a system of key term clumps, and the potential performance of such a retrieval model in large collections. The results are suggestive rather than conclusive, but justify further empirical work with larger collections than the one used. The data in table 3 also suggest that efficient retrieval in large collections might utilize user feedback, based on scrutiny of initial system output. Thus, if it is the case that the system's denotations will generally coincide with those of a given user, one retrieval strategy would be to output the upper part of the response list generated in response to the initial request, and take the user's specifications of most relevant items in this subset as a basis for a reordering of the remaining documents. 7. References [1] A. F. Parker-Rhodes and R. M. Needham, The Theory of Clumps (Cambridge Language Research Unit, Cambridge, England, 1960). [2] R. M. Needham, The Theory of Clumps II (Cambridge Language Research Unit, Cambridge, England, 1961). [3] Research on Information Retrieval Classification and Group- ing, 1957-61 (Cambridge Language Research Unit. Cam- bridge, England, 1961). [4] R. M. Needham, A Method for using computers in information classification (Presented at IFIPC, Munich, 1962, mimeo). [5] See H. Borko and M. Bernick, Automatic document classi- fication, J. Assoc. Computing Machinery 10, 151-162 (1963). 23S Statistical Association Methods for Simultaneous Searching of Multiple Document Collections William Hammond* Datatrol Corporation Silver Spring, Md. 209 1 A technique is described for using statistical association methods for machine retrieval from a large collection of documents when individual elements of the collection have been indexed by different agencies employing different indexing vocabularies. The objective is to develop a mechanized approach for providing the kind of Government-wide clearinghouse information retrieval service described in the "Crawford Report" [1] '; or, in the words of the report, "to undertake and coordinate, on demand, appropriate simultaneous searches and serv- ice multiple collections." The approach envisions superimposing a common subsumption scheme onto the indexing data of the different agencies; this would inject a significant degree of commonality, and would provide the base, or framework, for deriving equivalent retrieval terms by computer. In actual practice, each agency would tag each report it enters into its system with the common terminology of the scheme. The association profiles of these common terms would serve as points of departure for mechanized searching. Experimentation in this approach with NASA and DDC indexing data is discussed. Examples of term association profiles generated during the experimentation are included. To condition myself for this program, I turned to my favorite reference work: How to Lie with Sta- tistics \2\. (It is really how to catch a liar, rather than be one.) This book makes reference to the work of Sir Francis Galton, who once said of sta- tistics: "I have a great subject to write upon, but feel keenly my literary incapacity to make it easily intelligible without sacrificing accuracy and thor- oughness."— Some of us recognize the same literary incapacity a century later. For this reason we welcome the opportunity to discuss our work at such a forum as this in advance of publication. Our unique contribution in this field — if indeed our contribution is unique — is in the area of com- puter software and in our application of statistical associative techniques to operating systems — and in particular, our current experimentation with these techniques to achieve compatibility among the large Federal technical information systems. We are currently working with the NASA and DDC files. Our presentations to this Symposium, mine and that of Mark Seidel, are somewhat in the form of progress reports. My paper deals with our efforts to achieve compatibility among different informa- tion systems — that is, compatibility of the nature required for integrated announcement and retrieval of Government research reports. Seidel deals with some aspects of the computer software that we have developed for the manipulation of the files of large information systems in the course of our investigations. In June of 1963, we were asked to undertake a study of the various approaches to the common vocabulary problem of the large Federal technical 1 Figures in ^rackets indicate the literature references at end of paper. *Now with ARLES Cooperation, McLean, Va. 22101. information agencies. This was one of the many problem areas that had to be resolved for the suc- cessful operation of an integrated clearinghouse service. To provide us with expert consultation on the objectives and operations of the various Government agencies involved, an Inter-Agency Vocabulary Study Group was formed under the Operating Committee of COSATI (Committee on Scientific and Technical Information, Federal Council for Science and Technology). This group of consultants was composed of senior personnel from the information facilities of the Department of Defense, Department of Commerce, Atomic Energy Commission, Department of Health, Edu- cation, and Welfare, Department of Agriculture, Na- tional Aeronautics and Space Administration, and the National Science Foundation. The study was accomplished under a National Science Founda- tion contract, and under the monitorship of the Head, Office of Science Information Service, of the Foundation [3]. We concluded that if the decentralized facilities retain their current mission orientation, a com- mon indexing vocabulary would be essentially a composite of the working vocabularies that the operating agencies currently employ. Assuming such a composite vocabulary were in use, we still could not formulate reliable search patterns for multicollection retrieval solely on the basis of the prescriptive indexing data of any "common the- saurus" of this nature. It is true that where the interests of the different agencies coincide or overlap, their indexing of a common subject it, recognizably similar, at least to those familiar with the subject matter. However, where the interests of the different agencies do not coincide, their indexing of common subject matter is dissimilar even if they have common indexing terms available. 237 Table 1 was compiled from current indexing of the two major information facilities. It is a sam- pling of the extreme variations in use of a common set of indexing terms by NASA and DDC to index an identical set of 966 research reports. From a review of the data in this figure, you can readily appreciate the difficulty in selecting corresponding search terms for the two systems solely on the basis of identical terms appearing in a fisting of their indexing vocabularies. Table 1. Sampling of variations in DDC-NASA usage of the common terms for indexing an identical set of reports DDC NASA Term DDC NASA Term use use use use 10 15 Ablation 1 19 Maps 30 60 Absorption 99 67 Measurement 4 20 Acceleration 1 12 Microscopes 11 45 Air 1 10 Navigation charts 7 19 Airborne 2 25 Numbers 13 18 Aluminum 15 36 Optics 1 7 Automation 12 28 Oscillation 3 6 Brightness 8 14 Oxidation 8 18 Calibration 1 7 Pilots 13 19 Combustion 1 5 Planets 8 22 Configuration 45 108 Pressure 5 17 Connection 33 57 Propagation 7 17 Cooling 7 16 Protons 12 7 Copper 1 12 Pumps 4 20 Deceleration 25 17 Reliability 4 14 Deflection 19 37 Resonance 43 21 Deformation 1 3 Sapphires 50 104 Density 1 14 Skin 5 17 Diffraction 2 8 Sky 13 70 Distribution 7 15 Spheres 13 29 Earth 7 14 Spin 35 26 Elasticity 31 52 Stability 1 35 Emissivity 1 44 Steel 30 99 Energy 10 60 Stresses 15 30 Excitation 8 18 Sun 22 50 Functions 27 10 Table 17 8 Glass 3 22 Telescopes 8 16 Graphite 75 190 Temperature ii 91 Heat 104 41 Theory 4 29 Heating 3 20 Tracking 36 25 Instrumentation 29 12 Turbulence 23 49 Ionization 28 87 Velocity 26 56 Ions 1 5 Venus 3 7 Learning 30 50 Vibration 13 38 Loading 12 18 Viscosity 6 8 Visibility We seek to achieve a degree of compatibility that will permit a clearinghouse operation to accept the original abstracting and indexing of the different federal agencies (at this point in time we are con- cerned with AEC, NASA, DDC, and OTS) and auto- matically integrate these different data into an- nouncement publications to meet the varied interests of the national scientific community. The clearinghouse should also be capable of providing effective retrieval of report literature on the basis of original indexing. One of the significant conclusions resulting from our study for COSATI was that a common subsump- tion scheme, superimposed on the indexing data of the different agencies by a human intermediary, would inject a significant degree of commonality for integrated announcement — and at the same time would provide a context or framework of "common generic denominators" for identifying equivalent access paths for searching the multiple collections. For the approach to compatibility that we are investigating, we have compiled a list of broad subject headings that subsume the entire subject coverage of the Federal scientific and technical report literature. These broad subject headings — or generic denominators — as we have developed them in our initial effort actually comprise a basic common vocabulary of some 225 terms. Although our experience to date is far from conclusive, the indications are that the current list may be too small. Perhaps our final list will be closer to 300 terms. Much will depend on the consistency — recognizably consistent patterns — that indexers can maintain with an acceptable degree of reliability. It is proposed that each participating agency require its indexers to assign one or more of these broad subject headings to each document processed into its system. In this manner, the subject indexer would be adding the set of common generic denomi- nators that we just referred to, providing points of departure for generating context sets or term pro- files of statistically associated terms. These term profiles of the generic denominators, as you will see later, suggest the equivalent access paths for retrieval. We have had many obstacles to overcome in establishing the validity of our concept. Not the least was to design a computer system that would permit economical manipulation of the data for experimentation. We can now generate the sta- tistically associative data and produce the term profiles for either the NASA or DDC system in about two hours on an IBM 7090. We can update the system in a fraction of that time. For our present experimental corpus we have generated the individual term profiles for all 12,000 terms in the NASA machine vocabulary and the 7,000 terms in the DDC thesaurus. Although the NASA subject indexing vocabulary has not been structured into the subsumption scheme of a thesaurus, our generic denominators accommodate the NASA indexing patterns more readily than they do the DDC indexing patterns. We can organize the existing NASA indexing data into our own scheme with a modest computer effort. The existing DDC indexing data will require a good deal of human effort. We have printed out the corresponding term profiles in the DDC and NASA systems for several of our generic denominators. We are now investi- gating the use of these corresponding profiles from the two systems for selecting the initial search terms for each system. From this point on the search, including associative expansion to formulate the final list of search terms, continues independ- ently in each system. Since the profiles of the generic denominators in fact reflect the "state of each collection" for the given subject, this approach appears to be most promising. We have been able to examine only those subject areas where the indexing data of both systems are already in consonance with our scheme of generic denominators. Some examples are shown in the appendix. Individual profiles in the two systems are shown for NAVIGATION, GUID- ANCE, THERMODYNAMICS, and HEAT TRANS- 238 FER. We have also shown the terms listed in the DDC Thesaurus of Descriptors under the group THERMODYNAMICS and under the group NAVI- GATION and GUIDANCE. At the present time we are using the Stiles Asso- ciation Factor [4] as a threshold for selecting the associated terms in the profiles. We also plan to use the statistical associative concept as one of the elements in ordering the output of the computer search. We have currently suspended our experimental work on multisystem searching while we are imple- menting the full associative search capability for the NASA collection which by now has grown to 60,000 reports and is increasing by almost 5,000 reports a month. This will provide an ample test bed for future experimentation. We feel that it is important to keep in mind that our discussions concern retrieval of report literature, not retrieval of data or generation of information from data stored in a machine system. Our current emphasis on retrieval of report literature is based on the belief that we are going to have to five for some time to come with the status quo in the indexing and abstracting of the large Federal technical information agencies. Addition- ally, our actions must be tempered by the vast "information in being" represented by several million reports in the various agency collections. Mechanized information retrieval — that is, re- trieval of report literature as it is practiced today — is at best a "gray" affair. It involves the inter- play of many models of human endeavor throughout the information transfer chain — from the recorder to the information handler to the ultimate user. The objective of retrieval under the current modus operandi is to satisfy the needs of the user without requiring him to review an undue amount of non- essential bibliographic data to select pertinent reports. In any given instance, it is unlikely that the information handler will know how well-informed the user may be, and what is nonessential. One realistic compromise that we are striving to attain through statistical associative techniques is to provide a high recall ratio and to fist a probable order of relevance for the reports cited. When we consider the human indexing model — as yet not clearly defined — together with information retrieval practices of the operating agencies, it is difficult to provide a firm measure of effective- ness of any approach to retrieval, particularly multi- collection retrieval. There are many elements, however, that are measurable. We can evaluate parallel operations on the basis of time and cost factors, and the usefulness of the output. Another important factor is the optimum use of the human resources that are available to perform the intel- lectual tasks required to support the system. These factors, together with the vast "information in being" that we referred to earlier, were the basis for our initial experimentation with statistical associative properties of indexing data. Our cur- rent efforts are motivated by the positive results of our experimentation over the past two years. References [1] Scientific and Technological Communication in Government, AD 299 545 (Apr. 1962). [2] Huff, D., How to Lie with Statistics ( W. W. Norton and Co., New York, 1954). [3] Hammond, W., and S. Rosenborg, Common approaches for Government scientific and technical information systems, Tech. Tept. IR-10 (AD-430,000) (Datatrol Corporation, Silver Spring, Md., Dec. 1963). - [4] Stiles, H. E., The association factor in information retrieval, J. Assoc. Comp. Mach. 8. 271 (1961). 239 Appendix Total usage frequency of parent term Total usage frequency of associated term Total co-occurrence with parent term Association factor ( X 100) 442 DDC —Continued TERM PROFILES NASA GUIDANCE 13 7 529 Aboard 57 16 550 Abort 130 30 594 Apollo* Project 60 16 544 Autopilot 1101 70 513 Computer 2059 141 599 Control*/Noun/ 824 71 560 Controls, *Control*Systems 126 21 518 Gyroscope 285 52 623 Inertia 410 47 556 Landing*/Noun/ 489 54 565 Launch*/Noun/ 119 25 564 Maneuver 61 14 513 Matching 49 33 720 Midcourse 640 55 533 Missile*/Noun/ 340 40 542 Mission 356 132 798 Navigation 933 80 572 Orbit*/Noun/ 22 8 502 Pershing*Missile 61 14 513 Platform 838 63 528 Propulsion*/Noun/ 800 55 500 Reentry 163 35 602 Rendezvous 8 6 546 Sextant 52 21 619 Space *Navigation 979 107 635 Spacecraft */Noun/ 15 7 514 Spacecraft *Navigation 2681 139 552 System 302 34 520 Target 51 14 533 Telecommunications 86 22 573 Terminal 30 10 518 Tracker 545 50 532 Tracking*/Noun/ 761 115 683 Trajectory*/Noun/ 137 17 586 Planets 277 23 574 Propulsion 12 6 616 Radar Homing 523 26 526 Reentry Vehicles 59 22 729 Rendezvous Spacecraft 18 5 534 Retro Rockets 1782 67 589 Satellites (Artificial) 800 74 707 Space Flight 170 56 813 Space Navigation 303 27 598 Space Probes 901 82 715 Spacecraft 99 10 506 Stabilization Systems 23 8 612 Star Trackers 1025 67 657 Surface to Surface 4 4 637 Terminal Guidance TERM PROFILES NASA 356 NAVIGATION 39 18 639 Aid 104 20 555 Air*Traffic 42 23 683 Airspace 130 30 618 Apollo*Project 21 7 500 Avoidance 15 7 536 Circumlunar 665 47 520 Communication 18 9 572 Compass 1101 73 557 Computer 16 8 559 Doppler*Navigation 959 58 519 Flight*/Noun/ DDC 403 GUIDANCE 250 28 628 Astronautics 136 21 632 Automatic Pilots 207 16 527 Booster Motors 14 9 689 Celestial Guidance 125 18 608 Command & Control Systems 762 34 541 Communication Systems 670 32 543 Control 1564 120 735 Control Systems 69 14 618 Doppler Navigation 1188 58 607 Errors 345 29 599 Flight Paths 32 11 647 Guided Missile Computers 285 31 635 Guided Missile Trajectories 2476 156 740 Guided Missiles 242 23 589 Gyroscopes 42 16 698 Homing Devices 115 39 779 Inertial Guidance 80 12 569 Inertial Navigation 44 7 515 Interception 242 25 607 Landings 81 11 549 Launching Sites 12 7 650 Light Homing 228 28 638 Lunar Probes 172 26 652 Manned 348 21 527 Moon 298 36 662 Navigation 79 15 618 Navigation Computers 861 85 728 Orbital Trajectories 442 132 798 Guidance*/Noun/ 10 7 579 Gyrocompass 126 23 564 Gyroscope 285 34 554 Inertia 119 17 503 Maneuver 49 17 603 Midcourse 340 31 511 Mission 933 54 505 Orbit*/Noun/ 61 12 503 Platform 83 57 800 Proportion 838 63 559 Propulsion*/Noun/ 163 21 513 Rendezvous 8 5 527 Self-Contained 8 6 568 Sextant 52 27 694 Space*Navigation 979 64 541 Spacecraft/Noun/ 15 7 536 Spacecraft*Navigation 2681 112 530 System 30 11 562 Tracker 545 45 536 Tracking*/Noun/ 761 48 505 Trajectory*/Noun/ DDC 298 NAVIGATION 203 18 588 Air Traffic Control Systems 1156 34 526 Airborne 218 19 592 Airplane Landings 57 11 618 All-Weather Aviation 136 17 619 Automatic Pilots 53 14 677 Beacon Lights 25 5 531 Bombing 50 10 611 Buoys 50 7 533 Celestial Navigation 39 12 676 Compasses 7 3 543 Course Indicators 204 13 517 Direction Finding 643 33 591 Display Systems 69 9 555 Doppler Navigation 241 DDC — Continued 179 17 590 Flight Instruments 345 19 541 Flight Paths 970 28 503 Flight Testing 36 7 568 Fog Signals 48 6 503 Glide Path Systems 57 17 710 Ground Controlled Approach Radar 9 3 517 Ground Position Indicators 403 36 662 Guidance 242 16 543 Gyroscopes 27 12 713 Hyperbolic Navigation 80 14 634 Inertial Navigation 44 8 576 Instrument Flight 81 19 697 Instrument Landings 164 55 844 Lighthouses 22 6 585 Loran 11 5 615 Loran Equipment 10 3 506 Low Altitude 8 3 529 Navigation Charts 79 20 710 Navigation Computers 69 14 649 Navigational Lights 247 21 600 Position Finding 161 15 574 Radar Beacons 9 3 517 Radar Bombing 795 32 559 Radar Equipment 116 31 762 Radar Navigation 58 8 547 Radar Reflectors 65 10 584 Radio Beacons 458 32 623 Radio Equipment 135 34 765 Radio Navigation 187 13 527 Shipborne 199 24 652 Ships 170 16 582 Space Navigation 788 63 707 Symposia 9 4 585 Terrain Avoidance 337 17 519 Transport Planes DDC THESAURUS -Continued SHORAN SPACE NAVIGATION STABILIZED PLATFORMS STAR TRACKERS STELLAR MAP MATCHING TELEVISION GUIDANCE TERMINAL GUIDANCE SYSTEMS TERRAIN AVOIDANCE VIDEO MAP MATCHING WIRE GUIDANCE TERM PROFILES DDC DDC THESAURUS GROUP 106 NAVIGATION AND GUIDANCE ALL-INERTIAL GUIDANCE AUTOMATIC NAVIGATORS AUTOMATIC PILOTS AZIMUTH CELESTIAL GUIDANCE CELESTIAL NAVIGATION CIRCULAR ERROR PROBABILITY CONTROL SIMULATORS DEPTH FINDING DEPTH INDICATORS DIRECTION FINDING DIRECTION FINDING SIGNALS DOPPLER NAVIGATION GLIDE PATH SYSTEMS GUIDANCE HEAT HOMING HOMING DEVICES HYPERBOLIC NAVIGATION IMPACT PREDICTORS INERTIAL GUIDANCE INERTIAL NAVIGATION INJECTION GUIDANCE LIGHT HOMING LORAN LORAN EQUIPMENT MAGNETIC GUIDANCE MAGNETIC NAVIGATION NAVIGATION PRESET GUIDANCE PROPORTIONAL NAVIGATION RADAR HOMING RADAR NAVIGATION RADIO HOMING RADIO NAVIGATION RENDEZVOUS GUIDANCE 5 THERMODYNAMICS 525 49 507 Aerodynamic Heating 722 77 573 Air 145 24 511 Beryllium Compounds 49 15 534 Boiling 421 50 544 Boron Compounds 146 39 619 Calorimeters 120 44 667 Chemical Equilibrium 1598 145 615 Chemical Reactions 1007 127 647 Combustion 114 30 590 Combustion Chamber Gases 260 79 705 Dissociation 1046 92 563 Energy 196 111 808 Enthalpy 182 107 808 Entropy 131 44 657 Equations of State 71 21 566 Eutectics 343 39 512 Exhaust Gases 10 6 505 Film Boiling 329 40 524 Flames 476 49 522 Fluid Mechanics 1208 139 644 Gas Flow 618 58 526 Gas Ionization 1673 246 735 Gases 366 55 585 Heat 123 65 746 Heat of Formation 18 11 576 Heat of Fusion 48 23 629 Heat of Reaction 9 6 516 Heat of Solution 24 15 612 Heat of Sublimation 1935 281 747 Heat Transfer 1820 169 634 High Temperature Research 1037 100 586 Hydrogen 451 47 519 Hypersonic Characteristics 437 55 561 Hypersonic Flow 45 14 528 Hypersonic Nozzles 16 9 545 Irreversible Processes 169 34 571 Liquid Metals 514 59 556 Liquids 292 35 508 Lithium Compounds 276 33 502 Mass Spectroscopy 340 48 563 Mixtures 25 13 577 Nucleate Boiling 1964 128 549 Oxides 1158 89 539 Oxygen 688 83 598 Phase Studies 1839 117 535 Physical Properties 3041 152 515 Pressure 49 14 519 Propellant Properties 474 83 646 Reaction Kinetics 201 34 550 Recombination Reactions 874 74 535 Refractory Materials 155 34 582 Rocket Propellants 1242 87 521 Skock Waves 596 53 508 Solid Rocket Propellants 850 89 586 Solids 170 27 518 Solubility 249 106 773 Specific Heat 130 26 543 Specific Impulse 242 DDC —Continued 5195 235 540 Temperature 5051 217 521 Theory 450 86 660 Thermal Conductivity 134 29 563 Thermal Diffusion 259 116 787 Thermochemistry 322 67 645 Transport Properties 145 51 677 Vapor Pressure 236 51 621 Vaporization 379 55 580 Vapors 222 40 575 Zirconium Compounds DDC —Continued NASA 498 THERMODYNAMICS 86 18 515 Calorimetry 607 53 514 Combustion 268 43 574 Dissociation 24 12 569 Effusion 198 57 672 Enthalpy 52 21 606 Entrance 148 67 738 Entropy 81 22 566 Envelope 488 91 670 Equilibrium 41 22 641 Free*Energy 1811 132 583 Gas*/Noun/ 1214 106 587 Heat*/Noun/ 65 24 610 Heat*Capacity 18 9 537 Heat*Content 1036 80 538 High*Temperature 1015 93 580 Property 326 39 526 Specific 2834 155 552 Temperature*/Noun/ 70 16 513 Vapor*Pressure, *Tension 119 27 567 Vaporization DDC THESAURUS GROUP 157 THERMODYNAMICS EQUATIONS OF STATE ENTROPY ENTHALPY HEAT HEAT OF ACTIVATION HEAT OF FORMATION HEAT OF REACTION HEAT OF SOLUTION HEAT OF SUBLIMATION HEAT TRANSFER JOULE-THOMSON EFFECT SPECIFIC HEAT THERMODYNAMICS TERM PROFILES DDC 1935 HEAT TRANSFER 264 92 702 Ablation 1578 119 511 Aerodynamic Characteristics 519 57 502 Aerodynamic Configurations 525 226 817 Aerodynamic Heating 722 77 528 Air 541 103 640 Atmosphere Entry 291 79 658 Blunt Bodies 331 54 554 Bodies of Revolution 49 26 620 Boiling 783 209 754 Boundary Layer 1007 90 514 Combustion 51 576 Compressible Flow 60 568 Conical Bodies 215 119 780 Convection 21 13 563 Cook-off 128 64 706 Coolants 591 196 773 Cooling 900 115 597 Cylindrical Bodies 196 56 629 Enthalpy 10 9 563 Film Boiling 27 15 567 Film Cooling 20 10 511 Flat Plate Models 975 142 637 Fluid Flow 476 79 595 Fluid Mechanics 208 34 506 Fluids 444 66 562 Friction 1208 244 736 Gas Flow 1673 178 613 Gases 366 55 544 Heat 226 118 773 Heat Exchangers 502 66 544 Heating 62 17 500 Hemispherical Shells 1820 132 514 High Temperature Research 451 109 676 Hypersonic Characteristics 437 126 712 Hypersonic Flow 217 43 556 Hypersonic Wind Tunnels 300 68 620 Hypervelocity Vehicles 429 169 778 Laminar Boundary Layer 26 13 540 Liquid Cooled 169 47 607 Liquid Metals 514 65 537 Liquids 168 31 513 Mach Number 7383 369 534 Mathematical Analysis 167 39 567 Nose Cones 25 18 615 Nucleate Boiling 214 43 558 Pipes 3041 222 570 Pressure 31 15 552 Radiators 28 12 514 Reactor Coolants 523 85 600 Reentry Vehicles 179 38 552 Reynolds Number 414 61 552 Rocket Motor Nozzles 950 83 502 Rocket Motors 428 53 513 Shock Tubes 253 360 NASA 1100 HEAT TRANSFER 237 56 550 Ablation 175 58 598 Aerodynamic*Heating 109 53 635 Boiling 666 201 714 Boundary*Layer 261 52 518 Conduction 252 120 716 Convection 450 109 622 Cooling*/Noun/ 198 53 561 Enthalpy 304 61 535 Flatness, *Flat 2537 350 659 Flow*/Noun/ 615 86 513 Fluid*/Noun/ 19 12 509 Free*Convection 1811 170 505 Gas*/Noun/ 1214 334 754 Heat*/Noun/ 98 41 591 Heat*Flux 127 116 786 Heat*Test 481 99 589 Heating, *Heated 691 138 619 Hypersonics 392 127 675 Laminar 626 85 507 Layer 86 42 612 MassTransfer 799 98 503 Nozzle*/Noun/ 22 17 570 Nucleate 23 17 565 Nusselt*Number 651 89 513 Plate 594 83 509 Point*/Noun/ 64 34 600 Prandtl*Number 43 26 587 Radiative 267 67 576 Reynolds*Number 240 48 510 Skin 295 107 672 Stagnation 2834 261 547 Temperature*/Noun/ 140 62 640 Temperature*Distribution 34 17 520 Temperature*Profile 1255 154 550 Thermal*/See*Also*Thermo-, 187 40 501 Thermocouple 677 309 808 Transfer/Noun/ 362 82 583 Turbulent 448 70 511 Viscosity 419 82 562 Wall*/Noun/ 50 34 628 Wall*Temperature Heat/ 243 Studies on the Reliability and Validity of Factor- Analytically Derived Classification 1 Categories Harold Borko System Development Corporation Santa Monica, Calif. 90406 A series of experiments has been conducted in order to determine whether a factor-analytically derived classification system is reliable and valid. In a previous experiment, 10 classification cate- gories were derived by factor analyzing 618 abstracts of psychological reports. Two new samples of psychological abstracts, numbering 659 and 338 respectively, were factor analyzed. The three independently derived classification schedules were compared and found to be quite similar. It was concluded that factor-analytically derived classification categories are reliable in that the factors remain essentially stable from sample to sample. The categories are also valid in that they are de- scriptive of the main divisions of the psychological literature. 1. Introduction and Purpose One aspect of documentation research is con- cerned with deriving a mathematical theory of classification that will provide a basis for dividing a collection of documents into major subject cate- gories. A number of mathematical techniques for deriving classification systems have been suggested. These include factor analysis [l], 2 clump theory [2, 3, 4], latent-structure analysis [5], and discrimina- tion analysis [6]. At the System Development Cor- poration, with support from the National Science Foundation, we are continuing to investigate the application of factor analysis to the problems of document classification with the aim of determining whether a factor-analytically derived classification system is (a) reliable — in the sense that successive samples from a given data base will yield the same factors, and (b) valid — in the sense of being descriptive of the content of the documents. 2. Determining Reliability A classification schedule is said to be reliable if the categories, which were derived on the basis of one sample of documents, are equally descriptive of other samples taken from the same population. One of the claims made for mathematically derived classification systems is that the categories so derived are descriptive of the documents used in the analysis. However, if the categories prove to be so unique that they describe only the one document set and no other, they would be of little value. In order to determine the stability, or reliability, of factor-analytically derived classification categories, a series of experiments was conducted using three different samples of documents selected from the psychological literature. 3. Results of Previous Study In the 1961 experiment by Borko [1], 618 abstracts 1 This document was produced in connection with a research project cosponsored by SDC's independent research program and a grant from the National Science Founda- tion. 2 Figures in brackets indicate the literature references at end of paper. of psychological reports were selected from the publication Psychological Abstracts, vol. 32, number 1, 1958. These abstracts were keypunched, analyzed by 245 means of the FEAT program [7], and 90 high- frequency clue words, called "tag terms", were selected. The 90 words and the 618 abstracts were arranged in the form of a data matrix and correla- tion coefficients based upon the co-occurrence of the words were computed. The resultant 90 X 90 correlation matrix was factor analyzed [8], and the 10 factors extracted were interpreted as classifica- tion categories. A report of this study has been published previously. 4. Selection of Sample To establish the proposition that a factor-analyti- cally derived classification system is reliable and does not vary from sample to sample, it is necessary to repeat the factor analysis using a new collection of abstracts. Approximately 1,000 abstracts of psychological reports were selected from Psycho- logical Abstracts, vol. 35, number 1, 1961. Ab- stracts vary in length and in style. To insure that the sample would be relatively uniform and the selection unbiased, only abstracts between one and two inches in length were included in the study. This reduced the number from 1,430 abstracts con- tained in that issue to 997. Next, the collection was divided into two groups by selecting approxi- mately every third abstract. The first group, con- sisting of 659 abstracts, was labeled the experiment group; the second, consisting of 338 abstracts, was called the validation group. An independent factor analysis was performed on each group, thus provid- ing an additional check on the reliability of the resulting factors. 5. Selection of Tag Terms All 997 abstracts were keypunched for computer processing by means of the FEAT program, which prepared a fisting, by frequency of occurrence, of all words appearing in the text. Function words and other common words were excluded. One hundred and fifty tag terms were chosen by the investigators from this fist of frequently occurring words. Appropriate words with the same root were combined manually. In the previous study, 90 tag terms were used, but since then the capacity of the factor-analysis program has been expanded, and it is now able to handle a larger matrix. The 150 tag terms are fisted in table 1. The words marked by an asterisk are also on the fist of 90 words used in the previous study. Only 16 words from this original fist do not appear on the present fist of 150 terms. 6. Data Matrix, Document-Term Having selected the terms, it was necessary to determine which documents (i.e., abstracts) contained each of these words. This information was recorded in the form of a matrix; the columns show the 150 terms, and the rows indicate the docu- ments. Each document is an abstract selected for 3 The writer perfers lo use "tag term" rather than key words or index terms to describe the automatic assignment of labels to documents. The words assigned are tags by which a document can be identified and compared with other documents. The tag terms do not necessarily describe the basic contents of the document nor are they true index terms; they are, to repeat, simply tags. analysis in this study. A small portion of this matrix is illustrated in table 2. Two such matrices were prepared, one for the 659 documents in the experimental group and the other for the 338 docu- ments in the validation group. A computer program prepared the document-term matrix in a form suitable for input to the factor-analysis pro- gram. Since the data consisted of 150 terms, two 80-column cards were produced for each of the docu- ments. Every term was assigned a unique column on the cards, and the number of times a word occurred in the document was punched in the proper column. 246 TABLE 1. Tag terms. *1. ability 2. academic *3. achievement 4. action *5. activity 6. adaptation 7. adjustment 8. administered 9. adults *10. analysis '11. animals *12. anxiety *13. attitude 14. auditory 15. average *16. behavior 17. baby *18. boys *19. brain *20. case *21. child *22. clinical *23. college 24. color 25. communication *26. community *27. concept 28. conditioning *29. correlation 30. cortex *31. data 32. delinquency 33. dependent *34. development 35. discrimination 36. dogs *37. education *38. emotion 39. employed *77. mental 40. error 78. monkeys *41. experiment 79. motivation 42. eye 80. motor *43. factor *81. nature 44. failure 82. negative *45. family 83. nervous 46. feeling 84. noise *47. field *85. normal 48. fond *36. organization *49. frequency *87. patient 50. frontal 88. people *51. function *89. perception 52. grade *90. performance *53. group *91. personal 54. hand *92. personality 55. health *93. personnel 56. hearing 94. physical 57. hospital 95. population 58. hypnosis 96. probability 59. hypothesis *97. problem 60. image *98. procedure 61. independent *99. program *62. information *100. psychiatric *63. intelligence *101. psychological 64. intensity 102. questionnaire 65. interaction 103. rat 66. interest 104. rate 67. I.Q. 105. reaction *68. knowledge *106. reading 69. language 107. reflex *70. learing *108. reinforcement *71. level *109. research *72. life *110. response *73. light 111. retarded 74. male *112. role *75. man *113. scale 76. medical *114. school Table 2. A portion of the data (document-term) matrix. c V bC V c a, c o 3 cd 49 3 H a E .C Cu o cd CO cfi Doc # CO X i* 9) u - ax) 2 ] [/vsp - (sy) 2 ] N= Number of documents X] = Terms being correlated Table 4 . A portion of the correlation (term-term) matrix. u O > CO J3 V CO c V a V a, X w a. 3 o u O "3 h CO V -J GO a o a CO o PS a "3 B c/5 16. Behavior .2702 .0359 .0096 .0454 .0112 - .0353 - .0818 41. Experiment .0359 .1930 .1190 .0424 .1594 .0432 .0297 53. Group .0096 .1190 .2835 .0746 .0070 -.0161 .0981 70. Learning .0454 .0424 .0746 .2958 .0549 .0714 .0524 110. Response .0112 .1594 .0070 .0549 .2489 .2265 - .0220 125. Stimulus - .0353 .0432 -.0161 .0714 .2265 .3334 - .0422 136. Test - .0818 .0297 .0981 .0524 - .0220 - .0422 .3141 8. Factor Analysis By means of factor analysis, the information contained in the 150 X 150 correlation matrix is compressed into a smaller matrix with fewer columns. Obviously, as a result of this compres- sion, some information contained in the original matrix is lost. Information must always be lost as we go from the specific to the general — as we go from specific data about collies, terriers, and poodles to the single concept "dogs" — or more appropriately as we go from a series of papers dealing with the causes and treatment for hysteria and schizophrenia to the single classification category labeled "etiology and treatment of mental disorders." Factor analysis is a mathematical technique designed to reduce the matrix to a small number of eigenvectors accounting for a large proportion of the total variance. There is always some questions as to when enough factors have been extracted. In this case, in order to maintain consistency with the pre- vious study, 10 factors were extracted and rotated orthogonally before interpretation. One factor was bipolar and so was interpreted as representing two classification categories. Two factor analyses were computed — one using the 659 documents in the experimental group and the second using the 338 documents in the valida- tion group. These, plus the 1961 study, provide three derived classification schedules for psychologi- cal literature. 9. Comparison In interpreting the stability of the factor-derived classification categories, the three sets of factors will now be compared. All three are based upon different samples of documents as recorded in Psychological Abstracts, 1958 and 1961. Further- more, in the earlier experiment only 90 tag terms were used, as compared with 150 in the current study. Nevertheless, it is hypothesized that the factors will be relatively stable from sample to sample and regardless of difference in the tag terms used for analysis. Is this the case? Let us examine in detail the factors from each study that are labeled "academic achievement." For convenience, the words with significant load- ings on each of these factors are listed side by side in table 5. In the 1961 study, the words with the highest loadings on this factor are girls, and boys. While boys was used as a tag term in the present study, girls was not. However, the word with the highest loading for both the other groups is student. This carries substantially the same meaning as girls and boys. School and achievement appeared with high loadings on all three sets of factors. 248 Table 5. Words with significant loadings on academic achievement factor. Current study Study Experimental group Validation group girls boys school achievement reading student achievement test school grade college administered independent program knowledge correlation medical scale student achievement college ability school grade test average academic motivation science Reading was a legitimate word, but it did not appear in the current study; however, the other words on the two current lists are clearly related to "academic achievement." Based upon this analysis, we conclude that all three studies contain a factor which could be prop- erly labeled "academic achievement." In other words, this factor is stable and reliable. As a second example, let us examine the factors dealing with "physiological psychology" (table 6). These are not nearly as similar as was "academic achievement," and the interpretation had to be stretched on a Procrustean bed to achieve some degree of commonality. The three lists in table 6 have very few words in common, and yet there is a unifying theme dealing with the structure and function of the central nervous system. The words cerebral, cortex, frontal, temporal are all related to Table 6. Words with significant loadings on central nervous system factor. Current study 1961 study Experimental group Validation group emotional development cerebral child (children) theory life nature factor(s) animals activity frontal cortex field behavior nervous perception color communication field structure analysis temporal conditioning the brain. Research in this area has many facets. Some studies are concerned with the development of the cerebral cortex in children and its psycho- logical concomitants. Extirpation experiments on animals are designed to study behavior as a means of determining localized brain activity. In the case of humans with structural brain damage, one is concerned with functional loss, such as perception and communication, and the possibilities of condi- tioning and retraining. Consequently, in spite of the fact that the words are different, all three factors refer to a single broad category of research papers and so are given a common interpretation. Finally, let us examine the factor named "etiology and treatment of mental disorders" (table 7). Clearly the words in the two groups of the current study are quite similar. There is also considerable agreement with the 1961 study; however, the 1961 study had an additional factor called "therapy — case studies," which did not appear as a separate factor in the current analysis. A possible reason is that the older data contained significantly more reports of therapy cases than did the more recent sample of literature. At any rate, the net effect is that two factors under the general heading of "clinical and abnormal psychology" were com- pressed into one. Nevertheless, it is reasonable to conclude that this factor configuration is relatively stable. Let us now take a more global view of all three factor-analytic studies and compare them for simi- larity (table 8). Under the major heading of "educational psychology," we see a factor in each analysis labeled "academic achievement." Simi- larly each analysis has a factor dealing with "physiological psychology" and the slight dif- ferences among these factors were discussed. Next, under "clinical and abnormal psychology," we note that the two original factors on this topic were compressed into one. In "experimental psychology" the opposite situation occurred. The 1961 study was based upon a relatively limited literature in this area — an accident of sampling — and as a result only one factor emerged. In the present study — again as a vagary of sampling — there was a large amount of experiment literature and five separate and distinct factors were derived. This change reflects the heavier concentration of experimental papers in the more recent psycho- logical literature. At the same time, we lost the special category of "clinical case studies" and combined this group of documents with the more general class of "clinical and abnormal psychology." Two factors in the 1961 analysis did not appear at all in the present study. These are Factor 4, "studies of college students," which was known to be a poorly defined factor, and Factor 8, "general psychology." This latter factor prob- ably deserves a place in the classification system. The documents which could reasonably be classified under "general psychology" were probably divided into the various experimental categories. 772-957 0-66— 17 249 Table 7. Words with significant loadings on etiology and treatment of mental disorders factors. Current study 1961 study Experimental group Validation group treatment psychiatric clinical psychotherapy case(s) schizophrenia therapy group(s) psychoanalysis counseling patient hospital therapy treatment medical group mental psychiatric program community patient hospital treatment psychiatric community techniques attitude therapy population emotion women personal case(s) therapy level The obtained results help reveal both the strengths and weaknesses of the factor-analysis technique for deriving classification categories. The factors which emerge from the analysis are closely related to the data used in the study. To the extent that the data base is an adequate sample of the total document collection, the factor-derived categories will represent the entire collection. To the extent that the sample is only partially repre- sentative, the factors will be only partially represent- ative of the total collection, but adequately rep- resentative of the sample on which they are based. The reasonableness, or validity, of the factor- analytically derived classification categories can be determined by comparing the derived classi- fication schedule with the classification system used by the American Psychological Association (APA). As is to be expected, the factor-analyti- cally derived categories are fewer in number and more general in character. Many fine distinctions are lost as, for example, the distinction between "human experimental psychology" and "animal psychology." Nevertheless most of the major Table 8. Comparison of factor names. Factors derived from current experiment Factors derived from 1961 experiment Factor name Experi- mental group, factor # Validation group, factor # Factor number and name Experimental psychology Conditioning Learning and reinforcement Feelings, emotion, and motivation Vision and the special senses Speech and hearing Physiological psychology Central nervous system Social psychology Community resources Clinical and abnormal psychology Etiology and treatment of mental disorders Educational psychology Academic achievement Interest and ability testing Special problems 2 8A 5 9 10 6 8B 4 1 3 7 1 2 10A 5 8 9 6 4 3 10B 7 2. Perception and learning 9. Developmental psychology 3. Community organization 6. Clinical psychology and therapy 10. Therapy — case studies 1. Academic achievement 5. School guidance and counseling 7. Educational measurement 4. Studies of college students 8. General psychology 250 headings do appear, as do some of the important subdivisions. It is thus reasonable to conclude that the factor-analysis technique has uncovered the most important dimensions, or trends, in pub- lished psychological research literature. On the basis of the above analyses, it is concluded that factor-analytically derived classification cate- gories, based upon representative samples of the total document collection, are reasonably reliable and descriptive. However, because of the diffi- culty of obtaining a truly representative sample of a document collection, more than one factor analysis should be made to attain a stable constellation of factors. By repeating the analysis every year or so and adding the new accumulations to the data base, changes in the character of the collection can be identified quickly and automatically, and a revised classification schedule created. Obviously, a change in classification categories without a concomitant reclassification of all the documents in the collection would be worse than useless. The documents will all have to be reclassified, and while this is normally a chore, it can be accom- plished automatically by using a factor-score predic- tion equation. In actual practice, the physical documents will be stored by accession number, and the reclassification will consist of a new set of properly arranged file cards, which will be printed as an output of the computer processing routines. Used in this manner, factor-analytically derived classification categories provide the flexibility and responsiveness to change that are needed in scien- tific documentation and provide a basis for an auto- mated document storage and retrieval system. 9. References [1] Borko, H., The construction of an empirically based mathe- matically derived classification system, Proc. Spring Joint Computer Conf. 21, 279-289 (1962). [2] Needham, R. M., The theory of clumps II, Cambridge Lan- guage Research Unit, M. L. 139 (Cambridge, England, 1961). [3] Needham, R. M., Research on information retrieval classi- fication and grouping 1957-61, Cambridge Language Research Unit, M.L. 149 (Cambridge England, 1961). [4] Parker-Rhodes, A. F., Contributions to the theory of clumps, Cambridge Language Research Unit, M. L. 138 (Cam- bridge, England, 1961). [5] Baker, F. B., Information retrieval based upon latent class analysis, J. Assoc. Comp. Mach. 9, No. 4, 512-521 (1962). [6] Williams, J. H., Jr., A discrimination method for automatically classifying documents, Proc. Fall Joint Computer Conf. 24, 161-167 (1963). [7] Olney, J. C, FEAT, an inventory program for information retrieval, FN-4018 (System Development Corporation, Santa Monica, Calif., 1960). [8] Harman, H. H., Modern Factor Analysis (Univ. of Chicago Press, Chicago, 111. 1960). 251 10. Appendix I. Factors Derived in the 1961 Experiment 1. Academic Achievement Tag-Terms girls boys school achievement reading Loadings .74 .73 .30 .20 .18 Experimental Psychology-Perception and Learning Tag-Terms perception(ual) learning experimental theory evidence visual field Loadings .46 .36 .29 .25 .24 .23 .21 3. Social Psychology and Community Organization Tag-Terms Loadings organization .67 community .54 structure .38 workers .22 field .15 analysis .15 social .11 role .10 job .10 4. Studies of College Students Tag-Terms Loadings student(s) -71 college -70 group(s) -17 mental 16 factor(s) -15 teacher -14 intelligence -11 personality -10 5. School Guidance and Counseling Tag-Terms Loadings program .42 education(al) .36 child(children) .33 parents .29 guidance .29 teachers .28 intelligence .27 school(s) .25 counseling .20 6. Clinical Psychology and Psychotherapy Tag-Terms Loadings treatment -44 psychiatric -35 clinical -32 psychotherapy -22 case(s) -16 schizophrenia -16 theory -16 group(s) -12 psychoanalysis -12 counseling -ll 7. Educational Measurement Tag-Terms achievement ability correlation scale group(s) reading intelligence test(s) school(s) Loadings .46 .36 .35 .32 .22 .30 .20 .20 .19 8. General Psychology — Psychology As A Science Tag-Terms social research science psychological status 9. Developmental Psychology Tag-Terms emotional development cerebral child(children) theory life nature factor(s) 10. Theory: Case Studies Tag-Terms personal case(s) therapy level Loadings .42 .32 .31 .25 .24 Loadings .32 .32 .23 .22 .19 .18 .18 .18 Loadings .56 .55 .42 .21 253 11. Appendix II. Experimental Group Factors Derived In the 1964 Experiment for Experimental Group and Validation Group 1. Educational Psychology: Academic Achievement Tag-Terms student achievement test school grade college administered independent program knowledge correlation medical scale Validation Group 1. Experimental Psychology: Conditioning Loadings .57 Tag-Terms nervous .51 reflex .48 .47 .44 ability conditioning dogs .34 cortex .32 system .32 motor .29 stimulus .25 ,24 auditory failure .24 .23 Loadings .83 .73 .63 .63 .62 .36 .32 .31 .29 .25 .21 Experimental Group 2. Experimental Psychology: Conditioning Tag-Terms Loadings conditioning .77 reflex .75 stimulus .43 academic .37 stimulation .36 visual .31 auditory .28 action .28 dogs .26 motor .26 sound .25 reaction .24 college .23 threshold .21 nervous .20 Validation Group Educational Psychology: Academic Achievement Tag-Terms Loadings student .63 achievement .61 college .56 ability .40 school .36 grade .33 test .31 average .29 academic .27 motivation .26 science .25 Experimental Group 3. Educational Psychology: Interest and Ability Testing Tag-Terms physical women interest achievement teacher grade ability motor Loadings .69 .64 .63 .58 .39 .32 .29 .27 Experimental Group Clinical and Abnormal Psychology Etiology and Treatment of Mental Disorders Tag-Terms patient hospital therapy treatment medical group mental psychiatric program community Loadings .56 .43 .32 .32 .30 .27 .27 .25 .20 .20 254 Feelings, Emotion, Loadings .67 .67 .47 .45 .32 .28 .27 Experimental Group Experimental Psychology: and Motivation Tag-Terms emotion feeling science nature psychological motivation personality Validation Group 10B. Educational Psychology: Interest and Ability Testing Tag-Terms scale physical behavior intelligence child test Validation Group Clinical and Abnormal Psychology Etiology and Treatment of Mental Disorders Experimental Group 6. Physiological Psychology: Central Nervous System Tag-Terms animals activity frontal cortex field behavior nervous Experimental Group 7. Educational Psychology: Special Problems Loadings .60 .58 .58 .35 .33 .30 .29 Tag-Terms retarded mental Loadings child .35 I.Q. .25 academic .25 achievement .23 behavior .22 boys .20 normal Validation Group Loadings .52 .50 .44 .41 .30 .22 .22 .21 .20 Tag-Terms patient hospital treatment psychiatric community techniques attitude therapy population emotion women Validation Group Loadings .64 .50 .47 .45 .36 .35 .34 .37 .27 .26 .23 10A. Experimental Psychology: Feeling, Emotion, and Motivation Tag-Terms Loadings frontal .31 performance .29 training .27 concept .27 emotion .24 problem .22 research .20 9. Physiological Psychology: Central Nervous System Tag-Terms perception color communication field structure analysis temporal conditioning Validation Group 7. Educational Psychology: Special Problems Tag-Terms normal I.Q. intelligence child dependent trials learning boys task negative verbal test motor Loadings .77 .65 .42 .34 .34 .25 .21 .20 Loadings .57 .49 .44 .39 .38 .33 .32 .29 .28 .24 .23 .22 .20 255 Experimental Group Validation Group 8A. Experimental Psychology: Learning and Reinforcement 6. Social Psychology Community Re Tag-Terms Loadings learning .34 Tag-Terms response .33 health reinforcement .27 development performance .26 child rate .24 education verbal .23 physical rat .23 community discrimination .22 research experiment .22 social stimulus .21 mental task .21 personality group .20 program function .20 concept emotion frontal psychological Loadings .66 .54 .41 .37 .36 .35 .32 .31 .29 .28 .25 .23 .22 .21 .21 Experimental Group 8B. Social Psychology: Community Resources Tag-Terms health community mental social psychological Validation Group Loadings .27 .25 .24 .23 .22 Experimental Group Experimental Psychology: Vision and the Special Senses Tag-Terms Loadings image .60 baby .43 negative .32 field .28 procedure .26 visual .23 light .20 temporal .20 test .20 2. Experimental Psychology: Learning and Reinforcement Tag-Terms animals rate response group sensory rat trials Hght reinforcement conditioning experiment fond Loadings Experimental Group .53 .52 10. Experimental Psychology .47 Speech and Hearing .42 .41 Tag-Terms Loadings .35 words .34 .35 language .31 .31 hearing .28 .30 speech .24 .29 structure .24 .25 threshold .21 .23 tone .21 256 Validation Group 5. Experimental Psychology: Vision and the Special Senses Tag-Terms light sensory stimulation function intensity visual rat baby auditory brain eye animals cortex frontal retarded Validation Group Experimental Psychology: Speech and Hearing Loadings Tag-Terms .58 employed .54 noise .51 frequency .42 stress .39 population .38 words .36 speech .35 emotion .27 concept .27 system .22 response .21 .21 .21 .20 Loadings .54 .49 .41 .41 .40 .39 .37 .35 .27 .24 .23 257 Postscript: A Personal Reaction to Reading the Conference Manuscripts Vincent E. Giuliano It was with great regret that I was unable to attend the conference because of sudden illness. None- theless, in my capacity as a member of the commit- tee backing the Symposium, I have had an oppor- tunity to read over the manuscrips carefully. In reading the manuscripts I felt an absence of remarks of an evaluative nature. I have been informed that there was a great deal of lively discussion during the conference, although it was unfortunately im- possible to include this material in this volume. This postscript represents a personal comment based on the written record of the Symposium, since the absence of commentary might otherwise make it difficult for readers not familiar with the field to piece together a coherent perspective. The discussions in this book revolve around one central theme, but the theme is approached from a variety of viewpoints which are often conflicting in emphasis, objectives, and methodology. The main questions which surround the theme are whether the work is of fundamental or transitory significance, whether the techniques will actually prove out in large-scale operational practice, and, in general, what the future for research in this area will hold. To repeat some remarks conveyed in the Intro- duction, my overall impression is that the work rests on quite solid fundamentals, but that it remains in a very preliminary stage of development and further clarification of objectives is essential. There are excellent theoretical foundations drawn from the fields of statistics, mathematical psychol- ogy, and a tradition of empiricist philosophy. In many instances, the techniques and methodol- ogies used have been previously applied to a number of closely related problems in other fields besides documentation, and are known to be effective. An ability to produce potentially useful results has been demonstrated in several problem areas, including document retrieval, automatic classification, and handling of citations. The methodologies are mostly based on use of very simple counting tech- niques, with relatively few major questions of work- ability yet to be resolved. In contrast with some of the other research approaches to problems of machine-aided documentation, such as those based of complex types of logical or grammatical analysis, many of those discussed in this volume seem to offer a real prospect of producing useful results in the foreseeable future. Passing now to what remains to be done, there are at least three areas in which more must be learned about the statistical association techniques; one area has to do with what the techniques themselves consist of, another has to do with their usefulness, and the third has to do with the very goals and objectives of the work itself. First, it soon becomes evident to the reader that at least a dozen somewhat different procedures and formulas for association are suggested in the book. One suspects that each has its own possible merits and disadvantages, but the fine between the profound and the trivial often appears blurred. One thing which is badly needed is a better under- standing of the boundary conditions under which the various techniques are applicable and the ex- pected gains to be achieved through using one or the other of them. This advance would primarily be one in theory, not in abstract statistical theory but in a problem-oriented branch of statistical theory. Secondly, it is clear that carefully controlled experiments to evaluate the efficacy and usefulness of the statistical association techniques have not yet been undertaken except in a few isolated in- stances. It is not surprising that this is so, for before one attempts to undertake a careful evalua- tion, one first of all wants to convince oneself that there is something worth evaluating. Nonetheless, it is my feeling that the time is now ripe to conduct carefully controlled experiments of an evaluative nature, for example, experiments which are designed to measure when and how much a statistical tech- nique for document retrieval yields improvements over conventional coordinate-type retrieval systems. Similar experiments are required for the other applications. Such experimental work has, to some degree, been undertaken by several investi- gators using relatively small document collections. This work has been and continues to be useful, but extension of evaluation experiments to docu- ment collections of realistic size is an essential next step: many problems of system performance are known to be dependent on collection size. My third main point is to open to question the perspective implicitly adopted in much of the exist- ing work in our area — that the techniques are to be mainly useful for completely automatic rather than merely machine-aided document retrieval, abstract- ing, etc. Personally, I am far from convinced that completely automatic document retrieval (i.e., without use of either an expert who knows the re- trieval system or of external user-machine feedback) is ever going to be a really useful activity except perhaps in certain highly specialized subject areas. Most of the machine searching systems that are now in existence are man-machine systems; they are likely to remain man-machine systems even if the standards of machine performance can be improved. As yet, however, there has been only modest investi- gation of using the associative techniques within such a more general man-machine framework. Also, a wide variety of alternative techniques for 259 scientific communication have been proposed and discussed in the literature, including document dissemination based on citations or based on re- searcher interest profiles, etc. It is my suspicion that the system configuration for the next genera- tion of automated documentation systems will not be merely an extension of a term-indexed coordinate retrieval system, but be something quite different; thus consideration of overall directions must pre- cede the detailed planning of future research. Finally, I would also like to remark briefly on equipment limitations. In the paper by Baker, a discussion is given on the limitations of existing digital computers; the impression may be left that it is impossible to deal with collections of more than 300 index terms with existing machines. I do not feel that the limitation is this bad; there are numer- ous shortcut techniques for dealing with sparse matrices. Both Spiegel and Stiles have dealt with collections of more terms than these, and at Arthur D. Little, Inc., we are currently experimenting with association of over 1,500 index terms and over 100,000 documents using an IBM 7094 computer. Nonetheless, the economics of manipulating very large matrices of index terms leaves something to be desired. This has proved to be one of the con- straints upon evaluating the proposed procedures on a reasonably large scale and may well be a ban to implementation of the statistical association methodology even if it is shown to provide improved performance. These considerations continue to suggest, in my opinion, that it would pay to look further into the area of large capacity, inexpensive permanent memory devices which would handle associative processing in a special-purpose manner. For example the fact that certain forms of associa- tive processing can be carried out directly by means of simple passive analog network devices could radically change the economics of reducing the techniques to practice. The development of either soft-ware schemes or processing devices which affect the economics of associative processing by making simpler the handling of relative large system matrices thus merits our continued interest and attention. 260 THE NATIONAL BUREAU OF STANDARDS The National Bureau of Standards is a principal focal point in the Federal Government for assuring maximum application of the physical and engineering sciences to the advancement of technology in industry and commerce. Its responsibilities include development and maintenance of the national standards of measurement, and the provisions of means for making measurements consistent with those standards; determination of physical constants and properties of materials; development of methods for testing materials, mechanisms, and structures, and making such tests as may be necessary, particularly for government agencies; cooperation in the establishment of standard practices for incorporation in codes and specifications; advisory service to government agencies on scientific and technical problems; invention and development of devices to serve spe- cial needs of the Government; assistance to industry, business, and consumers in the development and acceptance of commercial standards and simplified trade practice recommendations; admin- istration of programs in cooperation with United States business groups and standards organizations for the development of international standards of practice; and maintenance of a clearinghouse for the collection and dissemination of scientific, technical, and engineering information. The scope of the Bureau's activities is suggested in the following fisting of its three Institutes and their organizational units. Institute for Basic Standards. Applied Mathematics. Electricity. Metrology. Mechanics. Heat. Atomic Physics. Physical Chemistry. Laboratory Astrophysics.* Radiation Physics. Radio Standards Laboratory:* Radio Standards Physics & Radio Standards Engineering. Office of Standard Reference Data. Institute for Materials Research. Analytical Chemistry. Polymers. Metallurgy. Inorganic Materials. Reactor Radiations. Cryogenics.* Materials Evaluation Laboratory. Office of Stand- ard Reference Materials. Institute for Applied Technology. Building Research. Information Technology. Perform- ance Test Development. Electronic Instrumentation. Textile and Apparel Technology Center. Technical Analysis. Office of Weights and Measures. Office of Engineering Standards. Office of Invention and Innovation. Office of Technical Resources. Clearinghouse for Federal Scientific and Technical Information.** *Located at Boulder, Colorado, 80301. **Located at 5285 Port Royal Road, Springfield, Virginia, 22171. U.S. GOVERNMENT PRINTING OFFICE : 1966 — 772-957 261 Hi ~mmw fflNm HUuHmtKUjitniuJwHB Si lliil 111! Ill MS Hi H 11 1 If iH HI ! ilHffl Hi ilgg BRHffl hIKI FhRHUU