National Institute on ‘ ALS @arch 9 al MONOGRAPH SERIES Synthetic Estimates For Small Areas ent 1 ’ CHEN Omit 3 tt ET er te Sn ne US DEPARTMENT OF HEALTH. EDUCATION. AND WELFARE «Public Health Service + Alcohol, Drug Abuse, and Mental Health Administration Synthetic Estimates for Small Areas: Statistical Workshop Papers and Discussion, Editor: Joseph Steinberg | NIDA Research Monograph 24 February 1979 DEPARTMENT OF HEALTH, EDUCATION, AND WELFARE Public Health Service Alcohol, Drug Abuse, and Mental Health Administration \ational Institute on Drug Abuse , Vv Division of Research ’ 5600 Fishers Lane Rockville, Maryland 20857 For sale by the Superintendent of Documents, U.S. Government Printing Office Washington, D.C. 20402 Stock Number 017-024-00911-3 Vv The NIDA Research Monograph series is prepared by the Division of Research of the National Institute on Drug Abuse. Its primary objective is to provide critical re- views of research problem areas and techniques, the content of state-of-the-art conferences, integrative research reviews and significant original research. Its dual publication emphasis is rapid and targeted dissemination to the scientific and professional community. Editorial Advisory Board Avram Goldstein, M.D. Addiction Research Foundation Palo Alto, California Jerome Jaffe, M.D. College of Physicians and Surgeons Columbia University, New York Reese T. Jones, M.D. Langley Porter Neuropsychiatric Institute University of California San Francisco, Califomia William McGilothlin, Ph.D. Department of Psychology. UCLA Los Angeles, California Jack Mendelson, M.D. Alcohol and Drug Abuse Research Center Harvard Medical School McLean Hospital Belmont, Massachusetts Helen Nowilis, Ph.D. Office of Drug Education, DHEW Washington, D.C Lee Robins, Ph.D. Washington University School of Medicine St. Louis, Missouri NIDA Research Monograph series William Pollin, M.D. DIRECTOR, NIDA Marvin Snyder, Ph.D. ACTING DIRECTOR, DIVISION OF RESEARCH, NIDA Robert C. Petersen, Ph.D. EDITOR-IN-CHIEF Eleanor W. Waldrop MANAGING EDITOR Parklawn Building, 5600 Fishers Lane, Rockville, Maryland 20857 Synthetic Estimates for Small Areas: Statistical Workshop Papers and Discussion ACKNOWLEDGMENT This monograph is based on papers presented at a workshop conducted by Response Analysis, Princeton, New Jersey, under NIDA Contract No. 271-77-3425, The workshop took place on April 13 and 14, 1978, in Princeton. The National Institute on Drug Abuse has obtained permission from the Journal of Studies on Alcohol, Inc. to quote previously published material which appears on page 224. Further reproduction of this passage is prohibited without specific permission of the copyright holder. With this exception, the contents of this monograph are in the public domain and may be used and reprinted without special per- mission. Citation as to source is appreciated. Library of Congress catalog card number 79-600067 DHEW publication number (ADM) 79-801 Printed 1979 NIDA Research Monographs are indexed in the Index Medicus. They are selectively included in the coverage of BioSciences Information Service, Chemical Abstracts, Psychological Abstracts, and Psychopharmacology Abstracts. iv Foreword The Workshop on Synthetic Estimates was cosponsored by the National Institute on Drug Abuse (NIDA) and the National Center for Health Statistics (NCHS). The collaboration came about as follows: In 1974, an inquiry was made of NCHS by NIDA about possible methods of "triangulating'" national survey data and census data to produce estimates of incidence or prevalence of drug abuse in states and local areas. Indeed, according to NCHS, there were such methods, called "synthetic estimation,' and they had been explored and dis- cussed over a span of about ten years. A short report, Synthetic State Estimates of Disability, published by NCHS in 1968, was one of the few pieces available for the non- technician to consult. A sparse literature in the statistical journals was available but not easy to collect or disseminate. The two agencies felt there was need for a ''consumer report' on the methods. They knew that the methods have an immediate appeal to planners, demographers, program officials, and epidemiologists charged with the task of describing conditions or estimating need in small areas. Yet neither agency was ready to recommend the methods outright because little is known about the quality of syn- thetic estimates. They wanted to air the strengths and weaknesses of the methods in a group of statisticians and scientists who had thought about them carefully or applied them to real situations of need. Thus the idea of holding a workshop was born. NCHS is the agency in the Federal Statistical System that has ma- jor responsibility for compiling, analyzing, and disseminating general purpose national health and vital statistics. In recent years, the demand for health statistics for small areas has greatly increased, and producing local area statistics has emerged as one of the Center's most difficult and pressing statistical problems. NIDA has responsibility for providing national statistics on non- medical drug use and its consequences. Its support of State pro- grams in treatment and prevention has created the need for data reflecting conditions at that level. O51) Most of NCHS's data systems are incapable of producing local area statistics. The exceptions, those based on complete counts of the population, include the birth and death registration systems, and the data systems for producing health establishment and health manpower statistics. On the other hand, the capabilities of NCHS's sample data systems are limited to producing national estimates, and estimates for the geographic regions and divisions and the larger standard metropolitan statistical areas (SMSA's). Priority was not given to local area statistics when the sample data systems were originally designed. In most instances, the cost effects would have been prohibitive. Similarly, NIDA has found it prohibitively expensive to require States to conduct their own surveys to establish need. The Client Oriented Data Acquisition Process (CODAP) produces information at the State and SMSA level on treatment admissions and discharges, but other systems provide only national estimates or data on a limited set of local areas. Local area health data are increasingly needed to implement the programs legislated by Congress. However, changes in the appropri- ations for health statistics programs have not kept pace with the needs for new data and new data priorities. Therefore, agencies are looking for more cost-effective methods for producing them. Neither NIDA nor NCHS is committed to synthetic estimation as the keystone of its policy for producing small area statistics. At present, NCHS is investigating two other strategies in addition to synthetic estimation. One of these is the Cooperative Health Statistics System. In this approach, State data systems serve as building blocks for national sample designs and methods for pro- ducing local area data. Currently NCHS is exploring the cost and error effects of network surveys, and of computerized telephone surveys on random digit dialing. It is our belief that we have assembled the outstanding workers in the field of synthetic estimation for this workshop. We feel that the papers, and the editing by Joseph Steinberg, have resulted in a landmark publication. We hope that future users or potential users of the methods will find this volume a solid foundation for their efforts. Louise G. Richards, Ph.D. National Institute on Drug Abuse Monroe G. Sirken, Ph.D. National Center for Health Statistics vi Contents Foreword Louige G. Richards and Monroe G. Sirken . . + + «+ + WV Introduction Joseph Steinberg. . + + + + + + + 4 + 0 4 oe W1 PART I Small Area Estimation--Synthetic and Other Procedures, 1968-1978 Poul 8. Levy + « « 3% » = x 3s v wu » » » v + of Discussion Walt R. STMMOns « + + « « « « « + « + + + 20 Gary G. Koch « + « « «+ «+ oo + ee 4 eo. oo. 24 Comments Paul S. Levy « + « + + « «+ « +o + + + + a. . 80 General Discussion, , . . . . . + + + 4 + + + . 32 PART II A Composite Estimator for Small Area Statistics Wesley L. Schaible . . + + « « «+ « «+ « + «+ . 386 Discussion Barbara A. Batlar . + «+ «+ «+ « « + + + + + . 64 Comments Wesley A. Schaible . . + + + + « « « + + + . 60 General Discussion. . + « «+ + + + + + + + « . . 61 Prediction Models in Small Area Estimation Richard M. Royall . + + « « «+ + + « + + + . 63 Discussion Harold Nigssel8onm. + + « «+ + + « + + + + «+ + 88 General Discussion. . +. +. + + + + «+ + + +. . . 91 A Modified Approach to Small Area Estimation Steven B. Cohen « + « «+ + « = =» w= #% = + B98 Discussion Joseph Waksberg . . . . « «+ « + + + + + « 135 General Discussion. . . . . + + + «+ + + + + . . 139 vii PART III Case Studies on the Use and Accuracy of Synthetic Estimates: Unemployment and Housin, Licerions hy Elena Busing Amplias ee eee eee W122 Some Recent Census Bureau Applications of Regression Techniques to Estimation Robert E. Fay « + « « + + + « + + + + + +» 155 Discussion Eugene P. Ericksen . . + + + & + + + + « + .185 General Discussion + « + 5 & 4 + « + « » « = #18 PART IV Drug Abuse Applications: Some Regression Explorations with National Survey Data Reuben Cohen . + «+ « «+ « « « + « + « + « 194 Discussion Monroe Gu FLoken + + « 4 + 3 + + +» ww » wu ww 9314 Ira C18TM + + + « + + « « «+ « «+ + + « a. 215 General DISCUSSION * + + + + + + + + + + eo + « W219 Applications of Synthetic Estimates to Alcoholism and Problem Drinking David M. Promisel. . +. + + + « + + « + + . .2%23 Discussion Dorma O. Farley . . . + « « + + + 2 « + + +239 General Discussion « « : + s o « + + = = x + « +548 Synthetic Estimates as an Approach to Needs Assessment: Issues and Experience Charles G. Froland .« + oo + « + « + +» + =» + 220 Discussion Reuben Cohen + + «+ + + «+ « « « «+ oo + + + +959 General Discussion . . . + +. +. + + + + « . « . .261 Expansion of Remarks Walt R. Simmons . + + + + + + «+ + « « + + +269 Afterword Joseph Steinberg . . + +. + + + + + + + +. W271 Appendix A Attendees at Workshop. . . . . . . . . .27¢ Appendix B Workshop Program . . . . . . . . . . .277 List of NIDA Research Monographs . . . . . « «+ + «+ .279 viii Introduction Joseph Steinberg There are many and varied needs for small area data. Traditionally, this has led to consideration of large-scale data collection as the basis for satisfying the need. On occasion, a method has been tried that provided estimates for a number of individual areas on the basis of a direct collection of data for the desired characteristic for only a sample of areas and data on a related characteristic for each area. The Radio Listening Survey, discussed in Hansen, Hurwitz and Madow (1953) is an illustration of this approach used in the early 1940's. Similarly, Lillian Madow used a derived method for providing small area data in a report of the Advertising Research Foundation (1956). There has been an increase in the use of a variety of procedures for small area estimation since the National Center for Health Statistics (1968) published derived "synthetic estimates." "Synthetic estimate' is a label that has been given to the product of a class of devices that yield estimates of a target statistic for specific subnational areas, using descriptive data for the specific area in combination with average values of the target statistic for national or regional territory." This is the way Simmons (1977), who coined the term, described the technique which is the focus of this WORKSHOP ON SYNTHETIC ESTIMATES FOR SMALL AREAS. Discussion of synthetic estimates evokes a great deal of enthusiasm by some and skepticism by others. The Workshop provided a forum for sharing experiences of what is the current state of the art in meth- odology and in application. An additional purpose of the Workshop was to suggest refinements of estimating procedure beyond what is cur- rently known. Invited Pp: pers and remarks of invited discussants were the Workshop framework. Extensive informal discussion also helped to serve the purposes of the conference. The papers, invited discussion, and ab- stracts of the informal discussion constitute the body of this volume. Papers and associated discussion have been grouped into four parts. A historical overview is the core of Part I. Part IT consists of pa- pers on methodological contributions. Groupings of applications con- stitute Parts III and IV. Different types of strategies for providing local area estimates were discussed in Levy's paper, which presents a historical perspective of efforts in the past decade. The papers by Schaible and Royall deal with refinements in estimation procedures and use of models. The pos- sibilities in the use of composite estimators were also indicated in some of the work presented by Fay and received consideration during the informal discussion of Froland's paper. How to devise useful subsets of a population to permit the best appli- cation of synthetic estimates received attention. The degree of homo- geneity within classes across areas was identified as a primary interest in producing synthetic estimates. Partitioning areas into subareas as one way- to help decrease the within variance is a facet of Steven Cohen's paper. The use of AID for determining the demographic cate- gories for synthetic estimation is a methodological aspect of Promisel's paper. The need for the producer to supply information about the quality of the synthetic estimates came up a number of times during the confer- ence. Some possible ways of accomplishing this are described in Fay's and Gonzalez's papers. Several types of applications of synthetic estimates in the work of the Census Bureau are described in the papers by Gonzalez and Fay. Applications in the drug and alcohol abuse fields are discussed in the papers by Reuben Cohen, Froland and Promisel. Reuben Cohen's paper illustrates use of a multiple regression model. Publication of the papers and discussion should permit a wider audience of users to understand the characteristics, strengths, and limitations of the current types of Synthetic Estimators. Producers of subnational data will be able to review the current state of the art as viewed by the Workshop participants. The desirability of additional research was identified at a number of points in the Workshop. What is known to date is represented by the contributions in these proceedings. It is reasonable to expect that this compilation will help stimulate additional productive ideas and results. REFERENCES Advertising Research Foundation U.S. Television Households, by Region, States, and County, New York, March 1956. Hansen, M.H., Hurwitz, W.N., and Madow, W.G. Sample Survey Methods and Theory, Vol.I. New York: John Wiley and ns, . National Center for Health Statistics SYtnete State Estimates of Disability, Public Health Service, PHS ication No. , Washington: U.S. Government Printing Office, 1968. Simmons, W.R. Subnational Statistics and Federal-State Cooperative Systems, Committee on National Statistics, Assembly of Behavioral and Social Sciences, National Research Council. Washington: National Academy of Sciences, 1977. Part | Small Area Estimation -- Synthetic and Other Procedures, 1968-1978 Paul S. Levy Discussion Walt R. Simmons Gary G. Koch Comments Paul S. Levy General Discussion Small Area Estimation—Synthetic and Other Procedures, 1968-1978 Paul S. Levy ABSTRACT Methods for obtaining small area estimates which have emerged over the past decade are reviewed with particular emphasis given to syn- thetic estimation, a procedure originally developed at the National Center for Health Statistics which has found wide acceptance because of its simplicity and intuitive appeal, and yet has pro- voked much controversy because of its lack of good demonstrable statistical properties and its equivocal results when subjected to empirical evaluation. The various methods of obtaining small area estimates are discussed in terms of their statistical properties, the feasibility of using them and the potential scope of their application. Finally, some recommendations are made concerning possible avenues of future research in small area estimation, and some tentative guidelines are given for choosing between alterna- tive existing methods. INTRODUCTION It has now been ten years since the National Center for Health Statistics (NCHS) published estimates for each State in the United States of restricted activity days, bed disability days and other selected variables from the Health Interview Survey (HIS) and, in so doing, introduced in published form the concept of synthetic estimation (National Center for Health Statistics, 1968). At the time, this represented a radical departure from NCHS policy of publishing only estimates known to be for all practical purposes unbiased and for which sampling errors can be estimated. It was immediately recognized that the importance of this publication lay not in its HIS subject matter, but in its presentation at a period of time in which local, State, and regional planning were emerging as important issues, of an easily usable, inexpensive and intuitively appealing method of obtaining exactly the kind of small area estimates that were so sorely needed. At the same time, it was recognized that synthetic estimation is a crude method and that much further work was needed, especially in evaluation of this method. Although the publication listed no individual authors, the project was initiated and carried out under the leadership of Walt R. Simmons, who should be considered the "father" of synthetic estimation if not its inventor. Since the introduction of synthetic estimation ten years ago, there has been a moderate amount of activity in development of further methodology for small area estimation, especially at the U. S. Bureau of the Census and at the National Center for Health Statis- tics. Some of this activity was a direct outgrowth of the early NCHS work on synthetic estimation while other activity, particu- larly that of Ericksen (1975) had antecedents not in synthetic estimation but in demographic techniques of estimating population changes for small areas. Most of the activity in small area esti- mation, however, has centered around a relatively small group of statisticians (many of whom are at this conference) who represent either as staff members or as contractors the agencies responsible for producing such estimates. Although it is a potentially fertile field for research, it has not as yet attracted the inter- est of the statistical community at large. In this paper, I will review the major work of the past decade in small area estimation and will comment on what I feel is needed in the way of future research. 2. METHODS OF PRODUCING ESTIMATES FOR SMALL AREAS The various methods of producing estimates for small areas that have been given some attention over the past decade are discussed in order of decreasing dependency on actual direct measurement of individuals from the local area. The list is not intended to be exhaustive but represents the types of procedures that are currently being used. Undoubtedly, new procedures will emerge from the pre- sentations at this conference. 2.1 Direct Estimation by Means of Sample Survey or Census If one wants to estimate some parameter (e.g., mean, total, pro- portion) of the distribution of a variable, X, in a small area, the most direct method would be to take a sample survey or census of the individuals in the area and measure them with respect to the variable, X. If the sampling plan were that of a probability sample, if the survey were well planned and executed, and if a reasonable algorithm for estimation were used, unbiased estimates would be produced. The disadvantages of this approach are well known, namely the immense amount of resources needed in the way of time, money, and technical expertise for the successful completion of a sample survey that would produce estimates meeting reasonable specifications in the way of reliability or validity. In spite of the expense involved, it should be recognized that estimates obtained from direct surveys of local areas have tremen- dous appeal to those individuals responsible for regional, State and local planning, and the consultant who proposes synthetic estimation or some other method of estimation in lieu of a survey is apt to meet some resistance. In order to be effective, the consulting statistician must be able to evaluate the level of accuracy of estimates that can be produced from a sample survey conducted in accordance with the client's limitations in resources, to compare this with the level of accuracy that can be produced by synthetic estimation or some other method of indirect estimation, and to commmicate these findings to the client. It is especially important to avoid amateurish, poorly planned and executed surveys, which can only result in inaccurate estimates. 2.2 Methods Using A Combination of Direct Estimation and Imputation It will generally not be feasible for an independent survey to be conducted in a particular local area for purposes of obtaining local estimates. The only alternative then is to use data from other sources such as surveys that have been conducted in larger areas, and by some method to relate these estimates from other surveys to estimates for the small area of interest. In this sec- tion, we will discuss a method of producing small area estimates from larger area surveys which has the capacity of making extensive and direct use of whatever data is available from the survey specific to the small area. This method, known as the nearly unbiased estimate, was discussed in the original NCHS publication on synthetic estimation (NCHS 1968). It is based on the fact that for many National Surveys such as the Current Population Survey (CPS) and the Health Interview Survey, the United States is grouped into a large number of primary sampling units and the PSU's are grouped into strata on the basis of similar geographic, economic or demographic characteristics. The PSU's are generally one or more counties or SMSA's and each stratum contains one or more PSU's. From each stratum, one PSU is sampled and estimates from the PSU's are inflated to stratum levels and aggregated to produce national estimates. From sample surveys having such designs, nearly unbiased estimates can be obtained for small areas by use of these stratum estimates. In particular, the nearly unbiased estimator, Xs of the mean level of a variable, X, for a small area, a, is given by: J n_. _ aj X= GE, RL, x3) / a. 1) where x! = the survey unbiased estimate of J the total or aggregate level of X in stratum j. n= the number of persons in stratum J j that belong to area a. n .= the total number of persons in J stratum J. n_ = the total number of persons in area a. and J = the total number of strata in the survey. To illustrate how this estimator is constructed, let us suppose that a population is grouped into three strata as illustrated below in Table 1: TABLE 1 Number of Persons by Stratum and Estimated Total Level of X for Total Population and Number of Persons by Stratum for Area a Estimated Total Total Population in Stratum Total Population Level of X Area a 1 50,000 295 10,000 2 20,000 327 20,000 3 25,000 132 0 30,000 The nearly unbiased estimate of the mean level of X in area a is given by: x} = [(10,000/50,000) (295) + (20,000/20,000) (327) + (0/25,000) (132)1/ 30,000 = 386/30,000 = .0129 Conceptually, this method imputes the estimate for an entire stra- tum of the mean level of a characteristic to that part of the stratum that is in the small area of interest. The nearly unbiased estimate either uses local data directly or else imputes on the basis of data from similar small areas. For example, let us suppose that Stratum 1 consists of PSU's 1, 2, and 3 from which PSU 1 has been selected in the sample and that Stratum 2 consists of PSU's 4, 5, 6 and 7 of which PSU 6 is the sample representative. Let small area a consist of PSU's 1 and 6, small area b consist PSU's 1 and 5 and small area c consist of PSU's 3, 4 and 5. Then estimates for area a will be obtained completely from local data, estimates for area b partly from local data and partly by imputa- tion and estimates for area c entirely by imputation. The bias, B(x}) of the nearly unbiased estimator, Xl is given by: n J . = _a) X. - X B(X)) jE n, (x; Xa5) (2) where _ Xx. = the average level of character- J istic, X, in stratum j. and X . = the average level of character- al istic, X, in that part of stratum j that is in area a. It follows from relation (2) that if there is little diversity within strata with respect to the characteristic being measured, the bias in the nearly unbiased estimate is likely to be small. An empirical study performed at the National Center for Health Statistics used HIS PSU's and stratification to construct nearly unbiased estimates for 42 States of 1960 deaths from all causes, major cardiovascular-renal diseases and deaths from motor vehicles (Levy and French, 1977). Since there was no sampling involved, differences between the nearly umbiased estimates and the true values are due entirely to bias, and the study showed for each of the three variables, the biases were, in general, quite small. The problem in the nearly unbiased estimator is likely to lie not in its bias but in its variance, o2 , given by: Xa J n_. 2 i = ji GD a? (3) Xx! a. x) a J where o2 is the variance of the survey estimate, x}, of the x! J mean level of X in stratum j. For most data systems, the o2 x! J are likely to be quite large since the sample size in any one stra- tum is likely to be relatively small. In addition, the 0? might x} J be difficult to estimate from the data if the x! are based on complex sample designs. J The approach taken in constructing the nearly unbiased estimate for a small area is to use directly as much actual data from the small area as can be taken from the larger survey, and it is likely that such an approach would yield estimates having small bias but possibly large variance. This same approach was taken by Woodruff (1966) in attempting to obtain small area estimates of retail trade although his estimation procedure is quite different from that of the nearly unbiased estimator. Theoretical properties of the near- ly unbiased estimator have been demonstrated by Levy and French (1977). 2.3 Methods Based on Regression Relationships A third class of procedures used to obtain small area estimates assumes a relationship between a dependent variable, X, and a set of independent variables, Zq yi ow wy Lye Estimates of X for small areas are obtained not from direct measurement of X in the small area as in a sample survey nor from a combination of direct measurement of X in the small area and imputation based on direct measurement of X in an area similar to that of the small area as is done in constructing the nearly unbiased estimate, but on measurement of the independent variables Zq po BIg Ly in the small area, and use of the relationship between X and Zy 3 % 2 ® Zier The motivation for use of this type of methodo- logy is that if the set of independent variables, {z;} are easily obtainable for the small area and if the relationship between X and the Zy is strong, then estimates of good quality might be produced at relatively low cost. The major disadvantage of this type of approach is that the resulting estimates are likely to be biased since they are not based on direct measurement of the vari- able of interest in the small area of interest. This class of methods includes synthetic estimation which has thus far dominated the field of small area estimation in addition to other methods that have recently emerged. 2.3.1 Synthetic Estimation Let us suppose that estimates Xi x5, Ce, Xx are available from a survey conducted in a large area (e.g., nationwide) of the mean levels Xs Xs TER Xyg of a variable X in a set of K mutually exclusive and exhaustive classes (e.g., age, seX, race, family income, etc.). Let us suppose that estimates Zi» AWE » Lak small area, a, belonging to each of the K classes. Then the synthetic estimator, § , of the mean level of X in area a, is a are available of the proportion of individuals in a defined by the relation: K : 3 4) X, = xf Xx Zak We see from relation (4), that the synthetic estimator, x, is a regression estimator in which the Xp are the estimated regress- jon coefficients and the Zi are the independent variables ob- tained from the small area. In other words, a synthetic estimate is an estimate obtained from a multiple regression equation in which the independent variables are the small area population pro- portions falling into mutually exclusive and exhaustive classes (obtained generally on the basis of demographic variables) and the estimated regression coefficients are estimates of the mean level of the dependent variables for the classes based on a survey or census conducted nationwide or at least in an area much larger than that for which estimates are desired. There are several reasons why synthetic estimation is very appeal- ing. First and foremost is its intuitive appeal. It seems likely that the mean level of many variables in a population is likely to be highly related to the distribution of the population by such demographic variables as age, sex, race, income, residence, etc., which are the independent variables generally used in obtaining synthetic estimates. In addition to its intuitive appeal, syn- thetic estimates are generally easy and inexpensive to obtain since the independent variables, Zi are easily available from census or other population data and the regression coefficients, Xpes are obtainable from National Surveys. Some important instances in which synthetic estimates have been used over the past decade are listed in Table 2. In addition to the six studies mentioned in Table 2 (plus others not mentioned) it should be noted that biostatisticians and epi- demiologists have been using for many years a process very much akin to synthetic estimation in constructing rates and ratios by the indirect method of standardization. According to this method, class specific rates found in a "standard" population are combined in an equation similar to equation (4) with data from a population of interest relating to its proportionate distribution into these classes to obtain the expected rate that would be obtained in the population of interest on the basis of the standard popula- tion's class specific rates. The expected rate is then compared with the observed rate in the population of interest, and the ratio of the observed to expected rates is called a standard ratio. Statistical properties of the synthetic estimator such as its variance, bias and mean square error have been developed in papers by Gonzalez and Waksberg (1978) and by Levy and French (1977) along with methods of estingting these parameters from the data. In par- ticular, the variance, oZ and bias, B. of a synthetic estima- x X . a a tor, X,» are given by: 2. X22 2 E52 ° Tid Tak or, tad HQ ZgdZy/n, (9) a +2 1 7,72 cov (Xl, XN ker ak “ar ¥ 10 11 TABLE 2 Recent Studies Using Synthetic Estimation Organization or Variables Being Estimated (Dependent Variables) Independent Variables Regression Coefficients Individual Investigators Small Area 1. NCHS, 1968 States 2. U.S. Bureau Counties, of Census - SMSA's Gonzalez and Hoza, 1978 3. Namekata, States Levy and O'Rourke, 1975 5 HIS variables relating to short and long term disability. Unemployment rates. Complete and partial work loss disability. Population proportions falling into 78 classes on the basis of age, sex, race, residence, family income, family size, industry of head of family. Population proportions falling into classes on the basis of occupa- tion, sex, race, Or on the basis of age-sex- race-marital status. Proportion of popula- tion falling into 60 age-race-sex-residence classes. 1963-1964 HIS estimates of mean level of depen- dent variables for each class based on national data. Current Population Survey (CPS) or census estimates of unemploy- ment based on the geo- graphic division in which the small area is located. 1970 census estimates of mean levels of com- plete and partial work loss disability for each of 60 classes for U.S., as a whole. TABLE 2: Recent Studies Using Synthetic Estimation (Cont'd.) Organization or 21 Individual Variables Being Estimated Investigators Small Area (Dependent Variables) Independent Variables Regression Coefficients 4. NCHS, 1977 States 15 HIS variables relating Proportion of popula- 1969-1971 HIS estimates to long and short term tion falling into 60 of mean level of depen- disability and to utili- age-sex-race-family dent variables for each zation of health services. size-family income- Class based on national industry of household data. head class. 5. Schaible, Groups of Unemployment rates, per- Proportion of popula- HIS estimates of mean Brock and Counties, cent of population having tion falling into 64 level of dependent Schanck, States completed college. age-sex-race-family variables for each 1977 size-industry of class based on national household head classes. data. 6. Levy, 1971 States 1960 U.S. deaths from Proportion of popula- 1960 U.S. estimates of four different causes. tion falling into 40 age-sex-race classes. death rates for each class and for each cause. and 2 K _ _ BR) = x1 Zak Cx = Xa ©) where {Zo k=1, . . . , K} are the true proportions of the population of area a falling into each class, 02 = the variance of Bi, k=1,.+.3%K x! n_ = the size of sample upon which the Zix are based 2 and _ X le = the mean level of X, in classes k of area a. In most applications of synthetic estimation, both the estimated regression coefficients, Xe and the estimated population pro- portions, Lik are obtained from very large data systems and are likely to have very small sampling variances, so that one would anticipate that the sampling variances of synthetic estimates would be quite small. Estimates of the sampling variances of the 1969- 1971 HIS synthetic estimates for States based on equation (5) seem to confirm this since the coefficients of variation of almost all the synthetic estimates were estimated to be less than 5% (NCHS, 1977). Examination of equation (6) shows that the bias in a synthetic estimate is a weighted average of the difference between the expected value, f of the estimated regression coefficients and true regression coefficients, Xa approprate for the particular class and area. In other words, the bias in a synthetic estimate depends on differences between the class specific mean levels, Xyes for the large area used in obtaining the estimated regression coefficients and the class specific mean levels, Xo for the small area. Examination, a priori, of equation (6) cannot lead us to surmise, as we have done for the variance, that the bias of a synthetic estimate is likely to be small. It may in fact be large if the level of a variable X in an individual is less dependent on the individual's being in a particular class than on other factors and if the distribution of these other factors differs among areas. This might be seen in the following simplified linear model: J Xue = 0 Fat 58 85 Yiake 7 where 13 H = an overall mean. = the level of X for individual 2 in class k of area a. the effect due to being in class k. oP {8., j=1, ..., J} = the effects due to J a set of other vari- ables, Yio + + Ys. and Y; ake” the level of variable Yj for indivi- dual 2 in class k of area a; J=3, «0. ,d Under model (7), the mean level, Xx for class k area a would be given by: _ J Xa = + ay + 50) 85 Yop (5 If the class mean levels, Vouk of the variables, Y; do not dif- fer appreciably among the areas, then the Xx will be approxi- mately the same among areas, which would imply that the bias in the synthetic estimate is likely to be small, even if the By are large. On the other hand, differences among areas with respect to those Yak which are associated with sizeable Bs would indicate the possibility of a large bias in a synthetic estimate. Evaluation of synthetic estimates has been difficult in situations where the true value of the characteristic being estimated is not known. The difficulty lies primarily in the fact that the bias of the synthetic estimator cannot be estimated from the data used to construct it. Gonzalez and Waksberg (1973) have used a method of evaluation of a set of synthetic estimates based on the fact that if an unbiased estimate, X35» exists of the mean level, Xs of variable X in area a, and if x: is uncorrelated with the syn- thetic estimator, Xx, then an unbiased estimator, MSE. of the X a mean square error of X, is given by: MSE. = (X' - X)2 - #2 9) X 2 2 x! a a 14 2 is an unbiased estimate of the variance x of XJ Since the xX) are likely to have high variances (or else they would be competitive with synthetic estimates) it is likely that the estimated mean square errors given in equation (9) are unsta- ble. Realizing this, Gonzalez and Waksberg concentrated on esti- mating the average mean square error (denoted AMSE) of a set of M synthetic estimates by the more stable estimator: Mo Mo. S352 1 2 af1 ( a X,) M ak1 5 (10) a AMSE = |= Using this criterion, Gonzalez and Waksberg (1973) evaluated syn- thetic estimates of unemployment for SMSA's against competing unbiased estimates, and found that synthetic estimates were superi- or to unbiased estimates for monthly rates, but that the reverse was true for annual unemployment rates. Some studies have been designed to evaluate synthetic estimates by comparing them with known true values of the parameter being esti- mated. Such studies have been performed for such variables as death rates from selected causes (Levy 1971), complete and partial work disability (Namekata, Levy, and O'Rourke 1975), unemployment rates and percent completing college (Schaible, Brock, and Schnack 1977). The overall conclusion emerging from these empiri- cal evaluation studies concerning the accuracy of synthetic esti- mates is at best equivocal. For some variables, synthetic esti- mates were quite accurate, whereas for others they were not good at all. Two interesting findings have emerged from these and other evalua- tion studies. It has been found in most instances that there is not much variability in the Z!, among small areas, and that as a result, there is generally 0S much variability among small areas, with respect to actual values of synthetic estimates. For this reason there is often low correlation,over a set of small areas, between synthetic estimates and true values of the parameter being estimated, and this is a serious deficiency if the synthetic estimates are being used to order a set of small areas on the basis of the variable being estimated. A second finding is that the large number of classes used to construct synthetic estimates is probably not needed since the values of synthetic estimates based on relatively small numbers of classes correlate very well with values of synthetic estimates based on a much larger number of cells. 2.3.2 Other Methods Based on Regression Relationships Perhaps the most successful use of small area estimation has been in the estimation of population changes for small areas. 15 In particular, Ericksen (1974 and 1975) has built a regression equation using as independent variables data on births, deaths and school enrollment for CPS PSU's and as the dependent variable, data on population size for these PSU's as estimated from CPS. This regression equation was then used to estimate population changes from 1960 to 1970 for 2,586 counties, and the agreement between the predicted values and the actual census values was, in general, quite good. Perhaps the main reason that a regression method worked so well in this application lies in the fact that the independent variables births, deaths, and school enrollment are known to be very highly correlated with population change. Two methods have been developed in which synthetic estimates are constructed, and then used essentially as independent variables in a regression equation which includes other variables character- izing the small area of interest. One such method, proposed by Levy (1971) assumes the following model: Y, = By t+ By Wai LJ By Wah 11) where 200 18 FE Y, = 100 x; - X)/X, B. 31=0,..., hare a set of regression al coefficients. and Was »i=1, ..., hare values for area a of a set of independent variables. In other words, Yo» the percentage difference between a synthe- tic estimate, xX, and the true mean level, X,, of a variable X in a small area a is assumed to be a linear function of a set of independent variables, Wap cee, Won If enough larger areas are available for which X, X, and the set of W's are known, then the regression coefficients, Bs» can be estimated and by use of these estimated regression coefficients By» an estimator, X, can be derived from equation (11) and can be used for small area estimation as an "improved" synthetic estimator. This estimator is given by: X,=X,@Q+ 0L(By*B W +. LL 4 Bap) (12) This estimator, when evaluated on mortality data, showed a consider- able improvement over the synthetic estimator (Levy 1971). A similar approach was taken by Gonzalez and Hoza (1978) who used synthetic estimates of unemployment as an independent variable 16 along with other independent variables and built a regression equa- tion to produce small area estimates of unemployment. The approach taken by these two regression procedures is based on the realization that some kind of regression estimator is likely to be an improvement over a direct estimate for a small area even when such an estimate is obtainable, and that the synthetic esti- mate, while useful, does not tell enough of the story to accurately estimate a population parameter. 2.4 Methods Based on a Combination of Regression Methods and Direct Estimation Very recently, Schaible, Brock and Schnack (1977) have proposed an estimator based on a linear combination of a direct unbiased esti- mator and a synthetic estimator. The rationale for their estima- tor is that often the same data upon which the regression coeffi- cents, Xp» are obtained for the synthetic estimator, contain sample units from the small areas for which estimates are desired, and that often these sample data can be used by themselves to obtain direct estimates for the local data. In particular, they speculated.that the mean square error, denoted b', of synthetic estimate, Xs is relatively independent of n., the number of units sampled in area a, whereas the mean square error of a direct estimate, X35» is dominated by its variance rather than its bias and is of the form, b/n,. Then the linear combination of xX} and X, which has the minimum variance over all such linear combinations is given by: Xp + (1-0) X, (13) where = 1 C=n/(m, + (o/b) (14) If x, and xX} had equal mean square errors, then C = % and: (b/b') = n, (15) Thus, from relation (15) b/b' is equal to the sample size, n, at which synthetic and direct estimates have equal error. From available data, Schaible, Brock and Schnack were able to esti- mate b/b' and hence C for two HIS variables, and demonstrated that their composite estimator had considerably lower average MSE than either Xx; or X, used alone. 17 3. WHERE SMALL AREA ESTIMATION STANDS NOW AND WHERE IT SHOULD GO When demographics tell most of the story concerning the expected level of a characteristic, the synthetic estimator is likely to be the estimator of choice. However, the empirical studies of the synthetic estimator have accumulated sufficient evidence to indi- cate that for most variables of interest, demographics do not tell most of the story. As a consequence, there is a general feel- ing of dissatisfaction with synthetic estimation. However, there seems to be no clarion call for allocating the huge amount of resources needed to obtain good small area estimates by direct estimation. It seems that the most productive approach would be to develop an estimator based on demographics, on whatever direct information is available for the small area with respect to the dependent variable being estimated, and on independent variables other than demogra- phics. The statistical properties of any such estimation procedure should be established, and by that I mean not only variance and bias, but such characteristics as optimality, cost efficiency and admissibility. To investigate these properties and gain some insight, it might be necessary to go beyond conventional finite population sampling and estimation theory. Good local planning requires good local estimates. At present, we cannot deliver these for most variables. However, if we make this a high priority item for statistical research and build upon what has been developed over the past decade, it is likely that much progress will be made in the next decade. 18 REFERENCES Ericksen, E.P. A regression method for estimating population changes of local areas. Journal of the American Statistical Association, 69: 867 - 875, 1974. Ericksen, E.P. A method for combining sample survey data and symptomatic indicators to obtain population estimates for local areas. Demography, 10: 137 - 160, 1975. Gonzalez, M.E., and Waksberg, J.E. Estimation of the error of synthetic estimates. Presented at the first meeting of the International Association of Survey Statisticians, Vienna, Austria, 1973. Gonzalez, M.E., and Hoza, C. Small area estimation with applica- tions to unemployment and housing estimates. Journal of the American Statistical Association, 73: 1978. Levy, P.S. The use of mortality data in evaluating synthetic estimates. Proceedings of the American Statistical Association, Social Statistics Section: 328 - 331, 1571. Levy, P.S., and French, D.K. Synthetic Estimation of State Health Characteristics Based on the Health Interview survey. vital and Health Statistics: Series Z, No. 75, DHEW Publication (PHS) 78 - 1349. Washington: U.S. Government Printing Office, 1977. Namekata, T.; Levy, P.S.; and O'Rourke, T.W. Synthetic estimates of work loss disability for each State and the District of Columbia. Public Health Reports, 90: 532 - 538, 1975. National Center for Health Statistics. Synthetic State Estimates of Disability. PHS Publication No. 1759. Public Health Service, Washington. U.S. Government Printing Office, 1968. National Center for Health Statistics. State Estimates of Dis- ability and Utilization of Medical Services, United States, 1969 - 1971. DHEW Publication No. (HRA) 77 - 1241. Health Resources Administration. Washington: U.S. Government Printing Office, Jan., 1977. Schaible, W.L.; Brock, D.B.; and Schnack, G.A. An empirical com- parison of the simple inflation, synthetic and composite estimators for small area statistics. Proceedings of the American Statistical Association, Social Statistics Section: 1017 - 1021, 1977. Woodruff, R.A. Use of a regression technique to produce area breakdowns of the Monthly National Estimates of Retail Trade. Journal of the American Statistical Association, 61: 497 - 504, 1966. 19 Discussion Walt R. Simmons INTRODUCTION Let me say first that Paul Levy's paper is an excellent introduction to our workshop on synthetic estimates, and an opening review of efforts to produce useful estimates for subnational areas. I should like to offer my general perspective of these issues. You will discover that Paul already has touched on several facets that I consider particularly important, while a scanning of the agenda suggests that other elements of my position will be treated by other speakers. A CENTRAL CONCEPT I start with a central concept, or model, or proposition. let us say that the primary objective is to estimate a parameter z for a defined universe. Consider a very general estimator 1 1 Zz = IW X a a a in which xX, is an estimator for the a-th component of the z-value, 1 1 el | ' 1 ' =~ g Ha | : ' 1 1 i lk bo i | i ' ' ' ' 1 ' Yummies Sl = ' 1 ' 1 ' © 2 NN =H un ' ' ~~ oa oOo ~~ ' . ' | | i ' | Jmmmmmmmmmmmmo- mm - : IIIIIIIIITo eee 3 ' 1 1 | ' ' ' ' ' ' ' oo To ' ' ~ 6 on oo I~ | . «a | i | | 2 i | ’ ot : Ym BHR SR 4 aN \ i= | Poo : oe ' ard - w | ' © po a 4 ' - oO 1 1 aD | 3 Se 18838 & wo FE = I SO Le eign SL ' no Q ' ' ' 1 1 1 1 | 1 1 — 2s , &! 1 & 1 1 1 1 1 1 ! ! 1 1 i teem - : 2 nm om 5 lemme mmm 4 1 ' 1 1 ' 1 | ' 8 \ TNO Oo wv PH oo TT a Ho 1 1 < 1 Io OO © NN © 1 B 1 ~ © oO © I~ 1 1 | 1 oe . 1 1 * ' 1 1 1 1 wy (1 ' 1 1, \ femnmn anne demo ! Bgl mm mmm moms ) 8 lemme mmm mmm mmm - Ta a deem mmm me mmm 3 mmm - ' ol 1 1 ' ox ' | & ' | 1 ' e ' ro = oa B= Mo > OS |! | ' ' Ih o oc © wn BH © oo oO OO © \ =O Qu Ve . ' | . ' ' oa 2! ' ~ ' | ' 1 1 1 1 1 ' 1 ' no 1 1 1 g ' 1 ' ' | 1 ' 1 1 ) gree | irom d S lemme mmm mmm ee : | ' : B Lr PO : ! 1 - | ' 1 1 ' ' 1S —n < ~A < 9 | a Oo © > wv |! [a ne ic Oo No << ' w oO oO oo © |! . EE] . «oe \ — Th , i — ' ' \ ' - vo ' | ' i os | HI | ' i | wo vod | | : ' mood = \ ' ' ! ro EE ) bem mmm mmm meme ' Fo Vo ' ' | ' £3 ro lo 0 oo © © ! © o © © © ! ' 2 ' «\ ' ro Oo Oo © © ' ' oO © © © Oo ' 1 1 1 1 . 1 ' . ' ) HE | ' | oH HH ' \mm mmm eh] lemmmmmmmmmm mmm mmm A lememmmmmmmmm mmm - HE eet att : ee 2 ' 1 1 ' ' ' ' Oo ' 1 1 1 1 wi ' I I 1 ' Pea 1 0 00 oN wn ' + = OV WO oa |! ' 2 ' 1c © © ~~ 1 ! ~ © © © OV ) ' ' 1 . 1 . . . . ' oo ' — Oo ~ ' ' 1 [I ' ' | ' & ! 1 1 1 1 ' 1 I | 1 1 1, | ! 1 ' 1 bem mmmmmmemmmmedm mee mmm mm nnn mmm mmmmmmmmmmmmeen - ' 1 1 1 1 1 ' Qo | ' | ' [SN] 1 Oo I~ un 2 ~~ ' ' MN OW =H OO © ' 1 o ' I - t OO 0 ' ' < ~~ oO ~~ 0 : 1 1 ' . . 1 1 . . 1 er — No-H | ' ' [= — ' | ' ' ro ' ' ' lemcmm mmm mmmmm mm dee de meee em em ee = deeced eee cme mmm mm mm = ' Percent of Population Completing High School Completing College Less than one Married Less than one Married Completing High School Completing College Separated Separated 45 TABLE 2 Average Squared Errors of the Model 3, Approximate MMSE Composite Estimator for Various Val ues of the Ratio R and for Five Variables, Forty Nine States, Health Interview Survey, 1969-1971 I de mm tw mn a 0 I | ! VARIABLE 1 1 ! deme mm m————— mmm ———— Tm ————— ny ——— 4 oR | Less Than | | High | } | i One i Married | Separated! School |! College ] fay ep RL SE | 00) a6 rar os | 12.3 | 1.67 | | | | 100 {asf on2a los | 9.07 Lo1.44 i 500 {08 1 8 | 04 | 6.14 | 1.05 I 1,000 P.06 72 1 loa | 4.67 | .g0 {2,000 too.04 64 0 04 1 380 | lg | 3.000 {0.03 1 64 | Joa | 361 1.80 2,000 {03 fea I los | 361 | 8 I 5.000 fo. 4 66 | .05 | 360 |g | 6,000 P02 f 67 I 05 | 378 | 83 {7.000 | 02% 69 | 05 | 3.8 | sa I 10,000 f.02 4 73 Flos 1 421 | leg | 15.000 {02 78 1 06 | 463 | .o2 | 20,000 be bs | aor fos ! : ! 1 1 1 1 | e ) | 02 | 1.08 | .08 | 672 | 1.5 | Woon i mmr sn: il isi a al Nim wm momo sl smn ull = | 46 V. SUMMARY The composite estimator (1), a weighted sum of two component estimators, has a mean square error that is smaller than the larger of the mean square errors of the two component estima- tors. This statement is not as trivial as it may first seem when it is noted that little information is usually available concerning the magnitude of the mean square errors of the component estimators. The composite estimator has a mean square error which is smaller than that of either component estimator when an appropriate weighting scheme is used. The estimation of the optimum weight for the composite estimator is a major problem which deserves further attention. However, the composite estimator is surprisingly insensitive to poor estimates of the optimum weight. This insensitivity depends on the relative sizes of the mean square errors of the compo- nent estimators. The composite estimator is most insensitive when the mean square errors of the two component estimators do not differ greatly. The percent reduction in mean square error of the composite estimator over those of component estimators also depends on the relationship between the mean square errors of the component estimators. Data were used to produce composite estimates and to calculate squared errors and correlation coefficients of estimates versus actual values. Only small differences were apparent in average squared errors or in correlation coefficients when an approxima- tion rather than the minimum mean square error weight was used. This was true even when a fairly unrealistic model was used to produce estimates. In all cases the composite estimator produced an average squared error as small as, or smaller than, that of either component estimator. In some cases the percent reductions in average squared errors were large. Although composite estimators have been used to produce small area estimates, there are two major problems which need additional attention. The first problem is to decide how to estimate the com- posite estimator weight. Under a simple model the weighting scheme for the James-Stein estimator can be viewed as one method of esti- mating the composite minimum mean square error weight, but other methods may be better. Under more realistic models the relation- ship between the James-Stein weighting scheme and the minimum mean square error weight is not so clear. An alternative approach, which has been used to produce weights for composite estimates in the report State Estimates of Disability and Utilization of Medical Services (NCHS, 1378), is to assume specific error functions for the component estimators and for a given sample and set of small 47 areas to estimate the relative magnitude of the parameters for a selected group of variables. This approach, although not ideal, may be useful since the composite estimator is quite insensitive to bad estimates of minimum mean square error weights. The second problem is to discover how to provide measures of error for a composite estimator for a given small area. This problem is common to all biased small area estimators and is likely to be a difficult one to solve. One way to provide informa- tion on the performance of biased small area estimators is to compute average measures of error using variables for which actual errors can be computed. Although this information is use- ful, it is more useful to have some measure of how well the esti- mator is likely to perform in a particular small area of interest. 48 APPENDIX I SIMPLE DIRECT AND SYNTHETIC ESTIMATORS Let Yaoi denote the observation of interest for the ith sample unit (i=1,2,...n4) in the ath (@=1,2,...K) demographic class in the dth (d=1,2,...D) small area. The simple direct estimator for small area d is then n or ga a=ah #1 Iu As =~ The simple direct estimator is more widely used than the synthetic or composite estimators. Its simplicity is appealing and with appropriate sample design it is unbiased and its variance can be estimated. However, when used to estimate for small areas from samples designed for large areas, the conventional sampling theory model yields little information about the properties of this estimator. For this reason alternative estimators have been pro- posed. In addition to the above notation let Ny, represent the number of units in the population in area d and class a. The sample mean of the ath demographic class for the large area is then D Ma T= #1 1 Yad Ma, and the synthetic estimator for small area d is J 5 aay a= a51 Ny a The a-cells for State synthetic estimates in this paper were defined to be the 64 cells created by cross-classifying the following variables: 1. Color: white; other 49 2. Sex: male, female 3. Age: under 17 years; 17-44 years; 45-64 years; 65 years and over 4. Family size: fewer than 5 members; 5 members or more 5. Industry of head of family: Standard Industrial Classifica- tions: (1) forestry and fisheries, agriculture, contruc- tion, mining and manufacturing; (2) all other industries. APPENDIX II WEIGHTING SCHEMES The expressions used to estimate composite estimator weights are specified below. The models and weighting schemes correspond to those in text table I. The minimum mean square error (MMSE) weight under Model 1 was estimated by oN 4 - oo (fa) - (1%) (7%) Cx - NZ f-N2 f- \ [- (Fev) + (vi) - o{vyvg) (vy, A Note: In this case YY The minimum mean square error (MMSE) weight under Model 2 was estimated by 2 49 (3s C49 for 0 \ fons a (vs i) > (1 i) ( fa) cr = Ye lea ¥ 49 (on V? 8(s, o \fon < Yo. Va - 1 Y. (re J+ % byt) fan AURA Rd Jai The minimum mean square error (MMSE) weight under Model 3 was estimated by 2 19 fo. 5 89(5, 5 fons (2) Ji - (5 i) iy) fo A 91 \ > 9) \[ 1 r = v 1 b'/ng + Z(Vy-Yy / 2 Z(Y)Yy 4 Yq 49 50 a> ous A where b' was estimated by fitting a curve of the form b'/ny to the individual squared errors of the direct estimates. The minimum mean square error weights restricted to the interval zero to one (MMSE [0,1] ) were estimated for each model as Pegi ied above except that they were restricted to the interval 0,1). The approximate MMSE weights were estimated for each model as specified above except that the crossproduct terms were omitted. 51 REFERENCES Efron, Bradley, and Morris, Carl. Stein's estimation rule and its competitors - An empirical Bayes approach. Journal of the American Statistical Association, 68(341):117-130, 1973. Efron, Bradley, and Morris, Carl. Data analysis using Stein's } estimator and its generalizations. Journal of the American Statis- tical Association, 70 (350): 311-319, 1975. Ericksen, Eugene Pp. Recent developments in estimation for local areas. Proceedings of the American Statistical Association, Social Statistics Section, 1973. pp. 37-41. Gonzalez, Maria E. Use and evaluation of synthetic estimates, Proceedings of the American Statistical Association, Social Statis- tics Section, 1973. pp. 33-36. Gonzalez, Maria E., and Waksberg, Joseph E. Estimation of the error of synthetic estimates. Presented at the first meeting of the International Association of Survey Statisticians, Vienna, Austria.1973 Gonzales, Maria E., and Hoza, Christine. Small area estimation of unemployment. Proceedings of the American Statistical Association, Social Statistics Section, 1075. pp.437-443, James, W.,and Stein, C. Estimation with quadratic loss. Proceedings of the Fourth Berkele osium on Mathematical Statistics and Probability. Vol. 1, SE University of California Press, 1961. Pp. 361-370. Levy, Paul S. The use of mortality data in evaluating synthetic estimates. Proceedings of the American Statistical Association, Social Statistics Section, . PP. -331, Levy, P.S.,and French, D.K. Synthetic estimation of State health characteristics based on the Health Interview Survey. Vital and Health Statistics, Series 2-75(78-1349). Public Health Service, National Center for Health Statistics, 1977, Namekata, Tsukasa; Levy, Paul S.3 and O'Rourke, Thomas W. Synthetic estimates of work loss disability for each state and the District of Columbia. Public Health Reports, 90: 532-538, 1975. National Center for Health Statistics. Synthetic State Estimates of Disability, Public Health Service Pub. No. 1759. 1968. 52 National Center for Health Statistics. State Estimates of Disability and Utilization of Medical Services: United States, 1974-76. 1978 (in press). Royall, Richard M. Discussion of two papers on recent developments in estimation of local areas. Proceedings of the American Statis- tical Association, Social Statistics section, 1973. pp. 43-44. Royall, Richard M. Statistical Theory of Small Area Estimates - Use of Prediction Models. Unpublished report prepared under contract from the National Center for Health Statistics. 1977. Schaible, Wesley L. A Comparison of the Mean Square Errors of the Postratified, Synthetic and Modified Synthetic Estimators. Unpublished report, Office of Statistical Research, National Center for Health Statistics. 1975. Schaible, Wesley L.; Brock, Dwight B.; and Schnack, George A. An Empirical Comparison of Two Estimators for Small Areas. Presented at the Second Annual Data Use Conference of the National Center for Health Statistics, Dallas, Texas. 1977a. Schaible, Wesley L.; Brock, Dwight B.; and Schnack, George A. An empirical comparison of the simple inflation, synthetic and composite estimators for small area statistics. Proceedings of the American Statistical Association, Social Statistics Section, 1977b. pp. 1017-1021. ACKNOWLEDGMENT The author would like to thank Barry Peyton of the Office of Sta- tistical Research, NCHS, for the computation of estimates for this paper. 53 Discussion Barbara A. Bailar If one defines a composite estimator as a weighted average of two or more estimators, one finds they have been used for many years for many different kinds of characteristics because of their desirable properties. In the two applications I know best, the Current Population Survey for labor force estimates, and the Retail Trade Survey for retail sales estimates, their variance reduction property is extremely important. It is interesting to see an application of this technique in the area of small area estimation. In one of the earliest reports on the use of synthetic estimation by the National Center for Health Statistics, Synthetic State Estimates of Disability (1968) a composite estimate combining two different kinds of tt estimates was investigated. However, that composite estimator was not the estimator suggested by Schaible. Interestingly enough, at the 1973 meeting of the American Statistical Association, at which Gonzalez and Ericksen presented papers on estimators and evaluation of estimators for small areas, each of the discussants suggested composite estimators. Royall speculated that a combination of the direct estimator and the synthetic estimator would be better than either alone. Kaitz suggested a combination of the synthetic and the regression estimators to yield an estimator superior to either alone. Let me now turn to specific comments on the Schaible paper. He introduces the composite estimator as: Yg = CQg+ (A-CRY§ and then proceeds to write the mean square error (MSE) of the estimator as: MSE(Yq) = CgMSE Yj + (1-Cq)MSE Y§ - Cq(1-CaE(Y] - 11): 54 This is a curious way of writing the MSE, though correct, considering that it wasn't used in this way throughout the rest of the paper. All of the results claimed seem much easier to derive if the estimator is written as a difference estimator. I will return to this later. The conditions that Schaible mentions that might help in estimating the weights for the optimum Cj are unrealistic. The first condition mentioned is when each component estimator is unbiased and the two are independent. Since one of the component estimators is the synthetic estimator, which is usually biased, this condition would rarely be met. The second condition that makes the estimation of ct more_manageable is when EQY3 - Yq) (Yq - Yy) is small relative to MSE Yq . This, again, would occur rarely. On the other hand, the empirical results show that even if EQYd - Yg) (Y3 - Yq) is not small in relation to MSE Yq it doesn't seem to matter, at least for the characteristics studied. It is interesting to observe that the weight is not restricted to the interval (0,1). Most of the applications would seem to confine it to this interval, but the theory holds even when this is not the case. It was noted in the paper that Royall (1977) had shown that if the component estimators are unbiased then the composite estimator has smaller variance than either component if the weight lies between 2C% - 1 and 2C}. If the composite estimator is written in the form of a difference estimator, one can see this is an old familiar problem. Suppose Y“ is the estimator with smaller variance ¥ ¥~+ C3 - 1”) Var (Y) Var(Y~) + C2E(Y” - Y~)2 + 2CEY~ (Y* - Y~). Then, if Var (Y) Var (Y*) IA C2E(Y” - Y~)2 + 2CEY~ (Y - YY) of EX ye BIg -3- 1 <0 EY” - Y)? E(Y~ - Y)? Clic - 2c" A o A o where 55 oo EY (QET oY E(Y~ - YY)? now Ce (0,1) so C - 2c* C 0 2C* TA A Now reverse the roles of Y and Y~ , and replace C by (1-C) to get 1-C<21-¢C» or C >2c* -1 In Schaible's paper, as presented at the Workshop, his statement about the percent reduction in the mean square error was proffered without identifying some unstated assumptions. In reviewing this aspect of his paper, we again write Y as a difference estimator, Y = Y +C(Y -Y) and letting Y~ have the smaller MSE (the other argument is analogous and will be omitted), MEY = EQ - 1)? where Y is the population value. x 11 where y is the n-vector of y-values associated with sample units, I and Yi is the vector of (N-n) non-sample y-values. We model y as a realization of a random variable Y Y = ~1 ~ Y “IT having mean vector X8 and covariance matrix V. We partition X and V according to Sample and non-sample units: ~ X v Vv x=["11}, v=|"1 “1,11 EE ¢ ~\v \ 11 “11,1 "11 If the vector ghas dimension p, then X is nx p, X is (N-n) x p, v is the n x n variance-covariance ” of Y, v is the o x (N-n) matrix of covariances between the n eLementy of Y and the (N-n) elements of Y , etc. We consider estimating a -— linear function, L7y, and we partition & as we did y so that y= Ly +47 . ~ rrr TITII 65 Theorem: Among linear estimators h”Y satisfying fy - 1y)- 0, ~~ lo ki the error variance Var(n'Y = vy) is minimized by I ~ -1 ~ h'AY = 2°Y +o” [x B+v Vv (x -x8)] ~~1 ~rr1 ~ubmr cur ar Ts where A a yr 4 8 = (xv x) XV Y 2 Xap Spf Spay The error variance of this estimator is -1 Var (n+ - vy) =” (v -V V Vv ) 2 pr Hp mm TIINIT CIILICT CIL,IT) CII " (x -v vi) (x vx J(x -V vi) g “INIT C1101 CUNT C1) \in crnLrer 7 crn The optimal estimator consists of the sum of the known part of 27y, namely 2°y , and the BLU predictor of 2” y or TL “ITI " =] 2 v [x B+vV V (x - x4] “IIIT CIILLICI NTI CI If the sample and non-sample units are uncorrelated (v is “11,1 the zero matrix) ,» the predictor of 2° Y is simply 2° xX B, TITTII “ITI the BLU estimator of E( 2” °Y ) For the present problem TITII of estimating a domain total the % vector consists of ones in the positions corresponding to the domain-d units in Y, and zeros in all other positions. ~ 4. ESTIMATION IN CROSS-CLASSIFIED POPULATIONS Although the preceding theorem provides estimators for problems of rather general structure, we will study only some relatively simple cases where the population units are cross-classified: each unit falls in one of D domains and also belongs to one of C classes. Thus the population is partitioned into CD class-by-domain cells. If unit i falls into class c¢ and domain d then we say i beTongs to cell (c, d). Let s(c, d) denote the sample from cell (c, d) and let $(c, d) denote the set of non-sample units in this cell. The domain total T to be estimated can now be written d T =I I y+ I vy . (2) d c¢ s(,d) i c¢ S(,d) i 66 We will denote by N the number of units in cell (c, d) and by n the number of b in the sample from this cell. Of course a sample s(c, d) can be the empty set, in which case n = 0 and $(c, d) is the set of all N J units in cell (c, d). « Cc All of the models we will study here treat the Y 's within a i given class-by-domain cell as being exchangeable. For our purposes this means that if different units i, j, k, and ¢ belong to the same cell, then Y , Y , Y , and Y all have the same probability ij L distribution, and the pair (¥ ’ v) have the same joint ij distribution as (v i ) . Exchangeability implies that within k 2 a given cell all units have a common mean and variance, and all pairs of units have a common covariance. This implies that if Y is the average for sample units in cell (c, d), there are s(c,d) 2 constants 4 , op , and 0 such that cd cd cd (5 en) s(c,d) cd 2 2 var(¥ )-o o (1-0 Jo /n s(c,d) cd cd cd/ cd cd 2 co(¥ y Y }- p © s(c,d) S(c,d) cd cd where cov(¥ »Y }- o o for every pair i # j in cell (c, d). ij cd cd 4.1 Cell Means Unrelated With no further assumptions we can give an unbiased estimator of the domain total, T , provided that all cells in domain d are d sampled. This is the 'post-stratified'" estimator amp ® AA) t=: T ; 3) d c cod ~(A) = where T is the expansion estimator N y v cd cd s(c,d) 67 In fact Theorem 1 can be applied to show that if the Y's are exchangeable within cells and if Y and Y are uncorrelated i j ~(A) whenever units i and j belong to different cells, then T is A) d the optimal (BLU) estimator. That is, T is optimal under d Model A: For every class c¢ and domain d, EY = 4 i in cell (c, d) i cd 2 0 i=j, in cell (c, 4d) cd 2 cov(y , ¥ = 0 © i#3j,iandj in cell (c, d) ij cd cd 0 i, j in different cells. ~(A) Under Model A the error variance of T is d 2 N n ~(A) cd cd 2 var(1 w T )- 1 - ( -p ) oc . 4 d d n N cd’ cd C cd cd An unbiased estimate of the error-variance is obtained when, for 2 each class x -p )s is replaced in (4) by its unbiased cd” cd BN 2 estimate (v -y ) (» - 1). i s(c,d) cd ié€s(c,d) . L(A) The post-stratified estimator T is unbiased under the minimal d assumption of exchangeability within cells, and is optimal when additional assumptions are made concerning the variance-covariance matrix. There are two main reasons why we do not stop here. The first reason is simply that in many applications (A) T is not available because not all cells in domain d are d 68 sampled. (In fact we might find that domain d is not represented at all in the sample.) Then we must look for alternatives to the post-stratified estimator. The second reason for considering other estimators is that if we can use a more restrictive model than Model A, then sample units from other domains might be used to construct an estimator which is significantly more efficient AA) than T . d If we rewrite (3) as AA) _ _ T =/ ny % f -n 5 d c cd s(c,d) "CT\lcd cd’ s(c,d) and compare this with expression (2) for T , we see that the estimation error is AA) fe _ Ta ) r) 20, ) nF ) 2 (e.0) ) Clearly, the total for non-sample units in cell (c, d), z y (x -n ) y , is being estimated (or S(c,d) i cd cd’ S(c,d) predicted) by the quantity (v -n Jr . That is, the cd cd/ s(c,d) average value over non-sample units, y , is estimated by the S(c,d) average over sample units from the same cell, y . The s(c,d) post-stratified estimator is unbiased under Model A because i(v -Y ) =u -u =0. No assumptions relating s(c,d) s(c,d) cd cd one cell mean yu to any other are required. cd If we have no sample units from cell (c, d) then we cannot estimate 4 unless this parameter is related somehow to the cd parameters in cells which are sampled. This is the unfortunate and unavoidable fact which makes small-area estimation difficult. We must either draw an adequate sample from cell (c, d) or we must rely on whatever assumptions are required for estimating T from observations on other cells. To the extent that each cd cell is unique, we will be frustrated in all efforts to provide 69 estimates for small groups of cells where only small samples are available. To the extent that there are similarities and regularities among the cells, we might use observations from some cells to make inferences about others, and thus produce useful small area estimates. These "similarities and regularities are just the relationships which we express through models like those which follow. 4.2 Cell Means Determined By Class But Uncorrelated A simple model under which unbiased estimation of T is possible d even when some classes are not represented in the sample from domain d is the following. It treats each class as a distinct population in which the class-by-domain cells represent clusters. The model is Model B: For every class c and domain, EY =q i in class c¢ i C 2 0 i=3j in cell (c, d) cd cov{¥ s ¥ ) = 2 ij po if in cell (c, d) cd cd 0 otherwise . Model B would apply if the population vector y were generated by a two-stage process in which the class-c cell means 4 are cd themselves realized values of uncorrelated random variables having mean yu and variance t¢ and if, given 1, the Y in cell (c, d) c c cd i 2 are exchangeable with mean yu , variance ¢” , and covariance cd cd 2 2 2 2 2 2 2 p" 0” . Theno =1 +0" andp o =1 +p" 0" . cd cd cd C cd cd cd c cd cd Model B says, in effect, that there is a common expected value for all units in a given class, regardless of their domain. It recognizes, however, through p , that units within the Same class- cd by-domain cell (c, d) are more alike than class-c units which do not belong to the same domain. It is under this sort of model that the synthetic estimator looks reasonable: T =I N uu (5) where 1 is some weighted average, I y , of sample means c j cj s(c,i) from all class-c cells sampled. Since each of these sample means has expected value py , and the & sum to one, the synthetic c cj estimator is unbiased under Model B. Schaible (1977) has pointed A (sy) ~ x out that when (5) is rewritten as T =yn py + 2M =n J) d c cdc clad cd c it becomes clear that the known sample sum, ¥ n ¥ is being ) c cd s(c,d) estimated, in effect, by © n pu . Replacing this estimate by the c cdc known true value would appear to be an obvious way of improving the synthetic estimator. The resulting "modified synthetic estimator," I Pp y + ¥ -n J ] , is also unbiased c Lcd s(c,d) cd cd/ c under Model B. Some comparisons of this estimator's variance with the synthetic estimator's variance are shown in Appendix I. Of course, in many potential applications the effect of the modification will be slight. Clearly the post-stratified estimator remains unbiased under Model B. We will look at the variances of synthetic and post- stratified estimators under this model after finding the BLU estimator and its variance. We assume that every class c¢ = 1, ..., C is represented in the sample, although the sample from class c might not contain any observations from domain d. That is, although n may be zero, cd n =rn > 0 for all c. We denote the variance of a sample c Jj cj mean from cell (c, j) by 2 2 v - Var(¥ } )= 0 o + (0 )s [n . cj s(c,j) cj cj ci’ ci! «cj Then under Model B the BLU estimator for the cell (c, d) total is (Royall 1976) (B) _ _ # Tu . dsc.) +(x, ) | EN : ( ) 0 Ji] 5) 71 where w =n op I(x =p +n op ): andy =I u y cd cd cd cd cd cd cj cj s(c,j) with u defined for all sampled cells (c, j) by cj =1 ~1 u =v / Iv cj cj "2 cL The sum of the estimators for cell totals in domain d gives the BLU estimator of the domain total: ~(B) ~(B) T =z T d c cd (7 Optimality of this estimator under Model B can be verified using the Theorem in section 3, ~(B) Before examining the error-variance of T we consider a d variation on the problem: Suppose Model B applies and we have, in addition to the sample, a supplementary estimate i of the Cc class mean u , for c = 1, ..., C. Now consider linear estimators C of the form T =a ¥ +IB8 1 d c cd s(c,d c¢ cd c If T is unbiased under Model B we must have d (7 1) (a +8 -N Ju =o d d c' cd cd cd? ¢ which implies 8 =N - a , so that we can write cd cd cd 7 =zn (v “pn Jew a d c¢ cd\ s(c,d c c cd c Reparameterizing, we let 6 = (= - n JE -n and see that cd cd C cd cd unbiasedness implies that the estimate must have the form 72 T =n ¥ ern nf 7 (1-0 ) i] d c¢ cd s(c,d c\ cd Ci cd s(c,d) cd C for some constants 6 , c=1, ..., C. If un is uncorrelated with cd c y-values in classes other than c, then optimal 6's are, for e€=1, 2, «wey Cs * 8 - coy “un, y - ) | ver(s - ; cd s(c,d) c¢ 38(c,d) c s(c,d) <¢ and with these weights Var(t -T )equats d d zw -n ) vars s)he -0,y -i] c\ cd cd 5(c,d) cc s(c,d) c¢ “s(c,d) c ® where p(a, b) denotes the correlation coefficient of a and b, In case Var (v )is Zero (v is known) the optimal weights, 8 , are the same weights, w oo (6) which are optimal when UH 9 € c estimated from the sample by up . In this case the error-variance c (8) becomes (after some reorganization) 2 N n n cd cd 2 cd I 1- 1-p o 1-{1- 1-w .(9) cn N cd | cd N cd cd cd cd We have written (9) as though n_ > 0 for all c. If in fact p n =0 then we take © [n = ——=d 4nd the summand in (9) cd cd cd 1-9 : cd 2 2 2 is N E g + (1 -p Jo fo ]- cd cd cd cd/ cdf cd 73 ~ (B) Now in the absence of supplementary estimates of the u , T c 4 given in (7) is the BLU estimator under Model B, and its error ~(B) variance, var -T ) , can be written d d 2 N n n cd cd Zz cd I 1 - — 1-9 o 1-{1- — f -w Jo -u cn N cd | cd N cd C cd cd cd | A ress — J | | a | | , b The part labelled "a" is the variance of the post-stratified estimator, and that labelled ''b'" is the variance (9) attainable if the 1 were known. For estimating the cell (c, d) total the C relative efficiency of the post-stratified estimator to the BLU ~(B) Neg estimator T is 1~-f01- - ® © -u ) ’ cd N cd cd cd which is at least as large as the maximum of the three quantities n cd 2 »w ,andu . If the ¢ 's are constant and the p's all N cd cd cd equal p, this relative efficiency lies between 1 (the efficiency when p = 1) and 2 n n n cd cd cd + - (the efficiency when p = 0). N n N n cd Cc cdc 74 ~(B) 2 The optimal estimator T depends on the p's and the o's, which d are generally unknown. However, even when incorrect values of these parameters are used, the estimator is unbiased under Model B. This suggests that estimators of this form (7) obtained using simple variance structures might prove useful under a fairly wide range of conditions. For example, if all p's are zero and A(B) ~(8) (5) the o 's are constant, T is simply T =I T , where d d c cod (5) _ _ T =n Yy +N -n rn y f» . cd cd s(c,d) cd cd|j cj s(c,j)l c ~(8) The estimator T is the modified synthetic estimator studied by d Schaible (1977). Its error variance under Model B is 2 N n n ~(S) cd cd 2 cd vax(T “1 )- 1- (1-5 Jo 1-{1- — d d n N cd’ cd N c cd cd cd n n n 2 cd cd 1 > cj 1 - lo 1-2 + Vv ] 1-p cd n 2 n Cj cd c o j c cd 2 ~(B) More generally, if the o 's are constant the estimator T does d not depend on the value of that constant. This is clear from (6) 2 since the o 's enter that expression only through the weights u , Cc) and when these variances are constant , u [n /1 +p (n A): n /1+p ( n - 1]. c cj cj\ cj 2 cL cj cj If the p's are also set equal to a constant p, a family of 75 ~ (8) estimators is generated . The estimator T is obtained when ~ (A) p=0,T is obtained when p = 1, and other members of this d family, with the value ofp estimated from historical or sample data, are potentially useful. We denote the estimator obtained for a ~(P) ~(p) given value of op by T . The estimator T with 0< p< 1 d d represents a compromise between the modified synthetic estimator ~ (5) % T and the post-stratified estimator T d d Another way of striking a compromise between these two is to take a weighted average for each cell (c, d) (cf. expression (6) for ~ (B) T ): cd ~(W) ~(A) ~(S) T =w T + x - Ww )i cd cd cd cd’ cd - rz co (hy » JS (c,d) (x ) . LY ~(W) and to estimate T by the sun Z T . Weighted averages of this d c cd sort are often referred to as composite estimators. (See, for example, Schaible, Brock and Schnack 1973). In Appendix II we give some simple conditions under which a composite estimator has smaller error-variance than either of its two components, and we show that these conditions are satisfied for a relatively wide range of weights. Under Model B with 2 constant 0 's and p's, say p =p, the optimal weights are given cj by wh = (1 +r ) (10) cd cd n n 2 n 2 cd cj cd where n r -(ee)f1 - = of Z +1 - cd cd n j#d\ n n C c C For a given value of p, the composite estimator T which uses ~(p) d the weights in (10) is closely related to T . In both cases, d increasing either n or p gives relatively more weight to the cd cell sample mean y in estimating the total T s(c,d) cd AW) Ale) A(5) AW) ~(p) ~(A) When p= 0, T =T =T ,whilewhenp=1,T =T =T . d d d ad da 4d ~(W) For intermediate value of p the main difference between T ~(p) and T is in their respective estimates of u , d < In y Ty + (1-p)/n p j cj s(c,3) J cel E cj ] and 2 : 1/ [ + Bie, °] ~(W) < The estimate of u used in T gives sample mean y a c d s(c,i) : _~P) weight proportional to n , while the estimate used in T cj gives the sample means more nearly equal weights. For this reason ~ (0) T appears to provide better protection from domination by d cells with unusually large sample sizes. 4.3 Cell Means Determined By Class And Correlated Within Domains Model B allows, through the parameter p , for possibly important differences between domains within on Cine c. That is, for i and j both belonging to class c, the expected value of (¥, w )° ij might be much smaller when i and j belong to the same domain than when they belong to different ones. A weakness of this model is that it does not express the possibility that the differences between domains might be fairly consistent from class to class. For example, when T exceeds its expected value N yu the other cd cd c cell totals in the same domain, T , might tend to exceed their c’d 77 expected values. There are various ways in which we can allow for this possibility. One is simply to modify Model A, setting UW =u + yu, so that there is an additive "domain effect." cd c Another is to treat the pu as realized values of random variables cd (so that Model A is a conditional model, given the u 's); the cd joint distribution of the pu 's is such that uy and nu are cd cd A positively correlated if either ¢ = ¢” or d = d”. This leads to a model in which all the Y 's have the same expected value (the i a priori expected value of the p's) but in which Y and Y are cd i 3 positively correlated if i and j belong either to the same class or to the same domain. A third possible alternative generalizes Model B, treating u as a random variable with expected value cd HU . However it allows pu and u to be positively correlated g cd c’d” whenever d = d”. This model specifies fixed effects for classes, but allows class means to be correlated within a domain. All of these models should be investigated, but for now we consider only the third: Model C: For every class c and domain d B(x )- H i belongs to class c i < 2 o i=3j, iin cell (c, d) cd cov s Y ) = 2 ij 0 i43j,1i, jin cell (c, d) cd cd T i in cell (c, d), j in cell d (c’y d), c $c” 0 i, j in different domains. 78 Under this model the cell averages satisfy: 7) . {) . Yo 2 2 po + (Q-p Jo /n cd cd cd cd cd = c=c and d=d" cov(Y , Y ) = s(c,d) s(c”,d”) T ckFc’,d=4d" d 0 df da 2 p Oo c=c¢c,d=4d" cd cd co , Y ) = T cfc,d=4a s(c,d) 3(c”,d?) d 0 d fd. Note that var(Y ) , which we denote by v , is the same as s(c,d) cd under Models A and B. A thorough analysis of Model C cannot be undertaken here. We will content ourselves with examining (i) the effects of the correlations introduced in Model C on the estimators already ~(C) considered and (ii) the optimal estimator T obtained for a d computationally simple special case of Model C. Note that Models B and C differ only in their covariance structure. ~(A) ~(B) ~(W) For this reason linear estimators such as T sg T , and T d d d which are unbiased under B remain unbiased under C. We now consider the effect of the covariances, T , on the variances of d these estimators. All three estimators have the general form In y + (v -n ) Ie y for some constants c cd s(c,d) cd cd’ j cj s(c,j) 79 0 = 2 <1 which sum to one, and for which ¢ =0 ifn = 0. cj &j Cj For any estimator of this form the error-variance under Model C is vari = T )= var zw -n J y -y ) d d c' cd cd’\j cj s(c,i) 3(c,d) = Var a(x -n (7 -y ) clcd cd/Vs(c,d) 5(c,d) 2 2 2 + zw -n ) E 2 V -v +2 (x - 2 Jo 0 ] c\ cd cd’ Lj cj cj cd cd cd cd (11) +z (N - Nn 8 =n [zo T *4& rt - 3% tv c#c’\ cod cd /\ c-d ced! cj cj j cd d cd d Now the summand in the third term is ( “nv -n ERE wore ften Joe )], cd cd/\ c-d c'a/bjtd cj c’j j d cd cd which is non-negative if the t's are non-negative and the g's do not exceed one. Thus the positive covariance terms{t } increase J the variance of the estimators. An exception is the post- stratified estimator, which is obtained when, for every C, 2 =1landyg =0 for all j # d. For the post-stratified cd cj estimator the third term in (11) vanishes. © 2 The BLU estimator T under Model C depends on the p's, g's, d © and t's, but as before, use of incorrect values in T does not introduce a bias under the model. If the values used are approximately correct, the estimator will be approximately optimal. By setting these parameters equal to constants, Os 2 o , and tT we generate a family of estimators. This proves to be a two-parameter family in which the estimator depends only on p 2 and the ratio, 1/g . Using historical or sample data to estimate these two quantities, we can choose a member of this family which 80 might compare favorably with the estimator T obtained using the same value of p but taking t = 0. ~(C) Because of the exchangeability within cells, we can find T d by applying the Theorem in section 3 to the condensed problem in which Y is the vector of all cell sample means and Y is the I ~1I vector of means of non-sample units in domain d cells. Even with 2 simplification, and restricting the p's, o 's, and t's to be ~(0) constants, the formula for T needs more than a casual inspection d for its appreciation. We will not undertake the necessary analysis here but will look at the very special case in which all of the cell sample sizes n equal the same constant, m. This can only cj suggest the direction in which use of Model C will carry us away from the estimators appropriate under Model B. We denote by y the average of sample means from cells in class C- c, by y the average of sample means from domain d. 1 C y = — ry » -d G c=1 s(c,d) and by y the average of all the sample means ll n | - ™ ~< Then the BLU estimator under Model C, for constant n's, p's, 2 o's and ts is ~(C) T = d < noo 3 + no y 1 s{c,d) ¢ 81 where ® -( wy - [er T+ a-o)/m | and 2 a=Crt /k T+poO The final term in this estimator can be interpreted as a correction for the 'domain effect" estimated by y -y . -T + a-ey/m v This effect is a result of the correlation among the class-cell means within each domain, and the correction term vanishes as this correlation vanishes (t + 0). The estimator can also be written n . n- i -8 (x ) WF ) Be * (v, ) v. )J As m becomes large the weight 1 - & approaches zero and the estimator is approximately the post-stratified estimator. 5. DISCUSSION We have focussed on simple cross-classification models as tools for studying the synthetic estimator and some alternatives. We have assumed that the numbers of units falling into the different classes within our domain of interest are known. Often much more is known, and as Gonzalez and Waksberg (1973) have suggested, this additional local area information might be used to improve on synthetic estimates. Here again, prediction models can be used to express the relationships among all the variables, and to suggest and compare alternative estimators. A very important use of prediction models, which we have not been able to treat here, is in suggesting and analyzing variance estimators (Royall and Cumberland 1977; Royall and Eberhardt 1975). The variance estimation theory based on prediction models, in contrast to the theory based on random choice of sample units, pertains to the actual sample used in estimation, not to the estimator's average performance over some other samples which might have been selected, or on average properties over different domains. The calculations are all made conditionally, given the sample s which was actually observed. A workshop of this sort, focused on a specific technique, can spur development, but it can also be dangerous. The danger is that, from hearing many people speak many words about synthetic estimation we become comfortable with the technique. The idea and the jargonbecome familiar, and it is easy to accept that ""Since all these people are studying synthetic estimation, it must be okay." We must remain skeptical and not allow 82 familiarity to dull our healthy skepticism. There is reason for some optimism, but it must be guarded optimism. One of the benefits of the prediction approach is that by holding s fixed, it forces us to examine carefully those relationships between variables which in fact enable us to use observations on some units to make inferences about others. When these relationships are weak and uncertain, then so are our inferences. There is no "repeated sampling distribution to use to gloss over this fact. If most of our sample data from North Carolina come from one region, and if we do not know much about the relation- ships among the variables, then we cannot make reliable estimates for the state. This is true regardless of whether or not a repetition of our sampling plan might provide a larger sample, or a better-distributed one, from this state. Using data from South Carolina and Virginia in estimating the North Carolina total entails assumptions that certain relationships among variables are the same in North Carolina as in these other places; using the prediction approach forces us to make these assumptions explicit and in doing so to realize just how essentially difficult small area estimation problems are. 83 APPENDIX I: VARIANCES OF SYNTHETIC AND MODIFIED SYNTHETIC ESTIMATORS For a given synthetic estimator (5) the corresponding modified synthetic estimator will have smaller variance under Model B if the differences 1 - y and uy - vy are positively c s(c,d) c 5(c,d) correlated for all classes c. This is because ~ (sy) Var (t 1 )- var zw -n )(¥ -F ) d d cl cd cd S +2%In (¥ “nn Jeor(d - 5 LU =F ) c cd \ cd cd C (c,d) «c¢ s(c,d and the first term on the right-hand side is the error-variance of the modified synthetic estimator. For the particular case in whichpy =n Yy /m , the modified estimator ics, c ~ (5) T has smaller error-variance under a wide range of conditions. d 2 For example, if within class c the o 's and the p's are constants, 2 say 0 and p , then c c A n _ ~ _ 2 2 cd cov -y sy H -Y )= @ [3(n /m) -2 +1>0 c S(c,d) ¢ s(c,d) c cbj\cj c¢ n Cc and the modified estimator's error-variance is smaller. 84 APPENDIX II: COMPOSITE ESTIMATORS Given two unbiased predictors, X and Y, of a random variable Z, we consider composite estimators (predictors) of the form a X+ (1 -a) Ywhere p< q<1 and ask (i) What value of o is optimal? (ii) For what range of values of a is the composite estimator better than either X or Y? We assume only that both X and Y are unbiased predictors of Z: E(X-2)+E(-2)=0. 2 2 Let Var X=¢ , Cov X, Z) + 0 , etc. Then the error-variance XZ of the composite estimator, Var (a X + (1 - a) Y - Z), is easily shown to be minimized when o is 2 oc to 0g -0 CoviX -Y, Z -Y) Y XZ YZ XY of = = Var X - Y) 2 2 oc +o -20 X Y XY In case X, Y, and Z are all uncorrelated, this is just the usual receipe - weights for X and Y should be inversely 2 2 proportional to their variances, ¢ and o Y To answer the second question we ask what values of o satisfy the inequality Var (x X+(-o)¥= 2 2 ao* - 1. Thus if the optimal weight o* is less than 1/2, the composite estimator is better than either X or Y alone if the weight assigned to X is less than twice the optimal weight o¥. 85 1 If a* > —— | then the composite estimator is better if the 2 weight assigned to Y, (1 - a), is less than twice the optimal weight, 1 - a*. The composite estimator is better so long as 20% - 1 100), n> the bias is negligible and the expectation of Yy is approximately equiv- alent to y 3 g tm — ~N0 ~~ Ite <1 g=1,2, ..., K . (2.3.5) Returning to the th local area, we focus attention on the ''target base unit" alignment in order to weight appropriately the stratum estimators 104 (7p) by the proportion of elements in the base units so classified. There- fore, the estimator of the criterion variable for the Ath 1ocal area takes the following form: 2 RM. = 3, =; 28 y (2.3.6) g1 M, & such that . . a » K M,_ = = = 2 ~N/ EG,) =k MgEG)=1 —£ (2.3.7) 1 Mo M, ® g=1 g=1 when n is large. Often the sizes of Y and M are only known approx- imately. When this occurs, the respective estimators of the stratum means are weighted by the ratio of available estimates Ma and mM , or by the cruder ratio Nog. Due to the nature of its derivation, the local area estimator Ye. of 5 is biased. The observed value of the criterion variable mean per element 18 N Oe = 1 M. Y: ) 7 A i Yi. Mi Vy (2.3.8) ve th . sumed across only those base units in the ¢ local area. The bias, B = (E(y,) - 5, 1 (2.3.9) can be approximated by K M N = 2, - B= [) ty 1M YT. (2.3.10) =1 wr g 2 [0 Similarly, to express the local area estimator in terms of a proportion, Yi is redefined, so that h ; . . = 1 when the 3t element in the ith base unit Yis J tas the characteristic of interest; 0 otherwise, 105 so that A Yij = Yi (2.3.11) is the total number of elements in the ith base unit with the character- istic of interest. 2.4. An Expression for the Mean Squared Error of the Local Area Estimator It has already been observed that the local area estimator 3, is biased. Consequently, the mean squared error term takes the form: ~ 5G, -§, 0% ~ EG, - EG, 02+ BF, - ¥,)? = Variance(§, ) + (Bias)? (2.4.1) Since ~ . KM _ EF, ) = |] AL8Y L. g=1 y. g 2 2 . . 2 aA where Yo is a linear combination of the ratio estimators y ,g=1, : g a 2, ..., K (with negligible bias), the variance of Y, can be approximated by 2. kK M 2 vary, ) 2 J (2)? var(§yp : g=1 Mg, M M Lo. FTF (EY) (28) cov, yoo) (2.4.2) bh MM, $s If we also assume n n - ] I. Moy. I MG. -Y i=] 81 1°71 ¥, = 4 gi ivi Yy) 3 (2.4.3) n ww & Loi » n( TE ) 106 N. 2 25 542 var 1 (N -n) N I I; MA, -Y) ar(y,) ge Cys): 1 . N M, } Noo Boy. - YL (2.48) ij i This is the standard form of the approximate variance of a ratio estimator for a two-stage sampling design where the base units have equal probabil- ities of selection. Here, the first term represents the between base unit component of the variance, whereas the second denotes the within base unit contribution. Since our two stage sampling design requires the independent selection of subsamples from different sample base units, and the respective strata estimators are defined in terms of the indicator variables Lois it can be shown that Cov (¥g, Fed =o. Hence, the mean squared error of our small area estimator can be expressed as: - K M ~ MEG) ¢ ) (2 vars) + ®ias)? (2.4.5) Jr -g 3. A REFORMULATION OF THE KALSBEEK MODEL; SOME ANALYTIC AND EMPIRICAL INVESTIGATIONS 3.1. Introduction An analytical expression for the mean squared error of our local area estimator has been derived in the previous chapter. Yet, the inherent bias of the model does not allow for tests of its precision unless another unbiased estimate or the true value of the criterion variable is obtained at the local level. In practice, this is usually unavailable and is the reason that alternative strategies must be considered. In order to determine the accuracy of the small area estimator and allow for comparisons of precision with respect to other strategies, we attempt to express the relationship between criterion and symptomatic variables by means of a probabilistic model. The model enables one to determine the true value of the criterion variable for target areas of interest and to approximate the bias and mean squared error of the respective local area estimators, and provides a framework for comparisons. 107 3.2 Determination of Stratum Boundaries As noted, our small area estimator of the criterion variable for the ¢th local area using the Kalsbeek model takes the form: M (3.2.1) ~< I> i. 0 Y% y = 2. g=1 N g To avoid unnecessary complications which would occur with the multistage sampling design, we consider the single stage cluster sample design, adding the restriction that all target base units consist of the same number of elements. As described in the first chapter, strata (groups) are to be formed which are optimally homogeneous within, while simulta- neously dissimilar between themselves. When the underlying relationship between the criterion and symptomatic variables is unknown, the strategy that has been entertained consists of forming groups by minimizing their within sum of squares while maximizing their between sum of squares using only the sample data. However, when a certain probabilistic model is entertained, one could determine those boundaries of the predictor variables which minimize the mean square error of Y, . Since each local area estimator usually consists of a different weighted linear combination of the respective stratum estimators, the boundaries which are optimal for small area ¢ would not necessarily be so for small area 2-. Conse- quently, another reasonable strategy would be to determine the optimal strata boundaries on the symptomatic variables which minimize the mean squared error of the criterion variable estimator for the over-all pop- ulation. This estimator is actually the weighted average of all small area estimators, weighted by the respective proportion of elements belonging to the particular small area. As before, (3.2.2) where M, =Mfori=1,2, ..., N and because we are now considering a single stage cluster design, 108 and therefore, £ i “1 ¥, cL EL ; (3.2.3) n ng MY I. i-1 # Consequently, * L M L MX . y ] Le = . M = Le be o ¥ 8 8 - ve g a. L K Mg’ K Ma - I Lait Low? F1gaa M78 ga MCE KMN_> KK Ng. _ —£8 75, = 3 np , (3.2.4) LL g=1 tt sinceM =N Mand M _ =N M ud vs .g .g We also note that this linear combination of local area estimators is an approximately unbiased estimator of the criterion variable for the overall population. Since the estimator is approximately unbiased, our mean squared error term is actually the variance of the overall population estimator. We must determine the boundaries on the symptomatic variables which will minimize Var(y ). Here, we are faced with the additional problem of working with a linear combination of poststratified estimators. For any fixed sample size n out of N base units, the ng, 8 1, 2, 5:0 KX (K fixed) are random, subject only to the restriction J] n_ =n. Because g=1 the variance of a poststratified estimator is most similar to that of a stratified estimator with proportional allocation, it would be reasonable to use those boundaries on the symptomatic variables which are optimal here. The strategy is most appropriate when K 7 ng /K is reasonably large, g=1 since the poststratified estimator's variance approaches that of the strat- ified estimator's variance (considering proportional allocation) when this occurs. 109 Dalenius (1957) and Singh and Sukhatme (1972) have considered the case of minimum variance stratification when a single auxiliary variable was used as the stratification variable. They showed that for a particular allocation (i.e., Neyman, proportional) the boundaries on the auxiliary variable must satisfy a set of minimal equations. Since these equations are ill adapted to practical computation, a quick approximate method has been developed by Dalenius and Hodges (1959), known as the CUM/T rule, and has been shown to be quite efficient. Thomsen (1975) has found that by taking equal intervals using the CUM3VT rule, approximately optimum stratum boundaries are determined which compare favorably with those de- rived by the CUM/T rule. Often, the stratification scheme will depend on more than one variable. Here as well, several methods have been developed which consider the prob- lem of determining those stratum boundaries which are optimal in the sense of minimum variance stratification. Anderson (1976) suggests a method which uses the CUM/T rule (or CUMS/T rule) along each marginal stratifier such that the product of the number of strata for each variable equals: p K( 1 K; = K). The method is not optimal, but is practical. It has been i=1 shown to yield estimators that are more precise than when only one strong stratifier is used. Another practical method, suggested by Kalsbeek (1973), allows for the determination of boundaries at successive stages of strati- fication. Approximately optimum boundaries are obtained for the most significant stratifier, then for the second, conditioned on the stratum means of the first, and so forth until all the stratification variables have been included. In the research that follows, both the methods ad- vanced by Anderson and Kalsbeek are considered. 3.3. A Reformulation of the Kalsbeek Model We wish to consider the case of sampling from populations with specified continuous multivariate distributions. To use such an approach requires rather strong underlying assumptions regarding the nature of relationships between the criterion and symptomatic variables. To be consistent in getting the finite population results to conform to the new scheme, we disregard the finite population correction factors. Since we have ini- tially considered a single stage cluster sampling design with the restric- tion that all target base units consist of the same number of elements, our small area estimator is expressed as y, = y (3.3.1) Lo LE K Ye . ; I. _ ¥ « JE (3.3.2) a MMs g=1 Ng. 8 110 N where ig = # of target base units falling in gth strata for gth local area Ny. Total ¥ of target base units in the 2th Tocal area and n n SMI IY PH § = i= 8 °' - Is (3.3.3) g ’ n n g M I: 4 gl where ng (the number of sample base units falling in the gth stratum) is random. Consequently, = 28 = E(yy ) = 7 EF) (3.3.4) * g=1 L, g where - n EG) = (3 Vi) n_ and, if we assume ng #0 forg=1, 2, ..., K, EF) = E| th Fv.) (3.3.5) g g strata ng fixed Similarly, we have shown K Zz . N 2 varG ) = I (28)% Var(F) (3.3.6) L el N, g where 1 1-W Var(y,) = [ + £1] Var ¥:) 3.35.7 “Ve n Wg nw? gth stratum Yi (2.2.9) £ n fixed 111 with Wg» the respective stratum weights, and again assuming ng#0 for g = 1, 2, ..K. Therefore, » K N -W = : 2 1 = Var, J 2 F(Z L +1 “giver ¥.) L. N nw 2 2 Stratum i =1 . n° yw & X 2 g g n fixed g (3.3.8) 3.4. The Theoretical Framework Assume a simple random sample of size n is drawn from an infinite p+l dimensional multivariate population (with continuous distribution) whose observations take the form of the ((p+1)x1) random vector (y, X15 X,, #58 5g Xp) Here, the y element conforms to the Yi cluster mean, while the (x “5 Xp)” are symptomatic indicators which conform to those 10 Xp oe for each target base unit. The joint density of the multivariate super population is f(y, X15 Xp eee, Xp with marginal probability density functions f(y), f(x), ..., fpe1 0p) y Hy E ((p+1)x1) \x My ~ In are the respective means of the criterion and 2. Uy yx Symptomatic variables while Var [~ |= = J Xx (p+1)x(p+1) i o g a is the respective variance covariance matrix assumed to be positive de- finite. Once the underlying multivariate distribution has been specified, we are able to construct target areas of interest for fixed values of N . Here, 0 the respective target base units are represented by N, (1xp) vectors of symptomatic information taking the form Zi1s Xz, «oes Xip) - These are determined by taking the values of equally spaced percentiles on the respective marginal distributions of the symptomatic variables over dif- ferent ranges of interest such that their product is equal to N,.- To be more explicit, consider the bivariate case with N, = 49 and the 20th 112 to goth percentile as the range of interest on each marginal stratifier. The values of the equally spaced, cross classified percentiles observed in the following diagram determine the target area's symptomatic informa- tion configuration. X 2 (.8, .2) (.8, .5) (.8, .8) (.5, .2) (.5, .8) (.2, .2) (.2, .5) .2, .8) X With the number of strata (k) fixed, the multivariate stratum boundaries are of form (ag < xX; < bis a, I o> = [Ral —~ tr h— Ld» I > o + >= where XE) = (1, Xj1» Xj2s Xj3s sve» Xip) is a (1x(p+1)) vector of symptomatic information from the ith base unit in the gth target area; 0) B = B, is the ((p+1)x1) vector of the least squares ~I(B) regression coefficients determined by the : criterion and symptomatic variable information for the n sample base units; Bp N is the number of base units in the th target area; and X = 1s 25 wsuy DP 15 the sth symptomatic variable's es? S mean for the %th target area. 3.6 Distribution Specific Results To give our findings a degree of validity beyond the scope of the theo- retical framework, the relationship between criterion and symptomatic variables must be characterized by those distributions most relevant to the practical setting. Since the vector (y, X15 Xp5 anny Xp) of criterion and symptomatic variables has been defined to represent a vector of cluster means, their distributions approach the normal when the underlying distri- butions are not markedly skewed. Consequently, the first distribution we have chosen to consider is the multivariate normal. To facilitate the presentation, we examine the trivariate case where the random vector v~ = (y, Xs X) has a three dimensional multivariate normal distribution with joint density function 116 expl Lv - uy) 2 wv - wu) £(v) = 2- Uv. (3.6.1) 3/2 1/2 (2r) |= vl = -.479 Xy > 0.0 X, (Stratum Mean = -.798) (S.M. = .798) b=3 X, < -1.222 Xy < -.61-1.22 < X, < -.246 -.61 < Xy < .61 X; > -.246 (S.M. = -1.223) (S.M. = 0.0) X; > .61 L246 < X, < 1.222 X, > 1.222 (S.M. = 1.223) bet X; < -1. 702 -1.702 < X; <- .910 Xp 2 - .99—+-910 < X, < -.118 -.99 < X; < 0.0 > -.118 (S.M. = an (S.M. = -.456) X; < < -.518 -.518 < X; < .274 0 1.066 (S.M. = (S.M. = 1.517) 120 Scheme IA .479 > X -.488 7% -.488 < X, < .488 X, > .488 2 -1.066 < X, < -.274 < -1.066 -.274 < Xs < .518 Xx, > .518 X; < .118 L118 «< Xx, < .910 L910 < X, < 1.702 X, > 1.702 scheme for n=120, and at least two for n=480, which demonstrate the post- stratified estimators' superiority using the mean squared error as the measure of precision. Had a more optimal scheme for the determination of strata boundaries been available, further increases in the precision of our poststratified estimator could have been observed. Generally, when stratification is the strategy used to yield an esti- mator of a criteriaon variable for a particular target population, an increase in the number of strata, K, is followed by an increase in the estimator's precision (as measured by a decrease in the variance) for relatively small values of K. Subsequent increases in K coincide with diminishing returns with respect to further proportional reductions in the estimator's variance. Since each target area estimator under consideration consists of a different weighted linear combination of stratum estimators, and the sampled population does not completely coin- cide with the target population, we do not expect to find strong evi- dence of a consistent relationship between the proposed method's pre- cision and the number of strata to be specified (see tables 3.1 - 3.9). 4. SUMMARY To summarize, reliable estimates of parameters at the local level are difficult, if not impossible, to obtain directly from sample surveys, primarily due to the constraints of sample size and design. Yet, the very nature of the problem has served as the motivating force in the de- velopment of several alternative procedures. When underlying assumptions are too strict or unrealistic, the need for a more flexible approach is obvious. The method considered in our research is particularly attractive in that no functional model between criterion and symptomatic variables must be specified. Here, the most limiting assumption is the availability of symptomatic information. Estimates for the respective 'base units' of ''target areas" are available as a byproduct of the technique. Finally, the method performs reasonably well even for the linear setting, though here it would be better to choose Ericksen's approach. 121 1 TABLE 3.1 TARGET AREA ESTIMATION FOR TRIVARIATE NORMAL DISTRIBUTION WITH R = .95 Range (.2, .8) True Value Stratification Strata Approximate Values for Criterion Parameters of Criterion Scheme Model (n=120) Expectation Variance Bias M.S.E. Variable Optimal Boundaries Ericksen 60.000 0.081 0.000 0.081 60.000 on Marginals Modified 4 58.625 0.292 -1.375 2.181 60.000 Kalsbeek 9 60.000 0.228 0.000 0.228 60.000 Model 16 59.288 0.263 -0.712 0.770 60.000 Hierarchical Ericksen 60.000 0.081 0.000 0.081 60.000 Modified 4 59.316 0.235 -0.684 0.753 60.000 Kalsbeek 9 60.000 0.211 0.000 0.211 60.000 Model 16 59.773 0.255 =0.227 0.306 60.000 (n=480) Optimal Boundaries Ericksen 60.000 0.020 0.000 0.020 60.000 on Marginals Modified 4 58.625 0.071 -1.375 1.961 60.000 Kalsbeek 9 60.000 0.055 0.000 0.055 60.000 Model 16 59.288 0.063 -0.712 0.570 60.000 Hierarchical Ericksen 60.000 0.020 0.000 0.020 60.000 Modified 4 59.316 0.067 -0.684 0.538 60.000 Kalsbeek 9 60.000 0.050 0.000 0.050 60.000 Model 16 59.773 0.059 -0.227 0.111 60.000 elt TABLE 3.2 TARGET AREA ESTIMATION FOR TRIVARIATE NORMAL DISTRIBUTION WITH R = .58 Range (.2, .8) True Value Stratification Strata Approximate Values for Criterion Parameters of Criterion Scheme Model (n=120) Expectation Variance Bias M.S.E. Variable Optimal Boundaries Ericksen 60.000 0.556 0.000 0.556 60.000 Modified 4 59.145 0.731 -0.855 1.462 60.000 Kalsbeek 9 60.000 0.925 0.000 0.925 60.000 Model 16 59.554 1.292 -0.446 1.490 60.000 Hierarchical Ericksen 60.000 0.556 0.000 0.556 60.000 Modified 4 59.580 0.720 -0.420 0.907 60.000 Kalsbeek 9 60.000 0.813 0.000 0.813 60.000 Model 16 59.862 1.162 -0.138 1.181 60.000 (n=480) Optimal Boundaries Ericksen 60.000 0.139 0.000 0.139 60.000 on Marginals Modified 4 59.145 0.178 -0.855 0.909 60.000 Kalsbeek 9 60.000 0.224 0.000 0.224 60.000 Model 16 59.554 0.309 -0.446 0.507 60.000 Hierarchical Ericksen 60.000 0.139 0.000 0.139 60.000 Modified 4 59.580 0.180 -0.420 0.356 60.000 Kalsbeek 9 60.000 0.196 0.000 0.196 60.000 Model 16 59.862 0.274 -0.138 0.293 60.000 yet TABLE 3.3 TARGET AREA ESTIMATION FOR TRIVARIATE LOGISTIC DISTRIBUTION (CORRESPONDING TO R = .58) Range (.2, .8) True Value Stratification Strata Approximate Values for Criterion Parameters of Criterion Scheme Model (n=120) Expectation Variance Bias M.S.E. Variable Approximate Optimal Ericksen 60.000 0.517 -1.296 2.195 61.296 Boundaries on Modified 4 59.334 0.763 -1.962 4.612 61.296 Marginals Using Kalsbeek 9 60.956 0.869 -0.340 0.984 61.296 Cum vf Rule Model 16 61.108 1.276 -0.188 1.311 61.296 (n=480) 5 Ericksen 60.000 0.129 1.296 1.808 61.296 Modified 4 59.334 0.186 -1.962 4.036 61.296 Kalsbeek 9 60.956 0.210 -0.340 0.325 61.296 Model 16 61.108 0.304 -0.188 0.340 61.296 sZt TABLE 3.4 TARGET AREA ESTIMATION FOR TRIVARIATE NORMAL DISTRIBUTION WITH R = .95 Range (.05, .95) True Value Stratification Strata Approximate Values for Criterion Parameters of Criterion Scheme Model (n=120) Expectation Variance Bias M.S.E. Variable Optimal Boundaries Ericksen 60.000 0.081 0.000 0.081 60.000 on Marginals Modified 4 58.625 0.292 -1.375 2.181 60.000 Kalsbeek 9 60.000 0.311 0.000 0.311 60.000 Model 16 59.278 0.476 -0.722 0.997 60.000 Hierarchical Ericksen 60.000 0.081 0.000 0.081 60.000 Modified 4 59.316 0.235 -0.684 0.753 60.000 Kalsbeek 9 60.000 0.215 0.000 0.215 60.000 Model 16 59.588 0.218 -0.412 0.388 60.000 (n=480) Optimal Boundaries Ericksen 60.000 0.020 0.000 0.020 60.000 on Marginals Modified 4 58.625 0.071 -1.375 1.960 60.000 Kalsbeek 9 60.000 0.065 0.000 0.065 60.000 Model 16 59.278 0.069 -0.722 0.590 60.000 Hierarchical Ericksen 60.000 0.020 0.000 0.020 60.000 Modified 4 59.316 0.070 -0.684 0.538 60.000 Kalsbeek 9 60.000 0.051 0.000 0.051 60.000 Model 16 59.588 0.048 -0.412 0.218 60.000 971 TABLE 3.5 TARGET AREA ESTIMATION FOR TRIVARIATE NORMAL DISTRIBUTION WITH R = .58 Range (.05, .95) True Value Stratification Strata Approximate Values for Criterion Parameters of Criterion Scheme Model (n=120) Expectation Variance Bias M.S.E. Variable Optimal Boundaries Ericksen 60.000 0.556 0.000 0.556 60.000 on Marginals Modified 4 59.145 0.731 -0.855 1.461 60.000 Kalsbeek 9 60.000 0.026 0.000 0.926 60.000 Model 16 59.548 1.118 -0.452 1.322 60.000 Hierarchical Ericksen 60.000 0.556 0.000 0.556 60.000 Modified 4 59.580 0.731 -0.420 0.907 60.000 Kalsbeek 9 60.000 0.722 0.000 0.722 60.000 Model 16 59.745 0.818 -0.255 0.883 60.000 (n=480) Optimal Boundaries Ericksen 60.000 0.139 0.000 0.139 60.000 on Marginals Modified 4 59.145 0.178 -0.855 0.909 60.000 Kalsbeek 9 60.000 0.205 0.000 0.205 60.000 Model 16 59.548 0.215 -0.452 0.419 60.000 Hierarchical Ericksen 60.000 0.139 0.000 0.139 60.000 Modified 4 59.580 0.179 -0.420 0.356 60.000 Kalsbeek 9 60.000 0.172 0.000 0.172 60.000 Model 16 59.745 0.187 =0.255 0.252 60.000 Ll TABLE 3.6 TARGET AREA ESTIMATION FOR TRIVARIATE LOGISTIC . DISTRIBUTION ( CORRESPONDING TO R = .58) Range (.05, .95) True Value Stratification Strata Approximate Values for Criterion Parameters of Criterion Scheme Model (n=120) Expectation Variance Bias M.S. E, Variable Approximate Optimal Ericksen 60.000 0.517 +0.835 1.214 59.165 Boundaries on Modified 4 59.334 0.763 +0.169 0.791 59.165 Marginals Using Kalsbeek 9 59.925 0.903 +0.760 1.481 59.165 Cum /T Rule Model 16 59.796 0.923 +0.631 1.322 59.165 (n=480) Ericksen 60.000 0.129 +0.835 0.827 59.165 Modified 4 59.334 0.186 +0.169 0.215 59.165 Kalsbeek 9 59.925 0.201 +0.760 0.779 59.165 Model 16 59.796 0.191 +0.631 0.590 59.165 871 TABLE 3.7 TARGET AREA ESTIMATION FOR TRIVARIATE NORMAL DISTRIBUTION WITH R = .95 Range (.35, .95) True Value Stratification Strata Approximate Values for Criterion Parameters of Criterion Scheme Model (n=120) Expectation Variance Bias M.S.E. Variable Optimal Boundaries Ericksen 65.094 0.104 0.000 0.104 65.094 on Marginals Modified 4 64.124 0.366 -0.970 1.306 65.094 Kalsbeek 9 65.751 0.307 0.657 0.739 65.094 Model 16 65.369 0.274 0.276 0.350 65.094 Hierarchical Ericksen 65.094 0.104 0.000 0.104 65.094 Modified 4 63.799 0.340 -1.294 2.015 65.094 Kalsbeek 9 65.102 0.286 0.008 0.286 65.094 Model 16 64.962 0.246 -0.132 0.264 65.094 (n=480) Optimal Boundaries Ericksen 65.094 0.026 0.000 0.026 65.094 on Marginals Modified 4 64.124 0.090 -0.970 1.030 65.094 Kalsbeek 9 65.751 0.074 0.657 0.506 65.094 Model 16 65.369 0.060 0.276 0.136 65.094 Hierarchical Ericksen 65.094 0.026 0.000 0.026 65.094 Modified 4 63.799 0.083 -1.294 1.759 65.094 Kalsbeek 9 65.102 0.068 0.008 0.068 65.094 Model 16 64.962 0.056 -0.132 0.073 65.094 621 TABLE 3.8 TARGET AREA ESTIMATION FOR TRIVARIATE NORMAL DISTRIBUTION WITH R = .58 Range (.35, .95) True Value Stratification Strata Approximate Values for Criterion Parameters of Criterion Scheme Model (n=120) Expectation Variance Bias M.5.G. Variable Optimal Boundaries Ericksen 63.196 0.726 0.000 0.726 63.196 on Marginals Modified 4 62.565 0.843 -0.631 1.242 63.196 Kalsbeek 9 63.622 1.103 0.426 1.284 63.196 Model 16 63.385 1.069 0.189 1.104 63.196 Hierarchical Ericksen 63.196 0.726 0.000 0.726 63.196 Modified 4 62.350 0.850 -0.846 1.566 63.196 Kalsbeek 9 63.202 1.072 0.006 1.072 63.196 Model 16 63.130 1.071 -0.066 1.075 63.196 (n=480) Optimal Boundaries Ericksen 63.196 0.182 0.000 0.182 63.196 on Marginals Modified 4 62.565 0.207 -0.631 0.606 63.196 Kalsbeek 9 63.622 0.265 0.426 0.446 63.196 Model 16 63.385 0.241 0.189 0.277 63.196 Hierarchical Ericksen 63.196 0.182 0.000 0.182 63.196 Modified 4 62.350 0.209 -0.846 0.925 63.196 Kalsbeek 9 63.202 0.257 0.006 0.257 63.196 Model 16 63.130 0.247 -0.066 0.251 63.196 0¢T TABLE 3.9 TARGET AREA ESTIMATION FOR TRIVARIATE LOGISTIC DISTRIBUTION (CORRESPONDING TO R = .58) Range (.35, .95) True Value Stratification Strata Approximate Values for Criterion Parameters of Criterion Scheme Model (n=120) Expectation Variance Bias M.S.E. Variable Approximate Optimal Ericksen 63.035 0.659 -0.680 1.122 63.714 Boundaries on Modified 4 62.530 0.742 -1.184 2.144 63.714 Marginals Using Kalsbeek 9 63.865 0.924 0.151 0.947 63.714 Cum vf Rule Model 16 63.340 0.856 -0.314 0.955 63.714 (n=480) Ericksen 63.034 0.165 -0.680 0.627 63.714 Modified 4 62.530 0.182 -1.184 1.584 63.714 Kalsbeek 9 63.865 0.223 0.151 0.246 63.714 Model 16 63.340 0.198 -0.314 0.296 63.714 BIBLIOGRAPHY Anderson, D.W. (1976). Gains from multivariate stratification. Unpub- lished doctoral dissertation. University of Michigan, Ann Arbor. Anderson, D.W., Kish, L., and Cornell, R.G. (1976). Quantifying gains from stratification for optimum and approximately optimum strata using a bivariate normal model. Journal of the American Statistical Associ- ation 71, 887-892. Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W. (1975). Discrete Mul- tivariate Analysis: Theory and Practice. Massachusetts: MIT Press. Burnham, W.D. and Sprague, J. (1970). Additive and multiplicative models of the voting universe: the case of Pennsylvania, 1960-1968. American Political Science Review 64, 471-490. Causey, B.C. (1972). Sensitivity of raked contingency table totals to changes in problem conditions. Annals of Mathematical Statistics 43, 656-658. Cochran, W.G. (1963). Sampling Techniques. New York: John Wiley and Sons. Cohen, A.C. (1957). On the solution of estimating equations for trun- cated and censored samples from normal populations. Biometrika 44, 225-236. Cramer, H. (1963). Mathematical Methods of Statistics. Princeton: Princeton University Press. Crosetti, A.H. and Schmitt, R.C. (1956). A method of estimating the intercensal populations of counties. Journal of the American Statistical Association 51, 587-590. Curnow, R.N. and Dunnett, G.W. (1962). The numerical evaluation of certain multivariate normal integrals. Annals of Mathematical Statis- tics 33, 571-579. Dalenius, T. (1957). Sampling in Sweden. Contributions to the Methods and Theories of Sample Survey Practice. Stockholm: Almquist och Wiksell. Dalenius, T. and Hodges, J.L. (1959). Minimum variance stratification. Journal of the American Statistical Association 54, 88-101. Duncan, 0.D. (1959). Residential Segregation and Social Differentiation. Vienna, Austria: International Population Conference. Ericksen, E.P. (1973a). A method for combining sample survey data and symptomatic indicators to obtain population estimates for local areas. Demography 10, 137-160. Ericksen, E.P. (1973b). Recent developments in estimation for local areas. American Statistics Association, Proceedings of the Social Sta- tics Section, 37-41. 131 Ericksen, E.P. (1974). A regression method for estimating population changes of local areas. Journal of the American Statistical Association 69, 867-875. Gonzalez, M.E. and Waksberg, J. (1973). Estimation of the error of synthetic estimates. Presented at the first meeting of the International Association of Survey Statisticians, Vienna, Austria. Gonzalez, M.E. (1973). Use and evaluation of synthetic estimates. American Statistical Association, Proceedings of the Social Statistics Section, 33-36. Gonzalez, M.E. and Hoza, C. (1975). Small area estimation of unemploy- ment. American Statistical Association, Proceedings of the Social Sta- tistics Section. Gumbel, E.J. (1961). Bivariate logistic distributions. Journal of the American Statistical Association 56, 335-349. Hansen, M.H., Hurwitz, W.N., and Madow, W.G. (1953). Sample Surve Methods and Theory, Vols. I and II. New York: John WiTer and Sons. Hinckley, B. (1970). Incumbency and the presidential vote in senate elections: defining parameters of subpresidential voting. American Political Science Review 64, 836-842. Johnson, N.L. and Kotz, S. (1972). Continuous Multivariate Distributions. New York: John Wiley and Sons. Kaitz, H. (1973). Comments on the papers by Gonzalez and Ericksen. American Statistical Association, Proceeding of the Social Statistics Section, 44-45. Kalsbeek, W.D. (1973). A method for obtaining local postcensal estimates for several types of variables. Unpublished doctoral dissertation. University of Michigan, Ann Arbor. Keyfitz, N. (1957). Estimates of sampling variances where two units are selected from each stratum. Journal of the American Statistical Asso- ciation 52, 503-510. Kim, J., Petrocik, J.R., and Enokson, S.N. (1975). Voter turnout among American States: systemic and individual components. American Political Science Review 69, 107-123. Kish, L. (1967). Survey Sampling. New York: John Wiley and Sons. Koch, G.G. (1973). An alternative approach to multivariate response error models for sample survey data with applications to estimators in- volving subclass means. Journal of the American Statistical Association 68, 906-913. Koch, G.G., Freeman, D.H., and Tolley, H.D. (1975). The asymptotic co- variance structure of estimated parameters from contingency table log- linear models. University of North Carolina Institute of Statistics Mimeo Series No. 1046. 132 . Koch, G.G. and Freeman, D.H. (1976). The asymptotic covariance structure of estimated parameters from marginal adjustment of contingency tables. Paper presented to the Washington Statistical Society Frederick F. Stephan Memorial Methodology Program. Levy, P.S. (1971). The use of mortality data in evaluating synthetic estimates. American Statistical Association, Proceedings of the Social Statistics Section, 328-331. Murthy, M.N. (1967). Sampling Theory and Methods. Calcutta: Statistical Publishing Society. Namekater, T., Levy, P.S. and O'Rourke, T.W. (1975). Synthetic estimates of work loss disability for each state and the District of Columbia. Public Health Reports 90, 532-538. Pool, I., Akelson, R.P. and Popkin, S.L. (1975). Candidates, Issues and Strategies: A Computer Simulation of the 1960 and 1964 Presidential Elections. Cambridge, Massachusetts: MIT Press. Rosenbaum, S. (1961). Moments of a truncated bivariate normal distribu- tion. Journal of the Royal Statistical Society--Series B 23, 405-408. Royall, R. (1973). Discussion of papers by Gonzalez and Ericksen. American Statistical Association, Proceedings of the Social Statistics Section, 42-43. Scheuren, F.J. and Oh, H.L. (1975). A data analysis approach to fitting square tables. Communications in Statistics 4, 595-615. Schneider, A.L. (1973). Estimating aggregate opinion in small political units: the disaggregation of national survey data. Applied Policy Re- search Series, Vol. 1, Oregon Research Institution. Seidman, D. (1975). Simulation of public opinion: a caveat. Public Opinion Quarterly 39, 43-54. Sethi, V.K. (1963). A note on optimal stratification of populations for estimating the population means. Australian Journal of Statistics 5, 20-33. Shryock, H.S. and Siegel, J.S. (1973). The Methods and Material of Demog- raphy. Washington D.C.: U.S. Government Printing Office. Singh, R. and Sukhatme, B.V. (1972). A note on optimum stratification. Journal of the Indian Society of Agricultural Statistics 24, 91-98. Snow, E.C. (1911). The application of the method of multiple correlation to the estimation of postcensal population. Journal of the Royal Statis- tical Society 74, 575-620. Soares, G. and Hamblin, R. (1967). Socioeconomic variables and voting for the radical left: Chile 1952. American Political Science Review 61, 1053-1065. 133 Stephen, F.F. (1945). The expected value and variance of the reciprocal and other negative powers of a positive Bernoullian variate. Annals of Mathematical Statistics 16, 50-61. Thomsen, I. (1976). A comparison of approximately optimal stratification given proportional allocation with other methods of stratification and allocation. Metrika 23, 15-25. U.S. Bureau of the Census. (1963). The Current Population Survey--A Report on Methodology. Technical Paper No. 7. U.S. Bureau of the Census. (1973). Concepts and methods used in man- power statistics from the Current Population Survey. Current Popula- tion Reports, p. 23, No. 22. U.S. Bureau of the Census. (1973). County and City Data Book, 1972 (A Statistical Abstract Supplement). “Washington D.C.: U.S. Govermment Printing Office. U.S. Bureau of the Census. (1973). Federal state cooperative program for local population estimates: test results--April 1, 1970. Current Population Reports, P-26, No. 21. U.S. Bureau of the Census. (1975). Coverage of the population in the 1970 census and some implications for public programs. Current Popula- tion Reports, P-23, No. 56. U.S. National Center for Health Statistics. (1968). Synthetic Estimates of Disability. PHS publication, No. 1759. Weber, R.E. and Shaffer, W.R. (1972). Public opinion and American state policy making. Midwest Journal of Political Science 16, 683-699. Weber, R.E., Hopkins, A.H., Mezey, M.L., and Munger, F. J. (1973). Com- puter simulation of state electorates. Public Opinion Quarterly 36, 549-565. Woodruff, R.S. (1966). Use of a regression technique to produce area breakdowns of the monthly estimates of retail trade. Journal of the American Statistical Association 61, 496-504. 134 Discussion Joseph Waksberg 1. I'm not sure that I see where the Kalsbeek-Cohen model is really different from the synthetic estimator model that Simmons, Levy and others at NCHS have described, or that Maria Gonzalez and I dis- cussed in our 1973 paper. The synthetic estimate is defined as 7p1iXj where the i is an index for the classification of the popula- tion considered most useful for the statistic to be estimated. Most, or possibly all, of the examples discussed in the earlier papers have considered the commonly-used classification variables such as sex, age, race, etc. However, there is no theoretical reason why some type of small area geographic classification cannot be used to define the classes, either solely or in combination with the more usual demographic items. If this is done, then the Kalsbeek-Cohen model merges with the earlier one. Some of the earlier papers do include geography as a component of the classification scheme, but use fairly large areas; for example, SMSA's versus non-SMSA's, or county size. These are areas that generally correspond to primary sampling units for most of the large-scale national surveys whose results have been used for synthetic estimates. They are easily manipulable since the data can be automatically coded. More important, they comprise classifications for which reasonably accurate data are likely to exist on the population proportions that act as weights for the local area estimates. This is, of course, essential for the theory to have any practical application. Cohen's paper departs from the large area geographic units and shows that smaller areas can also be used. I have tried to develop criteria for the kinds of areas that could be efficiently utilized in real-life applications of Cohen's approach. It seems to me that there are three conditions that have to be satis- fied in defining the areas: a. The areas must be such that each sample element can be coded in its proper base unit, so that it is clear to which of the G classes it belongs; b. Current population counts of the number of elements in each of the G classes are necessary so that the appro- priate weights can be used in the estimators; 135 c. The areas must be fairly small so that the population within each area is relatively homogeneous. This is necessary for the poststratification in the estimator to be effective in reducing the mean square error. Concentrating first on the third condition, I suspect it is necessary to get down to the tract or enumeration district (ED) level to achieve sufficient homogeneity. Many private national surveys using area sampling techniques do use ED as a stage of sample selection permitting the base unit coding. However, the Census Bureau currently does not have this capability easily available for about 15 percent of its sample, the part used to represent new construction. Extra efforts would be required to carry out Cohen's procedure. The real problem, however, stems from the second criterion. Once one departs from the census dates, the estimates of the population of the G strata in each local area become very uncertain. For example, I suppose that proportion of population in various minority classes would be a fairly obvious stratification variable. There have been dramatic and significant changes in the population of such areas in many cities of the United States, and also in minority proportions in these areas. I doubt that most cities have accurate information on the changes that have occurred. We recently contacted a number of local officials and agencies in Maryland in an attempt to update 1970 census data on the percent of black population per tract, and the information was simply not available. The application of Cohen- Kalsbeek method thus seems to me mostly restricted to a period of possibly two or three years after the census. Of course, with the start of mid-decade censuses in the 1980's, this will not be as much of a restriction as it is at present. There is one study area where the same time restrictions may not apply: studies of political behavior. Election precincts have some of the characteristics of ED's. The geographic sizes and average populations are not too different. However, unlike ED's, election precinct infor- mation is brought up-to-date every two years, and in some areas more often. It would be possible to apply the method described in Cohen's paper to studies which use election precincts as stages in sampling. 2. Let me move to another issue, tests of the accuracy of the various procedures that are being developed. Cohen developed several potential population distributions and studied the bias for each distribution. Many of the other papers have proceeded empirically, using information available for local areas from censuses or other sources, and simulat- ing synthetic estimators. Both of these procedures are valuable in giving insight on the conditions under which one method or another is preferable. However, neither procedure is sufficient for most real- life studies that would call for practical applications of synthetic estimates. It is necessary for a technique to have some means of es- timating accuracy from survey results without making assumptions about the nature of the underlying distributions. Ultimately, the accuracy depends on the size of the between local area variance. I didn't see any discussion of between-area estimation methods in Cohen's paper. Possibly, it's sufficient to assume that usual methods of es- timating components of variance exist. 136 3. I'd like to add one general remark about potential uses of syn- thetic estimates. In Maria Gonzalez and Christine Hoza's article, Small Area Estimation with Applications to Unemployment and Housing Estimates in the March 1978 issue of the Journal of the American Statistical Association, average mean square errors are shown for estimates of unemployment in 1970 crossclassified by unemployment rate in the area. The errors of the estimates are sort of u-shaped, 1ow for areas with average unemployment rates and much higher for areas at both ends of the distribution, in particular for those at the higher end. This is not too surprising. Synthetic estimates tend to squeeze estimates toward the mean. One of the main purposes of using symptomatic data in a regression model is to compensate for this tendency. Ericksen's work on eliminating outliers is another attempt to reduce the same effect. I think it is unlikely that these devices will be completely success- ful. This raises a real dilemma when one attempts to make local area estimates for purposes of administrative action at the local level. For example, if we wish to allocate funds for drug abuse treatment or education on the basis of the size of the problem, then it is precisely the areas that need the funds most whose estimates will be most seriously understated. I am not very optimistic about the possibility of finding the right symptomatic variables to signifi- cantly reduce this effect. There are several courses of action that can be taken. One, of course, is simply to live with the problem. A second is to view synthetic estimates as screening devices, designed to identify the areas where it is reasonably safe to assume that only a small problem exists, and do more intensive work to get a better handle on the problem in areas where the synthetic estimator is above a specific cut-off. The third is to use synthetic estimates not to produce sta- tistics for individual areas, but to produce distributions of the areas, for example, number of areas with drug abuse rates at various levels. If the latter is done, some moderate size should be used to establish the upper end of the class intervals. When it is important to have good estimates for areas at the upper end of the distribution, synthetic estimates are likely to be inadequate unless very effective symptomatic variables exist. 4. There has been occasional reference during the meeting to the elimination of outliers in order to get better fit to models. I am somewhat uneasy about mechanical rules to eliminate or reduce the effect of outliers. My inclination is to view outliers from a quality control point of view, that is, to reexamine them to make sure there are no errors in the data, or for that matter as a clue to the use of other, nonlinear models, rather than to follow mechanical rules of rejection. Some time ago I saw a dramatic illustration of the dangers of auto- matic rules on outliers. In the 1966 election, one of the TV networks was making early evening projections of state votes on the basis of data from a sample of precincts. As part of quality control, the per- centage Democratic vote in each precinct was compared to past perform- ance in that precinct. Wild fluctuations were removed as being either 137 data errors or some sort of unrepresentative freaks. In that year, there was an unusual election for governor of Maryland. The Demo- crats nominated an extremely right-wing, prosegregation candidate. The Republicans nominated someone who was largely unknown, and kept quiet on most controversial issues. As a result, precincts in pre- dominantly black and liberal areas, that had been solidly Democratic in previous elections, suddenly voted solidly Republican. The analysts in New York, apparently completely unfamiliar with the Maryland situ- ations, proceeded to throw out the results of the sample precincts in such areas. These, of course, were the precincts that most clearly illuminated what was going on in Maryland. The network probably made the worst projection in history on that election. I might say that I was not involved in these projections. The experience, however, is indicative of the dangers in too much "fooling around" with the data. 138 General Discussion * There is one point which has just been made by Joe Waksberg that may be worth emphasizing. The point is slightly different from one made earlier. That is, perhaps synthetic estimators could be used for dis- tinguishing outliers which should be given special treatment the next time around in a sample survey, so that one could supplement the sample in those areas in particular. Thus, instead of spreading effort over, say, 39,000 units, if you could find some small subset of areas in which a rather different cultural, social, or economic phenomenon exists, then this would be useful for designing the second effort. Thus, there may be a number of uses of synthetic estimators as screeners. The one which has just been suggested should be kept in mind. (Contributing to the general discussion during this period were: Reuben Cohen, Joseph Steinberg and Joseph Waksberg.) 139 Part Ill Case Studies on the Use and Accuracy of Synthetic Estimates: Unemployment and Housing Applications Maria Elena Gonzalez Some Recent Census Bureau Applications of Regression Techniques to Estimation Robert E. Fay Discussion Eugene P. Ericksen General Discussion Case Studies on the Use and Accuracy of Synthetic Estimates: Unemployment and Housing Applications Maria Elena Gonzalez ABSTRACT A description is given of unemployment synthetic estimates for counties, based on the 1970 Census of Population. The distribution of the method error of these estimates is given, as well as the relative accuracy of these estimates. Implications for inter- censal estimates based on regression models are considered. Vacancy rates from the 1970 Census of Housing are discussed. Estimates of 1970 estimates of dilapidated housing units with all plumbing facilities and their accuracy are analyzed. INTRODUCTION Small area estimates are required for the planning and evaluation of programs for individual areas, as well as for the distribution of Federal funds to State and local areas. This great demand has created a need to analyze the different methodologies available to obtain small area data and evaluate the accuracy of the data pro- duced. One such methodology, called synthetic estimation, has been used to obtain estimates for small areas and as a method of imputation for missing data. In the simplest case a synthetic estimate would use a valid estimate for a large area (e.g., a State), and apply it to all the subareas (e.g., counties) within the State; for the subareas (counties in this case) this estimate would in general be biased. The bias for the subareas is due to the difference which usually exists between the estimate for the large area and the various estimates for the subareas. In most of the examples to be discussed in this paper, synthetic estimates are derived by partitioning the universe into a series of mutually exclusive and exhaustive cells and deriving the estimate as a sum of products. In the case of unemployment, the weights correspond to the distribution in the small area of the labor force by age, for 142 example, and the estimated unemployment by age corresponds to the estimate for the larger area. A formula expressing the synthetic estimate, is: G * = uy 3 Pj Wy (1) j=1 where Pi is the labor force for county (or subarea) i and charac- G teristic j, T Pi; =1, and u j is the unemployment rate for j=1 characteristic j in the State (or larger area). In this paper, we review synthetic estimates of unemployment derived for counties at the time of the 1970 Census of Population; these estimates are compared with the Census 20-percent sample estimates of unemployment to obtain and analyze the distribution of the method error of the synthetic estimates. In addition, some regres- sion estimates of unemployment which might be used intercensally (the years between decennial censuses) are presented. In the area of housing, we present data on vacancy rates. In the 1970 Census of Population and Housing, it was found that about 11% percent of the housing units initially reported by enumerators as vacant were occupied. An adjustment for these misclassified vacant units was included in the processing, and some effects will be described (see Gonzalez and Waksberg 1973). The pretests for the 1980 census shed some further light on these results. In addition, the possibility of estimating vacancy rates intercensally is explored. In the 1970 Census of Housing, estimates of housing units dilapidated with all plumbing facilities (DWAPF) were obtained by synthetic methods. The relative accuracy of these estimates is discussed. UNEMPLOYMENT STATISTICS The 1970 Census of Population data on unemployment, collected from a 20-percent sample, were used to calculate various alternative synthetic estimates of unemployment for counties in the United States. This allows us to compare the Census and synthetic estimates. The unemployment estimates for geographic divisions were used as the basis for the u, 4, for a number of different characteristics, j. The characteristics Used to compute synthetic estimates included sex, race (black vs. all other races), and alternative classifications of the population by: occupation, age-marital status, industry and occupation-income (see Gonzalez and Hoza 1978). The definition of the cells (mutually exclusive and exhaustive) used to campute the alternative synthetic estimates was determined empirically, trying to minimize the number of cells for which many counties 143 had zero persons in the labor force. It is possible that by means of a more systematic approach, such as the use of cluster analysis for defining the cells, improved results could be obtained. The synthetic estimates based on race - sex - occupation classifi- cation provided the highest weighted correlation, 0.682, with the county estimates for the 1970 Census. Within each of the nine geographic divisions, the number of cells used to compute the synthetic estimate based on race - sex - occupation was 31: 12 cells for nonblack males, 9 cells for nonblack females, and 5 cells each for black males and black females. The synthetic estimate based on race - sex - age - marital status resulted in a weighted correlation for all counties of 0.569. This synthetic estimate used 50 cells within each of the nine geographic divisions. The increase in number of cells did not, in this case, result in a higher correlation with the Census estimate. Computing the county synthetic estimates based on the unemployment rates for the geographic divisions where they are located might not lead to the most efficient results. It is possible that a more homogeneous grouping of counties would give better results. In this analysis, however, other groupings of counties were not tried. Table 1 shows the number of counties classified by the 1970 Census estimate of unemployment, as well as the root mean square error and the relative root mean square error for the synthetic estimates based on race - sex - occupation and those based on race - sex — age — marital status classifications. The root mean square error was estimated as: M 3 (u;* = up)? )¥ (2) i=1 where u. is the 1970 Census unemployment estimate for county i, and M is thé number of counties with a specified unemployment rate in the 1970 Census (e.g., counties with unemployment rate from 4.0 percent to 4.9 percent). tse 0% = 1 2 The root mean square error is smaller for the synthetic estimates based on occupation than for those based on age — marital status categories: 1.9 versus 2.2 percent. The smallest relative root mean square error corresponds to unemployment between 4.0 percent and 4.9 percent, which is also the category where the overall U.S. unemployment rate falls (4.4 percent). For counties with unemploy- ment rate below 3 percent and those above 11 percent, for synthetic estimates based on occupation, the relative root mean square error was above 0.5. This results in a U-shaped distribution. Because of the smoothing characteristic of the synthetic estimates, the estimates corresponding to 1970 Census unemployment estimates further away from the average tend to be less accurate than those for counties with 1970 Census estimates closer to the average 144 SPT TABLE 1 Distribution of the Root Mean Square Error of Synthetic Estimates by Counties by Size of 1970 Census Unemployment Rate 1970 Census Root Mean Square Error (%) Relative Root Mean Square Error’ Unemployment a Age-Marital Age-Marital Rate Counties Occupation Status Occupation Status Less than 1.0% 21 2.8 2.8 5.52 5.56 1.0% - 1.9% 171 2.0 1.5 1.36 0.99 2.0% - 2.9% 493 1.4 1.2 0.57 0.50 3.0% - 3.9% 679 0.9 0.7 0.24 0.21 4.0% - 4.9% 580 0.6 0.8 0.14 0.18 5.0% - 5.9% 363 1.2 1.6 0.22 0.28 6.0% - 6.9% 232 1.8 2.3 0.28 0.36 7.0% - 7.9% 137 2.5 3.0 0.33 0.40 8.0% - 8.9% 88 3.4 4.1 0.40 0.48 9.0% - 9.9% 51 4.3 4.9 0.46 0.52 10.0% ~- 10.9% 30 4.8 5.5 0.46 0.52 11.0% ~- 11.9% 22 6.5 7.1 0.56 0.62 12.0% ~- 12.9% 23 7:2 7:9 0.58 0.63 13.0% - 13.9% 10 8.1% 8.9 0.60 0.66 14.0% - 14.9% 2 8.4 9.1 0.58 0.62 15.0% - 16.9% 6 10.4 11.3 0.66 0.71 Average 4.4% 2908 1.9 2.2 0.43 0.50 a See footnote 2. b The relative root mean square error was calculated by dividing the root mean square error by the mid-point of the unemployment interval. unemployment rate. The results for synthetic estimates of unemploy- ment based on age-marital status are similar to those based on occupation. Although the variance was not separately estimated, if it is relatively small, then the bias is not negligible. Figure A plots the distribution of the relative method error for synthetic estimates based on occupation and those based on age - marital status. The relative method error for the unemployment rate is calculated as the difference between the synthetic estimate and the Census estimate divided by the Census estimate. For synthe- tic estimates based on occupation, 48.3 percent of the counties had a negative relative method error and for synthetic estimates based on age - marital status, the corresponding percentage is 54.3. If we disregard the sign, a relative method error of 0.2 or less is obtained by 43.0 percent of the synthetic estimates based on occupa- tion and by 38.3 percent of those based on age - marital status. Similarly, a relative method error of 0.5 or less is obtained by 79.9 percent of the occupation synthetic estimates and by 79.3 percent of the age - marital status synthetic estimates. About 95 percent of the counties for both distributions have a relative error of 1.0 or less. Approximately 1.1 percent of the 2908 counties tabulated had a relative method error over 2.0. The charts show quite similar distributions of the relative method error for both synthetic estimates; this result is expected since there is a very high correlation, 0.916, between the occupation and age —- marital status synthetic estimates. For intercensal estimates of unemployment, we will consider regression estimates for 122 Current Population Survey (CPS) primary sampling units (PSUs) (see Ericksen 1974). The CPS is a monthly survey which collects data on employment and unemployment. The data of the survey can be tabulated for individual PSUs, although the data are subject to a very high variance. The regressions use as dependent variables two summarizations of the CPS PSU data: (1) a one month summary based on the April 1970 data, Z, and (2) a summary of five months of CPS data centered in April 1970 and spaced at quarterly intervals, Y. The independent variables include 1970 Census estimates, U, and alternative estimates based on the unemployment insurance data, as well as synthetic estimates based on sex - race - occupation classifications, Xe The following regressions are obtained: Y' = .016 + .884U ~- .080 X, = «023 X, (3) R2 = L540 Residual mean square = .881 x 1074 Standard error of estimate = .938 x 1072 146 Figure A. Distribution of Relative Method Error for Alternative Synthetic Estimates of the Unemployment Rate for Counties in the United States, 1970 Percent of Counties 14.01 Synthetic estimate based on age-sex and: I Occupation 12.0 Im Age-marital status Occupation and age-marital status 10.0 8.0 6.0 4.0 2.0 -0.8 -04 002040608101214161.8 20° Relative Method Error a 1.1% of the 2908 Counties Tabulated had a Relative Method Error over 2.0. 147 where Xq is the insured unemployment as a percent of total unemploy- ment. z' = .026 + 1.016 U- .107 x - .396 X, (4) RZ = .263 Residual mean square = 2.119 x 10% Standard error of estimate = 1.456 x 1072 Because of the higher variance of Z, based on one month of CPS data, regression (4) shows a lower correlation than regression (3) which uses a dependent variable based on an accumulation of five months of CPS data. Additional regressions follow: Y' = .010 + .450U + .089 X, + .326 X3 (5) R2 = .563 Residual mean square = .835 x 10 2 Standard error of estimate = .914 x 102 where X, is the new final annual "70-step" estimate’ of unemployment before Benchmarking the estimates by CPS data. Z' = ,019 + .442U - .247 X, + .430 X3 (6) R= .201 Residual mean square = 2.040 x 10 2 Standard error of estimate = 1.428 x 10 2 The results show a slight improvement of the correlation in the regressions which use X rather than X, as an independent variable. However, in selecting the ind&pendent variables, the availability and timeliness of the variables must be taken into account. For the sample areas further improvements in the estimates could be achieved by combining the CPS PSU sample data with the regression estimates (Fay and Herriot 1978). Nevertheless, the regression methodology provides a feasible way of obtaining inter- censal small area estimates of unemployment. HOUSING STATISTICS: VACANCY RATES After the initial completion of the enumeration for the 1970 Census of Population, a National Vacancy Check (NVC) sample survey was 148 carried out (U.S. Bureau of the Census 1974b). Reinterviews were con- ducted for a sample of housing units initially reported as vacant by the enumerators to check whether they might have been occupied at the time of the census. The results of this survey showed that an esti- mated 11.4 percent of these vacant housing units were actually occu- pied at the time of the census. This project was intended originally as an evaluation project of the 1970 Census, but when the extent of the problem became apparent, the project was converted into an operational census procedure. One possible reason why occupied housing units might have been erroneously classified as vacant was that the enumerator could not find anybody to report whether or not the unit could have been occupied at the time of the census. Based on the results of the NVC and the size of household found in the misclassified units, twelve conversion rates (4 regions x 3 types of census procedures) were used during the processing of the census to convert vacant housing units into occupied ones and to assign to the vacant units the number and characteristics of the persons in a neighboring unit. This is a type of synthetic estimate, and an analysis of the effects of this procedure on the population esti- mates for areas of different sizes is given in the paper by Gonzalez and Waksberg (1973). As a result of this procedure, 1,069,000 persons were added to the 1970 Census. The main intent of this coverage improvement procedure was to adjust for population undercoverage. The percentage of housing units initially reported as vacant, but actually occupied (11.4 percent) was adjusted downward in determining the conversion rates (8.5 percent overall), because the average size of household for mis- classified units was smaller than the average size of household reported in the 1970 Census. Therefore, fewer vacant housing units were converted into occupied than the estimate given by the NVC survey. In fact, the procedure used under imputed population, because vacant housing units were neighbors of smaller than average households in the census. The vacancy rate, computed as the percent vacant of the total nonseasonal housing units, was affected by the imputation procedure used; the imputation proce- dure improved the initially reported vacancy rate, but additional housing units would have needed to be converted into occupied ones to improve further the estimates for 1970 vacancy rates. Two main variables were measured in the NVC: misclassified vacant housing units, and persons living in these units. In specifying an improved imputation procedure, it would be necessary to control both variables: the number of housing units converted from vacant into occupied, as well as the total number of persons (and distri- bution by household size) to be imputed. For example, Figure B illustrates the needed controls to achieve specified housing unit and population control totals. 149 FIGURE B Size of household Region 1 Total 1 2 3 4 “ou Type of enumeration 1 Housing units to be converted from Ix, xX X X X vacant to occupied 1 Type of enumeration 2 Housing units to be converted from 3 Yi ¥y Y, Y3 Yy cee vacant to occupied The plans for the 1980 Census of Population and Housing include an independent reinterview of all housing units initially reported as vacant or deleted from the original list of addresses in order to be able to process a more correct count of persons and occupied and vacant housing units (U.S. Bureau of the Census 1978). The possibility of estimating vacancy rates intercensally for small areas requires the use of the Annual Housing Survey (national and SMSA) sample data and the Quarterly Vacancy Survey as dependent variables and the use of regression techniques similar to those illustrated for estimating unemployment rates. Such a project needs to determine the availability of local area data which might be used as independent variables, such as building permits issued or turnover in households. HOUSING STATISTICS: DILAPIDATED HOUSING WITH ALL PLUMBING FACILITIES Synthetic estimates were used in the 1970 Census of Housing (Vol. VI) to provide estimates of the component of substandard housing units which were dilapidated with all plumbing facilities (DWAPF). The 1970 census procedures did not provide for individual rating of structural condition, such as sound, deteriorating, and dilapidated, as was used in the 1960 Census of Housing. In 1970, census data on housing units with all plumbing facilities for specified areas and cells were multiplied by estimated proportions of dilapidated housing units which had all plumbing facilities, as derived from a post-census survey, Components of Inventory Change (CINCH) to obtain the synthetic estimates of DWAPF (Gonzalez 1973). 150 The estimate of accuracy used to evaluate the estimates of DWAPF was the root mean square error computed as follows: G 2 1 M * 2 (use, )® = ( % i where j=1 i=1 ia is the number of housing units with all plumbing facilities J in area i for characteristic j (3¥1,...,G) based on the 1970 Census of Housing var (q.) is an estimate of the variance of the proportion of J dilapidated housing with all plumbing facilities for characteristic j from CINCH D* is the synthetic estimate of DWAPF for area i based on the 1960 Census of Housing D is the 1960 Census of Housing 25-percent sample estimate of DWAPF for area i M is the number of areas being averaged. The average of the squares of the 1960 biases for a group of areas was used as an approximation of the square bias for the 1970 DWAPF estimates for that same group (U.S. Bureau of the Census 1974a). The relative size of the estimated root mean square error depends on the size of the area being estimated. Tables 2, 3, and 4 give relative root mean square error for geographic divisions, States and counties by size of estimate. These estimates provide only rough indications of the accuracy of the data, but in general for larger areas the relative root mean square error is smaller. TABLE 2 Approximate Relative Root Mean Square Error of 1970 Estimates of Dilapidated Housing Units with all Plumbing Facilities for Division, by Inside and Outside SMSA's Relative root mean square error for division a/ Size of estimate Total Inside SMSA Outside SMSA 20,000 - 49,999 - 0.35 0.48 50,000 - 99,999 0.30 0.20 0.24 100,000 - 199,999 0.15 0.16 0.17 200,000 & over 0.15 0.12 - a The relative root mean square error was calculated by dividing the root mean square error by the lower limit of the size class as given in Table H of U.S. Bureau of the Census (1974a). 151 TABLE 3 Approximate Relative Root Mean Square Error of 1970 Estimates of Dilapidated Housing Units with all Plumbing Facilities for States Relative root mean Size of estimate square error for States? 1,000 - 4,999 1.00 5,000 - 9,999 .42 10,000 - 19,999 .36 20,000 - 29,999 .26 30,000 - 49,999 w23 50,000 - 99,999 +20 100,000 - and over .18 a The relative root mean square error was calculated by dividing the root mean square error by the lower limit of the size class as given in Table I of U.S. Bureau of the Census (1974a). TABLE 4 Approximate Relative Root Mean Square Error of 1970 Estimates of Dilapidated Housing Units with all Plumbing Facilities for Counties within SMSA's by Region Relative root mean square error for county a Size of Estimate Northeast North Central South West 100 - 249 1.00 1.00 1.00 1.00 250 - 499 1.20 .80 .80 .80 500 - 999 .80 .60 .60 .60 1,000 - 4,999 .70 .90 .60 .80 5,000 and over .34 .34 .32 “l2 a The relative root mean square error was calculated by dividing the root mean square error by the lower limit of the size class as given in Table J of U.S. Bureau of the Census (1974a) . IMPLICATIONS FOR OTHER VARIABLES The results presented here illustrate the uses and limitations of synthetic and regression estimates in the case of unemployment rates, housing vacancy rates, and housing units dilapidated with all plumbing facilities. However, the methods used could be applied to other subject-matter fields; the accuracy of the resultant data would probably depend on the specific data set used. In whatever context these methodologies would be applied, data relevant to the specific field are needed. For example, in the data shown on unemployment rate, the basic sources used were the 1970 Census of Population unemployment rates, as well as Current Population Survey data. 152 In the future, synthetic estimates will be used often. We need to recognize that at present synthetic estimates are sometimes used without being recognized as such; producers of data may not always be aware of the implications for the accuracy of the data of using synthetic estimates. FOOTNOTES 1. The terminology was first used by the U.S. National Center for Health Statistics (U.S. Department of Health, Education and Welfare). 2. 2908 "counties" were analyzed (counties with population of less than 5,000 in the 1970 census were merged with a neighboring county). SMSA counties were never merged with non—-SMSA counties; counties in the 1960 or 1970 CPS design were merged only with counties in the same PSU. 3. The Bureau of Employment Security (now Employment and Training Administration) of the Department of Labor published in 1960 a "Handbook on Estimating Unemployment" which describes the 70- step method. This Handbook specifies a series of computational steps (about 70) designed to produce unemployment estimates. These estimates are the sum of three components: a. Unemployed persons who were employed in an industry and were covered by unemployment insurance immediately prior to their unemployment spell. b. Unemployed persons who were employed in an industry and were not covered by unemployment insurance immediately prior to their unemployment spell. c. Unemployed persons who were new entrants and reentrants into the labor force. The basic building block of these estimates of unemployment is the count of insured unemployed. 153 REFERENCES Ericksen, E.P., A Regression Method for Estimating Population Changes of Local Areas, Journal of the American Statistical Association, Volume 69, + PP. -8/5. Fay, R.E., and Herriot, R., Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data. Unpub- lished, 1978. Gonzalez, M.E., Use and Evaluation of Synthetic Estimates, Proceedings of the Social Statistics Section of the American tatistical Association, r PP. 33-36. == al gesociation and Hoza, C., Small Area Estimation with Applications to Unemployment and Housing Estimates, Journal of the American Statistical Association, Volume 73, 19 r PP. 7-15. aL blcal association and Waksberg, J., Estimation of the Error of Synthetic Estimates, unpublished paper presented at the first meeting of the International Association of Survey Statisticians, Vienna, Austria, 1973, pp. 1-17. U.S. Bureau of the Census, Census of Housing: 1970, Volume VI, Plumbing Facilities and Estimates of PIlaplated Housing, Addendum: Accuracy of Estimates, was ington, D.C.: U.S. ernment Printing Office, 1974a, pp. 1-7. U.S. Bureau of the Census, 1970 Census of Po ation and Housing, Effect of Special Procedures to Tmprove Coverage in the 1970 RRs, Washington, D.C.: U.S. Government Printing Office, 1974b, pp. 11-14. U.S. Bureau of the Census, Proposals for Coverage Evaluation of the 1980 Census, Presented at the March 2, 1978, Meeting of the Census Advisory Committee of the American Statistical Association. Unpublished, 1978. U.S. Department of Health, Education and Welfare, Synthetic State Estimates of Disability, PHS Publication No. 1759, Washington, D.C.: U.S. Government Printing Office, 1968. 154 Some Recent Census Bureau Applications of Regression Techniques to Estimation Robert E. Fay INTRODUCTION Adaptations and extensions of the classical theory of regression and linear models constitute one of the possible approaches to estimation for small areas. This paper will describe three recent applications of this theory to problems at the Census Bureau and indicate possible future directions. Much of what is presented here must be classified as simply exploratory research; yet, each of the three investigations has had tangible effects upon aspects of Bureau policy. Furthermore, with preliminary plans for evaluation of the 1980 census calling for use of regression and/or synthetic techniques to produce subnational estimates of undercount at particular levels of geography, the interest of the Bureau in these techniques may be expected to increase. Because synthetic estimates are the principal topic of this workshop, the relation between regression and synthetic estimation serves as a natural point of departure. The two are linked by their common basis in linear models. For purposes of discussion here, we shall consider a linear model over any set of geographic units i to be a representation ™M To X.. B. + . 1 j=1 1) 5 by 1) of a characteristic <5 in terms of a linear transformation of the predictor variables X;5 plus a residual term u;. The common vector representation for equation (1) c = Xg + u (2) will also be employed in this paper. 4 155 Synthetic estimates may be expressed in the form of (1). In this instance, the X;;'8 become relative or absolute frequencies of population subgroups j in units i, while the B's become the rates of incidence of the characteristic in the subgroup j over the entire set of units. On the other hand, linear regression, or more specifically, weighted least squares, determines the vector 8 through ge = xwo!xtwe (3) where W is a diagonal matrix of weights Ww. (In some applications, not included among those presented here, W may be other than diagonal.) This second approach, unlike synthetic estimation, does not impose structural restrictions upon X. In a sense, a synthetic estimate models relationships in the population at a micro-level, while a regression estimate models only at a macro-level. The preceding description of the linear model departs somewhat from the usual. Here, equation (2) stands by itself as a mathematical relation between the terms. The practice in most linear theory directly links this equation to a stochastic model for u, and occasionally for X or 8 as well. In so doing, the statistical issues in linear theory are typically grounded in the properties of infinite populations. The conceptual standard for the evaluation of small area estimates, on the other hand, is generally the complete census (whether this census is actual or hypothetical), and this standard casts the problem in the context of the finite population. Equations (2) and (3) will, therefore, represent definitions of finite population parameters, although we shall at points consider implications of stochastic assumptions. POST-CENSAL ESTIMATION OF POPULATION The Census Bureau currently employs (2) and (3) in one of its methods of post-censal estimation, the ratio-correlation method, at the levels of both States and counties. (In what follows, simplifications will represent the nature of the statistical problem without fully detailing the implementation. A complete description is given in U.S. Bureau of the Census (1976).) The Xj;'s are taken to represent the ratio of change in indicator variable j in unit i to the change at the national level (or in the case of counties, State level) in this manner: Value of j at current year, unit i Value of j at census year, unit i HJ Value of j at current year, total Value of j at census year, total 156 Examples of indicator variables are data on school enrollment, automo- bile registration, tax returns, and labor force size. The c;'s are the corresponding rates of change in population and are defined analogously to (4). For example, if school enrollment decreases by 5 percent nationally but increases by 14 percent in a particular State, the value for the corresponding X33 would be 1.20 (=1.14/.95). If the same State's population grew by 32 percent during a period in which the national growth was 10 percent, the value of 3 would be 1.20 also (=1.32/1.10). In a sense, therefore, each of the indicator variables is expressed in a form to indicate directly the relative change in population compared to the national rate of change. The B.'s act as weights to combine the various changes implied by the indicator variables. The current practice is not to force the weights to sum to unity but to include a constant term in the model as well, equivalent to setting Xi; = 1 for all i and some j. J Current estimates of population are computed as X8, where the LT are defined according to (4) for the current year relative to 1970. The Census Bureau derives 8 as Bg, 7» the application of (3) to the 1960- 1970 decade (that is, with X33 and cy defined as in (4) with 1970 as the current year and 1960 as the census year). W has been taken to be the identity matrix, thus giving equal weights to the geographic units. Ericksen (1973, 1974) first outlined and investigated a technique, the regression-sample method, to estimate the current coefficients, Be» that would result from (3) if a census were taken to determine the true values of cy He proposed the use of Yi» sample estimates of the relative growth since 1970 in each sampled primary sampling unit (PSU, a county or group of counties), in the Current Population Survey (CPS). Using the X;;'8 for the current year relative to 1970, 6 = xo xhwy (5) estimates Be Because of considerations of sampling variance in Y, he employed weights Wy approximately inversely proportional to the estimated sampling variance of Y;- Ericksen delineated three sources of error in the estimates: 1. The random error not explained by the indicators. 2. The error due to structural changes in regression. 3. The sampling errors in the CPS estimates. He noted that the ratio-correlation method and regression sample method are equally subject to the first source of error, whereas ratio-corre- lation is affected by the second and regression-sample by the third. 157 Another fundamental idea appears in these papers by Ericksen, namely that the sample data may provide an estimate of an average mean square error for the current estimates. In this computation, the average square of bias is defined as uw 0, = (6) 1M where u was defined by (2), and 1 is a column vector of 1's. With w, 1 taken as the sampling variance of Yi» E((-X0)WO-X8)) = nn - p+ 0 21 %) where n is the number of Y's andp is the rank of X. (The notation and some constants here have been altered from Ericksen's original paper in order to set the problem in the finite population context, although neither this nor his paper fully attacks the exact constants required to represent the effects of the first-stage selection in CPS. The practical consequences are trivial, however.) In this manner, the sample data may be used to measure the magnitude of error from the changes not explained by the indicators; classical regression theory gives the error due to sampling error in Y. Consequently, both components of the error may be estimated. William Madow first noted (in a seminar given at the Census Bureau) that a judiciously selected weighted combination of Xg and Y would produce estimates with smaller average error than Xg. For example, the combination w. 1 i Bi = (XB); + (A - . ARNEON (8) 1 Ww, * oy where Ww, is again the sampling variance of Yi» is related to the original James-Stein estimator. The application of (8) or similar combinations has insignificant effects in this instance because of the large sampling error in Y, but similar formulas play a central role in a third example to be discussed here. If the finite population is the standard for evaluation, three other possible sources of error in the regression estimates deserve addition to Ericksen's list: 4. The error due to differences between the population regression equations for sampling units (PSU's) and for the units of analysis (States or counties). 5. The error arising from bias in the sample data. 6. The consequences of redistributing error among units by alter- ing the weights in the regression. 158 All three factors are at issue in this application: the use of the PSU in substitution for direct analysis of States or counties, deficiencies and lags in the CPS sampling frame whose effects may be distributed unevenly across the country, and a possibly undue emphasis in the weight- ing on estimating the most populous units (efficient in terms of sampling error but possibly undesirable as a population parameter). Several questions thus remain unanswered as to the practical merit of Ericksen's suggestion in this case, although his idea may have significant effects elsewhere. A separate section of this paper describes alternative statistical procedures that may be used to provide evidence on how the current indicators should be weighted to estimate population change. Ericksen had formulated the problem as a dichotomy between use of past relation- ships applied without evidence of their currency and sample-regression methods that make an effort to be current at the cost of substantial sampling error. Relationships between the indicator variables themselves may be examined. Since this approach is unrelated to the methods in the other two applications to be discussed here, this topic is deferred to the end of the paper. CHILDREN IN POVERTY The second example to be discussed here is a direct application of Ericksen's regression-sample method to the problem of estimating the proportion of school-age children living in poverty families by State. Congress has employed census counts of these children by county in apportioning approximately $2 billion annually under Title I of the Elementary and Secondary Education Act of 1965. Recognizing the potential for change since 1970 in the relative distribution of poor children among States, Congress included in the Educational Amendments of 1974 a directive to the Secretaries of Commerce and of Health, Education, and Welfare to conduct a survey to produce sample estimates of children in poverty families by State. In compliance with this legislation, the Census Bureau carried out the Survey of Income and Education (SIE) in the Spring of 1976. In 1975, prior to the SIE, research at the Census Bureau explored other techniques to estimate the proportion of children in poverty families by State. After initial investigations of regression models of the 1970 proportions of children in poverty using other 1970 data, it became apparent that these equations were unlikely to carry forward in time adequately. This problem with a fixed regression model based upon the preceding census is, of course, the second source of error listed earlier that had been identified by Ericksen, namely, ''the error due to struc- tural changes in regression." Consequently, an adaptation of the sample- regression method was attempted, again using the CPS to provide current sample estimates of the dependent variable, Yio this time the proportion of children 5 to 17 years old in poverty families in each State. Unlike Ericksen's experiments with predicting changes in population, the sample data were employed at the State, rather than PSU, level. 159 Experimental regressions, modeling.1970 poverty rates for families by State based upon 1960 census and other data available independently of the 1970 census, pointed to the fundamental importance of total income. Estimates of Per Capita Personal Income (PCI) published annually by the Bureau of Economic Analysis (BEA) are employed in the model. Other variables associated with poverty, including female headship, racial composition, unemployment, and region, did not appreciably add to the explanation afforded by the model. The final model proposed for years after 1970 consists of five indepen- dent variables plus a constant term. The poverty rate for children from the 1970 census is the first, while two variables are formed from BEA PCI for the census year (income year 1969) by first finding the median of the 51 State (and D.C.) PCI figures, PCL, and computing Xi = In(PCL;/PCI ) if PCT, > PCL (9) = 0 otherwise Xz = 0 if PCI, > PCT (10) = In(PCI;/PCI ) otherwise . The variables Xia and Xis are formed similarly from BEA PCI for the current year (the year immediately preceding the survey date), and, finally, Xi6 is taken to be identically 1, so that Be is the constant term. The assessment of this technique was originally based upon its perfor- mance in relation to the 1970 census. A parallel model was developed for the proportion of families in poverty, with 1960 as the base year and 1970 as the current year. The 1970 census values for the proportion of families in poverty were used in place of sample estimates as the dependent variable. Thus, the lack of fit in this case is the bias of the model. When this research was conducted in 1975, an effort was made to characterize the distribution of these biases. The principal deter- minant seemed to be size: when States were grouped into four strata by population, the largest States had errors averaging only four percent, while the second group averaged about six, and the smaller groups, ten. Other experiments suggested that the relative error for children in poverty was likely to be approximately the same as for families in poverty, so these relative errors were interpreted as rough indications of the level of error for children. (The lack of counts from the 1960 census of children in poverty by State necessitated this indirect evaluation.) The sampling errors for CPS State estimates of the proportion of children in poverty are simply too large to support the estimation by (7) of the average error as suggested by Ericksen. It is possible, however, to compute the sampling variance of the regression estimate for each State and to add an allowance for bias based upon the 1960-1970 test regression for families in poverty. With these estimates of the components of error, it is also possible to weight the sample and regression estimates together, as in (8). In only two States, however, New York and California, 160 does the weight on the sample estimate exceed .2 in this computation. As mentioned earlier, the legislative directive was for a survey suf- ficient to produce State estimates. The 1976 SIE was of adequate size and design for this purpose, and in fact the sampling variances for States were generally lower than the preceding research suggested could be obtained as mean square errors for the regression estimates from CPS. From the perspective of 1975, therefore, the SIE seemed to afford an opportunity for a definitive evaluation of the regression estimates. In particular, the computation (7) of the mean square error for the regres- sion estimates could be performed with the expectation of interpretable results, unlike the situation with CPS. In point of fact, however, the relationship between the regression and SIE estimates turned out to be more complex. In two important respects to be described here, the regression results served the purposes of the survey, once in the design and later in the evaluation, whereas a precise assessment of the bias of the regression model itself could not be obtained. Under an agreement with the respective legislative committees, a speci- fication for a coefficient of variation of 10 percent on the SIE estimate of the number of poor children in each State was chosen. This specifi- cation created some difficulty, since an efficient and practicable survey design required prior estimates of the current poverty rates for children in each State. If a prior estimate in a given State was too high, an insufficient sample size would have resulted, and the specifications would not have been met. In order to provide some protection against this occurrence, both the 1970 census poverty rates and the regression esti- mates based upon the March 1975 CPS were considered, and the smaller of each pair was used for purposes of design. Thus, the regression estimates helped to target additional sample to States in which the poverty rate had decreased since the 1970 census. The regression estimates proved even more valuable in evaluating the SIE. The whole question of evaluation was critical in the case of this survey: for the first time Congress specifically legislated that an evaluation be performed, by requiring a report on the outcome of the survey, "including analysis of its accuracy and the potential utility of the data derived therefrom ... '" In response to this directive, the Census Bureau conducted an extensive evaluation of the SIE results. The prin- cipal basis for the evaluation was a reinterview of an approximately five- percent sample of SIE and of CPS households by more intensive interview- ing techniques. (This reinterview survey is described in Fay (1978) and in the U. S. Bureau of the Census report (1978), "Assessment of the Accuracy of the Survey of Income and Education.') The SIE yielded results that appeared to require explanation; in particular, the SIE national estimate of children in poverty was 12 percent below the corresponding value obtained by the CPS, a result that could not be ascribed to sampling error alone. On this point the reinterview data supported the SIE: there was no significant change in the national estimate in the SIE reinterview, whereas the reinterview result for CPS lowered the CPS estimate by about 20 percent. The CPS reinterview estimate consequently stood within sampling error of the original SIE result but not within sampling error of the CPS result. The SIE reinterview also detected no statistically significant bias by region or division. 161 Other questions could not be answered by the reinterview alone. The significance of the difference between the SIE and CPS national estimates is compounded by the fact that the 1970 CPS produced an estimate for children in poverty about 10 percent lower than the 1970 census. By combining these differences, it could be argued that had a national census been taken in 1976, the result for children in poverty might have exceeded the SIE by over 20 percent. Others suggested that, because of this potentially large difference in level, the SIE results for the distribution of poverty among States would be essentially incompatible with the census measurement of poverty. (See, for example, Ginsberg and Grob (1977).) The CPS regression estimates provided the most direct evidence on this question, since they linked 1970 to 1976 by an annual series obtained from a consistent methodology. Figures 1 to 4 show the trends in the series by division over this period, expressing the esti- mates in terms of the percent of the total number of poor children resid- ing in each region. In essentially every case, the direction of change in the proportion of the total number of children in poverty agrees with the conclusions obtained in comparing the census and SIE; the Northeast, East North Central, and Pacific States have increased their share of the total, while a substantial decline has occurred throughout the South. This evidence implies that the SIE and census procedures would measure essentially the same distribution of poverty among States even though their national levels may differ markedly. When the regression equation is fitted to the SIE data, there is a relatively strong agreement between the regression and sample estimates for the proportion of children in poverty by State. Table 1 shows these results. The average difference between the two sets is 14 percent (root mean square), whereas the average difference between the SIE and 1970 census values is 23 percent. Since the sampling error in the SIE esti- mates was approximately 10 percent, (7) gives an average bias in the regression of about 10 percent (14% = 10° + 16%. The most remarkable outcome, however, comes from the comparison of the regression and reinterview. When each is classified by the direction of difference from the SIE, Table 2a results. Thus, there is an apparent statistical agreement between the two. A covariance adjustment to the SIE estimates, which did not change the reinterview measures of shift, produces Table 2b, which shows a highly significant relation. (The nature of the covariance adjustment and other specifics of the analysis are described in the report.) Consequently, the reinterview, which had not otherwise been noted to demonstrate any consistent pattern of shift, actually does measure a component of non-sampling error in the SIE State estimates. Analysis indicated that the magnitude of the non-sampling error was roughly 7 percent, although this result is measured to limited precision because of large sampling error in the reinterview estimates. Since the non-sampling error in the SIE is included in the preceding estimate from (7) of a 10 percent average bias for the regression, it is difficult to establish precisely the actual level of bias for the regression if the non-sampling error in the SIE were excluded, except to say that it is less than 10 percent, perhaps 7 percent. The last finding represents possibly the first application of a technique to measure non-sampling error. Whether other applications are possible will depend upon the availability of both a successful model and indepen- dent estimates of net survey error that are obtained by a more controlled process than the original survey. 162 FIGURE 1 MODEL ESTIMATES BASED ON CPS OF THE PERCENT OF TOTAL POOR CHILDREN IN THE NORTHEAST REGION, BY INCOME YEAR AND DIVISION (1970 Census and 1976 SIE Estimates Shown Circled) Percent Middle Atlantic 10 New England 1968 1969 1670 1971 1972 1973 1974 1975 YEAR 0 LZ po 1967 1976 163 Percent East North Central West North Central MODEL ESTIMATES BASED ON CPS OF THE PERCENT OF TOTAL POOR CHILDREN IN THE NORTH CENTRAL REGION, BY INCOME YEAR AND DIVISION (1970 Census and 1976 SIE Estimates Shown Circled) FIGURE 2 1968 1969 1970 1971 164 1972 YEAR 1973 1974 1975 1976 FIGURE 3 MODEL ESTIMATES BASED ON CPS OF THE PERCENT OF TOTAL POOR CHILDREN IN THE SOUTH REGION, BY INCOME YEAR AND DIVISION (1970 Census and 1976 SIE Estimates Shown Circled) Percent 24 South Atlantic 20 ii West South Central East South Central 12}. w= ro — —— 0L= — = : - 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 YEAR 165 FIGURE 4 MODEL ESTIMATES BASED ON CPS OF THE PERCENT OF TOTAL POOR CHILDREN IN THE WEST REGION, BY INCOME YEAR AND DIVISION (1970 Census and 1976 SIE Estimates Shown Circled) Percent 15 10 = Pacific 5 Mountain oli ; 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 YEAR 166 TABLE 1 Percent of Children Age 5-17 in Poverty Families According to 1970 Census, SIE, and Regression Model Divisions, Regions, ATES 1975 Estimate Regression and States [1970 Censu SIE Model UNITED STATES, TOTAL NORTHEAST New England Maine 14.2 15.3 14.2 New Hampshire V7.7 10.3 10.5 Vermont 11.4 17.8 11.9 Massachusetts 8.4 9.3 10.6 Rhode Island 11.0 10.5 11.8 Connecticut py. 8.4 9.6 Middle Atlantic New York 12.2 13.1 13.8 New Jersey 8.7 11.6 10.2 Pennsylvania 10.6 12.6 10.9 NORTH CENTRAL East North Central Ohio 9.8 11.6 11.8 Indiana 9.0 9.6 10.8 Illinois 10.7 15.1 10.8 Michigan 9.1 11.3 11.2 Wisconsin 8.7 9.4 9.6 West North Central Minnesota 9.5 9.1 9.7 Towa 9.8 7:9 8.2 Missouri 14.8 14.7 14.8 North Dakota 15.7 11.5 10.4 South Dakota 18.3 13.1 15.3 Nebraska 12.0 10.1 10.3 Kansas 11.5 8.6 10.2 SOUTH South Atlantic Delaware 12.0 10.4 12.3 Maryland 11.5 10.7 11.2 District of Columbia 23.2 15.7 17.8 Virginia 18.2 13.7 135.9 West Virginia 24.3 18.9 18.2 North Carolina 24.0 17.8 20.2 South Carolina 29.1 23.9 23.4 Georgia 24.4 21.3 20.9 Florida 18.9 21.6 16.6 167 Percent of Children Ag TABLE 1 (Continued) 1970 Census, SIE, and Regression Model e 5-17 in Poverty Families According to LL . 1969 1975 Estimates Divisions, Regions, Istimates Regression Bad Sunios 1070 Census SIE | Model UNITED STATES, TOTAL (continued) SOUTH CENTRAL East South Central Kentucky 25.1 21.4 20.2 Tennessee 24.8 20.5 20.2 Alabama 29.5 15.9 23.1 Mississippi 41.5 32.6 32.2 West South Central Arkansas 31.6 21.4 23.8 Louisiana 30.1 22.9 23.8 Oklahoma 19.5 14.6 16.2 Texas 21.5 20.5 17.7 WEST Mountain Montana 12.9 12.5 10.8 Idaho 12.0 11.0 10.5 Wyoming 11.2 8.6 8.2 Colorado 12.3 10.7 10.7 New Mexico 26.3 26.0 21.2 Arizona 17.5 16.8 16.1 Utah 10.0 8.0 9.4 Nevada 8.8 11.0 9.8 Pacific Washington 9.3 10.0 10.2 Oregon 10.3 8.4 10.2 California 12.1 13.8 12.5 Alaska 14.6 6.4 6.9 Hawaii 9.7 9.6 9.8 168 TABLE 2a Comparison of Reinterview, Model, and SIE Estimates of Children 5-17 Years 01d In Poverty Families by State (see text for explanation) Comparison of Reinterview to SIE Comparison of Model to SIE States with Model Estimate Less than SIE States with Model Estimate Greater than SIE States with re- interview less than SIE States with re- interview greater than SIE 12 10 10 18 NOTE: One State is omitted because of an estimate of no change in reinterview. TABLE 2b Comparison of Reinterview, Model, and Adjusted SIE Estimates © f Children 5-17 Years Old In Poverty Families by State (see text for explanation) Comparison of Reinterview to SIE Comparison of Model to Adjusted SIE States with Model Estimate Less than Adjusted SIE States with Model Estimate Greater than Adjusted SIE States with re- interview less than SIE States with re- interview greater than SIE 15 19 NOTE: Two States are omitted: one with an estimate of no change in reinterview, and the other with an estimate of no difference (within 0.5 percent) between the model and SIE. 169 ESTIMATES OF INCOME FOR SMALL PLACES The Census Bureau provides the Department of the Treasury with current estimates of per capita income and population for approximately 39,500 units of local government participating in the Revenue Sharing Program. In general, these estimates represent an updating of census values by factors derived from administrative data. A significant exception occurred for the roughly 15,000 places of size under 500 persons, where the 1970 census values for county PCI were substituted as base figures for these places in preparing the first sets of estimates for income of sampling error in the 1970 census 20-percent sample estimates; for example, the coefficient of variation for PCI in the 1970 census was about 30 percent for places with population of 100 persons. This situation falls rather easily into the framework constructed by Ericksen: sample estimates (from the census) are available for the variable of interest, and there is a presumed relationship to a predictor variable, the county PCI. Two other variables could also be added to the analysis: the value of owner-occupied housing obtained in the 1970 census (a 100-percent housing item) and the adjusted gross income per exemption from Internal Revenue Service data for 1969, although usable data were available for only a subset of the places in each case. The other notion incorporated into the estimation, that of combining the sample and regression estimates, appeared in the two preceding examples, but in either instance the CPS data were unable to reduce appreciably the error of the estimates. In the case at hand, however, the contribution of the sample data was potentially significant. For example, a cursory examination of sample estimates for these places compared to the county values of PCI revealed a considerable number outside the usual range of sampling error, some by large multiples of the standard error. In con- sideration of this, the James-Stein estimator was adapted to this problem to provide a means to combine the sample and regression estimates. Efron and Morris (for example, (1972), (1973), and (1975)) have argued and illustrated the potential utility of the James-Stein estimator to diverse problems in multivariate estimation. The estimator can be moti- vated by the observation that for k sample estimates Y; with equal variances D and means 8» and for any set of fixed constants Ps, the estimator Z_ of ¢., a i Z, = P+a (Y- Pp) (11) 170 for fixed a has its expected square error a Tin. R(,Z,) = Eo((8-2) (6-Z,)) (12) minimized by the choice a= 2 (13) A+D for A (6-P) 1 (6-P) /k i (14) With this a, the value of (12) is kaD, less than the value of (12), kD, . for Y itself. The James-Stein estimator for k > 3, is simply (11) with a estimated from the data as a = 1 - (k-2)D/S (15) for s = a-»fap (16) Thus, differences between the sample estimates Y and prior estimates P are assessed to determine how much weight the sample data should receive: if P fits poorly, the sample estimates receive more weight than when differences are small relative to sampling error. Efron and Morris have extended and refined the estimator. One suggestion of theirs, critically important in this application, effects a compromise between overall error, as in (12), and the error of individual components. The modification is to use the sample data to limit the reliance upon the prior estimates by constraining the final estimates to lie within some specified distance, usually a fixed multiple of the standard error, of the sample estimate for each component of 6. Thus, the estimator shrinks the data toward the prior estimates and maintains most of the resulting over- all advantage, while guarding against unacceptably large risk to any individual component. The program of estimation in this application may be outlined as follows: 1. Fitting a regression equation to the census sample estimates. 2. Measuring the goodness of fit between the regression equation and the sample data, taking into account the contribution of sampling error to the observed differences. 3. Forming a weighted estimate of the sample and regression esti- mates, letting the weights reflect the relative fit of the 171 regression and the sampling error of the sample estimate. 4. Constraining each weighted combination to lie within one standard error of the sample estimate. For purposes of estimation, Y; was expressed as the logarithm of the sample estimate. (Since the sample estimates have approximately a con- stant coefficient of variation for a given sample size, the logarithm of the sample estimate has approximately a constant variance for a given sample size.) In turn, all independent variables were similarly converted into logarithmic form. Separate regressions for each State and each of the two groups of places under 500 population and of 500-999 were fitted; reduced equations were employed for places lacking housing or IRS data. The strategy was to estimate A as in (13) and to reflect this value both in combining the regression and sample data and in weighting the regression. The regression estimates were = x wo xTwy (17) <> with Wis GIRS where D; is the sampling variance of Yi» and A > 0 was determined iteratively as the unique solution to AT oo (Y-Y)'W(Y-Y) = n-p (18) for p, the rank of X, and n, the number of Y;'s. (If no positive solu- tion existed, A was set to 0.) Each value was then estimated as 1 . 1 8; = 8; + 0; if 8; > Y; + (0) (19) 1. Ww : ES 8; =; if 8; - Y; < ®;) (20) 1 3 = - 2 3 - 2 85" 8; (0;) if 85 < Y; (0;) (21) where . A . §. = Y. + = 0. -Y.) . (22) i i AD. i i 1 The A obtained through the solution of these equations measures an average lack of fit between the regression and true values. Table 3 gives values of A from the estimation for places of population under 500 in States with the largest number of such places, and, similarly, Table 4 shows results for places of population 500-999. Roughly, A is in units 172 TABLE 3 Estimated A for Places with 20-Percent Sample Estimates of Population Less than 500 Regression Equation STATES County County and County and County, Tax, Tax Housing and Housing a. States with More than 500 Places in Class Illinois .036 .032 .019 L017 Towa -.029 L011 .017 .000 Kansas .064 .048 .016 .020 Minnesota .063 .055 .014 .019 Missouri .061 .033 .034 .017 Nebraska .065 .041 .019 .000 North Dakota L072 .081 .020 .004 South Dakota .138 .138 .014 mm Wisconsin .042 +025 ,025 .004 b. States with 200-500 Places in Class Arkansas .074 .036 .039 .018 Georgia .056 .081 .067 L114 Indiana .040 012 .003 .000 Maine .052 .015 -- = Michigan .040 .032 .028 .023 Ohio .034 .015 .004 .004 Oklahoma .063 .027 .049 .036 Pennsylvania .020 .018 .016 «O11 Texas .092 .048 .056 .040 NOTE: A dash (--) indicates that the regression was not fitted because of too few observations. 173 TABLE 4 Estimated A for Places with 20-Percent Sample Estimates of Population 500-999 Regression Equation STATES County County and County and County, Tax, Tax Housing and Housing a. States with More than 250 Places in Class Illinois .032 .023 .012 .008 Indiana .017 .014 .007 .009 Michigan .019 .014 .005 .008 Minnesota .056 .040 .021 .007 New York .052 .015 .028 .006 Ohio .024 .010 .005 .000 Pennsylvania .035 .025 .015 .026 Wisconsin .039 .030 .014 -- b. States with 100-250 Places in Class Towa .017 .005 .016 .004 Kansas +025 .010 .014 .008 Maine 022 +021 -- -- Missouri .042 .019 L011 .013 Nebraska .027 .007 .008 .008 Texas .050 .017 .013 .012 NOTE: A dash (--) indicates that the regression was not fitted because of too few observations. 174 equivalent to squared relative error, so that .040 corresponds to about a 20 percent average error. A place of 225 persons has a c.v. of about 20 percent also; thus, Table 3 indicates that, for places of this sizc, (22) weights the sample data more heavily than the regression estimate in the majority of cases for the county-only equation. When other vari- ables were available for inclusion, the values of A were generally considerably lower, indicating a substantially better fit. Two further investigations of the performance of the James-Stein estimator were made in this application. In 1973, the Bureau of the Census con- ducted special censuses of a random sample of places, some of which had 1970 populations under 1000. These censuses collected 1972 income on a 100-percent, rather than sample, basis. Table 5 displays the comparison between the special census results for places falling into this category and alternative estimates based upon updating county or place sample estimates from the 1970 census or the .James-Stein estimates. Thus, the table offers only an indirect assessment of the relative merits of the three base figures, as the resulting estimates for 1972 were equally affected by error in the common updating factor. Of the three, the set based upon the James-Stein estimates shows smaller average error (measured as absolute percent difference) and appears considerably better than the county values. (The tendency for the 1972 special census estimates to appear lower than the other estimates also occurs for the remaining special censuses for larger places and probably reflects principally the conse- quences of not imputing income for non-response in the processing of the special census returns.) A second investigation served to demonstrate that the true values for places of this size differed in general from their respective county values, and that the James-Stein estimator was a useful mechanism to achieve a reduction in sampling error while preserving much of the actual variation. A sample of places with usable IRS estimates was sorted by adjusted gross income per exemption and then aggregated in order into groups of ten. The census sample estimate for per capita income of the groups as a whole was thus considerably more accurate than for the indi- vidual components and could be taken as an accurate estimate for the group. Table 6 displays comparisons of the sample estimates for these groups with aggregated estimates using the James-Stein or the county estimates. According to each measure of spread considered in the table, the aggregated values of the James-Stein estimates more closely matched the sample estimates than did the county values, by a substantial margin, in fact. The Census Bureau has incorporated the James-Stein estimates as base figures into its computation of per capita income for 1974 and subse- quent years. This represents perhaps one of the largest, if not the largest, formal applications of this estimator in a Federal statistical series. 175 A TABLE 5 Comparison of Selected 1972 PCI Estimates to 1972 Special Census PCI Values 1972 PCI Estimates and Percent Difference from Special Census PCI 1972 SPECIAL CENSUS AREAS Speein Census Base James-Stein Base County or MCD Base PCI u 1972 Percent 1972 Percent 1972 Percent Estimate Difference” Estimate Difference® Estimate Difference a. 1970 Census Weighted Sample Population Less than 500 Newington, GA 2,019 2,225 10.2 2,302 14.0 2,279 12.9 Foosland Village, IL 2,899 2,771 4.4 3,199 10.3 3,796 30.9 Bonaparte, I0 2,35) 3,126 34.1 2,942 26.2 2,542 9.1 McNary, IA 2,333 2,303 1.3 2,527 8.3 2,908 24.6 Freeborn Village, MN 2,741 3,693 34.7 3,338 21.8 2,922 6.6 Spruce Valley Twp, MN 2,430 1,894 22.1 1,949 16.8 2,076 14.6 Jacksonville, MO 2,723 2,338 14.1 2,611 4.) 3,233 18.7 Thayer, NE 2,742 2,245 18.1 2,870 4.7 3,452 25.9 Benton Town, NH 1,788 2,874 60.7 3,284 78.7 3,570 09.7 Nora Township, ND 1,780 2,629 47.7 2,754 54.7 3,476 95.3 Riga Township, ND 1,454 2,749 89.1 2,411 65.8 2,711 86.5 Deer Creek, OK 2,451 2,493 1.7 2,673 9.1 2,762 12.7 Dudley Borough, PA 2,446 2,168 11.4 2,411 1.4 2,608 6.6 Brookings Township, SD 3,132 3,400 8.6 3,309 5.7 2,395 23.5 Valley Township, SD 1,574 1,946 23.6 1,972 25.3 2,114 34.3 Bryant Township, SD 2,412 1,120 53.6 2,158 10.5 2,695 11.7 Parrish Town, WI 3,567 5,399 51.4 4,079 14.4 2.72 23.7 Average, all areas -- -- 28.6 me 22.0 -- 31.6 b. 1970 Census Weighted Sample Population Between 500 and 999 Caswell Plantation, ME 1,946 2,656 36.5 2,490 28.0 2,646 36.0 Sugar Creek Township, MD 2,224 2,035 8.5 2,315 4:1 2,018 9.3 Jeromesville, OH 3,329 3,081 7.4 3,418 2.7 3,072 7:7 Rush Township, OH 2,241 2,545 13.6 2,619 16.9 2,546 13.6 Dennison Township, PA 3,521 4,411 25.3 4,095 16.3 4,430 25.8 Manor, TX 2,062 2,746 33.2 2,765 34.1 2,740 32.9 Derby Center, VT 2,968 2,694 9.2 2,754 7.2 2,675 9.9 Average, all areas -- -- 185.1 -- 15.6 - 19.3 NOTE: 'd" = absolute percent difference. "Average, all areas," is average of absolute percent differences. TABLE 6 Relation of 1969 Revised Estimates and 1969 County Averages to 1970 Census Sample Estimates for Groups of Ten (for places with the ratio of 1969 IRS exemptions to 1970 census population between .8 and 1.1) Relation to 1969 Sample Estimates 1969 Revised Estimates Number Percent 1969 County Averages Number Percent Total Groups Within 10% of Sample PCI Outside 10% of Sample PCI Within One Standard Error Between 1 and 2 Standard Errors Outside 2 Standard Errors Closer to Sample PCI 212 100. 81. 18. 70. 13, 16. 72. 0 1 9 UID Ww 212 100.0 111 52.4 101 47.6 61 28.8 60 28.3 91 42.9 58 27.4 177 THE PROBLEM OF TWO REGRESSIONS The regression paradox or the problem of two regressions appears in most texts on linear regression. If we restrict the problem temporarily to univariate regression, including a constant term, the least squares esti- mate of the regression of Y on Z is based on the coefficient LZ -D 0-0 1 by -— (23) L (24 - 7) i for Z =z Zi/n (24) i Y= g Y;/n (25) 1 whereas the regression of Z on Y gives the coefficient . : (2; - 7) (Y; © Y) b, = 7 2 (26) r(Y; -Y) 1 when there is a perfect linear relationship between Y and Z, byb, =), as logic might seem to dictate. In all other situations, however, the product b,b, is less than 1, which is the root of the so-called ""regres- sion paradox." In the presence of residual error, (23) and (26) determine two distinct regression lines intersecting at the joint means, and their different interpretation requires care. To illustrate the implications of this problem to small area estimation, consider the case where Z is a sample estimate of X, and Y is an indi- cator for X. One approach to determine X on the basis of Y is to follow Ericksen's suggestion to form the regression of Z on Y, computing a co- efficient for Y using (26). Our attitude toward this procedure might change, however, if we were to learn that Y was in fact a sample estimate for X. We would find generally that the coefficients estimated from (26) would not tend toward the value 1, as the principal of unbiased estimation would require, but in fact to a lesser value. (We would obtain an expec- ted value of 1 if we could substitute the actual X for Z in (23).) To see what this lesser value is, suppose that we let the sampling error of Z go to zero, for the sake of argument. We would find a convergence of (26) to approximately the value of "a' given earlier in (13) as the op- timal weight to combine sample and prior information (in this case, the mean) to minimize mean square error. (In formula (13), A assumes the role of the true variability of X and D the sampling error of Y.) Thus, the regression approach leads to a shrinkage of the sample estimates Y toward the mean very much in the spirit of the James-Stein estimator, although by an entirely different route. 178 As an illustration of this phenomenon of shrinkage, let us return to the first example of population estimation. For the values of c, the growth of State population relative to the national as in (4) for the decade 1960-1970, the regression coefficients for the 51 States and District of Columbia are .324, .374, and .177, for school enrollment, labor force size, and number of tax returns, respectively. This set of coefficients is em- ployed in the most recent revision of the ratio-correlation method (to a greater precision than shown here, however). Their sum, .875, is less than unity. Consider the consequences of reweighting the regression: using the square root of 1960 population as a weight, the coefficients become .334, .435, and .124; weighting proportional to population (in- cluded in Ericksen's proposal) gives .371, .483, and .058. The sum of the second set is .893; that of the third, .912. Thus, the shrinkage effect, the summation of the coefficients to a value less than one, is reduced somewhat as larger States receive increased weight. An inter- pretation of this effect is that the better fit of the regression to the larger States supports less shrinkage than for smaller. This last example was chosen only to suggest that linear regression in- cludes a shrinkage effect that works to reduce mean square error and runs counter to the notion of unbiasedness. Furthermore, if some specific subsets of units favor less shrinkage than others, the regression equa- tion will express a compromise between the different degrees of shrinkage. In these cases, the question of weighting must be considered carefully. The possibility exists, moreover, for estimators that would explicitly accomplish varying degrees of shrinkage for different groups. POST-CENSAL ESTIMATION OF POPULATION (REVISITED) As described earlier, Ericksen proffered the regression-sample method as a means to counter possible obsolescence of past relationships applied to measure the present. This section will illustrate that multivariate methods in some applications may enable the study of the structure of the same past relationships and permit inferences about the approximate degree of their persistence. (The following discussion addresses the actual merit of Ericksen's proposal only obliquely, however, since the models will be analyzed on the level of States rather than PSU's. Furthermore, the computations carried out here are for the purposes of exploration only and are insufficient to constitute a complete methodology.) Subsequent to Ericksen's original work on population, circumstances have limited the field of possible indicators of population change to statis- tics on school enrollment, labor force size, and number of tax returns. Recent instability due to changes in abortion laws has virtually elimina- ted the utility of births as an indicator of general population change, although this variable had been demonstrably effective in predicting change during the 1960-1970 decade. Similarly, fluctuations in the data on automobile registrations, never a strong predictor, have also resulted in its exclusion from current estimates. The Census Bureau has altered the methodology in another important respect: Medicare data are now used directly to estimate the component of the population age 65 and over, and consequently the ratio-correlation method is now used only to predict the population under 65 years old. 179 School enrollment, labor force size, and number of tax returns correlate almost identically with population change (for the component of the population under 65) for the 1960-1970 decade, with values .955, .952, and .954, respectively. Rather than weighting the three equally, however, the regression coefficients for the decade are .324 for school enrollment, .374 for labor force, and .177 for tax returns. As mentioned in the pre- ceding section, weighting the regression by the square root of population or by population further reduces the coefficient on tax returns. A general explanation for unusual coefficients is near-colinearity among the variables, which can lead to instability in the estimated coeffi- cients. In this case, however, colinearity has a relatively mild effect upon the stability of the coefficients computed from the census data, and the differences between the resulting coefficients and an equal weighting cannot be ascribed to this factor alone. The analysis that follows suggests why the coefficients take this form. The linear regression of population change on the three variables con- stitutes one measure of their interrelationship. Other multivariate techniques, in particular principal component analysis, can be useful for exploring the structure of the independent variables apart from their relationship to the dependent variable. The three-dimensional space de- termined by the three independent variables may have its points specified by the values of the individual variables. Equivalently, the points of this space may be measured in relation to other component dimensions arising as linear combinations of the original variables. One such repre- sentation, the principal components, establishes dimensions that are un- correlated according to the sample covariance matrix. In addition, these dimensions may be specified to represent progressively the largest remain- ing component of variation subject to the constraint of zero correlation with the preceding principal components. Hence, in a three dimensional space, the first principal component represents the direction of maximum variation, and the third corresponds to the least variation. Algebra- ically, the principal components are the eigenvectors of the sample co- variance matrix, and the corresponding eigenvalues measure the variance of the original variables along the dimension of the space determined by the respective eigenvector. The top half of Table 7 gives the principal components for the 1960-1970 decade for the three predictor variables. The first component represents effectively an average of the three variables, suggesting its origin in their common relation to population change. The second, with an eigen- value only a twenty-fifth of the first, contrasts labor force and school enrollment, with tax returns playing a minor part. The third component, the dimension of least variation, has an eigenvalue only about half of the second, measures the tax return variable against the average of the other two. This description of the variables, together with the tendency of the regression to favor the combination of labor force and school enrollment over tax returns, suggests the following interpretation: the second com- ponent reflects a possible demographic phenomenon, that the labor force and school enrollment variables are indicators of two separate elements of the population, and their combination is able to represent the entire population efficaciously. The small eigenvalue of the third component indicates that the tax variable represents generally an average of the other two, although the regression clearly favors the combination of school enrollment and labor force as a prediction of population change. 180 Principal TABLE 7 Components of Indicators Principal Components Indicators 1st 2nd 3rd 1960-1970 School enrollment 01 -.62 -.48 Labor force «53 .78 -.32 Tax returns +58 -.06 .81 Eigenvalue .0541 .0021 .0011 1970-1976 School enrollment .33 -.81 -.48 Labor force .72 .54 -.43 Tax returns .61 -.20 +77 Eigenvalue .0221 .0018 .0006 181 Table 8 presents evidence in support of this interpretation. In the upper section of the table, the two-variable regressions of population change on school enrollment and labor force size indicate that school enrollment dominates the prediction of age 5-17 and contributes equally with labor force for 0-4, while being less effective for 18-44 and entirely negligible, once labor force is considered, for 45-64. The three-variable regressions in the lower part of the table show the potential of tax returns as a general indicator but also its inability to dominate both labor force and school enrollment for any age group. (The shrinkage effect described in the preceding section is apparent in these separate regressions, but to the least extent for the age group 5-17. The small shrinkage applied for this group may be attributed to the excellent fit of the regression here.) (The computations for age groups are only illustrative and are based simply upon published census counts without the necessary adjustments for the institutional popula- tion, etc., in the ratio-correlation method.) To address the issue of possible change in the regression relationships since the 1970 census, the lower half of Table 7 gives the principal components of the 1970-1976 variables. The reduced coefficient on school enrollment in the first principal component is a direct conse- quence of the smaller variation among States for this indicator. (During the 1960-1970 decade, the average variation among States ranged from 13 percent for labor force to 15 percent for school enrollment. For the period 1970-1976, however, the average variation in school en- rollment is only 6 percent, whereas tax returns vary by 9 percent and labor force by 11 percent.) We find substantially the same alignment of components as for the 1960-1970 decade. The second principal com- ponent still may be understood to represent the difference in relative growth between the school-age population and the labor force. The second eigenvalue here is now larger relative to the first eigenvalue than previously; it is now almost a tenth of the first. The third eigenvector, which still contrasts tax returns with the average of the other two, has remained relatively small, with an eigenvalue only 1/40th of the first, close to the ratio between these two eigenvalues for 1960- 1970. These last observations provide a limited assurance that the relation- ships established during the 1960-1970 decade have largely continued to hold. If either labor force or school enrollment were to have deterior- ated substantially in its ability to predict their respective compo- nents of population change, this would be reflected in a larger third eigenvalue. Hence, the tax data as a general indicator suggest the demographic relations observed earlier have persisted. (Some adjustment to the weights might be argued, however, in terms of the declining pro- portion of the total population under age 17.) Should the tax variable, which serves to confirm the relationship between school enrollment and labor force, receive increased weight? The analysis based upon principal components does not fully resolve this question. Unfortunately, a linear regression incorporating CPS data would also be quite unsuccessful in answering this, since the extremely small variation in the third component, which represents the dimension at issue, forces an extremely high variance on the estimated coefficient from sample data. At best, the sample-regression method represents a tool of possible future use for this question, but other techniques appear to be required as well. 182 TABLE 8 Regression Coefficients for Population Growth, 1960-1970, for States Age Indicators Total 0-4 5-17 18-44 45-64 Two-Variable Regression School enrollment .421 .390 .925 251 .005 Labor force L449 442 -.019 .508 .851 Three-Variable Regression School enrollment .324 .374 .856 +231 -.241 Labor force .374 429 -.071 .492 .663 Tax returns +177 .030 .124 .036 L446 NOTE: Computations for age groups for illustration only and not consistent with current methodology. 183 REFERENCES Efron, Bradley, and Morris, Carl (1972), "Limiting the Risk of Bayes and Empirical Bayes Estimators -- Part II: The Empirical Bayes Case," Journal of the American Statistical Association, 67, 130-9. , and Morris, Carl (1973), "Stein's Estimation Rule and Its Com- petitors -- an Empirical Bayes Approach,' Journal of the American Statistical Association, 68, 117-30. , and Morris, Carl (1975), "Data Analysis Using Stein's Estimator and Its Generalizations," Journal of the American Statistical Association, 70, 311-9. Ericksen, Eugene P. (1973), "A Method of Combining Sample Survey Data and Symptomatic Indicators to Obtain Population Estimates for Local Areas,'' Demography, 10, 137-60. (1974), "A Regression Method for Estimating Population Change for Local Areas," Journal of the American Statistical Association, 69, 867-75. Fay, Robert E. (1978), "Problems of Nonsampling Error in the Survey of Income and Education: Content Analysis," in Proceedings of the Social Statistics Section, 1977, Part I, American Statistical Association, Washington, DC. Ginsberg, Alan, and Grob, George (1977), "Uses of Data from the Survey of Income and Education for Policy Analysis," in Proceedings of the Social Statistics Section, 1977, Part I, American Statistical Association, Washington, DC. U.S. Bureau of the Census (1976), Current Population Re rts, P-25, No. 640,"Estimates of the Population of States with Components of Change: 1970 to 1975," U.S. Government Printing Office, Washington, DC u.S, Bureau of the Census (1978), "Assessment of the Accuracy of the Survey of Income and Education, submitted to Congress on April 25, 1978, by the Secretaries of Commerce and of Health, Education, and Welfare. 184 Discussion Eugene P. Ericksen DEFINING CRITERIA FOR EVALUATING LOCAL ESTIMATES The selection of criteria for evaluating local estimates is at once a statistical and political issue. The statistician first of all wants a methodology for evaluating errors and then wants to verify that the selected set of estimates has a smaller average error than any competitive set and that there are no indications of systematic bias for particular subgroups of local areas. The policy-maker naturally wishes to have statistically satisfactory estimates, but also must value presentability since s/he will need to defend the estimates before legislative groups, local critics, and the general public. Unfortunately, the best statistical estimates are sometimes difficult to present to a nonstatistical audience. More often, the policy-maker is forced by legislative demands or other requirements to produce and use ''the best avail- able estimate' which either does not meet accepted statistical standards or has not been subjected to statistical evaluation. The Federal estimates of population growth since 1970 which are used to allocate revenue sharing funds to local jurisdictions are an example of this. Congress specified that estimates be computed for about 39,500 localities, and the Census Bureau had to produce the estimates, even though it had not developed and tested a method for doing so. The procedure of synthetic estimation provides a method of computing local estimates which would not otherwise be available. It has been used to give local estimates of dilapidated housing, unemployment, drug-taking behavior, and vacant housing. The alternative to these estimates was either nothing or a set of esti- mates already shown to be fallible. Unfortunately, the accuracy of synthetic estimates has not usually been assessed and we don't have a systematic method which could tell us how inaccurate or biased the estimates might be. On the other hand, for the regres- sion-sample data method there are already usable, though imper- fect, methods of evaluating errors. Although these methods can usually tell us which of several sets of estimates are better, they cannot specify the level of error precisely. Moreover, the methods are complex and sometimes require assumptions which are statistically acceptable but difficult to sell politically. There seems to be a belief that a good local estimate incorporates information collected from that jurisdiction only and does not make use of information borrowed from other local areas as is done in the regression-sample 185 data estimates (Ericksen 1974). Nonetheless, it seems to me that synthetic estimates could be made more acceptable and more complex estimates salable if statisticians emphasized the assessment of errors as the most important criterion to evaluate the methodology of a set of local estimates. Top priority should be given to research strategies designed to improve the methodology of error es- timation. Fortunately, Bob Fay has made steps toward that goal. I feel that the synthetic procedure is of questionable validity. The estimates have the unfortunate characteristic of "shrinking" es- timates toward the mean of all areas. For a variable where charac- teristics of local areas are important, synthetic estimates might be very poor. Such a variable might be usage of a drug which is avail- able in some areas but not others. This is because individual level characteristics like age, race, and sex are typically used to compute synthetic estimates, and these characteristics are weakly related or unrelated to the volume of drugs on a local market. Moreover, if a synthetic estimate is to be used to identify extreme cases like local areas with particularly high unemployment rates, the shrinking is a decisive liability. While there may be estimating situations where the synthetic procedure gives accurate results, there are usually also reasons to disbelieve their accuracy. Therefore the acceptability of a set of synthetic estimates should be based on an evaluation of errors. I suggest that this can often be done using the sample data on which the synthetic estimate is based. Maria Gonzalez has presented an overview of some of the better known applications of synthetic estimates. Some of these applica- tions have been important to users, such as the set of estimates correcting the numbers of housing units classified as vacant in the 1970 Census. Her paper indicates the versatility of synthetic estimation,and I think it is clear that the methodology will be used in important ways in the years to come. While she did not indicate a method by which the accuracy of estimates can be as- certained without resorting to census counts of the variable in question, she and I have worked on the problem. We did this for the set of unemployment estimates for 122 large metropolitan areas which she has reported here and given more extensive information about elsewhere (Gonzalez and Hoza 1978). Many synthetic estimates, particularly those derived from Census or CPS data, are based on large sample calculations. For these, un- biased estimates of the characteristic in question can be computed from the survey data for the sample psu's. These estimates have large variances, but unless the number of psu's is small, the esti- mates can be used as a standard for accuracy. The series of synthe- tic and competitive estimates can be compared to the psu sample esti- mates. The set of estimates most highly correlated to the sample estimate is judged most accurate. This assumes that the sample esti- mates have only random errors. In the unemployment application discussed by Gonzalez, the main competitor to the synthetic estimates was the set of "70-step" esti- 186 mates computed by the Department of Labor. We correlated various sets of synthetic estimates and the 70-step estimates with the 122 sample estimates and found that the 70-step estimates were consistently more strongly related to the sample estimates of un- employment. We then used the occupation-race-sex synthetic estimate, thought to be the best synthetic estimate, and the 70-step estimate as independent variables in regression with the sample estimates as the dependent variable, following the methodology of the regression- sample data technique. There we found the regression weights of the 70-step estimates to be considerably larger than those of the synthetic estimates. Fortunately, the synthetic estimates contained some independent information. The regression estimates computed with 70-step and synthetic estimates as the two independent variables were more accurate than either the 70-step or synthetic estimates, particularly when outliers due to large sampling errors were removed (Ericksen 1975; Gonzalez and Hoza 1978). With hindsight, we can see why the synthetic estimates of unem- ployment should be so poor. The variance of the synthetic estimates was very small, considerably smaller than either the variance of the 70-step estimates, the sample estimates, or the sample estimates after an estimate of the within-psu variance had been removed. This should have been an indicator of the shrinking problem. The synthetic procedure assumed that the unemployment rate was the same for all mem- bers of a given sex-race-occupational group in a region. For example, if the unemployment rate for steelworkers was high, this high rate was applied to all local areas. This unemployment rate was the result of economic problems in the steel industry which have led to the se- lective closing of plants. Bethlehem Steel, for example, is closing only some of its plants. A number of other steel plants have been closed in Youngstown, Ohio, but more are still working in Gary, Indiana. As a result, synthetic estimates computed for 1978 would give a mis- leading result indicating the unemployment rates to be overly similar in Gary and Youngstown. Because the 70-step estimates were sensitive to local fluctuations, they would again prove superior. A key issue, then, is the accurate estimation of the within-psu error of the sample survey estimates. This is needed to establish the magnitude of errors of synthetic and other estimates as well as to evaluate the errors of estimates computed by the regression-sample data method. This estimation problem has been difficult, and its lack of solution prevents us from specifying a definitive answer to the important problem of assessing the errors of local estimates. Using only the synthetic and 70-step estimates and the sample data, we were unable to give accurate estimates of the mean squared errors of the various unemployment estimates. We were only able to rank order them in terms of accuracy. It can be seen from Fay's discussion of the SIE estimates of the number of children in poverty that the accurate estimation of the within-psu variance is a continuing problem. In this case, Fay was unable to compute a direct estimate of the errors of regression 187 in 1975, although a complex and ingenious assessment of errors was eventually carried out. We faced a similar variance estimation prob- lem in our work on 1960-70 population growth. We found that a few local units with extraordinarily large errors upset the stability of our within-psu variance estimates. These large errors appeared to be due to nonsampling errors, to the inclusion of special strata im- portant nationally but found in only a few sample psu's, to poor estimates of the location of new construction, and in some cases, to pure chance. We found some improvement through a rejection of out- liers routine (Ericksen 1975) but more research needs to be done on the estimation of within-psu error and its components. Among the many issues usefully discussed in Fay's paper, there are two which deserve special attention. One is his delineation of sources of error in regression-sample data estimates, and the second is his application of the James-Stein technique. Both of these points suggest that the most fruitful applications of the regression-sample data technique will occur in estimating situations where sample esti- mates are available for all local units and explicit use can be made of the unbiased nature of the sample estimates. It is recognized that errors in regression-sample data estimates arise due to structural errors in regression and to the presence of within-psu error. Fay correctly points out that errors also arise due to (1) differences between population regression equations for sampling units (psu's) and for the units of analysis (states or coun- ties), (2) biases in the sample data, and (3) the weights used in the regression equation. I would like to underscore his argument by giving an example of how the first and third sources contributed to error in one application. The job was to compute estimates of 1960-70 population growth for 2,586 counties in 42 states. Symptomatic infor- mation was available for all counties and for psu's in the CPS sample. We estimated a regression equation using 444 CPS psu estimates as the dependent variable. Because some of the self-representing psu's were very large, much larger than the typical nonself-representing stra- tum, they were given larger weights. These weights were directly proportional to population size and hence to the sample sizes in the psu's. In this way, the weights were proportional to the expected accuracy of the psu sample estimates and we hoped to reduce the within-psu component of error by giving greater weight to the more reliable estimates. When the regression equation was applied to the 2,586 counties, we found the mean error to be 4.54 percent and 221 of the errors were 10 percent or greater. We then, as an experiment, proposed to eliminate the within-psu source of error entirely by using 1960-70 Census figures for the 444 psu's as the dependent variable in the calculation of the regression equation. When we ap- plied this regression equation to the 2,586 counties, we found to our surprise that the mean error was now 4.55 percent and that the number of errors of ten percent or greater had, in fact, risen to 234. How was this possible? We compared errors by size of county. We found that where the county population was 25,000 or greater, the errors were consistently and substantially reduced by the second equation. 188 For smaller counties, the large majority of all counties, the errors had increased, and these increases offset the decreases in the larger counties. It should be clear that psu's are more similar to the larger counties, particularly those psu's given greater weights. As a result, our weighted equation based on psu's, when improved, in- creased the accuracy for psu's but decreased the strengths of the inferences to counties. We had made better estimates with less in- formation. A second point to be made is that we cannot directly assess the errors for local areas not included in the sample. More importantly, the presence of the sample survey information, as Fay has shown, can lead to further reductions in error. By applying the Stein-James methodology, he was able to compute optimal weights for regression and sample estimates and to reduce the errors below those obtained from either method. There are two quibbles I would like to make. The first concerns the assumption that the sample observations are drawn from a population with equal means and variances. Since our objective is to estimate the differences among local units, how do we sustain this assumption? Is it necessary to subdivide local areas into categories with similar means, and just how robust is the assump- tion? The second quibble concerns the constraint that final estimates must lie within a specified distance, perhaps one standard deviation, of the sample estimates. If we assume that the within-psu errors are totally random, then we would expect the errors to have mean zero and to be normally distributed. As a result, there would always be a small subset of local areas which would have particularly bad sample estimates due to chance alone. As a result, the constraint would be particularly bad in these areas. If a constraint is necessary it is probably better practice to use the regression estimates rather than the sample estimates as the standard and to remove bad sample esti- mates from the equation. In the three applications I have worked on, estimating population growth, unemployment, and income, the regres- sion equations have been considerably more accurate on average than the sample estimates. This leads to a final point about within-psu errors. As Hogg (1974) has pointed out, outliers can have drastic effects on the calculation of a regression estimate. For regression-sample data es- timates, outliers due to measurement error can be particularly da- maging, even when their number is small. We have found a suitable way to identify these outliers and thus remove them from the equation (Ericksen 1975). We first computed a regression equation based on all cases, and then compared the regression and sample estimates. Those sample observations at a specified distance from the regression estimate, usually two standard deviations, were identified and re- moved from the sample. A second regression equation was then com- puted from the remainder and this equation was used to calculate the final estimates. Sizable reductions in the mean squared error were obtained by this technique which does not seem incompatible with the general idea of the James-Stein methodology. Moreover, if outliers 189 due to large within-psu errors were excluded, a more optimal set of weights between the sample data and regression estimates could per- haps be computed. To summarize, both the synthetic and regression-sample data methodologies promise good, though uneven, results. If the synthetic estimate has a reasonable competitor, it is likely that a more optimal result could be obtained by using both synthetic and competitive es- timates in a regression format using the sample estimates as the de- pendent variable. The most important point, though, is that we need a systematic way of evaluating and comparing errors. One way to do this is to make explicit use of the sample data on which the synthe - tic and regression-sample data estimates were computed. Given the difficulty of evaluating estimates for areas where there is no sample information, the most useful applications of the regression-sample data method are likely to occur in estimating situations where sample data are available for all local units. Finally, let us hope that future research on synthetic estima- tion does not follow that of ratio-correlation estimates. This lat- ter method is a technique for estimating population change which has been used extensively on the State and national level. There is a literature full of variations on the basic method which in a parti- cular estimating situation gave an improvement. People have tried stratifying local units, using differences between ratios instead of ratios of ratios, dummy variables, and many other variations, and have shown that their particular variation worked for them. Unfortuna- tely none of these papers ever provided a methodology for determining which variation or the basic method was optimal in a new situation, and statisticians and demographers have been left to make the same ad hoc judgments as before. REFERENCES Ericksen, Eugene P. (1974), "A Regression Method for Estimating Population Changes of Local Areas,' Journal of the American Statistical Association, 69, 867-875. Ericksen, Eugene P. (1975), "Outliers in Regression Analysis When Measurement Error is Large," Proceedings of the Social Statis- tics Section of the American Statistical Association, 412-417. Gonzalez, Maria E. and Christine Hoza (1978), "Small-Area Estimation with Application to Unemployment and Housing Estimates," Journal of the American Statistical Association, 73, 7-15. Hogg, R. V. (1974), "Adaptive Robust Procedures: A Partial Review and Some suggestions for Future Applications and Theory," Journal of the American Statistical Association, 69, 909-923. 190 General Discussion * Some very challenging philosophical issues were raised at some of the sessions. It is important to continue to explore the questions concerning synthetic estimates: When is it and when isn't it safe? What are the conditions under which one could use the method? What are the criteria? * One criterion for when a synthetic or any of the other types of es- timates should be used would be a circumstance when one can evaluate the error and determine whether the estimates are sufficiently accurate. If we are not able to assess the error in any way, then this should be a strong indication that the estimate should not be used unless it is politically dictated that it has to be. As statisticians working either for or with the government,we don't always have the freedom to make the choice not to use a synthetic estimate. Sometimes we have to do things that statistically we don't necessarily agree with. If we are going to talk about errors of estimates, the size of error is important and also the direction of error. Almost every error for a place with a high unemployment rate is negative. If the objective is to spot places with high unemployment rates, than a synthetic estimate is particularly bad for that and should not be used. Competitors will arise if the agencies that have the responsibility to compute estimates don't give out estimates that seem plausible to groups that might object. It is possible that you could use regression methods and get rid of some of the bad characteristics of the synthetic estimates. But re- gression does not get rid of these characteristics. All it does is dampen them. Places with high unemployment rates where the synthetic estimate is too low, if the data are used for regression estimates, come out too low once again. One of the things that you learn about in sociological statistics'is ecological correlation. You learn not to use the characteristics of aggregates to make inferences to individuals. It seems equally invalid to use characteristics of individuals to make inferences to aggregates. 191 That is where the synthetic type of estimate that uses regressions of aggregates could go wrong. It is likely to misorder the weights that would be applied to variables. For example, variables that would pre- dict drug usage on an individual level, for example, age, would be the most important. Yet age distribution of the population would not really do very well compared to other factors in estimating whether drug use is very high. If you have the kind of local area sample data, like the number of drug treatment centers or the number of drug arrests or the FBI's best guess as to the rate of drug traffic, they would turn out to be much better predictors and that would be the data to use. (Contributing to the general discussion during this period were: Eugene Ericksen and Joseph Steinberg.) 192 Part IV Drug Abuse Applications: Some Regression Explorations with National Survey Data Reuben Cohen Discussion Monroe G. Sirken Ira Cisin General Discussion Applications of Synthetic Estimates to Alcoholism and Problem Drinking David M. Promisel Discussion Donna O. Farley General Discussion Synthetic Estimates As An Approach to Needs Assessment: Issues and Experience Charles Froland Discussion Reuben Cohen General Discussion Drug Abuse Applications: Some Regression Explorations with National Survey Data Reuben Cohen ABSTRACT Personal interview surveys in recent years have provided national estimates of use of marihuana, heroin, and other substances. Over a number of national surveys, consistent relationships have been observed between drug abuse and demographic variables such as age, education, and sex. Where one lives has also been found to be sig- nificantly related to level of drug abuse. This is observed in survey data in relationships between experience with drugs and geo- graphic region of residence and commmity size and type. Regression and other multivariate analyses have been used to help understand the prevalence of drug abuse among various segments of the general population and have provided a means to explore re- lationships between drug use and a number of additional factors related to location of residence. Regression procedures have also been used in an exploratory way to provide drug abuse estimates for States. NATIONAL SURVEY RESULTS A number of sample surveys in recent years have provided national estimates of use of marihuana, heroin, and other substances. Data collection and analyses for five such surveys have been carried out by Response Analysis Corporation, starting in 1971 and 1972 for the National Commission on Marihuana and Drug Abuse, and continuing in later years in cooperation with the Social Research Group, George Washington University, under sponsorship of the National Institute on Drug Abuse. 194 This paper aims to provide a flavor of the findings and something of the methodology of these surveys and invites the reader to think about the ways that results could be made more useful by appropri- ate use of small area estimating techniques. Typically, the surveys have been based on national probability samples in the range of 3000 to 4500 personal interviews. They have includ- ed special samples of youth age 12-17, and have oversampled young adults in the 18-25 age range. Something more about the method - ology is described further on, but first a few findings from the 1977 survey are presented to suggest the range of content and types of data available for additional analysis (Abelson, Fishburne, and Cisin 1977). All of the surveys included a variety of measures of use and fre- quency of use of a range of substances, including illicit drugs as well as nonmedical use of drugs legally obtainable only under a doctor's prescription. Table 1 shows the range of substances and figures on lifetime experience reported in the 1977 survey by youth, young adults, and older adults. As a quick summary, each group is more likely to have had experience with marihuana and/or hashish than with any of the other psychoactive drugs studied. Clearly al- so, marihuana use is strongly associated with age, and the highest prevalence rate is found among young adults age 18-25. TABLE 1 NATIONAL SURVEY ESTIMATES FOR 1977 LIFETIME EXPERIENCE YOUNG OLDER YOUTH ADULTS ADULTS 12-17 18-25 26% MARIHUANA AND/OR HASHISH 28.2 60.1 15.4 INHALANTS 9.0 11.2 1.8 HALLUCINOGENS 4,6 19.8 2.6 COCAINE 4,0 19.1 2.5 HEROIN 1.1 3.6 8 OTHER OPIATES 5.1 13.5 2.8 STIMULANTS (Rx) 5.2 21.2 4.7 SEDATIVES (Rx) 3.4 18.4 2.8 TRANQUILIZERS (Rx) 3.8 13.4 2.6 *PERCENT EVER USED 195 Lifetime experience (ever used) is considerably higher than current use (use in the month prior to interview). For youth and young adults, the figures on current use of marihuana and/or hashish are roughly half as large as those reported for lifetime experience. For other substances, reported levels of current use fall off much more sharply from the figures for lifetime experience. The national surveys have also shown substantial differences in re- ported levels of drug use among population subgroups other than age, and these have been generally consistent across the five points in time. Table 2 shows lifetime experience with marihuana for sex, race, and educational level. Males are more likely than females to report experience with marihuana, and reported marihuana experience also increases with educational level. Differences by race are smaller and less consistent. TABLE 2 LIFETIME EXPERIENCE WITH MARIHUANA AND/OR HASHISH* 1977 SURVEY YOUNG OLDER YOUTH ADULTS ADULTS TOTAL 28 60 15 SEX MALE 33 66 21 FEMALE 23 55 10 EDUCATION NOT HIGH SCHOOL GRAD -- 52 6 HIGH SCHOOL GRAD -- 60 16 COLLEGE -— 65 26 RACE WHITE 29 61 15 NONWHITE 26 54 20 *PERCENT EVER USED Patterns of use by geographic region and community type (Table 3) are of more specific interest to the topic of this workshop. For each of the three age groups, highest levels of experience are reported in the Northwest and West, and lowest levels in the South. For each age group also, more lifetime experience with marihuana is reported by residents of metropolitan areas than by residents of nonmetropolitan areas, with at least a suggestion of more experience in large metropolitan areas than in small metropolitan areas. 196 Lifetime experience with marihuana has increased significantly over the period covered by the five national surveys,as shown by figures for age groups in Table 4. With some allowance for sampling varia- bility from one time period to the next, the figures also show a rea- sonably consistent pattern for sex and education (Table 5) and for geographic region and commmity type (Table 6). TABLE 3 LIFETIME EXPERIENCE WITH MARTHUANA AND/OR HASHISH* 1977 SURVEY YOUNG OLDER YOUTH ADULTS ADULTS GEOGRAPHIC REGION NORTHEAST 35 66 20 NORTH CENTRAL 29 61 14 SOUTH 19 50 9 WEST 36 67 23 COMMUNITY TYPE LARGE METROPOLITAN 37 63 20 SMALL METROPOLITAN 28 64 16 NONMETROPOLITAN 18 48 9 *PERCENT EVER USED TABLE 4 LIFETIME EXPERIENCE WITH MARIHUANA AND/OR HASHISH* 1971 1972 1974 1976 1977 12 = 35 6 4 6 6 8 14 - 15 10 10 22 21 29 16 - 17 27 29 39 40 47 18 - 25 39 48 53 53 60 26 - 34 19 20 30 36 by 35+ 7 3 4 6 7 *PERCENT EVER USED 197 TABLE 5 LIFETIME EXPERIENCE WITH MARIHUANA AND/OR HASHISH* ALL ADULTS 1971 1972 1974 1976 1977 SEX MALE 21 22 24 29 30 FEMALE 10 10 14 5 19 EDUCATION srabuATE 8 5 9 12 1 HIGH SCHOOL GRAD 14 13 20 22 26 COLLEGE 23 32 28 30 25 *PERCENT EVER USED TABLE 6 LIFETIME EXPERIENCE WITH MARIHUANA AND/OR HASHISH* ALL ADULTS 1971 1972 1974 1976 1877 REGION NORTHEAST 20 14 22 24 29 NORTH CENTRAL 19 15 17 19 24 SOUTH 5 8 13 17 17 WEST 21 33 29 29 32 POPULATION DENSITY LARGE METRO 20 21 24 26 30 OTHER METRO 18 20 20 24 26 NONMETRO 7 6 12 13 16 "PERCENT EVER USED 198 SURVEY METHODS So much for the summary of national survey results. The starting point is a multi-stage area probability sample of the coterminous United States, stratified by Census geographic divisions, metro- politan/nonmetropolitan place of residence, and other demographic factors. Primary sampling units were counties and groups of counties, with 103 such units selected for the Response Analysis national sample. Interviews for the series of studies described have typically been carried out in approximately 400 segments with- in the 103 PSU's. Reasonably careful probability sampling and field interviewing pro- cedures have been used at each step in the data collection process. Rough field counts are used to divide census enumeration districts and block groups into small segments, and field listings of specific housing units are completed in advance of interviewing. Letters are then written to households selected as part of the survey sample to announce the interviewer's visit and to urge cooperation with an im- portant national survey. In most cases, interviewers were trained on procedures for these sur- veys in regional meetings scheduled just before the start of field interviewing for each study. The interviewer's first task at the sample household is to list resi- dents of the household. Although the details of the procedure have varied somewhat over the period covered by the five surveys, the list- ings of residents have been divided into age groups for youth, young adults, and older adults, in order to provide for oversampling of the two younger groups. In effect, two independent sampling procedures have been carried out at each household -- one for the youth sample, one for the adult sample. In households which include one or more elegible youth age 12 to 17, one such person is always randomly selected for the youth sample regardless of whether an adult is selected from that household. The adult sampling procedure is somewhat more complex and depends on whether the household includes only young adults, only older adults, or both. No more than one adult is selected, and younger adults are favored by the probability selection procedures. Weights are used in processing survey results to compensate for the disproportionate nature of the sampling procedure. Interviewers make repeated visits to sample households, as necessary, in an effort to complete interviews with each designated respondent -- sometimes up to ten visits or more. Additional efforts are made to solicit the cooperation of persons who initially refuse or who are 199 reluctant to participate. Interview completion experience for the series of surveys has generally been in the range of 80 percent of designated respondents; in the most recent survey, interviews were completed with 82 percent of the youth sample and 81 percent of the adult sample. As one might expect for a survey on a sensitive issue such as use of illicit drugs, special efforts are made to protect the privacy of the respondent and to insure the confidentiality of data. A com- bination of procedures is used in the interview. Part of the ques- tionnaire is a standard interview instrument with answers recorded by the interviewer, and techniques to afford greater privacy for the respondent are used in other phases of the interview. In those sections of the interview on illicit drug use, the respondent marks his or her own answers to questions read aloud by the interviewer. This procedure permits respondents to conceal potentially sensitive answers, while allowing the interviewer to maintain control of the interview. The answer sheets were designed so that, whether or not the respondent had ever used illicit drugs, the same amount of time would be re- quired to fill out the forms. Codes were used to identify completed questionnaires and answer sheets but neither names nor addresses were used. As each answer sheet was completed, the respondent was instructed to place it di- rectly in a return envelope. At the conclusion of the interview, the main questionnaire was also placed in the envelope, and then, in the presence of the respondent, the envelope was sealed. The re- spondent, who had been told of these procedures in advance, was in- vited to accompany the interviewer to a mailbox. The interview materi- als did not contain the respondent's name or address anywhere on the questionnaires or envelope and were mailed directly to the central office. Interviewers were not permitted to review or to edit ques- tionnaires. REGRESSION ESTIMATES FOR STATES Now that we have these kinds of data, how can we use them to assist in the development of estimates for States or smaller areas? First we might consider the possibility of extracting estimates by looking into the survey data for interviews conducted within specific states. But sample surveys of adequate size to provide reasonably stable estimates for the total U.S. population are rarely large enough to provide direct estimates for specific States. A survey intended to provide estimates for the State of New Jersey, for example, would require about as large a sample for that State as for the U.S. as a whole in order to yield estimates of similar accuracy. Within the national sample, the number of locations and the number of persons in the sample in any given state are too small to provide a useful esti- mate. Indeed, the national sample used for the series of surveys de- scribed in this paper does not include interviews in every State. 200 Synthetic estimates of a type which require dividing the total popu- lation into a large number of specific cells based on a set of factors believed to be associated with drug abuse were not seriously consid- ered because of the relatively small size of our national samples. Much larger samples would be needed than those on which this series of studies is based. The specific procedure chosen for the work that is discussed next is a dummy variable multiple regression analysis. One portion of the analysis was carried out in parallel form, using a multiple classifi- cation analysis, with almost identical results. In each case, a number of independent variables, or predictors, are identified. Each of these techniques deals adequately with the gen- eral problem of intercorrelated predictors provided that certain other assumptions are met. One assumption of the classic multiple regression approach is that the variables used in the analysis are continuous and normally distributed. However, the technique has been adapted to deal with classifications (e.g., geographic regions) by using dummy variables in the regression equation. The multiple classification analysis (MCA) technique was developed specifically for classification data and is generally equi- valent to the dummy variable multiple regression used for the complete series of analyses (Andrews, Morgan, and Sonquist 1969). An important assumption of both the regression and MCA techniques is that relationships between the predictors and the dependent variable are additive -- that is, that the effect of each class of each pre- dictor is not dependent on the values of any of the other predictors. In the case of the present analysis, multiple regression and MCA mo- dels would assume that a person's likelihood of having experience with a substance is composed of a series of additive coefficients, corresponding to the particular category or class in which he or she stands on each predictor. Thus, for example, separate effects could be calculated for age, sex, education, region of the country, and so on, and summed to obtain an estimated probability which takes all of those factors into account. While the assumption of additivity is often taken to be a good in- itial approximation to reality, it poses some obvious difficulties in the analysis of drug abuse. An alternative assumption which must be considered is that the predictors interact -- i.e., two or more predictors have an effect in combination which is differ- ent. from the sum of their effects computed separately. Some parts of this general problem of interaction have been dealt with in the way that variables have been combined for the analysis. Additional work on the general problem of interaction would be a useful aspect of any further effort to develop a drug abuse index from survey re- search data. 201 Two sets of analyses have been done using these procedures. The first used data from the 1972 national survey; the second combined data from the 1974 and 1976 surveys. In the 1972 survey analysis we created dependent variables for each of eight substances, for both lifetime experience and current use. Each of these was coded as yes/no. Before going on to discuss the predictor variables, Table 7 shows the proportion of variance we were able to explain in the analyses. The figures in the chart are the multiple R's for each analysis. At least a small proportion of variance in use is explained for each of the substances. The squares of the multiple correlation coefficients are highest for marihuana, and are higher for lifetime experience than for current use. This suggests, of course, that in likelihood of use, mari- huana is more predictable than other substances -- and lifetime ex- perience more predictable than current use. The sizes of the co- efficients are probably at least in part a function of the overall levels of reported use. For drugs with very low levels of reported use, errors of various types, including reporting errors, are larger relative to reported frequency of use and thus are likely to reduce the amount of variance that might otherwise be attributed to the pre- dictor variables in the equation. TABLE 7 MULTIPLE RZ 1972 SURVEY ANALYSIS LIFETIME EXPERIENCE CURRENT USE MARIHUANA 27 18 HEROIN .05 .03 COCAINE .06 .05 HALLUCINOGENS 3 .08 INHALANTS .05 02 SEDATIVES .05 .05 TRANQUILIZERS .02 .02 STIMULANTS .06 .06 202 A number of different versions of the regression analyses were carried out with the 1972 survey data, using different numbers of predictor variables. The figures shown in Table 7 were based on an analysis using seven sets of dummy variables. With some differences in the group of dummy variables, the analysis was repeated for selected drugs with the combined 1974-76 survey data. Tables 8A through 8D compare results of analyses of the two sets of survey data for lifetime ex- perience with marihuana. The youth and adult samples were combined in these analyses. In Table 8A we note that the multiple correlation coefficients were identical in the two analyses. Table 8A also shows "index numbers" for a combined age/education set of dummy variables, and for sex. The index numbers created for ease of interpretation are simply multiple regression coefficients multiplied by 100 and rescaled with the lowest valued coefficient set equal to zero. TABLE 8a MULTIPLE REGRESSION INDEX, 1972 AND 1974-6 LIFETIME EXPERIENCE WITH MARIHUANA 1972 1974-6 Multiple R2 .27 .27 AGE/EDUCATION 12 - 13 by 2 14 - 15 12 18 16 - 17 28 37 18 - 20/COLLEGE 50 52 18 - 20/NONCOLLEGE 37 52 21 - 24/coLLEGE 43 52 21 - 24/NONCOLLEGE 25 45 25 - 34/COLLEGE 29 37 25 - 34/NONCOLLEGE 14 26 35 - 49/COLLEGE uy 10 35 - 49/NONCOLLEGE y 5 50 AND OVER 0 0 SEX MALE 8 10 FEMALE 0 0 203 The same kinds of index numbers are shown in Table 8B for family in- come groups, used only in the 1972 survey analysis, and for race/ethnic group dummy variables. A question on family income has been included in interviews with adults but not in youth interviews. In order to include income in the 1972 survey analysis we used that part of the youth sample for which an adult had been interviewed in the same household, and assigned the income reported by the adult to the youth interview also. In the 1974-76 analysis we used the full youth sample and did not use the income variable. It is possible that inclusion of family income in the analysis for 1972 but not for 1974-76 also has affected the results for race/ethnic group for the two years, but we have not tried to unravel these ef- fects. TABLE 8B MULTIPLE REGRESSION INDEX, 1972 AND 1974-6 LIFETIME EXPERIENCE WITH MARIHUANA 1972 1974-6 FAMILY INCOME * unDer $5,000 9 $5,000 - $9,999 4 $10,000 - $14,999 0 $15,000 AND OVER 4 RACE/ETHNIC GROUP WHITE 4 6 BLACK 0 9 HISPANIC 0 0 * Family income not included in 1974-76 analysis. 204 Table 8C shows results for the two principal sets of geographic vari- ables we have used in the analyses. These show generally consistent results in terms of direction of differences between geographic groupings, but the differences are generally smaller in the 1974-76 analysis than in the 1972 analysis. There is a clear relationship between community type and reported lifetime experience with marihuana, and similarly between geographic region and marihuana use. TABLE 8c MULTIPLE REGRESSION INDEX, 1972 AND 1974-6 LIFETIME EXPERIENCE WITH MARIHUANA 1972 1974-6 COMMUNITY TYPE LARGE METRO/CENTRAL CITY 19 12 LARGE METRO/SUBURBAN 14 7 SMALL METRO/CENTRAL CITY 19 6 SMALL METRO/SUBURBAN 2 NONMETRO/URBAN 2 NONMETRO/RURAL 0 REGION NORTHEAST 6 4 NORTH CENTRAL 2 SOUTH 0 WEST 14 9 205 Finally, in this series of findings, Table 8D shows results of one of our side excursions. In the analysis of the 1972 survey data, we coded a number of additional geographic variables based on county of residence of survey respondents. For example, each county in the national sample was coded as high, middle, or low in terms of percent of population living in college dorms, and similarly in terms of per- cent of population enrolled in college. For the 1972 analysis, per- cent in college dorms was selected for inclusion based on an early informal inspection of regression and correlation data for a large number of variables. In the 1974-76 analysis, both sets of dummy variables were originally incorporated in the analysis and stepwise regression procedures were permitted to select one set. The suggestion in both cases is that some proportion of experience with marihuana is explained by the presence of large numbers of college students in the community relative to total population. TABLE 8p MULTIPLE REGRESSION INDEX, 1972 AND 1974-6 LIFETIME EXPERIENCE WITH MARIHUANA 1972 1974-6 % POPULATION IN COLLEGE DORMITORIES LOW MIDDLE HIGH 13 % POPULATION ENROLLED IN COLLEGE _ LOW 0 MIDDLE 2 HIGH 7 206 If for no more than their curiosity value, the complete list of addi- tional variables coded for the 1972 survey analysis is shown in Table 9. They have not been very useful so far, but they may sug- gest additional possibilities to the reader. TABLE 9 POPULATION CHARACTERISTICS USED IN REGRESSION ANALYSES OF 1972 SURVEY DATA CODED HIGH, MIDDLE OR LOW FOR COUNTY OF RESIDENCE OF SURVEY RESPONDENTS POPULATION PER SQUARE MILE PERCENT POPULATION CHANGE, 1960-1970 MEDIAN NUMBER OF PERSONS/HOUSEHOLD PERCENT POPULATION IN ONE-PERSON HOUSEHOLDS PERCENT FOREIGN BORN PERCENT FOREIGN BORN AND NATIVE BORN OF MIXED OR FOREIGN PARENTAGE PERCENT POPULATION IN GROUP QUARTERS PERCENT POPULATION IN MILITARY BARRACKS PERCENT POPULATION IN COLLEGE DORMITORIES PERCENT OF CIVILIAN LABOR FORCE THAT IS UNEMPLOYED PERCENT OF HOUSEHOLDS WITH INCOME LESS THAN POVERTY LEVEL PERCENT BLACK POPULATION LOCATION NEAR INTERSTATE HIGHWAY LOCATION NEAR MAJOR POPULATION CENTER 207 To illustrate the possible application of regression estimates for specific States, indexes were computed from the 1974-76 analysis. Table 10 shows figures for the three highest and three lowest estimates. TABLE 10 MARIHUANA INDEX LIFETIME EXPERIENCE 1974-76 SURVEYS INDEX* HIGHEST DISTRICT OF COLUMBIA 155 CALIFORNIA 142 COLORADO 137 LOWEST ALABAMA 57 KENTUCKY 53 MISSISSIPPI 51 "AVERAGE FOR ALL STATES = 100. 208 EXAMINATION OF REGRESSION RESIDUALS The final step in the exploratory work that is included in this paper was an examination of regression residuals from the 1974-76 analysis. The research started with a hypothesis, but most statistical cautions were thrown aside in looking at residuals for areas in the national sample figuratively plotted on a map of the United States. The im- plication of the regression coefficients shown earlier is that the United States consists of four large plateaus, at four different heights with respect to reported experience with marihuana, rep- resented by regression coefficients for the four census regions shown earlier. The plateaus would be at the relative heights shown in Map #1. There would, of course, be sharp elevations wherever metropolitan concentrations occurred, with peaks represented by central cities. MAP I. REGRESSION RELATIONSHIPS FOR CENSUS REGIONS 209 My own mental map of the United States suggests something quite dif- ferent -- perhaps rolling hills and valleys corresponding to points of entry and avenues of diffusion of drug experience. With this in mind, I looked at residuals which are in effect deviations from the plateaus, after taking into account metropolitan/nonmetropolitan community type and variations in demographic features such as age, sex, and education. The number of PSU's in our national sample poses obvious limitations for this type of examination of residuals, but let me share with you the terrain features that emerged for me. Starting with the North- east region (Map #2) there seems to be a difference between an area included within a broad arc drawn around New York City and the rest of the region. The arc extends into Connecticut and into Northern New Jersey. Residuals for sample locations within the arc average plus 3 percentage points.l In other words, even after taking ] community type and demographic features into account, New York City and the surrounding area average about three percentage points high- er than the region as a whole, or about 5 percentage points higher than the rest of the region. MAP 2. AVERAGE RESIDUALS IN NORTHEAST 210 For the North Central region (Map #3), the specific features don't exactly pop off the map but there does seem to be something differ- ent about the metropolitan regions near the Great Lakes and the rest of the region. The Great Lakes metropolitan group embraces the areas of Chicago-Milwaukee, Detroit-Ann Arbor, and Cleveland-Akron- Youngstown. This grouping averages plus 2 percentage points, and the rest of the region minus 1. 2 MAP 3. AVERAGE RESIDUALS IN NORTH CENTRAL 211 For the South (Map #4), the picture is different. There is a depression in the terrain that runs across the States of the deep South. The band extends from Georgia and the Carolinas across to Arkansas and Louisiana. Residuals in these states average minus 2 percent compared to about plus 3 percentage points in the rest of the region. MAP 4. AVERAGE RESIDUALS IN SOUTH 212 The West is more complex (Map #5). The most noticeable features are highs in Northern California and lows in the Los Angeles area. The Northern California grouping of locations extends from the Bay Area to Sacramento; residuals average about plus 5 percentage points. For the Los Angeles area, including Orange County, residuals average minus 4 percentage points. Locations in the rest of the region aver- age about the same as the entire region. MAP 5. AVERAGE RESIDUALS IN WEST \ re " IN LMINNESS onl / 1045 i sn i a J i ; J or n® oo 4 D yl iz. N; i L em mem JKCAHOMA 3 " . y . = ! iia vr an: ET - I i i Frm BEE coma. . ( | TO i 7 | \ NN | . \ i Examination of the residuals has been an interesting exercise. I sus- pect that careful study will suggest new approaches to meaningful es- timating procedures for small areas. FOOTNOTE 1. For this analysis, residuals were averaged for four or more primary sampling units. For any grouping examined and reported separately, the minimum number of interviews is 280. REFERENCES Abelson, H.I.; Fishburne, P.M.; and Cisin, I. National Survey on Drug Abuse: 1977. Princeton, N. J.: Response Analysis Corp., 1977, and (Volume 1, Main Findings) National Institute on Drug Abuse. DHEW Pub. No. (ADM)78-618. Washington, D. C.: Superintendent of Documents, U. S. Govt. Printing Office, 1977. Andrews, A.; Morgan, J.; and Sonquist, J. Multiple Classification Analysis. Ann Arbor, Michigan: Survey Research Center, University of Michigan, 1969. 213 Discussion Monroe G. Sirken Reuben Cohen proposes and illustrates a multiple regression model for producing State and local area synthetic estimates of drug use. He suggests that the designs of the national surveys conducted by NIDA favor the regression estimator over a synthetic estimator because the sample size of NIDA's national survey is too small to be divided into a large number of population subdomains. In other words, the sampling errors would be larger based on the synthetic estimator. However, Reuben does not present any empirical or theoretical evidence to substantiate this view. Personally, I doubt that much is gained by dividing the population into a large number of subdomains. For instance, synthetic estimates of health service utilization are changed very little by increasing the number of subdomains beyond those by age and sex. In his continued work with the drug use data, I suggest that Reuben undertake two types of studies - one theoretical, the other empirical. First, it would be very helpful if he would indicate the relationship between the multiple regression estimator and the synthetic estimator. Second, that he use the NIDA data to compare the State estimates of drug use and their sampling errors for the two estimators. One of Reuben's observations deserves underscoring. He notes that although drug use varies greatly by demographic variables, like age and sex, these variables account for only a small fraction of the total variance in the populations's use of drugs. He shows this to be particularly true for the rarer drugs. Does this imply that we should be wary of synthetic estimates of drug use, particularly for rarer drugs? 214 Discussion Ira Cisin The scope of this workshop is considerably broader than I had expect- ed; we are scheduled to discuss a wide variety of estimating procedures, both direct and indirect, and perhaps to discuss a hierarchy of utility within the indirect domain. As far as I can tell, our vocabulary in this field is not sufficiently differentiated, so that when a term like "synthetic estimates" is used, we are not all necessarily thinking about the same thing. Even the term "synthetic!" is a little unfortu- nate, since the connotation it evokes suggests "imitation" or "ersatz" —-not quite the genuine article. My intent is to demonstrate that synthetic estimates are indeed genuine and potentially important; to make explicit some obvious conditions and assumptions under which synthetic estimates can be most useful; and to make a couple of modest proposals on how their utility can be increased. Our procedures are ''synthetic'' in that they synthesize information from more than one data set. In the case of the drug use estimates, we have the results of a national sample survey; we search these results for an explanatory model--that is, we seek a set of "predictor" variables or "independent" variables which will maximally account for the vari- ance in some particular "criterion" or "dependent" variable. Funda- mentally, this is a regression procedure, whether the results are ex— pressed in terms of regression coefficients or whether they are ex— pressed in differential probabilities for defined subgroups. Then, armed with our survey results, we apply our model to a geographic seg- ment of the population which is a part of the total population but which was not sampled intensively. Usually the geographic segment is a State or a city or a county. The result is a synthesis of the national sample survey data with available Census information about the geographic segment or segments of particular interest. Three observations on the procedure are appropriate at this point: 1. Obviously, the procedure is not very useful if the explanatory power of the regression model derived fram the national survey results is weak. If the best we can do is a very small R2 in ex- plaining the variance in the criterion, and/or if that model is based on variables whose distribution does not differ much from State to State or county to county, then the exercise will 215 inevitably lead to estimates which differ only minutely from estimates based simply on population size. On the other hand, if a powerful regression model can be generated and if it uses camponents which differ considerably from small unit to small unit, then the outcome will be quite different. Several speakers have mentioned that synthetic estimates for small areas usually do not differ much from area to area. The reason is obvious: practitioners have concentrated on predictors which maximally differentiated in tems of the criterion behavior; only as an afterthought have they remembered that such variables as sex and age do not differ very much from one small area to another. So the net outcome is disappointingly nondiscriminating. 2. In applying this procedure, we assume that the influential factors which apply to large aggregates apply equally well to small ag- gregates; that is, we assume that there are no significant inter- active effects which are unique to the individual States or other entities for which estimates are generated. 3. We must also keep reminding ourselves that the search procedures we use in generating our explanatory model are themselves maximiz~ ing procedures. Regression statistics as applied in search pro- cedures are descriptive statistics, fine tuned so as to take every advantage of the idiosyncracies of the particular sample in which they are calculated. In psychometrics, we know that cross-valida- tion of a regression equation is expected to yield a lower R2 on a new sample than it did on the sample from which it was derived. In exactly the same way, we are undoubtedly overestimating our explanatory power. Recognizing these limitations, I want to comment briefly on the impor- tance of these procedures in various research applications, in addition to their applications to geographic estimates. Exactly the same synthesizing procedures are widely used in generating regression estimates of missing data because of item nonresponse in surveys. Given that we have some information on the nonrespondents, we search the respondent data set for an explanatory model, seeking the correlates of the responses to an item that is missing among the nonrespondents, again seeking those correlates which differentiate among the responses within the respondent group and at the same time differentiate the respondent group from the nonrespondent group. Similarly, but less widely recognized, we have applied this technique to the standardization of samples in natural quasi-experiments. Morris Rosenberg first suggested this tactic in his work on test-factor standardization. The paradigm is simple. Let's say we are studying the relationship between TV viewing and aggressive behavior; we do not have a controlled experiment; we have a survey, and we can compare the aggressive behavior of heavy viewers with that of light viewers; but heavy viewers and light viewers are self-selected and the two groups differ markedly in various other ways. Obviously we should standardize the two viewing-level groups with respect to their other differing characteristics, using the Rosenberg procedures, but Rosen-— berg (1968) does not suggest systematic ways for choosing the variables 216 on which to standardize. William Belson (1959), an English psycholo- gist, gets credit for the first attempt, however crude, to select sys- tematically the standardization variables which would in this instance differentiate between the viewing groups and, at the same time, differ- entiate on the criterion behavior--in other words, to select standard- ization variables which would do the most work. Two constructive suggestions arise from consideration of these appli- cations: First, it seems obvious that the search procedures could be improved by use of an interactive tactic like AID rather than linear multiple regression. Certainly interactions can be built into linear multiple regression, but this has to be done artistically, as Reuben Cohen did it. The AID disadvantage of dichotamization of predictors is easily overcome and the interactions among the predictors can be detected objectively. Second, and most important, we should continue to explore techniques for systematic selection of predictor variables which provide maxi- mum power; that is, predictor variables which contribute to explana- tory power and at the same time differentiate among the small geograph- ic units. The trick, of course, is to select standardization variables with optimum relationship to the two criteria. To start, we can follow Belson's lead: he developed a search technique which would make a stepwise selection among the candidate predictors this way: he in- vented a summary statistic to express the candidate variable's rela- tionship with one of the criteria and separately its relationship with the second criterion. Then the basis for selection would be the prod- uct of the two summary statistics. Subsequent selections are accom- plished stepwise in a manner that has become familiar in the AID adaptation. Although Belson's invented statistic is statistically questionable, we at the Social Research Group have been working with both correlation coefficients and analogs of chi-square to achieve the same objective in a statistically defensible manner. The symbolic representation is simple: Let variable 1 be a drug use criterion; and variable 2 be State of residence; then we are seeking a set of variables "3" that will maxi- mize the absolute value of the product: R13 R23, not merely maximize the absolute value of R13. The correlation product is recognizable as the right-hand term of the numerator of the familiar formula for the partial correlation coefficient. There are minor technical difficulties in our dual criteria technique. Since residence in the 51 States is a nominal variable, and we are using it as a criterion, we have some trouble with naminal variables as candidate predictors. Ideally, we could.use correlation coefficients for some of our calculations and non-parametric chi-square analogs for others. But we have qualms about equivalence. In any case, we now have a solution for the dual criteria problem in simple cases like item nonresponse estimation; and we are confident 217 that the approach can be generalized to more difficult practical problems. REFERENCES Belson, W.A. Matching and prediction on the principle of biological classification. Applied Statistics, 8:65-75, 1959. Rosenberg, M. The Logic of Survey Analysis. New York: Basic Books, 1968. 218 General Discussion * Tt is useful to note that there is a relationship between the regres- sion-based estimates using dummy variables and the covering (nearly unbiased) estimates that Paul Levy discussed. When you use a regres- sion procedure instead of using a cell mean in the covering estimate equation, you are using a predicted value of a cell mean from a linear combination of data. One advantage is that you can account for more variables because you are building up your degrees of freedom; you might be able to include six or eight variables (or however many you might want to use). Whereas, if you are using the covering estimator, then six or eight variables would involve a multiway crossclassifica- tion with 400 cells and would become awkward to use. Another advantage of the regression procedure is that by taking into account more vari- ables you could probably get ones that are better (given that you have measured them and have them available). The difficulty is that unless you carry out an assessment of the regression relationship you run the risk of leaving out variables. If you leave out variables, that causes estimates to have properties that may be misleading. If you did a statistical test that demonstrated that the interactions were unimportant, then the estimates based on the regression would be essentially the same as the estimates based upon the ordinary means, and they would probably have smaller standard errors. The dilemma is that the bigger you make the table, the poorer your ability to do the test. And then you have to start assuming that the model you are pro- ducing is useful on certain kinds of a priori considerations. When you use these estimates you are adopting something called a "response error model" point of view. You are in essence saying: response errors dominate and sampling errors are less important. If it turns out that the assumption that there is no sampling error is an appropriate one, the regression estimates may be very satisfying. If it turns out that each particular unit in a population has unique characteristics, so that the sampling error is indeed im- portant, then the prediction model may not work out very well. The dilemma is that most situations are a mixture of the two and we don't necessarily know how to deal with the mixture. 219 * One problem that exists is the multiple use of the same word: re- gression. When you take the regression approach in the sense of trying to find alternative indicators for geographic units you're interested in, one of the properties of that approach is that it allows you to make use of any information. One could use information which has noth- ing to do with the variables in the survey. For example, a practical suggestion would be to consider a regression estimate using the number of drug treatment centers in an area as a predictor variable. Of course, sometimes after trying a predictor variable, it becomes necessary to throw it out as having a poor predictor ability. * Another way of considering the problem is to use available data for changing the strategy of the structure of the basic survey design. In this approach one would aim towards the use of the data not only for a national survey result but also for the basic needs of synthetic estimate purposes. * The dilemma for the user is that while the technique discussed can be implemented, there seem to be problems of lack of variation among areas, between proportions of useful demographic variables, and a lack of explanatory power of predictor variables. * (Joan Rittenhouse) I'd like to follow up on that point because I'm deeply involved in a data set, that is, the National Survey, which gives us very respectable estimates for drugs of wide prevalence, particularly marihuana. But our office gets calls constantly from States and lo- calities, and they really need, not only for treatment purposes, but also for public health purposes, good estimates for heroin. In the unidentified (i.e., nonclinical) population we have little, very little to help them. So when we got into the Levy discussion I began to feel like that bumper sticker which said "I found it}' because it seemed like the answer to States and localities: synthetic estimates. We can give them this technology and they can put it to work to come up with the estimates they need. But a little later on in the Levy presentation,when he talked about the power to discriminate one area from another given equal distribution or powerful predictors such as age, I began to get a feeling more like the other bumper sticker which says "I lost it." All these small areas have people in these age groups; so there it goes. You get a very nondiscriminating estimate. Reuben was suggesting a number of other non-age variables which contribute to the prediction of drug abuse less significantly than age, but which contribute something. They also discriminate one area from another: for example, race, and density of the population. Since these factors have been associated in the past with different rates of drug abuse, they would seem ideal for incorporation into the synthetic estimates procedure and for the gen- eration of discriminating prevalence estimates by locality. So there may be a second chance to say ''I found it." But--the National Surveys have shown in the past two years or so that population density and race, to persist with these variables, are losing their meaning so far as drug abuse is concerned. The 1977 findings made the point even stronger; the differences are disappearing. So now I really feel that "I've lost it." 220 * The situation may not be that bleak, although you've dramatized the issues quite a bit. It may be useful to focus on the variance components--the between and the within components. The heart of the issue is how things vary not in the population as a whole, but area by area. One could suppose it is possible to get a moderately low R and find that most of it is accounted for by within area variance. It would be necessary to investigate the between and within variance aspects to know whether the synthetic procedure would be useful. * You want to look at two things. One is the RZ for the national data; the other is the variability between areas in the composition of the population. Perhaps some statistical work could be done. It may be useful to determine and to define the combination of the two criteria under which it might be fruitful to try to use a synthetic estimator and the conditions under which it might not be. * Another question: Is there a cutting point for RZ before we should become serious about using the regression estimator? It may be worth noting that sometimes the RZ can be increased considerably by taking into account other variables (e.g. lifestyle variables in a drug use application). These may be considered soft types of variables, and some data collecting agencies may prefer not to collect them. However, these types of variables may be worth obtaining. * It is necessary to consider whether there is a systematic way to get synthetic estimates which are as different as possible from simply applying the national data to the small areas. The answer may lie in selecting predictors-- independent variables--such that the product Rig Ryz is maximized. This implies that you can determine a small set of predictors which maximally explain the criterion variables and are maximally different among the States or among the small areas. Setting up the dual criteria answers that question. If either one of the two relationships is zero, it doesn't matter how big the other one is--it is not going to make a difference. You might as well apply the national estimates. You can think of it as a continuum rather than a cutting point. If you're going to predict a phenomenon temporally, you have to use demonstrably antecedent variables. However, if somebody else has done a survey in which questions concerning soft variables have been asked, there is no reason why the soft variables cannot be used in a synthetic estimate. The objective is not temporal prediction. The objective is estimation, and for estimation anything goes. They can be used for this purpose. * The heart of the problem is not whether variables are soft or hard but what is the likelihood of being able to get reliable data at the local area level. * One should consider using available data (e.g., the existence of treat- ment facilities for a specified disease) if the data are very reliable on a small area basis, and different from area to area, and demonstrably 221 correlated with criteria. There is nothing that restricts you to using only your own sample survey results. * Tt would be useful if there existed an archive of national sample data that have been collected, giving the nature of the variables that have just been referred to, and if the information would be available so that you could assume certain relationships were preserved over time. But the point is worth recognizing that you are in a prediction mode. There may be something uncomfortable with the notion of maximizing vari- ation between local areas, particularly at the State level, because a number of States are relatively homogeneous with respect to each other but very heterogeneous within. They are comprised of individual units which may be quite different county by county or for the metropolitan area versus the rural area. If you are not careful with respect to Ry3Rp3, you could get into some difficulty; you may start out thinking Eo States but really want counties; and you probably should be pretty sure as to exactly why you are choosing a particular criterion. It is an interesting concept. However, it has to be used fairly carefully relative to where you want to produce the estimate. * To summarize, if an analysis shows the demographic variables do not explain much of the variance of the dependent variables, then there may not be any point in going ahead and using a synthetic estimate for local areas with these variables. Even if there is a reasonable degree of explanation, if there is little variability in the distri- bution of the demographic variables among areas, the synthetic esti- mate approach may not be very useful. Political subdivisions are not necessarily going to be the areas that one wants to use for synthetic estimates. It may be better to produce estimates for classes of local areas that are likely to show better results and then recombine the results into the areas of interest for use. In our discussion the question arose whether the multiple regression synthetic estimator is better than the demographic synthetic estimator. It might be interesting to set up a test where sample size is varied to get some idea of how variance and bias of the two types of synthetic estimates vary by sample size. (Contributing to the general discussion during this period were: Ira Cisin, Reuben Cohen, Eugene Ericksen, Gary Koch, Fred Oeltjen, Louise Richards, Joan Rittenhouse, Monroe Sirken, Joseph Steinberg, and Joseph Waksberg.) 222 Applications of Synthetic Estimates to Alcoholism and Problem Drinking David M. Promisel ABSTRACT This paper focuses on the application of synthetic estimation techniques to issues involving estimation of the prevalence of alcoholism and problem drinking. Demands for information led to the first use of synthetic estimation in this area. However, the experience of bringing that first application to fruition led to new uses where previously no attempt would have been made to devel- op information. Three examples are discussed briefly: estimating the relative prevalence among the States; identifying health manpower shortage areas; and calculating the need for service in a community. BACKGROUND The question "How many people are there with alcohol-related prob- lems?" is a difficult one for two reasons: (1) defining what are alcohol-related problems; and (2) counting the number of people who have them. Alcohol is associated with a multitude of problems, ranging from alcohol addiction and behavioral difficulties associated with intoxication to diseases such as liver cirrhosis and various can- cers resulting from excessive alcohol consumption. The causal nature of the association has been established in some cases and is only suspected in others. Often, the individual's problem is the result of alcohol working in conjunction with other factors such as diet, genetic or familial conditions, psychological status, concomitant use of tobacco or other drugs, etc. And there is a reasonable degree of independence among all these factors, so that there is no small set of them that can be used as markers of the entire population with drinking problems. The World Health Organization has summarized this situation (Edwards, et al.1977) by defining two concepts. The "alcohol dependence syndrome'' is "a state, psychic and usually also physical, resulting from taking alcohol, characterized by behavioral and other responses that always include a compulsion to take alcohol 223 on a continuous or periodic basis in order to experience its psychic effects, and sometimes to avoid the discomfort of its absence; toler- ance may or may not be present." In addition, an "alcohol related disability exists when there is an impairment in the physical, mental, or social functioning of such a nature that it may reasonably be in- ferred that alcohol is part of the causal nexus determining that disability." Historically, two approaches have dominated attempts to estimate the prevalence of alcohol problems: surveys and indirect estimation. A useful review of this topic is provided by Keller: In recent years numerous efforts have been made to identi- fy by survey methods populations exhibiting drinking prob- lems. For the most part these surveys have sought primarily to describe the drinkers and abstainers in general or particular populations, and secondarily to identify the kinds of motivations and problems associated with the drinking by some people, and the kinds of people who experience those problems. One important culmination of these efforts is the work of Cahalan and his associates. Improving on prior methods they have developed a description of drinking that takes account of quantity, frequency and variabil- ity, and from the drinking thus delimited they have developed a classification of infrequent, light, moder- ate and heavy drinkers. Based further on reported reasons for drinking, they have extracted a class of "escape" drinkers. These are persons who reported two or more of the following motives: (a) helps them relax, (b) is needed when tense, (c) cheers up, (d) helps forget worries, (e) helps forget everything. Keller 1975. Reprinted by permission from Jourmal of Studies on Alcohol, Vol. 36, pp. 1442-1451, 1975, Copyright by Journal of Studies on Alcohol, Inc., New Brunswick, NJ 08903 Building on these techniques, the National Institute on Alcohol Abuse and Alcoholism, shortly after its founding in 1971, initiated a series of national surveys. Over a five-year period, seven surveys were conducted by Louis Harris and Associates (Harris and Associates, Inc. 1974) and Opinion Research Corporation (Rappeport , Labow and Williams 1975). It has proven quite difficult to merge all of the Cahalan and later surveys for analysis purposes. How- ever, for illustration, table 1 shows the results of an analysis of data on problem drinking from several of the NIAAA-sponsored surveys. These results suggest that of adults who drink, about 10 percent can be classified as problem drinkers, with women having a substantially lower rate than men. An example of the combined use of Cahalan's and these later surveys applied to synthetic estimation is provided later in this paper. A national survey com- missioned by NIAAA is currently being designed which, among other things, will specifically establish the linkages among the alcohol problem indicators used in these various surveys. 224 Some of the difficulties in using survey methods for estimating prevalence were described briefly by Cahalan: However, survey methods have some inherent drawbacks, a few of which are worth noting here. They are rela- tively costly and time consuming. Area probability samples may miss people who are not in households-- and these may be people who are particularly relevant to alcohol studies. Thus the Armor report suggests that the clinic populations are more extreme in alcohol use than survey data indicate. Surveys depend upon the cooperation of respondents and thus in large part they collect respondents' estimates and recollections, which may of course be inaccurate: not only in the playing down of unflattering materials, but also the reconstruction of the past in terms of what "everyone knows'' about alcohol use and alcohol problems. (Cahalan 1976, p. 17) Jellinek's formula is the famuous instance of application of indirect techniques to prevalence estimation. Jellinek hypothe- sized (Keller 1975) that there was a relatively constant rela- tionship between alcoholism and mortality from cirrhosis which would permit an estimate to be made of the number of "alcoholics with complications.’ This led to the development of the formula A = (PD/K)R. In this formula, the number of reported deaths from cirrhosis in a given year, D, is multiplied by P, the presumed constant percentage of such deaths attributable to alcoholism (different for men and women), and divided by K, another constant, representing the percentage of alcoholics with complications who die of cirrhosis. The result is then multiplied by R, the ratio of all alcoholics to alcoholics with complications in the given place and time. Over time, many including Jellinek expressed doubt about the reliability of this formula and the constancy of its parameters. One proposed solution was a modified version of the formula. Keller argued that there was no evidence that the basic rates associated with alcoholism in the U.S.A. had undergone any substan- tial change since the early 1940's. If then the average basic rate of the years 1940-1945, when the formula appeared to yield reliable results, were applied to the current population, an approximation of the prevalence of alcoholism could be derived. This has been the method used in the Efron, Keller, and Gurioli series, Statistics on Consumption of Alcohol and on Alcoholism, published by the Rutgers Center of Alcohol Studies. Even with these modifications, however, numerous questions remain regarding the adequacy of the formulation, estimation of parameter values, and the nature of the alcoholic population represented by this estimation procedure. Nevertheless, indirect techniques are believed to have large potential utility for prevalence estimation and are currently under active investigation by NIAAA. 225 As difficult as it may be to estimate, prevalence is central to innumerable program and policy decisions. These decisions range from the need to compare the numbers of people suffering from various health problems to the requirement for predicting the extent to which alcoholism treatment benefits will be utilized under national health insurance. The next section describes three examples of synthetic estimation techniques applied to alcoholism prevalence questions: estimating the relative prevalence among the States; identifying health manpower shortage areas; and calculating the need for service in a community. CASE STUDIES 1. Relative Prevalence of Alcohol Problems Among the States In the legislation establishing NIAAA in 1971, a requirement was stated that revenue sharing funds be alloted to the States "on the basis of the relative population, financial need, and need for more effective prevention, treatment and rehabilitation of alcohol abuse and alcoholism." For several years, need for more effective prevention, treatment and rehabilitation was expressed by the relationship of the population of each State to the total population of all the States. However, in the report of the Committee on Labor and Public Welfare, U.S. Senate, in 1976, it is stated that the Committee was distressed to learn that this "need" provision in the law had been totally disregarded. As a result, the legislation that was passed that year to continue the existence of NIAAA required that within 180 days the Secretary of HEW, by regulation, establish a methodology to assess and determine the incidence and prevalence of alcohol abuse to be applied in determining this "need." The NIAAA, with the help of the National Center for Health Statistics, undertook to respond to this congressional mandate. It was clear that the response needed to be quick and that it should be equitable to the States in that they should not be penalized for their report- ing practices. It was decided that the best way to ensure equitabil- ity was to use national data sources such as national population surveys and data collected by the U.S. Census Bureau. In the time available the only mechanism for developing prevalence estimates was the use of synthetic estimation in conjunction with the data that were then available. It was not possible to initiate collection of new data. It should be noted that there was no necessity to es- timate the actual number of alcoholic people in each State but only the relative numbers from State to State. The problem became one of defining an index of alcohol problems and then establishing on a national basis the relationship of various demographic variables to this index. There were no single measures felt to be sufficiently indicative of all alcohol problems. Further- more, there did not exist a single survey considered to be definitive for the purposes of establishing the necessary relationships. Accord- ingly, two surveys were used,with a different index of problem drink- ing from each. These were selected strictly on a judgmental basis. The first survey was carried out by the Social Research Group (SRG), 226 University of California at Berkeley (Cahalan 1970) in 1967. The other was the Harris Alcohol Survey of December 1971. The two indices of problem drinking are: (1) Frequent Heavy Drinking (FHD) - the number of times per week that a respondent drinks 5+ drinks on one occasion (coded in 4 categories). Based on Harris survey. (2) Current Tangible Consequences (CTC) - an additive score con- cerning problems with spouse, relative, friends, job, police, finances, and health (coded to 10 categories). Based on SRG survey. The first, FHD, was considered representative of chronic alcohol problems in need of treatment. The second, CTC, was associated more with intoxication and incipient alcoholism where prevention programs would be appropriate. The eight individual characteristics used to "predict" problem drinking are: age, sex, residence (urban/rural), race, region of the U.S., marital status, education and income. The choice of these characteristics was based on their known relationship to alcohol problems and their availability on a State basis from the U.S. census. The statistical technique used to establish the relationships is called the Automatic Interaction Detector (AID) (Sonquist, Baker and Morgan 1973; Sonquist and Morgan 1964). This approach is somewhat analogous to ''stepwise regression’ where the independent variables need not be quantitative nor even categorized into equal intervals or into ordinal categories. The results of the AID analyses are shown in figures 1 and 2 and an example of the use of this information is provided in table 2. It can be seen in figure 1 that the best single predictor with the FHD index is sex. The only other significant split for females was marital status. The FHD factors for males included age, marital status, region of the country, education, and income. For the CIC index (in addition to sex) race, age, marital status, and geographic region were also significant. The final 'meed" index, or index of relative prevalence, proposed in response to the congressional mandate was as follows: the total FHD and CIC scores for the State were divided by the nation- al average scores to produce relative scores for the State; the mean of the resulting FHD and CTC scores was the relative measure of alcohol abuse and alcoholism in each State or the 'need for more effective prevention, treatment, and rehabilitation." The index of relative prevalence is combined with population data and financial need in a formula which computes for each State its allot- ment from the Federal revenue sharing fund established for use with alcohol programs. 227 This formula was presented in a Notice of Proposed Rulemaking pub- lished in the Federal Register (Vol. 42, No. 21, pp. 6066-6069) in February of 1977. In that notice, comments on the formula were re- quested and 46 letters were received by NIAAA. Summaries of these letters and the NIAAA responses to them were published in the Federal Register (Vol. 42, No. 227, pp. 60398-60403) in November of 1977. The dominant theme of the responses was objections that some States would get reduced funds as a result of the formula. To resolve that issue legislation was passed specifying essentially that no State shall receive an allotment less than it would have received using the formula in its prior version. Several comments pertained more specifically to the needs index derived from the synthetic estimates. Objections were made that the estimates were based on survey data gathered in 1967 and 1971 and were unreliable because of their age. There were complaints that the indices used were unreliable and proposals were made to replace them with others considered to be more suitable such as per capita consumption of alcohol, deaths from cirrhosis of the liver, or alcohol-related fatalities. Others pointed out that the indices used did not reflect specific geographic factors such as those that occur in rural areas or States with special problems, such as Florida; and some objected to the relative weight assigned to need compared to the other factors in the formula. The general response by NIAAA to these concerns was to point out that NIAAA planned to undertake a new national survey to get cur- rent data; that the regulations did not require that the same in- dices be used each year so that better indices could be implemented after they became available; that there were restrictions on the use of indices resulting from the need to be both comprehensive regarding alcohol problems and thoughtful of the reporting capa- bilities of each of the States; and that some valid issues could not be resolved with the knowledge available at the moment. 2. Identification of Health Manpower Shortage Areas The Health Professions Educational Assistance Act of 1976 contains a number of provisions providing support for the education and training of individuals working in health services. Certain geographic areas with shortages of health services will be eligible to request National Health Service Corporation personnel. They will also constitute areas of service for those receiving aid from Public Health Service scholarships and loan repayment programs. This concept of manpower shortage areas will also be used in con- nection with other Public Health Service programs. In late 1976, the NIAAA was given the opportunity to recommend criteria for use in determining which geographic areas had a shortage of alcoholism treatment personnel. At that time manpower in the alcoholism con- text referred solely to psychiatrists. Conceptually, identification of manpower shortage areas is a func- tion of estimates of the prevalence of problems in given areas, specifications of model staffing patterns and desirable staff to 228 client ratios, and inventories of available manpower. None of this was available for use in identifying alcoholism manpower shortages. Nevertheless, it was considered important that the alcoholism factor play some role in connection with implementation of the various programs in the Educational Assistance Act. The work that was then going on in developing relative prevalence estimates among States offered a feasible approach to this problem. Accordingly, it was argued that individuals with alcohol problems consumed a substantial portion of total health care re- sources. For example, estimates were available indicating that 20 to 25 percent of all hospital beds are occupied by alcoholics and that 17 percent of the physician's practice involves alcoholics. In addition, alcohol admissions in one study represented 47 percent of all male additions to State and county mental health hospitals during a one-year period. Thus, treatment of alcohol-related problems pervades the service of all primary health care physicians and psychiatrists. It was proposed that alcohol-related health manpower shortage areas be identified in terms of added numbers of psychiatrists required to provide alcohol-related treatment in communities with a rela- tive excess prevalence of alcoholism. This assumed that require- ments for numbers of psychiatrists to treat the mean level of alcohol problems were included in the general manpower require- ments enunciated by the Public Health Service. This proposal was generally accepted. The "interim final" regula- tions for designation of areas having shortages of psychiatric manpower states that one criterion for eligibility is that an area has an unusually high need for mental health services. One such unusually high need is stated as follows: A high prevalence of alcoholism in the population, as indicated by a relative prevalence of alcoholism problems which exceeds that in 75 percent of all catchment areas (or other complete set of areas for which the prevalence index is computed), using the index of relative alcoholism prevalence developed by the National Institute on Alcohol Abuse and Alco- holism for the purposes of allotting funds under 42 U.S.C. 4571. (Federal Register, Vol, 43, No. 6, Jan. 10, 1978, p. 1592). The index of relative alcoholism prevalence had been developed on a State basis. However, these manpower shortage areas had to be defined for much smaller geographic units. Psychiatric manpower requirements were being calculated for Community Mental Health Center (CMHC) catchment areas, so that the same units had to be used for alcoholism purposes. The National Institute of Mental Health maintains a Mental Health Demographic Profile System on a catchment area basis. These data were used for calculating the FHD and CTC indices. The same categories of the population were used as had been identified by the AID procedure for the States. 229 However, no education or income information was available, so that these categories were dropped from the calculation. There are approximately 1,500 CMHC catchment areas, the top 25 percent of which are to be considered as representing alcoholism manpower shortages. Table 3 shows a comparison between the States represented in this top 25 percent compared to the top 13 States identified in the State calculations. It can be seen that 10 States appear on both lists and that there is some degree of correspondence of their rank order (the catchment areas list is based on the numbers of catchment areas in the top 25 percent by State, so that the State with the largest number of designated catchment areas, California, is first on the list). Again the regulations specify only the methodology to be used and not the specific data. The currently available list of shortage areas has not been subjected to thorough analysis for its reasonableness. Neither are the comments available made in response to publication of the proposed regulations. However, as new data become available and as greater understanding is achieved of the relationships among the demographic variables at the local level and indices of alcohol problems, new calculations will be made. 3. Estimating the Number of Persons Needing Alcoholism Treat- ment Services One last example will be discussed briefly to illustrate use of synthetic estimation of alcoholism prevalence in yet another area of application. Increasingly, at all levels of government, pres- sure is being brought to bear on service providers to estimate the number of people who might need and could use their services. Marden reviewed this situation at the request of NIAAA and pro- posed a solution based on the use of synthetic estimation. A review was made of 385 proposals for grant funds to provide direct services. Forty-three percent included no estimate of the number of alcoholics in the service area; another 18 percent pro- vided estimates with no indication of their origin. Table 4 describes the remainder. As can be seen, a diversity of techniques are used, many of them quite crude. It was argued that any proposed remedy to this situation should take into account several considerations. Prescribed procedures for developing the estimate would have to be appropriate for use by service individuals lacking in experience with research techniques. The procedures should be flexible and easily modified as additional pertinent information became available. And data requirements should reflect the availability of data in local areas. A Population Matrix and a '"Problem Drinker" Matrix were developed. The Population Matrix had dimensions of sex, age, and occupation. The cells of this matrix were to contain the size of the population 230 in that geographic area that corresponded to the designated demo- graphic characteristics (e.g., the number of male sales workers aged 20-29 living in that area). The "Problem Drinker" Matrix had the same dimensions. However, the cells contained the pro- portions of the various subpopulation groups whose score in the Cahalan problem scale exceeded a threshold value. Estimates of these proportions were obtained from the national household sur- veys conducted by Cahalan. To estimate the number of people in a given area with alcohol problems one had only to get the local population breakdown, multiply it by the "Problem Drinker" Matrix and add up the cells. This application of synthetic estimation is similar to the pre- ceding two in that primarily the method rather than the specific data is being prescribed. It differs in producing an estimate of the actual prevalence of alcohol problems. The other appli- cations provided only relative estimates, a somewhat easier task. Marden's approach has been widely used but the results of this use have not been carefully studied. CONCLUSION The use of synthetic estimation techniques has permitted the NIAAA to respond to congressional mandates and take initiatives not otherwise possible. The methodology seems to have been accepted by government policy makers, the general public, and to some extent, at least, the technical commmity. It could be argued that synthetic estimation is an elegant stopgap measure either to be used until more direct information can be obtained or to replace more expensive direct estimation whose added value is questionable. 231 TABLE 1 RATES OF PROBLEM DRINKING AMONG U.S. DRINKERS, BY DRINKING POPULATION 1973-1975 Percentages For Each Survey Drinking Mar. Jan. Jan. June Population 1973 1974 1975 1975 All Drinkers No problems 64 70 65 63 Potential problems 26 24 24 26 Problem drinkers 11 6 10 10 Males No problems 57 66 62 57 Potential problems 29 27 23 31 Problem drinkers 14 8 15 13 Females No problems 74 77 70 73 Potential problems 21 19 27 2) Problem drinkers 5 4 3 6 SOURCE: Paula Johnson, David Armor, Susan Polich and Harriet Stambul, U.S. adult drinking practices: time trends, social correlates, and sex roles. Draft report prepared for National Institute on Alcohol Abuse and Alcoholism under Contract No. ADM 281-76-0020 July, 1977. NOTE: A problem drinker experienced four or more of sixteen problem drinking symptoms frequently or eight or more symptoms sometimes; a potential problem drinker experienced two or three of sixteen problem drinking symptoms frequently or four to seven symptoms sometimes. 232 TABLE 2 HYPOTHETICAL SYNTHETIC ESTIMATES FOR CTC Proportion of State Mean CTC Population in Each Subgroup Index Subgroup 1. Black males 35+ .602 .046 2. Black males 21-35 2.034 .012 3. White males 65+ .200 .048 4. White males under 65 .450 .378 who are married or were never married 5. White males under 65 .980 .010 who were previously married 6. Black females .490 .063 7. White females living .423 0% in Pacific region 8. White females 65+ .035 .069 living outside Pacific region 9. Previously married +395 .369 white females under 65 living outside Pacific region 10. Married or single white .151 .005 females under 65 living outside Pacific region 1.000 Synthetic estimate: CIC = .602 x .046 + 2.034 x .012 + .200 x .048 + .450 x .378 + .980 x .010 + .490 x .063 + .423 x 0 + .035 x .069 + .395 x .369 + .151 x .005 = .421 * This value is zero since the hypothetical State is not in the Pacific region. If the State is in the Pacific region, this value would be the proportion of white females in the State's population and the proportions in subgroups 8, 9, and 10 would all be zero. 233 TABLE 3 LISTING OF STATES IN ORDER OF DECREASING RELATIVE PREVALENCE DOWN TO THE 75TH PERCENTILE Based on Manpower Based on 'Needs' Shortage Calculations® Estimate Calculations California Alaska New York District of Columbia Washington Hawaii Oregon New Jersey Illinois California Louisiana New York Pennsylvania Pennsylvania Alabama Washington New Jersey Louisiana Texas Mississippi Alaska Oregon Mississippi Alabama Michigan Nevada * Catchment areas in the top 25% of relative prevalence were tallied by State. States were then ranked in order of the number of catchment areas listed. 234 TABLE 4 METHODS OF ESTIMATING THE NUMBER OF ALCOHOLICS USED BY FUNDED PROPOSALS Number of Percent of Proposals Proposals Jellinek Formula 40 23.3 Agency or Other Records 55 32.2 Arrest Records 29 Unemployment Figures 17 State Mental Health Statistics 9 Percentage of Population 61 35.7 Percentage of Adult Population 10. NANO Ce ee eo COONWOO AEN A oc Percentage of Total Population NAS wv HO VIR NN Percentage of Low Income Population 6.5 1 Sample of Population 15 8.8 1712 100.0 235 9¢e TOTAL 142 Male .240 Female .046 FHD = number of times per week a respondent drinks 5 + drinks on one occasion Nd Married FIGURE 1 FREQUENT HEAVY DRINKING (Harris Survey) Age 65 + .045 Age (65 434 vd .300 025 Unmarried 105 Married Unmarried vd 244 i Region SA, WNC H.S. Grad. Income 138 $4-9000 or $15,000 + Other 378 247 Regions Non-H.S. Income 512 NY Grad. $10-14,000 or {$4000 914 .602 Mid-Atl., Region Age .403 35-45 Other 775 Regions Age 210 18-34 50-64 232 FIGURE 2 CURRENT TANGIBLE CONSEQUENCES (SRG Survey) Age 35+ 602 Black Age 21-35 943 2.034 Male 463 White Age 65 + 427 .200 Married or Single .450 Total Age {65 314 470 Previously Married .980 Black Pacific Region 490 423 Female Age 65 + 198 035 White Other Previously Regions Married A77 150 395 Age {65 170 Married of Single 1561 CTC = additive score based on hierarchical scores concerning problems with spouse, relatives, friends, job, police, finances, health 237 REFERENCES Cahalan, D. Problem Drinkers. San Francisco, Jossey-Bass, 1970. Cahalan, D. "Some Background Considerations in Estimating Needs for States' Services Dealing with Alcohol-Related Problems," paper prepared for presentation at Conference on 'Need' Methodology for Formula Grants, HEW, 1976. Edwards, G.; Gross, M.M.; Keller, M.; Moser, J.; and Room, R., ed. Alcohol-Related Disabilities. Offset Publication No. 32, Geneva: World Health Organization, 1977. Harris, L., and Associates, Inc. Public Awareness of the National Institute on Alcohol Abuse and Alcoholism Advertising Cam- paign and Public Attitudes Toward Drinking and Alcohol Abuse. Phase 0: Fall, 1971. Study No. 2138; Phase One: Fall, 1972. Study No. 2224; Phase Two: Spring, 1973. Study No. 2318; Phase Three: Fall, 1973. Study No. 2342; and Phase Four: Winter, 1974 and Overall Study. Study No. 2355. Reports pre- pared for the National Institute on Alcohol Abuse and Alco- holism. Keller, Mark. "Problems of Epidemiology in Alcohol Problems," Journal of Studies on Alcohol, Vol. 36, No. 11, 1975. Marden, Parker G.'A Procedure for Estimating the Potential Clientele of Alcoholism Service Programs," Prepared for the Division of Special Treatment and Rehabilitation Programs, National hstitute on Alcohol Abuse and Alcoholism. Rappeport, M.; Labow, P.; and Williams, J. for Opinion Research Corporation. The Public Evaluates the National Institute on Alcohol Abuse and Alcoholism Public Education Campaign, Vols. I and II, July 1975. Sonquist, J.A., Morgan, J.N. The Detection of Interaction Effects, Monograph No. 35, Michigan: Survey Research Center, Institute for Social Research, 1964. Sonquist, J.A.; Baker, E.L.; and Morgan, J.N. Searching for Structure,Michigan: Survey Reseach Center, Institute for Social Research, Revised ed., 1973. 238 Discussion Donna O. Farley Following a full day of discussion of the statistical design, charact- eristics, and limitations of synthetic estimation, I am finding that some of my own questions and concerns about the method have been re- inforced by the experts of the field. Therefore, my discussion will address somewhat philosophically several of my questions, while fo- cusing on the need for an estimation method by many people in the health field, and on the growing tendency to use synthetic estimation regardless of its limitations. I am trained in environmental health, and my perspective reflects that training. The work I have done with synthetic estimation, which I will summarize briefly, was for the purpose of developing an instru- ment that could be used as part of environmental health impact assess- ments. But first there are several points that have already been made by many of the speakers, which I would like to reiterate, with a slightly dif- ferent viewpoint: 1. There is, as we know, a growing demand for small area esti- mates. That demand is coming from local sources as well as national, and the areas involved are often of very small geographical and population size. I can cite the health planing agency for which I am presently working as an ex- ample of that local initiative. There are at least three different demands for local estimates within the agency. These include a) needs assessments for review of projects under Certificate of Need or for grant applications, b) the internal agency need for morbidity estimates, and c) estimates for use in problem identification as part of our planning process. 2. The appeal of synthetic estimation will probably tend to make it a preferred technique in the field. It can be easily conceptualized, is adaptable to many different data sets, and can be used readily by practitioners without exten- sive statistical training. The latter characteristic is one I feel should be emphasized here, because exper- tise such as that around this table is not always readily available to assure judicious local use of this uncer- tain method. 3. Research findings have demonstrated the wide variation in 239 the validity of synthetic estimates. This variation indicates that the method should not be used casually, but with perhaps a conservative approach, recognizing that while synthetic estimation may estimate some variables well, for others it will be much less effec- tive. The people in the field need to be kept aware of that fact. he describes offer excellent examples of these three points. All three applications in his paper resulted from governmental mandates tied to the distribution of dollars. The very different approaches used demonstrated the flexibility in application of synthetic esti- mation; but because local direct estimates were not available for validation, the estimates themselves must be considered to be at the least uncertain. They filled a need, however, ard quite possibly are the best estimates around at this time. My own work with synthetic estimates also filled a need, although not one that involved financial allocations. In order to estimate the potential health impacts of air pollution, one needs to know the num- ber of people exposed to the pollution, the severity of the exposure (measured as dosage if possible), and a dose response relationship which will convert the dosage to estimated health effect. People with certain chronic health conditions are at high risk to such ex- posures, and therefore should be included in a health impact assess- ment. Among the important high risk groups are those with chronic precise, and the variables of age, race, and sex accounted for only a small portion of local differences in rates. However, when com- pared to estimates based on local application of State level crude death rates for the diseases, the synthetic estimates were the bet- ter estimates of local death rates. The local estimates needed for our work, though, were prevalence rates rather than death rates. Yet the validity of synthetic estimates of prevalence could not be evaluated without local direct estimates of prevalence. Although we recognize that limitation, synthetic esti- mates of the local prevalence of the three conditions have been used in subsequent work, with the intuitive expectation that they are better estimates than those based on the local application of national level crude prevalence rate estimates. Another phenomenon was observed during the validation work with the death rate estimates. The synthetic death rate estimates tended to 240 cluster around the mean, not showing the same local variation as the actual local death rates. This characteristic has also been men- tioned several times in this workshop. In order to take advantage of the available local mortality data, another approach to prevalence estimation was developed. A synthetic estimate of the ratio of cases to deaths for a disease was calculated for an area, then to be appli- ed to the actual local death rate, yielding what was called a "death rate conversion'' estimate of the local prevalence rate. The assumptions underlying this approach are 1) that the cell speci- fic case fatality experience among those people with a disease is at least as consistent as the prevalence or death rates for the same cell, and 2) that building the estimates from the actual local death rates would bring into the estimate the influence of local variables that are not reflected in the regular synthetic estimates of preval- ence. It is an appealing approach intuitively, and I ask your comments on it. This method has been used also, whenever mortality data were available, for chronic bronchitis, emphysema, and asthma. It consis- tently yields smaller prevalence estimates for these conditions than does synthetic estimation of prevalence. In summary, I would like to address an underlying issue of the work- shop, which already has been discussed at length. The studies des- cribed in David Promisel's paper show that approaches using synthe- tic estimates of either relative or absolute values can be and are being used quite freely for various demographic data bases. Simi- larly, his paper and my own efforts show that a variety of approaches can be designed for producing local synthetic estimates. If the user can expect that those estimates will be better than those from more crude methods, synthetic estimation will probably be used -- for bet- ter or for worse. The use of synthetic estimation will probably in- crease, with people of various levels of training in diverse disci- plines applying it to their own specific problems. Recognizing this, we need an answer to a very pratical question: How freely can synthetic estimation be used for different variables and for different geographical areas; and per- haps more importantly, what modifications or adjustments can be made in its application to enhance the validity of the local estimates? This is not a new question, nor by any means a simple one, but I ask it with the perspective of a user of the method who is aware it has limits. There are growing mmbers of users who need to be kept aware of its practical limits, its capabilities, and the ways in which its use can be optimized. Those of you here who are the collective "'par- ents and guardians of synthetic estimation are the ones who can help provide that guidance. 241 General Discussion * Donna Farley's use of synthetic estimates raises an interesting point. Her problem was to try to devise synthetic estimates for prev- alence of chronic obstructive lung disease. She used death data to estimate the deaths and then had to convert that to an estimate of prevalence. One of the problems is: How good is NCHS data from two sources: HIS or HES estimates of the prevalence of, say, chronic ob- structive lung disease? Can they be used with data on deaths that are also diagnosis-specific but from a different data system? I would like to pose the further question: Is this a useful area on which to put more emphasis for estimating case fatalities? This is an area of extreme importance to epidemiologists. * We don't feel that we know as much as we should about the validity of classification of causes of death, particularly as it varies from one place to another. NCHS is, in fact, doing some studies now. Some work has been done in the past for selected diseases, but the thought is to have a more systematic attempt to evaluate the quality of the clas- sification of causes of death. We are thinking of it primarily in terms of national statistics. Measuring validity for local areas will be even tougher than producing prevalence estimates for those local areas. In the broader perspective, what we're talking about is: What kind of data do we have at the local area level in addition to complete count data from the decennial census? There are vital statistics on a com- plete count basis for local areas. The registration system provides the advantage that vital registration is a continuous system. The statistics are available on an annual basis. It might be interesting to compile a listing of the kinds of statistics that are available for local areas with some consistency and therefore are potentially useful for synthetic estimates. Measuring the prevalence of disease is one that interests NCHS very much. The primary instrument that has been used is the Health Inter- view Survey. Securing diagnostic information in a personal interview is subject to serious quality limitations. Now a completely different kind of survey approach is being explored--a survey of medical sources. 242 The hope is by that means to get diagnosed cases of disease. Some fairly large studies are being done now trying to estimate disease prevalence by means of surveys of medical sources (including physicians and hospitals and other places that provide care). The area of col- lecting data on prevalence, whether you're talking of drug use or alcohol use, or chronic diseases, is a very difficult one. Over the next five or ten years, perhaps, survey methodology will be developed. For local area data, one part of the system is to develop hospital discharge statistics within each of the States and then build up to the national level. If that kind of approach is productive, eventually there will be much greater information at the subnational, State, and perhaps the local level. * As we know there are some administrative programs that have operating data in the same area, e.g., the Medicare program. There also are ab- stracted data from various hospital-based systems. There are now two reports by the Institute of Medicine (1977a,1977h) that deal with the quality of coded diagnostic data. One is for several abstracting ser- vices collectively, and the other deals with the Medicare system. * Yesterday there was some discussion about the desirability of stat- isticians providing comments to Congress regarding the feasibility of compiling certain types of data. I'd like to reinforce the need for such activity. It extends beyond Congress. Congress imposes de- mands on the Executive Branch. Within the Executive Branch we impose demands on State and local governments. There are two kinds of problems. One is: Is the question reasonable? For example, we request estimates of relative prevalence, but that is not what the law asked for. The law asked for need, not prevalence. Someone arbitrarily equated the two (and probably had difficulty in defining the term, let alone esti- mating it). So perhaps that is not a reasonable question. Is the request to identify need a reasonable one? If it is reasonable, it has to be couched in very careful terms. For example, there may be quite a difference between estimating the number of people who have a certain ailment and what is the need for a particular service as a result. Even if the question is reasonably posed, there is the ques- tion whether it can be answered. There is no bulwark against the flood of demands for information. Hopefully, it doesn't do too much harms but are we sure? * perhaps the following idea is responsive to the concerns that have been raised. There are two basic issues that we have talked about from time to time. One is, how to produce different kinds of local estimates given certain kinds of data sets. The other issue is, how to provide some sort of advice to policymakers who would like us to help them make a decision. To some extent there are certain limits on the latter issue depending on the data that is available. Let us consider a design which a consulting statistician might suggest that possibly would assist a policymaker trying to make a decision. Consider splitting resources available among three different kinds of research designs. In the first design, a national survey would be conducted to obtain data by personal interview, but of only moderate depth, say, a one-hour interview. Second, consider the use of a selected set of observational studies (similar to the types of multiclinic 243 clinical trial type designs that are used in a lot of experimental situations) where you would pick selected sites in local areas of interest. You would try to do in depth studies of a lot of variables, trying to produce for any given local area the best estimate for that area that significant expenditures would buy. You would try to identify variables which were good correlates to the variables that you were most interested in --variables that were easy to measure, or variables that you could perhaps obtain by some sort of a telephone interview survey. Third, you would follow that up with a survey that would en- compass every local area, either a telephone survey or collection by any other approach that could be quick and easy. If you could spread the resources among these three things, that would be something which possibly would be better than putting all of your resources into any single one of them and having the limitations that would apply to any one of them, whether it would be cost, ability to estimate, or feasi- bility. It appears to represent a statistically cost effective way of trying to address a policymaking question. Knowing the overall budget one can experiment with different design strategies. * A subcommittee of the Federal Committee on Statistical Methodology has prepared a "Report on Statistics for the Allocation of Funds." This report, issued by the Office of Federal Statistical Policy and Standards (1978) looks at five specific ase studies of distribution of Federal funds to local areas and then tries to generalize on the problem. * The previous proposal for a three activity research design is similar to some thinking that the NCHS has been doing. First, there is under consideration a telephone survey capability, using random digit dialing. This would eliminate listing and other expenses that go with selecting an area sample. If you consider that the areas of interest are States, NCHS is thinking of a dual frame survey using the existing HIS as the first frame of the dual frame survey. The other frame would be based on telephone random digit dialing. There is some work that has to be done to decide what the sample design of the telephone survey would have to be in each State to adequately supplement the existing interview survey. This will vary by State because the PSUs and sample sizes in HIS vary by State. For local area statistics the strategy is twofold. One, a telephone survey random digit dialing manual is being prepared that will be available to local areas. This manual should facilitate efforts of those who want to do such surveys on their own. In the field of health, there isn't any mandate for annually produced statistics for as many areas as for revenue sharing. Some areas seem to be much more advanced in their thinking than others. In addition, there is the possibility of having NCHS have the capability of conducting local area surveys from Washington. Thus, if a particular area could not do its local area telephone survey, NCHS would have the capability of doing it. There are a number of problems that have to be worked out. * What is the role of OMB in regard to our discussions in terms of the approaches, the level of interviewing that would be permitted, and so forth? * OMB has a role whenever funding gets involved. In regard to design and issues of statistics needed, and how you're going to get them, 244 there is some involvement with the responsibility of review and sta- tistical clearance. * Agencies are questioned about the extent of survey work. OMB needs to concur with the approach to obtain data by survey. # A lot of the decisions are made by a Department clearance office and are reviewed at OMB. * We should note another point concerning the recent work on random digit dialing. If it turns out that random digit dialing is going to lower the costs of surveys quite a bit and if there are manuals avail- able, will it be a problem of local area survey proliferation? * We'll have to wait and see what the savings really are. * Jt's likely to be, say, three to one. * Are populations without telephones covered by the estimated reduction factor of three to one? * Tt depends. You would want to cover nontelephone households at a lower sampling rate. Therefore, the reduction in overall costs depends on whether the rate of subsampling of nontelephone households is one in three or one in five. So it's hard to provide a unique answer. In terms of one experience, lately, with telephone you can probably figure on one third or a half reduction, or something of that order of magnitude. (Contributing to the general discussion during this period were: Maria Gonzalez, Gary Koch, Paul Levy, David Promisel, Monroe Sirken, Joseph Steinberg and Joseph Waksberg.) REFERENCES Institute of Medicine, Reliability of Hospital Discharge Abstracts. Washington, D.C.: National Academy of Sciences, February I977a. Institute of Medicine, Reliability of Medicare Hospital Discharge Records. Washington, D.C.: National Academy of Sciences, November Office of Federal Statistical Policy and Standards, Statistical Policy Working Paper 1, Report on Statistics for Allocation of Funds. Washington, D.C.: U.S. Department of Commerce, 1978. 245 Synthetic Estimates as an Approach to Needs Assessment: Issues and Experience Charles G.Froland ABSTRACT An overview of a study which applied the synthetic estimates technique to derive rates, mumbers, types and characteristics of potential clientele for substance abuse related programs in the State and counties of Oregon is presented. A brief description is given of the methods utilized to obtain estimates as well as the means for examining their validity. Inasmuch as the objective of the study was to provide useful information to State and local program plammers and administra- tors, the experience of utilizing the study's findings is pre- sented. Several applications are highlighted to indicate the range of ways in which the study was utilized. The experience of applying the results in a program and policy context sur- faced several issues concerning the requirements for validity and accuracy, specificity and, finally, the role of synthetic esti- mates in needs assessment. The experience suggests that the information derived by this technique will be most useful if integrated with a range of other types of information, both quantitative and subjective. INTRODUCTION The value of quantitative information about a community's sub- stance abuse problems has been well recognized by planners, providers, and policymakers. While such information is not often available in many commmities, this has generally not been for lack of interest or expertise. Barriers to obtaining estimates of a population at risk for substance abuse treatment have generally involved the prohibitive technical and resource demands associated with producing accurate and timely information about the nature and extent of substance abuse within a given community, issues of confidentiality, and fundamental disagreement regarding the defini- tion of abuse, dependency, or addiction. As one consequence of these basic limitations, decisions about programs and policies are often made without the benefit of quantitative statements of the 246 size or scope of a community's needs for substance abuse treatment resources. To be sure, information of a quantitative or ''scienti- fic" nature is clearly not the only input into the policymaking process if resulting plans are to be politically acceptable (Lindbloom 1973). Values, political interests, community norms and other considerations perhaps form a set of more immediate policy determinants of which information must be seen as only one contender. On balance, the promise that information about sub- stance abuse problems holds in this context is to provide a common and valid frame of reference for discussing values and interests. Without such information, policy is likely to be created solely in response to impressions, status quo and/or political interest groups, making it difficult to determine whether the needs of the comunity are being addressed or met. In the abstract, the development of effective policies and programs must be based upon a clear understanding of: (a) the numbers of individuals potentially needing substance-related services, i.e., potential clientele, (b) their characteristics, and (c) the types of substances being used. Given this information, poli rs plarmers, administrators and citizens may, for example, be guided in the allocation of resources to various types of services, the determination of the appropriateness of existing services in meet- ing a community's needs, and the identification of target popula- tions needing services. However, the task of directly obtaining information about the nature and extent of substance abuse problems within a commmity is usually beyond the technical or financial capabilities of most local and State jurisdictions. As a result, most local plammers have typically adopted a number of indirect strategies for developing needs assessment information including, for example, interviews with key community representatives, com- mmity forums with local residents, or using available data con- cerning rates of arrest, emergency room admissions, cirrhosis deaths, and other drug-related deaths. At best, such indirect and inexpensive approaches yield a global but useful picture of the needs in a commmity. Most often, these methods are not entirely satisfactory for deriving an evenhanded estimate of the size of substance-related problems and need for services. The synthetic estimates method offers the promise of a useful alternative. To the extent that existing survey data are available which adequately reflect conditions in a planning area, reasonable estimates can be obtained of the number of problems that might be expected to occur, given the geographical and sociodemographic mix of the area. Although not without a number of technical limitations, the technique was considered to have sufficient merit that it was applied as one part of a study of substance abuse needs in the State of Oregon. The study was conducted by a research arm of the Oregon Mental Health Division in 1976. The results of the study are reported elsewhere (Froland 1976). What is presented here is an overview of the approach taken in deriving synthetic estimates for the counties and State of Oregon as well as findings related to the appropriateness of the resulting information. Additionally, some discussion is given to the uses made of the information by program plammers at the State and local levels. Finally, a number of issues regarding the utility and potential applicability of 247 synthetic estimates for needs assessment can be shared, based on the experience of Oregon. STUDY OBJECTIVES The study was conducted to serve a umber of audiences. The pri- mary objective was to derive information that could satisfactorily address questions at both local and State levels of planning con- cerning the accessibility, appropriateness and adequacy of service efforts. The State Legislature wished to know whether too much or too little money was being spent on substance abuse treatment. State and regional plarming specialists responsible for approving county plans and allocating legislative appropriations wanted to know which counties had the greatest need as well as the nature of local substance-related problems. County administrators and pro- gram directors not only were concerned with whether they were getting their fair share but also whether they were serving clients who were in some manner representative of the nature of their com- munity's needs. Given this range of questions, three types of information were considered necessary. Estimates of the numbers of potential clientele for each county would address legislative and local concerns as to the adequacy of resources allocated to counties. Descriptions of the varying degrees of different classes of substance abuse within each county's population would permit State and local planners to assess the appropriateness of differ- ent mixes of service modalities in dealing with the commmities' problems. A third type of information, estimates of the socio- demographic characteristics of the population of potential clientele, would allow local programs to assess the representative- ness of the problems of clients actually served. Since the syn- thetic estimation technique could be based on a body of existing survey data that could provide rates of different classes of sub- stance abuse specific to different demographic subgroups within a population sample, the method was capable of yielding these three types of information. APPROACH For use of the technique in Oregon, a search was undertaken to determine the best source of survey information. Several broad questions were considered in choosing among alternatives: (1) What population base is the survey information representing? (2) What are the technical attributes of the survey? (3) What kind of information does the survey provide? Several sources were consulted which consistently indicated that the survey of greatest utility would be one conducted for the National Institute on Drug Abuse by the Social Research Group at George Washington Univer- sity (1975). tion sample focus was felt to be appropriate to the general population of Oregon. Second, the survey was technically 248 acceptable. It was timely, having been conducted in 1974 and published in 1975. (The study which developed and used the syn- thetic estimates was conducted in 1976.) The survey protocol was administered by trained interviewers except for a self-report section. A reliability and validity study was conducted which demonstrated acceptable results. The sample size was sufficient to provide acceptable error rates in the survey information. Finally, the survey questionnaire items were appropriate for Oregon's purposes in that they covered a wide range of different types of substances; they were behaviorally focused and included a sufficient breadth of items to estimate current potentially abusive patterns of use. Beyond this, the information was tabu- lated for specific categories of several demographic and residence characteristics of the sample. Thus, the survey addressed all of our questions of acceptability to a satisfactory degree. CASENESS While the survey was concerned with identifying use patterns for many licit and illicit substances, it was not expressly concerned with identifying abusive patterns or individuals with a potential need for service by reason of their use of a substance. In general, the survey was simply concerned with providing informa- tion about varying degrees of frequency, duration and amount of use for a wide range of substances, some of which are illicit and/or potentially harmful. No information was provided as to the extent to which such use patterns implicated reduced physical, interpersonal or social functioning. Thus, the first task was to identify that combination of frequency, duration and amount of drug use which could be used to approximate 'caseness," i.e., an individual with a potential need for substance-related services. To some extent, the study relied upon common operational defini- tions used in sociobehavioral research (Elinson and Nurco 1975). Beyond this, a common decision rule was to define the use patterns of those individuals with a potential need for services on the basis of the most extreme patterns of use in terms of high fre- quency and amount with indications of problematic duration. Addi- tional considerations involved adjusting for use of other types of drugs, i.e., polydrug abuse. Table 1 shows the resulting definitions. COMPUTATIONAL OVERVIEW Having identified and adjusted survey rates to reflect expected levels of potential clientele specific to various demographic and residence characteristics, the next step involved applying these expectations in respect to the population base of the 36 counties of Oregon. Essentially, the approach taken could be character- ized as actuarial. In general, this involved weighting the rates provided by the survey according to the respective characteris- tics of each of the counties. Several general steps were followed: (1) First, adjustments were made on the basis of urban/rural mix for each county. (2) Next, area-adjusted rates for each category of a demographic characteristic were weighted by corresponding census distributions of such characteristics. 249 Four characteristics were employed: age, sex, race, and educa- tion. (3) Finally, area and demographic adjusted rates were weighted and summed to obtain an overall rate for a given class of distribution. To find mmbers of potential clientele, rates were simply multiplied times the updated population count for each county. Thus, the results yielded the three types of information desired: number, types of problems, and demographic distribution associated with potential clientele. TABLE 1 Definition of Use Patterns Drank average of nine or more drinks each time Alcohol they drank in past month,* and/or drank every day in past month. Used three or more times in past month and/or bniates used in past month and will use again. _—__" Used five or more times in past month and/or Pp used in past month and will use again. Stimlanss Used five or more times in past month and/or used in past month and will use again. Used cocaine, inhalants and/or LSD nine or Other more days in past month** and/or used in past month and will use again. *Youth sample used the following: Drank five or more times in past month an average of five or more drinks each time. **Youth sample used the following: Five or more times in past month. Results The resulting rates and mumbers of users with a potential need for substance-related services for the State of Oregon are shown in Table 2 for alcohol, opiates (heroin, illegal methadone and other opiates), depressants (barbiturates and tranquilizers), stimulants (amphetamine and nonamphetamine stimulants), and other drugs (psychedelics, cocaine and inhalants). The rates for each substance may be interpreted as indicating populations whose use patterns leading to a potential need stem primarily from the specific substance. For example, the figure of 16.7 persons per 1000 for other drug use refers to those persons who use principally either cocaine, psychedelics or inhalants in a manner which is indicative of a potential need for drug-related services. 250 TABLE 2 Statewide Rate and Number of Potential Clientele Rate as a Percent of Number (1975 Substance Population Population) Total .0870 199,320 Alcohol .0565 129,510 Drugs .0305 69,810 Opiates .0020 4,590 Depressants .0027 6,240 Stimulants .0090 20,710 Other drugs .0167 38,270 The total Drugs, which excludes alcohol users, refers to the total of opiate, depressant, stimulant and other drug users whose use of one of these drugs is indicative of a potential need for substance-related services. The overall total reflects the addi- tion of all individual substance classes. APPROPRIATENESS OF ESTIMATES In extracting rates from the national survey and applying them to Oregon's population, the assumption was made that the survey's information was applicable, i.e., rates for Oregon would not dif- fer markedly from other States in the Western region of the United States. Such an assumption was obviously open to question. Beyond this, the inability to precisely define a rate of substance abuse from the survey that would refer only to substance abusers who were clearly in need of treatment easily creates suspicion of the estimates that had been developed. These potential objections and others demanded that some means be developed to determine the extent to which the estimates approximated actual conditions. Since the prime reason for estimating numbers of potential clien- tele by the synthetic estimates technique was the absence of such information, an indirect method for examining the appropriateness of the estimates had to be explored. The approach relied upon the substance abuse problem indicators shown in Table 3. Composite indicators were developed within each class of substance-related problems and correlated with the corresponding synthetic estimate for that class. The results, shown in Table 4, demonstrate that with the exception of depressant-related problems, the synthetic estimates are significantly correlated with the distribution of the level of problems observed in the 36 counties of Oregon. One aspect of the appropriateness of the estimates could not be addressed definitively. While the estimates of potential clien- tele appeared to be distributed across counties in a mammer that was reasonably close to the distribution of the level of substance- related problems, the magnitude of the estimates could not be readily substantiated. Hith regard to alcohol-related problems, the estimate of approximately 130,000 alcohol abusers compared favorably with that derived by the Jelinek method (102,500) that appeared in the Oregon State Alcohol Plan for 1976-1977. 251 TABLE 3 Substance Abuse Indicators Substance Class Indicators Alcohol alcohol-related emergency room admis- sions; percent of traffic accidents with blood alcohol involved; alcohol sales; State hospital admissions diagnosed alco- holic; 4 cirrhosis deaths. Opiates opiate-related emergency room admissions;l opiate-related arrests (State and Fed- eral); opiate-related deaths;% serum hepatitis. Depressants depressant-related emergency room admis- sions; 1 4 depressant-related deaths. Stimulants stimulant-related emergency room admis- sions; 1 4 stimulant-related deaths. Other Drugs other drug-related emergency room admis- sions; other dangerous drug arrests (State and Federal) ;2 other drug-related deaths. Ibrug and Alcohol Program Office, Mental Health Division, Salem, Oregon Zoregon State Criminal Justice Information System, Salem, Oregon; includes State and Federal agency cases oregon Department of Motor Vehicles “Oregon State Department of Health, Portland, Oregon Oregon State Liquor Control Commission 252 TABLE 4 Correlation Results for Potential Clientele Estimates and Substance Abuse Problem Indexes (N=36) Substance Class Tr Pa Alcohol 41 17 p < .05 Opiates 64 RAN p< .05 Depressants .03 .001 NS Stimulants 37 14 p < .05 Other drugs .61 .37 p< .05 An additional source of data provided estimates based on a dif- ferent national survey of alcohol use patterns (Marden ND). While the demographic categories, as well as the problem defi- nitions, were different from those employed in this study, the resulting estimates of total potential clientele for alcohol ser- vices based on the different survey was 125,000. Thus, the estimates of alcohol problems seemed to be in agreement with a number of sources. However, with regard to various categories of drug abuse, no similar figures were available for easy compari- son. The Oregon State Drug Plan for 1976-1977 estimated that, overall, close to 30,000 persons had a conspicuous involvement with drugs, i.e., "it had resulted in arrest, incarceration, admission to a hospital or treatment program, or death." This estimate was not directly comparable to the estimates provided by this study, since different use patterns and potential problems were involved. The addition of the categories of opiates, depressants, stimulants, and other drugs yields a total roughly twice that of the estimates for conspicuous drug users (69,816 versus 30,000). However, the study was also interested in "inconspicuous," as well as ''conspicuous'' or known substance abuse, so that it should not be surprising for the numbers of potential clientele to be higher than the mumber of actual sub- stance abuse-related clientele. APPLICATION In part because of its intuitive appeal, and in part because of a well-designed dissemination strategy, the information was accepted and utilized by a wide variety of audiences. Staff of the State Legislature used the estimates to prepare testimony in appropriations hearings for alcohol and drug programs funded by the State. A number of local county programs utilized the information to target needs within their counties as well as to compare the characteristics of those they were serving with the demographic distribution of the estimates of potential clientele. By identifying specific population groups that were under- represented in their service strategies, these programs were able to make decisions about new programs that were needed as well as the appropriateness of existing eligibility and admission cri- teria. The information was employed in both the drug abuse and alcoholism statewide plans for fiscal year 1976-1977. 253 Perhaps the most concerted and systematic use made of the infor- mation was by a pilot project undertaken across the State involv- ing the development of county alcoholism plans. The project may serve here as a case study of what is possible in actually util- izing the information in planning services (see Hardison 1977, for more detail). The project was carried out by the Office of Programs for Alcohol and Drug Problems and involved using the estimates to establish a uniform process for defining service needs across all Alcohol Subcontract Service Providers funded through the Oregon Mental Health Division. The planning process was implemented across all counties in the course of their plan development by means of a series of steps. First, ranges of the expected number of admissions to a program for a particular demographic category were computed. For this purpose, a procedure was used that computed a 90% tolerance interval about the mumbers of potential clientele for a given demographic category, adjusted by a utilization ratio formed by dividing total actual admissions by total potential clientele. The resulting computations are shown for an illustrative county in Table 5. Next, the distribution of expected admissions was compared with the distribution of actual admissions to determine whether a par- ticular group was over or underrepresented in their utilization of services. High Priority Groups were identified by rank ordering underrepresented groups on the basis of percent need met, i.e., actual admissions divided by potential admissions as illustrated in Table 6. A number of further steps were carried out to complete the plan- ning process. These may be highlighted. Having identified the high priority population groups that existing services were considered to address inadequately, discussion groups comprised of program representatives were held to identify those reasons that might be at the base of the problem. Issues of geographic, cultural, and psychological accessibility were generally sur- faced. Further steps involved identifying what modifications in existing service procedures or capacity might be implemented to deal with such problems. Finally, local planning groups were set to the task of formulating measurable objectives whereby needed changes in services would be carried out. ISSUES All in all, the experience at the local and State levels demon- strated that the synthetic estimates could be of practical utility in structuring decision-making in policy and program development. During the course of working with plarmers and providers in Oregon, several key issues emerged which may be generalized as of common concern in choosing the synthetic esti- mates technique as a needs assessment tool. Perhaps the major obstacle confronted in attempting to utilize the synthetic estimates technique in policy and programs con- cerned establishing some degree of understanding of what the 254 TABLE 5 Identifying Underserved Client Groups Estimate HIS ercent | Range of Repre-| IRace/ Number at | Admis- | of Need | Expected senta- Ethnicity Risk sions | Met Admissions | tion Total 42,155 | 4,813 RBS. 1,278 601 | 17.03 | 124-168 | Over Black 2,203 218 | 9.90 | 222-282 | Under Ld 1,452 115 | 7.92 | 142-190 | Under Whi te 37,222 | 3,669 | 9.86 | 4068-4435 | Under et Ti wo Age 12-17 984 9 91 93-132 | Under 18-21 1,590 112 | 7.04 | 157-207 | Under 22-25 2,747 229 | 8.3% | 280-348 | Under 26-34 5,822 770 | 13.23 | 611-719 | Over 35-49 11,265 | 1,761 | 15.63 | 1205-1369 | Over 50-64 17,863 | 1,425 | 7.98 | 1930-2151 | Under 65+ | 1,88 102 | 5.41 | 188-243 | Under EL %%08 Sex Female 21,355 363 | 1.70 | 2315-2564 | Under Male 20,800 | 4,054 | 19.49 | 2254-2498 | Over * Mental Health Information System 255 TABLE 6 High Priority Populations Rank Order Population Descriptor Percent of Need Met 1st Age: 12-17 91 2nd Female 1.70 3rd Age: 65+ 5.41 4th Age: 18-21 7.04 estimates really meant. On the one hand, the information was in a form that was intuitively appealing and those who most often had little or no data with which to campare were sorely tempted to use the estimates. On the other hand, the computational pro- cedures used in the tec ique were somewhat obscure to the range of audiences to which the information was directed. Those responsible for policymaking were rightfully suspicious of infor- mation developed in a manner they did not understand. Concerns over the timeliness of the information compounded these issues. This may be endemic, to the extent the technique relies on second- ary data which, allowing for dissemination lag, may be several years old. In situations where the estimates are attempting to describe a condition which is unstable or in flux, the resulting information may be rejected out of hand by those working in the field. A number of issues hindered the precision of the technique. At same level of demographic or geographical detail, the size of the survey sample limits the ability of the survey to maintain repre- sentativeness and accuracy in the information disaggregated to a local area. Additionally, the application of the survey rates to the population base of a community is limited by the size of the area. Issues concerning the nature of the survey used (e.g., sample characteristics, representativeness, sampling error) , the sampling error of census information, and geographic uniqueness of the area all serve to limit the degree to which estimates for a small area may be accurate. These issues become more telling for areas which have unique characteristics or conditions. Such fac- tors as special population pockets, geographical diversity and unique cultural features all serve to reduce the relevance of estimates based on more general expectations. For example, several counties in Oregon had Indian reservations, while others were influ- enced by transient farm labor. The estimates derived for these counties certainly could not adequately reflect the circumstances confronted by local programs serving such areas. 256 CONCLUSION To what extent can synthetic estimates be relied upon in policy decisions, particularly if such decisions are to materially affect funding allocations, program operation, and the utilization of services? From one perspective, this question can be examined by assessing the technical validity or reliability of the estimates themselves. Here, limitations in the body of information used for prediction or estimation, errors in computational procedures, as well as the nature of the area to which the technique is applied all serve to define reasonable boundaries for the use of the estimates in decisiommaking. However, one has to distinguish between what is statistically acceptable and what is useful in practice. While achieving technical standards of validity usually heightens practical utility, information can be of practi- cal use that may not meet standards of statistical rigor. In part, it is a matter of degree; more often it is a question of what alternative sources of information are available and whether they are more or less technically acceptable. While the synthetic estimates technique has much to recommend it as part of the method- ological armamentarium of quantitative needs assessment, a broader view recognizes that the utility of synthetic estimates rests on their ability to inject an element of objectivity in policy deci- sions. The experience of Oregon suggests that when the information was used as a basis for discussion or combined with other informa- tion or perspectives, planning decisions were relatively more systematic and comprehensive. Thus, when the estimates were not taken as being exact and precise statements of community need but rather used to structure a closer examination which included the qualitative and subjective viewpoints of those working in the field, their role was more useful in motivating program changes and improvement. 257 REFERENCES Abelson, H., and Atkinson, R. Public Experience with Psychoactive Substances (National Survey--NIDA; Main Findings 1974). Princeton: Response Analysis, 1975. Elinson, J., and Nurco, D., eds. Operational Definitions in Socio- Behavioral Drug Use Research. NIDA Research Monograph 2. DHEW Pub. No. (ADM) 76-292. Washington: Superintendent of Documents, U.S. Government Printing Office, 1975. Froland, C. Substance Abuse in Oregon. Salem, Oregon: Oregon Mental Health Division, Department of Human Resources, 1976. Hardison, J. Criteria for a Minimum Definition of Need. Salem, Oregon: Management Support Services, Oregon Mental Health Divi- sion, 1977. Lindbloom, C. The Policymaking Process. New York: John Wiley, 1973. Marden, P. A procedure for estimating the potential number of alocholism service program clientele. Washington, D.C.: NIAAA, unpublished, no date. Oregon State Alcohol Plan for 1976-1977. Salem, Oregon: Mental Health Program Office, Oregon Mental Health Division, 1977. 258 Discussion Reuben Cohen Most of us present at this workshop deal with large data bases and design research or provide data at the national level. Charles Froland's paper has added a significant dimension to our discussion. He has told us how real life decisions are made at the county and community level. For any given jurisdiction in which program funds are allocated, the number of dollars involved may be relatively small. But those local allocations aggregrate to millions of dol- lars and affect large numbers of human lives. A recurring question has been posed at this workshop: What are the alternatives available to us? One message in Froland's paper is that there are few, if any, alternatives to making appropriate use of national survey data for needs assessment at the community level. Surveys adequate to the task of providing direct estimates of needs at the community level might cost as much as or more than the amount available for program use. Poorly conceived or loosely executed data collection procedures might be worse than none at all. I am reminded that I was involved in planning and interpreting re- sults of a national survey preceding the Presidential election of 1968. Since Joe Waksberg told his election story yesterday, I will tell mine today. Pre-election surveys may be unique in that, in addition to national samples, there are generally more State sur- veys than there are States, and the actual election results are available almost immediately to help evaluate the results of State as well as national polls. Many of you will recall that the pre- election poll results (and indeed the election itself) were very close to 50-50 between Nixon and Humphrey in 1968. Estimates for specific States became very important because electoral votes would actually elect the new President. 259 Just prior to the election, one of my tasks was to estimate the electoral vote distribution based on survey results and any other information available to me. I made very little use of the State survey results and would have done better had I not used them at all. T discarded all but a few of the State surveys because (1) I was concerned about bias of the auspices (some of the surveys were done in behalf of the political candidates); (2) I was doubt- ful about the methodology (either the sampling or interviewing was suspect) ; or (3) the sample size was too small to be useful. The alternative I had was to use regional data from the national survey and make State estimates based on relationships among the States within a region observed in earlier elections. Except in the South, those relationships had been reasonably consistent through the Presidential elections of the 1950's and 1960's. Some States consistently voted more Democratic than the region as a whole, others were more Republican. The point of these remarks is that a rough and ready "synthetic" procedure provided better estimates than State surveys of questionable quality. A significant point in Froland's paper is the importance of the strategy used to disseminate statistical results and the need to distinguish between what is statistically acceptable and what is useful in practice. As he points out, estimates do not have to meet rigorous statistical standards in order to be useful. There is an urgent need to continue to suggest ways in which national survey data can be useful to community program administrators. 260 General Discussion * Things have been put on a different level from what was talked about this morning. One point suggested by Charles Froland's paper is that the real issue is: How well did the individual characteristics in Reuben Cohen's survey correlate with the alternative data that he had available? Has someone ever done these kinds of correlations for local areas for which survey data were actually collected? That would really have been useful information for the process. A general point is that if there is an assessment of error, then the data is a lot more useful than if there is no assessment of error. There is another point being made: The context in Froland's paper is different from the context of Cohen's paper. Froland did not have a policy situation with a great deal of money; this is different from the context where millions or billions of dollars are involved. When that is true, then a much higher standard of accuracy should be called for. I think it's amazing that the demand for accuracy is probably ten times the demand that we're talking about. It's remarkable to see the statisticians' desire to get error down below levels that don't really make any difference for the purpose. % There is no indicator of demand here--there is an indicator of a crude level of use. There is an indicator of the proportion that met some arbitrary criterion which cannot clearly be defined as need or as demand in any sense. If you want a better match between the esti- mated number at risk and the MHIS admissions, I think this is one of the messages: All you have to do is pick a different arbitrary level to define risk and you get a better correspondence. But these will not necessarily be the same people, which is another factor to consider. * Look at the changes in levels of activity. If you compare it with Table 6, where you're really coming down to a few percentage points difference in what might be regarded as a policymaker's potential clientele, I'm a little overwhelmed by the mixture of levels of accuracy. 261 * One of the things that was considered was the issue of error. We knew it was there--how great it was was something that we didn't know. That was one of the primary reasons for structuring a process. * It's appropriate for a lot of situations. People who from day to day have to deal with the problems can look at these as results of one method. There could be discussion of: What do you think about it; does that help you? It's a step up from what normally goes on in plan- ning discussions. * Sample size has not been mentioned but there certainly are some startling findings. The fact that the female prevalence rate is higher than the male prevalence rate runs against a lot of experience, but it's hard for me to believe, because quite a different dependent variable was used. It certainly seems that the consumers of these things are probably more than happy to see a high prevalence rate; but what about the rela- tive distribution problems? * What the data represents is a combination of a proportion within the area times the rate. The rate was a sex-specific rate. * In the alcohol field the issue of definition of what you want to measure has a degree of arbitrariness associated with it. There are material differences you can get simply by setting a standard of a few drinks more or less as to what is a drinking problem. What you want to measure is a much more difficult issue than the question of how to count it once you have defined it. The statistical aspects of this workshop have been very interesting. But there are real problems of definition that are bigger than the problems of statistical differ- ences and errors. Beyond that, it is necessary to be aware that there is a ten to one ratio between prevalence and utilization and some uti- lization data aren't very accurate and deal with counts of admissions rather than individuals. You really have to wonder about this juxta- position of differing levels of accuracy and interest. *The ten to one ratio is actually a small one. A lot of literature found it much higher, depending upon the kind of problem, or the area, or the availability of services. So we weren't at all surprised to find that kind of difference. * Are you implying that there are ten times as many people that need treatment as are getting treatment? Or are we talking here of prevention modality? It seems to me that we've discussed primarily secondary and tertiary treatment modality. I see most of these data as indicating some population at risk. Concerning the alcohol field--there are a lot of gradations of use which don't suggest that someone has a full blown alcohol problem. But if they keep it up, most medical evidence would indicate that in a period of time they might have the problem. * The data as I see it would be more useful really for people designing prevention programs than treatment programs. 262 * Tt used to be that if there was some physiological damage then you could be sure that you had an alcoholic on your hands. More and more the judgment in treatment services tends to take a much broader view of who is at risk. A second point about the difference in magnitude: From a practical standpoint, it really doesn't make any difference to the people in the field whether it is ten times or twenty times. * Did you discover any groups that weren't served at all by any programs? The thing that is apparent is the volume of users that live in the suburbs: Mostly white women, middle class, and there are very few programs for people like that. I think one of the kinds of things that a needs analysis should do is to make a population estimate of groups that no program exists for. Did that occur? * There has been some mention of data problems. Have some of the in- dicators used for testing the quality of your process been tried in re- gression estimates as a composite estimator with the synthetic estimates? T wonder what advice we would give you if you were asking--given the fact that you used something to test the method, but the results are also available for inclusion in the estimate. * T want to clear up one point. I wasn't being critical of mixtures of levels of accuracy. I was thinking of the different levels of ac- curacy that were being discussed in different papers. I think this kind of work suggests the value of the observation that we sure know a lot more just by knowing admissions. We are a lot closer to some sort of reality for practical purposes in terms of predicting antici- pated admissions for next year by looking at last year or this year. There seems to be a circle that's been traveled. To start, apparently planners and program people don't like their own current statistics and are looking for something that tells them more about prevalence and need. This is a logical step that is mediated through their pre- sumed knowledge of treatment of needs. Then it circles all the way around and comes back to how many people they are seeing in the program anyway. Tt seems somewhat of a circular process. The only people, in a way, who are benefiting from it are the estimators and the people who get some benefit from having a prevalence that way outstrips their ability to serve that prevalence. It does seem that past admissions are a lot more trustworthy figure. # As has been noted there are some data that had been used for synthetic estimates and there are some data on indicators. What advice would you give? * As a general rule, any time you have two estimators for a group of small areas that you think are equally good (both are reasonably good, or both poor) you should consider combining somehow; and, you're not going to do any worse than the poorer of the two. * Another theme that has been raised is the disparity, or seeming disparity, between what is acceptable when millions or billions of dollars are at stake versus what the person has to do when there is a small staff and little time. Is it really that different a problem? 263 * IT want to comment on the point you made, which I think is a very fun- damental one, not in terms of the question as posed but rather on: What are the requirements for precision when you are dealing with a billion dollar program as compared with a tens of thousand dollar pro- gram? The real question is, when you're dealing with these billion dollar programs whether synthetic estimates are the right thing to do or whether you should be pressing for the kind of money that would give you better kinds of estimates. When you are dealing with small programs, such as Froland's, just from the point of view of any kind of cost benefit ratio, it doesn't make sense to put more money into getting the statistics than you put into the program. If the estimates are synthetic and are crude, they would still be the best kind of allocation data for the purpose and the nominal cost involved. Several things suggest themselves as a result of the sessions to this point. They relate to the basic question asked: Under what conditions should we use synthetic estimates? Part of the answer is, use synthetic estimates when it doesn't pay to put a lot of money into trying to get individual survey data for individual places. Thus, there are times when previous studies may indicate that getting data from a survey would result in quality so poor that almost certainly synthetic estimates would give you better information for local areas than from a survey or a census. The fact is that under some conditions, because of mea- surement error, we are likely to do better with synthetic estimates than with directly collected data. Unfortunately there isn't any nice set of rules that can be put down that would identify the specific circumstances. You have to think about the problem, and if it is likely there would be substantial measurement error, at least in some cases, synthetic estimates would be a useful solution. Under some circumstances this would hold true even for larger places up to and including the United States as a whole. If the figure on dilapidated units in the 1960 census is compared to what was gotten in a housing survey done simultaneously with the census, but by better trained and better quality interviewers, one figure is found to be fifty percent higher than the other. If an overall result for the United States is subject to problems of quality of this magnitude, imagine what it must be at the local area level where a small number of inter- viewers are involved. * There is a question of use which needs to be examined. Are data needed on level for a large area or are data needed on the relative order of differences among small areas such as counties or tracts, as illustrated in Froland's paper? Attention needs to be paid to who are the users and what are their data needs, both on geographic level and quality. For if one ignores the users, after data have been published for local areas, if there is distrust, then users will ignore the data that have been compiled and use either synthetic estimators or direct collection of data that they believe are relevant and have the needed accuracy. * For some purposes large relative errors for areas with small popu- lations don't make very much difference, whereas small relative errors for large population areas make a lot of difference. There hasn't been very much discussion about how synthetic estimates and direct estimates 264 can be used jointly: where the direct estimates are used only for the very large States or for very large local units and synthetic es- timates for the others. It is net only the size of an area in popu- lation that should be considered but also the importance of an area for analysis. For example, if I wanted to construct a conservation target for home heating fuel it is going to make a big difference whether I'm talking about Minnesota or about a State with the same population in the deep South. I want more accurate data--not in terms of relative error--but in terms of absolute error--for the Northern State than for the Southern State in this instance. * Tn a sense the thrust of the last comment needs®to be kept in mind. That is, there are occasions when there is knowledge of an atypical situation and the method of synthetic estimates does not (as it stands right now) take it into account. In fact, if we were designing a sur- vey we would take it into account by treating it as a separate self- representing area or we would do something special in estimation. It appears right now that for synthetic estimates we use two sets of data and a single algorithm and get the result. We ought to try to keep such possibilities of atypical areas in mind and suggest to the producers and users of the synthetic estimation approach how to deal with these kinds of identifiable problems. We've heard one method that has been proposed: The use of symptomatic indicators. But, how do we provide that in certain circumstances the symptomatic indicators be used for problem areas when the synthetic estimators should not be used; whereas, for the other areas the synthetic estimator should be used? Perhaps we have to get away from what might be called a push button approach and create a joint composite estimator approach that comes close to the complex kinds of sample designs we construct. * A bit less general way of doing what you have just described was in Wes Schaible's paper. * T think, in one place in Bob Fay's paper, there is a distinction between two populations: those above and those below the median. This seems to me is the kernel of an idea with respect to use of mea- sures of position for a symptomatic indicator for defining subsectors of the population. For one subsector there would be proper use of a single kind of estimator, say, a synthetic estimator; for other sub- sectors one would use a more complex system (including a composite estimator with varying weights for each of the specific subsectors). * I'm not sure problems are quite so complicated. Of course, there are likely to be exceptions. The example of a conservation target for home heating fuel may be amenable to a less complex approach. It seems to me that if I were looking at heating fuel I would use weather in- formation for classifying States into tiers. If it turns out that Minnesota is unusual among its tier of Northern States, then, of course, there would be trouble with the Minnesota estimate. * Another point which is related is that you need to consider the pro- perties of the variable that you are estimating. There has been refer- ence to the fact that synthetic estimates for diseases seem to be rel- atively accurate compared with estimates for other variables. That seemed to make sense. But there are other variables where you would 265 expect the synthetic estimates to be bad. I think unemployment is a very good example of that because of the nature of the economy. If you have a synthetic estimate that is based on industry like, let's say, the steel industry, for example, that doesn't mean that every steel plant lays off ten percent of its workers. There is likely to be variation. What actually happens is that Youngstown Steel closes down in Youngstown, so you have a place with a thirty percent unemploy- ment rate. They just happened to close down before Inland Steel did in Gary, so the unemployment rate in Gary would be lower. So there is reason to think that the synthetic estimates for local area unem- ployment would be bad because this is how unemployment arises. * Is the suggestion perhaps that it would be useful to build in a current indicator of local area variation if such data exist? Are we coming around full circle to the question: What are the resources, what do we know about the between and within variances, and how current are the indicator data? * Perhaps it is a bit different. In the absence of other data there are substantial reasons why you would not use synthetic estimates for unemployment, but there are substantial reasons why you would use syn- thetic estimates of death rates due to certain diseases. All you have to do is look at unemployment rates as far back as they go. If there was high unemployment, it was uneven and it lends credence to the point. So, it is more an argument of when you don't use synthetic estimates in the absence of other information. * Another variable in this situation is the group that is producing the estimates. Should the Census Bureau be producing synthetic esti- mates? It is a different thing than if the local area produces the local estimates. You expect the Census Bureau to do a thorough analysis of the methods and to try to understand the errors and develop a model that you feel reasonably sure fits the situation. It is no different from conditions under which you do a survey, or the conditions under which you produce an estimate from the survey, or the conditions under which you will not. If a national statistical agency is putting out the data, it has a different connotation than if a local area is pro- ducing that local area estimate for its own use. That is part of the problem that we have here. We may lose sight of an important part of the problem,and it may be that the national statistical agencies will simply refrain from producing certain local area statistics because they feel that the errors are too large or that the errors are not measurable. It isn't the size of the error that bothers you so much. It is whether or not you have an idea of how large that error is. If you feel that you don't have a reasonable fix on the size of the error, you may decide that as a Federal agency, you will not produce the data. That does not prevent the local area from going ahead and using the data if it wants to, for its own purposes. There is an offi- cial character to the data that is produced by a Federal agency, and there is an expectation of accuracy, deliberateness, and thoughtfulness. I don't see why that aspect should be any different as it applies to synthetic or any other kinds of estimate since it applies to the sta- tistics produced directly from surveys. 266 * Federal agencies have a responsibility when they work with local people who want to produce an estimate for some particular character- istic, to determine whether a situation exists where there are anomalies for small areas. It is this sort of thing that synthetic estimation has trouble handling. And it is up to the Federal agency at that point, whether or not estimates of the error involved in the particular sta- tistics can be determined, to advise the local people that based on previous experience synthetic estimates won't work. * Tt appears from the work that Bob Fay and Gene Ericksen have told us about, that it really takes a good deal of work to understand what is going on with synthetic and regression estimators. It really re- quires that we dig into it quite hard to know what is going on. It may be that if people ask (and if in fact that is what it takes and there don't appear to be any shortcuts that anyone has thought of to setting up criteria), you may have to say sometimes: "I really can't tell you. I don't have the experience or the knowledge, and I can't advise you to do this." * 1 would like to be a devil's advocate for a minute. It seems that what we are trying to do is to provide only Grade A statistics and if it is not Grade A, then data are not to be provided. Perhaps groups that like to have a Grade A symbol attached to their work need to examine whether some lower grade should be made available with an in- dication of the level of quality which is associated with the data. If it can't be done in a quantitative sense, it could be attempted in a qualitative sense. In Britain, for example, in certain programs, they do use this system of Grade A, Grade B, and Grade C as a way of distinguishing, in a qualitative sense, among a variety of statistical outputs. It's handled in a way so that users are put on notice that there are problems in the lower grade categories that demand attention. % T don't think that that was what I was saying. What I was saying was that you don't put out everything just because there might be a need for it. In survey work there is a screening. You consider what you can do and what you can't do. I don't see why the same kind of consciousness of what is the best type shouldn't apply to the statistics that don't come out of surveys as applies in statistics that do come out of surveys. It may be that there will be political reasons why you have to put out some poorer statistics anyway. But, we should make the distinction between the political reason for doing it and us as statisticians proposing to do it. # I think that we all agree that putting out good data is a good idea. The next question is, is putting out bad data worse than no data? I think Froland said at one point--this is better than giving the money to people who cry the loudest. I'd like to put in a good word for people who cry the loudest. Crying the loudest is often a very helpful thing for the system. It teaches people about argument; you've got to get in there and say what's going on at the level of providing ser- vice. This notion of providing a more rationalized system for the distribution of resources is not necessarily as desirable in all re- spects. I suppose one could go on and speak a whole essay about that. But, just one word for the people who cry loud. 267 * Have we found it or have we lost it? We started our discussion about synthetic estimates on the note that some would be enthusiastic and some skeptical. In the course of the presentations and the discussion a number of different methods have been discussed. Some have been care- fully documented and have led to a feeling that synthetic estimates do provide useful means for creating estimates. On the other hand, there has been discussion leading to the feeling that perhaps we are not yet to the point where these methods can be and should be used in every instance. * I just want to say regarding that: I think that people feared we might have a how-to-do-it manual coming out, and instead I think we're going to have a very fine consumer's report on synthetic estimates, which will serve the field very well. (Contributing to the general discussion during this period were: Ira Cisin, Reuben Cohen, Eugene Ericksen, Dwight French, Charles Froland, Maria Gonzalez, Louise Richards, Ron Roizen, Wes Schaible, Walt Simmons, Monroe Sirken, Joseph Steinberg, Joseph Waksberg, and Robert Wilson.) 268 Expansion of Remarks Walt R. Simmons Recapping statements of my own and of others, we should follow these guidelines in order to increase the utility of a synthetic estimate: Let ZL = 3 Wen xX! a a where xX} is the direct estimate for the oH category of persons, secured from a probability survey, W_, is the proportion of persons in community c¢ that fall into category a, and Zc is the synthetic estimate for community c. The efficiency of the Z - estimate depends upon four factors: A. The variability of X' measures among a-classes. Design should make this variability as great as feasible. B. The variability of the X - measure among persons within an a-class. We seek a-classes for which this variability is relatively small. C. The sampling variances of the estimates X2, which in turn are a function of B above, and sample size. This means that sample sizes of the a-classes must be adequate to yield tolerable sampling error. D. The variability of the Vey values among the c-communities of interest for a given a-class. The guidelines require a search for a-classes for which this variability is as great as available data permit. It seemed to me that the majority view of conferees was that the best choice is a composite estimate that is a weighted average of a direct estimate and either a simple synthetic estimate or some form of re- gression estimate. 269 For many purposes, data for a homogeneous class of small areas -- where class is defined in socioeconomic terms; for example, central cities in the North Central U.S. with 200,000 to 1,000,000 population, median household income under $10,000, and more than 20 percent black -- are acceptable in lieu of data for a specific small area and may have greater validity. Average relative measurement error may be quite large for individual small areas, but may be substantially less for the direct estimate for the homogeneous class of areas, and thus lead to superior final conclusions. 270 Afterword coseph Steinberg The participants in this workshop gave the existing techniques of Syn- thetic Estimates for Small Areas a mixed review. Synthetic estimates are useful in some situations where small area data are not available. There are other situations where synthetic estimates are not useful and in some cases may be worse than no data at all. Throughout the course of the workshop, there have been comments and advice concerning criteria about when to use and when not to use syn- thetic estimates. Walt Simmons in his Expansion of Remarks suggests guidelines for increasing the utility of a synthetic estimate. Tt was felt that where there were going to be important decisions in- volving substantial sums of money there should be significant efforts to obtain funding of direct survey estimates with usable precision. For other situations, especially where funds were limited for program needs and where cost benefit analyses dictated it, synthetic estimates may serve in the absence of anything else. In such situations they are likely to be better as a basis for decisions than opinions or pres- sure (although some may prefer pressure as a decision-making tool). Surveys or census results may not provide the answer to small area data needs if there are relatively large measurement errors in direct data collection. If the data are needed retrospectively, there will be no opportunity to do surveys and all that is feasible is one or another indirect estimation, if anything is to be provided. Anomalies need to be recognized. Symptomatic data may be helpful in recognizing such situations. Sometimes the symptomatic data, used in a regression function, may provide one useful component of a composite local area estimator. The other component could be a direct estimate or a synthetic estimate. James-Stein estimators should be considered. Symptomatic data may be helpful in the efficient design of a basic sample survey geared to the needs of synthetic estimation. Multilevel survey design strategies need to be considered. The efficiencies of designs using random digit dialing techniques for one aspect should be explored. 271 A variety of estimators have been discussed during the workshop. Each use was related to particular circumstances. The nature of the vari- able being estimated may suggest the desirability (or its lack) of use of a simple synthetic or regression estimator or a composite estimator. Synthetic estimates may not be a good way of ordering areas if they are based on demographic characteristics since such characteristics may not vary much among local areas; care was advised for such intended use. If there is some means of determining quality of estimate, publication of synthetic estimates could be considered. Availability only of av- erage approximate measures of quality should be considered reasonable for synthetic estimates as are average approximate standard errors when publishing probability sample survey data. After evaluation of likely quality, it seems clear that professional statistical judgment needs to be exercised before synthetic estimation use is recommended. There is a need for continuing research on estimators and evaluation methods. Tt is unlikely that many small area data needs--including some where substantial resource allocation is involved--are going to be met by direct surveys. Continuing efforts to improve small area estimation techniques are needed to serve the many and varied policy and administrative needs of our society for objective planning, allo- cation,and decision. 272 Appendix A: Attendees at Workshop B: Workshop Program Appendix A: Attendees at Workshop Herbert I. Abelson, Ph.D. Response Analysis Corporation Box 158 Princeton, New Jersey 08540 Barbara A. Bailar, Ph.D. Research Center for Measurement Methods Bureau of the Census Washington, DC 20233 Ira Cisin, Ph.D. Social Research Group The George Washington University 2401 Virginia Avenue, NW Washington, DC 20037 Judy Coakley Division of Extramural Research National Institute on Alcoholism and Alcohol Abuse Rockville, Maryland 20857 Reuben Cohen Response Analysis Corporation Box 158 Princeton, New Jersey 08540 Steven B. Cohen, Ph.D. National Center for Health Services Research Hyattsville, Maryland 20782 Eugene P. Ericksen, Ph.D. Institute for Survey Research Temple University 1601 North Broad Street Philadelphia, Pennsylvania 19122 Donna 0. Farley Suburban Cook County - DuPage County Health Systems Agency 1010 Lake Street Oak Park, Illinois 60301 Robert E. Fay, Ph.D. Statistical Research Division Bureau of the Census Washington, DC ~ 20233 Dwight K. French National Center for Health Statistics Hyattsville, Maryland 20782 Charles Froland, Ph.D. Regional Research Institute Portland State University Portland, Oregon 97207 274 Charles J. Furst, Ph.D. Neuropsychiatric Institute - Department of Psychiatry University of California Center for the Health Sciences Los Angeles, California 90024 Maria Elena Gonzalez Office of Federal Statistical Policy and Standards Department of Commerce Washington, DC 20230 Warren G. Holland National Clearinghouse for Alcohol Information - Aleohol Epidemiologic Data System PO Box 2345 Rockville, Maryland 20852 Gary G. Koch, Ph.D. Department of Biostatistics School of Public Health University of North Carolina Chapel Hill, North Carolina 27514 Paul S. Levy, Sc.D. Department of Biostatistics School of Health Sciences University of Massachusetts Amherst, Massachusetts 01003 (affiliation at time of workshop: School of Public Health University of Illinois at the Medical Center Chicago, Illinois 60680) Lillian H. Madow Bureau of Labor Statistics 441 G Street, NW Washington, DC 20212 Harold Nisselson Statistical Standards and Methodology Bureau of the Census Washington, DC 20233 Frederick J. Oeltjen Division of Community Assistance National Institute on Drug Abuse Rockville, Maryland 20857 David M. Promisel, Ph.D. Policy Studies and Special Reports Branch National Institute on Alcoholism and Alcohol Abuse Rockville, Maryland 20857 Louise G. Richards, Ph.D. Psychosocial Branch, Division of Research National Institute. on Drug Abuse Rockville, Maryland 20857 275 Joan Rittenhouse, Ph.D. Office of Medical and Professional Affairs National Institute on Drug Abuse Rockville, Maryland 20857 Ron Roizen Soetal Research Group School of Public Health University of California 1918 Bonita Avenue Berkeley, California 94704 Richard M. Royall, Ph.D. Department of Biostatistics School of Hygiene and Public Health The Johns Hopkins University 615 North Wolfe Street Baltimore, Maryland 21205 Wesley L. Schaible, Ph.D. Office of Statistical Research National Center for Health Statistics Hyattsville, Maryland 20782 Walt R. Simmons 1525 Belle Haven Road Alexandria, Virginia 22307 Monroe G. Sirken, Ph.D. Office for Mathematical Statistics National Center for Health Statistics Hyattsville, Maryland 20782 Joseph Steinberg Survey Design, Inc. 1320 Fenwick Lane Silver Spring, Maryland 20910 Joseph Waksberg Westat, Inc. 11600 Nebel Street Rockville, Maryland 20852 Robert Wilson, Ph.D. College of Urban Affairs and Public Policy University of Delaware Newark, Delaware 19711 Philip Wirtz Social Research Group The George Washington University 2401 Virginia Avenue, NW Washington, DC 20037 Thomas H. Woteki, Ph.D, Office of Data Development Energy Information Administration Department of Energy Washington, DC 20461 276 Appendix B: Workshop Program Thursday, April 13, 1978 9:00 CONVENING OF WORKSHOP Remarks by Chair Louise G. Richards National Institute on Drug Abuse Monroe G. Sirken National Center for Health Statistics Joseph Steinberg Survey Design, Inc. 9:15 PAPER Small Area Estimation -- Synthetic and Other Procedures, 1968-1978 Discussants Paul S. Levy School of Public Health University of Illinois at the Medical Center Walt R. Simmons Consultant, National Academy of Sciences Gary G. Koch School of Public Health The University of North Carolina at Chapel Hill 10:45 PAPER Drug Abuse Applications: Some Regression Explorations with National Survey Data Discussants Reuben Cohen Response Analysis Corporation Monroe G. Sirken Ira Cisin Social Research Group The George Washington University 2:00 PAPER A Composite Estimator for Small Area Statistics Discussant Wesley L. Schaible National Center for Health Statistics Barbara A. Bailar Bureau of the Census 277 3:00 PAPER Prediction Models in Small Area Estimation Richard M. Royall School of Hygiene and Public Health The Johns Hopkins University Discussant Harold Nisselson Bureau of the Census 4:00 PAPER A Modified Approach to Small Area Estimation Steven B. Cohen Discussant Joseph Waksberg Westat, Inc. Friday, April 14, 1978 9:00 PAPER Applications of Synthetic Estimates to Alcoholism and Problem Drinking David M. Promisel National Institute on Alcohol Abuse and Alcoholism Discussant Donna 0. Farley Suburban Cook County-DuPage County Health Systems Agency 10:15 PAPERS Case Studies on the Use and Accuracy of Synthetic Estimates: Unemployment and Housing Applications Maria E. Gonzalez Office of Federal Statistical Policy and Standards Some Recent Census Bureau Applications of Regression Techniques to Estimation Robert E. Fay Bureau of the Census Discussant Eugene P. Ericksen Institute of Survey Research Temple University 2:00 PAPER Synthetic Estimates as an Approach to Needs Assessment: Issues and Experience Charles Froland Berkeley Planning Associates Discussant Reuben Cohen 3:00 Summary Remarks Joseph Steinberg 278 National Institute on ¢fearch monograph series While limited supplies last, single copies of the monographs may be obtained free of charge from the National Clearinghouse for Drug Abuse Information (NCDAI). Please contact NCDAI also for informa- tion about availability of coming issues and other publications of the National Institute on Drug Abuse relevant to drug abuse research. Additional copies may be purchased from the U.S. Government Printing Office (GPO) and/or the National Technical Information Service (NTIS) as indicated. NTIS prices are for paper copy. Microfiche copies, at $3, are also available from NTIS. Prices from either source are subject to change. Addresses are: NCDAI National Clearinghouse for Drug Abuse Information Room 10-A-56 5600 Fishers Lane Rockville, Maryland 20857 GPO NTIS Superintendent of Documents National Technical Information U.S. Government Printing Office Service Washington, D.C. 20402 U.S. Department of Commerce Springfield, Virginia 22161 1 FINDINGS OF DRUG ABUSE RESEARCH. An annotated bibliography of NIMH- and NIDA-supported extramural grant research, 1964-74. Volume 1, 384 pp., Volume 2, 377 pp. Vol.1l: GPO out of stock NTIS PB #272 867/AS $14 Vol.2: GPO Stock #017-024-0466-9 $5.05 NTIS PB #272 868/AS $15 279 2 OPERATIONAL DEFINITIONS IN SOCIO-BEHAVIORAL DRUG USE RESEARCH 1975. Jack Elinson, Ph.D., and David Nurco, Ph.D., editors. Task Force articles proposing consensual definitions of concepts and terms used in psychosocial research to achieve operational SEs; 167 pp. GPO out of stock NTIS PB #246 338/AS $8 3 AMINERGIC HYPOTHESES OF BEHAVIOR: REALITY OR CLICHE? Bruce J. Bernard, Ph.D., editor. Articles examining the relation of the brain monoamines to a range of animal and human behaviors. 149 EP; GPO Stock #017-024-00486-3 $2.25 NTIS PB #246 687/AS $8 4 NARCOTIC ANTAGONISTS: THE SEARCH FOR LONG-ACTING PREPARATIONS. Robert Willette, Ph.D., editor. Articles reporting current alternative inserted sustained-release or long-acting drug devices. 45 pp. GPO Stock #017-024-00488-0 $1.10 NTIS PB #247 096/AS $4.50 5 YOUNG MEN AND DRUGS: A NATIONWIDE SURVEY. John A. O'Donnell, Ph.D., et al. Report of a national survey of drug use by men 20-30 years old in 1974-5. 144 pp. GPO Stock #017-024-00511-8 $2.25 NTIS PB #247 446/AS $8 6 EFFECTS OF LABELING THE "DRUG ABUSER'': AN INQUIRY. JAY R. Williams, Ph.D. Analysis and review of the literature examining effects of drug use apprehension or arrest on the adolescent. 39 De. GPO Stock #017-024-00512-6 $1.05 NTIS PB #249 092/AS 50 TCANNABINOID ASSAYS IN HUMANS. Robert Willette, Ph.D., editor. Articles describing current developments in methods for measuring cannabinoid levels in the human body by immunoassay, liquid and dual column chromatography and mass spectroscopy techniques. 120 £0 GPO Stock #017-024-00510-0 $1.95 NTIS PB #251 905/AS $7.25 8R :3x/WEEK IAAM - ALTERNATIVE TO METHADONE. Jack Blaine, M.D., and Pierre Renault, M.D., editors. Comprehensive summary of development of LAAM (Levo-alpha-acetyl methadol), a new drug for treatment of narcotic addiction. 127 pp. Not available from GPO NTIS PB #253 763/AS $7.25 9NARCOTIC ANTAGONISTS: NALTREXONE. Demetrios Julius, M.D., and Pierre Renault, M.D., editors. Progress report of development, preclinical and clinical studies of naltrexone, a new drug for treatment of narcotic addiction. 182 pp. GPO Stock #017-024-00521-5 $2.55 NTIS PB #255 833/AS $9 10 EPIDEMIOLOGY OF DRUG ABUSE: CURRENT ISSUES. louise G. Richards, Ph.D., and Louise B. Blevens, editors. Conference Proceedings. Examin- ation of methodological problems in surveys and data collection. 259 gp: GPO Stock #017-024-00571-1 $2.60 NTIS PB #266 691/AS $10.75 11DRUGS AND DRIVING. Robert Willette, Ph.D., editor. State-of-the- art review of current research on the effects of different drugs on performance impairment, particularly on driving. 137 pp. GPO Stock #017-024-00576-2 $1.70 NTIS PB # 269 602/AS $8 280 12 PSYCHODYNAMICS OF DRUG DEPENDENCE. Jack D. Blaine, M.D., and Demetrios A. Julius, M.D., editors. 4 pioneering collection of papers to discover the part played by individual psychodynamics in drug de- pendence. 187 pp. GPO Stock #017-024-00642-4 $2.75 NTIS PB #276 084/AS $9 13 COCAINE: 1977. Robert C. Petersen, Ph.D., and Richard C. Stillman, M.D., editors. A series of reports developing a picture of the extent and limits of current knowledge of the drug, its use and misuse. 223 pp. GPO Stock #017-024-00592-4 $3 NTIS PB #269 175/AS $9.25 14 MARIHUANA RESEARCH FINDINGS: 1976. Robert C. Petersen, Ph.D., editor. Technical papers on epidemiology, chemistry and metabolism, toxicological and pharmacological effects, learned and unlearned be- havior, genetic and immune system effects, and therapeutic aspects of marihuana use. 251 pp. GPO Stock #017-024-00622-0 $3 NTIS PB #271 279/AS $10.75 15 REVIEW OF INHALANTS: EUPHORIA TO DYSFUNCTION. Charles Wm. Sharp, Ph.D., and Mary Lee Brehm, Ph.D., editors. 4 broad review of inhalant abuse, including sociocultural, behavioral, clinical, pharma- ecological, and toxicological aspects. Extensive bibliography. 347 pp. GPO Stock #017-024-00650-5 $4.25 NTIS PB #275 798/AS $12.50 16 THE EPIDEMIOLOGY OF HEROIN AND OTHER NARCOTICS. Joan Dunne Ritten- house, Ph.D., editor. Task Force report on measurement of heroin- narcotic use, gaps in knowledge and how to address them, improved re- search technologies, and research implications. 249 pp. GPO Stock #017-024-00690-4 $3.50 NTIS PB #276 357/AS $9.50 17 RESEARCH ON SMOKING BEHAVIOR. Murray E. Jarvik, M.D., Ph.D., et al., editors. State-of-the-art of research on smoking behavior, ineluding epidemiology, etiology, socioeconomic and physical conse- quences of use, and approaches to behavioral change. From a NIDA- supported UCLA conference. 383 pp. GPO Stock #017-024-00694-7 $4.50 NTIS PB #276 353/AS $13 18 BEHAVIORAL TOLERANCE: RESEARCH AND TREATMENT IMPLICATIONS. Norman A. Krasnegor, Ph.D., editor. Conference papers discuss theoretical and empirical studies of nonpharmacologic factors in development of tolerance to a variety of drugs in animal and human subjects. 151 pp. GPO Stock #017-024-00699-8 $2.75 NTIS PB #276 337/AS $8 19 THE INTERNATIONAL CHALLENGE OF DRUG ABUSE. Robert C. Petersen, Ph.D., editor. A monograph based on papers presented at the World Psychiatric Association 1977 meeting in Honolulu. Emphasis ie on emerging patterns of drug use, international aspects of research, and therapeutic issues of particular interest worldwide. GPO Stock # 017-024-00822-2 $4.50 NTIS PB # to be assigned 281 20 SELF-ADMINISTRATION OF ABUSED SUBSTANCES: METHODS FOR STUDY. Norman A. Krasnegor, Ph.D., editor. Papers from a technical review on methods used to study self-administration of abused substances. Digs- cussions include overview, methodological analysie, and future planning of research on a variety of substances: drugs, ethanol, food, and tobacco. 246 pp. Not available from GPO NTIS PB #288 471/AS $10.75 21 PHENCYCLIDINE (PCP) ABUSE: AN APPRAISAL. Robert C. Petersen, Ph.D., and Richard C. Stillman, M.D., editors. Monograph derived from a technical review to assess the present state of knowledge about phen- eyelidine and to focus on additional areas of research. Papers are aimed at a professional and scientific readership concerned about how to cope with the problem of PCP abuse. 313 pp. GPO Stock #017-024-00785-4 $4.25 NTIS PB #288 472/AS $11.75 22 QUASAR: QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIPS OF ANAL- GESICS, NARCOTIC ANTAGONISTS, AND HALLUCINOGENS. Gene Barnett, Ph.D., Milan Trisc, Ph.D., and Robert E. Willette, Ph.D., editors. Reports an interdisciplinary conference on the molecular nature of drug-receptor interactions. A broad range of quantitative tech- niques were applied to questions of molecular structure, correla- tion of molecular properties with biological activity, and mole- cular interactions with the receptor(s). 487 pp. GPO Stock #017-024-00786-2 $5.25 NTIS PB #292 265/AS $15 23 CIGARETTE SMOKING AS A DEPENDENCE PROCESS. Norman A. Krasnegor, Ph.D., editor. 4 review of the biological, behavioral, and psycho- goctal factors involved in the onset, maintenance, and cessation ot tobacco/nicotine use ie presented, together with an agenda for further research on the cigarette smoking habit process. In Press Jr U.S. GOVERNMENT PRINTING OFFICE : 1979 O—293-965 282 Y LIBRARIES Wii 028527009