Public Health Service Publication No. 1000-Series 2-No. 13 For sale by the Superintendent of Documents, U.S. Government Printing Office Washington, D.C'., 20402 - Price 35 cents NATIONAL CENTER| Series 2 For HEALTH STATISTICS | Number 13 VITALand HEALTH STATISTICS DATA EVALUATION AND METHODS RESEARCH Computer Simulation of Hospital Discharges Hicro-simulation of measurement errors in hospital dis- charge data reported in the Health Interview Survey. Washington, D.C. February 1966 U.S. DEPARTMENT OF HEALTH, EDUCATION, AND WELFARE Public Health Service John W. Gardner William H. Stewart Secretary Surgeon General NATIONAL CENTER FOR HEALTH STATISTICS FORREST E. LINDER, Pu. D., Director THEODORE D. WOOLSEY, Deputy Director OSWALD K. SAGEN, Pu. D., Assistant Director WALT R. SIMMONS, M.A., Statistical Advisor ALICE M. WATERHOUSE, M.D., Medical Advisor JAMES E. KELLY, D.D.S., Dental Advisor LOUIS R. STOLCIS, M.A., Executive Officer OFFICE OF HEALTH STATISTICS ANALYSIS Iwao M. Moriyama, Pu. D., Chief DIVISION OF VITAL STATISTICS RoserT D. Grove, Pu. D., Chief DIVISION OF HEALTH INTERVIEW STATISTICS Puivie S. LAWRENCE, Sc. D., Chief DIVISION OF HEALTH RECORDS STATISTICS Monroe G. SirkEN, Pu. D., Chief DIVISION OF HEALTH EXAMINATION STATISTICS ArtHUR J. McDoweLL, Chief DIVISION OF DATA PROCESSING SioNEY BINDER, Chief Public Health Service Publication No. 1000-Series 2-No. 13 Library of Congress Catalog Card Number 65-62273 PREFACE The purpose of the study described in this report was two-fold: (1) theunderlying considera- tion was methodology, with emphasis on model building and on experience to be gained inthe use of computer simulation techniques employed in analysis of health statistics; and (2) the immedi- ate target was a better understanding of the im- pact of certain measurement deficiencies present in health interview surveys. The specific problems studied are set forth in sections I and II of the report. The subject matter is hospital discharges, and more espe- cially the discrepancies between the number of discharges as reported by household respondents to interview and those that actually occur. The Health Interview Survey of the National Center for Health Statistics in its household inquiry in- cludes questions asking for the number andchar- acteristics of hospital discharges experienced by household members in the year prior to inter- view, There are many reasons for discrepancy between the reported number of discharges and the true number. Two of these causes have been given particular attention. One is that hospital experience during the reference period for per- sons not living at the time of interview is not reported in a survey of living persons. This de- ficiency is relatively more important the longer the reference period. A second principal cause of discrepancy between reported and true data isthe response error in the report for a living person. Empirical data and theory have indicated that this error, too, increases with length of reference period. The interaction of these factors and their impact on reported data have been explored pre- viously in a variety of ways, using record-check techniques, internal analysis of reported data, and hypothetical models. This research has con- RAwoq sia wo. 15-5 PUBLIC HEALTH LIBRARY tributed substantially to better knowledge of the subject but has left several questions unanswered, It seemed likely that understanding would be further promoted, and especially that better judgments could be made of the effect of changes in interview procedure, if the process were to be studied through a technique for simulating on a computer the hospital experience of a model pop- ulation of individual persons, and subsequently simulating interviews of this population, Such an undertaking might have particular merit since the main threads of logic for the hospital problem might have considerably wider potential applica- tion—for example, a close analogy can be made between periods of unemployment and hospital episodes. Accordingly, through a contractual arrange- ment the present study was carried out by Re- search Triangle Institute, Durham, N.C., in close cooperation with staff members of the National Center for Health Statistics. Dr. D. G. Horvitz of the Research Triangle Institute was the pro- ject director and principal author of this report. He was assisted by Dr. D, T. Searls, formerly on the Research Triangle Institute staff, and by Irving Drutman (deceased) of North Carolina State University. Mr. Drutman did most of the computer programming. Other contributors to the study were Mr. Joseph Snavely of the North Carolina State University Computing Center and Mr. Francis Giesbrecht of the Research Triangle Institute, who developed appropriate expected values and variances for the computer -generated discharge rates. Walt R. Simmons prepared an initial outline of the problem, proposed the simu- lation approach, and coordinated contributions of the staff of the Center to the project. Wilbur M. Sartwell of the Center staff supervised much of the computer calculation. 160 CONTENTS Page Preface == ----mcmmmm mmm eee i I. Introduction ===-=-mcmm come ee ee eee 1 II. Project Objective Sm =m mmm mm mmm ee ome eee eee eee 3 III. Procedures=---==-mmmmm meme eee eee eee meme mem 3 SUMMAT Y= === === mm mm mmm em mee ee eee em 3 A Stochastic Model for Hospital Episodes-=--=--mceccmmmaccanaaun 5 Hospital Admissions Model-==mmme mmm meee cece eee a 5 Duration-of-Stay Model--=c=ccmmmmm ccc ———————— 6 Computer Simulation of Hospital Episodes-----==-eceomacaaaaan 6 Interview Simulation Model-=== mem mmm mo mee emma 11 Underreporting of Hospital Episodes-=====-cecmmmmcmmccaaao- 11 Length-of-Stay Response Errors--=-=-=cocecmmmmmcmmccceceee 12 Month-of-Discharge Response Errorg-=--ce-=cemmemeecmeo ano 13 Computer Simulation of Interviews -====c=em commemorate 13 Simulation Estimates of Errors in Hospital Discharge Data-------- 15 IV. ReSUllS === mmm mmm meee eee eee meee 16 Evaluation of Hospital Episodes Simulation----=-=ccceccmmaeaanao- 16 Evaluation of Interview Simulation======em emo o mmm 19 Estimates of Specific Error Components------««we- mmm ———————— 19 Methods for Increasing ACCUraCY-=-===mmme omen meee eee eee 22 V. Conclusiong--==emmmm mmm cee eee 23 Detailed TableS--=-=-mm memo ee eee eee eee m 25 Appendix. Outline for Computer Simulation of Hospital Discharges----- 35 IN THIS REPORT a study is presented on computer micro-simulation of discharges from short-stay hospitals, and on the associated measure- ment errors that occur in household interview surveys, as set forth in the preface. A synthetic universe of 10,000 persons was established with demographic characteristics similar to those of the U.S. civilian, non- institutional population. On the basis of earlier theoretical work and em- pirical record-check studies, this universe was subjected to a series of stochastic operations to simulate hospital experience, and the reporting of that experience in household intevviews. Each individual person was moved from one state to another—e.g., from not-in-a-hospital to in-a-hospital, or from in-a-hospital to discharged- alive—by arandom process with probabilities which varied by such fac- tors as age, sex, distance from death, number of days already in the hospital, and a general health index. Thus it was possible to count the simulated hospital discharges over a 12-month period, and to tabulate them in a variety of ways. At monthly intervals the living persons in the synthetic population then were "interviewed" by the computer and reported their hospital expevi- ence over the previous year. Two sets of simulated interview data were tabulated. In one, respondents reported without ervov. For this set, com- parisonwith total experience reflected the impact on discharge statistics of the missing data for persons not living at the time of interview. In the other, response was conditioned by probabilities of reporting cor- rectly, which varied by distance between interview and discharge, length of stay, reason for hospitalization, and other less significant factors. Comparison of this latter set of data with total experience gives a mech- anism for studying a wide range of problems found in the interview data. Throughout the study, emphasis was placed on the development and use of a flexible method of analysis. The report is not an evaluation of the reporting of hospital discharges in the Health Interview Survey. SYMBOLS Data not available---=mmcecmmccc cmc i Category not applicable--=-==ceeccoacmaaax Quantity Zero-----===-=me--mmmmmmmemmmmee - Quantity more than O but less than 0.05----- 0.0 Figure does not meet standards of reliability or precision------=coececaeau- * COMPUTER SIMULATION OF HOSPITAL DISCHARGES l. INTRODUCTION The Health Interview Survey of the National Center for Health Statistics provides estimates of the number of discharges from hospitals on an annual basis for the living, civilian, noninstitu- tional population. The data are gathered in a household interview survey by means of personal interviews conducted each week, during a 52- week period, in area probability samples of house- holds throughout the United States. The informa- tion on discharges (along with hospital utiliza- tion) is obtained for each resident in the sample households for a reference period of 12 months prior to the week of interview, There are some readily recognized factors in the survey procedure which cause the number of discharges reported by the respondents to dif- fer from the actual number which occurred in hospitals during the reference year. One impor - tant factor is the failure of the respondents to report correctly each hospital episode during the reference year. A second factor is thatthe survey covers only persons living on the date of inter- view, The hospital experience of persons who died in the year prior to interview is not included. If the difference between reported discharges and all discharges taking place during the ref- erence year is examined on a weekly or monthly basis, a definite decreasing trend or decay, mov- ing backward in time from the date of interview, of the number of discharges reported by the re- spondents in the Health Interview Survey is ob- served. Explanations for this decay curve include the following factors. 1. Response errors.—Underreporting can be expected to increase with increasing length of the recall period. In other words, recent discharges are more likely to be recalled and reported ac- curately than discharges which occurred earlier in the reference year. 2. Persons in their last year of life.—A study of hospital utilization in the last year of life reports that the ''daily discharge rate per 1,000 deaths increases gradually from less than 1 during the twelfth month before death to about 3 on the day before death.'! The Health Inter- view Survey obtains information from persons who will die inthe year following the date of inter- view. The discharges for these persons for the reference year are more frequent for the period immediately prior to the date of interview than for earlier periods in the reference year, thus contributing to the observed decay curve. 3. Population growth. —Only living persons residing in the sample households on the date of the interview are eligible for the survey. The size of this population is probably at least 1.5 percent smaller 12 months prior to the date of interview, since during this period there are births and other additions to the household pop- ulation such as returnees from mental and penal institutions. During this same period, losses in the household population occur, but these are not recorded since they involve persons who died or were institutionalized. 4. Hospital discharge trend.— A portion ofthe observed trend may be a legitimate consequence of natural phenomena related to the hospitaliza- tion needs of the population. If there is an in- creasing trend in hospital admission rates, then the same trend will be present in the discharge rates. Such a trend is not expected to be very great during a period as short as 1 year. Response errors in reported hospital dis- charges have been studied by the Survey Research Center, University of Michigan, in cooperation with the Bureau of the Census and the National Center for Health Statistics. The first study employed a sample of individuals with known hospitalization records.? These persons were interviewed concerning their hospital experience, and the results were compared with the records obtained from hospitals. The comparisons con- firmed that underreporting of hospitalization in- creases with length of recall period. For dis- charges occurring near the beginning of the 12- month period prior to interview such underreport- ing was particularly serious. The study estimated underreporting of hospital episodes for the ref- erence year to be 10 percent. A second study compared three survey pro- cedures for obtaining hospital episode data, in- cluding the Health Interview Survey procedure which was used as the standard.’ Reporting ac- curacy was found to be significantly improved by using a revised interview schedule with a mail followup to obtain information concerning hos- pital stays that had been overlooked in the inter- view, With respect to decedents during the ref- erence year, the Division of Vital Statistics of the Center conducted a study of hospitalizations during the last year of life from the records of a sample of deaths in the Middle Atlantic States, i.e., New York, New Jersey, and Pennsylvania.! The study estimated that the hospital discharges reported in the Health Interview Survey for the Middle Atlantic States needed to be adjusted up- ward by approximately 8 percent to include the experience of decedents. A similar study on a national scale is now nearing completion. The Health Interview Survey collects data from a new sample of households each week.* It is therefore possible to compare the hospital discharges reported for a particular calendar period by two or more of these weekly samples. For example, consider the number of hospital discharges reported for the month prior to inter- view of each weekly sample and compare this with the number of hospital discharges reported for the same month by each sample interviewed 4 weeks later, The average discrepancy for the paired weekly samples represents an estimate of the combined effects of mortality and response errors for the second month prior to interview. Such factors as population growth or hospitaliza- tion trends are not included in the observed dif- ference. Analyses of this type have been carried out with Health Interview Survey data to estimate the relationship between underreporting (including mortality and response errors) and the time in- terval between discharge and date of interview, Simmons and Bryant derived adjustment factors based on these internal analyses by which hos- pital discharges reported in the Health Interview Survey need to be inflated according to the dis- tance between discharge and interview to produce an estimate of total hospital discharges, including discharges for persons dying during the reference year. Although so extensive an adjustment pro- cedure has not been adopted, publication of hos- pital discharges reported in the Health Interview Survey is now based on data for the most recent 6 months of the reference year. The 12-month- reference period is retained in the interview. While research has resulted in greater un- derstanding and knowledge of the role played by various factors affecting observed discrepancies, this understanding and knowledge is still insuffi- cient for specification of a completely satisfactory procedure of data collection and estimation. Part of this difficulty might be explained by the fact that the major studies of response error and mor - tality factors have been carried out independently. An ideal research design might conduct a pro- spective study on a large population sample for 1 year, observe (independently) the actual hospitali- zation experience of this sample, and interview those persons living at the end of the year. The required data for a fuller understanding would probably result from such a study. However, this is not considered a feasible research project; it might be impossible to carry itout satisfactorily. An alternative research approach isto simu- late this prospective study on a computer. This implies specifying a population to be followed over time, with the initial state of each individual know, such as age, whether or not in a hospital, and if so, the number of days the individual has already spent in a hospital, It also requires the specifica- tion of the transition probabilities for each pair of possible states for each time period (such as a week), including mortality. The division of the population into the various states for each time period is then generated successively by means of the transition probabilities. In this way the hos- pital discharges can be counted for each time period, including those of individuals discharged dead as well as those of individuals who die in subsequent time periods. The household interview among living persons in the generated population at the end of 1 year can also be simulated. This simulation uses a probability function relating failure to reporthos- pital episodes to the number of weeks between discharge and interview. The simulated interview data can then be compared with the generated hos- pital discharge data and the distribution of the dis- crepancy among the contributing factors deter- mined for each time period. The computer simulation approach was used in this project. ll. PROJECT OBJECTIVES The major purpose of this project was to develop a research tool for comparison of alter- native hospital episode interview survey proce- dures. It was expected that the computer simula- tion approach could lead to relatively inexpensive evaluation of the effects of alternative procedures and eventually to more efficient and accurate pro- cedures for the continuous collection and estima- tion of hospital discharge statistics. Specific objectives of the project were: 1. To develop probability models for gener- ating (a) hospital admissions and durations of stay for a given population, and (b) in- terview data on hospital episodes as col- lected in the Health Interview Survey. 2. To determine suitable parameter inputs for the models from existing data. 3. To program an IBM 1410 computer for experimental simulation under the mod- els. 4. To estimate, through computer simula- tions, the specific effects of the various factors related tothe discrepancy between hospital discharges reported in the inter- view survey and all discharges. 5. To suggest, on the basis of the research results, a method for continuous collec- tion and adjustment of hospital discharge data. ll. PROCEDURES SUMMARY The initial phase of this project was concerned primarily with developing a probability model for generating hospital episodes for individuals on a computer. The model adopted assumes that each individual in the population of interest has a par- ticular probability of being hospitalized each week. It further assumes that this weekly hospital ad- mission probability remains constant for a given individual over the time period of interest (pro- vided he is not in his last year of life), but varies from individual to individual. Based on empirical studies of data available from the Health Inter- view Survey and on theoretical considerations, it was determined that the generalized gamma dis- tribution provides a suitable and consistent model for the distribution of the weekly admission prob- abilities over the population. Once an individual is hospitalized, the model provides for discharge from the hospital on a daily probability basis with the chance of discharge conditional on the number of days already hospitalized. The log- normal distribution was adopted as the duration- of-stay model, following empirical analysis of length-of-stay data available from the Health In- terview Survey. A computer program was developed in the second phase of this project to generate hospital- ization histories for each individual in a model U.S. population. The weekly admission probabil- ities and daily discharge probabilities employed in the computer program were estimated for in- dividuals in each of 12 age-sex groups consistent with the hospital episodes model developed in the first phase. In brief, the computer program gen- erates uniform random numbers to compare with the appropriate weekly hospital admission prob- ability for an individual during each week thatthe individual is not hospitalized. When an individual is hospitalized by the computer, it then generates uniform random numbers to compare with the ap- propriate daily discharge probabilities until the individual is discharged. The computer records the day of admission and day of discharge for each hospital episode generated. This basic computer program, with some modifications, was carried out for an initial pop- ulation of 10,000 individuals, distributed by age and sex to represent the U.S. civilian, noninstitu- tional population, for a period of 108 weeks or 756 days. The modifications included introducing births and deaths in order to give a dynamic di- mension to the population and using a separate set of daily hospital admission probabilities for individuals in their last year of life. These latter probabilities increased gradually as the day of death approached. Except for deliveries, reasons for hospitalization were not assigned in the com- puter simulation program. The computer deter - mined on a random basis those deliveries which were to occur in a hospital. In the third phase of the project a relatively simple model was devised to simulate the re- sponses obtained in household interviews for in- dividuals experiencing one or more hospital epi- sodes in the year prior to interview. For each hospital episode, the model simulates on a prob- ability basis failure to report the episode, reported length of stay (if the episode is reported), and re- ported month of discharge. The model treats re- porting of each hospital episode as a random event dependent on length of the recall period and length of hospital stay for the episode. The distribution of errors in reported length of stay is approxi- mated in the model by a normal or Gaussian dis- tribution. Response errors in the reported month of discharge are simulated in the model by first approximating errors in the reported date of ad- mission by a normal distribution. The reported length of stay is then added to the reported date of admission to obtain the reported discharge date. A computer program to generate interview results consistent with the interview simulation model was developed in the fourth phase of the project. The input data for this program con- sisted of the 108 weeks of hospital episode data generated by the first computer program together with parameter values for the interview simula- tion model. Estimates of the necessary param- eters were based on evidence from exploratory work which had been done in the National Center for Health Statistics and especially on the results obtained in the previously mentioned response error study conducted by the Survey Research Center, University of Michigan. This interview simulation computer program was run for 13 separate interview dates 4 weeks apart beginning with week 60 of the 108-week period for which hospital episode data had been generated. The results were tabulated in three separate cate- gories by the computer for each interview date, These results included number of discharges and number of hospital days, by sex, age, and each of 13 four-week periods prior to the interview date, The three tabulation categories were 'interview reported" results for persons alive on the date of interview, which include simulated interview re- porting errors; ''perfect interview' results for persons alive on thedateof interview, which sim- ulate the results which would be obtained by the household interviews if there were no response errors of any kind; and "all discharges' which consist of the actual results for all hospital epi- sodes generated by the first computer program for the year prior to the interview date for all persons, whether alive or dead on the interview date. The data generated by the computer for the 13 interview dates were averaged and estimates of annual hospital discharge rates and annual hos- pital days per 1,000 persons by age and sex were derived for each of the three tabulation categories. Using these results, both separate and combined estimates of the effects of interview response errors and of exclusion of persons who died dur- ing the reference year on hospital discharge data collected in the Health Interview Survey can be derived. A STOCHASTIC MODEL FOR HOSPITAL EPISODES Hospital Admissions Model The model for hospital admissions was deter- mined soon after the project was initiated. This was primarily due to a fortunate exposure to re- search on a mathematical model of an index of health by Dr. Chin Long Chiang, University of California at Berkeley.5 The hospital admissions of an individual during a time interval of length t can be treated as random events in time, that is, as a stochastic process. A simplified model assumes that the probability of the individual being hospitalized during a small time interval dt is given by Adt, where X is a positive constant. ? If it is further assumed that this prob- ability Adt is independent of the number of pre- vious hospital admissions for the individual, then the process is a Poisson process. It follows that the probability of exactly x admissions of the in- dividual occurring during the time ¢ is given by aM ar) x! P(t) = 2=0.1,2,... + OD If the time interval t is taken as 1 year (i.e., t =1), then the probability density function for the number of hospitalizations annually for the indi- vidual is Poisson, where the parameter A is the expected number of hospital episodes during this period. Suppose now that the probability of being hospitalized in a small time interval varies from individual to individual in a population so that x varies over the population. If the distributions of the X's is gamma, then the distribution of the population by number of hospital episodes yearly is negative binomial, derived as follows. 8More rigorously, the probability of one or more hospital admissions for anindividual inthe small interval dt is given by dt +o (dt) where the term o (dt) denotes a auantity which is of smaller order of magnitude than dt and is the probatility that more than cne admission occurs. From equation (1) above, the distribution of admissions annually for an individual with param- eter \ is f(xI\) = x=012,.... (2) x! For allindividuals in the population, the distribu- tion of X's is assumed to be a gamma distribu- tion, i.e., sO) = or (BN! ao #30050, kB 3 Then the joint distribution of x and A is AB+1) ya+x-1_ 4) e FINE) = Ss The distribution of the population by number of hospital episodes annually, thatis f (x), is found by integrating equation (4) with respect to A. Thus, - f=J f(xINg (Nd) LN) x=0,1,2,.... (5 which is the negative binomial distribution. Data available from the Health Interview Sur - vey for the period July 1958-June 1960 were used to determine the goodness of fit of the negative binomial distribution to the observed frequencies of persons with 0, 1, 2, 3, and 4 or more hospital episodes in the average year. A separate fit was made for males and females in each of the follow- ing six age groups: under 15 years, 15-24, 25-34, 35-44, 45-64, and 65 years and older. Each fit was accomplished by estimating the parameters « and 8 by the method of moments, that is, from the relations X=a/B = «(1+8)/B8* where x and s? are the observed mean and vari- ance respectively. The comparisons of the ob- served and expected frequencies for the 12 age- sex groups were considered to be fairly good. While a satisfactory fit of the negative binomial distribution is not sufficient evidence to claim the model to be valid, it does indicate that the model provides an excellent basis for generating hospital episodes reasonably consistent with ob- servation. Duration-of-Stay Model Once an individual is hospitalized, his length of stay depends largely on the reason for the hos- pitalization. Each diagnosis can be considered to generate its own length-of-stay distribution; for example, the length-of-stay distribution for ton- sillectomies will be different from that for pneu- monia cases. Since the overall length-of-stay distribution is a mixture of many different dis- tributions, it is not expected that any one distribu- tion will fit well. For purposes of computer sim- ulation, the distribution of duration of stay ob- served in the Health Interview Survey could have been used, except that the data had been grouped into fairly large intervals, particularly for the upper tail of the distribution. A smoothed distri- bution was preferred. In order to obtain some insight into an ap- propriate theoretical distribution for duration of stay, the conditional probabilities of discharge on a particular day, given thatthe individual has been hospitalized up to that day, were computed for the July 1958-June 1960 Health Interview Survey data for grouped periods on an average daily basis. The rise and fall of these conditional probabil- ities as duration of stay increased was charac- teristic of the log-normal distribution. Accord- ingly, this distribution was fitted to the available duration-of-stay data separately within age and sex groups. Since the agreement between these expected and observed proportions was considered satisfactory, the log-normal distribution was adopted as the duration-of-stay model. Computer Simulation of Hospital Episodes The stochastic models for hospital admission and duration of stay developed above suggest that hospital episodes for the U.S. civilian, noninstitu- tional population can be readily simulated on a computer by means of a set of daily (or weekly) transition probabilities for each individual. These probabilities are assumed to remain constant over time for an individual, at least for periods up to 2 years, but to vary from individual to individual. On a given day, say i, an individual can be in one of S+1 states. These states are: H = not in hospital I Hj in hospital ; days for a particular episode, j=1,2,..., SS. For each state on day 7, transition probabilities are specified for the two eligible states for the individual on day i; +1. Thus, for individual k in state H on day i: P, = the probability of being hospitalized on day 7 +1 1-P, = the probability of remaining out of the hospital on day i; +1. Similarly, for individual k in state H; on day i: P, = the probability of being discharged on day :i+1 (i.e., going to state H) 1-P;, =the probability of remaining in the hospital on day +1 (i.e., going to state Hj). In brief, then, by specification of S +1 probabil- ities (Py and Pj, j=1,2,...,S) for individual k, a computer can be programmed to generate a hospitalization history for this individual during a designated time period. If the individual is not in the hospital initially, the computer generates a uniform random number R, between zero and one to compare with P. If R, < P,, individual k is hospitalized on the first day (i.e., transferred from state Hto state H,). The computer then generates a second uniform random number R, to compare with Py. If R, < P,, individual k is discharged on the second day; otherwise individual k remains in the hospital for a second day and a third uniform random number R, is generated for comparison with P,,, etc., until discharge oc- curs. Following discharge, the next uniform ran- dom number is again compared with P,. If the initial random number R, > P,, individual k re- mains in state Hand R, is compared with P,, etc., until hospitalization occurs or the designated time period is exhausted. The computer is pro- grammed to record the day of admission and the day of discharge for each hospital episode gener - ated. The probability of hospital admission (P,) was specified on a weekly basis rather than a daily basis, except for individuals in their last year of life, This change was necessary in order to reduce computer time. If an individual was ad- mitted to the hospital in a given week, the com- puter assigned the specific day of the week, and hence the day of admission, by means of a ran- dom sequence. The weekly admission probabilities were es- timated by first fitting a negative binomial distri- bution to the distribution of the population by num- ber of hospital episodes annually, as observed in the July 1958-June 1960 Health Interview Surveys, for each of 12 age-sex groups. Delivery episodes were excluded from the female age groups. The « and B parameters estimated in the fitting proc- ess for a particular age-sex group (table A) are also, in accordance with the hospital admissions model, the parameters of the gamma distribution of x (equation 3), where x is the expected annual number of hospital episodes for a given individual in the group. While it would have been possible to determine a x for each individual in a group by sampling the appropriate gamma distribution at random, this was not considered necessary. Rather, each of the 12 age-sex groups was divid- ed further into 10 equal subgroups. It was planned initially to assign the first subgroup in each age- sex group a value of A corresponding to the 5% point of the appropriate gamma distribution, the second a \ corresponding to the 15% point, and so on to the corresponding to the 95% point for the 10th subgroup. Since the gamma distributions of interest were highly skewed, the tables of the incomplete gamma-function used to determine these \ values were lacking in some detail.” The tables are entered for arguments uand p where Bra? pP a—1. However, the tables did not give values of the argument u below the 40th percentile in all cases of interest and below the 50th percentile in a few cases, Thus, the first four or five subgroups in each age-sex group were assigned A's corre- sponding to the interpolated 20th percentile (or 25th percentile) values of uv. The average value of the assigned A's in each age-sex group was adjusted to the observed mean of the distribution of hospital episodes annually by adjusting the A corresponding to the 95% point. The constant weekly admission probability P,, which applied toall individuals in a subgroup, was obtained by dividing each assigned A by 52. These weekly admission probabilities for the 120 subgroups are given in table B. Each newborn individual was assigned to one of the 10 subgroups in the "under 15 years' age group of the same sex. Table A. a and B parameters of the negative binomial distributions fitted to the distribution of the population in 12 age-sex groups by number of annual hospital ep- isodes [See equation 5] Male Female Age a B a B Under 15 years-=--e=e-cececcmcccccceccccnneaa= 0.3097 4.9090 0.2432 4.7083 15-24 yearsese-ce-ceecoccmccccccccccccccce—a 0.2369 3.7665 0.1398 1.4480 25-34 yearsesmsc-scecccmcmcmcmcccccccceceee—- 0.2824 4.2410 0.2290 1.7924 35-44 yearse=eeemcceccccccccccccccccccecn———— 0.2834 3.6292 0.3901 3.3889 45-64 yearsemee-ceccmccmccccccccccccmceceaaa 0.2833 2.6920 0.3622 3.2396 65+ yearses==---cccmcacccccccncccccececeeee—- 0.3906 2.6129 0.3569 2,8701 Table B. Estimated weekly hospital admission rates per 1,000 persons not in their last year of life, weekly and annual hospi cal admission rates put probabilities x 103) excluding deliveries, by age,sex, and 10 percent subgroups, and average for all subgroups combined (computer in- Age groups Subgroup Under 15 | 15-24 25-34 35-44 45-64 65+ years years years years years years Male leceemmer cece cccce ccna 0.146 0.233 0,273 0,318 0.430 0.548 2emmmmeme emcee ————— 0.146 0.233 0.273 0.318 0.430 0.548 3emmmmmmm ecm eeen eee 0.146 0,233 0,273 0,318 0.430 0.548 femme 0.146 0.233 0.273 0.318 0.430 0.548 EE 0,219 0.233 0.273 0.318 0.430 0.822 bomen mmm 0.439 0.350 0.409 0.477 0.644 1.370 Jmmmmmccccccc mmc 0.768 0.700 0.793 0.927 1.250 2.284 Beceem cece 1.382 1.325 1.202 1.404 1.894 3.655 Jee cece eee 12 522 2.574 2.788 3.257 4,395 6.076 10-mccccommc cece cae 15,549 6.074 6.144 7+177 9.685 12,336 Average weekly rate for all subgroups combined--- 1.1463 1.2188 1.2701 1.4832 2.0018 2.8735 Average annual rate for all subgroups combined=-=-- 59.608 63.378 66.045 77.126 104.094 | 149.422 Female lececmmccccccccccnccncaaa 0,187 0.327 0.344 0.422 0.332 0.375 2emeececmccmm mmc ccn a 0.187 0,327 0.344 0.422 0.332 0.375 3eccmmcccccccc mca ccaeaa 0.187 0.327 0.344 0.422 0.332 0.375 foocamccmmmc ccc macceaaae 0.187 0.327 0.344 0.422 0.332 0.375 Semecccmmc emcee mcnnee 0.187 0.327 0.344 0.633 0.499 0.563 b-=ccceemmmccccccceaaaa- 0.280 0.327 0.516 1.055 0.890 1.005 FE rr 0.560 0.491 1.238 1.759 1.521 1.729 3 15 0 im 8m 1.060 1.325 2.475 2,814 2,564 2,895 Jmmememcccccm ccc nec ceea 2,061 3.435 5.054 4,678 4,452 5.025 TE) so ro 0 rae rr el i se 4,904 11.354 13.709 9.497 10.102 11.404 Average weekly rate for all subgroups combined--- 0.9800 1.8567 2.4712 2,2124 2.1366 2.4121 Average annual rate for all subgroups combined--- 50.960 96.548] 128.502; 115.045 | 111.103 125,429 This rate was incorrectly computed, The correct value is 6,227. computor runs. The error was not discovered until after the The expected annual rate for the computer generated episodes would have been raised from 59.6 to 63.1 per 1,000 persons by use of the correct value. A slightly different model was used to gener- ate the hospital histories of persons in their last year of life, Prior to generating a random number to determine if an individual would be hospital- ized in the week of interest, the computer first checked whether or not the individual had entered his last year of life. If so, the computer changed to a set of daily probabilities of being hospital- ized which increased gradually as the day of death approached, These probabilities were estimated from data on hospital utilization during selected time periods prior to death reported in the Middle Atlantic States study.! First, rough estimates of admission rates per 1,000 deaths and number of Table C. Estimated daily hospital admission probabilities for persons in their last year of life as a function of time period to death Daily Povsous Daily Period 3 to death admissions hospital admission eric prise ©o per 1,000 osp probabil- deaths per 1,000 ities deaths 1 and 2 daySe==-=-eemeecmceccecccccmcecemeee——————— 41,8 674.9 0.061935 2 and 3 dayS--==smmmeccececcescmesesseeeecee—e—————— 30.4 702.3 0.043286 3 and 4 dayS=---=-m-mcemcmcccecccccccmccc cece 31.8 731.1 0.043496 4 and 5 dayS-==mcemmmmmececccccceccccecccce mmm 18.8 746.91 0.02517] 5 and 6 daySe=s=mmeecmccccecccemeccece ecm ———— 23.1 767.0 0.030117 6 and 7 daySe=-=-mmecececcceccemcccccceemec———————— 27.5 791.5 0.034744 1 and 2 weekS=r=-mmecccmccccccc cece ccc ec, ———— 7.3 813.5 0.008919 2 and 3 weekS=-=mmcmcemccmmmce nme cece, ————— 6,2 845.9 0.007329 3 and 4 weekS----ecmcmmmccacacccccmccmmc ee meeae een 7.0 880.2 0.007953 1 and 2 monthS==-meccaccccecccmccnccccc meee cece ——— 3.1 915.2 0.003387 2 and 3 monthS=--=ccccccacccccccccccc ccc ccc cane 2.6 949.3 0.002739 3 and 4 months----eecccmcccccmccccccccccc ceca 1.8 963.9 0.001867 4 and 5 monthsS====cccccccccccmcccccccccc ccc cae 1.1 967.1 0.001137 5 and 6 months==eecmccccceeccemceccccnnccccncccann 1.3 977.4 0.001330 6-12 monthS===mecececccceccccc ccc ec ccc ccc cece ena 0.65 985.1 0.000660 lRatio of first to second column, Table D. Probability of birth occurring in a hospital, by age of mother, 15-44 years Total Annual annual births Prob- births in hos-| ability Ave of per pital of de- 1,000 per livery in females, 1,000 | hospital 1960 females 15-24 years==--- 166.32 135.86 0.816859 25-34 years---- 152.86 145.79 0.953749 35-44 years---- 36.60 31.40 0.857923 persons per 1,000 deaths not in the hospital as a function of the time period prior to death were derived from changes (first differences)in the nights of care rates and from the discharge rates. The ratio of these two quantities provided there- quired estimates of daily admission probabilities as a function of days to death. These estimates, shown in table C, were then plotted and the func- tion smoothed graphically. The smoothed func- tion provided 365 admission probabilities, one for each day in the last year of life. Except for deliveries, reasons for hospital- ization were not assigned in the simulation pro- gram. Females with delivery dates less than 31 days away from the day of interest were not ad- mitted to hospital during this period. On the as- signed delivery dates, the computer determined on a random basis which deliveries were to oc- cur in hospitals. The probability of a delivery taking place in a hospital was estimated for three age groups of mothers by dividing the number of births in hospitals per 1,000 females® by the rate for all births. These probabilities are shown in table D. The log-normal distribution 1 —(in t—u)/202 f(t) —— o @n' to , 120 6) was fitted to the observed distribution of length of hospital stay (excluding deliveries) for each of the 12 age-sex groups using unpublished Health Interview Survey data for the period July 1958- June 1960. The parameters, x and o, in f(t) were estimated from the equations 2 ON +02 syxl=e9"~1, where x and s? are the mean and variance of the observed duration-of-stay distribution. The con- ditional probabilities (P;, ) of discharge on day t, given that the individual had been hospitalized for the previous ¢-1 days, were then estimated from the fitted log-normal duration-of-stay dis- tributions. The computer program limited length of stay to a maximum of 100 days so that Pg, was set equal to .999999, Separate sets of discharge probabilities were estimated for females 15-24, 25-34, and 35-44 years of age hospitalized for deliveries. The es- timates were derived in the same manner as dis- cussed above, using unpublished length-of-stay data for deliveries obtained from the Health In- terview Survey, July 1958-June 1960. Length of stay was limited to a maximum of 21 days for fe- males 15-24 years, 24 days for females 25-34 years, and 30 days for females 35-44 years. Duration-of-stay distributions were not available for persons in their last year of life. However, average length-of-stay estimates by sex in age classes under 45, 45-64, and 65 years and over were obtained from the study of hospital utilization by decedents in the Middle Atlantic States.! The variances of the duration-of-stay distributions for these age-sex classes were im- puted by using the relationship observed between s? and x for these distributions among persons not in their last year of life. Thus, estimates of the conditional discharge probabilities were de- rived as above with length of stay limited to a maximum of 100 days. The estimates of the parameters up and ¢ for the log-normal fit of the duration-of-staydistri- butions in each of the above cases are given in table E. The computer operations for generatinghos- pitalization histories for persons not in their last year of life (Phase I) and for persons in their last 10 year of life (Phase II) are given in detail in the Appendix. The basic computer program, with modifica- tions as discussed below, was carried out for an initial population of 10,000 individuals for 108 weeks or 756 days. This population was distrib- uted by age and sex to represent the U.S, civilian, noninstitutional population. The initial population was given a dynamic dimension by introducing births and deaths, The births were distributed over a 2-year period ac- cording to 1960 monthly birth rates and then as- signed specific days within months at random. A total of 237 births (121 male and 116 female) were assigned the first year and 240 (123 maleand 117 female) the second year. Coinciding with the birth dates, deliveries were assigned to females in the 15-24, 25-34, and 35-44 years of age groups. A simple three-digit code was used to record dates on the computer, with the first day of the 108-week period coded 001. The first 26 days of the hospital episodes simulation program were utilized to establish the appropriate initial dis- tribution of the population over the states H and H;. This was necessary since all individuals were in state H (i.e., not in hospital) on day 001. An alternative procedure would have required assignment of about 22 individuals to the hospital states H; on day 001. Since the average length of stay in short-term hospitals is approximately 8 days and less than 10 percent of the episodes exceed 15 days, allowing the computer 26 days to establish an equilibrium distribution over the states H and H; is considered adequate. There were no additions to the population from births assigned prior to day 027. Hospitalization his- tories for newborn infants were generated by the computer only for the days following birth, In order to introduce appropriate hospital admission rates for individuals entering their last year of life, death dates were assigned by age and sex covering a 3-year period. A total of 93 deaths were assigned in the first year, 94 in the second, and 89 in the third. As with the birth dates, these were distributed first accord- ing to 1960 monthly death rates and then were assigned specific days within months at random. The third year death dates were necessary since individuals scheduled to die in that year enter last year of life sometime during the second year. Table E. Estimates of the parameters u and o for log-normal distributions fitted to duration-of-stay distributions, by sex and age Persons not in their last year of life Female Age Male } Deliveries Deliveries excluded only Mu a Mu a M o Under 15 yearS=====mmemmccecoccccceocccoaan 1.220 1.12] L116, 1.15 vivie Sie 15-24 yearS-=m---sm-cccececnnmcccnnce nena 1.51) 1.10} 1.19 1.01 1.32 0.47 25-34 yearS-------smeccccccmccecmmecceneann 1.63 1.02 1.467 0.94] 1.33 0.53 35-44 yearS--emmcmemcccmccmcecceedene mee 1.74 1.00 1.65] 0.90.1 1.37 0.69 45-64 yearS-=mmememecmmemccecemcecceece————— 2.081 0.93] 1.94] 0.9 Til wis 65+ years-------s-cmcmcccccccccccee nen 2.30; 0.891 2.33, 0.85 Persons in their last year of life Age Male Female Mu a Mu 0 Under 44 yearS-=s=-mme-emecccceccccccceccccca-" 2.21 0.90 1.96 0.93 45-64 years--smeemecmccmeccececseceeee—————— 2.53 0.82 2.94 0.70 65+ years-s=--e-sccemccmcccecemne eee ———— 2.64 0.80 2.36 0.84 A four-digit number was used to code the day of death for computer purposes; all individuals not in their last year of life at the end of the second year were assigned 9999 as their day of death. No deaths were assigned prior to day 0027. INTERVIEW SIMULATION MODEL A relatively simple model was devised for simulating the responses obtained in interviews with individuals experiencing one or more hos- pital episodes during the 12 months prior to the date of interview. For each hospital episode, the model simulates on a probability basis failure to report the episode, reported length of stay (if the episode is reported), and reported month of dis- charge. Underreporting of Hospital Episodes The response error study by the Survey Re- search Center, University of Michigan, reported three major factors related to underreporting of hospital episodes.? It was found that underreport- ing increases with increasing time between dis- charge and interview, decreases with increasing length of stay, and increases for personally em- barrassing or threatening types of illness. Only the first two factors are included in the interview simulation model. The Michigan study reported percent underreporting by number of weeks be- tween hospital discharge and interview for three length-of-stay groups.? The Center also had pro- duced, through internal analysis of reported data, rough distributions of underreporting by number 11 Table F. Probability of failure to report hospital episodes,by length of stay and number of weeks between discharge and interview, and average probability of failure ve 7 Length of stay Weeks between dis- ©" eharge and interview 1 2-4 5+ J day days days RE Nondelivery episodes 1-4 weekse-eoeoena-- 0:07} 0.04 0.01 5-8 weeks-~===ce--- 0,131 0.05 0.02 9-12 weeks-=c-emmmea -0,18| 0.06 0.04 13-16 weeks =v mmmmu- 2221-+0,07 0.05 17-20" peeks === mu=- 0.24 0.08 0.06 21-24 -weekS-mmmmuax 0.261 0.09 0.07 25-28 weeks----m--- 0.287 0.11 0.08 29-32 weeks~-====u-- 0.29; 0.14 0.09 33-36 weeks-------- 0.30] 0.18 0.09 37-40 weeks-=c-=u-- 0,30| 0.22 0.10 41-44 weekS~=mm=m-- 31 0.27 0.10 45-48 weeks-mmmaamn 0.32. 0.33 0.11 49-52 weeks-==----- 0.32; 0.39 0.46 53-56 weeks====-=-- 0.32} . 0.39 0.46 57-60 weeks-====--- 0.32] 0.39 0.46 Average probabil- ity of failure----| 0.2571 0.1871 0.147 Delivery episodes 1-4 weeks--mmecmcan- 0.00 0.00 0.00 5-8 weekS-==cmmucaa 0.00 0.00 0.00 9-12 wWeekS-=wmenaaa 0.01 0.01 0.00 13-16 weekSe=mamuu= 0.01 0.01 0.00 17-20 weekSe==meeu= 0.02 0.02 0.01 21-24 weekS=mmmmuu= 0.02 0.02 0.01 25-28 weekS-=mmacu= 0.03, 0.03 0.03 29-32 weekS-==mmum= 0.03 0.03 0.03 33-36 weekS=mmmmuu= 0.03 0.03 0.03 37-40 weekSemmmmeu= 0.04 0.04 0.04 41-44 weekS==mmmmea= 0.05 0.04 0.04 45-48 weekSemmmmaan 0.05 0.05 0.05 49-52 weekSe=mmmmu= 0.06 0.05 0.05 53-56 weekSe=meama= 0.07 0.06 0.06 57-60 weekSew=meunua 0.07 0.06 0.06 Average probabil- ity of failure----| 0.033| 0.030 0.027 of weeks between discharge and interview for four length-of-stay classes. After study of data from these sources, smooth curves were fitted for each of the length-of-stay groups, and esti- mates of underreporting rates for hospital epi- sodes.- as a function of the time interval between 12 discharge and interview (in 4-week periods) were obtained for the model. The model treats report- ing of each hospital episode as a random event dependent on length of the recall period and length of the hospital stay for the episode. These estimated underreporting rates were used for nondelivery episodes only. Since the data upon which they were based included all episodes, these estimates are slightly optimistic. The re- sponse error study mentioned above found only 3 percent underreporting of deliveries, whereas the average underreporting for all diagnoses was 10 percent. A separate set of underreporting rates, averaging 3 percent, was constructed for delivery episodes. These were also made dependent on length of recall period and length of hospital stay. The estimated rates of underreporting of non- delivery and delivery episodes were treated as probabilities in the computer simulation. They are shown in table F for 15 four-week periods prior to interview. The last two intervals (53-56 weeks and 57-60 weeks) were included to allow for overreporting of episodes occurring more than 12 months prior to interview. These were in- cluded in the model by telescoping forward, again on a probability basis as discussed below, epi- sodes reported by the respondent with actual dis- charge dates in the 14th or 15th 4-week periods prior to interview. The same underreporting rates were used for these latter two periods as were estimated for weeks 49-52 (the 13th 4-week period). Length-of-Stay Response Errors The Michigan study found the average length of stay reported in household interviews to be slightly greater than the average calculated from hospital records.? One explanation given for this is that underreporting is more likely for short- stay episodes than for longer episodes, so that the average of reported episodes has an upward bias. Thus, it is quite possible that duration-of- stay response errors are symmetrically distrib- uted about zero. The model for interview sim- ulation in this study made use of this hypothesis, but also introduced a slight positive shift in the mean of the distribution of reporting errors in length of hospital stay. The model approximates the distribution of length-of-stay response errors by a normal or Gaussian distribution with a mean error of zero in an expected 95 percent of the responses and a mean error of 2 days in the remaining 5 percent. Thus, the overall distribution of errors is as- sumed normal with mean equal to 0.05 x 2.0 or 0.1 day. Unit variance was assigned thesenormal error distributions; this is considered a conserv- ative value for this parameter. A reported length of stay for a given episode is generated in two steps according to this model. First, a uniform random number between zero and one is compared with 0.05. If it is less than 0.05, 2 days are added to the actual length of stay; otherwise the actual length of stay is left un- changed. Second, a random normal deviate is generated and added to either the adjusted length of stay or the actual length of stay, depending on the previous comparison of the random number with 0.05. The resulting length of stay in days is accepted as the reported duration of stay. Month-of-Discharge Response Errors The first Michigan study found that for 82 percent of the episodes, the respondent correctly reported the month of admission; about 11 percent were reported 1 or more months later than shown in the hospital records, and 7 percent were ear- lier by 1 or more months. ? The later study, com- paring three alternative hospitalization survey procedures, showed 14 percent reported the month of discharge later, 9 percent earlier, and 77 per- cent correctly, using the Health Interview Survey procedure.® The month of discharge is calculated by use of the reported admission date and the reported length of hospitalization. The evidence in these two studies indicates a greater tendency to telescope the hospital episode forward rather than backward in time, although the shift is a modest one. The bulk of the inaccurate reports were plus or minus 1 month of the correct month, The model adopted for simulation of response errors leading to incorrect classification of the month of discharge also approximates errors in the date of admission by a normal distribution. As with the length-of-stay response errors, this distribution is a weighted combination of two nor- mal distributions, the first with mean zero to apply in an expected 95 percent of the episodes and the second with a mean of 10 days applicable to the remaining 5 percent. The overall error distri- bution has mean equal to 0.05 x 10 or 0.5 days. The variance assigned these distributions depend- ed on the number of weeks between date of inter- view and date of admission. This interval was di- vided into 4-week periods and the assigned stand- ard deviation was set equal to 0.4 times the num- ber of 4-week periods in the interval. Thus, the model permits larger errors in reported date of admission with increasing length of recall peri- od. As with the length-of-stay model, these pa- rameters are considered conservative. A reported month of discharge for a given episode is generated in three steps. In the first step a uniform random number between zero and one is compared with 0.05. If it is less than 0.05, 10 days are added to the actual admission date; otherwise the actual admission date is left un- changed. In the second step, a random normal deviate is generated and multiplied by a stand- ard deviation ¢ depending on the number of weeks between the interview date and the date of ad- mission. This product is added to either the ad- justed admission date or the actual admission date, depending on the prior comparison of the random number with 0.05. In the third step, the reported length of stay is added to the adjusted admission date obtained in step two to yield the reported discharge date and hence the reported month of discharge. Computer Simulation of Interviews The output of each computer -generated hos- pitalization includes the day admitted, whether the episode was for a delivery or not, and the day dis- charged. The output also includes the age, sex, and day of death for each individual experiencing one or more episodes during the 108 weeks of interest, These data make up the input for com- puter simulation of interviews on a specified in- terview date. The basic steps in the computer program for this simulation are outlined below. 1. The death date for each individual is compared with the interview date to determine if the individual is alive and hence eligible for interview. If the indi- 14 vidual has died the computer proceeds to the next individual. If the individual is alive on the inter- view date, the computer determines whether the admission date for the first episode occurred prior to the interview date. If not, the next episode is examined. If the admission date is earlier than the interview date, the discharge date for the episode is checked to determine if it is a completed episode. If not, the computer records an incomplete episode and pro- ceeds to the next episode. . If the episode is completed prior to the interview date, the number of days be- tween interview and discharge is com puted to determine if discharge occurred more than 420 days prior. If so, the com- puter proceeds to the next episode. - If the episode is completed less than 420 days prior to the interview date, a uniform random number is generated and compared with the appropriate probabil- ity of failure to report the episode (based on the number of weeks between inter- view and discharge dates, length of stay, and reason for hospitalization as shown in table F). Ifthe generated random num- ber is less than this probability, the epi- sode is recorded as nonrecalled and the computer proceeds to the next episode, If the episode is recalled, a second uni- form random number is generated and compared with 0,05. If itis less than 0.05, the computer adds 10 days to the actual admission date and continues. If not, the computer continues. A random normal deviate is generated and multiplied by the appropriate stand- ard deviation ¢ (based on number of weeks between interview and admission dates). The resulting product is added to the adjusted or actual admissiondate, whichever is appropriate as per step (6), to obtain the reported admission date of the episode. 8. A third uniform random number is gener- ated and compared with 0.05. If itis less than 0.05, the computer adds 2 days to the actual length of stay for the episode and continues, If not, the computer con- tinues, 9. A second random normal deviate is gen- erated and added tothe adjusted or actual length of stay, whichever is appropriate as per step (8), to obtain the reported length of stay. 10. The reported length of stay is added to the reported admission date to deter- mine the reported discharge date. 11. The interval between the interview date and reported discharge date is compared with 364 to determine if the episode is reported with discharge date in the year prior to interview, If so, the computer records the appropriate output data for the reported episode and proceeds to ob- tain "interview data' for the next epi- sode. If the reported discharge date is more than 364 days prior to the inter- view date, the computer proceeds to the next episode. This interview simulation program (Phase III) was carried out for 13 interview dates 28 days apart beginning with day 418. The hospitalization histories for the 1,870 individuals with one or more episodes generated by the hospital simula- tion program (Phases land II)over the 108-week period provided the interview simulation input data. The results of the simulation for each inter- view date were tabulated by the computer and the following tables printed out. 1. Number of nonrecalled discharges by sex and age in each of 13 four-week periods prior to the interview date. 2. Number of nonrecalled delivery dis- charges for females by age in each of the 13 four-week periods. 3. Number of incomplete episodes by sex, age, and type of episode (i.e., nonde- livery and delivery). 4. Number of reported discharges of 1-day stays by sex and age for the 13 four- week periods. 5. Number of reported discharges of 2-4- day stays by sex and age for the 13 four- week periods. 6. Number of reported discharges of 5-or- more-day stays by sex and age for the 13 four-week periods. 7. Number of reported discharges by sex and age for the 13 four-week periods. 8. Number of reported delivery discharges for females by age for the 13 four-week periods. 9. Number of reported hospital days as- sociated with reported discharges in the 13 four-week periods by sex and age. 10. Number of persons by sex and age and reported number of completed episodes in the year prior to interview. 11. Number of persons by sex and age and reported number of completed nonde- livery episodes in the year prior to in- terview, 12. Number of reported days in hospital in each of 17 four-week periods prior to interview for reported discharges by sex and age. 13. Number of reported days in hospital in each of 17 four-week periods prior to interview for reported delivery dis- charges for females by age. The computer print-out of these tables is designated by the heading "interview reported." The computer program also tabulated this same set of tables using actual results for all episodes with discharge in the year prior to interview ex- perienced by the persons alive on the date of in- terview, that is, with no response errors of any kind. These tables are designated in the computer print-out by the heading ''perfect interview." Finally, the results for persons who died in the year prior to the interview date were tabulated by the computer and added to the "perfect inter- view" tables. The computer print-out of these tables is designated by the heading 'all dis- charges." SIMULATION ESTIMATES OF ERRORS IN HOSPITAL DISCHARGE DATA The computer-generated data for the 13 in- terview dates were averaged and estimates of annual hospital discharge rates by age and sex derived for the "interview reported," "perfect interview," and "all discharges data tabulation categories, Similar sets of estimates were also derived for discharge rates excluding deliveries, annual hospital days per 1,000 persons with and without deliveries included, and average length of stay. These estimates are given in tables 1-5. The population bases for these rate estimates are given in table 6. Estimates of the effects of interview re- sponse errors (using data for the full 12 months prior to interview) and of exclusion of persons who died during the reference year on hospital discharge data can be derived from tables 1-5. For example, interview response errors are estimated to reduce the annual discharge rate per 1,000 living persons by 106.0 - 94.0= 12.0 or 11.3 percent (table 1). In addition, exclusion of persons who died during the reference year re- duces the annual discharge rate by an estimated additional 6.6 discharges per 1,000 persons (112.6 - 106.0) or 5.9 percent. The overall annual rate based on the interview procedure is esti- mated to be less than the actual annual discharge rate by 112.6 - 94.0 = 18.6 per 1,000 persons or 16.5 percent. Similar estimates of effects of pro- cedural errors on hospital discharge data can be determined from the tables for specific age-sex groups. Although input parameters for this study were based in part on empirical data, the specific output estimates of underreporting should be con- sidered illustrative rather than necessarily are- 15 flection of the situation which prevails in the Health Interview Survey. Estimates of the percent underreporting of hospital discharges by number of weeks between discharge and interview for all discharges, de- liveries only, and discharges excluding deliveries were computed for "interview reported' versus "perfect interview," "perfect interview'' versus "all discharges," and "interview reported' versus "all discharges.' These estimates are given in tables 7-9. A similar set of percent underreport- ing estimates was computed for hospital dis- charges by recall period and actual length of stay and are shown in tables 10-12, IV. RESULTS EVALUATION OF HOSPITAL EPISODES SIMULATION Several aspects of the computer-generated hospital episode data were examined in order to evaluate the accuracy of the simulation. First, the generated distributions of the persons ineach of the 12 age-sex groups by number of annual nondelivery episodes (perfect interview data) were compared with the expected distributions. With but minor exceptions, the computer simulation program generated distributions of the number of nondelivery episodes equivalent to the expected negative binomial distributions. It is noted that, except for females 35-44 years of age, the expected frequencies of two or more episodes were higher than generated. This tendency on the low side could be due to inade- quate representation of the upper tail of the gam- ma distribution of the weekly admission probabil- ities (i.e., the X values). It is possible that this aspect could be improved by subdividing the 10th subgroup in order toinclude X values correspond- ing, for example, to the 99th percentile. An alter- native explanation of the observed deficiency of persons with two or more episodes is that the uni- form random number subroutine, used in the computer program, failed to generate small ran- dom numbers in close order proximity as fre- quently as expected statistically. The second aspect examined was a compari- son of the generated annual discharge rates by age and sex, excluding deliveries, with the ex- pected rates (table G). The sampling errors in- dicate that the differences in these rates are not statistically significant. The annual discharge rates generated by the computer for males and females 65 years and older are greater than the expected rates shown in table G since they in- 16 clude persons in their last year of life who were alive on the interview date (and hence subject to higher admission rates). The expected rates were not adjusted for the higher admission probabil- ities assigned to persons in their last year of life. The Health Interview Survey annual discharge rates, excluding deliveries, reported for the peri- od July 1958-June 1960 are higher than the ex- pected rates for the computer simulation since the published rates are based on data reported for the most recent 6 months of the year prior to in- terview. On the other hand, the weekly admission probabilities were derived from unpublished Health Interview Survey data on the distribution of the population by number of annual nondelivery episodes based on reported experiences for the 12 months prior to interview. The third aspect examined in evaluating the computer simulation of hospitalization histories was the distribution of persons in the hospital on the interview date by age in comparison with the unpublished Health Interview Survey distri- bution for the Sunday prior tointerview. The data, given in Table H, show the two distributions to be in close agreement. Fourth, the average length of stay in days by sex and age for the computer episodes (perfect interview data) are compared with the July 1958- June 1960 Health Interview Survey results in table J. Agreement, slightly better for females than males, is fairly good. The sample size (episodes) for males 15-24, 25-34, and 35-44 years of age, is only about 30 for each of these age classes, accounting in part for the variability observed in their length-of-stay averages. The distribution of the generated lengths of stay has not been tabulated in detail. However, the distribution for 1-day, 2-4-day, and 5-or- more-day stays is available from table 10. This distribution is compared with the distribution Table G. per 1,000 persons per year, and observed rate, by sex and age Comparison of computer generated and expected number of nondelivery episodes simulated population base and standard deviation of : Standard Sex and ane Observed Expected Simulonsa deviation g number! number p pe of observed ase rate Male Under 15 yearS=sem-meecececrccccaccceecn-= 64.9 62.1 1,740 5.20 15-24 yearS=--ememmcemcaccccncemoec nn ————— 55%.7 62.2 608 S77 25-34 yearS---mememmmmececcceeeecce—————— 54.0 64.5 613 8.87 35-44 yearSe-sm-emcmcmemecececsmemeceeeee————— 68.5 75.1 641 9.44 45-64 yearS---mmmmemmmmcacmccee ee ————— 102.0 100.1 965 9.04 65+ yearS-m-mrmemmcomcccmmecc meee 168.1 142.0 345 18.11 Female Under 15 yearS=-=--mmecececcccanaeneea= 51.5 50,2 1,675 4,70 15-24 yearS-===-ememmmccccccccceacaanan= 97.8 95.0 683 10.95 25-34 yearS-e-ememmme-ecemcccccccccesceaaa= 105.8 125.8 669 12.85 35-44 yearSe=-emmmmcmmcccececmeccaan———— 122.3 112.4 695 11.22 45-64 yearS-e=mmmmmmemccccccccccccenna= 105.0 97.4 1,043 8.63 65+ yearS=-e=mmemcmmcececccccccccccn ann 135.2 119.3 429 14,92 The observed rates are inflated slightly by the experience of personsin their last year of life. These persons are not included in the expected number. Table H. Number and percent distribution of persons in hospital on day of interview, by age: computer simulation! versus Health Interview Survey Computer simulation Health Interview Survey Age Nurber Percent Number in Percent distribution thousands distribution All agesS-===--==cmceeo-- 344 100.0 367 100.0 Under 15 yearS-=eeeeececeeeee-- 43 12.53 48 13.1 15-24 yearS—==-=-meecccecmcccean-— 37 10.8 42 11.4 25-34 yearS-=m-mmmmcmmcecameaa- 40 11.6 43 11.7 35-44 yearsS-=--m-cemcccencooo- 42 12.2 54 14.7 45-64 yearSe-emememeecececeeean 110 32.0 106 23.9 65+ yearS---meemcecmcccconcoao 72 20.9 74 20.2 ltotal of incomplete episodes for 13 interview dates. 2Average number of persons in short-stay hospitals last July 1959-June 1960. Sunday night, United States, 17 Table J. Comparison of average length of stay in days, by sex and age: computer generated! versus Health Interview Sur- vey? Health Computer . Sex and age Interview generated Survey Length of stay in Male days All ages----- 10.1 10.3 Under 15 years----=- 6.0 6.1 15-24 years-=======- 9.6 8.2 25-34 years-=--==-=-- 10.7 92.3 35-44 yearS-e=====-= 8.4 11.8 45-64 years-==-e=--- 13.3 12,2 65+ yearse=meememen= 13.7 15,9 Female All ages-==-=- 6.9 7.2 Under 15 years-=---=- 5.6 5.8 15-24 yearse-===e=w== 4.4 445 25-34 yearseemmmmm=- 4,6 5.2 35-44 yearS-eemmme== 6.6 6,7 45-64 yearS=====m=e= 10.9 11.4% 65+ years-==eemena- 15.4 14.0 lperfect interview data; interview dates. 2See table 1, p. 14, in reference 8. average of 13 of discharge rates for these same length-of-stay groups as derived from unpublished July 1958- June 1960 Health Interview Survey data in table K. Agreement is quite good. It seems clear from the above analysis that the hospital episodes simulation model and com- puter program are quite satisfactory. Further improvements, one of which has already been mentioned, are possible. It would be desirable that the various hospitalization statistics within age-sex groups generated by the computer have greater reliability than can be obtained with a population run of 10,000. The computer program should also be revised to permit individuals to shift over time from their initial age group to the next higher age group. This is particularly im- portant for the two older age groups, as will be made clear from results discussed in later sec- tions. For example, under the present program when 2-year histories are generated, the number of persons 65 years and older for the second year is reduced significantly due to deaths during the first year. The assignment of reasons for hospital- ization within age-sex groups can be added to the computer program with relatively little difficulty. Length-of-stay distributions for each reason or condition would be more realistic if this change were made in the program, Table K. Comparison of length-of-stay distributions: computer generated discharges! versus Health Interview Survey discharges Computer generated | Health Interview discharges Survey discharges Length of stay Percemt Rate per Percent Number distri- 1,000 distri- bution persons bution Total-=-scceccccccccccce ccc cece neee 1,071.1 100.0 114.5 100.0 1 day====eecccmccccccrennccc cece cece e een 131.8 12.3 12.6 11.0 2-4 daySe=ee=s=eeccmccccccmcccccecccccennnaan 383.5 35.8 41.0 35.8 5+ daySmeemmccceccceccncrecn cence creme 555.8 51.9 60.9 53.2 ; Perfect interview data; average of 13 interview dates. 2 Unpublished data, July 1958-June 1960. 18 EVALUATION OF INTERVIEW SIMULATION The interview simulation model introduced errors due to failure to report hospital discharges which occurred in the year prior to interview, failure to report discharge dates accurately, and failure to report length of stay accurately. As dis- cussed previously, the parameters for generating these errors were based largely on results ob- tained in the Michigan study. Percent underre- porting of hospital discharges as generated by the computer is compared with the Michigan study data in table L separately by length of stay and by weeks between discharge and interview. As expected, since the assigned probabilities were based on these two factors, the generated results essentially reproduced the Michigan study data. A more detailed comparison of the computer- generated underreporting rates with the assigned rates jointly by length of stay and interval between discharge and interview is given in table M. As in table L, the generated underreporting rates in- clude the effect of reporting the discharge date inaccurately. Thus, the computer overreported 2-4-day stays and 5-or-more-day stays for the 4-week period immediately prior to interview. The agreement between the observed and expected re- sults in table M is fairly good, but not outstand- ing. The total number of episodes for each cell was not large for any one interviewing date, rang- ing from 10for the 1-day stays to 30 for the 2-4- day stays and 40 for the S-or-more-day stays. However, the generated results shown are aver- ages for 13 interviewing dates, and hence are based on fairly substantial numbers of cases. The effect of inaccurately reported discharge dates may be responsible for the several instances of somewhat larger differences than expected. The computer simulations of failure to report the discharge date and/or the length of stay ac- curately have not been evaluated in detail. As discussed in the next section, the net shifting of discharge dates by the computer was essentially negligible. The proportion of discharge dates re- ported accurately (i.e., within the same 4-week period as the actual discharge date) has not been determined. The average length of stay for the Table L. Percent underreporting of hos- pital discharges, by length of stay and number of weeks between discharge and interview: computer generated! versus Michigan study? Length of stay and tod weeks between dis- Computer Mishigm charge and interview | 8€M€rate Suny Length of stay Totaleeecencas 11..3 12.9 1 day-=ecceccccnnaa- 23.2 26.9 2-4 daySe=eemcmanna- 11.3 1%.0 54 days - 8.5 9.9 Weeks between dis- charge and interview Total-===cuma= 11.3 12,0 1-20 weekSemmenccnana 4.9 5.0 21-40 weekS=mmmmman= 10.7 9,0 41-52 weekSememmanna 23.0 24,0 Interview reported versus perfect in- terview; average of 13 interview dates. Includes errors in reported discharge dates. 2See table 15, p. 21, and table 40, p. 36, in reference 2. interview reported discharges was 0.3 of a day greater than for the perfectinterview discharges, which agrees with the Michigan study.? The dis- tributions of reported length of stay by actual length of stay have not been tabulated, however. Based on this limited evaluation, the inter- view simulation program appears to have been fairly successful. Further analysis is necessary before any suggestions regarding revisions in the model and computer program can be made. ESTIMATES OF SPECIFIC ERROR COMPONENTS As mentioned in the introduction, a definite decreasing trend can be observed in the number of discharges reported in the Health Interview Survey when tabulated by month prior to interview. It is of considerable interest to determine the fac- 19 Table M. Percent underreporting assigned rates? of hospital discharges number of weeks between hospital discharge and interview: by actual length of stay and computer generated! versus 1-day stay 2-4-days stay 5+-days stay Weeks between discharge and interview Computer | Assigned | Computer | Assigned | Computer | Assigned generated rate generated rate generated rate Total-==ecceccaea- 23.2 24,8 12.3 15.6 5.53 9.8 1-4 week§-=mmmmmmmmmmn-e 3.2 7.0 30,3 4.0 30.3 1.0 5-8 weeks--eeccoccnnnan 16.3 13.0 4,9 5.0 2.3 2.0 9-12 weekS--=memencuan= 20.0 18.0 2.8 6.0} 1.2: 4.0 13-16 weekS-==ceenccenu= 21.6 22.0 3.1 7.0 . S21 5.0 17-20 weeksS~ecemcmcnnnnan 18.8 24,0 11.4 8.0} 5.1 6,0 21-24 weekS-===mmeccnen 23.5 26.0 4.4 9.0 4,9} 7.0 25-28 weekS==cmcmccnnan 22.8 23.0 10.3 11.0 9.1 8.0 29-32 weekS~==mcmcmcnen 31.0 29.0 8.1 14.0 9.6 9.0 33-36 weeks-==--scencnna 25.5 30.0 13.3 18.0 10. k 9.0 37-40 weekS-=mmccncca-a- 32:1 30.0 11.1 22.0 5.0 10.0 G1 bls WeekS ww www www 20.4 31.0 21.7 27.0 6.7 10.0 45-48 weeks=mrmmemmacnn 3.5 32.0 26.4 33.0 15,0 11.0 49-52 weekS=wevccccnana 32.4 32.0 29.0 39.0 37.2.} 46,0 Interview reported versus perfect interview; average of 13 interview dates. In- cludes errors in reported discharge dates (see table 10). Nondelivery episodes only. 3Percent overreported. tors contributing to this decay curve and the mag- nitude of their respective effects. Accordingly, estimates have been derived of the component parts of the discrepancy between the interview re- ported discharges and all discharges in 4-week intervals prior to interview, using the computer generated hospital episode and interview simula- tion data. These estimates are given in absolute numbers of discharges (average of 13 interview dates) and also as a percent of all discharges in each of the 13 four-week periods in the year prior to interview in table N. The average estimates for 12, 24, 36, and 52 weeks prior to interview are also shown in this table. The observed decay curve is shown in the column headed ‘interview reported." The dis- crepancy (i.e., all discharges less interview re- ported discharges) increases as the interval be- tween discharge and interview increases, as does the number of not reported discharges and also the number of discharges of persons who died in the year prior to interview (all discharges less perfect interview discharges). The error com- 20 ponent due to shifting of discharge dates fluctu- ates from positive (back in time) to negative (forward in time), but remains at a fairly low level; the average of this component is essentially zero for the year prior to interview. It is clear that the number of discharges of persons who died in the year prior to interview should increase as the interval between discharge and interview increases, since this group is somewhat larger numerically at the beginning of the year of interest and decreases in size as the interview date is approached. This might suggest that the total number of discharges should also increase as the interval between discharge and interview increases. This is incorrect, although the average of the generated 'all discharges" over the 13 interview dates does exhibit this in- correct relationship in table N and also in table 8. This error is due to the unfortunate oversight of failing to age the population in the computer simulation program. Since the living population is aging and also increasing in size during the year and since the number of persons living on the date Table N. Estimated contribution of error components to discrepancy between interview reported and all discharges, by number of weeks between discharge and interview [Average of 13 interview dates) Discrep- All dis- Net Weeks between ALL dig Perfect Inter- ancy: Not charges shifting discharge and areas inter- view all less fesorted less per- | of dis- interview & view reported | interview p fect in- charge reported terview date 2 Number of discharges 1-4 weeks-===-= 85.5 82.2 82.2 3.3 1.5 33 -1.5 5-8 weeks--=-=---~ 86.2 81.8 77.8% 8.4 3.8 4.4 0.2 9-12 weeks-----~ 86.5 81.7 78.4 8.1 4,2 4.8 -0.9 13-16 weeks---- 88.6 83.5 | 78.1 10.5 3.3 5.1 0.1 17-20 weeks===- 88.1 82.3 74.9 13.2 5.8 5.8 1.6 21-24 weeks----~ 87.6 82.1 76.3 11.3 7.2 5.5 -1.4 25-28 weeks-=-=~ 87.8 82,2 73.0 14.8 8.2 5.5 1.0 29-32 weeks-=-- 88.3 82.4 72.5 15.59 Ded 5.9 0.1 33-36 weeks-=--- 88.1 82.1 71.2 16.9 9.9 6.0 1.0 37-40 weeks---- 89.1 82.7 13.8 15.3 10.4 6.4 -1.5 41-44 weeks-~--- 89.3 82.9 71.4 12:9 11.3 6.4 0,2 45-48 weeks---~ 89.0 82.5 64.8 24.2 16,2 6.5 1.9 49-52 weeks-~--- 89.4 82.7 54.8 34.6 28.5 Go? -0.6 Average esti- mate for: 1-12 weeks====- 86.1 81.9 79.5 6.6 3.2 4.1 -0.7 1-24 weeks-==== 87.1 82.3 78.0 9.1 4.6 4.8 -0.3 1-36 weeks-===-- 87.4 82.3 76,1 11,3 6.1 5:2 0.02 1-52 weeks-===- 88.0 82.4 73.0 15.0 9.4 5.6 -0.02 Percent distribution of all discharges 1-4 weekS-===== 100.0 96,1 96.1 3.9 1.8 3.9 -1.8 5-8 weeks-=--===~ 100.0 94.9 90.3 9:7 4.4 vl 0.2 9-12 weeks-=-==- 100.0 94.5 90.6 9.4 4.9 5.5 -1.0 13-16 weeks---- 100.0 9,2 88.1 11.9 6.0 5.8 0.1 17-20 weeks=--=-= 100.0 93.4 85.0 15.0 6.6 6.6 1.8 21-24 weeks---=~ 100.0 93.7 87.1 12,9 8.2 6.3 -1.6 25-28 weeks-=-=-- 100.0 93.6 83.1 16:9 9.3 6.4 Le2 29-32 weeks=--~-- 100.0 93.3 82.4 17:6 10.8 6.7 0.1 33-36 weeks---- 100.0 93:2 80.8 19.2 11.2 6.8 1.2 37-40 weeks-=--~ 100.0 92.8 82.8 17.2 11,7 1.2 -1.7 41-44 weeks---- 100.0 92.8 80.0 20.0 12.6 7.2 0.2 45-48 weeks---- 100.0 92.7 72.8 2742 18.2 7.3 157 49-52 weeks~-=~- 100.0 92.5 61.3 38.7 31.9 7.5 -0.7 Average esti- mate for: 1-12 weeks====- 100.0 95.2 922.3 Le? 3.8 4.8 -0.9 1-24 weeks--=--- 100.0 94.5 89.5 10.5 5.4 5.5 -0.4 1-36 weeks=-===-=~ 100.0 94.1 87.1 12.9 140 5.9 0.02 1-52 weeks-===--~ 100.0 93.7 82.9 17.1 10.7 6.4 -0.02 Ipischarges of persons who died A negative value means discharge date shifted forward in time. T during the year prior to interview. 21 of interview, but already in their lastyear of life, is somewhat larger on the date of interview than at the beginning of the reference year, the number of discharges of persons alive on the interview date (perfect interview discharges) should de- crease as the time interval between discharge and interview increases. This is the key phenomenon previously stated in the introduction. Hence "all discharges" should either decrease or remain constant as the interval between discharge and interview increases. The computer incorrectly generated a rela- tively constant monthly number of discharges during the reference year for persons alive on the interview date (perfect interview discharges), at least on the average for the 13 interview dates (see table N), because persons 65 years and older who died were not replaced by new persons from the 45-64 year age group. This reduced the 65 years and over age group over time. The number of discharges of living persons was reduced from 1,088 in the year prior to the first interview date to 1,049 in the year prior to the last interview date. Similarly, the number of all discharges was re- duced from 1,162 in the year prior to the first interview date to 1,111 in the year prior to the last interview date. Without these decreases (which should not have occurred) the total number of discharges by weeks between discharge and in- terview would have remained approximately con- stant and the number of discharges among persons living on the date of interview would have de- creased with increasing time interval between discharge and interview. While the average levels shown in table N (and in table 8) for all discharges, perfect inter- view discharges, and interview reported dis- charges are not correct as to level, the estimates of the error components and of the discrepancy itself are considered satisfactory. This shouldbe clear, since the weaknesses in the generation model tend to be compensating when the discrep- ancy and its components are computed. Table N shows the underestimate of all dis- charges from an interview procedure using data reported for the entire reference year to be 17.1 percent. If only the data reported for the 24 weeks (approximately 6 months) immediately prior to interview are used, the underestimate of all dis- 22 charges is reduced to 10.5 percent. The major source of this reduction is the not reported error component which is cut in half (5.4 versus 10.7 percent). It is of interest to note that, even if no response errors were made, the number of re- ported discharges in the interview is estimated to be lower than all discharges by approximately 4 percent if reporting is confined to the 4 weeks immediately prior to interview and 6.4 percent when reporting for the year prior to interview. METHODS FOR INCREASING ACCURACY Inspection of tables 1-4 shows that the aver- age annual hospital discharges and hospital days for persons alive on the interview date within each age-sex group are underestimated by approxi- mately 11 percent when a procedure using all data reported for the 12 months prior to interview is employed. The estimates are improved when they are based only on the episodes with reported dis- charge dates occurring in the most recent 6 months prior to interview. The generated data have not been tabulated on this basis so that the improve- ment for each of the age-sex groups has not been ascertained. However, the average underestimate is reduced by a factor of two, approximately, with this procedure. It is doubtful that basing the esti- mates of interestonly on hospitalizations reported within a shorter time interval than 6 months be- tween interview and discharge wouldbe economi- cally efficient. Apparently it is possible to further increase accuracy by use of Procedure B as re- ported in the study by the University of Michigan in which three alternative survey procedures were compared.? The relative biases in the average an- nual number of discharges and hospital days by age and sex with this procedure can be estimated by means of the interview simulation program on the computer. The program would require a set of parameters (i.e., probabilities of failure tore- port the episode, etc.) appropriate to Procedure B. Apparently, the data for estimating these pa- rameters are available from the study which com- pared Procedure B with the standard procedure used in this project. Further improvement in the accuracy of the hospital statistics based on the Health Interview Survey through changes in the interview procedure is doubtful. A method of adjusting the surveysta- tistics is necessary. One such method, discussed briefly in the introductory section, uses the J- analysis technique of Simmons and Bryant to de- rive inflation factors by which reported hospital discharges are weighted to estimate total actual discharges, including those of persons not alive on the interview date. Because of limited time, evaluation of the Simmons and Bryantapproach by means of the generated data was not carried out. Estimation of inflation factors toimprove the accuracy of published hospital statistics based on the Health Interview Survey appears both feasible and desirable. Using the observed data to derive the adjustment factors has considerable appeal. It seems advisable to explore alternative methods of estimating adjustment factors using simulation models. V. CONCLUSIONS A probability model for generating hospital admissions and duration of stay for the U.S. pop- ulation together with an IBM 1410 computer pro- gram for simulation of hospitalization histories under the model were developed in this project. The simulation program was carried out for an initial population of 10,000 individuals for a peri- od of 108 weeks; while the results were judged very satisfactory, there is room for improvement in several aspects. These are: Estimation of weekly admission probabilities should, at the very minimum, be based on data obtained in the Health Interview Survey for the most recent 6 months prior to inter- view. These probabijlities should be improved further by appropriate adjustment of the ob- served episodes distributions to reflect all hospitalizations rather than reported hos- pitalizations. The estimated daily admission probabilities for persons in their last year of life were based on sketchy data and should be improved, using data obtained from a national study. The simulation program should permit indi- viduals in specific age-sex groups to shift to the next older group over time. This is par- ticularly essential for the 45-64 and 65 years and over age groups, since deaths reduce these groups significantly over time if the population is age-static. This could be ac- complished, with relatively little change in the existing program, by adding an age-shift- ing date to be treated in a manner similar to the birth and death dates already in the program. Reasons for hospitalization should be included in the program, to be assigned on a probabil- ity basis, provided sufficient data are avail- able for developing length-of-stay distribu- tions by reason. A probability model and computer program for simulating interview data on hospital episodes as collected in the Health Interview Survey were also developed in this project. The computer pro- gram was carried out for 13 interview dates 28 days apart using the data generated by the hos- pital episodes simulation program as input. The generated interview data were also judged satis- factory, providing estimates of the relative biases due to measurement errors for each of the princi- pal hospitalization statistics obtained in the Health Interview Survey. It is noted that the estimated relative biases are fairly substantial. The interview simulation model was not an- alyzed intensively, due to limited time available to complete this project. The parameters asso- ciated with errors in reporting length of stay and discharge date are considered conservative. Fur- ther study and analysis is necessary before any suggestions on revisions in the model and com- puter program can be made. It is doubtful that further significant reduc- tions in the measurementerrors of hospitalization data collected in the Health Interview Survey are possible without adding unduly to the cost. The survey design suggests that satisfactory adjust- ment factors can be estimated from the collected 23 data. The simulation models and computer pro- grams developed in this project provide a useful research tool for studying alternative methods of adjustment. The computer program for generating hos- pitalization histories is essentially a program for distributing episodes in the population consistent with the negative binomial distribution. Hence, it should be useful, with but minor revisions, for simulating the distributions of other events which » have been observed to be negative binomial. These include, for example, the distribution of the pop- ulation by number of colds annually and by number of doctor visits annually. Undoubtedly there are other health variables in this class. The hospital episodes computer program, re- vised as suggested, should also be useful for stud- ies of the effects on the demand for hospital beds of trends in such variables as age, sex, reasons for hospitalization, and duration of stay. REFERENCES INational Center for Health Statistics: Hospital utiliza- tion in the last year of life. Vital and Health Statistics. PHS Pub. No. 1000-Series 2-No. 10. Public Health Service. Wash- ington. U.S. Government Printing Office, July 1965. 2National Center for Health Statistics: Reporting of hos- pitalization in the Health Interview Survey. Vital and Health Statistics. PHS Pub. No. 1000-Series 2-No. 6. Public Health Service. Washington. U.S. Government Printing Office, July 1965. National Center for Health Statistics: Comparison of hos- pitalization reporting in three survey procedures. Vital and Health Statistics. PHS Pub. No. 1000-Series 2-No. 8. Pub- lic Health Service. Washington. U.S. Government Printing Office, July 1965. 4.8. National Health Survey: The statistical design of the Health Household-Interview Survey. Health Statistics. PHS Pub. No. 584-A2. Public Health Service. Washington. U.S. Government Printing Office, July 1958. SSimmons, Walt R., and Bryant, E. E.: An evaluation of hospitalization data from the Health Interview Survey. Am.J. Pub.Health 52(10):1638-1647, Oct. 1962. 6National Center for Health Statistics: An index of health, mathematical models. Vital and Health Statistics. FHS Pub. No. 1000-Series 2-No. 5. Public Iealth Service. Washington. U.S. Government Printing Office, May 1965. "Pearson, K., ed: Tables of the incomplete T-function. Cambridge, England. Cambridge University Press, 1957 print- ing of original 1922 edition. 80.5. National Health Survey: Hospital discharges, United States, 1958-1960. Health Statistics. PHS Pub. No. 584-B32. Public Health Service. Washington. U.S. Government Print- ing Office, Apr. 1962. 000 24 Table 1, 10. 11. 12, DETAILED TABLES Page Average annual number, number per 1,000 persons, and percent distribution of pa- tients discharged in year prior to interview for each of three types of simula- tion, by sex and age-=======--- i 1 — Average annual number, number per 1,000 persons, and percent distribution of pa- tients discharged in year prior to interview, excluding deliveries, for each of three types of simulation, by sex and age-=~=-=--- meme meme em EEE ——————————— Average annual number, days per 1,000 persons, and percent distribution of hospi- tal days in year prior to interview, for each of three types of simulation, by sex and age---===--ecccccccmccceecccmeccccecmcecceeecmeesseeeeeceecsmeeee————————— Average annual number, days per 1,000 persons, and percent distribution of hospi- tal days in year prior to interview, excluding deliveries, for three types of simulation, by sex and age-====--cccccccmcccccccccccncccemmcccmcccmeccmcme mmm ne Average length of stay in days for each of three types of simulation, by sex and Population changes during year prior to interview and population bases used in obtaining ratesS--s-eeececcccmmccmcemecec cece mmcmes sneer ceeec cme m me ———— Percent underreporting of hospital discharges, by type of discharge and number of weeks between discharge and interview: interview reported versus perfect inter- NV CW mmm nm mm -——— -————— [EE Sp. Percent underreporting of hospital discharges, by type of discharge and number of weeks between discharge and interview: perfect interview versus all discharges--- Percent underreporting of hospital discharges, by type of discharge and number of weeks between discharge and interview: interview reported versus all discharges-- Percent underreporting of hospital discharges, by actual length of stay and num- ber of weeks between discharge and interview: interview reported versus perfect INET VIEW mmm mm eee ee ee ee ee ee ee ee ee ee -— Percent underreporting of hospital discharges, by actual length of stay and number of weeks between discharge and interview: perfect interview versus all discharges- Percent underreporting of hospital discharges, by actual length of stay and num- ber of weeks between discharge and interview: interview reported versus all dis- charges — 5 -——————————— 26 27 28 29 30 31 32 32 33 33 34 25 Table 1. Average annual number, number per 1,000 persons, and percent distribution of patients discharged in year prior to interview for each of three types of simulation, by sex and age [Average of 13 interview dates] For living persons Interview reported Perfect interview All discharges discharges discharges Sex and age Nusoee Percent Nuee Percent Pubes Percent Number P distri- | Number P distri- | Number P distri- 1,000 bution 1,000 bution 1,000 bution persons persons persons Both sexes All ages-- 949.6 94.0 100.0; 1,071.0 106.0 100.0 1,143.4 112.6 100.0 Under 15 years-- 167.0 48.9 17.6 199.2 58.3 18.6 202.8 59.3 17.7 15-24 years----- 176.8 136.9 18.6 196.6 152.3 18.4 196.8 152.3 17.2 25-34 years----- 186.2 145.2 19.6 201.4 157.1 18.8 201.9 157.4 17.7 35-44 years----- 133.0 99.6 14.0 149.9 112.2 14.0 157.6 117.7 13.8 45-64 years=---- 183.4 91.3 19.3 207.9 103.5 19.4 221.1 109.5 19.3 65+ years------- 103.2 133.3 10.9 116.0 149.9 10.8 163.2 203.5 14.3 Male All ages-- 332.9 67.8 100.0 382.1 77.8 100.0 421.7 85.4 100.0 Under 15 years-- 95.5 54.9 28.7 113.0 64.9 29.6 116.3 66.8 27.6 15-24 years----- 29.3 48.5 8.9 35.7 58.7 9.3 35.7 58.6 8.5 25-34 years----- 30.2 49.3 9.1 33.1 54.0 8.7 33.1 53.9 7.8 35-44 years----- 39.2 61.2 11.3 43.9 68.5 11.5 47.3 73.7 11.2 45-64 years=---- 87.3 20.5 26.2 98.4 102.0 25,8 104.9 107.9 24.9 65+ years--=-=-=--- 51.2 148.4 15.3 58.0 168.1 15.1 84.4 235.1 20.0 Female All ages-- 616.7 118.7 100.0 688.9 132.6 100.0 721.7 138.4 100.0 Under 15 years-- 71.5 42.7 11.6 86.2 51.3 12.5 86.5 51.6 12.0 15-24 years=-=--- 147.3 215.7 23.9 160.9 235.6 23.4 161.1 235.9 22.3 25-34 years~---- 156.0 233,2 25.3 168.3 251.6 24.4 168.8 251.9 23.4 35-44 years----- 93.8 135.0 15.2 106.0 1532.5 15.4 110.3 158.5 15.3 45-64 years----- 96.1 92.1 15.6 109.5 105.0 15.9 116.2 111.0 16.1 65+ years------- 52.0 121.2 8.4 58.0 135.2 8.4 78.8 177.9 10.9 26 Table 2. Average annual number, number per 1,000 persons, and percent distribution of patients discharged in year prior to interview, excluding deliveries, for each of three types of simula- tion, by sex and age [Average of 13 interview dates] For living persons Interview reported Perfect interview All discharges discharges discharges Sex and age Number Percent Funboe Percent uvose Percent Number PeX | distri-| Number pz distri- | Number P distri- 1,000 bution 1,000 bution 1,000 bution persons persons persons Both sexes Excluding deliveries All ages-- 741.4 73.4 100.0 858.4 84.9 100.0 930.0 91.6 100.0 Under 15 years-- 167.0 48.9 22.5 199.2 58.3 23.2 202.8 59.3 21.8 15-24 years----- 84.9 65.8 11.5 102.5 79.4 11.9 102.6 79.4 11.0 25-34 years=====- 90.1 70.3 12,2 103.9 81.0 12.1 104.5 81.4 11,2 35-44 years-~=== 112.5% 84.4 15.2 128.9 96.5 15,0 135.8 101.4 14.6 45-64 years====-- 183.4 91.3 24,7 207.9 103.53 24,2 221.1 109.5 23.8 65+ yearg=====-- 103.2 133.3 13.9 116.0 149.9 13.6 163,2 203.5 17.6 Male All ages-- 332.9 67.8 100.0 332.1 77.8 100.0 421.7 85.4 100.0 Under 15 years-- 95.5 54.9 25.7 113.0 64.9 29.6 116.3 66.8 27.6 15-24 years=----- 29.5 48.5 8.9 35.7 58.7 92.3 35.7 58.6 8.5 25-34 years=---- 30.2 49.3 9.1 33.1 54,0 87 33.1 53.9 7.8 35-44 years===--=- 39.2 61,2 11.3 43.9 68.5 11.5 47.3 73.7 11.2 45-64 years----- 87.3 90.5 26.2 98.4 102.0 25.53 104.9 107.9 24.9 65+ years======= 51.2 148.4 15.3 58.0 168.1 15,1 84.4 235.1 20.0 Female All ages-- 408,5 © 78.6 100,0 476,3 91.7 100,0 508.3 97.5 100.0 Under 15 years-- 71.5 42.7 17.5 86.2 51.5 18.1 86.5 51.6 17.0 15-24 years=--=-- 55.4 81.1 13.6 66.8 97.8 14.0 66.9 98.0 13.2 25-34 years----=- 59.9 89.5 14.7 70.8 105.8 14,9 71.4 106.6 14.0 35-44 years=-=--- 73.6 105.9 18.0 85,0 122.3 17.3 88.5 127.2 17.4 45-64 years----- 96.1 92.1 23.5 109.5 105.0 23.0 116.2 111.0 22,9 65+ years====---- 52.0 121.2 12.7 58.0 135.2 12.2 78.8 177.9 13.5 27 Table 3. Average annual number, days per 1,000 persons, and percent distribution of hospital days in year prior to interview for each of three types of simulation, by sex and age [Average of 13 interview dates] Sex and age For living persons Interview reported Perfect interview All discharges Days Days Days Number per Pree Number per Povcent Number per Pengene of days 1,000 bition of days 1,000 botion of days 1,000 bution persons persons persons Both sexes Hospital days All ages--| 7,917.1 783.3 100.0 | 8,604.6 851.4 100.0; 9,303.4 916,1 100.0 Under 15 years--| 1,066.3 31.2.1 13.5] 1,164.4 340.9 13.51:1,186,2 346.8 12.8 15-24 years----- 992.6 768.9 12.51:1,057.1 818.8 12.4] 1,057.6 818.6 11.3 25-34 years----- 1,082.0 844,0 13.,77:1,133.9 884.5 13.2) 1,135.2 884.8 12:2 35-44 years----- 988.9 740.2 12.5} 1,068.1 799.5 12,4] 1,105.6 825.6 11.9 45-64 years----- 0 2,27%.21:1,231,% 28.77 2,496.1 | ‘1,243.1 29.0]. 2,678.31:1,326,5 28.8 65+ years------- 1,516.1 1,958.8 19.141 1,685.0 2,177.0 19.5} 2,140.5] 2,669.0 23.0 Male All ages--| 3,497.4 711.9 100.0 | 3,844.8 782.6 100.0] 4,238.2 837.9 100.0 Under 15 years-- 622.5 357.8 17.8 681.4 391.6 17.7 702.2 403.1 16.6 15-24 years----- 303.7 499.5 8.7 342.9 564.0 8.9 342.9 563.1 8.1 25-34 years----- 335.2 546.8 9.6 352.7 575.4 9.2 352.7 574.4 8.3 35-44 years-=--- 337.0 525.7 9.6 368.1 574.3 9.6 392.7 611.7 9.3 45-64 years----- 1,186.5] 1,229.5 33.9 1,305.2 |1,352.5 34.0) 1,360.2] 1,399.4 32.1 65+ years------- 712.5] 2,065.2 20.4 794.51.2,302.9 20.6, 1,087.5, 3,029.2 25.6 Female All ages--| 4,419.7 850.9 100.0 | 4,759.8 916.4 100.0 5,065.2 971.1 100.0 Under 15 years-- 443.8 265.0 10.1 483.0 288.4 10,2 484.0 288.6 9.6 15-24 years----- 688.9] 1,008.6 15.6 714.2] 1,045.7 15.0 714.7| 1,046.4 14.1 25-34 years----- 746.8] 1,116.3 16.9 781.2 | 1,167.7 16.4 782.5 1,167.9 15.4 35-44 years----- 651.9 938.0 14.7 +700.0'| 1,007.2 14.7 712.91 1,024.3 14.1 45-64 years----- 1,084.7 1,040.0 24,5] 1,190.9 | 1,141.8 25.01 1,318.11{ 1,259.0 26.0 65+ years------- 803.61 1,873.2 18.2 890.5 | 2,075.8 18.7) 1,083.0} 2,377.0 20.8 28 Table 4. Average annual number, days per 1,000 persons, and percent distribution of hospital days in year prior to interview, excluding deliveries, for three types of simulation, by sex and age [Average of 13 interview dates ] Sex and age For living persons Interview reported Perfect interview All discharges Days Days Days Number per Percent Number per Peroont Number per Porcent distri- distri- distri- of days 1,000 bution of days 1,000 DL LOon of days 1,000 tution persons persons persons Both sexes Hospital days excluding deliveries All ages--| 7,042.0 696.7 100.0 | 7,740.9 765.9 100.0 | 8,439.7 831.1 100.0 Under 15 years--| 1,066.3 312.) 15.11 1,164.4 340.9 15,0} 1,186.2 346.8 14.1 15-24 years----- 618.5 479.1 8.8 688.9 533.6 8.9 689.4 533.6 8.2 25-34 years----- 698.7 545.0 9.9 756.4 590.0 9.8 757-7 590.6 9.0 35-44 years----- 871.2 652.1 12.4 950.1 711.2 2.3 987.6 737.6 11.7 45-64 years----- 2:,271.2} 1,135.1 32.3 2,496.1] 1,243.1 32.21 2,678.3 1 1,326.5 31.7 65+ years------- 1,516.1) 1,958.8 21.5] 1,685,0| 2,177.0 21.8} 2,140.5] 2,669.0 25.3 Male All ages--| 3,497.4 711.9 100.0 | 3,844.8 782.6 100.0 | 4,238.2 857.9 100.0 Under 15 years-- 622.5 357.8 17.8 681.4 391.6 17.7 702.2 403.1 16.6 15-24 years----- 303.7 499.5 8.7 342.9 564.0 8,9 342.9 563.1 8.1 25-34 years----- 335.2 546.8 9.6 352.7 575.4 9.2 352.7 574.4 8.3 35-44 years----- 337.0 525.7 9.6 368.1 574.3 9.6 392.7 611.7 9.3 45-64 years----- 1,186.5 1,229.5 33.91 1,305.2] 1,352.5 34.0) 1,360.2 1,399.4 32.1 65+ years------- 712.5 { 2,065.2 20.4 794.51 2,302.9 20.61 1,087.5 | 3,029.2 25.6 Female All ages--| 3,544.6 682.4 100.0] 3,896.1 750.1 100.0 | 4,201.5 805.5 100.0 Under 15 years-- 443.8 265.0 12.5 483.0 288.4 12.4 484.0 288.6 1.5 15-24 years----- 314.8 460.9 8.9 346.0 506.6 8.9 346.5 507.3 8.2 25-34 years----- 363.5 543.3 16.3 403.7 603.4 10.4 405.0 604.5 9.6 35-44 years----- 534.2 768.6 15.1 582.0 837.4 14.9 594.9 854,7 14,2 45-64 years----- 1,084.7 | 1,040.0 30.61 1,190.91 1,141.8 30.6).1,318.1 1 1,258.9 31.4 65+ years------- 803.6 | 1,873.2 22.6 890.51] 2,075.8 22.8 1.2,053.0 } 2,377.0 25.1 29 Table 5. Average length of stay in days for each [Average of 13 interview dates ] of three types of simulation, by sex and age Sex and age For living persons Interview reported Perfect interview All discharges Number Number Number ’ _ | Number | Average _ | Number | Average _ | Number | Average of os of dis- | length of hot of dis-| length o2 253 of dis- | length Fa charges | of stay a charges | of stay oa charges | of stay Both sexes All ages--| 7,917.1 949.6 8.3] 8,604,6 | 1,071.0 8.0] 9,303.4} 1,143.4 8,1 Under 15 years--| 1,066.3 167.0 6.4 1,164.4 199.2 5.811,186.2 202.8 5.8 15-24 years----- 992.6 176.8 5.6 1,066.9 196.6 5.411,057.6 196.8 5.4 25-34 years----- 1,082.0 186.2 5.8711,133.9 201.4 5.6 1,135.2 201.9 5.6 35-44 years----- 988.9 133.0 7.411,068.1 149.9 7.111,105.6 137.6 7.0 45-64 years----- 2,271.2 183.4 12.4] 2,496.1 207.9 12.0} 2,678.3 221.1 12.1 65+ years----=--- 1,516.1 103.2 14,7 | 1,685.0 116.0 14.5] 2,140.5 163.2 13.1 Male All ages--| 3,497.4 332.9 10.5 | 3,844.8 382,1 10.114,238.2 421.7 10.1 Under 15 years-- 622.5 95.5 6.5 681.4 113.0 6.0 702.2 116.3 6.0 15-24 years----- 303.7 29.5 10.3 342.9 35.7 9.6 342.9 35.7 9.6 25-34 years----- 335.2 30.2 11.1 352.7 33.1 10.7 352.7 33.1 10.7 35-44 years----- 337.0 39.2 8.6 368.1 43.9 8.4 392.7 47.3 8.3 45-64 years----- 1,186.5 87.3 13.6] 1,305.2 98.4 13.3(1,360.2 104.9 13.0 65+ years------- 712.5 51.2 13.9 794.5 58.0 13.7 | 1,087.5 84.4 12.9 Female All ages--| 4,419.7 616.7 7.2] 4,759.8 688.9 6,9] 5,065.2 721.7 7.0 Under 15 years-- 443.8 71.5 6.2 483.0 86.2 5.6 484.0 86.5 5+6 15-24 years----- 688.9 147.3 4.7 714.2 160.9 4.4 714.7 161.1 4.4 25-34 years----- 746.8 156.0 4.8 781.2 168.3 4.6 782.5 168.8 4.6 35-44 years----- 651.9 93.8 6.9 700.0 106.0 6.6 712.9 110.3 6.5 45-64 years----- 1,084.7 96.1 1.3.0 0,190.9 109.5 10.9} 1,318.1 116.2 11.3 65+ years------- 803.6 52.0 15.5 890.5 58.0 15.41 1,053.0 78.8 13.4 30 Table 6. Population changes during year prior to interview and population bases used in obtain- ing rates [Average of 13 interview dates ] Births Deaths Rate: bases Initial priow prior Births | Deaths ina Sex and age DUEDRY first fins during during | Jc per- | Inter- [Perfect |,1; gis. L pey day of [day of yeer year sons view |inter- hotees Li year year reported | view ohare Both sexes All ages-- 10,000 144.5 $8.6 | 235.6 96.4 | 10,225 10,107 | 10,107 | 10,155 Under 15 years-- 3,167 144.5 5.4 235.6 7.9 3,534 3,416 3,416 3,420 15-24 years----- 1,293 0.3 2.0 1,291 1,291 1,291 1,292 25-34 years----- 1,286 1.4 2.2 1,282 1,232 1,282 1,283 35-44 years----- 1,343 v 1.5 we 5.3 1,336 1,336 1,336 1,339 45-64 years----- 2,045 v 14.9 oy 22.6 2,008 2,008 2,008 2,019 65+ yearg------- 866 g 55.1 56.4 774 774 774 802 Male All ages-- 4,866 74.0 32.81 119.5 53.3 4,973 4,913 4,913 4,940 Under 15 years-- 1,615 74.0 3.6 119.5 4.5 1,800 1,740 1,740 1,742 15-24 years----- 610 0.2 3 %.8 608 608 608 609 25-34 years----- 615 0.6 1.0 613 613 613 614 35-44 years----- 645 s 1.0 2.6 641 641 641 642 45-64 years----- 989 9.4 ‘ 14.7 965 965 965 972 65+ years------- 392 18.0 28.7 345 345 345 359 Female All ages-- 5,134 70.5 25.8 | 116.1 43.1 5,252 5,194 5,194 5.219 Under 15 years-- 1,552 70.5 1.8 116.1 3.4 1,733 1,675 1,675 1,677 15-24 years----- 683 . 0.1 . 0.2 683 683 683 683 25-34 years----- 671 . 0.8 5 1.2 669 669 669 670 35-44 years----- 698 . 0.5 wr 2.7 695 695 695 696 45-64 years----- 1,056 = 3 5.5 . 7.9 1,043 1,043 1,043 1,047 65+ years------- 474 . 7.1 27.7 429 429 429 443 Ipistribution based on table 29, p. 42, of reference 8. 31 Table 7. between discharge and interview: interview reported versus perfect interview [Average of 13 interview datc ] Percent underreporting of hospital discharges, by type of discharge and number of weeks Delivery and nondelivery Delivery discharges Discharges excluding discharges deliveries Weeks between discharge and interview Inter- Percent {| Inter- Percent | Inter- Percent view Tentoes under- view Jontees under- view Doxteer under- re- : re- re- : re- re- : re- ported view ported | ported view ported | ported view ported Total----- 949,51 1,071.1 11.4 208.3 212.7 Z.) 741.2 858.4 13.7 lobmmmmmcmmmmeam 82.2 82.2 0.0 16.8 16.2 13.7 65.4 66.0 0.9 5=8-mmme meme 77.2 81.8 4.9 16.4 16.5 0.6 61.4 65.3 6.0 9-12-=-emmmmme- 78.4 81.7 4.0 16,2 16.5 1.8 62.2 65.2 4.6 13-16=-=mmmmm=m= 78.1 83.5 6.5 16,3 16.5 1,2 61.8 67.0 8.8 17-20--====-==-===~ 74.9 82.3 9.0 15.7 16.5 4.8 59.2 65.8 10.0 21-24 mmm e 76.3 82.1 7-1 16.4 16.7 1.8 59.9 65.4 8.4 A 73.0 82,2 2),2 15.8 16.5 4,2 57.2 65.7 12.9 29-32-~-cmmmenn 72.8 82.4 11.7 16.0 16.4 2.4 56.8 66.0 13.9 33=36=mmmmmmmmm 71.2 82.1 13.3 15.8 16,2 2:3 55.4 65,9 15.9 37-40==mmmmmmmmm 73.8 82.7 10.8 15.6 16.0 2.5 58.2 66.7 12.7 41-4fmecmccmmenm 71.4 82.9 13.9 16.1 16.2 0.6 35.3 66.7 17.1 45-48=mmmmmmmmmm 64.8 82.5 21.5 15.0 16.2 7.4 49.8 66,3 24.9 49-52-=-nemmennn 54.8 82.7 33.7 16,2 16.3 0.6 38.6 66.4 41.9 lpercent overreported. Table 8. between discharge and interview: perfect interview versus all discharges [Average of 13 interview dates] Percent underreporting of hospital discharges, by type of discharge and number of weeks Delivery and nondelivery Delivery discharges Discharges excluding discharges deliveries Weeks between discharge and interview Perfect Popoant Perfect Yooous Perfect Yorgene inter- ALL, fenders inter- All Bhdey inter- ALY [det yigew ported yigw ported view ported Total--=-- 1,071.17 1,143.5 6.3 212.7 212.7 0.0 858.4 930.8 7.8 lebmmmmm mmm eee 82.2 85.5 3.9 16.2 16,2 0.0 66.0 69.3 4.8 81.8 86.2 5.1 16.5 16.5 0.0 65,3 69.7 6.3 81.7 86.5 5.3 16.5 16.5 0.0 63,2 70.0 6.9 83.5 88.6 5.8 16.5 16.5 0.0 67 .0 72.1 7:1 82.3 88.1 6.6 16.5 16.5 0.0 65.8 71.6 8.1 82.1 87.6 6.8 16.7 16.7 0.0 65.4 70.9 5.3 82.2 87.8 6.4 16.5 16.5 0.0 65.7 71.3 7.9 82.4 88.3 6.7 16.4 16.4 0.0 66.0 71.9 8.2 82.1 88.1 6.8 16.2 16.2 0.0 65.9 71.9 8.3 82.7 89.1 7.2 16.0 16.0 0.0 66.7 73.1 8.8 82.9 89.3 7.2 16.2 16.2 0.0 66.7 73.1 8.8 82.5 89.0 7:3 16.2 16.2 0.0 66.3 12.8 8.9 82.7 89.4 7.5 16.3 16.3 0.0 66.4 73.1 9.2 32 Table 9. between discharge and interview: interview reported versus all discharges [Average of 13 interview dates | Percent underreporting of hospital discharges, by type of discharge and number of weeks Delivery and nondelivery Delivery discharges Discharges excluding discharges deliveries Weeks between dischenge and | yy... Percent | Inter- Percent | Inter- Percent interview : - : d : view All under view All under view All under- re- re- re- re- re- re- ported ported ported ported ported ported Total~==~- 949.5 1,143.5 17.0 208.3 212.7 2.1 741.2 930.8 20.4 l-b4emmmmmm meen 82.2 85:5 3+9 16.8 16,2 13.7 65.4 69.3 5.6 5-8--crmmcm manne 77.8 86.2 9.7 16.4 16.5 0.6 61.4 69.7 11.9 9-12--ncmmmmmmmm 78.4 86.5 9.4 16.2 16.5 1.8 62.2 70.0 11.) 13-16====--===u== 78.1 88.6 11.9 16.3 16.5 1.2 61.8 72.1 14.9 17-20==mmmmmmmmm 74.9 88.1 15.0 15.7 16.5 4.8 59.2 71.6 17.3 21-24--nnmmmmmmm 76.3 87.6 12.9 16.4 16.7 1.8 59.9 70.9 15:5 25-28 cmmm mmm 73.0 87.8 16.9 15.8 16.5 4.2 57.2 71.3 19.7 29-32--=-mmmnm 72.8 88.3 17.6 16.0 16.4 2.4 56.8 71.9 2..0 33-36-==mmmmmm—— 71.2 88.1 1942 15.8 16,2 2.5 55.4 71.9 22.9 37-40====-mmmmmm 73.8 89,1 17.2 15.6 16.0 2.5 58.2 73.1 20.4 41-44ecnmmmmmmem 71.4 89.3 20.0 16.1 16,2 0.6 55.3 13.1 24.4 45-48mmmmmmmmnmm 64.8 89.0 27.2 15.0 16.2 7.4 49.8 72.8 31.6 49-52--ccmmmnman 54.8 89.4 38.7 16.2 16.3 0.6 38.6 73:1 47.2 lpercent overreported. Table 10. Percent underreporting of hospital discharges, weeks between discharge and interview: interview reported versus perfect interview [Average of 13 interview dates] by actual length of stay and number of 1-day stay 2-4-day stay 5+-day stay Weeks between Inter- Inter- Inter- discharge and view Jeuiese Percent | view Jontese Percent | view Yogtese Percent interview re- view under- re- io under- re- anier under- ported dig re- ported Siew re- ported Yow re- dis- ported dis- orted dis- r orted charges charges charges charges Pp charges charges Pp Total-==-=-- 101.2 131.8 23.2 339.9 383.5 11.4 508.8 555.8 8.5 lobfmemcmmmce meen 9.1 9.4 3.2 29.5 29.4 19.3 43.7 43.5 19.5 5=8emmme mma 8.2 92.8 16.3 27-1 28.5 4.9 42.5 43.5 2.3 9=l2emmm mmm mmm 8.0 10.0 20.0 27.3 28.3 2.3 42.9 43.4 1.2 13-16===-mmmemmn 8.0 10.2 21.6 28.4 29.3 3k 41.7 44.0 32 17-20=-==mmumnun 7.8 9.6 18.8 26.4 29.8 11.4 40.7 42.9 5.1% 21=24mccmmmmean 7.5 2.8 23.43 28.2 29.5 4.4 40.7 42.8 4.9 25-28===cmmmmmmm 7.8 10.1 22.8 26.2 29,2 10.3 39.0 42.9 |' 9.1 29-32--mmvmmmm mn 6.9 10.0 31.0 27.2 29.6 8.1 38.7 42.8 9.6 33-36=cmmmmmm—n 7.6 10,2 25.5 25.5 29.4 13.3 38.2 42.5 10.1 37-40=mmmmmm mmm 7.2 10.6 32.1 26.5 29.8 11.1 40.1 42.2 5.0 41-44 mem mmm 8.6 10.8 20.4 23.8 30.4 2%.7 39.0 41.8 6.7 45-48---nmcmnne 7.4 10.8 31.5 22.3 30.3 26.4 35.1 41.3 15.0 49-52--ccmcnnnn- 7.1 10.5 32.4 21.3 30.0 29.0 26.5 42.2 37.2 lpercent overreported. 33 Table 11. Percent underreporting of hospital discharges, weeks between discharge and interview: perfect interview versus all discharges [Average of 13 interview dates) by actual length of stay and number of 1-day stay 2-4-day stay 5+-day stay Yok Debussy Perfect POTCONE Perfect Percent Perfect Percent ee inter. All ink vg inger- 41). he Inger. all nder- View dis- Ton tov Sige ree view i 8 To is- hareges is- charges dis- charges charges i ported charges 2 ported charges 2 ported Total----- 131.8 136.1 3.2 383.5 405.4 5.4 555.8 602.2 7.7 9.4 9.6 2.1 29.4 30.4 3.3 43.5 45.5 4.4 9.8 10.2 3.9 28.5 30.0 5.0 43.5 46.1 5.6 10.0 10.3 2.9 28.3 30.0 5.7 43.4 46,2 6.1 10.2 10.53 2:9 29.3 31.1 5.8 44.0 47.1 6.6 9.6 92.8 2.0 29.8 31.6 3.7 42.9 46.6 7.9 9.8 10.0 2.0 29.5 31.2 5.4 42.8 46.5 8.0 10.1 10.4 2.2 29.2 30.8 5.2 42.9 46.6 7+ 10.0 10.3 22 29.6 31.2 541 42.8 46.8 8.5 10.2 10.5 2.9 29.4 31.0 552 42.5 46.6 8.8 10.6 1.0 3.6 29.8 31.7 6.0 42.2 46.4 9.1 10.8 11.2 3:6 30.4 32.3 5.9 41.8 45.8 8.7 10.8 11.3 4.4 30.4 32.2 5:6 41.3 45.5 9.2 10.5 11.0 4.5 30.0 31.9 6.0 42.2 46.5 2.2 Table 12. Percent underreporting of hospital discharges, weeks between discharge and interview: interview reported versus all discharges [Average of 13 interview dates] by actual length of stay and number of l-day stay 2-4-day stay 5+-day stay Weeks between Inter- Inter- Inter- discharge and view ALL Percent | view All Percent view All Percent interview aed dic under- 2g did- under- re- dis- under- porte re- porte re- orted re- dis- charges ported dis- charges ported ii hi charges ported charges charges charges Total===-==~ 101.2 136.1 25.6 332.9 405.4 16.2 508.8 602.2 15.3 lefevre mm 9.1 9.6 Sed 29.5 30.4 3.0 43.7 45.5 4.0 5-8-cmmccn mr 8.2 10,2 19.6 27.1 30.0 9.7 42.5 46.1 7.3 9-l2e-cmmmmmm—- 8.0 10.3 22.3 27.3 30.0 8.3 42.9 46,2 Zed 13-16=mcmmcmc cme 8.0 10.5 23.8 28.4 31.1 8.7 41.7 47.1 11.5 17-20===mmecmmm= 71.8 95 20.4 26.4 31.6 16.5 40.7 46.6 12.7 21-24 mem 745 16.0 25.0 28.2 31.2 9.6 40.7 46.5 12.5 25-28~= mm mmm mmm 7.8 10.4 25.0 26.2 30.8 14.9 39.0 46.6 16.3 29-32-cmmcmmm mmm 6.9 10.3 33.0 27.2 31.2 12.8 33.7 46.8 17.3 33-36==cmmmmmmm 7.6 10.3 27.6 25.5 31.0 17.7 38.2 46.6 18.0 37-40 mmm 7:2 11.0 34.5 26.5 31.7 16.4 40.1 46.4 13.6 4l-bbfmmmm mmm mm 8.6 11.2 23.2 23.8 32.3 26.3 39.0 45.8 14.8 45-48-=--nmmmnnn 7.4 11,3 34.5 22,3 32.2 30.7 35.1 45.5 22.9 49-520 ccnmnnn- 7+X 11.0 35.5 21.3 31.9 33.2 26.5 46.5 43.0 34 APPENDIX OUTLINE FOR COMPUTER SIMULATION OF HOSPITAL DISCHARGES [Input data are found in table B for the MP1 matrix, in table C for the MP2 matrix, and in table D for the MP3 matrix. For other matrices in the computer program, data are not reproduced in this report because of their bulk Each age-sex group of n individuals is assigned birth dates by, delivery dates c¢,, and death dates dy, cludes: 1. 2. 3. 4. where k=1,2,...,n. The input data also in- Weekly admission probabilities P, appropriate to the kth individual according to his age, sex, and subgroup as per the MP1 matrix; Daily discharge probabilities P; appropriate to the kth individual according to his age, sex, and number of days already hospitalized as per the MP2 matrix; Probabilities P, of being hospitalized for a de- livery according to age as per the MP3 matrix; Daily discharge probabilities P; for delivery hospitalizations according to age and number of days already hospitalized as per the MP4 ma- trix. These probability matrices are all used in Phase I. In Phase II, the input data consists of birth dates, delivery dates, death dates, and the number of days to death m for individuals determined in Phase I to be in their last year of life. The input data for Phase II also includes: i. Daily admission probabilities FB, according to the number of days of life remaining to the in- dividual as per the MP7 matrix; Daily discharge probabilities Py according to age, sex, and number of days already hospital- ized as per the MP8 matrix. Histories are generated separately for each of the n individuals in an age-sex group. Starting with the first individual the basic steps in the computer program are as follows: I. Determine m=365— (d,— by) whether dy — b, >364. If no, set and day i =1 and proceed to III (Phase II). If yes, set /=b, and: a. Generate uniform randomnumbers R; for each day from b, to 756 as outlined below. First, however, check, is dy, - 1 < 365? If yes, set m=365-(d,-i) and proceed to II. If no, is c,—30 756? If yes, record 7 +1 asthe discharge date for this admission and loop back e. Generate R._. (j=1 to 100) II. Is cy=12? to I for the next (k +15t) proceed to I-e. individual. If no, and proceed to I-f. Is R;,; < P. 5 If no, proceed to I-g. If yes, proceed to I-h. Is i+, =156? If yes, record 757 as the dis- charge date and loop to I for the next individual. If no, loop back to I-e. taking j=,+1. Record +; as the discharge date for this ad- mission. Is 7+) = 756? If yes, loop back to I for the next individual. If no, loop back to I-a. taking i=1+j +1. If no, loop back tol-a. taking r=7+1. If yes, proceed to II-a. a. Generate random number R,. no, loop back to I-a. taking 7 +7 +1. proceed to II-b. Record ; as a delivery admission date; then proceed to II-c. Generate R;,, (s=11t030). IsR, : B - : ny . 3 \ ’ ) Lf Ee i 5 Ta , ) . : wr im } lt ) op B51 i eo A HE A { Ea ; aii a es n oo Ne tht = if al a if 5 : oo - E A : oo bo, (i g he i mgd g i m ) i i n ' | den gl vy : ; a Fam Ey - { x i = * i, » LT n- Nie ) nl, - ie hz rr - . A I Bt ' i 4 ll. 2 A ot FE It " th R » BR . pe a vp Gn sd BITRE Fey nti, hr ke hl CE al wg B . ft BN Fs , ot mye 5 41 = ) i eh (rE Trl 2 oi HE om ¢ | . ) ry I's ! i 5 os «19 , a it ig i i » Tiel ) EO o " Ea , 1 i : FP i whi Fa - A N pl | Dandy ) = Fea A a : } ie] iv Ten " ony i N E a i kJ y It ul fe oo x = i ' a hs g i go. a i ' i el at ig § a oi + ii " mT a TS [a . ud oo hw a p wT a il y uw i J da: u x ” 5 wl gt = ' . yi 2 wl : ) A% = uF ' 3 i . - — RE B 5 A I ptt bt i , 5. mr | } u 3, sami r i 5 a . ot ey i pi +d . oe - 3 Ly | mn fis rr i BET a ' v h i TL, ea! : Bb) 0 Ape } s +r . - Bb a iby: ahes 5! FH, = ails To ‘R. nl K i i fo - . E = n . Es ni 3 . " . ' el ) wt " 1 . . v Bes ' Ji 2 ha . Uy, Nl i . pgm = ) . i En : AN I i AH i ai, ! = ky : . mt y EE 1 pt - : ea FEE pd P= : " i 1 5 i y » ir Eis EL A. xy. po. J WT Bp == Eat ep } Lk - - A hh J Ih Su aly Ei fr ' TF pk f r i J L 4 f ire i ap Se Ly B Br ho at i Ln 1 a Wh a BR 2 i= Iu nL. Ss = n EJ . a = Cp ir a STL il i } > I » «ld on . iat ia ny mal ) » : wy 2 . i 5 . i fe - I WE Sn a 1 - ma aL ~ IT : I ‘ - w oo EA Lo i$ # . i 1 ) 1 nl a y = : 5 i oh i - pi = uF ge | " ) Lamps RK B - EE . 5 EY 4 LF . : 1 iE i a io a B wl In 1 = a | n » A=} RL . : RLiE mn =, I, E C3 : ES - en = Si . 5 i nr LS 1 # " » j iT i gi B EY ls i Jos F of u - ET mf E. : aN i ih % | . i ) ol hs IE hE as SRE . d : a inl po f ih" - . ake on La op it Series 1. Series 2. Series 3. Series 4. Series 10. Series 11. Series 12. Series 20. Series 21. Series 22. OUTLINE OF REPORT SERIES FOR VITAL AND HEALTH STATISTICS Public Health Service Publication No. 1000 Programs and collection procedures.—Reports which describe the general programs of the National Center for Health Statistics and its offices and divisions, data collection methods used, definitions, and other material necessary for understanding the data. Reports number 1-4 Data evaluation and methods rveseavch.—Studies of new statistical methodology including: experimental tests of new survey methods, studies of vital statistics collection methods, new analytical techniques, objective evaluations of reliability of collected data, contributions to statistical theory. Reports number 1-13 Analytical studies.—Reports presenting analytical or interpretive studies based on vital and health sta- tistics, carrying the analysis further than the expository types of reports in the other series. Reports number 1-4 Documents and committee veports.—Final reports of major committees concerned with vital and health statistics, and documents such as recommended model vital registration laws and revised birth and death certificates. Reports number 1 and 2 Data From the Health Interview Suvvey.—Statistics on illness, accidental injuries, disability, use of hospital, medical, dental, and other services, and other health-related topics, based on data collected in . a continuing national household interview survey. Reports number 1-26 Data From the Health Examination Suvvey.—Statistics based on the direct examination, testing, and measurement of national samples of the population, including the medically defined prevalence of spe- cific diseases, and distributions of the population with respect to various physical and physiological measurements. Reports number 1-12 Data From the Health Records Suvvey.—Statistics from records of hospital discharges and statistics relating to the health characteristics of persons in institutions, and on hospital, medical, nursing, and personal care received, based on national samples of establishments providing these services and samples of the residents or patients. . Reports number 1-3 Data on mortality. —Various statistics on mortality other than as included in annual or monthly reports— special analyses by cause of death, age, and other demographic variables, also geographic and time series analyses. Reports number 1 Data on natality, marriage, and divovce.—Various statistics on natality, marriage, and divorce other than as included in annual or monthly reports—special analyses by demographic variables, also geo- graphic and time series analyses, studies of fertility. Reports number 1-7 Data From the National Natality and Mortality Surveys. — Statistics on characteristics of births and deaths not available from the vital records, based on sample surveys stemming from these records, including such topics as mortality by socioeconomic class, medical experience in the last year of life, characteristics of pregnancy, etc. Reports number 1 For alist of titles of reports published in these series, write to: National Center for Health Statistics U.S. Public Health Service Washington, D.C. 20201 lh he NATIONAL x 2 / 2 PE LA For HEALTH Number 14 STATISTICS Replication OW TOE CR Analysis of Data From Complex Surveys \% U.S. DEPARTMENT OF / EEN, HEALTH, EDUCATION, AND WELFARE [5 A), Pes ; CAVA 7 4 Public Health Service eX Ne Public Health Service Publication No. 1000-Series 2-No. 14 For sale by the Superintendent of Documents, U.S. Government Printing Office Washington, D.C., 20402 - Price 35 cents NATIONAL CENTER| Series 2 For HEALTH STATISTICS | Number 14 VITALand HEALTH STATISTICS DATA EVALUATION AND METHODS RESEARCH Replication An Approach to the Analysis of Data From Complex Surveys Development and evaluation of a replication technique for estimating variance. Washington, D.C. April 1966 U.S. DEPARTMENT OF HEALTH, EDUCATION, AND WELFARE Public Health Service John W. Gardner William H. Stewart Secretary Surgeon General NATIONAL CENTER FOR HEALTH STATISTICS FORREST E. LINDER, Pu. D., Director THEODORE D. WOOLSEY, Deputy Director OSWALD K. SAGEN, Pu. D., Assistant Director WALT R. SIMMONS, M.A., Statistical Advisor ALICE M. WATERHOUSE, M.D., Medical Advisor JAMES E. KELLY, D.D.S., Dental Advisor LOUIS R. STOLCIS, M.A., Executive Officer OFFICE OF HEALTH STATISTICS ANALYSIS Iwao M. Moriyama, Pu. D., Chief DIVISION OF VITAL STATISTICS ‘Rosert D. Grove, Pu. D., Chief DIVISION OF HEALTH INTERVIEW STATISTICS Paiute S. LAWRENCE, Sc. D., Chief DIVISION OF HEALTH RECORDS STATISTICS Monroe G. SirkEN, Pu. D., Chief DIVISION OF HEALTH EXAMINATION STATISTICS ArTHUR J. McDowEgLL, Chief DIVISION OF DATA PROCESSING SipNEY BINDER, Chief Public Health Service Publication No. 1000-Series 2-No. 14 Library of Congress Catalog Card Number 66-60010 PREFACE The theory of design of surveys has advanced greatly in the past three decades. One result is that many surveys now rest upon complex designs involving such factors as stratification and poststratification, multistage cluster sampling, controlled selection, and ratio, regression, or composite estimation. Another result is a growing concern and search for valid and efficient techniques for analysis of the output from such complex surveys. A central difficulty is that most of the standard classical techniques for statistical analysis assume that observations are independent of one another and are the result of simple random sampling, often from a universe of normal or other known distribution—a situation that does not prevail in modern complex design. This report reviews several aspects of the problem and the limited literature on the topic. It offers a new method of balanced half-sample pseudoreplication as a solution to one phase of the problem. The entire matter of how best to analyze data from complex sur- veys is nearly as broad as statistical theory itself. It encompasses not only the technical features of analysis, but also relationships among purpose and design of the survey, and the character of inferences which may be drawn about populations other than the finite universe which was sampled. This report treats only a very small sector of the subject. but, it is believed, introduces a scheme which may be widely useful, There is good reason to hope that the method, or possibly variations of it, may have utility beyond the somewhat narrow area with which it deals specifically. The exploration and developments reported here are the outgrowth of discussions among a number of people, as is nearly always true when the subject is a pervasive one. But they are particularly the product of a study by Philip J. McCarthy of Cornell University under a contractual arrangement with the National Center for Health Sta- tistics. Contributions of the Center were coordinated by Walt R. Simmons. Professor McCarthy wrote the report. Garrie J. Losee of the Center was responsible for initial work on half-sample replication, as employed in NCHS surveys, and prepared the appendix to this re- port. “i na . ty x n Sel uh Heo : i TA oy ie . a 5 dies = b nt i=g . sty du Sv ma os IRE Ar = yi - Sarpy” J ) J i RINE, SEE SrR ashi -f moog = yg et ps pt Sl i I re i i who HN i on il ii aly gw ge iy u jv » . fe i j sire a hf - Fi eit JR A Ee is J de eR i i E 7 ' i Age x - Rag a h sig ligt ot SE Ede RE Ji aie oy wy gad pd . : fe 0 = } {hz . fe i) |! en : fide i LATA E Roan i orl fe ir : Sr ; x rat ys . a Ka ! oo " 1 . h ) Le A FL IF CONTENTS IPCC TAC, mmm sm ms sm ps ss ne er ee em et em em INTLOAUCTION == =m = =m mm om om am rm er rm te en mm Complex Sample Surveys and Problems of Critical Analysis---------------- General Approaches for Solving Problems of Critical Analysis-------------- Two Extreme Approaches -- === mmm comme eee eee Obtaining "Exact" Solutions--=--==-mecmmeommooc meme ae So Ft eR Replication Methods of Estimating Variances------=-c---ooeommmmmmoooo Pseudoreplication ====— === mcm meee eee eee Half-Sample Replication Estimates of Variance From Stratified Samples-- Balanced Half-Sample Replication=======mm mc mmmm moe eee meee Partially Balanced Half Samples======c=cm mmm Half-Sample Replication and the Sign Test--===-e-coammmmmmm cee Jackknife Estimates of Variance From Stratified Samples------eceeeea-- Half-Sample Replication and the Jackknife Method With Stratified Ratio-Type Estimators-----=--c cme mmm meee SUMMATY === mmm mmm mm mee eee ee ee em mmm mmm mmm Bibliography === === comm mm mee ee eee Appendix. Estimation of Reliability of Findings From the First Cycle of the Health Examination SUrVey--=-ceom omens meee meee mmee meee mmm mem Survey Design-=---=-eereem meee mre mem Requirements of a Variance Estimation Technique--=-===neccmcccmmaaa- Development of the Replication Technique---=-===-c-comommmmmmmo COMPULET OULDUL = = = mm mm om me mm em mr 0 mr st me mm on LUIS LT ALI = evo me mre sen mm io eg es se st sm 24 29 31 33 33 33 34 37 37 A key feature of statistical techniques necessary to the analysis of data Jrom complex surveys is the method of calculating variance of the sam- ple estimates. Earlier divect computational procedures are either in- appropriate or much too difficult, even with high speed electronic com- puters, to cope with the elaborate stratification, multistage cluster sam- pling, and intricate estimation schemes found in many current sample surveys. A different approach is needed. A number of statisticians have attempted solution through a variety of schemes which employ some form of replication ov random grouping of observations. These efforts are recalled in this veport, as a part of the background review of principal issues present in choice of analytic methods suitable to the complex survey. Among the estimating schemes in vecent use is a half-sample pseudo- replication technique adapted by the National Center fov Health Statis- tics from an approach developed by the U.S. Bureau of the Census. This method is described in detail in the report. Typically, it involves sub- sampling a parent sample in such a way that 20-40 pseudoveplicated es- timates of any specified statistic are produced, withthe precision of the corresponding statistic fromthe parent sample being estimated from the variability among the replicated estimates. One difficulty in using this method is that the 20-40 estimates are cho- sen from among the thousands ov millions of possible replicates of the same character, and hence may yield an unstable estimate of the var- tability among the possible replicates. The report presents a system for controlled choice of a limited number of pseudoreplicates— often no move than 20-40 for a major national suvvey— such that for some classes of statistics the chosen small number of veplicates has a variance al- gebraically identical with that of all possible replicates of the same character within the parent sample, and the same expected value as the variance of all possible replicates of the same character for all possi- ble parent samples of the same design. Illustrations of the technique and guides for its use are included. SYMBOLS Data not available--=--eeemmammcccaceem eee —— Category not applicable--==mmeececcanoam- oa Quantity Zero=-=-===mmm-mmmmomo meee me cm Quantity more than 0 but less than 0.05----- 0.0 Figure does not meet standards of reliability or precision-------=ccmeenuan- REPLICATION AN APPROACH TO THE ANALYSIS OF DATA FROM COMPLEX SURVEYS Philip J. McCarthy, Ph. D., Cornell University INTRODUCTION A considerable body of theory and practice has been developed relating to the design and analysis of sample surveys. This material is available in such books as Cochran (1963), Deming (1950), Hansen, Hurwitz, and Madow (1953), Kish (1965), Sukhatme (1954), and Yates (1960), and in numerous journal articles. Much of this theory and practice has the following characteristics: the sampled populations contain finite numbers of elements; no assumptions are made concerning the distributions of the pertinent variables in the population; major emphasis is placed on the estimation of simple population parameters such as percentages, means, and totals; and the samples are assumed to be "large" so that the sampling distributions of estimates can be approximated by normal distributions. Furthermore, it has frequently been appropriate to regard the principal goal of sample design as that of achieving a stated degree of precision for minimum cost, or alternatively, of maximizing precision for fixed cost. Sample surveys in which major emphasis is placed on the estimation of population parameters such as percentages, means, or totals have been variously called 'descriptive' or "enumerative' surveys and, as noted above, the work in finite- population sampling theory has been primarily concentrated on the design of such surveys. Increasingly, however, one finds reference in the sample survey literature to 'analytical" surveys or to the use of "analytical statistics." Cochran (1963, p. 4), for example, says: In a descriptive survey the objective is simply to obtain certain information about large groups: for example, the numbers of men, women, and children who view a tele- vision program. In an analytical survey, comparisons are made between different sub- groups of the population, in order to dis- cover whether differences exist among them that may enable us to form or to verify hypotheses about the forces at work in the population. . . . The distinction between de- scriptive and analytical surveys is not, of course, clear-cut. Many surveys provide data that serve both purposes. Although there are some differences in emphasis, Deming (1950, chap. 7), Hartley (1959), and Yates (1960, p. 297) make essentially the same dis- tinction. Kish (1957, 1965) more or less auto- matically assumes that data derived from most complex sample surveys will be subjected to some type of detailed analysis, and applies the term "analytical statistical methods'' to proce- dures that go much beyond the mere estimation of population percentages, means, and totals. The Health Examination Survey (HES) of the National Center for Health Statistics (NCHS) is one example of a sample survey that, in some respects, might be classified as an enumerative survey, but whose principal value will undoubtedly be in providing data for analytical purposes. Some of the basic features of Cycle I of HES are pre- sented in a publication of the National Center for Health Statistics (Series 11, No. 1). A brief de- scription of the survey, quoted from the report, is as follows: The first cycle of the Health Examination Survey was the examination of a sample of adults. It was directed toward the collection of statistics on the medically defined prev- alence of certain chronic diseases and of a particular set of dental findings and physical and physiological measurements. The prob- ability sample consisted of 7,710 of all non- institutional, civilian adults in the age range 18-79 years in the United States. Altogether, 6,672 persons were examined during the period of the Survey which began in October 1959 and was completed in December 1962. A rather detailed account of the survey designhas been published by the Center (Series 1, No. 4). The enumerative and analytical aspects of this survey, and the inevitable blending of one into the other, are well illustrated in two reports that have been published on the blood pressure of adults (NCHS, Series 11, Nos. 4 and 5). Not only does one find in these reports the distribution of blood pressure readings for the entire sample, but one also finds the comparison (with respect to blood pressure) of subgroups of the population defined by a variety of combinations of such demographic variables as age, sex, arm girth, race, area of the United States, and size of place of residence. It seems unnecessary to argue where the enumerative aspects end and the analytical aspects begin. For all practical purposes, and by any definition one chooses to adopt, the survey is analytical in character. The same will be true of almost any sample survey that one examines, at least as far as many users of the data are concerned. Certainly this is the view of the staff at NCHS. The principal goal of this report will be to examine some of the problems that arise when data from a complex sample survey operation are subjected to detailed and critical analysis, and to discuss some of the procedures that have been suggested for dealing with these problems. Particular emphasis will be placed on a pro- cedure for estimating variances which is especial- ly suitable for sample designs similar to those used in the Health Interview Survey and the Health Examination Survey. COMPLEX SAMPLE SURVEYS AND PROBLEMS OF CRITICAL ANALYSIS Simple random sampling, usually without replacement, provides the base upon which the presently existing body of sample survey theory has been constructed. Major modifications of random sampling have been dictated by one or both of two considerations. These are as follows. (1) One rarely attempts to survey a finite population without having some prior knowledge concerning either individual elements in the pop- ulation, or subgroups of population elements, or the population as an entity. This prior information, depending upon its nature, can be used in the sample design or in the method of estimation to increase the precision of estimates over that which would be achieved by simple random sam- pling. Thus we have such techniques as stratifica- tion, stratification after the selection of the sample (poststratification), selection with prob- abilities proportional to the value of some auxil- iary variable, ratio estimation, and regression estimation. (2) Many finite populations chosen for survey study are characterized by one or both of the following two circumstances: the ultimate popula- tion elements are dispersed over a wide geograph- ic area, and groups or ''clusters'' of elements can be readily identified in advance of taking the sur- vey, whereas the identification of individual pop- ulation elements would be much more costly. These circumstances have led to the use of multi- stage sampling procedures, where one first selects a sample of clusters and then selects a sample of elements from within each of the chosen clusters. In addition to these two main streams of development, whose results are frequently com- bined in any one survey undertaking, there is a wide variety of related and special techniques from which choices can be made for sample design and for estimation. Thus one can use systematic sampling, rotation sampling (in which a population is sampled over time with some sample elements remaining constant from time to time), two-phase sampling (in which the results of a preliminary sample are used to improve design or estimation for a second sample), un- biased ratio estimators instead of ordinary ratio estimators, and so on. Finally, it is necessary to recognize that measurements may not be ob- tained from all elements that should have been included in a sample, and that such nonresponse may influence the estimation procedure and the interpretation of results. As sample design and estimation move from simple random sampling and the straightforward estimation of population means, percentages, or totals to a stratified, multistage design with ratio or regression forms of estimation, it becomes increasingly difficult to operate in an 'ideal" manner even for the purest of enumerative sur- veys. Ideally, one would like to be assured that the "best" possible estimate has been obtained for the given expenditure of funds, that the bias of the estimate is either negligible or measur- able, and that the precision of the estimate has been appropriately evaluated on the basis of the sample selected. Numerous difficulties are en- countered in achieving this goal. Among these are: (1) the expressions that must be evaluated from sample data become exceedingly complex, (2) in many instances, these expressions are only approximate in that their validity depends upon having ''large'' samples, and (3) most surveys provide estimates for many variables—that is, they are multivariate in character—and this, in conjunction with the first point, implies an ex- tremely large volume of computations, even for modern electronic computing equipment. This last point is accentuated when one wishes to study the relationships among many variables innumer- ous subpopulations. The foregoing difficulties are well illustrated by the Health Interview Survey (No. A-2), the Health Examination Survey (Series 1, No. 4), and the Current Population Survey (Technical Paper No. 7). The sample designs and estimation techniques for these three sur- veys are somewhat similar, although the Current Population Survey employs a composite estimation technique (made possible by the rotation of sample elements) that is not employed in the other two surveys. We have at various points in the preceding discussion used the term ''complex' sample surveys, implying thereby that the sample design is in some sense or other complex. Little is to be gained by arguing the distinction between sim- ple or complex under these circumstances, al- though several observations are perhaps inorder. We are, of course, primarily concerned with the complexities of analysis that result from the use of a particular sample design and estimation procedure. These complexities arise from various combinations of such factors as the following. The assumption of a functional form for the dis- tribution of a random variable over a finite pop- ulation is rarely feasible and thus analytical power for devising statistical procedures is lost. The selection of elements without replacement, or in clusters, introduces dependence among ob- servations. Estimators are usually nonlinear and we are forced to use approximate procedures for evaluating their characteristics. Some design techniques that are known to increase the pre- cision of estimates almost invariably lead to the negation of assumptions required by such common statistical procedures as the analysis of variance (e.g., strata having unequal within-variances). Further comments on these points will be made later in this report. New dimensions of complexity, both concep- tual and technical, arise as one progresses from a truly enumerative survey situation to a purely analytical survey. Each of these will now be discussed briefly. On the conceptual side, the major question concerns the manner in which one chooses to view a finite population—either as a fixed set of elements for which a statistical description is desired, or as a sample from an infinite super- population to which inferences are to be made. In simplest terms, this can be viewed as follows. An infinite superpopulation, characterized by ran- dom variable y with mean E(y) =u, and with variance E(y—-wl=o0?, is assumed as a basis for the sampling process. N independent observations on y lead to a finite population with mean N = A/MZ , =¥, and with variance |= N oe a/N-1 2 (y; -7)?=5?%, while a simple random sample, drawn without replace- ment, from the finite population has the observed mean n (1/n) 2 y;=¥% and variance i=l (/n-1) 2 (3-9)? = ot, i= Ordinary sampling theory assumes that we wish to describe the realized finite population of N ele- ments, and we have E(y|fixed N values of y) = ¥ 2 V(y|tixed nN values of y) = N=n S N' n a 2 V (y|fixed N values of y) = A 2 where the symbol A indicates an estimator of a population parameter. If, however, we wish to draw inferences for the infinite superpopulation from our observed sample, and therefore take expectations over an infinite set of finite popula- tions of Nv elements, then itis straightforward to demonstrate that EF) =u V(y)=o%n A V(y)=s%n In effect, the only formal difference in the two views is that the finite population correction is omitted in the variance of y and in the estimate of the variance of y. This point has been made by Deming (1950, p. 251) and Cochran (1963, p. 37). Cochran says, in reference to the com- parison of two subpopulation means: One point should be noted. It is seldom of scientific interest to ask whether ¥; = ¥, be- cause these means would not be exactly equal in a finite population, except by a rare chance, even if the data in both domains were drawn at random from the same infinite population. Instead, we test the null hypothesis that the two domains were drawn from infinite pop- ulations having the same mean. Consequently we omit the fpc when computing V(y;) and Vides Actually, it can also be argued that one would rarely expect to find two infinite populations with identical means. Careful accounts of statis- tical inference sometimes emphasize this fact by distinguishing between "statistically significant difference’ and 'practically significant differ- ence," and by pointing out that null hypotheses are probably never "exactly" true. In practice, the survey sampler is ordinarily in a position to control only the inference from the sample to the finite population. He may know that the finite population is indeed a sample, drawn in some completely unknown fashion from an infinite superpopulation, but when he tries to specify this superpopulation, his definition will ordinarily be blurred and indistinct. Professional knowledge and judgment will therefore play a major role in such further inferences. Further- more, comparisons with other studies, com- parisons among subgroups inhis finite population, and a consideration of related data must be brought to bear on the problem. There seems to be little that one can say in a definite way at present about this general problem, but very perceptive com- ments on this subject have been made by Deming and Stephan (1941) and by Cornfield and Tukey (1956, sec. 5). Even the answer to the specific question of whether or not to use finite popula- tion corrections in the comparison of domain means would appear to depend upon the circum- stances. The technical problems raised by the ana- lytical use of data from complex surveys differ in degree but not really in kind from those faced in the consideration of enumerative survey data. These problems are primarily of two types: 1. As indicated earlier, most analytical uses of survey data involve the comparison of sub- groups of the finite population from which the sample is selected. These subgroups have been frequently referred to as ''domains of study." The basic difficulty raised by this fact is that various sample sizes, which in ordinary sam- pling theory would be regarded as fixed from sample to sample, now become random variables. Furthermore, this occurs in such a manner that it is usually not possible to use a conditional argument—that is, it is not possible to consider the drawing of repeated samples in which the various sample sizes are viewed as being equal to the size actually observed—as can be done when estimating the mean of a domain on the basis of simple random sampling. 2. In making critical analyses of survey data, one is much more apt to use statistical techniques that go beyond the mere estimation of population means, percentages, and totals (e.g., multiple regression). Ordinary survey theory has attacked the problem of providing estimates of sampling error for certain esti- mates, e.g., ratio and regression estimates, but the body of available theory leaves much to be desired. An example of each of these problems will now be described briefly. One of the most frequently used techniques from sampling theory is that of stratification. A population is divided into L mutually exclusive and exhaustive strata containing N,.N,,. .., N| elements; random samples of predetermined size n, n,,..., n are drawn from the respec- tive strata; a value of the variable y is obtained for each of the sample elements; and the popula- tion mean is estimated by i> : = EV where y, is the mean of the n, elements drawn from the hth stratum, N is the total size of the population, and W,=N,/N. It is easily shown that an unbiased estimate of the variance of ¥ is given by AD L., 2 VY) =ZW Q-D sin, where s? is the variance for the variable y as estimated in the hth stratum, and f, =n,/lV, is the sampling fraction in the hth stratum. For purposes of illustration, let us assume that the strata are geographic areas of the United States, that ny, NVy,and N refer to all adults (18 years of age and over) and that the variable y is blood pressure, Suppose now that one wishes to estimate the average blood pressure for males in the 40- 45 year age range with arm girth between 38 and 40 centimeters. This special group of adults is a subpopulation, or domain of study, with reference to the total finite population, and elements of the domain will be found in each of the defined strata. The weights for the strata and the fixed sample sizes do not refer to this subpopulation and, over repeated drawings of the main sample, the number of domain elements drawn from a stratum will be a random variable. Furthermore, the total number of domain elements in a stratum is un- known. Under these circumstances, let n, 4= the number of sample elements in the hth stratum falling in domain d. yi ¢= the value of the variable for the ith sample element from domain d inthe hth stratum. Nk : N, =z (/f) ny 4= the estimated total - number of elements in the domain. hd Fna=C/my 4) Z yp; q =sample mean, i 2 for Ath stratum, of elements fall- ing in domain d. Then an estimate of the domain mean and its es- timated variance are given by A > 1/£,) Z yhid A h 1 Vy ——r 2 2 (1/£p) npg AD 2(1— {sl pg B05 fn) Ni h ny(ny=1) [2 Whi, a ~Vna) .n A +ny4(1- RL 7] where h=1,2,...,L and i=12,..., ny 4. These expressions have been presented and dis- cussed by a number of authors— Durbin (1958, p. 117), Hartley (1959, p. 15), Yates (1960, p. 202), Kish (1961, p. 383), and Cochran (1963, p. 149. The factor (1/N?) was evidently omitted in the printing of this formula ). Three points concerning these results are worthy of note in the context of the present discussion. These are: 1. The estimate is actually a ratio estimate— technically, a combined ratio estimate. Itis there- fore almost always biased for small sample sizes, and the variance formula is only approximately correct, 2. The complexity of the formulas, as re- gards derivation and computation, has been in- creased considerably over that of ordinary strati- fication. 3. The variance has a between-strata com- ponent as well as a within-strata component and, if n, 4 is small as compared with n,, this be- tween-strata component can contribute substan- tially to the variance. Thus we see that changing emphasis from the total population to a sub- population has introduced added complexities in theory and computations. As regards the use of more advanced sta- tistical techniques in the critical analysis of survey data, we shall simply refer to the diffi- culties that have been encountered in obtaining exact theory when ordinary regression techniques are applied to random samples drawn from a finite population. Cochran (1963, p. 193) summa- rizes this very well: The theory of linear regression plays a prominent part in statistical methodology. The standard results of this theory are not entirely suitable for sample surveys because they require the assumptions that the popula- tion regression of y on x is linear, that the residual variance of y about the regression line is constant, and that the population is infinite. If the first two assumptions are violently wrong, a linear regression esti- mate will probably not be used. However, in surveys in which the regression of y on x is thought to be approximately linear, it is helpful to be able to use y, without having to assume exact linearity or constant residual variance. Consequently we present an approach that does not demand that the regression in the population be linear. The results hold only in large samples. They are analogous to the large-sample theory for the ratio es- timate. Somewhat the same point is made by Hartley (1959, p. 24) in his paper on analyses for do- mains of study. He says: . . nevertheless we shall nof employ regres- sion estimators. The reason for this is not that we consider regression theory inappro- priate, but that this theory for finite popula- tions requires considerable development be- fore it can be applied inthe present situation, Some developments have arisen since Hartley's paper and reference to these will be given in the next section, As a final complicating factor, we note that certain techniques used in some sample survey designs are such that their effects on the pre- cision of estimates cannot be evaluated from a sample, even in the case of an enumerative sur- vey. We refer specifically to the technique of controlled selection, described by Goodman and Kish (1950), and to instances in which only one first-stage sampling unit is selected from each of a set of strata (Cochran, 1963, p. 141). In order to provide a convenient illustration of some of the foregoing points, the appendix presents a brief description of the ''complex' sample design and estimation procedure employ- ed in the Health Examination Survey, together with a selection of examples that arose in the more or less routine analysis of information collected on blood pressure. The data given are estimates of the percentage of individuals with hypertension in various subclasses of the popu- lation of adults, with the subclasses defined in terms of such demographic variables as race, sex, age, family income, education, occupation, and industry of employment. These subclasses cut across the strata used in the selection of primary sampling units, and the variances of the estimates are also affected by the various clus- tering and estimation features of the design. Most of the cited cases refer to the estimation of variance for the percentage of hypertension in a single subclass, although several examples are given in which the percentages in two sub- classes are compared. The variances were es- timated by a replication technique that will be introduced in ''Balanced Half-Sample Replica- tion," a technique that to some extent overcomes the problems of analysis that have just been raised. The results obtained through the application of this technique will be used for illustration at several points throughout this report. GENERAL APPROACHES FOR SOLVING PROBLEMS OF CRITICAL ANALYSIS Two Extreme Approaches It is possible to identify two extreme views that one may hold with respect to the problems raised in the preceding section. First of all, one might conceivably argue that analytical work with survey data should be done only ''by design." That is, areas and methods of analysis should be set forth in advance of taking the survey and the sample should be selected so as to conform as closely as possible to the requirements of the stated methods. On the other extreme, one might decide to throw up his hands in dismay, ignore all the complicating factors of an already executed survey design, and treat the observations as though they had been obtained by random sampling, presumably from some extremely ill-defined su- per population. The first approach, that of "design for a- nalysis," is certainly the most rational view that one can adopt. No careful examination of the literature was made to search out actual ex- periences on this point, but it would appear un- likely that one could find any examples of large- scale, complex, and multipurpose surveys in which this approach had been attempted. A pos- sible exception might be the Census Enumerator Variation Study of the 1950 U.S. Census, as de- scribed by Hanson and Marks (1958), although this study was based primarily on the complete enumeration of designated areas rather than on a sample of individuals selected in accordance with a complex sample design. In other cases individuals have been randomly selected from a defined population of adults to provide observa- tions for a complex "experimental design," as in the Durbin and Stuart (1951) experiment on response rates of experienced and inexperienced interviewers, but again this differs considerably from the type of problem raised in the preceding section. Another example in which an experi- mental design has been applied to survey data is provided by Keyfitz (1953) and discussed by Yates (1960, pp. 308-314). In this case, the sample elements were obtained by cluster sam- pling, but the author investigates the possible effects of the clustering and concludes that it can be ignored in the analysis of variance. Some recent work by Sedransk (1964, 1964a, 1964b) bears directly on the problem of design for analysis and it assumes that the primary goal of an analytical survey is to compare the means of different domains of study. If ¥; and ¥, are the estimated means for the sth and ;th do- mains, Sedransk places constraints on the vari- ance of their difference, for all ; and ;, and searches for sample-size allocations that will minimize simple cost functions. A variety of different situations are considered. Random sam- ples can be selected from each of the domains; random samples can be selected from the over- all population, but the number of elements falling in each domain then becomes a random variable; two-stage cluster samples can be selected from each of the domains; and two-stage cluster samples can be selected from the total population, but again the number of elements falling in each domain is a random variable, In the second and fourth cases, the author considers double sampling procedures and obtains approximate solutions to guide one in choosing sample sizes for sampling from the total population so as to satisfy the con- straints which are phrased in terms of all pos- sible pair-wise domain comparisons. Even if one does not wish to impose constraints on domain comparisons and on minimization of cost, the cited papers contain of necessity many de- velopments in theory that will be of assistance in attacking the problems raised in the preceding section. It should be observed that the com- plexity of the designs considered is still far from that of the Current Population Survey or the Health Examination Survey. Major difficulties in designing for analysis are and will continue to be encountered when the primary goal of a survey is to describe a large and dispersed population with respect to many variables, as the analytical purposes are some- what ill defined at the design stage. Thus the broad primary purposes of HES were to provide statistics on the medically defined prevalence in the total U.S. population of a variety of specific diseases, using standardized diagnostic criteria; and to secure distributions of the general popula- tion with respect to certain physical and physio- logical measurements. Nevertheless, analysis of relationships among variables is also an impor- tant product of the survey. A similar set of circumstances arises with respect to data on unemployment collected by the Current Population Survey. Clearly the primary goal is to describe the incidence of employment and unemployment in the total U.S. population, and yet the data obtained must also be used for comparison and analysis. Faced with difficulties of analysis, as described in the preceding sections, one may wish to retreat to the opposite extreme from design for analysis and view the observa- tions as coming from a simple random sample. Actually, this type of retreat would appear to place the analyst in a difficult, if not un- tenable, position. Cornfield and Tukey (1956) speak of an inference from observations to conclusions as being composed of two parts, where the first part is a statistical bridge from observations to an island (the island being the studied population) and the second part is a subject-matter span from the island to the far shore (this being, in some vague sense, a pop- ulation of populations obtained by changes in time, space, or other dimensions). The first bridge is the one that can be controlled by the use of proper procedures of sampling and of statistical inference. One may be willing to in- troduce some uncertainty into the position of the island, for example by ignoring finite population corrections, in the hope of placing it nearer the far shore than would otherwise be the case. How- ever, there seem to be no grounds for suggesting statistical procedures that may, unbeknownst to their user, succeed only in moving the island a short distance from the near shore. The Health Examination Survey was carried out on a sample chosen to be broadly ''repre- sentative" of the total U.S. population. Among other characteristics of design, the sample was of necessity a highly clustered one. As is well known, a highly clustered sample leads to es- timates that have much larger standard errors than would be predicted on the basis of simple random sampling theory if the elements within clusters tend to be homogeneous with respect to the variables of interest. Since geographic clustering leads to homogeneity on such char- acteristics as racial background, socioeconomic status, food habits, availability and use of medi- cal care, and the like, it can therefore be ex- pected that there will also be homogeneity with respect to many of the variables of interest in the Health Examination Survey. A portion of this loss of precision is undoubtedly recovered by stratification and poststratification, but there is no guarantee that the two effects will balance one another. Hence the ignoring of sample design features might well lead to gross errors in determining the magnitude of standard errors of survey estimates. In effect, the situation might be viewed as one in which inferences are being made to some ill-defined population of adults, rather than to the population from which the sam- ple was so carefully chosen. These points have been emphasized by Kish (1957, 1959) as they relate to social surveys in general. The effect of these sample design and es- timation features on the variances of estimates for the Health Examination Survey are illus- trated in the data presented in the appendix. For each of 30 designated subclasses, the variance of the estimated percentage of adults in a sub- class with hypertension was estimated by two methods: (1) the replication technique to be in- troduced in ''Balanced Half-Sample Replication" was employed, thus accounting for most of the survey features, and (2) the observations falling in a subclass were treated as if they had arisen from simple random sampling. In the second case, variances were computed as pg/n, Where p was the observed fraction of hypertensive individuals among the n sample individuals falling in a particular subclass. The ratios of the first of these variances to the second, for the 30 comparisons, ranged from .45 to 2.87, with an average value of 1.31; the ratios of the standard errors ranged from .67 to 1.69, with an average value of 1.12. These ratios probably overes- timate slightly the true ratios, since the rep- lication technique uses the method of collapsed strata and in this instance does not account for the effects of controlled selection. Furthermore, they are subject to sampling variability. Also included in the appendix are three examples which refer to the estimated difference between the percentages of hypertensive individ- uals in each of two subclasses. In this instance, the average ratio of variances is 1.51 while the average ratio of standard errors is 1.23. These comparisons are not as ''clean" as the ones for single subclasses since the random sampling variances were computed on the assumption of independence of the two estimates, and this is not necessarily the case. This set ofdata, limited though it may be, tends to confirm the general experience that estimates made from stratified cluster samples will tend to have larger sampling variances than would be the case for simple ran- dom samples of the same size, although the differences are not so pronounced as in situations in which the intraclass correlation is stronger than it is for this statistic. Obtaining “Exact” Solutions If one wishes to consider that the principal goal of analytical surveys is either the estimation or the direct comparison of the means of various domains of study, then there already exists in the literature a number of results that can assist in achieving this goal. This '"'exact' theory can be generally characterized as follows. 1. Ratio estimates of population means are employed, primarily because sample size is a random variable as a result of sampling clusters with unequal and unknown sizes. This of course introduces the possibility of bias for these es- timates, although empirical research—e.g., Kish, Namboodiri, and Pillai (1962)—indicates that the amount of bias is apt to be negligible. 2. Expressions for the variance of a single estimate and the covariances of two or more estimates are obtained from the Taylor series approximation, and variance estimates are con- structed by direct substitution into these ex- pressions. Hence variance estimates are subject to possible bias, 3. In multistage sampling, it is either as- sumed that the first-stage units are drawn with replacement or that the first-stage sampling fractions are very small. This means that vari- ance estimates can be obtained without explicitly treating within-first-stage unit sampling variabil- ity. 4, The most powerful tool for deriving re- sults for domain-of-study estimates has been that of the ""pseudovariable' and the "count variable." That is ¥hij =nij» if the jth element in the ith first-stage unit of the Ath stratum belongs to the domain of interest = 0, otherwise uy; =1,if the jth element in the ith first-stage unit of the hth stratum belongs to the domain of interest =0,otherwise Using this approach, which is related to Corn- field's (1944) earlier work, it is possible to specialize ordinary results to domain-of-study results. A very brief summary of some of the litera- ture on these aspects of analysis is as follows, where no attempt has been made to assign priorities to the various authors. Results thatare directly phrased in terms of domains of study are given by Cochran (1963), Durbin (1958), Hartley (1959), Kish (1961, 1965), and Yates (1960). Related work on the estimation of the variance of a variety of functions of ratio esti- mators is presented by Jones (1956), Keyfitz (1957), Kish (1962), and Kish and Hess (1959). Aoyama (1955), Garza (1961), and Okamoto (1963) discuss chi-square contingency-table analyses in the presence of stratification. McCarthy (1965) considers the problem of determining distribu- tion-free confidence intervals for a population median on the basis of a stratified sample. Finally, we observe that, in the case of ''small" samples, not even the ordinary normality as- sumptions are able to do away with difficulties, even without domain-of-study complications. Thus unequal strata variances lead to difficulties in obtaining tests and confidence intervals for a population mean, although some approximate solutions are available—e.g., Aspin (1949), Meier (1953), Satterthwaite (1946), and Welch (1947). If one has essentially unrecognized stratification— that is, normal variables with common variance but differing means—then it is necessary to work with noncentral ¢, x%, and F distributions as described by Weibull (1953). Replication Methods of Estimating Variances As a result of the indicated theoretical and practical difficulties associated with the esti- mation of variances from complex sample sur- veys, interest has long been evidenced in de- veloping shortcut methods for obtaining these estimates. For example, we noted earlier that Keyfitz (1957) and Kish and Hess (1959) have emphasized the computational simplicity that can result when primary sampling units are drawn with replacement from each of a number of strata, and when one can work with variate values as- sociated with the primary sampling units. There are, however, other approaches that have been suggested and applied to accomplish these same ends. We refer in particular to methods that have variously been referred to as interpene- trating samples, duplicated samples, replicated samples, or random groups. In the succeeding discussion, the term 'replicated sampling'' will be used to cover all of these possibilities. Ref- erences will be made to the pertinent literature but no attempt will be made to assign priorities or to be exhaustive. Deming has been a consistent and firm advocate of replicated sampling. He first wrote of it as the Tukey plan (1950); his recent book (1960) presents descriptions of the applications of replicated sampling to many dif- ferent situations, and contains a wide variety of ingenious devices that he has developed for solving particular problems. In simplest form, replicated sampling is as follows. Suppose one obtains a simple random sample of n observations—drawn with replace- ment from a finite population or drawn independ- 3 10 ently from an infinite population—and that the associated values of the variable of interest are Yi» Yr +--+» ¥n- Then, if y denotes the sample mean and Y the population mean, E(y)=Y and nm=2 ~rl-1) v, (7) provides an unbiased estimate of V(y). Suppose now that n observations are randomly divided into t+ mutually exclusive and exhaustive groups, each containing (n/t) elements, and that the means of these groups are denoted by V15.Vps ¢ oa Fo ot d8 clear that t F=2 yi/t j=1 and that the variance of y¥ can be estimated by A {= lewd Vi (3) =z F-7)7t(t-1) j= In this simple case, the advantages gained by using Vi (y) rather than V, (y)lie in the fact that one has to compute the sum of t squared de- viations instead of the sum of n squared de- viations, If t is considerably smaller than n and if such computations must be carried out for many variables, the savings in computational time may be substantial. Also, the kurtosis of the distribution of the y; is less than that of yi,» possibly offsetting some of the effect of having fewer degrees of freedom to estimate V (3). There is, of course, a loss of information associated with the subsample approach for es- timating variances since V, (3) is subject to greater sampling variability than is V, (3). A variety of ways have been suggested for measur - ing this loss of information. Hansen, Hurwitz, and Madow (1953, vol. I, pp. 438-449), who des- ignate this the random group method of esti- mating variances, make the comparison in terms of the relative-variance of the variance estimate. For example, they show by way of illustration that the relative-variance of a variance estimate based on 1,200 observations drawn from a nor- mal distribution is 4.1 percent while the relative- variance based on a sample of 60 random groups of 20 observations each is 18.3 percent. Actually, this approach places emphasis on the variance estimate itself rather than on the fact that one usually wants to use the variance estimate in setting confidence limits for a population mean or in testing hypotheses about a population mean. Under these circumstances, Fisher (1942, sec. 74) suggests a measure for the amount of information that a sample mean provides re- specting a population mean. Since his approach has been questioned by numerous authors—e.g., Bartlett (1936)—we shall simply adopt the ex- pedient of taking the ratio of, say, the 97.5 percentiles of Student's t distribution for 1,200 and 60 degrees of freedom, which is .981, and interpreting this as a measure of the relative width of the desired confidence intervals, This is evidently the approach used by Lahiri (1954, p- 307). The foregoing is, of course, the familiar argument that a sample of roughly 30 or more is a 'large' sample when dealing with normal populations, since ty, for 30 degrees of free- dom is 2.042 and for a normal distribution is 1.960. Ninety-five percent confidence intervals will, on the average, be only about four percent wider when s? is estimated with 30 degrees of freedom than when ¢’ is known. A slightly dif- ferent measure has been proposed by Walsh (1949), formulated in terms of the power of a t-test, If one wishes to use replicated sampling in conjunction with drawing without replacement from a finite population, then two different pos- sibilities arise. One can first draw without re- placement a sample of (n/t) elements, then re- place these elements in the population and draw a second sample of (n/f) elements, and continue this process until + samples have been selected. Denoting the sample means by 7, ¥,,..., 7 , we have t ¥=2 yi /t N= 2 G9 =D) and A - 2 gl) - R=n pa on _N-(ai) S* N n This type of replication makes the successive samples independent of one another, but it does permit the possible duplication of elements in successive samples and hence lowers the pre- cision of y as compared with the original draw- ing of a sample of n elements without replace- ment. However, there may be a saving inthe cost of measuring the duplicated items. Hence a slightly larger sample could be drawn for the same total cost, recapturing some of the loss of precision, Finally, one can draw a sample of (n/t) elements without replacement, a second sample without replacing the first sample, and so on. This is, of course, equivalent to drawing a sam- ple of n elements without replacement and then randomly dividing the sample into t groups. It follows that =v y-= t : 2 Zz (7,-) v FT _ -n J=I R= TD and A. _v_ N-n s’ dhol“ % This latter variance is smaller than the preceding one because of the difference in finite population corrections. Note, however, thatthe independence achieved by the first method of drawing makes it easy to apply nonparametric methods (e.g., the use of order statistics) for estimation and hypothesis testing. These points have been dis- cussed, in a more general framework, by Koop (1960) and by Lahiri (1954). Although replicated sampling for simple random sampling, as described in the preceding paragraphs, does provide the possibility of a- chieving some gains in terms of computational effort, the principal advantages of replication arise from other facets of the variance estimation problem. Some of these facets may be identified as follows. 1. There are instances of sample designs in which no estimate of sampling precision can be obtained from a single sample unless certain assumptions are made concerning the population. Systematic sampling is a case in point. (See 11 Cochran, 1963, pp. 225,226.) If the total sample is obtained as the combination of a number of rep- licated systematic samples, then one can obtain a valid estimate of sampling precision. This approach was suggested by Madow and Madow (1944, pp. 8,9) and has been discussed at greater length and with a number of variations by Jones (1956). In some instances, estimates made from replicated systematic samples may be less ef- ficient than from a single systematic sample, and then one must choose between loss of efficiency and ease of variance estimation, as discussed by Gautschi (1957). 2. As is well known, the ordinary Taylor series approximation for obtaining the variance and the estimated variance of a ratio estimate, even for simple random sampling, provides a possibly biased estimate of sampling precision. As an alternative, one can consider drawing a number of independent random samples, com- puting a ratio estimate for each sample, and then averaging these ratio estimates for the final estimate. A valid estimate of sampling precision can then be obtained from the replicated values of the estimate. It is true, however, that the bias of the estimator itself is undoubtedly larger for the average of the separate estimates than it is for a ratio estimate computed for the com- plete sample since this bias is ordinarily a de- creasing function of sample size. Thus gains may be achieved in one respect, while losses may be increased in the other. As far as the author knows, no completed research is available to guide one in making a choice between these two specific alternatives. This problem is, however, related to some suggestions and work by Mickey, Quenouille, Tukey, and others, and their results will be discussed in some detail in the following section, 3. After an estimate and an estimated var- iance have been obtained, confidence intervals are ordinarily set by appealing to large sample normality and to the approximate validity of Student's t distribution. Replication can some- times assist in providing ''better' solutions. For example, consider a stratified population in which the variable of interest has a normal dis- tribution within each stratum, but where the variance is different for the separate strata. Difficulty is then encountered in applying the chi- 12 square distribution to the ordinary estimate of variance, as discussed by Satterthwaite (1946), Welch (1947), Aspin (1949), and Meier (1953). However, the mean of a replicate will be nor- mally distributed, being a linear combination of normally distributed variables, and the chi- square distribution can be applied directly to a variance estimated from the means of a number of independent replicates. This aspect of the problem has been discussed at some length by Lahiri (1954, p. 309). 4. If one is using a highly complex sample design and estimation procedure, and if independ- ent replicates can be obtained, then replicated sampling permits one to bypass the extremely complicated variance estimation formulas and the attendant heavy programming burdens. Vari- ance estimates based upon the replicated esti- mates will mirror the effects of all aspects of sampling and estimation that are permitted to vary randomly from replicate to replicate. This, of course, includes the troublesome domain-of- study type of problem. One major disadvantage of replicated sam- pling has been mentioned in the preceding para- graphs, namely that the variance estimate re- fers to the average of replicate estimates rather than to an estimate prepared for the entire sam- ple. If the estimates are linear in the individual observations, the two will be the same. They will not be the same, however, for the frequently em- ployed ratio estimator and the other nonlinear estimators, and the average of the replicate es- timators may possibly be subject to greater bias than is the case for the overall sample estimate. Another major disadvantage arises from the difficulty of obtaining a sufficient number of replicates to provide adequate stability for the estimated variance. Thus the commonly used design of two primary sampling units per stratum (frequently obtained by collapsing strata from each of which only a single unit has been drawn) gives only two independent replicates, and the resulting confidence intervals for an estimate are much wider than they should be or need to be. Some suggestions have been made for attacking this problem, and they will be discussed in the follow- ing section. Another, but subsidiary, problem arises with replication if one wishes to estimate com- ponents of variance—that is, to determine what fraction of the total variance of an estimate arises from the sampling of primary sampling units, what fraction arises from sampling within primary sampling units, and the like. This prob- lem does not appear to have been discussed at any great length in the sampling literature and will not be considered here since it bears more directly on design than on analysis. Some of Sedransk's work (1964, 1964a, and 1964b) does relate to the problem, and McCarthy (1961) has discussed the matter in connection with sampling for the construction of price indexes. PSEUDOREPLICATION Half-Sample Replication Estimates of Variance From Stratified Samples If a set of primary sampling units is strati- fied to a point where the sample design calls for the selection of two primary sampling units per stratum, there are only two independent repli- cates available for the estimation of sampling pre- cision. Confidence intervals for the correspond- ing population parameter will then be much wider than they need to be. To overcome this difficulty, at least partially, the U.S. Bureau of the Census originated a pseudoreplication scheme called half- sample replication, The scheme has been adapted and modified by the NCHS staff and has been used in the HES reliability measurements. A brief de- scription of this approach is given in a report of the U.S. Bureau of the Census (Technical Paper No. 7, p. 57), and a reference to the Census Bu- reau method of half-sample replication was made by Kish (1957, p. 164). We shall first present a technical description of half-sample replication as used by NCHS in the Health Examination Survey. The theoretical development duplicates, in part, work by Gurney (1962). We shall then suggest several ways in which the method can be modi- fied to increase the precision of variance esti- mates, Consider a stratified sampling procedure where two independent selections are made in each stratum, Let the population and sample characteristics be denoted as follows: Popula- Popula- tion tion Stratum Weight mean Sample Sample variance observations mean 2 3 1 % Y, 8 Yi1Y12 Yi 2 Ww Y g 2 2 2 Y21Y22 Ya h Ww Y g’ y X h h Yh1Yh2 Yh k ¥ Y, 8 Yr1YL2 Yo An unbiased estimate of the population mean Y is yse=Z, Wy ¥n> and the ordinary sample estimate of V (yg) is L 4 v F0=/D = Wy si= (1/4) Z Wy dy where d= (y= yp): Under these circumstances, a half-sample replicate is obtained by choosing one of y,, and yi» one of y, and y,,..., and one of y , and y,,. The half-sample estimate of the popu- lation mean is Pos = z Wy whi where i is either one or two for each h. There are 2! possible half samples, and it is easy to see that the average of all half-sample estimates is equal to yg. That is, for arandomly selected half sample E@psl yr yizr «+0 Yi» Y12)= Fst If one considers the deviation of the mean determined by a particular half sample, for ex- ample y, ,=Z W, y,, from the overall sample mean, the result is obtained that (Fhs,1 = Fst) =Z Woy - 1/22 Wi, (7+ vg) =(1/2) z Ww, Vni=yp)= a/2)z wy d, In general, these deviations are of the form (Pps = 7st) = (1/2) Ew d, * w, d, sak wy d)) where the deviation for a particular half sample is determined by making an appropriate choice of a plus or minus sign for each stratum. In the ex- ample given above, each sign was taken as plus. The squared deviation from the overall sample mean is therefore of the general form Fps- 750° = A/D Wl + A/DZ + WW dd, “4D h * q =2 agy/2L and its variance by aL w #2 2 (ag, - a") /2 In the simple linear case that is being con- sidered here, it is easy to show that an analysis in terms of the i 's produces results that are identical with those obtained by the standard analysis. For example, 9p = Tot * Wid, /2, qf) = 34 — Wd,/2 9g = Fx - wd /2, Uy = Fu + Wd /2 Us) = Ya w,d,/2, Us = yg — Wd;/2 Go = To = Wd/2. af) = Fu + Wdi/2 q = Vu gt = 3 Furthermore 6 2 2.2 2.2 2,2 2 (qf, - Mr = (Wd + wd: + wid;)/4 which is the ordinary estimate of the variance of Yst» just as in the case of the balanced half- sample replicates. There is another variant of the Quenouille type of estimate which is closely related to the half-sample approach. Suppose that a particular half sample is chosen. Denote its elements by yr Four -o 00 Fy and its mean by y,. The set of remaining elements, one in each stratum, then constitutes an independent half sample whose 24 mean we denote by ¥7,¥. A Quenouille-type esti- mate is then defined by 25g — Fis + Ti)/2 which, in the simple linear case that is being considered here, is identically equalto y,. This approach does not provide an estimate of vari- ance in the present instance since only one esti- mate is obtained. In more complicated situations, however, different half samples will provide dif- ferent values of the Quenouille half-sample esti- mate, and it might be possible to base estimates of variance on these different values. This pos- sibility will be discussed briefly in the following section, Half-Sample Replication and the Jackknife Method With Stratified Ratio-Type Estimators We have introduced half-sample replication and the Jackknife method inthe setting of a simple linear situation, where they obviously havenoreal utility. Under these circumstances, they merely reproduce results that can be obtained by direct analysis. If, however, more complicated methods of sampling and estimation are employed, then direct methods of analysis may not be available, may require a prohibitive amount of computation in comparison with the methods being considered here, or may even give results that are in one way or another inferior to those provided by half- sample replication and the Jackknife. Although one may accept on intuitive grounds the general premise that half-sample replication and the Jackknife do permit the "easy' computa- tion of variance estimates that in one way or another mirror most of the standard complexities of sample design and estimation, the exactchar- acteristics of the resulting estimates and their corresponding estimates of variance are, for the most part, unknown. This is particularly true for half-sample replication, even though the intuitive “appeal of this method may be more direct than that of the Jackknife. No published or unpublished references to the behavior of half-sample rep- lication in complex situations were discovered, and the notion of balanced half samples was introduced in this report for the first time. On the other hand, there is a growing body of literature and data relating to the Jackknife. We shall now summarize this material on the Jack- knife and then report the results of a very small experiment which compares results obtained by balanced half-sample replication and by the Jack- knife. Although Quenouille (1956) introduced his method of adjustment as a means for reducing the bias of an estimator, our interest inthe Jack- knife is primarily focused on its utility for vari- ance estimation. One is naturally interested in obtaining any reductions in bias thatare possible, but there is a considerable body of empirical evidence—notably in the work of Kish, Nam- boodiri, and Pillai (1962)— which indicates that the "combined ratio estimator" for population means, subpopulation means, and differences of sub- population means probably has negligible bias in most practical surveys. On p. 863, Kish etal. say '"Our empirical investigations, set ina theo- retical framework, show that the bias in most prac- tical surveys is usually negligible; the ratio of bias to standard error (B/s) was small in every test, even those based on small sub-classes." There is actually very little published ma- terial which has a direct bearing on our present concern. The pertinent items are briefly sum- marized. 1. Quenouille (1956) has shown by formal analysis that the variance of his esti- mator, where such an estimator is ap- propriate, differs from the variance of the unadjusted estimator by terms of order 1/n?. 2. Durbin (1959) applied the method of Quen- ouille to the ratio estimator r= y/x, where a random sample was divided into two groups of equal size. Thus he con- sidered the estimator, of E (y)/E (x), t= 2r —- {r, “+ 2/2 where r = (y/x), r= (y/x)), r= (y,/x,), y and x are sample totals, y, and x, are half-sample totals, and y, and x, are the other half-sample totals. Durbin con- siders two cases: (1) x is a normal vari- able with variance O(n™!), andtheregres- sion of y on x is linear, not necessarily through the origin; (2) x is a gamma variable with mean mand the regression of y on x is linear. For the first case, when terms of O(n %are ignored, the result is obtained that the variance of t¢ is smaller than the variance of r. For the second case, it was not necessary to use an approximate form of analysis. Durbin concludes that ", . . whenever the coefficient of variation of x is less than 1/4, which will be satisfied by all except the most inaccurate estimators, Quen- ouille's estimator has a smaller mean square error than the ordinaryratioes- timator. This is an exact result for any sample size." Brillinger (1964) studies the properties of these estimators in relation to maxi- mum likelihood estimators. His conclu- sions are: "Summing up the results of the paper, one may say that Tukey's general technique of setting approximate confi- dence limits is asymptotically correct, under regularity conditions, when applied to maximum likelihood estimates and that the technique provides a useful method of estimating the variance of an estimate, Also one may say that the estimate pro- posed by Quenouille will on many occa- sions have reduced variance, smaller mean-squared error and closer to asymp- totic normality properties, when com- pared to the usual maximum likelihood estimates." Robson and Whitlock (1964) apply Quen- ouille's method of construction to obtain estimates of a truncation point of a dis- tribution. One of the interesting features of their work relates to the construction of estimators that successively eliminate bias terms of order n?,n73, etc. They find for their particular problem that the variance of these estimators increases as the bias is decreased. Miller (1964) is concerned with conditions under which a Jackknife estimator, and its 25 associated estimated variance, will as- ymptotically have a Student's ¢ distribu- tion. Both of the situations described by Miller are ones in which the unjackknifed estimator had a proper finite or limiting distribution under weaker conditions than required for the Jackknife. The foregoing five references are concerned with estimators and with their bias and variance. None deals with the problem of estimating vari- ances. However, Lauh and Williams (1963) do present some Monte Carlo results which relate primarily to the estimation of variance. They are again concerned with the estimation of a ratio, E(y)/E(x), but the Quenouille procedure is ap- plied to the individual sample observations instead of to half samples as in the Durbin investigation. That is, they compare the behavior of a= (Z y)/ (Ex) i=1 with the behavior of where = (Z y= y)/(Z x-x)) i=l i 70 j=1 and UG = ng - (n-1) Za This is similar to the estimator that was pro- posed for stratified sampling in the preceding section. Lauh and Williams define two populations which are used for empirical sampling: (1) x is a normal variable, while the regression of y on x is linear through the ovigin; (2) x is a chi- square variable with 2 degrees of freedom, and the regression of y on x is linear through the origin. Since the regressions are forced to go through the origin, both ¢ and g* are unbiased estimators of E(y)/E(x). For each population 1,000 samples of n are drawn, n=2,3,..., 9, and a variety of variance estimators are con- 26 sidered. In particular, the ordinary estimate of variance obtained from a Taylor series approxi- mation was employed, denoted by v,(¢); also anes- timate of variance was obtained from the 2% ’s, namely v(g*) = 3 (g*. - ¢"V¥/nln-1) i=1 (0) The results of this investigation are most in- teresting, particularly the fact that the precision of v (¢*) is much better than that of v,(g) when x has an exponential distribution, and are sum- marized by the authors as follows: From the results of these two studies, it may be inferred that the bias of the esti- mator v,(g) is dependent upon the degree of skewness of the original y and x popu- lations. Estimators of the true variance taken from higher order approximations lead only to slight improvements over the second order approximation v(g),and in some cases the estimate is actually worse. The preci- sion of v(g*) is nearly double that of v,(q) for exponential x distributions and the bias of v(g®) is smaller than that of v,(g). Thus it appears that the split-sample estimator q* may be definitely preferableto g insome situations. Finally, we note that extensive Monte Carlo investigations of many of these points have been initiated by Dr. Benjamin Tepping, Director of the Center for Measurement Research in the U.S. Bureau of the Census. Results of these investiga- tions are not yet available. The foregoing references are concerned only with random samples drawn from infinite popula- tions. Our principal concern is with stratified samples drawn, usually without replacement, from finite populations where complex estimation pro- cedures are applied to the basic sample data, Un- der these circumstances, a careful investigation of the behavior of estimators and of variance es- timators would undoubtedly require a large-scale, Monte Carlo type of program, integrated with as much analytic work as possible, This was not feasible within the confines of the present study, even in terms of planning. Nevertheless, we did Table 5. Artificial population Stratum I Tt TIL =. x Z = 3 x 3 4 5 4 7 3 4 6 9 8 9 4 11 20 24 23 25 12 Total 18 30 38 35 41 19 rR, = 18 = 6000 R, = 38 = 1.0857 R, =2L = 2.1579 30 35 19 R=27 = 1,1548 84 desire some small numerical model that would 4, The Jackknife estimate of variance, as illustrate the various points that have beenraised. described in the preceding section. As an example, we started with the small 5. The average of four balancedhalf-sample artificial population that is used for illustrative estimates, as described in "Balanced purposes in Cochran (1963, p. 178,179), However, Half-Sample Replication." It was neces- since we wished to enumerate all possible samples sary to consider two sets of balanced half and thereby investigate the behavior of both bal- samples, one complementary to the other, anced half-sample replication and the Jackknife since this is a nonlinear situation. It is method of variance estimation, and since com- assumed that one of these two sets will be putations were to be carried out on a desk cal- chosen randomly in practical applications. culator, we were not able to use the full popula- 6. The estimate of variance based on the tion as given by Cochran. (Itdid not appear worth- four balanced half-sample estimates. while to invest computer programming time on one This estimate of variance is the sum of isolated and artificial example.) Accordingly one squared deviations of four half-sample observation was dropped from each stratum and estimates about the combined ratio esti- the following population was used as shown in mate, divided by four, and multiplied by table 5. the finite population correction. This is the For this population, all possible samples of manner in which the half-sample esti- six, n,=2 in each stratum, were enumerated mate has been applied in the work of the -3%=127 possible samples. For each sample, U.S. Bureau of the Census and in HES. the following quantities were computed. Again the estimate was made for each of the two sets of balanced half samples. 1. The combined ratio estimate. 7. The Quenouille estimate based on the 2. The estimate of variance based on the ordinary Taylor series approximation to the variance of the combined ratio es- timate. 3. The Quenouille estimate of the population ratio, using individual observations as previously described. balanced half samples. That is, a Quen- ouille estimate was obtained from a half. sample and its complement. This was carried out for each half sample and then averaged over the set of four balanced half samples. The results of these com- putations are summarized in table 6. 27 Table 6. Behavior of estimates and estimates of variance obtained by enumerating samples drawn from artificial population of table 5 Variance Mean Average Estimate Bias | Standavd| gy, 000 square OE of the ad error estimate | variance estimates Combined ratio estimate-----=- +.0118 «122 .0148 .0149 .0099 .000034 Quenouille estimate, individual observations----- +.0034 «126 .0160 .0160 .0110 .000040 Balanced half-sample estimate---cemecccceccccccaao +.0428 .104 .0109 «0127 0122 .000047 Quenouille estimate, half samples-=c-ccecccccanam- -.0198 +137 .0188 .0192 where coo In obtaining the variance estimates a finite pop- ulation correction of (1— a) was applied uni- formly. It can be readily demonstrated that this is appropriate for the Jackknife and balanced half-sample estimates of variance, at least inthe simple linear case that was used to introduce these techniques. Jones (1965) describes a modi- fication of the Jackknife, whose purpose is to introduce the finite population correction into the "bias-reducing'' argument upon which the Quen- ouille adjustment rests. This modification was not used here. Although it is clearly impossible to draw any general conclusions from one artificial ex- ample such as the above, the results are interest- ing. In particular, we find that the combined ratio estimate has almost negligible bias and that the ordinary variance estimate seriously underesti- mates the true variance; that the Quenouille es- timate with individual observations does reduce the bias at the expense of increasing the vari- ance by about 8 percent, and that the Jackknife estimate of variance again seriously underesti- mates the true variance; that the average of the four balanced half-sample estimates has the smallest variance and the largest bias, while at the same time providing a reasonable estimate of the corresponding variance. The variance of the variance estimates is largest for the estimate based on the balanced half samples. Even if one makes the comparison onthe basis of mean square error, the balanced half-sample average is still superior to the other estimates in spite of its larger bias, except for the variance of the vari- 28 ance estimate. As a final point, the Quenouille adjustment applied to the balanced half samples does reduce the bias, but at the expense of a marked increase in terms of variance. This example, trivial and artificial as it may be, does raise one question that concerns half-sample replication. The use of the Quenouille estimate and the Jackknife method of estimating variances have usually been considered simulta- neously. On the other hand, the half-sample rep- lication method of estimating variances has always been used to estimate the variance of the estimate obtained from the entire sample, and not the variance of the average of the half-sample estimates. Of course when one does not have an "exhaustive'' set of half samples, as in the case where they are drawn with replacement, the average of half-sample estimates would not be appropriate. Here, however, we do have an "exhaustive'' set of balanced half-sample esti- mates, and we might well consider using their average in place of the combined ratio estimate. In terms of our example, a variance estimate with average value .0122 is being used to esti- mate the variance of the combined ratio esti- mate, whose true value is .0148, This is some- what better than the ordinary Taylor series var- iance estimate, whose average value is .0099, but not nearly as good as if one uses the half- sample estimate of variance to estimate the mean square error of the average of the balanced half-sample estimates—namely, a quantity whose average is .0122 to estimate a quantity whose true value is ,0127. A small amount of data which relates to the foregoing point is presented in table IV of the appendix. For each of the six subclasses for which comparisons of percentages are presented, the estimate obtained from the entire sample can be compared with the average of the 16 balanced half-sample estimates, It follows from the argu- ment used in developing the Quenouille-type es- timate that the difference between the two can be viewed as an estimate of bias for the overall estimate. This approach has been used by Deming (1960, p. 425). These estimates of bias, expressed as fractions of the estimated standard errors, range from approximately .03 to about .38. These data are reassuring, but they are also too frag- mentary to support any general conclusions con- cerning the bias of estimates for HES analysis. In conclusion, attention should be drawn to another avenue of approach to estimation and variance estimation which is somewhat related to the Jackknife method, although this relation has not been explored or even noticed in the litera- ture. Mickey (1959) presents a general method for obtaining finite population unbiased ratio and regression estimators building on the work of Goodman and Hartley (1958). In addition, he con- structs unbiased estimates of the estimator vari- ance by the process of breaking up the sample into subsamples, more or less along the lines of Tukey's general version of the Jackknife. Williams (1958, 1961) specializes these results to regres- sion estimators, and considers their properties in some detail. No detailed attention has been given to this topic in connection with the prepara- tion of the present report. SUMMARY Sampling theory provides a wide variety of techniques which can be applied in sample design to obtain estimates having essentially maximum precision for fixed cost. These techniques are particularly useful when populations are spread over wide geographic areas so thathighly cluster- ed samples must be obtained, and when extensive prior information about the population under study can be used in sample selection or in estima- tion. Such complex sample designs do, however, require extremely complicated and only approxi- mate expressions for estimating from a sample the variance of survey estimates. If an extremely large number of widely differing types of esti- mates are to be made from a single large-scale survey, the burden of developing appropriate variance expressions, of programming these for a computer, and of carrying out the computations may become excessive. The foregoing problems are intensified, al- though not appreciably changed in kind, if survey analysis is to go much beyond the estimation of population means, percentages, and totals. This is particularly true when the goals ofanalysis are to compare and study the relationships among subpopulations, or domains of study. Investiga- tors are then interested in applying such standard statistical techniques as multiple regression or analysis of variance, and find that many of the assumptions required for the application of these techniques are violated by the complexities of the sample design. Some authors have used the term "analytical survey' to refer to any survey in which extensive comparisons are made among subpopulations; other authors reserve this term for surveys that are specifically designed to con- trol the precision of these comparisons. There seemed to be little point in arguing this issue in the present report, since most surveys are multi- purpose in character and it is usually impossible to design for a specific comparison. The major portions of the first three sections of the report are devoted to a literature survey and discussion of these topics. Survey design (as opposed to the analysis of survey data) requires the use of "exact" variance expressions since it is necessary to balance the effects on precision of a wide variety of sampling techniques. It is possible, however, to bypass the corresponding detailed variance estimation tech- niques in the actual analysis of survey data through the use of replication. This approach is discussed in "Replication Methods of Estimating Variances," where an attempt is made to set forth its advan- tages and limitations. Emphasis has been placed upon variance estimation, although it is clear that covariances can also be treated in the same manner, One of the most serious limitations of rep- lication as applied to the analysis of complex sample survey data arises from the difficulty of obtaining a sufficient number of independent rep- 29 lications to assure reasonably stable variance estimates. This fact has been particularly obvious when a set of primary sampling units is stratified to a point where the sample design calls for the selection of two primary sampling units per stratum, thus leading to only two independent rep- licates. To overcome this difficulty the U.S. Bureau of the Census and the National Center for Health Statistics have been using a pseudoreplica- tion method for variance estimation, called half- sample replication. This procedure is described in "Half-Sample Replication Estimates of Vari- ances From Stratified Samples," and several improvements, balanced half-sample replication and partially balanced half-sample replication, are introduced in "Balanced Half-Sample Repli- cation" and ''Partially Balanced Half Samples." Still another, but related, variance estimation technique, the Jackknife, is described in ''Jack- knife Estimates of Variance From Stratified Samples." The application of these methods is illustrated on an artificial set of data in '"Half- Sample Replication and the Jackknife Method With Stratified Ratio-Type Estimators," and the appendix shows how balanced half-sample repli- cation has been used in analyzing data obtained from the Health Examination Survey. It would appear that replication and pseudo- replication are extremely useful procedures for 30 obtaining variance estimates when one is making detailed analyses of data derived from complex sample surveys. Nevertheless, there are many unresolved problems relating to the application of these methods. Among these are the following: (1) The effects of certain sampling techniques on variances will not be picked up—e.g., the selection of one primary unit per stratum and controlled selection; (2) The variance estimate ordinarily refers to the average of the replication estimates, whereas the ordinary procedure is to use an overall sample estimate, and the two will not be the same except in the rare case that the estimate is linear in form; (3) No investigations have been carried out of the applicability of these approaches to such problems as contingency table analyses and standard analysis of variance ap- proaches; and (4) It is extremely difficult to attack any of these problems analytically, and the development of empirical approaches that will have widespread applicability seems most difficult, As a final point, we call attention to the prob- lems that arise when survey data, as opposed to experimental data, are used to develop general scientific conclusions. This topic has not been more than mentioned in this report, but reference may be made to discussions by Yates (1960), Kish (1959), and Blalock (1964). BIBLIOGRAPHY Aoyama, H.: A study of the stratified random sampling. Annals of the Institute of Statistical Mathematics VI(1):1-36, 1954-55. Aspin, A.: Tables for use in comparisons whose accuracy involves two variances, separately estimated. Riometrika 26:290-293, 1949. Bartlett, M. S.: The information available in small samples. Proc.Camb.Phil.Soc. 32:560, 1936. Blalock, H.M., Jr.: Causal Inferences in Nonexperimental Research. Chapel Hill. University of North Carolina Press, 1964. Brillinger, D. R.: The asymptotic behavior of Tukey’s zen- eral method of setting confidence levels (the Jackknife) when applied to maximum likelihood estimates. Pewiew of Interna- tional Statistical Institute 32:202-206, 1964. Cochran, W. G.: Sampling Techniques, ed. 2. New York. John Wiley and Sons, Inc., 1963. Cornfield, J.: Cn samples from finite populations. J.: in. . statist. Ass. 39(226):236-234, June 1944. Cornfield, J., and Tukey, J. W.: Average values of mean squares in factorials. A4nn.math.Statist. 27:907-949, 1956. Deming, W. E.: Some Theory of Sampling. New York. John Wiley and Sons, Inc., 1950. Deming, W. E.: Sample Design in Business Research. New York. John Wiley and Sons, Inc., 196C. Deming, W. E.: On the correction of mathematical bias by use of replicated designs. Metrika 6:37-42, 1963. Deming, W. F., and Sterhan, F. F.: On the interpretation of censuses as samples. J.4m.statist. Ass. 36(213):45-49, Mar. 1941. Durbin, J.: Sampling thecry for estimates based on fewer individuals than the number selected. Bulletin of the Inter- nationel Statistical Institute 2€:11°-119, 1958. Durbin, J.: \ noteon the application of uencuille’s meth- od of bias reduction to the estimation of ratios. Riometrika 46:477-48C, 1959. Durbin, J., and Stuart, A.: Differences in response rates of experienced and inexperienced interviewers. Journal of the Royal Statistical Society, Series A (General), 114, Ft. II, pp. 163-206, 1951. Fisher, R. A.: The Design of Experiments, ed. 3. London. Cliver and Poyd, Ltd., 1942. Ciautschi, W.: Someremarks on systematic sampling. 4nn. math.Statist. 28:385-394, 1957. Garza-Hernandez, T.: An Approximate Test of Homogeneity on the Basis of a Stratified Random Sample. M.S. thesis, New York State School of Industrial and Labor Relations, Cornell University, 1961. Goodman, L. A., and Hartley, H. G.: The precision of un- biased ratio-type estimators. J.Am.statist.Ass. 53(282):491- 508, June 1958. Goodman, R., and Kish, L.: Controlled selection—a tech- niquein probability sampling. J.Am.statist. Ass. 45(251):350- 372, Sept. 195C. Gupta, S. S.: Probability integrals of multivariate normal and multivariate ¢. Ann.math.Statist. 34:792-828, 1963. Gurney, M.: The Variance of the Replication Method for Lstimating Variances for the CPS Sample Design. Unpublished memorandum, U.S. Bureau of the Census, 1962. Gurney, M.: McCarthy’s Orthogonal Peplications for Fsti- mating Variances, With Grouped Strata. Unpublished memo- randum, U.S. Bureau of the Census, 1964. Hansen, M. H., Hurwitz, W. N., and Madow, W. G.: Sample Survey Methods and Theory, Vols. I and II. New York. John Wiley and Sons, Inc., 1953. Hanson, R. H., and Marks, E. S.: Influence of the inter- viewer on the accuracy of survey results. J.Am.statist.Ass. 53(283):635-655, Sept. 1958. Hartley, H. O.: Analytic Studies of Survey Data. Instituto di Statistica, Rome, Volume in onora di Corrado Gini. Ames, Towa. Statistical Laboratory, lowa State University of Science and Technology. Reprint Series 63. 1959. Jones, H. L.: Investigating the properties of a samplemean by employing random subsample means. J.Adm.statist.Ass. 51(273):54-83, Mar. 1956. Jones, H. L.: The Jackknife method. Proceedings of the IBM Scientific Computing Symposium on Statistics. White Plains, New York. IRM Data Processing Division, 1965. Keyfitz, N.: A factorial arrangement of comparisons of fam- ily size. Am.J.Soc. 58(5):470-480, Mar. 1953. Keyfitz, N.: Estimates of sampling variance where two units are selected from each stratum. J.Am.statist.Ass. 52 (28€):503-510, Dec. 1957. Kish, L.: Confidence intervals for clustered samples. American Sociological Review 22(2):154-165, Apr. 1957. Kish, L.: Some statistical problems in research design. American Sociological Review 24:328-338, June 1959. Kish, L.: Efficient allocation of a multi-purpose sample. Econometrica 29:363-385, 1961. Kish, L.: Variances for indexes from complex samples. Proceedings of the Social Statistics Section of the American Statistical Association, 1962, pp. 190-199. Kish, L.: Survey Sampling. New York. John Wiley and Sons, Inc., 1965. Kish, L., and Hess, I.: On variances of ratios and their differences in multistage samples. J.Am.statist. Ass. 54(286): 416-446, June 1959. Kish, L., Namboodiri, N. K., and Pillai, R. K.: The ratio bias in surveys. J.Am.statist. Ass. 57(300):863-876, Dec. 1962. Koop, J. C.: On theoretical questions underlying the tech- nique of replicated or interpenetrating samples. Proceedings 31 of the Social Statistics Section of the American Statistical Association, 1960, pp. 196-205. Lahiri, D. B.: Technical paper on some aspects of the de- velopment of the sample design, in P. C. Mahalanobis ‘Tech nical Paper No. 5 on the National Sample Survey.’ Sankhya 14:264-316, 1954. Lauh, E., and Williams, W. H.: Some small sample results for the variance of a ratio. Proceedings of the Social Statis- tics Section of the American Statistical Association, 1963, pp. 273-283. Madow, W. G., and Madow, L. H.: On the theory of system- atic sampling. Ann.math.Statist.XV:1-24, 1944. McCarthy, P. J.: Sampling considerations in the construc- tion of price indexes with particular reference to the United States Consumer Price Index. U.S. Congress, Joint Economic Committee, Government Price Statistics, Hearings. Washing- ton. U.S. Government Printing Office. Part 1, pp. 197-232, 1961. McCarthy, P. J.: Stratified sampling and distribution-free confidence intervals for a median. Accepted for publication in J.Am.statist.Ass., 1965. Meier, P.: Variance of a weighted mean. Biometrics 9(1): 59-73, Mar. 1953. Mickey, M. R.: Some finite population unbiased ratio and regression estimators. J.Am.statist. Ass. 54(287):594-612, Sept. 1959. Miller, R. G., Jr.: A trustworthy Jackknife. Ann.math.Sta- tist. 35(4):1594-1605, 1964. National Center for Health Statistics: Cycle I of the Health Examination Survey, sample and response, United States, 1960-62. Vital and Health Statistics. PHS Pub. No. 1000- Series 11-No. 1. Public Health Service. Washington. U.S. Government Printing Office, Apr. 1964. National Center for Health Statistics: Blood pressure of adults by age and sex, United States, 1960-62. Vital and Health Statistics. PHS Pub. No. 1000-Series 11-No. 4. Pub- lic Health Service. Washington. U.S. fiovernment Printing Office, June 1964. National Center for Health Statistics: Blood pressure of adults by race and area, United States, 1960-62. Vital and Health Statistics. FHS Pub. No. 1000-Series 11-No. 5. Pub- lic Health Service. Washington. U.S. Government Printing Office, July 1964. National Center for Health Statistics: I'lan and initial pro- gram of the Health Examination Survey. Vital and Health Statistics. PHS Pub. No. 1000-Series 1-No. 4. Public Health Service. Washington. U.S. Covernment Frinting Office, July 1965. National Health Survey: The statistical design of the Health Household-Interview Survey. Health Statistics. FHS Pub. No. 584-12. Public Health Service. “ashington. II.S. ¢‘cvern- ment Printing Office, July 1958. Okamoto, M.: Chi-square statistic based on the pooled frequencies of several observations. Biometrika 50:524-528, 1963. Plackett, R.L., and Burman, J.P.: The design of optimum multifactorial experiments. Biometrika 33:305-325, 1943-46. Quenouille, M. H.: Notes on bias in estimation. Biometrika 43:353-360, 1956. Robson, D. S., and Whitlock, J. H.: Estimation of a trun- cation point. Biometrika 51:33-39, 1964. Satterthwaite, F. E.: An approximate distribution of esti- mates of variance components. Biometrics 2:110-114, 1946. Sedransk, J.: Sample Size Determination in Analytical Sur- veys. Ph.D. dissertation, Harvard University, 1964. Sedransk, J.: Analytical Surveys With Cluster Sampling. Unpublished naper, Iowa State University, 1964a. Sedransk, J.: 4 Double Sampling Scheme for Analytical Surveys. Unpublished paper, Iowa State University, 1964b. Sukhatme, T'. V.: Sampling Theory of Surveys Vith Appli- cations. Ames, Iowa. Iowa State College Press, 1954. Tukey, J. W.: Bias and confidence in not-quite large sam- nles. Abstracted in Ann.math.Statist. 29:614, 1958. U.S. 2ureau of the Census: The Current Population Survey, A Report on Methodology. Technical Paper No. 7. Washing- ton. U.S. Government Printing Office, 1963. Walsh, J. E.: Concerning the effect of intraclass correla- tion on certain significance tests. Ann.math.Statist. 18(1): 88-96, 1547. Walsh, J. E.: On the ‘“‘information’’ lost by using a ¢-test when the population variance is known. J.Am.statist.Ass. 44(245):122-125, Mar. 1949. Weibull, M.: The distributions of ¢- and F-statistics and of correlation and regression coefficients in stratified samples from normal populations with different means. - Skand.Aktuar- Tidskr. 36(suppl. 1, 2):9-106, 1953. Welch, B. L.: The generalization of ‘‘Student’s’’ problem when several different population variances are involved. Bio- metrika 54:98-35, 1947. Villiams, W. H.: Unbiased Regression Estimators and Their Efficiencies. Fh.D. dissertation, Iowa State College, 1558. Williams, W. H.: Generating unbiased ratio and regression estimators. Biometrics 17(2):267-274, June 1961. Yates, F.: Sampling Methods for Censuses and Surveys, ed. 3. New York. Hafner Publishing Co., 1960. O00 32 APPENDIX ESTIMATION OF RELIABILITY OF FINDINGS FROM THE FIRST CYCLE OF THE HEALTH EXAMINATION SURVEY Survey Design The sampling plan of the first cycle of the Health Examination Survey followed a highly stratified, mul- tistage probability design in which a sample of the civilian, noninstitutional population of the conterminous United States, 18-79 years of age, was selected. In the first stage of this design, the 1,900 primary sam- pling units (PSU's), geographic units into which the United States was divided, were grouped into 42 strata. Here a PSU is either a standard metropolitan statistical area (SMSA) or one to three contiguous counties. By virtue of their size in population, the six largest SMSA's were considered to be separate strata and were in- cluded in the first-stage sample with certainty. As New York was about three times the size of other strata and Chicago twice the average size, New York was counted as three strata and Chicago as two, making a total of nine certainty strata. One PSU was selected from among the PSU's in each of the 33 noncertainty strata to complete the first-stage sample. Later stages resulted in the random selection of clusters of typically four persons from segments of households within the sample PSU's. The total sampling included some 7,700 persons in 29 different States. All examination findings for sample persons are included in tabulations as weighted frequencies, the weight being a product of the reciprocal of the prob- ability of selecting the individual, an adjustment for nonresponse cases, a stratified ratio adjustment of the first-stage sample to 1960 Census population controls within 6 region-density classes, and a poststratified ratio adjustment at the national level to independent population controls for the midsurvey period (October 1961) within 12 age-sex classes. The sample design is such that each person has roughly the same probability of selection. However, there were sufficient deviations from that principle in the selection and through the technical adjustments to produce the following distribution of sample weights as required to inflate to U.S. civilian, noninstitutional pop- ulation levels: Percent 1-digit distribution Class relative of examined Weight class average weight persons 7.,000-20,999 14,000 1 78.7 21,000-34,999 28,000 2 18.4 35,000-48,999 42,000 3 1.9 49 ,000-62,999 56,000 4 0.6 63,000-76,999 70,000 5 0.0 77,000-90,999 84,000 6 0.4 A more detailed description of the sampling plan and estimation procedures is included in Vital and Health Statistics, Series 11, No. 1, 1964: '"Cyclel of the Health Examination Survey, Sample and Response.' Requirements of a Variance Estimation Technique The Health Examination Survey is obviously com- plex in its sampling plan and estimation procedure. A method for estimating the reliability of findings is re- quired which reflects both the losses from clustering sample cases at two stages and the gains from strati- fication, ratio estimation, and poststratification. Ideally, an appropriate method once programmed for an elec- tronic computer can be used for a wide range of sta- tistics with little or no modification to the program, This feature of adaptability is an important and special requirement in HES. The small staff of analysts in the Division of Health Examination Statistics typically works on only a few sections of the examination and laboratory results at a time. Consequently tabulation specifications 33 and edited input for a sizable variety of report topics are not available until shortly prior to the need for estimates of sampling error. New tables of sampling error have been prepared for each of the 12 reports published to date and at least a dozen more will be prepared before the Cycle I publication series is com- pleted. Development of the Replication Technique The method adopted for estimating variances in the Health Examination Survey is the half-sample replication technique. The method was developed at the U.S. Bureau of the Census prior to 1957 and has at times been given limited use in the estimation of the reliability of results from the Current Population Survey. A description of the half-sample replication technique, however, has not previously been published, although some references to the technique have ap- peared in the literature. The half-sample replication technique is particu- larly well suited to the Health Examination Survey because the sample, although complex in design, is relatively small (7,000 cases) in sample size. Only a few minutes are required for a pass of all cases through the computer. This feature permitted the de- velopment of a variance estimation program which is an adjunct to the general computer tabulation program. Every data table comes out of the computer with a table of desired estimates of aggregates, means, or dis- tributions together with a table identical in format but with the estimated variances instead of the estimated statistics. The computations required by the method are indeed simple and the internal storage requirements are well within the limitation of anIBM 1401-1410 computer system, The variance estimates computed for the first few reports of Cycle I findings were based on 20 random half-sample replications. A half sample was formed by randomly selecting one sample PSU from each of 16 pairs of sample PSU's, the sample repre- sentatives of 16 pairings of similar noncertainty strata, and 8 of 16 random groups of clusters of sample per- sons selected from the 9 certainty strata and the San Francisco SMSA, the largest sample PSU of the 33 noncertainty strata. The concept of balanced half samples is utilized in present variance estimates for HES. The variance estimates are derived from 16 balanced half-sample replications, The composition of the 16 half samples, shownin tables I and II, was determined by an orthogonal plan. In the tables an "'X" indicates that the PSU or random group was included in the half sample. The construction using 16 balanced half-sample replications results from viewing the certainty and noncertainty strata as independent uni- verses. This is only approximately true as the post- stratified ratio adjustment to independent population controls is made across both certainty and noncertainty strata. Analternative construction, and perhaps a slight- ly more accurate one, would have been to use 24 bal- Table I. Composition of the 16 balanced half-sample replicates—certainty strata Balanced half-sample replications : Random group of segments in Pair certainty PSU's 1 4is|el7|8|ofl10 31 {12 [1314 25 (16 1 XIX XIX Xi XX X X X X) X 2 X X X X X X X X X X 3 X X|X X X X X X X X 4 X X |X X 20x X XX |) X X| X 5 Xl X{ Z| 2{ X X X X X X 6 X X X X X X X X X |X X X 7 X X|X X| X X X X XX |X X XX! X 8 X X X X X X|X X X| X X| X 34 Table II. Composition of the 16 balanced half-sample replicates—noncertainty strata Balanced half-sample replications Pair Sample PSU Liz 341516171819 {10 ;11.]112 113 {14 |15 {16 1 |Pittsburgh, Pa., SMSA-r- |X |X |X |X |X |X |X |X |X| X |X| X| X| X| X| X Providence, R.I., SMSA 2 | Columbus, Ohio, SMSA---- X X X X X X X X Akron, Ohio, SMSA------- X X x x X x 3 | York, Pa., SMSA----=--== Xx X xi x A | Muskegon-Ottawa, Mich---| X | X X|X X| X X| X 4 | Cayuga-Wayne, N,Y-=--==- X X XX X X York, Me---e-cccccncean- XiX X X| X X| X 5 |Baltimore, Md., SMSA---- Zixizix Xx x Louisville, Ky., SMSA--- |X |X |X [|X X| X| X| X 6 | Nashville, Tenn., SMSA-- |X X X X |X X X X San Antonio, Tex., SMSA- X X|X X Xi X X 7 Savannah, Ga., SMSA----- X |X X[X|[X | X X X Midland, Tex., SMSA--=-= XixIX X| X| X| X 8 | Barbour, Ala-==--=cecece== X X X X X Independent cities in Virginia in 1950----==-- X X X |X X X XxX! xX 9 | Brooks-Echols- Lowndes, Ga-=====ccece== XX XixX| Xx] X| x X Jackson-Lawrence, Ark--- | X |X |X |X [X |X |X |X 10 | Horry, S.Ce==--ceececcca= X X X X X X X X Franklin-Nash, N.Cooomem X X |X X X X 11 | Lafayette-Panola, Miss-- |X [X X X| X X E. Feliciana-St. Helena, La-=--ccecceccce= X |X X(X|X[X X 12 | San Jose, Calif., SMSA--~ X |X X|X X X| X X Minneapolis-St. Paul, SMSA-vermmmmmcc mc mmca—— X X|X X X| X X| X 13 | Ft. Wayne, Ind., SMSA--- [X |X [X Xi Xx x X Topeka, Kans., SMSA-===- X X|X |X| X| X 14 | Grant, Wash---cceecnecena- X X|X X X X Apache-Navajo, Ariz-=--- | X X X X X X| X X 15 | Dunklin-Pemiscot, Mo---- xx X X| Xx Franklin-Jackson=- Williamson, Ill--cee--- X |X X XX X 16 | Bates, Mo--=-=-=-=--- ——————— X X X X X Bayfield, Wis--eecceccce== XIX X 1X | X X - X X providence, R.I., SMSA figures into the variance computations as it is always a part of the right hand size of the difference z' i - z' in the variance equation. 35 Table III. Estimates of the percent of demographic subgroups of the hypertension and estimates of variance in percent U.S. adult population with Replica- HES T SRS SRS 3 ir cation % : Sample Ratio Demographic subgroup SpLaaate estimate bic gic gstlince persons of percent Are percent | variance examined | variances (col. 1)! (col. 2Y! (col. 3) (col. 4): (col. 5) (col. 6) Males aged 35-44 with income less than $2,000-=-=-=mmcmeem meme 19.69 | 34,1056 22,22 | 27.4348 63 1.2432 Females aged 55-64 with income of $4,000-86,999 == mmmmmmemmem meee em 25,75 | 117.2225 24,49 | 18.8697 98 0.9127 Females with income of $10,000+-===-- 11.75 4.7524 11.95 2.7326 385 1.7391 White males with income of $4,000-$6,999 == mmmmmm meee ee 12.21 1.2321 12.39 1.2114 896 1.0171 White females with income of $7,000-89,999 mcm cmcmcm meme 11.48 5.7600 10.59 2,0066 934 2.8705 Negroes with income of $2,000-$3,999 == mmmmmmmcmm mmm meee 23.08 | 10.2400 23.41 8.7474 205 1.1706 Males aged 18-24 with 9-12 years of schoOlemmcemcccc ccc 2.45 1.0609 2.37 0.9151 253 1.1593 Females aged 25-34 with none or less than 5 years of school-----=-= 2.49 5.3361 3.33| 10.7407 30 0.4968 Males with 5-8 years of school------ 17.82 1.8225 17.92 1.7805 826 1.0236 White males with 13+ years of SChoOl mmm emcee em 9.34 2.3716 9.01 1.4211 577 1.6688 White females with 9-12 years of Eee at 10.33 0.35329 9.54 0.5187 1,666 1.0274 Negroes with 5-8 years of school---- 30.73 | 10.3684 31.19 7.4948 286 1.3834 Males aged 65-74 who are married---- 27.26) 19.2721 26.87 9.7751 201 1.9716 Females aged 35-44 who are separated----mecemmcmc cee eee 18.80 | 78.3225 22.22) 96,0217 18 0.8157 Females who are divorced------ce-a-- 13.47 | 13.7641 14.50 9.4658 131 1.4541 White males who are single---eeemeen= 9.05 1.9321 9.98 2.23% 401 0.8628 White females who are widowed=--===-- 35.77 15,1321 33.55 2.1223 313 2.1246 Negroes who are married------ececeea-= 27.79 5.1076 28.79 3.8316 535 1.3330 Males aged 55-64 who are craftsmen-- 15.153 | 29,9209 14,29 | 19.4363 63 1.539% Females aged 35-44 who are private household workers=meeemeececececaanax 10,60 21.2521 10.67 | 12.7052 75 1.6727 Males who are laborersmeem-eececececeaa- 19.93 11.4921 18.25 5.6730 263 2,0258 White males who are farmers or farm managers----emccmeccccmccncnaa 10,89 2.8900 11.39 6.3889 158 0.4523 White females who are clerical and sales workers=-----ececccocaonx 9.80 2.8561 9.78 1.9522 451 1.4630 Negroes who are professional WOTKerSmmmmmm mmc cece eee 16.57 | 31.2481 16,22 | 36,7202 37 0.8510 Males aged 25-34 who are employed in construction and mining---=-==-= 7.46 9.3025 9.52) 13,6775 63 0.6801 Females aged 18-24 who are employed in wholesale and retail trade-==-=-- 2.29 6.8121 2,27 5.0477 44 1.3495 Males who are employed in transportation--=-eeceecceccccccaaa 11.29 4,1209 10.76 4,3067 223 0.9569 White males who are employed in finance, insurance and real estate- 12.34 | 24.5025 11.54 | 13,0860 78 1.8724 White females who are employed in Services-=--mmmmemccc cmc 10.48 3.0276 10.59 2.3324 406 1.2981 Negroes who are employed in Government==--=seccmeccmmcmceccennn 22,19 9.3636 22,65 9.6800 243 0.9673 36 Table IV. Estimates of differences in percent between demographic subgroups of the U.S. adult population with hypertension and estimates of variance in percent Replica- Average HES s SRS SRS 3 : cation : : Sample Ratio of rep- Demographic subgroup gstingte estimate a espivace persons of licate of : examined | variance per- percent |... ce | Percent |variance i (gol. 1) | (col, 2) | (col, 3) | (col, 4) | (col. 3) [| (col. 6) I (col, 7) 1. Adults with income less than $2,000====-== 26.23 4.5796 26.66 1.7791 1,099 2,5741 26.06 2. Adults with income of $10,0004=====mcmeaan 11.75 1.3924 12,17 1.3933 764 0.9994 11.34 Difference (l=2)===-=-= 14.48 5.2850 14.48 3.1784 1.6628 15,72 3. Males with income $2,000-$3,999~ccmmcmun- 15.36 4,2025 15.31 2.4499 535 1.7154 14.75 4, Males with income $4,000-56,999 ccc mmmun= 12.83 1.4641 13.24 1.1797 975 1.2411 12,35 Difference (3-4)====-= 2.33 4,5600 2.27 3.6296 1.2563 2.40 5. Females aged 55-64 with income $4,000-$6,999-~~ 25.75 17.2225 24,49 18.8697 98 0.9127 25.91 6. Females aged 55-64 with income $7,000-$9,999--- 25.25 99.8001 24,32 49.7500 37 2.0060 26.67 Difference (5=6)===-- 0.50] 110.1069 0.17 68.6197 1.6046 0.66 anced half-sample replications, viewing the 16 pairs of noncertainty strata and 8 pairs of randomly grouped clusters from the certainty strata as a single universe. After the composition of each of the balanced half samples was determined, the resulting half samples were then separately subjected to all the estimation procedures and tabulations used to produce the final estimates from the entire sample. An estimated variance s?,- of an estimated sta- tistic z” of the parameter z is obtained by applying the formula 2 5% 18 2 st = 2 (z{-2") 1 16 i=1 where z; is the estimate of z based on sth half sample and z” is the estimate of z based on the entire sample. Computer Output For the Health Examination Survey the variance tabulations and prepublication tabulations of estimates are derived from the same computer output. Since the findings are generally expressed as rates, means, or percentages, each output "table" actually consists of three tables, the statistic of interest, such as the percent of persons with hypertension, the numerator of each cell in the "table," and the denominator of each cell. The cells of the table are a cross-classification of the statistic by age and sex with one of about a dozen demographic variables for which information was collected in the survey. The analyst can also receive a printout of the same three tables for each of the 16 half-sample replications. The replication tables are useful when estimates of the variance of estimated differences between statistics or of such derived statistics as medians are needed or for evidence to support or refute a hypothesis concerning observed patterns in the data. In addition to the "table" of findings, the output also includes a ''table' of estimat- ed standard errors (of the statistic, its numerator, and its denominator), a ''table'" of estimated relative variances (the estimated variance of a statistic divided by the square of the statistic), and a ''table' of the number of sample observations on which the statistic, its numerator, and its denominator are based. The last table together with the others gives some insight into the effect of the sampling plan and estimation pro- cedures. Illustration The figures in table III are estimates from the Health Examination Survey of the percent of demographic subgroups of the adult population with hypertension and their estimated variances. The official HES estimates based on unbiased inflation factors adjusted for non- response and ratio adjusted to independent population controls are shown in column 1, Estimates of their variance derived from 16 balanced half-sample repli- 37 cations treating the estimated percent of replicate i as z{ are shown in column 2. For comparison, the estimates of percent and variance which would have resulted if the 6,600 examined persons had been a simple random sample of the U.S. population and the sample size in each demographic subgroup or domain is considered to be fixed, are shown in columns 3 and 4. The number of examined sample persons inthe demo- graphic subgroup or domains (the bases of the percents) are shown in column 5. The ratios of the two variance estimates are shown in column 6. These ratios are in- dicative of the net effect of clustering and stratification in the sample design, deviations from equal probabilities of selection, and nonresponse and ratio adjustment in the estimation procedures, and reflect as well the variance of the estimated variance. The median ratio of replication variance to simple random variance——i.e., of an appropriate variance to 38 a much cruder measure—is 1.30. The mean ratio is 1.31. As one would expect, there is a tendency for the ratio to be higher for larger values of the statistic, al- though this tendency is not very pronounced. The criteria for hypertension was 160 mm. Hg. or over systolic blood pressure and 95 mm. Hg. or over diastolic. The average of three blood pressures taken over a 30-minute period was used for each examined person. Table IV is similar to table III but it also includes the estimated difference in percent between two demo- graphic subgroups. Estimates of variance of the dif- ference between two estimated percents which would have resulted if the sample had been a simple random sample were obtained by summing the estimated variances of the two estimated percents. The average of the estimated percents over the 16 replicates is shown in column 7. % U. S. GOVERNMENT PRINTING OFFICE : 1966 O - 211-926 Series 1. Series 2. Series 3. Series 4. Series 10. Series 11. Series 12. Series 20. Series 21. Series 22, OUTLINE OF REPORT SERIES FOR VITAL AND HEALTH STATISTICS Public Health Service Publication No. 1000 Programs and collection procedures.—Reports which describe the general programs of the National Center for Health Statistics and its offices and divisions, data collection methods used, definitions, and other material necessary for understanding the data. Reports number 1-4 Data evaluation and methods research.—Studies of new statistical methodology including: experimental tests of new survey methods, studies of vital statistics collection methods, new analytical techniques, objective evaluations of reliability of collected data, contributions to statistical theory. Reports number 1-15 Analytical studies.—Reports presenting analytical or interpretive studies based on vital and health sta- tistics, carrying the analysis further than the expository types of reports in the other series. Reports number 1-4 Documents and committee veports.—Final reports of major committees concerned with vital and health statistics, and documents such as recommended model vital registration laws and revised birth and death certificates. Reports number 1 and 2 Data From the Health Interview Survey.—Statistics on illness, accidental injuries, disability, use of hospital, medical, dental, and other services, and other health-related topics, based on data collected in a continuing national household interview survey. Reports number 1-29 Data Fyrom the Health Examination Suvvey.—Statistics based on the direct examination, testing, and measurement of national samples of the population, including the medically defined prevalence of spe- cific diseases, and distributions of the population with respect to various physical and physiological measurements. Reports number 1-12 Data From the Health Records Survey.--Statistics from records of hospital discharges and statistics relating to the health characteristics of persons in institutions, and on hospital, medical, nursing, and personal care received, based on national samples of establishments providing these services and samples of the residents or patients. Reports number 1-4 Data on mortality.—Various statistics on mortality other thanas included in annual or monthly reports— special analyses by cause of death, age, and other demographic variables, also geographic and time series analyses. Reports number 1 Data on natality, marriage, and divorce.—Various statistics on natality, marriage, and divorce other than as included in annual or monthly reports—special analyses by demographic variables, also geo- graphic and time series analyses, studies of fertility. Reports number 1-8 Data From the National Natality and Mortality Surveys.—Statistics on characteristics of births and deaths not available from the vital records, based on sample surveys stemming from these records, including such topics as mortality by socioeconomic class, medical experience in the last year of life, characteristics of pregnancy, etc. Reports number 1 For a list of titles of reports published in these series, write to: National Center for Health Statistics U.S. Public Health Service Washington, D.C. 20201 NATIONAL . . CENTER Series 2 For HEALTH Num 15 AV NAR Ros Ll SEITE] 0) Psychological Measures Used in the Health Examination Survey of children ages 6-11 U.S. DEPARTMENT OF HEALTH, EDUCATION, AND WELFARE Public Health Service Public Health Service Publication No. 1000-Series 2-No. 15 For sale by the Superintendent of Documents, U.S. Government Printing Office Washington, D.C. 20402 - Price 45 cents VITALand HEALTH STATISTICS DATA EVALUATION AND METHODS RESEARCH evaluation of Psychological Measures Used in the Health Examination Survey of children ages 6-11 A critical review of literature pertaining to the psy- chological measures used in Cycle Il, with recommen- dations concerning validity, reliability, and applica- bility to the Survey data. Washington, D.C. March 1966 U.S. DEPARTMENT OF HEALTH, EDUCATION, AND WELFARE Public Health Service John W. Gardner William H. Stewart Secretary Surgeon General NATIONAL CENTER FOR HEALTH STATISTICS FORREST E. LINDER, Pu. D., Director THEODORE D. WOOLSEY, Deputy Director OSWALD K. SAGEN, Pu. D., Assistant Director WALT R. SIMMONS, M.A., Statistical Advisor ALICE M. WATERHOUSE, M.D., Medical Advisor JAMES E. KELLY, D.D.S., Dental Advisor LOUIS R. STOLCIS, M.A., Executive Officer OFFICE OF HEALTH STATISTICS ANALYSIS Iwao M. Moriyama, Pu. D., Chief DIVISION OF VITAL STATISTICS RoserT D. Grove, Pu. D., Chief DIVISION OF HEALTH INTERVIEW STATISTICS PuiLip S. LAWRENCE, Sc. D., Chief DIVISION OF HEALTH RECORDS STATISTICS Monroe G. SirkEN, Pu. D., Chief DIVISION OF HEALTH EXAMINATION STATISTICS ArTHUR J. McDoweLL, Chief DIVISION OF DATA PROCESSING SioNEY BINDER, Chief Public Health Service Publication No. 1000-Series 2-No. 15 Library of Congress Catalog Card Number 65-62272 FOREWORD The practice of comparing one individual with another is as old as recorded history. Man's earliest writings are replete with statements in- dicating that he has long viewed his fellow man in terms of whether or not he measured up to an expected ideal. Similarly, the performance of a man has traditionally been described in terms of how it compares with that of another man. However, subjecting these 'known'' differences to the scientific method of inquiry is a recent development. In the area of individual differences ir behavior and psychological characteristics, re- search has progressed from the simple to the complex. The first studies dealt with the simple functions of speed of reaction time. Today, studies are aimed at measuring individual differences in the complex functions of motivation, ego-integra- tion, and cognition. Progress in developing a technology for measuring behavior has progressed in a similar manner. Instruments are available which, most scientists will agree, accurately measure the speed with which an individual taps his finger in response to a given signal. Scientists do not agree, however, on the adequacy of the equipment used to measure individual differences in intelli- gence. Moreover, there will even be some dis- agreement over the use of the word "intelligence" to describe certain aspects of behavior. Because of the present state of the art of psychological measurement, studies such as those conducted by the Health Examination Survey encounter difficult problems in attempting to esti- mate the prevalence of various mental health factors in the population. The Health Examination Survey is part of the U.S. National Health Survey, authorized by Congress in 1956 to collect information about the Nation's health. Data are collected by direct examinations of individual persons chosen to constitute a probability sample of some segment of the total population of the United States. The first sample represented the adult popu- lation aged 18 through 79 years. Since the study was primarily concerned with the prevalence of chronic physical disease, the examination did not include psychological measurements. The second sample consisted of noninstitutionalized children ages 6 through 11, among whom the incidence of chronic disease is insignificant. The important health factors in this group are found in those functions which result in growthand development. These, then, were the factors to be studied. Many authorities in the field of growth and development contributed to the planning phase of the Survey. Although they generally agreed on what factors should be measured, they could not agree on how the measurements should be obtained. They did conclude that present instruments were inade- quate but that these were the only tools available. The tests which are discussed in the following report were those selected for use by the Health Examination Survey. In choosing these instru- FOREWORD—Con, ments, primary consideration was given to those which best met the following criteria: 1. They were capable of yielding data in those areas considered most important to the study of growth and development. . They would produce data in a form which would be meaningful to the individuals responsible for children's health, They were suitable for use in a survey operation where examiners change fre- quently, where only 1 hour is available to conduct the examination, and where examining conditions are less than opti- mal. The selected instruments are not ideal, but they are felt to be the best compromise offered by the present state of the art of measurement. How much was compromised? What can be said about the growth and development of chil- dren from the data obtained by the use of these instruments? Through a contractual arrangement with Dr. Sells, the first step has been taken in answering these questions. Lois R. Chatham, Ph.D. Psychological Advisor Division of Health Exam- ination Statistics CONTENTS Page Foreword «eweeeecc mecca ccc mn cc cence reece ccm em ———————————————————— i Introduction = ==~=---ceccccm mmc cemr eee ———— 1 I. The Wechsler Intelligence Scale for Children, the Vocabulary and Block Design SUDtEStS====mm momo mom mm cmc eeeeeem 2 Description of the WISC == ome mmm mc meee eee eee 2 Research on Short Forms of the WISC-=-memecemcmmm emcee eee eee mm 3 Reliability and Stability=-===- === mmm mmm eee eee 4 Validity-m mmm em mm meme oe eee eee eee 4 Factors Affecting WISC Scores---==cmemmmem coca 10 AnXiety--mmmmmm omen ee 10 Sex Differences----==ce- mmm eee 11 Qualitative Differences by Level-me mm emm ammo cece 11 Developmental FactorS=======mmmmmmmm ccc ee 12 Special GroupS=--=-=mmem meme mmm mmm 12 Reading Disability-===-=- cmc cco eee 12 Auditory Disability======m=cm mmm ee eee 13 Visually Handicapped---====m=cmcmmm mo cee eee 13 StUtterers ==-mmmmm meee eee 13 Cerebral Palsy-=--emmmm mmm eee eee 14 Organic Impairment of Central Nervous System--==-c-ceeeeamaana- 14 Gifted ~===mmmmmm ccc ———— 14 Mentally Retarded and Defective--==meemmmm cece mecca 14 Bilingual === emcee mm meee eee 14 Negro —===-—mmmm mmm eee eee ee 15 Socioeconomic StatuS--=--=-=ceemcmcmccm ccm ec mmm ———————— 15 Comparison of WISC and Stanford-Binet IQ'S---=-mcmeemmmcmccceeceae 15 Summary and ConclusionS--=-===me mmm eee 17 Bibliography === ==me mmm mee eee eee 18 CONTENTS—Con, P II. The Wide Range Achievement Test, the Oral Reading age and Arithmetic SubteStS ======mmmmmm omc 23 EVAIUHIEVE Cr TlOT Iti mormon mim i ito im vp wri ts ii ot iis i sm mp i 23 1946 EIN Of WRAT «=m wmmimer mmm cv im mw sr mm mm em 0m mm a mrs 0 00 24 Research on the 1946 WRAT--=---mmeemcemc cee eecc mcm me 25 Reading =====mmmmmm memo eee eee eee meee 25 Arithmetic--~==mmmememmmem cee m meme ee ———————————————— 29 1963" BAIION Of WRA Teenie mn mm wns iin hm sm ws 29 Validity and NOrms=-====mmm emcee eee meme 30 Comparison of the Two Editions======ememmammmmm occ mccceee 30 Validation of 1963 Edition-----===mmccmmmmc emcee meee meen 30 Validity Variances-----==m=cmommmmm mo meee eee 31 Validity Data in 1963 Manual-=----eeeeocmmmcmm occ em 31 Grade EquivalentS--------cemecmmmmoe cece meme 32 STANAATTA SCOT EE == == mmm em mm mm ro rm mm me mre i mo ne rm 32 Percentiles =-==m= emo mm meee ee meme 33 Summary and Conclusions=--=-====mmmmmmm mmo 33 Bibliography ===m= momo mmm meme eee emma 33 III. The Goodenough Draw-A-Man Testem===mmmmmmmecmcem cee ccc meee i 34 Background and Development-==--=m=ammemmme ome m 34 Rationale = - === mmm eee eee 34 Point-Scoring System == mm =m momo mmm eee 34 Standardization~-========meemecc meme cmc — eee ————————— 35 Perspective mem mmm come eee eee 35 Evaluation of Intelligence by Human Figure DrawingS----=-===eeemu-- 36 Effective Range----===-m momma 36 Relation to Artistic Ability-==mm-mmeommm ccc meee 36 Perturbing FactorS--===cmm mmm moomoo eee eee 36 (©1111 TEI ERPS SSS SSRI 36 Sex DifferenceS---===e-=mmecmcemerocc ccc cce———————————————— 38 CONTENTS—Con. Page III. The Goodenough Draw-A-Man Test—Con. Personality Study by Children's Drawings--=-=-=eeemmmammeccmameaaa- 38 Research on the Goodenough Teste==-=meceme comm c mecca 40 Reliability Studies=---====ccmmm mmm 40 Correlations With Other TestS---=--=mememmmme cece ———————— 40 The Harris Revision of the Goodenough TeSt=m=mmmmmmmmmmemeacacanax 46 Comparison of Goodenough and Goodenough-Harris Scores--------- 47 Recommendation =======cecmemecmc ccc cece eee 49 Summary and ConclusionS======cemeeccmee mmc mm cece ccc cece cen mm 49 Bibliography =e===m=me mmm mm mm eee eee ee eee 50 IV. The Thematic Apperception TeSt-meeememmam emcee ccc ccc cee 53 Review of the Literature on the TAT ec cmc c ccc eee 55 OVerVieW mmm meee eee emer meee ———————— 55 Research Demonstrating Developmental Factors ==========ecmeee-- 56 Other Relevant ResearChe--eeccmm ccm ccc ee 57 Prospects for Developing an Objective Scoring Key for the Survey's TAT mm meme meee eee emma 59 Bibliography == === mmo mm meee eee eee 60 V. Total Psychological Test Battery--=-==ec-memmemmcmcmm ccc eee 63 VI. Cross-Disciplinary Analyses-=---==-cccmm mmm 64 Data Available-----c-crcmamcc cae reece reece, a ———— 65 Analyses Indicated=--===mm emo mame eee 65 Growth IndexeS---ceccmmm mmm ccc me 66 Other Factors Related to Test SCOr€S-=--mmmmmmm ccm m cece em 66 Acknowledgments === mmm am mcm eee meee 66 GloSSary Of ADDTeViatiONS m= cm mam moe sm om mm em mm mm me mm ee 67 IN THIS REPORT the psychological procedures used in the Health Ex- amination Survey conducted between June 1963 and December 1965 for children ages 6 through 11 are critically evaluated. In his analysis, the author combines his own professional competence with the information obtained in an extensive survey of litevature per- taining to the four procedures used—the Wechsler Intelligence Scale for Children, the Wide Range Achievement Test, a modification of the Dyaw- A-Man Test, and the Thematic Apperception Test. The result is an evaluation of the instruments which is made in teyms of their validity, reliability, and applicability for use in the Health Examination Survey. Finally, the author points out the strengths and weaknesses of each pro- cedure and makes vecommendations concerning the eventual use of data obtained in the Survey. SYMBOLS Data not available--mmmmmmoocmmmc eee —_— Category not applicable----eeemmemmeacano Quantity zero-----==-=r=mmmmmnem—————— = Quantity more than O but less than 0.05----- 0.0 Figure does not meet standards of reliability or precision-----=-cececmaca-- %* EVALUATION OF PSYCHOLOGICAL MEASURES USED IN THE HEALTH EXAMINATION SURVEY OF CHILDREN AGES 6-11 S. B. Sells, Ph.D., Institute of Behavioral Research, Texas Christian University INTRODUCTION This report is the outcome of a contract with the National Center for Health Statistics. The purpose of the contract was to obtain an objective critical evaluation of the psychological procedures chosen for use in the Health Examination Survey of children ages 6 through 11. The objectives may be summarized as follows: 1. To prepare a critical review concerning the development and use of the psycholog- ical procedures used in Cycle II based on available literature and unpublished re- ports (theses, dissertations, and others). These measures include the Vocabulary and Block Design subtests of the Wechsler Intelligence Scale for Children, the Oral Reading and Arithmetic subtests of the Wide Range Achievement Test (1963 edi- tion), the Draw-A-Man Test, and cards 1, 2,5, 8BM, and 16 of the Thematic Ap- perception Test. 2. To make recommendations concerning the appropriate inferences which can be made concerning individual growth and develop- ment based on scores derived from the test battery described above. 3. To recommend what research must be done if the objectives of the Health Ex- amination Survey are to be accomplished. 4. To make original recommendations con- cerning the types of cross-disciplinary analyses that can be performed on data obtained in the Health Examination Survey of children. An extensive survey of the literature was made, but only the most relevant material was included in this final report. Literature was con- sidered relevant if it was either empirical re- search or a review which included or made ref- erence to the tests used in the Survey. Empirical studies which were conducted on samples of U.S. children ages 6 to 12 years were given preference. A few important reports which did not meet these criteria were included because of their method- ological features or their significant content. Un- published master's theses and dissertations were obtained, as extensively as possible, by inter- library loan. Information was sought and, with some success, obtained from the publishers and selected users of the reviewed tests. One empirical study was carried out under this contract. Its results are included in the sec- tion on the Goodenough Draw-A-Man Test. The study was stimulated by a recent publication by Dale B. Harris entitled Children's Drawings as Measures of Intellectual Maturity. This text is basically a revision of the 1926 book by Florence L. Goodenough entitled Measurement of Intelli- gence by Drawings. In his publication, Harris in- cludes new point-score scales and modernized norms for scoring drawings of the human figure. The text of this report is divided into six sections. Sections I-IV present critical discus- sions of various tests used by the Health Examina- tion Survey. The tests are discussed in the follow- ing order: I. The Wechsler Intelligence Scale for Children, Vocabulary and Block Design subtests II. The Wide Range Achievement Test, the Oral Reading and Arithmetic subtests III. The Goodenough Draw-A-Man Test IV. The Thematic Apperception Test Section V briefly discusses some of the issues which arise when these tests are used as a bat- tery. Finally, section VI considers the cross- disciplinary relationships between 'psychologi- cal" and "nonpsychological' measures. Each research study or review referred to in this report is identified by a number placed in parentheses immediately following the cited ref- erence. Bibliographies following each of the first four sections of the report contain all references cited in the respective sections. Research studies which were abstracted as part of the literature-review portion of this con- tract are also included in the four bibliographies. The actual abstracts of the reviewed literature appear as appendixes to the report. For conven- ience, numbers which identify the abstracts cor- respond to the number given when the reference is cited in the text of the report. These abstracts have been deposited as docu- ment number 8486 with the Library of Congress. A copy may be secured by sending the document number and $28.80 for photoprints or $3.20 for 35mm. microfilm to the American Documenta- tion Institute Auxiliary Publication Project, Pho- toduplication Service, Library of Congress, Wash- ington, D.C., 20541. Advance payment is required. Checks or money orders should be made payable to Chief, Photoduplication Service, Library of Congress. |. THE WECHSLER INTELLIGENCE SCALE FOR CHILDREN, THE VOCABULARY AND BLOCK DESIGN SUBTESTS This section reviews the measurement char- acteristics of the Vocabulary (Voc.) and Block Design (BD) subtests of the Wechsler Intelli- gence Scale for Children (WISC), both as a sepa- rate unit and as a WISC short form. It also reviews behavioral correlates of intelligence as reported in the literature and critically evaluates the appro- priateness of their use in Cycle II of the Health Examination Survey. The selection of the Vocabulary and Block Design subtests for use as part of the psycho- logical test battery for Cycle II, in effect, treats these subtests as a short form of the WISC. In addition to providing an estimate of the WISC score, the two subtests may be interpreted sepa- rately, in combination with other test scores, or in conjunction with other Survey data. Combina- tions of these measures with other data obtained in the Survey are discussed in section II. DESCRIPTION OF THE WISC The WISC, which was published in 1949, extended the well-known Wechsler intelligence scales for adolescents and adults into the child- hood range of 5 to 15 years. During the decade and a half since its publication the WISC has been the subject of extensive investigation and has achieved wide school and clinic use where individual measures of intelligence are desired. The WISC is patterned after the Wechsler- Bellevue Intelligence Scale both in the structure of the subtests and the scales and in the use of the deviation intelligence quotient. The test con- sists of 12 subtests—6 Verbal and 6 Perform- ance—of which 2 (Digit Span of the Verbal Scale and Mazes of the Performance Scale) are supple- mentary and not routinely used. The 5 subtests comprising the Verbal Scale are as follows: Information, Comprehension, Arithmetic, Simi- larities, and Vocabulary. The 5 Performance Scale subtests are Picture Completion, Picture Ar- rangement, Block Design, Object Assembly, and Coding (Digit Symbols). An important innovation in the Wechsler in- telligence tests is the use of the deviation IQ. This device supplants the mental age concept and evaluates the performance of each individual on the basis of the distribution of scores ofa repre- sentative sample of his own chronological age. In the standardization of the WISC, Wechsler kept the standard deviation of intelligence quotients constant from year to year, with the result that "a child's obtained IQ does not vary unless his actual test performance as compared with his peers varies." Raw scores for each subtest are converted to scaled scores which have a mean of 10 and standard deviation of 3 for each age level. The sum of five scaled scores for the Verbal Series constitutes the Verbal Scale score (VS), and simi- larly the Performance Scale score (PS) is the sum of the five Performance Series scaled scores. The Full Scale score (FS) is the sum of the Verbal Scale and the Performance Scale. Deviation in- telligence quotients have been derived by a sim- ilar conversion process for VS, PS, and FS. The IQ scales at each age have a mean of 100 and standard deviation of 15. The standardization of the WISC is reported in Wechsler's manual (101), and the standardiza- tion sample is summarized in terms of age, sex, geographic representation, urban-rural compo- sition, and composition by socioeconomic status (reflected by occupation of fathers). The WISC was standardized on a total sample of 2,200 cases, including 100 white boys and 100 white girls at each age from 5 to 15 years. The proportion of urban children in the sample was slightly higher than in comparable United States population sta- tistics. Reviewers have commented very favorably on the WISC as a test of superior quality (102- 104), but, as in all areas of mental measurement, imperfections have been noted and users have attempted to employ it for purposes for which it was not specifically designed. In general, the deviation IQ has been accepted as an improvement over the IQ computed by dividing mental age by chronological age. Except for a slight bias for urban and smalltown areas—as opposed to rural areas—for a native white population, the sampling basis of the WISC has been regarded as good. Maxwell (106), and also Wilson (139), has criticized the linearity of the transformation of raw scores to scaled scores, which may be a problem when sampling extreme cases and widely varying regional, ethnic, and linguistic groups. Hite (112) reported that the WISC lacks items of middle-range difficulty at all age levels and is too difficult for young children, particularly those in the age range Sto 6 years. In the studies reviewed, WISC Full Scale IQ's have indeed tended to be lower than comparable Stanford-Binet IQ's. This is especially true at the lower age levels. McCand- less (103) noted that girls tend to test lower than boys on the WISC, but support for this generali- zation is equivocal in the present review. In evaluating the utility of the Vocabulary and Block Design short form of the WISC for the Survey it is appropriate to consider shortcomings of these tests in relation to alternatives that might have been considered—given the constraints of testing time available in the Survey schedule and the general problems of a national survey. It may be noted that although the WISC norms are inappro- priate in varying degrees for Negro, bilingual and foreign-born, illiterate, retarded, defective, rural, and other special groups for which the test was not designed, there is no adequate measure that can be applied to all. On the other hand, because of the extensive research on the WISC, reported below, it may be possible to estimate errors in the Vocabulary and Block Design sub- tests and in the scores derived from them for various components of the Survey sample. In ad- dition, relationships of these variables to the Goodenough Draw-A-Man Test offer further op- portunities for compensatory analysis. RESEARCH ON SHORT FORMS OF THE WISC Several investigators have combined two or more subtests in order to develop an efficient short form of the WISC that correlates well with the Full Scale and produces comparable means and standard deviations (175-179, 231, and 235). Of these, only one article, by Simpson and Bridges (177), reported favorable results with the combi- nation of Vocabulary and Block Design. They used a sample of 120 children over the age range of 65 to 192 months. Finley and Thompson (231) developed for a sample of 309 mentally retarded persons a short form with five subtests, including Block Design, which correlated 0.89 with FS IQ. Significantly, their report included correlations of 0.55 and 0.45, respectively, for Voc. and BD with FS IQ, while the correlation of Voc. and BDwas only 0.1. Further, estimation of mean FS IQby proration of the sum of Voc. and BD, as reported by these authors, approximated the actual FS IQ quite closely. Schwartz and Levitt (235) also reported a short form of the WISC for educable retarded chil- dren, consisting of six subtests including Voc. and BD which correlated 0.95 with FS IQ. However, their best combination of five subtests, whichre- duced the correlation to 0.92, eliminated Block Design. Osborne and Allen (239), on the other hand, cross-validated two triads of WISC subtests including Voc. and BD, one with Picture Com- pletion and one with Picture Arrangement, using samples of 240 (initial) and 50 (validation) retarded children aged 7 to 14 years, withcorrelations with FS IQ of 0.88 to 0.90. At the same time, Hite (112) has confirmed Wechsler's data (101) indicating that Vocabulary and Block Design are the most reliable subtests in the WISC battery. Hagen (109) and Cohen (111) in the United States and Gault (110) in Australia have reported that both of these subtests are highly loaded on the general factor obtained in factor analysis of the WISC over the entire age range of 5 to 15 years. Cohen found that Vocabu- lary was the strongest single measure of the general factor. Nevertheless, a problem exists in determining the optimal combination of these sub- tests to estimate the FS IQ and various parameters related to the Survey objectives. Simpson and Bridges (177) estimated the FS IQ on the basis of a simple sum of the scaled scores of Voc. and BD and reported a conversion table for this purpose. Inasmuch as their results have not been replicated, so far as is known, cross-validation on a substantial sample should be considered before this table is adopted. The importance of this recommendation is illustrated by some computations based on the Finley and Thompson data (231). The sum of mean Voc. and BD scaled scores, 11, multiplied by 5 to prorate the FS score, gives a WISC Full Scale IQ of 70 (as compared with the actual mean of 68), while the score of 11 in the Simpson and Bridges tables yields an FS IQ of 77. Further, in view of Max- well's criticism of the transformation of raw scores to scaled scores (106), it may be advisa- ble also to explore empirically the alternative of predicting the FS IQ from raw scores. In reviewing the WISC literature every effort was made to focus on the Voc. and BD subtests, and considerable data have been assembled. Nevertheless, the major portion of the information referred to in this report is based on the full test, and assumptions of equivalence of short form scores to the Full Scale must be made in gener- alizing the results reported. As indicated above, this assumption is not entirely inappropriate, but caution is certainly indicated. RELIABILITY AND STABILITY Wechsler's manual (101, p. 13) reported cor- rected split-half reliability coefficients of 0.77, 0.91, and 0.90, respectively, for Vocabulary, and 0.84, 0.87, and 0.88, respectively, for Block De- sign for samples of 200 children at each of the following age levels: 7 1/2, 10 1/2, and 13 1/2 years. The corresponding FS reliabilities were 0.92, 0.95, and 0.94, respectively. As noted above, these two subtests were the most reliable of all the WISC subtests. These results for Voc. and BD have been confirmed by Hite (112) for children in the age range of 5 to 7 years. Stability of the WISC on retest has also been found satisfactory by Gehman and Matyas (113) over a 4-year period (age 11 yearsatinitial test), by Reger (115), who tested a sample at ages 10, 11, and 12 years, and by Whatley and Plant (116), who used a 17-month interval. In these studies, retest correlations were generally of the order of the corrected split-half reliabilities. These and related data are summarized in table 1. VALIDITY Despite the fact that Wechsler developed the WISC in protest against the measurementconcept of mental age (and the IQ based on it) implicit in the Stanford-Binet test, and despite the additional Table 1. Studies reporting reliability coefficients of the WISC Number Coefficient Investigator Year Subjects? Age range ore OF Zz M F Voc. | BD vs PS FS Throne, Schulman, and 1962 | Retarded=========~ ; 11-0 - 14-11 39 39 -10.79| 0.82] 0.92] 0.89 | 0.95 | Test-retest Kaspar (227). Armstrong (175) ==---==-= 1955 | Guidance clinic----{ 5-0 - 14-11| 200 | 100 | 100 | 0.94 | N.R. | N.R. | N.R. | N.R. | Split-half, 5-7 years 20| 20 -10.92| XR. | R.R. | N.B, | N.R, Raman 5-7 years 20 -| 20]0.90] N.R. | N.R, | N.R. | N.R. 7-9 years 20 20 -10.93| N.R. | N.R. | N.R. | N.R. 7-9 years 20 - 20 | 0.91] N.R. | N.R. | N.R. | N.R. 9-11 years | 20| 20 -| 0.87 | N.R. | N.R. | N.R. | N.R, 9-11 years | 20 -| 20| 0.89 N.R. | N.R. | N.R. | N.R. 11-13 years| 20| 20 -] 0.88 | N.R. | N.R. | N.R. | N.R. 11-13 years 20 - 20 | 0.88 | N.R. | N.R. | N.R. | N.R. 13-15 years 20 20 -|1 0.90 | N.R. | N.R. | N.R. | N.R. 13-15 years | 20 -| 20] 0.96 N.R. | N.R.|[ N.R.| N.R. Sos and Matyas 1956 | Normals=-=========== 11-1 60 29 31| N.R. | N.R.| 0.77] 0.74 | 0.77 | Test-retest” Caldwell (252)-----=-=-- 1954 | Normals (Negro) =----| 9-7 - 10-6 60 | ---| ---| 0.70| 0.89| 0.82] 0.90 | 0.84 | Split-half Jones (154) -====-==-mnuun 1962 | Normals (England) -|-=-=-=-========= 240 | 120 | 120 |~=====f=m=mmfmm——- of mmm om ml on me mn] Split-half, 7-6 - 8=5 80| 40| 40(0.70| 0.74| 0.86| 0.80 | 0.89 en 8-6 - 9-5 80| 40 | 40 0.70| 0.68| 0.87 0.81 0.90 9-6 - 10-5 80| 40| 40|0.70| 0.75] 0.90| 0.85| 0.94 Wechsler (101)=--=-====-- 1949 | Normals (WISC ~~ |-===-===-=enmnnrd 600 | 300 | 300 [=====f=mmmmefammmmef mmm ee —---~-| Split-half, standardization Spearman- data). Brown 7-6 200| 100| 100 | 0.77 | 0.84 | 0.88| 0.86 | 0.92 10-6 200 | 100| 100 | 0.91| 0.87 | 0.96 | 0.89 | 0.95 13-6 200 | 100| 100 | 0.90 | 0.88 | 0.96 | 0.90 | 0.94 Hite (112)----ecoocnnnn- 1953 | Normals=======mmonslommm meme meme 200 | 117 | 83 |=====ofmmmmmemmm mm emma meee Split-half 5-6 50| 34| 16) 0.71| 0.77] 0.77| 0.81 0.90 6-6 100 | 56| 44) 0.72| 0.84 0.89 0.89] 0.91 7-6 50| 27| 23(0.76|0.89| 0.89] 0.86| 0.94 Hagen (109)° -----=-===-- 1952 | Normals (WISC ~~ |-==-==-n=n-===-1 400 || 200) | 200 [runs msm mdicn nod wmode Split-half, standardization Spearman- data). Brown 5 years 200 | 100 | 100 | 0.68 | 0.77 | N.R. | N.R. | N.R. 15 years 200 | 100| 100 | 0.91 0.89 | N.R. | N.R. | N.R. tDesignations of subjects are always white Americans unless otherwise specified. Time between testings was 49 months. Data are from the WISC standardization sample, but were not reported in the WISC manual. NOTES: All correlation coefficients$ are Pearson Product-Moment unless otherwise specified. Z —Total population; M—male; F—female; Scale; FS—Full Scale; N.R.—not reported. Voc.—Vocabulary; BD—Block Design; VS—Verbal Scale; PS—Performance Table 2. Studies reporting correlation between the WISC and Stanford-Binet” Number Correlation Investigator Year Subjects” Age range z M F Voc. BD vs PS FS Nale (216) -=-==-==-==--mceeeun 1951 | Mental defectives---=--=-==== ~8-10 - 15-11 |104 54 50 | N.R, | N.R, | N.R, |N.R. {0,91 Stacey and Levin (228) ------- 1951 | Mental defectives--------=-= 7-2 - 15-11 70 |---| --- | N.R., | N.R. | N.R. |N.R. [0.68 Sloan and Schneider (217)----| 1951 | Mental defectives----------- N.R. 40 20 20 | N.R. |N.R. [0.75 [0.64 | 0.76 OLE CiBE) man mmm meri mt 1950 | Retarded-n=mm=mmmm=mmmmmmn- = N.R. 10 |---| --- | N.R. | N.R. [0.81 [0.49 0.71 Sharp (229) --=-cmemmcmmmeem am 1957 | Slow learners-=-=--========- 8-0 - 16-5 50 |---| =---| N.R. | N.R., | 0.62 | 0.67 | 0.69 Post (198) --===ccemmcmme mee 1952 | Stutterers=--======-eeeeee=— 5-5 - 15-10 30 27 3 | N.R. [N.R. [0.80 [0.37] 0.78 Kent and Davis (207) ----=-----| 1957 | Normals and clinic referrals (England) ~==-======mmmmumn— 8-12 years [213 |133 NOTA LE mw mis mim wis mim wien 59 Delinquents-==============| comm maa 55 | 48 Psychiatric outpatients---|-==m=mwmmen=— 40 | 26 Muhr (119) ==-mmecmm meee 1952 | Institutional (orphans and various problems) --=--=-=-==- 5-0 - 6-11 42 | === | === | N.R. | N.R. | 0.46 | 0.52 | 0.62 5 years 21 |---| ==-| N.R. | N.R. | 0.65 | 0.66 | 0.74 6 years 21 |---| =---| N.R. | N.R, | 0.44 | 0.39] 0.49 Davidson (162) ===-=-m-ceeeenx 1954 | Normals-====-=cmmocmmmonoann 14-0 - 14-3 30 |---| =--| N.R. | N.R. | 0.79 | 0.71] 0.83 Kardos (l6l)-==-==--m-mmememae-n 1954 | Normals====-=====-moommmonnn 11-11 - 13-0 [100 | 50 | 50 | N.R. | N.R. | 0.87 | 0.82] 0.89 MAEYaE? (L184) meme mmm LTB HOT mmr em rem mmr mm mse fi im in G01 20] BE [emmanuel mmm foe sim Grade S5-==-mmmmmmmm——————— 11-1 (mean) 60 29 31| N.R. | N.R. [0.78 | 0.46] 0.73 Grade 9 (retest)------=--- 15-2 (mean) 60 29 81 | N.R. | B.R. [0.76 | 0.64 | 0.77 Raleigh (191) -===-==mmmmeo-mn 1952 | Normals=-=========commmmeonn— 10-8 - 14-9 100 52 48 NR. { N.R. | 0.77 | 0.59 0.80 Schwitzgoebel (189)-=-=====--= 1952 | Normals=-======mmmmmmomaan~ 9-11 - 13-8 |100 | 52 | 48 | N.R. | N.R. [0.78 |0.61| 0.84 Clarke (160) --===meemmmmemmae 1950 | Normals=====-==mcommmemanaaou 9-7 - 12-9 84 39 45 | N.R. | N.R. | 0.83]0.57| 0.79 Frandsen and Higginson (159)-| 1951 | Normals-========-ccuomoaoon- 9-1 - 10-3 54 |---| ---| N.R, [| N.R. | 0.71 | 0.63| 0.80 Reidy (171) ===memmmmmmmmmeeem 1952 | Normals==-=-===mmoooccmmnanan 9-0 - 11-11 | 60 | 30 | 30| N.R. | N.R. | 0.87 0.69] 0.86 Jones (154) ~=-==eemmmmmanaaan 1962 | Normals (England) ------=---- 8-10 years |240 |120 | 120 | N.R. | N.R. | 0.84] 0.59 0.81 8 years 40 40 -| N.R. | N.R. | 0.77 | 0.48] 0.72 8 years 40 -| 40 | N.R. | N.R. | 0.79 |0.46| 0.76 z 8 years 80 40 40 | N.R. | N.R. | 0.78 {0.47 | 0.74 9 years 40 | 40 -| N.R. | N.R. | 0.89 | 0.65] 0.90 9 years 40 -| 40| N.R. | N.R. | 0.78] 0.58] 0.75 Zz 9 years 80 40 40 | N.R. | N.R. | 0.84|0.61| 0.84 10 years 40 | 40 -| N.R. | N.R. | 0.86 | 0.64] 0.83 10 years 40 -| 40| N.R. | NR. | 0.90 | 0.67 | 0.86 2 10 years 80 | 40 | 40| N.R. | N.R.| 0.88] 0.66] 0.85 Arnold and Wagner (158) ------ 1955 | Normalg§====m=memmmmeememmaam = 8-9 years 50 | === | ---| N.R. | N.R. | 0.85] 0.75| 0.88 Wagner (156) -=-=-=--c=comeunn 1951 | Normals-====omemomommmaaaoan 8-9 years 50 |---| ---| N.R. [ N.R. | 0.77 | 0.87 | 0.81 Scott (155) -=====m=mmmmmmanan 1950 | Normals-=-==-=mcmcmmmmoanonn 7-7 - 11-1 30 |---| ---|0.63|0.60| 0.86 | 0.86] 0.92 Beeman (153) -=<===c-memncmcoan 1960 | Normals====--==c-momomemaann 7-2 - 11-9 36 |---| === | N.R. | N.R. | 0.64 | 0.42] 0.67 Harlow, Price, Tatham, and 1957 | Normals§==-=m-mmm comme | mmmmcmm meee o 60 | === | === |mmmmmm mmm meee meen Dovdidson (148). 6-6 - 6-7 30 |---| ---| N.R. | N.R. | 0.64 | 0.61] 0.64 10-0 - 10-1 30 |---|] --~-| N.R. | N.R. [ 0.88 0.52] 0.83 Cohen and Collier (124)------ 1952 | Normals-=-=-=comcmcmcmmananq 6-5 - 8-9 51 |---| ---| N.R. | N.R. | 0.82 0.80] 0.85 Tatham (152) =-=-=-=-ememmaomn 1952 | Normals========momommmmeeann 6-5 - 6-7 30 |---| =--| N.R. [ N.R. | 0.64 | 0.51| 0.64 Mussen, Dean, and Rosenberg | 1952| Normals=-=--===mmm-memcmmenan 6-0 - 13-1 39 |---| ---| N.R. | N.R. | 0.83] 0.72] 0.85 Qalz), See footnotes at end of table. Table 2. Studies reporting correlation between the WISC and Stanford-Binet®—Con. Number Correlation Investigator Year Subjects’ Age range z M F Voc. BD VS PS FS Krugman, Justman, Wright- stone, and Krugman (144) ---= | 1951 | NOTMALE = mmm mmm mmm oom cm om mm ct mf on 222 | === | mmm |e mp mmm mf mee meee 6 years 38 |---| ---| N.R. |N.R. [0.73 |0.74 [0.82 7 years 43 | === | === | N,R. |N.R. | 0.64 |0.49 | 0.73 8 years 44 | === |---| N.R, | N.R. {0.78 |0.57 | 0.82 9 years 31 |=---|=---| N.R., | N.R., | 0.83 ]0.79 0.87 10 years 29 | === | === | N.R. | N,R., [0.88 | 0.54 | 0.86 11 years 37 |---| ---| N.R. [N.R. | 0.69 [0.53 0.76 Pastovic® (121) ====memmmmmemm 1951 | NOTMAlS === mmm moon mmm mmm mm mmm mm mm mmm me oo] 100 | === | === |m=mmmmf mmm mem meee ene - 50 | === | === | N.R. | N.R. [0.63 0.57 [0.71 - 50 | === | =--| N.R. | N.R. {0.82 (0.71 |0.88 Winpenny (105) ---=--=====c-z= 1951 | Normals=====mmmmommm mm mmm ee meme mmm =] 185 | === | === |-===== fromm mem me mee Kindergarten--------==-w-=- 5-4 - 5-8 50 | === | =--| N.R. | N.R. | N.R. | N.R. [0.71 Grade 2--=-~cmmmmmm mmm 7-4 - 7-8 50 | === | ---| N.R, | NR. | N,R, [ N,R., | 0.88 ‘Grade S5----=--=---mmmm——— 9-7 - 12-9 85 | === | ---| N.R, | N.R. | N.R. | N.R. [0.79 Dunsdon and Roberts (170) =---- | 1955 | Normals (England) -----=-=-===- 5-0 = 14-11|1,947| 980 | 967 |-=====|-mmmmmmmmm mmf mm mmm mm mm 980 | 980 -| N.R. | N.R. | N.R. | N.R. | 0.82 967 -| 967 | N.R. | NR, | N.R. | N.R. | 0.77 loruszak (146) ---=---====-=m= 1954 | Normals-=====mmmmommmm mm 5-14 years 80 | 40 | 40| N.R. | N.R. | 0.87 |0.78 | 0.90 5-14 years 40 | 40 -] N.R. | N.R., | 9.89 | 0,720.93 5-14 years 40 -] 40({ NR, | N.R. 10,86 [0.71] 0.93 olland (149) ----=-==--=-ocm- 1953 | Normals-==~-===-=mc=mmmmmmmm 5-13 years 52 |---| ---| N.,R, [| N.,R., [0.88 |0.73|0.87 eider, Noller, and Schraumm (150) ==mmmmm mm mmm meme 1951 | Normals======== == == m=m—————— 5-0 - 11-11 | 106 | --- | ---| N.,R. | N.R. | 0.89 | 0.77 | 0.89 5-0 - 7-11 44 | === | ---| N.R. | N.R. [0.82 (0.79 0.90 8-0 - 11-11 62 | --- | ---| N.R. [N.R. [0.92 [0.78] 0.90 wureth, Muhr, and Weisgerber (118) ~==mmmm mmm mmm meme 1952 | Normals========mmcm momen ———— 5-6 years 100 | ---| ---| 0.51 {0.61 |0.75|0.71 0.81 5 years 50 | --- |---| 0.42|0.65|0.79|0.73|0.84 6 years 50 | === |---| 0.65 |0.55| 0.71 {0.71 | 0.79 Rottersman (151) -----c-emenen 1950 | Normals=m-mmmmmmmm mmm 6 years 50] 21; 29] NuR. {N.R. [0.71 | 0.49 | 0.7) [riggs and Cartee (148)------ 1953 | Normals (S-B, Form M)------- 5 years 46 | =-- | === | N,R., | N.,R., | 0.58 | 0.48 | 0.61 Orr (188) =---m-ommmmmm meee 1950 | Normals===m=mm mmm mmm m mmm mele meme mmm 40 | mmm | mmm [mmm meee ee mee meee Grade l---me-memmemmm mee] N.R. 15| ---| =--| N.R. | N.R. [ 0.63 | 0.62 | 0.77 Grade femmmmmmmeemmeanen=- N.R. 14 | === | === | N.R. | N.R. [0.64 [0.65 (0.67 Grade Je--mmmeeemmmonansned N.R. 11 | === | === | N.R. | N.R. [0.88 [0.66 [0.79 Stanley (157) --=-==-=cmmmcaz= 1955 | Normals (from Frandsen and Higginson, 159, above)----- N.R. 50 | === | ===] N.R. | N.R. | N.R. | N.R. | 0.71 " Schachter and Apgar (147)---- | 1958 | Normals, mixed sample--=----- N.R. 113 61 52| N.R. | N.R. [0.64 [0.48] 0.67 White-==mcmmmmmm mm mem mm meee meee 39 | === | === Negro-=======meommmeneon—— 66 | === | --- Puerto Rican-~--====mnemmm- 6 |---| --- Oriental---=-===c=-mm=m=n-q 2] ===] --- Estes, Curtin, DeBurger, and Denny (125) -==--=-==cemcmeean 1961 | Normals, Grades 1-8--=-----n--m-mcmnnmmnn 82 | 47 | 35|==--mmermmmmdemm meme meme Form Le-==-=smmmmeemmmm meme N.R. 82 | 47 35| N.R. | N.R. | N.R., | N.R. | 0.80 Form L-M---mmmmccmmemmmnmn N.R. 82| 47 | 35| N.R. |N.R. | N.R. |N.R. | 0.74 tUnless otherwise noted, Stanford-Binet, Form L. Designation of subjects are always white Americans unless otherwise specified. oRank difference correlation. Also reported by Gehman and Matyas in 1956. Also reported by Pastovic and Guthrie in 1951. fIntraclass correlation. EAverage time between S-B and WISC administration was 50.8 months. NOTES: All correlation coefficients are Pearson Product-Moment unless otherwise specified. I —Total population; M—male; F—female; Voc.—Vocabulary; BD—Block Design; VS—Verbal Scale; PS—Performance Scale; FS—Full Scale; N.R.~—not reported. Table 3. Studies reporting correlation between the WISC and other measures ) Number Correlation Investigator Year Tose ov oXjlevion Subjects” Age range z M F Voc. BD vs PS FS Smith (126) ------- 1961 | Full Range Picture | Normals-----=-=-==-=-- 6-11 - 8-10 | 100 51| 49] N.R. | N.R. | 0.63| 0.42| 0,60 Vocabulary Test. McBrearty (123)---| 1951 | Arthur Point Scale | Normals------------| 10-3 - 12-11) 52 | 22| 30| N.R.| N.R.| N.,R.| 0.65] 0.71 of Performance Tests. b Cohen and Collier | 1952 | Arthur Point Scale | Normals----=-==-=-==-- 6-5 - 8-9 49 | === | ---| N.R. | N.R.| 0,77 0.81 0.80 (124). of Performance Tests. Winpenny (105)=----| 1951 | Arthur Point Scale | Normals----=-------- 9-7 - 12-9 | 85| ==-| --=| N.R.| N.R.| N.R.| N.R,| 0.70 of Performance Tests. Armstrong and Hauck | 1960 | Visual Motor Ge=- Nonorganic child 6-12 years 98 49 49 N.R.| N.R.|-0.22|-0.07(-0.23 (130). stalt Test. guidance popu- F lation. Winpenny (105) ----| 1951 | Bernreuter-Winpenny= Normals--=---=-=-==-=je======c=-cox|men=d==--q rm [mmm sn of ne wr of og af ce 4--m-- Kindergarten----- 5-4 - 5-8 50 | =--| ---| N.R.| N.R.| N,R.| N.R. | 0.92 Grade 2---------- 7-4 - 7-8 50 | =--| ---| N.R.| N.R.| N.R.| N.R.|[ 0.92 Grade 5-----==-=- 9-7 - 12-9 85| ---| =---| N.R.| N.R.| N,R.| N.R.| 0.97 Cooper (242)=-=---- 1958 | california Achieve- | Bilinguals N.R 51| --~| ===; N.R.| N.R.[ 0.80] 0.54] 0.77 ment Tests. (Guam), Grade 5. Altus (122) -=====-= 1952 | California Test of | Normals, junior N.R. 55| =--| =---| N.R.| N.R.| N.R.,| N.R.| 0.81 Mental Maturity. high. Altus (134) -====-== 1955 | California Test of | Retarded, elemen- N.R. 100 71| 29| ~=mm=tfmmmmmdmmmmel——- 4mm—— Mental Maturity tary school. Language ---| ===| =--| N.R.| N.R.| 0.71] 0.57} 0.70 Non- language === | ===| ==-| N.R.| N.R.| 0.65| 0.67| 0.68 Total------==--=~4 -==| ===| ===| N.R.| N.R.| 0.76 0.68] 0.77 Cooper (242) ------ 1958 | California Test of | Bilinguals N.R. 51| ---| ---| N.R.| N.R.| 0.66| 0.68| 0.74 Mental Maturity. (Guam) , Grade 5. Schwitzgoebel 1952°| California Test of | Normals--===m=m=m=m 9-11 - 13-8 | 100 | 52| 48| N.R.| N.R.| 0.55] 0.59 0.75 (189). Mental Maturity. Barratt (138)----- 1956 | Columbia Mental NOEL Emm wns n 9-2 - 10-1 | 60| 26] 34/°0.45/°0.47%.56|%.48 0.61 Maturity Scale. Warren and Collier | 1960 | Columbia Mental Retarded--==--=---- 9-30 years | 49 | ---| ---| N,R.| N.R.| N.R,| N.R, | 0.68 (224). Maturity Scale. Thompson (193) ----| 1961 | Gates Advanced Normals---=-=-=====--| 6-4 - 8-0 105 62| 43] ---=- domo mdm mmm meme Primary Reading Tests. Word Recognition=ee===mmemmmme cee. —————————————— === | ===| ===] N.R.| N.R.[ 0.58] 0.42 0.55 Paragraph Reading=---=-===-=-coemmmmaloe ccm emo] === | ===| ===] N.R.| N.R.|[ 0.55] 0.46 | 0.56 Composite Reading=-=======-c-ececmeeodecm ccm canna === | ===| ===] N.R.| N.R.|[ 0.57] 0.47 0.58 Warren and Collier | 1960 | Goodenough Intelli- | Retarded--=-======~- 9-30 years | 49 | ---| ---| N.R.| N.R.| N.R.| N.R.| 0.43 (224). gence Test. Armstrong and 1960 | Goodenough Intelli- | Child guidance 6-12 years | 98 | 49 49) N.R.| N.R.| 0.37| 0.51 0.49 Hauck (130). gence Test. clinic, Rottersman (151)--| 1950 | Goodenough Intelli- | Normals===========- 6 years 50 21) 29 N.R.| N.R.| 0.38] 0.43| 0.47 gence Test. Kimbrell (136)----| 1960 | Grade placement=----- Mental defec- 10.5 - 15.8 | 62 | ---| ---1 N.R.| N.R.| N.R.| N.R. | 0.40 tives. Smith (126) ------- 1961 | Wide Range Normalg==-========= 6-11 - 8-10 | 100 51| 49] N.R.| N.R.| 0.55] 0.47] 0.61 Achievement Test. Delp (135)--=--=---- 1953 | Kent EGY Test=-=---=-- Normals===========-| 6-15 years 74 | ---| ---| N.,R.| N.R.| 0.60 0.55] 0.62 Cooper (242) ------ 1958 | Leiter Interna- Bilinguals N.R. 51 =--| =--| N.R.| N.R.| 0.73] 0.78] 0.83 tional Perform- (Guam) , Grade 5. ance Scale. Sharp (229) ------- 1957 | Leiter Interna- Slow learners--=---- 8-0 - 16-5| 50| ---| ---| N.R.| N.R.| 0.78] 0.80 0.83 tional Perform- ance Scale. See footnotes at end of table. Table 3. Studies reporting correlation between the WISC and other measures=—Con. Number Correlation Investigator Year Test oF Criterion Subjects” Age range bo M F Voc. BD VS PS FS Alper (221)------- 1958 | Leiter Interna- Mental defec- 7-2 - 17-3 30 15 15 | N.R. N.R, {0.4070,79 | 0,77 tional Perform- tives. ance Scale. Dunn and Brooks 1960 | Peabody Picture Retarded-=-=mmmm- N.R. 56 | --= |--- N.R. N.R. | N.R. | N.R. 0.61 2 . Vocabulary Test. Kimbrell (136)---- | 1960 | Peabody Picture Mental defec- 10.5 - 15.8 62 | --- | --= N.R. N.R N.R N.R. 0.30 Vocabulary Test. tives. Himelstein and 1962 | Peabody Picture Emotionally 6-2 - 14-8 48 | --- | --- | N.R. N.R. | 0.64] 0.52 ( 0.63 Herndon (137). Vocabulary Test. | disturbed. McBrearty (123)--- | 1951 | Progressive Normals-=-======xq 10-3 - 12-11 52 22 30 N.R. N.R. [ 0.78 | 0.50 0.81 Achievement. Tests. Dunsdon and 1955 | Mill Hill Vocabu=- | Normals 5-0 - 14-11 [1947 | 980 | 967 |------ ft nf nf Roberts (170). lary Scale. (England). POI freemen A= irre rrr Ean aT 980 | 980 - | 0.83 | N.R. | N.R. | N.R. | N.R. FOrm A--==mmmmod mmm eee meee meme mee 967 - {967 (0.81 | N.R. | N.R. | N.R, | N.R. FOI Boss mmr Simm mind hi ——— 980 | 980 - | 0.85 | N.R. | N.R. | N.R. | N.R. Form B-=--==-cmedmme mmm mcm memo mmm mmm mmm ema 967 - [967 | 0.82 | N.R. | N.R, | N.R. | N.R. Brown, Hakes, and | 1959 | Raven Progressive | Retarded--=====-- N.R. N.R. | =-- [=== | N.R. | N.R. | N.R. | N.R. | 0.39- Malpass (233). Matrices, 0.49 Malpass, Brown, 1960 | Raven Progressive | Retarded--------- 11-8 (mean) 104 | --- | === | N.R. N.R. | N.R. | N.R. | 90.51 and Hakes (140). Matrices. C ic Barratt (138)----- 1956 | Raven Progressive | Normals -----munno 9-2 - 10-1 60| 26 | 3¢ [70.56 [0.60 [0.69 0.70 | 0.75 Matrices. Wilson (139)------ 1952 | Raven Progressive | British Columbia | 5-6 - 13-0 90 | --- | --- | N.R. N.R. | N.R. | N.R. | o=---- Matrices. Hospitalized |--==---c-een-a- JO [www [mm [mmm ede me dm (0.75 Americans 0, 27 Indians. Hospitalized [-==-===m=ww=e- 30 | mem | [mmf nd i mmm] °0.83 whites, $0.42 High socioeco= |-====-cococnnn 30 | === | === [mmmmmedemmmmee meena °5.51 nomic whites. 0.49 Martin and Wiech- | 1954 | Coloured Progres- | Normals=-=-======- 9-0 - 10-0 100| 60 | 40 | 0.73 | 0.74|0.84(0.83| 0.91 ers (142), sive Matrices. Stacey and Carle- 1955 | Coloured Progres-| Mental defec- 7-5 - 15-9 150 | =-- | --- | ,N.R, | N.R.,|0.54)|0.52] 0.55 ton (141). sive Matrices. tives. 0.36 [“0.41[0.51|°0.55| 0.62 Hite (112)======== 1953 | SRA Primary Mental| Normals-----=-=-=--| 5-6 years 50 | 34 | 16 |eecemodemmme meen eee mee Abilities Test. Verbal 0.38 | N.R. | N.R. N.R. Perception-===-dmecomoc mmm mmm memo bm oo mmm mm——— 0.30 0.83 | N.R. | N.R. N.R. Quantitative--=qeeecoc mcm ee meee mm homme mmm L me 0.35 0.53 | N.R. | N.R. N.R. - 0.68 | N.R. | N.R N.R. Stempel (143)----- 1953 | SRA Primary Mental| Superior 8-5 - 10-4 50 | === | === |==mmmmq-m--- EE Abilities. intelligence. SpPaCe======m mmm] mmm mem mm me mmm mmm hmmm mmm mmm N.R. N.R. [ 0.45] 0.34 | N.R. NUmbET === = === =m mm mmm mm mmm mmm mle mmm mmf eeee Lee N.R. | N.R. [0.15] 0.38 | N.R. RBG OTLLING rm wim we mm mr ti mm mm a sr oa N.R. N.R. | 0.63] 0.55 | N.R. Perception=====-qemmmmeece meme meee mmm mbm mee mm mmm ee N.R. N.R. [0.18] 0.42 | N.R. Verbal---==mn==+ i N.R. | N.R. | 0.68 0.40 | N.R. IQ-mm=m mm mmm mm dem mmm mmm mmf eee he N.R. | N.R. | N.R. | N,R. | 0.68 Jones (154)-----=- | 1962 | Teacher ratings--- Normals 7-6 - 10-5 240 | 120 | 120 | N.R. N.R. ( 0.73] 0.57} 0.74 (England) . 8 years 80| 40 | 40 |N.R. |N.R. [0.70] 0.48 | 0.70 9 years 80 | 40 | 40 | N.R. | N.R. |0.71]0.59| 0.73 10 years 80 | 40 | 40 | N.R. N.R. | 0.76 | 0.62 | 0.76 Stark (163) ------- 1954 | The Drawing=- Normals=========~ 8-4 - 9-10 50 30 20 0.72 | 0.49 | N.R. | N.R. 0,79 Completion Test. Bacon (127) ------- 1954 | Wechsler-Bellevue | Normals--===-====~ 11-9 - 12-3 32 | 16 16 [0.84 |0.65|0.86| 0.65| 0.77 Intelligence , Scale, Form I. Delattre and Cole | 1952 | Wechsler-Bellevue | Normals----=-===-- 10-5 - 15-7 50 { --- |-=-- | 0.55 [0.49 0.86 0.82 0.87 Q . Intelligence Scale, Form I. "Designation of subjects are always white Americans unless otherwise specified. YETA coefficient. °WISC scaled scores. Partial correlations with chronological age removed. ®Raw scores. fScaled scores. NOTES: All correlation coefficients are Pearson Product-Moment unless otherwise specified. T —Total population; M—male; F—female; Voc.-—Vocabulary; BD=-Block Design; VS-—Verbal Scale; PS—Performance Scale; FS—Full Scale; N.R,—not reported. fact that the validity of the WISC must be judged principally in relation to the logic of Wechsler's approach and the adequacy of his developmentand standardization of the test, a surprisingly large number of papers dealing with the validity of the WISC have used the Stanford-Binet as a criterion. As may be expected, unless one assumes naively that the theoretical objections to mental age scores involve gross discrepancies, which they usually do not, the correlations between WISC Full Scale IQ's and Stanford-Binet IQ's are generally high, in about the same range as the respective reli- abilities of these tests. (See table 2.) There seems to be little doubt that both the WISC and the Stan- ford-Binet merit their reputations as outstanding individual intelligence tests. There are, however, differences between the WISC and Stanford-Binet in score levels. As noted above, the WISC IQ's tend to be substantially lower than the corresponding Stanford-Binet IQ's for the very young and for the gifted (153 and 215), as well as for many samples reported across the normal range (119, 120, 124 147, 148, 151, 154, 156, 159, and 161). This problem is discussed below. The WISC has been correlated with a wide range of verbal and performance tests that pur- port to measure various aspects of intelligence. Correlations with the Wechsler-Bellevue, Form I, have been reported by Bacon (127) for a sample of 36 children in the age range 11 years 9 months to 12 years 3 months and by Delattre (128) for 50 students aged 10-5 to 15-7. Their results for FS were 0.77 and 0.87, respectively, while both corre- lated 0.86 for VS. For PS their respective corre- lations were 0.65 and 0.82; for Voc., 0.84 and 0.55. Finally, for BD their results were 0.65 and 0.49. Variations of the magnitude indicated mustbe ex- pected for small samples from different settings. Dunsdon and Roberts (170) administered four vocabulary tests including the WISC to 2,000 British children and obtained intercorrelations exceeding 0.8 for both sexes. Table 3 summarizes reported correlation coefficients between WISC scores and other tests of intelligence, mental maturity, and achievement in school subjects, teacher ratings, and related criteria. For the FS IQ these are generally quite high and positive, considering sample size and variation in sample composition and setting. In 10 view of these variations, the specific coefficients are of less interest than the general trend, which supports the validity of the WISC as a general measure of what Wechsler labels ''the total effec- tive intelligence of the individual" (101, pp. 4 and 5) For the purposes of a national survey, the robustness of the validity data over wide sample fluctuations is very encouraging, as is revealed by its use on samples of varying geographic and ethnic characteristics, of varying abilities ranging from defective to gifted samples, and by its use with special groups such as retarded readers (133), bilinguals (242), stutterers (198), and low school achievers (190). FACTORS AFFECTING WISC SCORES Both qualitative and quantitative variations in WISC scores have been reported by various inves- tigators in relation to a wide range of factors. Those discussed in this section are considered relevant to the objectives and problems of the Survey. Where feasible and appropriate, implica- tions and recommendations are noted. Anxiety Hafner, Pollie, and Wapner (132)and Carrier, Orton, and Malpass (205) have both reported nega- tive correlations between the WISC FS and the Children's Manifest Anxiety Scale (CMAS), indi- cating that anxiety, as measured by this scale, tends to interfere with effective WISC perform- ance, Hafner and others found a significant corre- lation of -0.31 between CMAS and BD. The Carrier study observed the relationship (-0.54) over a range of ability but not among the exceptionally bright. It appears to be most marked in the sub- normal; Feldhusen and Klausmeier (167) found the following mean differences in CMAS scores for three groups at different IQ levels: low IQ, 20.2; average, 14.8; and high, 12. These results are not entirely consistent with those of Burns (206), how- ever, who found similar correlations between WISC Vocabulary and California Personality Test measures of Social Adjustment (0.55) and Personal Adjustment (0.45) but obtained nonsignificant co- efficients of 0.12 and 0.10, respectively, for Block Design. Although anxiety and adjustment may be re- garded generally as factors that tend to depress WISC (Voc. and BD) scores for some segments of the child population on some occasions, it would seem unwise to attempt any correction for these factors. Presumably, some valid evidence on ad- justment will become available from the Thematic Apperception Test (TAT), the School Information Form, and the extensive background and medical information being collected in the Health Exam- ination Survey. However, the relationships are not clearly enough defined for fine quantitative manip- ulation. One alternative is to regard fluctuations on these variables as a source of error which may possibly be crudely estimated later but is probably well randomized in the total sample. Another is to accept the error pragmatically with the attitude that depressed scores resulting from affective factors probably reflect depressed a- bility of the individual to function effectively. Sex Differences The statement by McCandless (103), cited earlier, that boys do better on the WISC than girls, is not supported by the present review. Data on sex differences are presented in nine studies (130, 146, 154, 169, 175, 192, 194, 196, and 232), and only one (130) reports a significant mean dif- ference favoring boys on FS IQ. However, none of them employed a sampling design encouraging confidence in the group comparisons. Some correlational differences mentioned by several authors do appear interesting: The cor- relation of WISC Full Scale IQ with Bender-Gestalt was negative and higher for boys (-0.34 p<0.01) than for girls (-0.09 ns) (130). The correlation of WISC Full Scale IQ with the Ammons Picture Vo- cabulary Test was 0.71 for boys and 0.45 for girls (169). The correlations of WISC FS and VS IQ's with the spelling subtest of the Iowa Test of Basic Skills were higher for boys than for girls. No data were reported in which sex differences favored girls. The absence of sex differences in studies of normal American (146)and English (154) children, deaf American (194) and English (196) children, and retarded American children (232) suggests considerable generality for the negative con- clusion. Qualitative Differences by Level Gallagher and Lucito (164) found a negative rank order between the mean scores of gifted and retarded children on the WISC. The three highest and three lowest subtests for five com- parison groups in their study are shown below. These results agree with others, to be discussed below, which indicate that Block Design scores are least affected by population variations, in contrast with Vocabulary, which is the highest test of the gifted groups and the lowest of the re- tarded. Baroff (223) described a WISC profile for a sample of 53 low-IQ patients with a mean FS IQ of 63; Block Design was highest, and Vocabulary ranked 11 out of 12. Although Fisher (225) failed to verify the Baroff patterning, Baroff's results are in agreement with those of Gallagher and Lucito with respect to Vocabulary. Matthews (230) found that nonachievers in school tend to be higher on Block Design than on Vocabulary. Levinson (243 and 244), working with Jewish children in New York, and Altus (240), with Mexican and Anglo-American children in California, both found that monolinguals exceeded bilinguals on Vocabu- lary, but that the differences on Block Design Grou Number of s ER LOT subjects (N) Three highest subtests Three lowest subtests 1 Gifted====== 50 Similarities, Information, Picture Completion, Picture Vocabulary Arrangement, Digit Span 2 Gifted-==-== 43 Vocabulary, Information, Picture Completion, Picture Similarities Arrangement, Digit Span 3 Average--=-=-=- 565 Arithmetic, Digit Symbol, Block Design, Information, Picture Arrangement Similarities 4 Retarded---=- 150 Object Assembly, Picture Information, Vocabulary, Completion, Digit Span Arithmetic 5 Retarded---- 52 Object Assembly, Digit Vocabulary, Information, Span, Picture Completion Picture Arrangement 11 were not significant. Burks and Bruce (186) found that poor readers score significantly high on Block Design, and Kallos, Grabow, and Guarino (180) obtained a significant difference between Block Design and Vocabulary, favoring Block De- sign, for a sample of poor readers. Results such as these suggest the possibility of investigating a Voc.-BD ratio which may prove to have some diagnostic use, in conjunction with the Goodenough Draw-A-Man Test, the Wide Range Achievement Test (WRAT), the Thematic Apper- ception Test, and school information, in evaluating various categories of subnormal and deviantper- formance such as those enumerated above. On the Vocabulary subtest, Stacey and Port- noy (168) also observed qualitative differences between a borderline group (IQ range 66-79) and a defective group (IQ range 50-65) in conceptual approaches to word definition. Defectives ex- ceeded borderlines significantly in the use of functional definitions, while the borderlines were significantly higher in use of descriptive defini- tions. Neither group used abstract concepts to more than a slight degree. Carleton and Stacey (219) made an item anal- ysis of the Vocabulary and Block Design subtests with a sample of 366 low-IQ children (mean FS IQ 67) and found four Voc. items and two BD items displaced. In view of the greater dependence on these two subtests in a short form than is usually required with the full test, consideration might well be given by the Survey staff to a repetition of this study for a substantial sample. Maxwell (211) observed that the WISC vari- ances for a sample of neurotic children were greater than for a normal sample, which led him to criticize the transformations of raw scores to scaled scores. This point was also made by Wilson (139), whose work was with Indian children. Walker (209), in a highly creative study, enumerated a lengthy list of qualitative variations of WISC re- sponses that appear to have promise for person- ality diagnosis. Walker's study merits further followup. Developmental Factors Klausmeier and Check (166) investigated a number of developmental correlates of the WISC. They reported that children with high intelligence 12 quotients grow taller than those in the average or low range, but that weight is not significantly re- lated to sex or IQ. On strength of grip, they found low-IQ children weaker than those with average or high IQ's, the average group weaker than the high-IQ group, and girls weaker than boys. Girls were found to have more permanent teeth and a higher carpal age than boys of the same age. No sex differences or IQ differences were found in relation to emotional adjustment. Girls also exceeded boys on achievement in relation to capacity, integrvation of self concept, and estimation of own ability. These observations are of interest in suggesting cross-disciplinary analysis of psychological and biomedical data. SPECIAL GROUPS The following discussion includes researchon the WISC with reference to a number of special groups—those involving various disabilities, af- flictions, deviations, social and ethnic character- istics, and other definitive attributes commonly recognized in the literature—for which at least some information has been found. Each of these groups involves some variables which affect WISC scores, and this review might properly have been included in the preceding section. However, most of the research referred to here was organized in terms of samples of persons in various categories rather than by underlying variables. As a result, the organization of the discussion follows the organization of the material reviewed. Reading Disability As noted earlier, Kallos and others (180) found that Block Design scores were significantly higher than Vocabulary scores for a reading dis- ability sample of 37 boys aged 9to 14 years whose IQ's ranged from 90 to 109. The elevation of BD was supported by Burks and Bruce (186). Altus (181), Sheldon and Garton (182), and Karlsen (185) published WISC profiles for retarded readers, based on small but similar groups. No consistent pattern is unequivocally shown. Robeck (183) used a more sophisticated method to study subtest patterning of problem readers on the WISC, repre- senting subtest scores as deviations of scaled scores from the respective age-group means. By this method problem readers were significantly higher than the norms on both Block Design and Vocabulary (as well as on Comprehension, Simi- larities, and Picture Arrangement) and lower on Digit Span, Arithmetic, Information, and Coding. Rogge (187) reported no significant differences on WISC VS, PS, or FS IQ's between a sample of 132 delinquents 14 to 16 years ofage and a control sample of good readers. Correlations of WISC scales with reading tests are generally moderate, in the range of 0.3 to 0.5 (171, 172, and 173). On the other hand, ap- proaches involving score patterns or profiles, such as discussed above, and qualitative analyses of responses, exemplified by the analyses of the understanding of the concept of opposite, by Ro- binowitz (108) and by Flamand (172), appear to offer greater promise than linear regression methods for the evaluation of reading disability cases. The latter approach does not appear feasible with only Voc. and BD in the battery, but the pattern ap- proach, as discussed above, merits consideration. In the Survey battery the WRAT is, of course, most directly related to estimation of reading dis- ability, but a Voc.-BD ratio may be a useful sup- plement. Auditory Disability Murphy (196) administered the WISC to an equally divided sample of 300 deaf boys and girls in English schools for the deaf. Deaf children did not differ significantly from normal children on the Performance Scale in this study, and there was no meaningful relation between hearing loss and PS. It is of interest, though, that Block Design correlated 0.71 with PS in this sample. In addition, teacher ratings of emotional adjustment corre- lated 0.76 with PS, suggesting that here also, as in the samples evaluated in relation to the Chil- dren's Manifest Anxiety Scale, anxiety may be a deterrent to effective performance. Graham and Shapiro (195) compared the per- formance of the deaf and normal children on the WISC with standard and pantomime instructions. Both groups did equally well on PS with pantomime instructions, but the normals were superior with standard instructions. Mean scores on BD were ‘approximately equal under all three conditions. For deaf children, ‘then, the pantomime instruc- tions are appropriate on BD. Glowatsky (194) found that WISC Performance Scale IQ's were comparable with Draw-A-Man Test IQ's for a sample of 24 deaf and hard-of- hearing children in Santa Fe. PS scores were sub- stantially higher than VS scores in this group, but bilingualism (noted in 13 cases) was not a factor. Thompson gave Wepman's Auditory Discrim- ination Test, the WISC, and other tests of reading and auditory acuity to 105 children, including good and poor readers. She found that a significantand substantial proportion of first graders (71 percent) had inadequate auditory discrimination, but that this number was reduced to 24 percent by the second grade. Auditory Discrimination scores correlated more highly with reading (0.59.to 0.66) than with WISC IQ's (0.55 t00.58). The correlation of Auditory Discrimination with WISC Verbal Scale IQ, the highest correlation reported, was 0.61. Where hearing disability is noted byaudiom- eter test it would be advantageous to estimate intelligence level by a combination of Draw-A- Man and Block Design scores. Visually Handicapped According to a study by Scholl (197), the Block Design test may be administered with normal procedures to the partially blind. For the totally blind only the Vocabulary test would be appropriate in the Survey, and no data are avail- able to evaluate their scores adequately. Stutterers Post (198) found no significant differences between the mean scores of 30 stutterers and 30 controls, predominantly boys in the age range of 5-5 to 15-10, on the Stanford-Binet (S-B) and the WISC. The correlation of WISC Full Scale IQwith the S-B was 0.78 for the stutterers. The only difference found between the two groups was in the correlation of WISC Verbal Scale and Perform- ance Scale IQ's, which was 0.26 for the stutterers and 0.60 (the same as in Wechsler's standardiza- tion sample) for the controls. Both group means were higher on PS than VS. 13 Cerebral Palsy Bortner and Birch (199) studied the adminis- tration of the Block Design subtests with twenty- eight 13-year-old cerebral palsied children. They found, as may be expected, that the ability to dis- criminate block designs in a choice situation may be intact even though motor factors impair re- productive ability. Organic Impairment of Central Nervous System Beck and Lam (200) found that WISC Full Scale IQ's of diagnosed organics were lower than those of nonorganics, but failed, as others have, to verify Wechsler's subtest diagnostic pattern for organics. Young and Pitts (202) compared the WISC scores of 40 rural juvenile congenital syphilitics (aged 6 to 16 years) with 40 normal controls matched on age, sex, race, region, and father's occupation. The controls were signifi- cantly superior on IQ's and on Vocabulary, but not on Block Design, where the critical ratio was marginal. Gifted In Edmonton, Chalmers (213) administered the WISC to 57 superior children with IQ's above 120 (mean FS IQ 128) and found that 11 obtained perfect scores on one or more tests. However, there were no perfect scores on Vocabulary and only one on Block Design. Nevertheless, Chalmers questioned the adequacy of the WISC ceilings for precise measurement in the very high range. Trauba (214), with a similar sample of 71 gifted Kansas children, found that WISC Vocabulary has a correlation of 0.71 with the McCall-Crabbs Standard Test Lesson in Reading. Lucitoand Gal- lagher (215) obtained a mean WISC Full Scale IQ of 141 for a sample of SO children whose mean S-B IQ was 161. In this group the boys' scores were slightly higher than those of the girls. In agreement with Gallagher and Lucito (164), men- tioned earlier, Similarities, Information, and Vo- cabulary were the three highest tests for boys and girls. Object Assembly, Coding, and Picture Ar- rangement were lowest for boys, while Digit Span, Picture Arrangement, and Picture Completion were lowest for girls (only partially in agreement with Gallagher and Lucito). 14 The adequacy of the WISC for precise meas- urement of the gifted may be questioned, but it is possible that more accurate measurement may be obtained by use of the present short form of Vocabulary and Block Design than with the Full Scale. This is a problem, however, that will re- quire further attention. Mentally Retarded and Defective The research on the use of the WISC with retarded and defective groups is very favorable, in contrast with research on its use for the gifted. This is indicated by virtually all the studies re- viewed: (a) reliabilities reported— Throne and others (227) obtained retest reliabilities over 3 to 4 months of 0.79 for Vocabulary and 0.82 for Block Design on a sample of 39 retarded boys aged 11 to 14 years; (b) correlations of the WISC with other tests—Stanford-Binet (216, 217, 228, and 229), Leiter International Performance Scale (221 and 229), Wechsler Adult Intelligence Scale (222), Columbia Mental Maturity Scale (224), Goodenough Draw-A-Man Test (224), Progressive Matrices (233), Peabody Picture Vocabulary Test (234), and grade placement (238); (c) patterning studies, mentioned earlier; (d) absence of sex differences (232); and (e) amenability to short forms based on Vocabulary and Block Design, as discussed above. (See Research on Short Forms of the WISC.) Dif- ferences between WISC and Stanford-Binet IQ's are smaller in this range than in any other. It appears that estimates of retardation in the pop- ulation should be justified on the basis of a com- posite score of Voc. and BD, but the desirability of further research to develop a conversion table to the Full Scale should not be minimized. Bilingual The effect of bilingualism appears tobe in the direction of lowering the Vocabulary scores; no effects have been reported on Block Design. Altus (240) reported such results for Mexicans in Cali- fornia; Kralovich (241), for children of Slavic origin in New Jersey; and Levinson (243 and 244), for Jewish children in New York. Kralovich re- ported a correlation of 0.61 between the Verbal and Performance scales of the WISC for 28 mono- linguals and -0.04 for 28 bilinguals. Where bi- lingualism is known to exist, verbal tests may be expected to be invalid measures and greater re- liance on performance-type tests such as Block Design and Draw-A-Man is indicated. Negro The WISC norms do not apply to Negro chil- dren, and research by Young and Bright (251), Caldwell (252), Blakemore (253), and Racheile (254), as well as others, does nothing to alter this fact. Negroes score lower than whites, and it is generally accepted that cultural experience and caste factors not only account for the Negro- white differences, but also render comparable measurement by culture-fair or culture-free methods as difficult as other ethnic comparisons. The sampling designs of the studies cited, which used the WISC, were not adequate to qualify them for any detailed comment on differences found. Socioeconomic Status Laird (250) compared children of different socioeconomic status (SES) on the WISC and noted, in common with the general trend in the literature, superior performance at upper levels. Estes (247 and 248) found similar differences at grade 2 but not at grade 5. At bothgrades the WISC Full Scale IQ was more highly correlated with the Metro- politan Achievement Test for the higher SES sam- ple. COMPARISON OF WISC AND STANFORD-BINET 1Q’'S Despite the theoretical objections tothe men- tal age concept, discussed earlier, which led to the adoption of the deviation IQ as a distinctive feature of the Wechsler scales and which set them apart from the venerable Stanford-Binet test, the relation of the WISC to the S-B has been a matter of great interest, as evidenced by the number of papers on this topic in the present re- view. The Stanford-Binet is indeed one of the giants among psychological tests, a veritable landmark in the history of psychological measurement, and still enjoys extensive school and clinical use, not- withstanding the fact that its popularity has been somewhat reduced by the success of the relatively recent WISC. Although the standardization of the WISC has been impressive and supported by so- phisticated conceptualization, many users have been relieved to find that it is highly correlated with the Stanford-Binet. The correlation is in fact so high (accounting for over 80 percent of common variance) that one wonders about the significance of the theorizing which describes them so differ- ently. The impression of similarity of measurement results given by the correlations does not, how- ever, stand up when mean scores of different groups are compared. As noted earlier, WISC IQ's tend to be lower than Stanford-Binet IQ's at the lower age levels and among the gifted. These observations are illustrated by data extracted from the following 12 studies in which comparison means were cited: 119, 120, 124, 147, 148, 151, 153, 156, 159, 161, 215, and 216. Their resulis are epitomized briefly on the following page. Data from Jones' (154) British study of 240 chil- dren in the age range 8 to 10 years are also of interest. For this group the WISC means were, on the average, 7.2 IQ points below the S-B, the WISC always being administered first. Allowing for sampling fluctuations and errors of measurement in routine testing, there never- 22 | 20} hn allel 4 z gl Gifted y 14 | E go = 2} - =z = oF 8 S 8 i w x 6 4 uw L 4 Normal range — o 2 nu ok Retarded - Defective 1 1 i] 1 1 1 1 1 A. 1 5 6 7 8 9 10 1 12 13 14 AGE IN YEARS Figure |. Summary of the amount Stanford-Binet Intelligence Test scores differ from Wechsler Intelligence Test scores. 15 Normal (White) Samples Schachter and Apgar Mean age 4-1 Mean S-B 104.3 (147)1 Mean age 8-3 Mean WISC 98,9 N 113 (61m, 62f) -5.4 Triggs and Cartee (148) Kindergarten- Mean S-B 124.1 Age 5 Mean WISC 107.6 N 48 -16,5 Muhr (119) 5-year group Mean S-B 97.4 N 21 Mean WISC 88,1 ~953 6-year group Mean S-B 102.2 N 21 Mean WISC 96,6 -5,6 Pastovic and Guthrie 5-year group Mean S-B 113.0 (120) N 50 Mean WISC 103.2 -9,8 7-year group Mean S-B 115.1 N 50 Mean WISC 111.5 -3,6 Rottersman (151) 6-year group Mean S-B 110.2 N 50 Mean WISC 101,5 a Cohen and Collier (124) 6- to 9-year group Mean S-B 104,8 Ages 6-5 to 8-9 Mean WISC 99,3 N 53 -5,0 Wagner (156) 8- to 9-year group Mean S-B 104,5 N 50 Mean WISC 103.3 -1,2 Frandsen and Higginson 9-year group Mean S-B 105.8 (159) N 50 Mean WISC 102,4 =3.4 Kardos (161) 13- to l4-year group Mean S-B 113.7 N 100 Mean WISC 109.4 Gifted (White) Samples Beeman (153) N 36 Full sample: Mean WISC compared with Mean S-B: =15 IQ over 130: Mean WISC compared with Mean S-B: =-20 IQ 120-129: Mean WISC compared with Mean S-B: -11.4 N 50 Mean S=-B 160,8 ‘Mean WISC 141.2 Lucito and Gallagher (215) = Retarded Samples Nale (216) 9- to ll-year group Mean S-B 55.4 N 104 Mean WISC 58,0 +2,6 Interval between S-B and WISC administration, 50 months, NOTE: N—number; m—male; f-——female, 16 theless appears to be a common trend in these reports which can be summarized as follows. The differences between WISC and S-B IQ's are great- est among the gifted. In the normal range they are high among the very young, dropping off as age increases, but persisting tosome degree through- out the age range 5 to 14 years. The data suggest an upturn after age 9, but this is not certain. No significant differences appear for the subnormal. The schematic chart in figure 1 suggests the na- ture of the age- and level-related difference functions on the basis of the results cited. Unfortunately it is possible only to speculate on the nature of the true curves which those in figure 1 are intended to suggest, and speculation on what they would be for a short form composed only of Vocabulary and Block Design is difficult. Some of the data presented earlier for these sub- tests suggest that the differences mightbe small- er, but in the absence of empirical evidence this is only an educated guess. For the purposes of the Survey there are only two alternatives. One is to carry out some ad hoc research on the short form, as suggested earlier, for the purpose of estimating the Full Scale IQ from Voc. and BD, using the results to conform to Wechsler's norms. The other is to regard the full Survey sample as the unprecedented opportunity to carry outa complete new standardi- zation of the short form on a basis that, in sam- pling sophistication, far exceeds any work of its kind in the history of testing. There are a number of problems related to the second alternative, including the availability of funds for this purpose. However, if this standardization were accom- plished, the new norms for Voc. and BD would be superior to those now available, and the compu- tations of FS IQ based on them would permit more accurate population ‘estimates than any others conceivable for the age range included. SUMMARY AND CONCLUSIONS This review is based on 154 published studies, reviews, and unpublished theses and disserta- tions related to the WISC, interpreted in a frame of reference of measurement theory and psy- chometric principles. The evidence considered strongly supports the judgment of the Survey staff in the selection of the WISC Vocabulary and Block Design subtests as a short form of the WISC for the national survey, but at the samc time it raises questions concerning the acceptance of either the scaled scores of these subtests or of prorated Full Scale Intelligence Quotients based on them without further empirical research. It is the reviewer's considered opinion that, given the alternatives presented, the selection was an eminently wise one. The research recommended reflects principally the nature of the unprecedent- ed testing problems and the generally imprecise nature of psychological measurement. The most important recommended investiga- tions discussed in this section involve the follow- ing steps: 1. Restandardization of the Vocabulary and Block Design tests on the full Survey sample. As part of this study, item diffi- culties should be checked and a formula or set of formulas should be developed for estimating Full Scale IQ's from revised Voc. and BD scaled scores (based on samples of normal, gifted, and retarded groups—and if possible several ethnic groups, such as Negroes or Mexicans—to whom the Full Scale has been adminis- tered). Consideration should be given to estimation of IQ's directly from raw scores by age group. 2. Research on correlates of a Voc.-BD ratio, for use with the WRAT and with the Draw -A-Man Test in the identification of poor readers, bilinguals, and verbally im- paired children and in estimating IQ's of culturally deviant ethnic groups. 3. Cross-disciplinary developmental anal- yses of Vocabulary, Block Design, and de- rived scores and of item responses with biomedical data obtained in other sections of the Survey. This area is discussed in detail elsewhere. See Klausmeier and Check (166). 17 BIBLIOGRAPHY General References to WISC 101. Wechsler, D.: Wechsler Intelligence Scale for Children. New York. Psychological Corp., 1949. 102. Littell, W. M.: The Wechsler Intelligence Scale for Chil- dren, review of a decade of research. Psychological Bull. 57:132-156, 1960. 103. McCandless, B. R.: Review of the WISC, in O.K. Buros, ed., Fourth Mental Measurements Yearbook. Highland Park, N.J. The Gryphon Press, 1953. pp. 480-481. 104. Frost, B. P.: An application of the method of extreme deviations to the Wechsler Intelligence Scale for Chil- dren. J.Clin.Psychol. 16:420, 1960. 105. Winpenny, N.: An Investigation of the Use and the Va- lidity of Mental Age Scores on the Wechsler Intelligence Scale for Children. Unpublished master’s thesis, Penn- sylvania State College, 1951. 106. Maxwell, A. E.: Inadequate reporting of normative test data. J.Clin.Psychol. 17:99-101, 1961. 107. Seashore, H. G.: Differences between verbal and per- formance IQ’s on the Wechsler Intelligence Scale for Children. J.Consult.Psychol. 15:62-67, 1951. 108. Robinowitz, R.: Learning the relation of opposition as related to scores on the Wechsler Intelligence Scale for Children. J.Genet.Psychol. 88:25-30, 1956. Factor Analytic Studies 109. Hagen, E. P.: 4 Factor Analysis of the Wechsler Intel- ligence Scale for Children. Unpublished doctoral dis- sertation, Columbia University, 1952. 110. Gault, U.: Factorial patterns on the Wechsler Intelligence Scales. Aust.J.Psychol. 6:85-90, 1954. 111. Cohen, J.: The factorial structure of the WISC at ages 7-6, 10-6, and 13-6. J.Consuli.Psychol. 238:285-299, 1959. Reliability and Stability 112. Hite, L.: Analysis of Reliability and Validity of the Wechsler Intelligence Scale for Children. Unpublished doctoral dissertation, Western Reserve University, 1953. 113. Gehman, I. H., and Matyas, R. P.: Stability of the WISC and Binet tests. J.Consult.Psychol. 20:150-152, 1956. 114. Matyas, R. P.:4 Longitudinal Study of the Revised Stan- ford-Binet and the WISC. Unpublished master’s thesis, Pennsylvania State University, 1954. 115. Reger, R.: Repeated measurements with the WISC. Psy- chol.Rep. 11:418, 1962. 116. Whatley, R. G., and Plant, W. T.: The stability of WISC 1Q’s for selected children. J.Psychol. 44:165-167,1957. Validity 117. Mussen, P., Dean, S., and Rosenberg, M.: Some further evidence on the validity of the WISC. J.Consult.Psychol. 16:410-411, 1952. 18 118. Kureth, G.,Muhr, J. P., and Weisgerber, C. A.: Some data on the validity of the Wechsler Intelligence Scale for Children. Child Development 23:281-287, 1952. 119. Muhr, J. P.: Validity of the Wechsler Intelligence Scale for Children at the Five and Siz Year Level. Unpub- lished master’s thesis, University of Detroit, 1952. 120. Pastovic, J. J., and Guthrie, G. M.: Some evidence on the validity of the WISC. J.Consult.Psychol. 15:385- 386, 1951. 121. Pastovic, J. J.: A Validation Study of the Wechsler In- telligence Scale for Children at the Lower Age Level. Unpublished master’s thesis, Pennsylvania State Col- lege, 1951. 122. Altus, G. T.: A note on the validity of the Wechsler In- telligence Scale for Children. J.Consult.Psychol. 16: 231, 1952. Relations with Other Tests: Batteries 123. McBrearty, J. F.: Comparison of the WISC With the Arthur Performance Scale, Form I, and Their Relationship to the Progressive Achievement Test. Unpublished mas- ter’s thesis, Pennsylvania State College, 1951. 124. Cohen, B. D., and Collier, M. J.: A note on WISC and other tests of children six to eight years old. J.Consult. Psychol. 16:226-227, 1952. 125. Estes, B. W., Curtin, M. E., DeBurger, R. A., and Denny, C.: Relationships between 1960 Stanford-Binet, 1937 Stanford-Binet, WISC, Raven, and Draw-A-Man. J.Con- sult.Psychol. 25:388-391, 1961. 126. Smith, B. S.: The relative merits of certain verbal and non-verbal tests at the second-grade level. J.Clin.Psy- chol. 17:53-54, 1961. Relations with Other Tests: Wechsler-Bellevue 127. Bacon, C. S.: A Comparative Study of the Wechsler-Bel- levue Intelligence Scale for Adolescents and Adults, Form I, and the Wechsler Intelligence Scale for Children at the Twelve-Year Level. Unpublished master’s thesis, University of North Dakota, 1954. 128. Delattre, L., and Cole, D.: A comparison of the WISC and the Wechsler-Bellevue. J.Consult.Psychol. 16:228- 230, 1952. Relations with Other Tests: Bender-Gestalt Perceptual Tests 129. Koppitz, E. M.: Relationships between the Bender-Ge- stalt Test and the Wechsler Intelligence Scale for Chil- dren. J.Clin.Psychol. 14:413-416, 1958. 130. Armstrong, R. G., and Hauck, P. A.: Correlates of the Bender-Gestalt scores in children. J.Psychol.Stud. 11: 153-158, 1960. 181. Goodenough, D. R., and Karp, S. A.: Field dependence and intellectual functioning. J.Abnorm.&8ocial Psychol. 63:241-246, 1961. Relations with Other Tests: CMAS 132. Hafner, A. J., Pollie, D. M., and Wapner, I.: The relation- ship between the CMAS and WISC functioning. J.Clin. Psychol. 16:322-323, 1960. Relations with Other Tests: Ammons Full Range Picture Vocabulary 133. Smith, L. M., and Fillmore, A. R.: The Ammons FRPV Test and the WISC forremedial reading cases; abstracted, J.Consult.Psychol. 18:332, 1954. Relations with Other Tests: CTMM 134. Altus, G. T.: Relationships between verbal and non-ver- bal parts of the CTMM and WISC. J.Consult.Psychol. 19:143-144, 1955. Relations with Other Tests: Kent EGY 135. Delp, H. A.: Correlations between the Kent EGY and the Wechsler batteries. J.Clin.Psychol. 9:73-75, 1953. Relations with Other Tests: Peabody Picture Vocabulary Test 136. Kimbrell, D. L.: Comparison of Peabody, WISC, and ac- ademic achievement scores among educable mental de- fectives. Psychol.Rep. 7:502, 1960. 137. Himelstein, P., and Herndon, J. D.: Comparison of the WISC and Peabody Picture Vocabulary Test with emo- tionally disturbed children. J.Clin.Psychol. 18:82, 1962. Relations with Other Tests: Raven Progressive Matrices 138. Barratt, E. S.: The relationship of the Progressive Ma- trices (1938) and the Columbia Mental Maturity Scale to the WISC. J.Consult.Psychol. 20:294-296, 1956. 139. Wilson, L.: 4 Comparison of the Raven Progressive Ma- trices (1947)and the Performance Scale of the Wechsler Intelligence Scale for Children for Assessing the Intel- ligence of Indian Children. Unpublished master’s thesis, University of British Columbia, 1952. 140. Malpass, L. F., Brown, R., and Hade, D.: The utility of the Progressive Matrices (1956 edition) with normal and retarded children. J.Clin.Psychol. 16:350, 1960. 141. Stacey, C.L., and Carleton, F.O.: The relationship be- tween Raven’s Colored Progressive Matrices and two tests of general intelligence. J.Clin.Psychol. 11:84-85, 1955. 142. Martin, A. W., and Wiechers, J. E.: Raven’s Colored Pro- gressive Matrices and the Wechsler Intelligence Scale for Children. J.Consult.Psychol. 18:143-144, 1954. Relations with Other Tests: SRA-PMA 148. Stempel, E. F.: The WISC and the SRA Primary Mental Abilities Test. Child Development 24:257-261, 1953. Relations with Other Tests: Stanford-Binet 144. Krugman, J.I., Justman, J., Wrightstone, J. W., and Krug- man, M.: Pupil functioning on the Stanford-Binet and the Wechsler Intelligence Scale for Children. J.Consult. Psychol. 15:475-483, 1951. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157. 158. 159. Harlow, J. E., Jr., Price, A. C., Tatham, L. J., and Davidson, J. F.: Preliminary study of comparison be- tween Wechsler Intelligence Scale for Children and Form L of the Revised Stanford Binet Scale at three age lev- els. J.Clin.Psychol. 13:72-73, 1957. Boruszak, R. J.: 4 Comparative Study to Determine the Correlation Between the 1Q’s of the Revised Stanford Binet Scale, Form L, and the 1Q’s of the Wechsler In- telligence Scale for Children. Unpublished master’s the- sis, Wisconsin State College, 1954. Schachter, F. F., and Apgar, V.: Comparison of pre- school Stanford-Binet and school-age WISC 1Q’s. J.Educ. Psychol. 49:320-323, 1958. Triggs, F. O., and Cartee, J. K.: Pre-school pupil per- formance on the Stanford-Binet and the Wechsler Intel- ligence Scale for Children. J.Clin.Psychol. 9:27-29, 1953. Holland, G. A.: A comparison of the WISC and Stanford- Binet IQ’s of normal children. J.Consult.Psychol. 17: 147-152, 1953. Weider, A., Noller, P. A., and Schraumm, T. A.: The Wechsler Intelligence Scale for Children and the Re- vised Stanford-Binet. J.Consult.Psychol. 15:330-333, 1951. Rottersman, L.: A Comparison of the IQ Scores on the New Revised Stanford Binet, Form L, the Wechsler In- telligence Scale for Children, and the Goodenough “Draw A Man” Test at the Six Year Age Level. Unpublished master’s thesis, University of Nebraska, 1950. Tatham, L. J.: Statistical Comparison of the Revised Stanford-Binet Intelligence Test Form L With the Wech- sler Intelligence Scale for Children Using the Siz and One-HalfYear Level. Unpublished master’s thesis, Uni- versity of Florida, 1952. Beeman, G.: A comparative study of the WISC and Stan- ford-Binet witha group of more able and gifted 7-11 year old students. Calif.J.Educ.Res. 11:77, 1960. Jones, S.: The Wechsler Intelligence Scale for Children applied to a sample of London primary school children. Br.J.Educ.Psychol. 32(2):119-133, 1962. Scott, G.R.: A Comparison Between the Wechsler Intel- ligence Scale for Children and the Revised Stanford-Binet Scales. Unpublished master’s thesis, Southern Method- ist University, 1950. Wagner, W. K.: A Comparison of Stanford-Binet Mental Ages and Scaled Scores on the Wechsler Intelligence Scale for Children for Fifty Bowling Green Pupils. Un- published master’s thesis, Bowling Green State Univer- sity, 1951. Stanley, J. C.: Statistical analysis of scores from coun- terbalanced tests. J.Exp.Educ. 23:187-207, 1955. Arnold, F. C., and Wagner, W. K.: A comparison of Wech- sler Children’s Scale and Stanford-Binet scores for eight- and nine-year-olds. J.Ezp.Educ. 24:91-94, 1955. Frandsen, A. N., and Higginson, J. B.: The Stanford- Binet and the Wechsler Intelligence Scale for Children. J.Consult.Psychol. 15:236-238, 1951. 19 160. 161. 162. Clarke, F. R.: A Comparative Study of the Wechsler In- telligence Scale for Children and the Revised Stanford Binet Intelligence Scale, Form L, in Relation to the Scho- lastic Achievement of a 5th Grade Population. Unpub- lished master’s thesis, Pennsylvania State College, 1950. Kardos, M. S.: A Comparative Study of the Performance of Twelve-Year-Old Children on the WISC and the Re- vised Stanford-Binet, Form L, and the Relationship of Both to the California Achievement Tests. Unpublished master’s thesis, Marywood College, 1954. Davidson, J. F.: A Preliminary Study in Statistical Com- parison of the Revised Stanford-Binet Intelligence Test Form L With the Wechsler Intelligence Scale for Chil- dren Using the Fourteen Year Level. Unpublished mas- ter’s thesis, University of Florida, 1954. Relations with Other Tests: Wartegg Drawing Completion Test 163. WISC: 1lv4. 165. 166. 167. Stark, R.: A Comparison of Intelligence Test Scores on the Wechsler Intelligence Scale for Children and the War- tegg Drawing Completion Testwith School Achievement of Elementary School Children. Unpublished master’s thesis, University of Detroit, 1954. Response Patterns of Cifted, Average, and Retarded Gallagher, J. J., and Lucito, L. L.: Intellectual patterns of gifted compared with average, and retarded. Except. Children 27:479-482, 1961. Klausmeier, H. J., and Feldhusen, J. F.: Retention in arithmetic among children of low, average, and high in- telligence at 117 months of age. J.Educ.Psychol. 50: 88-92, 1959. Klausmeier, H. J., and Check, J.: Relationships among physical, mental, achievement, and personality measures in children of low, average, and high intelligence at 113 months of age. Am.J.Ment.Deficiency 63:1059-1068, 1959. Feldhusen, J. F., and Klausmeier, H. J.: Anxiety, intel- Jligence, and achievement in children of low, average, WISC: and high intelligence. Child Development 33:403-409, 1962. Vocabulary, Language Skills, Reading 168. 169. 170. wL 20 Stacey, C. L., and Portnoy, B.: A study of the differen- tial responses on the vocabulary subtest of the Wechsler Intelligence Scale for Children. J.Clin.Psychol. 6:401- 403, 1950. Winitz, H.: 4 Comparative Study of Certain Language Skills in Male and Female Kinaergarten Children. Un- published doctoral dissertation, State University of Iowa, 1959. Dunsdon, M. I., and Roberts, J. A. F.: A study of the performance of 2,000 children on four vocabulary tests. Br.J .Statist.Psychol. 8:3-15, 1955. Reidy, M. E.: A Validity Study of the Wechsler-Bellevue Intelligence Scale for Children and Its Relationship to Reading and Arithmetic. Unpublished master’s thesis, Catholic University of America, 1952. 172. 173. 174. WISC: 175. 176. vi. 178. 179. Flamand, KE. K.: The Relationship Between Various Meas- ures of Vocabulary and Performance in Beginning Read- ing. Unpublished doctoral dissertation, Temple Univer- sity, 1961. Triggs, F.O., Cartee, J. K., Binks, V., Foster, D., and Adams, N. A.: The relationship between specific reading skills and general ability at the elementary and junior- senior high school levels. Educ.Psychol.Measur. 14: 176-185, 1954. Fitzgerald, L. A.: Some Effects of Reading Ability on Group Intelligence Test Scores in the Intermediate Grades. Unpublished doctoral dissertation, State Uni- versity of Iowa, 1960; abstracted, Diss.Absér. 21:1844, 1961. Short Forms Armstrong, R. G.: A reliability study of a short form of the WISC vocabulary subtest. J.Clin.Psychol. 11:413- 414, 1955. Throne, J. M.: A Short Form of the Wechsler-Bellevue Intelligence Test for Children. Unpublished master’s thesis, University of Florida, 1951. Simpson, W.H., and Bridges, C. C., Jr.: A short form of the Wechsler Intelligence Scale for Children. J.Clin.Psy- chol. 15:424, 1959. Carleton, F. O., and Stacey, C. L. : Evaluation of se- lected short forms of the Wechsler Intelligence Scale for Children. J.Clin.Psychol. 10:258-261, 1954. Yalowitz, J. M., and Armstrong, R. G.: Validity of short forms of the Wechsler Intelligence Scale for Children (WISC). J.ClUin.Psychol. 11:275-277, 1955. WISC: Reading Disability 180. 181. 182. 183. 184. 185. 186. 187. Kallos, G. L., Grabow, J. M., and Guarino, E. A.: The WISC profile of disabled readers. Personnel Guid.J. 39:476-478, 1961. Altus, G. T.: A WISC profile for retarded readers. J.Con- sult.Psychol. 20:155-156, 1956. Sheldon, M. S., and Garton, J.: A note on ‘‘a WISC pro- file for retarded readers.’ Alberta J.Educ.Res. 5:264- 267, 1959. Robeck, M. C.: Subtest patterning of problem readers on WISC. Calif.J.Educ.Res. 11:110-115, 1960. Abrams, J. C.: 4 Study of Certain Personality Character- istics of Non-Readers and Achieving Readers. Unpub- lished doctoral dissertation, Temple University, 1955. Karlsen, B.: 4 Comparison of Some Educational and Psy- chological Characteristics of Successful and Unsuccess- ful Readers at the Elementary School Level. Unpub- lished doctoral dissertation, University of Minnesota, 1954. Burks, H. F., and Bruce, P.: The characteristics of poor and good readers as disclosed by the Wechsler Intelli- gence Scale for Children. J.Educ.Psychol. 46:488-493, 1955. Rogge, H. J.: A Study of the Relationships of Reading Achievement to Certain Other Factors in a Population of Delinquent Boys. Unpublished doctoral dissertation, University of Minnesota, 1959. WISC: School Achievement 188. 189. 190. 191. 192. Orr, K.N.: The Wechsler Intelligence Scale for Children as a Predictor of School Success. Unpublished master’s thesis, Indiana State Teachers College, 1950. Schwitzgoebel, R. R.: The Predictive Value of Some Re- lationships Between the Wechsler Intelligence Scale for Children and Academic Achievement in Fifth Grade. Un- published doctoral dissertation, University of Wisconsin, 1952. Barratt, E. S., and Baumgarten, D. L.: Therelationship of the WISC and Stanford-Binet to school achievement. J.Consult.Psychol. 21:144, 1957. Raleigh, W. H.: 4 Study of the Relationships of Academic Achievement in Sixth Grade With the Wechsler Intelli- gence Scale for Children and Other Variables. Unpub- lished doctoral dissertation, Indiana University, 1952. Stroud, J. B., Blommers, P., and Lauber, M.: Correlation of WISC and achievement tests. J.Educ.Psychol. 48: 18-26, 1957. WISC: Auditory Disability, Visual Handicap, Stuttering, Cerebral Palsy, Brain Damage 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. Thompson, B. B.: The Relation of Auditory Discrimina- tion and Intelligence Test Scores to Success in Primary Reading. Unpublished doctoral dissertation, Indiana Uni- versity, 1961. Glowatsky, E.: The verbal element in the intelligence scores of congenitally deaf and hard of hearing children. Amer. Ann.Deaf 98:328-385, 1953. Graham, E. E., and Shapiro, E.: Use of the Performance Scale of the Wechsler Intelligence Scale for Children with the deaf child. J.Consult.Psychol. 17:396-398, 1953. Murphy, L. J.: Tests of abilities and attainments, pupils in schools for the deafaged six to ten, in A. W. G. Ewing, ed., Educational Guidance and the Deaf Child. Man- chester, England. Manchester University Press, 1957. pp. 213-251. Scholl, G.: Intelligence tests for visually handicapped children. Ezcep.Children 20:116-120, 1953. Post, D. P.: 4 Comparative Study of the Revised Stan- ford Binet and the Wechsler Intelligence Scale for Chil- dren Administered to a Group of Thirty Stutterers. Un- published master’s thesis, University of Southern Cali- fornia, 1952. Bortner, M., and Birch, H. G.: Perceptual and perceptual- motor dissociation in cerebral palsied children. J.Nerv. EMent.Dis. 134:108-108, 1962. Beck, H. S., and Lam, R. L.: Use of the WISC in pre- dicting organicity. J.Clin.Psychol. 11:154-157, 1955. Kilman, B. A., and Fisher, G. M.: An evaluation of the Finley-Thompson abbreviated form of the WISC for un- differentiated, brain damaged and functional retardates. Am .J Ment.Deficiency 64:742-746, 1960. Young, F.M., and Pitts, V. A.: The performance of con- genital syphilitics on the Wechsler Intelligence Scale for Children. J.Consult.Psychol. 15:239-242, 1951. 203. Rowley, V. N.: Analysis of the WISC performance of brain damaged and emotionally disturbed children. J.Con- sult.Psychol. 25:553, 1961. WIS C: Personality Measures (Normal), Discipline, Delinquency 204. Gourevitch, V., and Feffer, M. H.: A study of motivational development. J.Genet.Psychol. 100:361-375, 1962. 205. Carrier, ‘N. A., Orton, K. D., and Malpass, L. F.: Re- sponses of bright, normal, and EMH children to an orally- administered manifest anxiety scale. J.Educ.Psychol. 53:271-274, 1962. 206. Burns, L.: A Correlation of Scores on the Wechsler In- telligence Scale for Children and the California Test of Personality Obtained by a Group of 5th Graders. Unpub- lished master’s thesis, Pennsylvania State College, 1954. 207. Kent, N., and Davis, D. R.: Discipline in the home and intellectual development. Brit.J.M.Psychol. 30:27-33, 1957. 208. Wall, H. R.: 4 Differential Analysis of Some Intellective and Affective Characteristics of Peer Accepted and Re- jected Pre-Adolescent Children. Unpublished doctoral dissertation, University of Kansas, 1960. 209. Walker, H. A.: The Wechsler Intelligence Scale for Chil- dren as a Diagnostic Device. Unpublished master’s the- sis, Utah State Agricultural College, 1956. 210. Schonborn, R.: A comparative study of the differences between adolescent and child male enuretics and non- enuretics as shown by an intelligence test. Psychol. Newsletter 6:1-9, 1954. 211. Maxwell, A. E.: Discrepancies in the variances of test results for normal and neurotic children. Br.J.Statist. Psychol. 13:165-172, 1960. 212. Richardson, H. M., and Surko, E. F.: WISC scores and status in reading and arithmetic of delinquent children. J.Genet.Psychol. 89:251-262, 1956. WISC: Gifted 213. Chalmers, J. M.: An Analysis of Results Obtained on the Wechsler Intelligence Scale for Children by Mentally Superior Subjects. Unpublished master’s thesis, Uni- versity of Alberta, 19583. 214. Trauba R.G.:4 Study of the Aspects of Differentiation of Abilities in Interpretation of Reading With a Group of Gifted Children. Unpublished doctoral dissertation, University of Kansas, 1959. 215. Lucito, L., and Gallagher, J.: Intellectual patterns of highly gifted children on the WISC. Peabody J.Educ. 38:131-136, 1960. WISC: Mental Defectives 216. Nale, S.: The Childrens-Wechsler ana the Binet on 104 mental defectives at the Polk State School. Am.J.Ment. Deficiency 56:419-423, 1951. 217. Sloan, W., and Schneider, B.: A study of the Wechsler Intelligence Scale for Children with mental defectives. Am .J .Ment.Deficiency 55:573-575, 1951. 21 218. 219. 220. 221. 222. 223. 224. 225. 226. 2217. Atchison, C.O.: Use of the Wechsler-Intelligence Scale for Children with eighty mentally defective Negro chil- dren. Am.J.Ment.Deficiency 60:378-879, 1955. Carleton, F. O., and Stacey, C. L.: An item analysis of the Wechsler Intelligence Scale for Children. J.Clin. Psychol. 11:149-154, 1955. Newman, J. R., and Loos, F. M.: Differences between verbal and performance IQ’s with mentally defective chil- dren on the Wechsler Intelligence Scale for Children. J.Consult.Psychol. 19:16, 1955. Alper, A. E.: A comparison of the WISC and the Arthur adaptation of the Leiter International Performance Scale with mental defectives. Am.J.Ment.Deficiency 63:312- 316, 1958. Fleming, J. W.: The Relationships Among Psychometric, Experimental, and Observational Measures of Learning Ability in Institutionalized Endogenous Mentally Re- tarded Persons. Unpublished doctoral dissertation, Uni- versity of Colorado, 1959. Baroff, G. S.: WISC patterning in endogenous mental de- ficiency. Am.J.Ment.Deficiency 64:482-485, 1959. Warren, S. A., and Collier, H. L.: Suitability of the Co- lumbia Mental Maturity Scale for mentally retarded in- stitutionalized females. Am.J.Ment.Deficiency 64:916- 920, 1960. Fisher, G. M.: A cross-validation of Baroff’s WISC pat- terning in endogenous mental deficiency. Am.J.Ment.De- ficiency 65:349-350, 1960. Baumeister, A., and Bartlett, C. J.: Further factorial in- vestigations of WISC performance of mental defectives. Am.J .Ment.Deficiency 67:257-261, 1962. Throne, F. M., Schulman, J. L., and Kasper, J. C.: Re- liability and stability of the Wechsler Intelligence Scale for Children for a group of mentally retarded boys. Am. J.Ment.Deficiency 67:455-457, 1962. WISC: Mentally Retarded 228. 229. 230. 231. 232. 233. 22 Stacey, C. L., and Levin, J.: Correlation analysis of scores of subnormal subjects on the Stanford-Binet and Wechsler Intelligence Scale for Children. Am.J.Ment. Deficiency 55:590-597, 1951. Sharp, H. C.: A comparison of slow learner’s scores on three individual intelligence scales. J.Clin.Psychol. 13:872-374, 1957. Matthews, C. G.: Differential Performances of Non- Achieving Children on the Wechsler Intelligence Scale. Unpublished doctoral dissertation, Purdue University, 1958. Finley, C. J., and Thompson, J.: An abbreviated Wech- sler Intelligence Scale for Children foruse with educable mentally retarded. Am.J.Ment.Deficiency 63:473-480, 1958. Finley, C., and Thompson, J.: Sex differences in intel- ligence of educable mentally retarded children. Calif. J.Educ.Res. 10:167-170, 1959. Brown, R., Hakes, D., and Malpass, L.: The utility of the Progressive Matrices Test(1956 revision); abstract- ed, Am.Psychologist 14:341, 1959. 234. 235. 236. 237. 238. 239. Dunn, L. M., and Brooks, S. T.: Peabody Picture Vocab- ulary Test performance of educable mentally retarded children. Train.Sch.Rull. 57:35-40, 1960. Schwartz, L., and Levitt, E.: Short forms of the Wechsler Intelligence Scale for Children in the educable, non-in- stitutionalized mentally retarded. J.Educ.Psychol. 51: 187-190, 1960. Salvati, S. R.: 4 Comparison of WISC 1Q’s and Altitude Scores as Predictors of Learning Ability of Mentally Re- tarded Subjects. Unpublished doctoral dissertation, New York University, 1960; abstracted, Diss.Abs¢r. 21:2370, 1961. Baumeister, A. A.: The Dimensions of Abilities in Re- tardates as Measured by the Wechsler Intelligence Scale for Children. Unpublished doctoral dissertation, George Peabody College for Teachers, 1961. Thompson, J. M., and Finley, C. J.: The validation of an abbreviated Wechsler Intelligence Scale for Children for use with the educable mentally retarded. Educ.Psy- chol.Measur. 22:539-542, 1962. Osborne, R. T., and Allen, J.: Validity of short forms of the WISC for mental retardates. Psychol.Rep. 11:167- 170, 1962. WISC: Bilingualism 240. 241. 242. 243. 244. Altus, G. T.: WISC patterns of a selective sample of bi- lingual school children. J.Genet.Psychol. 83:241-248. 1953. Kralovich, A. M.: The Effect of Bilingualism on Intelli- gence Test Scores as Measured by the Wechsler Intelli- gence Scale for Children. Unpublished master’s thesis, Fordham University, 1954. Cooper, J. G.: Predicting school achievement for bilin- gual pupils. J.Educ.Psychol. 49:31-36, 1958. Levinson, B.M.: A comparison of the performance of bi- lingual and monolingual native born Jewish preschool children of traditional parentage on four intelligence tests. J.Clin.Psychol. 15:74-76, 1959. Levinson, B. M.: A comparative study of the verbal and performance ability of monolingual and bilingual native born Jewish preschool children of traditional parentage. J.Genet.Psychol. 97:93-112, 1960. WISC: Cultural Variations 245. 246. Levinson, B. M.: Traditional Jewish cultural values and performance on the Wechsler tests. J.Educ.Psychol. 50:177-181, 1959. Levinson, B. M.: Subcultural variations in verbal and performance ability at the elementary school level. J. Genet.Psychol. 97:149-160, 1960. WISC: Socioeconomic Status 247. 248. Estes, B. W.: Influence of socioeconomic status on Wech- sler Intelligence Scale for Children, an exploratory study. J.Consult.Psychol. 17:58-62, 1953. Estes, B. W.: Influence of socioeconomic status on Wech- sler Intelligence Scale for Children, addendum. J.Con- sult.Psychol. 19:225-226, 1955. 249. Roy, I., and Cohen, N.: Some psychometric variables relative to change in sociometric status; abstracted, Am. Psychologist 10:328, 1955. 250. Laird, D. S.: The performance of two groups of eleven- year-old boys on the Wechsler Intelligence Scale for Children. J.Educ.Res. 51:101-107, 1957. WISC: Negro Samples, Negro-White Comparisons 251. Young, F. M., and Bright, H. H.: Results of testing 81 Negro rural juveniles with the Wechsler Intelligence Scale for Children. J.Soc.Psychol. 39:219-226, 1954. 252. Caldwell, M. B.: An Analysis of Responses of a South- ern Urban Negro Population to Items on the Wechsler Intelligence Scale for Children. Unpublished doctoral dissertation, Pennsylvania State University, 1954. 253. Blakemore, J. R.: 4 Comparison of Scores of Negro and White Children on the Wechsler Intelligence Scale for Children. Unpublished master’s thesis, College of the Pacific, 1952. 254. Racheile,L.D.: 4 Comparative Analysis of Ten Year Old Negro and White Performance on the Wechsler Intelligence Scale for Children. Unpublished doctoral dissertation University of Denver, 1953. * Il. THE WIDE RANGE ACHIEVE MENT TEST, THE ORAL READING AND ARITHMETIC SUBTESTS The requirement ot the Survey for an indi- vidually administered, brief, well-standardized, reliable, valid, and flexible school achievement test was filled by the selection of the Reading and Arithmetic subtests of the 1963 revision of the Wide Range Achievement Test. The 1963 WRAT, by J.F. Jastak, replaces the original 1946 edition by Jastak and S.W. Bijou and appears to be quite similar to the original indesignand item content, except that the new edition is divided, for the convenience of users, into two levels (Level I covers ages 5 to 12 years; Level II, 12 years through adulthood), in contrast with the broad sweep of the original, from kindergarten through adulthood. The principal difference between the two edi- tions ‘appears to be in the method of standardi- zation. The 1946 norms were computed to conform to those of the New Stanford Achievement Test (Reading, to New Stanford Word and Paragraph Reading, and Arithmetic Computation, to New Stanford Arithmetic Computation), whereas the 1963 norms, in each age bracket, depend on "probability samplings based on IQ's... that would correspond to the achievement of mentally average groups with representative dispersions of scores above and below the mean" (301). The purpose of this section is both to review the literature on the WRAT and to evaluate it in relation to its suitability for the objectives of the Survey. Unfortunately this must be done almost entirely on the basis of the tests, manuals, and research available on the 1946 edition, which is itself extremely limited. Appropriate data for critical evaluation of the 1963 edition are almost totally lacking. Although released for sale in 1963, the test manual for this edition was still incom- plete in June 1964 (301) and no independent data on validity have been found. EVALUATIVE CRITERIA Measurement experts believe that in addi- tion to the standard questions concerning such issues as reliability, validity, representativeness of standardization sample, and agreement of norms with criterion levels, some problems are inherent in the wide-range type of design. These are stated forthrightly by Chauncey and Dobbin (310), in a discussion of various defects of tests: The "wide-range" test. . . is the too-short testin disguise. There are only a few of them around. They are promoted as being suitable measures of ability (or achievement) for people of many ages—from third grade through second year of college, for example. Since only a small part of any such test can be material suitable in difficulty for one indi- vidual, the effective part of the test may amount to no more than half a dozen ques- tions—making it a very short test, indeed. These remarks, by the president and one of the project directors of the Educational Testing Service, in a book written expressly to defend educational testing at a time when it is under 23 attack from many sources, command attention and concern by users of wide-range tests such as the WRAT. The particular implication of the critique is that reliabilities, validities, and score levels must be evaluated at every level covered (or at least at every level at which the test is used) and that broad-band coefficients of relia- bility and concurrent validity are likely to be misleading. The problem of selecting a suitable achieve- ment test for the Survey is highly complex. Time restrictions favor short forms and short-cut methods (such as the wide-range approach), pro- vided that they meet reasonable standards of acceptability. However, it is just as true in test- ing as in all other areas that ''you cannot get more out than you put in." Compromises with reality in testing often mean less reliable meas- ures and less adequate coverage of appropriate universes of content; sometimes they mean penal- ties in relation to validity and consequent gener- alizability of measures. The application of these points to the WRAT is considered as judicially as possible in this re- view, and the reality demands are weighed against possible shortcomings of this wide-range test in relation to alternatives available in the situation. A brief review of the 1946 edition andthe general conceptualization of the WRAT is followed by a review of the 1963 edition used in Cycle II. 1946 EDITION OF WRAT The conceptualization and rationale of this test (302) could not help but appeal toclinical psy- chologists in schools and mental health services. Jastak made an extremely strong case for the clinical use of his test, and it is not surprising that the WRAT has enjoyed considerable popu- larity in clinical circles despite psychometri- cians' prejudice against wide-range tests. Jastak's arguments are briefly as follows: 1. A thorough psychological examination should include tests of school fundamen- tals as well as intelligence tests. In- telligence tests account for only a portion of the variance in school achievement, and failure in school and life adjustment may result from factors other than low in- telligence. 24 2. Reliable (and valid) school tests should be used to assess discrepancies between in- tellectual capacity and performance in basic school subjects as well as dis- crepancies in the organization of learning abilities. Wide range discrepancies in school achievement are the rule rather than the exception, and their discovery is important for the understanding of per- sonality and school performance problems and for the institution of proper remedial programs. 3. Clinically recognized discrepancy pat- terns in children are illustrated by the tendency of neurotic and disorganized children to be more proficient in reading than in arithmetic. In addition, "if neu- rotic tendencies and special reading handicaps occur together the child may function far below the level of his true capacity in all school subjects." Of course, failure in reading and in arithmetic may also reflect unrelated processes. Jastak's criteria of a satisfactory school achievement test for (individual) clinical use are (a) low cost, (b) individual standardization, (c) ease and economy of administration, (d) suita- bility of contents, (e) relevance of the functions studied, and (f) comparability of results over the entive range of the skills in question. It is appar- ent that these criteria do in effect exclude such standard school achievement batteries as the Stanford, Iowa, Cooperative, and other well-known and highly respected batteries that are designed for group administration within a narrow grade range and cover a large universe of content, requiring considerable time to administer and score. These criteria certainly appear to be "tailor made" for the Survey (as well as for clinical practice). However, in view of the test- ing conditions for individually selected members of the national sample, the question is, how well are they implemented in the WRAT? Jastak's views on test content are of partic- ular interest. The WRAT focuses entirely on three basic school study skills—reading, spelling, and arithmetic— "around whichmost school stud- ies revolve." The range of the subtests for each is indeed wide, from kindergarten to college. The test content is concerned principally with mastery of the mechanics of the subject rather than with comprehension. Thus the reading test is in effect a test of reading as a motor skill; the spelling test focuses on words without sentence contexts; and the arithmetic test in- volves number facility with minimal dependence on reading. This emphasis is a reflection of the author's conception of the WRAT as an adjunct to tests of intelligence and behavior adjustment. Information concerning the subject's ability to comprehend can be obtained from intelligence tests, but ac- curate measurement of mechanics in the basic tools chosen is essential because of the depend- ence of most other studies on them. Further, it is argued that correct answers can often be given in conventional reading, arithmetic, and other subject-matter achievement tests on the basis of general knowledge and intellectual ability, even when mastery of mechanics is poor; thus, im- portant diagnostic cues are overlooked. Although the WRAT Reading and Arithmetic tests were reported to correlate satisfactorily with other achievement tests, their limitations of content and intended use were clearly outlined in the manual. As stated above, the 1946 edition of the WRAT was standardized by anchoring the WRAT norms to those of corresponding subtests of the New Stan- ford Achievement Test. The standardization sample consisted of the scores of 4,052 students for Spelling and Arithmetic (about 1,500 were individually tested; the remainder were tested in groups) and 1,429 students, .individually tested, for Reading. Reliability coefficients (retest) were reported as 0.95 for Reading (N=110) and 0.90 for Arithmetic (N=120). The Reading section of the New Stanford Achievement Test was reported to have correlated 0.81 with Paragraph and Word Reading; the Arithmetic section of the Stanford test correlated 0.91 with Arithmetic Computation. The detailed composition of the various sam- ples was not reported in the 1946 manual, and the validation data were not specified by age level as would be required to conform with the evalua- tive criteria discussed above. This was not ex- ceptional in 1946, however, when the professional demands for rigorous reporting of critical infor- mation by test publishers were less stringent than they are today. Nevertheless, despite the absence of com- prehensive statistical information, the WRAT be- came a favorite of a large number of clinicians, and its use was extensive in the United States and abroad within a short time of its publication. It may appear surprising that so popular a test generated so little research. However, itappears that the principal use of the testwas by clinicians whose attitudes toward tests are usually validated more by clinical experience than by statistics and whose opportunities and motivations to con- duct and publish research are generally limited. RESEARCH ON THE 1946 WRAT It is noteworthy that only seven researchre- ports have been found dealing with the 1946 edi- tion and that of these seven, two were unpublished mimeographed papers (303 and 306) furnished by Dr. Jastak. Reliability coefficients and corre- lations of the WRAT with other tests, abstracted from these reports and the two test manuals (301 and 302), are reported in tables 4 and 5. Reading Hopkins, Dobson, and Oldridge (304) quoted Sundberg (312), in a 1961 paper, to the effect that although the WRAT was the second most popular achievement test in clinics, Sundberg could not find a single empirical study of it. They adminis- tered the Reading subtest to 502 children in grades 1 to 5 and correlated the scores with teacher ratings and scores on the California Reading Test (CRT). The correlations with teacher ratings were high for grades 1 to 5—0.79, 0.74, 0.86, and 0.85, respectively. The correlations with the total score of the California Reading Test were 0.86 for grade 3 and 0.71 for grade 5. The mean grade placements on the WRAT, for the five grades in order, were 1.4, 2.4, 3.5, 4.1, and 4.7. Wagner and McCoy (303) reported correla- tions of the WRAT Reading subtest with the Sangren-Woody Silent Reading Test (grade level) for two samples, one of 29 fifth graders and the other of 57 primary school juvenile offenders. The correlations were 0.78 and 0.74. In the first sample, the WRAT Reading correlated 0.78 with both teacher ratings and with rank order of mid- term grades. The correlation with the Stanford Reading Test, in the second sample, was 0.80. 25 Table 4. Studies reporting reliability coefficients of the WRAT Tavestigator | Voor | Subjects | IRRPE. | ape venge | ZEEE | P-|Sugiiiiey| Sghrens | Bene |Selighiliey Jastak and 1946 | Normals®--| Test-retest N.R. Reading------ 110 0.95 Arithmetic--| 120 0.90 Bijou (302). Jastak (301) --| 1963 | N.R, -===-- Split-half |---==eeeeen-—- i eo so me hr Arithmetic-mp mmm ofmmo mee aeann 20+ years | ==---------- - 200 0.99 |emeemeeneao 200 0.97 18-19 years | -==-===-e-an 200 0.98 |=-emmceemaaa 200 0.97 16-17 years | =====meeeeen 200 0.99 | --mmmmmeen 200 0.95 15 years | =-----eeeemooa 200 0.99 | --mmmeeeeean 200 0.97 14 years | ==--eceeeoand 200 0.99 | =eemmmmmeee- 200 0.96 13 years | =-=-ceemmen- 200 0.99 |==mememeenan 200 0.96 12 years | =--==--=meeeo-- 200 0.99 | ---mmmmmmean 200 0.94 Reading, = | ===-=-f---cmmccaaan Arithmetice=ee=mmcobocmmmmaaaao Level I, } 11 years |---==-ec---nnq 200 0.99 |=memmmemeeen 200 0.95 10 years |[=--=-c-conoad 200 0.99 [==memememnenn 200 0.95 9 years |=-=----c--oano 200 0.99 |=--mmemeeae- 200 0.94 8 years |-----emeooood| 200 0.99 [memes 200 0.95 7 years |-------oooooo 200 0.99 [--e--memmeen 200 0.96 6 years |=---e-c-eoaan 200 0.99 |---emeeeeee- 200 0.96 5 years |=--m-ecmeaoa-d| 200 0.98 |----memeena- 200 &.10./97 Standard- | Form I with |--==-=ncecoau Reading-=====lommmcmcbmmccmeemeeee Arithmetic-=fe=mecobocooonaoao ization Form II. Tation. 14-0 = 14-11 |=--=mmmemaann 89 0.88 |==--memeeee- 87 0.86 13-0 = 13-11 |----emm-mmmmn 224 0.90 |-m--mmeeeeee 194 0.87 12-6 = 12-11 |~===cmmemcnmun 180 0.94 |-m-emmmeoea- 165 0.85 12-0 = 12-5 |==--cccnennan 179 0.92 |-memmmemean 164 0.86 11-6 = 11-11 |-=-=ccmmeeae- 252 0.91 |----memmmee- 225 0.85 1150 = 11=5 |=nwesnenswews 197 QOL EI or mms, 191 0.82 10-6 = 10-11 |--==cc=eccen-n 214 0.93 |=eemmmemea- 195 0.89 10-0 = 10-5 |=-=--cemmmean 207 0.90 |--e--meea- 190 0.84 9-6 = 9-11 [--ccccccmaaa- 165 0.91 [--e-eemmeen 160 0.79 9-0 = 9-5 [--meememenaad 81 0.90 [-mmememeeeeo 78 0.88 Level of subjects and time interval between tests not reported. NOTES: All correlation coefficients are Pearson Product-Moment unless otherwise specified. N.R.—Not reported. 26 Table 5. Studies reporting correlation between the WRAT and other measures Number Investigator Year Test or criterion variable Subjects” Age range Correlation z M F WRAT Reading Test Smith (126) -==-=--m=cmaum 1961 | Full Range Picture Vocabulary Normals, 6-11 - 8-10 | 100 51 49 0.42 Test. Grade 2. Hopkins, Dobson, and 1962 | California Achievement Test-=-=-=-=-=-- Normals----=-- N.R. 257 | mmm | wm frmmm——————— Oldridge (304). Reading Vocabulary---===wmeneene" Grade 3--=--- N.R. 171 | =m | mmm 0.83 Grade 5-===-- N.R. 86 | === | === 0.67 Reading Comprehension ----e-c-oo-. Grade 3------ N.R. 171 | === | === 0.84 Grade 5------ N.R. 86 | --- | --- 0.67 Total Readingr---2-mememeemenen—- Grade 3------ N.R. 171 | === | === 0.86 Grade 5------ N.R. 86 | === | === 0.71 Smith (126) -=-=--=--==-== 1961 | California Test of Mental Maturity-| Normals, N.R. 100 51 49 0.47 Grade 2. Lawson and Avila (305)--- | 1952 | Gray Standardized Oral Reading Mental de- 16-45 years | 30 | 19 | 11 0.94 Paragraphs Test. fectives. b Reger (307) -=---==-=me==-= 1962 | Metropolitan Achievement Tests, Retarded 9-9 - 14-6 05) | wm (f womine . 0.76 Reading. boys. Wagner and McCoy (303)--- | N.R. [Midterm grades------=--c===-cmecan- Normals, N.R. 29 [=== [=== .78 Grade 5. (rank order) Jastak and Bijou (302)--- [1946 | Stanford Achievement Test, Reading--| Normals, N.R. a Grades 7 and 8. Word Meaning---=--=====-=mm-ecccmm mem mmm mmm N.R. 389 | === [=== 0.84 Paragraph Meaning=-=---==-=-=c-mcoolommmmm eens N.R. 389 |---| --- 0.81 Wagner and McCoy (303) ~--- | N.R. | Sangren-Woody Reading = = |[============-==- N.R. SR eC Test. Normals, N.R. 29 |---| === 0.78 Grade 5. Juvenile of- N.R. 57 | === | === 0.74 fenders. Stanford Reading Tests---=====mmm=- Juvenile of- N.R. 47 | === | === 0.80 fenders, Teacher rating of reading ability--| Normals, NR. 29 |---| === 0.78 Grade 5. Hopkins, Dobson, and 1962 | Teacher rating of reading ability--| Normals=-====mmmmemeneanaa= 502 | === | === |rmmmmmmm———- Oldridge (304). Grade l------ N.R. 90 |---| --- 0.79 Grade 2------ N.R. 106 |---| --- 0.74 Grade 3------ N.R. 171 | === | === 0.86 Grade 4------ N.R. 49 | ==m | === 0.86 Grade 5------ N.R. 86 | --~ | === 0.85 Smith (126) -=--=====cun== 1961 | Wechsler Intelligence Scale for Normals, N.R. 100 ||. SL. | 49 amen immmmmmen Children. Grade 2. Verbal Score = 0.55 Performance SCOre-==-=-=-===-em- oom eee meme meh mmm fe mm 0.47 Full ScCOr@==-==smmemc meee meee meee meee eee hee mm mm] 0.61 See footnotes at end of table. 27 Table 5. Studies reporting correlation between the WRAT and other measures—cCon. Number Investigator Year Test or criterion variable Subjects? Age range Correlation z M F WRAT Arithmetic Test Holowinsky (309) -----=---~ 1961 | California Reading Test-=-========= Normals and |12-17 years | 600 |---| == 0.61 retarded, Murphy (306) --==-=-==-=-=== N.R. | First-quarter grades-------=----=--- Normals--===mfooommanonoae 24] | mem | mmm fmmmmmmmmmmeem Grade 5----- N.R. 135 [mmm [| === 0.64 Grade 6----- N.R. 106 |---| --- 0.56 Holowinsky (309)------=--- 1961 | Grade placement--------emeceeenn——— Normals and 12-17 years | 600 | =-- | === 0.31 retarded. Reger (307)-==----mmm-mn= 1962 | Metropolitan Achievement Tests, Retarded 9-9 - 14-6 | 25 |---| ~~~ b0.87 Arithmetic. boys. Jastak and Bijou (302)--- | 1946 | Stanford Achievement Tests, Arith- | Normals, N.R. 140 | === | === 0.91 metic Computation. Grades 7 and 8. Holowinsky (309) --------- 1961 | Otis Quick Scoring Mental Ability | Normals, 12-17 years | 600 | === | === 0.30 Tests. retarded. 12-13 years | N.R.| === | === 0.59 13-14 years | N.R.| === | === 0.39 14-15 years | N.R.| === | === 0.54 15-16 years | N.R.| --- | ==~- 0.02 16-17 years | N.R.| === | --- 0.09 Murphy (306) --=-=-=mmmmnv N.R. | Stanford Achievement Tests, Arith- | Normals-====-=f==========-- 241 | mmm | mmm prmmmmmmee—— metic, and school grades. N.R. 135 | mmm | w= 0.59 N.R. 106 | === | ~~~ 0.35 Stanford Achievement Tests, Arith- | Normals------ |e elm nace 241 [rm=msfprmm= mm nm metic, and school grades. Grade 5----- N.R. 135 | wom | === 0.75 (Multiple r) Grade 6----- N.R. 106 | ww= | === 0.70 (Multiple r) Designation of subjects are always white Americans unless otherwise specified. Spurious correlation with age for small N. NOTES: All correlation coefficients are Pearson Product-Moment unless otherwise specified. 3 —Total population; M—male; 28 F—female; N.R.—not reported; r—correlation. The report by Lawson and Avila (305) of a correlation of 0.94 between the WRAT Reading subtest and the Gray Oral Reading Test, adminis- tered to a sample of retarded adults ranging widely in age and IQ, is probably inflated because of the nature of the sample. Similarly, Reger's (307) sample of 25 emotionally disturbed, re- tarded boys (age range 9-9 to 14-6) is also quite a diverse population. Reger reported a correlation of 0.76 between the WRAT Reading subtest and the Metropolitan Achievement Test. Holowinsky (309) had an apparently well- designed sample of 600, including 75 children at each age from 12 to 16 years. Each group was divided into three categories on the basis of IQ scores. The categories were as follows: 80-89 1Q), 90-99 1Q, and 100-109 IQ. For the total sample of 600 children, the California Reading Test corre- lated 0.61 with the WRAT Arithmetic subtest. Students of lower intellectual ability tended to show better achievement in arithmetic than in reading. For the total sample of 600 children the WRAT had a correlation of 0.31 with grade placement. These limited results tend to support the claims for the WRAT with regard to concurrent validity both with other reading tests and with grade placement. The evidence is far from suf- ficient to permit definitive evaluation, and the lack of information on many points is obvious. However, no contrary evidence was found and as far as these papers are concerned, the report for the WRAT Reading subtest is favorable. Arithmetic The most adequate independent study of the WRAT Arithmetic subtest is that of Murphy (306), who tested 135 fifth and sixth graders (with average IQ of 114) with the WRAT and the Stan- ford Achievement Test (SAT). The correlation of the two tests was 0.59 for grade S and 0.35 for grade 6. The correlations between Arithmetic grades and the WRAT were 0.64 for grade 5 and 0.56 for grade 6. Correlations between the SAT and Arithmetic grades were 0.68 for grade S and 0.59 for grade 6. In Reger's sample, noted above (307), the WRAT Arithmetic testhada correlation of 0.87 with the Metropolitan Achievement Test. Holowinsky's study mentions a correlation of 0.59 between the IQ scores of 12-year-olds and the WRAT Arithmetic subtest, as compared with 0.71 for the Reading subtest. These results are less satisfactory than those for Reading in the respect that the corre- lations reported compare less favorably with those mentioned in the manual. This type of cross- validation is imperative and demonstrates the importance of independent reports to supplement the data provided in a test manual. To Dr. Jastak's credit, however, it should be noted that the Murphy report, in which the lower corre- lations appear, is an unpublished paper which he, Dr. Jastak, furnished unsolicited for this review. These studies are insufficient for an evaluation of the WRAT Arithmetic subtest, to be sure. As the only information available, they leave the case for the Arithmetic test without strong independent support. 1963 EDITION OF WRAT Two major changes appear in the 1963 edi- tion. One is the division of the testinto two levels. Level I covers the age range of 5 to 12 years; Level II covers the age range 12 years through adulthood. It is pointed out in the mimeographed manual for this edition that this change not only has reduced the time of test administration, but also has increased the number of items at each level, thereby increasing ''the already highrelia- bility" of the test. Indeed, the test has been lengthened, and the reliabilities have been listed for samples of 200 each for ages 5 through 11 years (Level I). For Reading, all—with the ex- ception of 5 years of age—correlate 0.99. (Age 5 correlates 0.98.) Similarly computed reliabilities for Arithmetic are listed at or above 0.94, with the highest correlation, 0.97, occurring at 5 years of age. Since these coefficients are based on corre- lations between two forms of the test, they are considered by the authors to be inflated. The text of the reliability section of the manual (301, p. 47) states that the reliability coefficients are more likely within the range 0.90 to 0.95 with a mean of 0.92. At this level, they do not seem perceptibly higher than the reliabilities reported in the 1946 manual. The second major change is in method of standardization. The 1963 manual (301)describes 29 the development of norms and the normative popu- lation sample as follows: The revised WRAT was administered to school children and adults in a number of states: Delaware, Pennsylvania, New Jersey, Maryland, Florida, Washington, and Cali- fornia. No attempt was made to obtain a representative national sampling. Nor is. such a sampling considered essential for proper standardization. (italics added) The groups of children were selected from schools of known socioeconomic levels. The IQ's of the children were also known from group tests such as the Lorge-Thorndike, the Kuhlmann-Anderson, and the California Men- tal Maturity Test, administered at the schools. Many of the cases (over 1,000) in the standardization group had been given individual tests such as the Stanford-Binet, Wechsler Intelligence Scale for Children, and others. In each age bracket, probability samplings based on IQ's were studied to de- velop WRAT norms that would correspond to the achievement of mentally average groups with representative dispersions of scores above and below the mean. (italics added) From the standpoint of the Health Exami- nation Survey, with particular reference to Cycle II (children aged 6-11 years), the first of the two mentioned changes is an advantage. The age range of Level I fits the age range of Cycle II perfectly, and the increased length of the test and more extensive reliability studies reported support the claim of excellent reliability. The second change, in standardization and norm development, does, however, present a potential problem which is accentuated by the absence of validity data. This is discussed below, Validity and Norms Although published in 1963, the validity sec- tion of the revised WRAT was not available for review until late in June 1964. The delay was explained by the author of the test as occasioned by comparison of the WRAT ''with a number of other tests in order to determine the meaning and diagnostic value of the three subtests in re- lation to other abilities." In addition, his letter 30 disclosed that ''specific methods to identify, in individual cases, the size of the independent and separate variances will have to be developed. Since this is somewhat of a novel and pioneering venture, it takes more time than routine manual preparation.’ The latter quotation is discussed separately below. The basis for the present evaluation is, then, a comparison of the content and structure of the 1946 and 1963 editions of the WRAT, supplemented by the limited independent literature on the 1946 edition, reviewed above, and the limited data on the 1963 edition provided in the manual furnished by the author. No independent studies of the 1963 edition were available. Comparison of the Two Editions Examination of the two booklets indicates close similarity in item content, format, adminis- tration, and scoring. The Reading test for Level I, in the revised edition, contains 5S words that were in the 1946 edition, and their rank order of sequential position in the two editions is about 0.99. It is presumed that the 20 new words were empirically calibrated to fit into the previously established word order. The arithmetic items of the new test are of the same general type as in the earlier test, although the format is slightly different and the number of items is increased. In view of this similarity, it appears reason- able to expect that the network of correlations of the revised test with other measures would be approximately the same as that reported for the 1946 edition. In fact, the correlations might even be slightly higher as a result of the greater length of the revision. To the extent that con- current validity could be accepted for the 1946 edition, therefore, there is no reason to doubt that it will be upheld with the 1963 edition. Al- though the data are quite inadequate, tentative acceptance on this point appears warranted, based on the authors' reputations and the state- ments in the manual. However, this is only part of the problem. Validation of 1963 Edition It is equally important to be able to meaning- fully interpret the grade ratings, standard scores, and percentiles in relation to individual age and grade placement and in relation to population parameters. In the absence of empirical infor- mation on this issue, nothing definite can be con- cluded. It is appropriate to raise some questions which have been generated by statements made in the 1963 manual. In the first place, the reviewer would take issue with the test author's statement that a representative national sampling is not essential for proper standardization. A national sample is certainly necessary if national norms are to be promulgated. Although the 1946 edition was de- veloped on a restricted (as opposed to national) sample, its norms were presumably keyed to the grade norms of the New Stanford Achievement Test, for which a more extensive base existed. Even though regional, ethnic, and other perturbing effects were not known, it was at least possible to invoke the Stanford norms in interpreting grade levels. With the 1963 edition, however, no such anchoring process was followed. The only indi- cations concerning age-grade levels are, in fact, disquieting. The manual goes on to say that intelligence quotients of a number of group and individual tests (which are generally known to vary inlevel among themselves) were used to select samples in each age bracket "that would correspond tothe achievement of mentally average groups with representative dispersions of scores above and below the mean." (italics added) It would indeed be remarkable if such a procedure could produce a standard reference sample of known character- istics for normative purposes. Therefore it is doubtful that the resulting norms could have de- pendable accuracy for individual assessment or for analysis of groups in the manner required for the national sample of the Health Examination Survey. Perhaps the test author's current con- cern with comparisons with other tests, referred to above, reflects realization of this problem. Furthermore, in view of the professed clini- cal purposes of the WRAT, itis surprising that the standardization research is confined to ''mentally average groups,’ and that no studies were under- taken of such groups as gifted pupils, students retarded in reading, arithmetic, and other school subjects, disturbed children, and subnormal chil- dren. For the purposes of a national survey, prob- lems of ethnic and regional variations in test performance are important, as are other sources of perturbation attributable to deviations of abili- ty, personality, and physical and social factors. The absence of such data for the 1963 WRAT is certainly not the sole responsibility of the author - publisher; ordinarily test producers donotassume responsibility for all possible research of interest to all possible users. If a test attracts interest, information about it in various situations gradu- ally accumulates in the literature. However, in the present case it appears fair to say that the author's confidence in his test led him to publish the revision before he had completed his own research and before research on it by any users could be reported. The test was issued without a formal designation of the norms as ''tentative" and without any qualifications. Validity Variances Instead, the 1963 manual (301, p. 2) concludes its introductory section with the following para- graph: In addition to the three operational aspects (of mechanics and comprehension in relation to each skill test) the basic skills have sever- al unique validities which will be explained later by reference to appropriate research. The validity variances will not only support the empirical distinctness of mechanics and comprehension, but will provide the degrees to which each is important in learning to read, spell and figure and the impact the relationship between them has on the total learning process. The burden of proof is on the author. The development of such an analytic scheme for inter- pretation of test scores is indeed both novel and ambitious and deserves all the time required to complete it. It seems regrettable, however, that the test was released before critical users could evaluate not only these devices, buteven the grade ratings, percentiles, and standard scores included in the manual. Validity Data in 1963 Manual The section of the manual entitled "Validity of the WRAT'" (301, p. 51), contains a table of means and standard deviations of raw scores for 31 the Reading, Spelling, and Arithmetic subtests, which indicates considerable need for refine- ment of the tests in order to produce an even progression of scores from grade to grade. The difficulties are considerable at some levels (8.0 to 8.5, 9.5 to 10.0, and 10.5 to 11.0, on the Read- ing test, for example), to say nothing of the fact that the basic difficulties reported about the standardization sample are not only notclarified, but are not even referred to in this section of the manual. Two paragraphs on the validity of the Read- ing test (301, p. 50) refer only to the studies cited above, which involve the 1946 edition of the WRAT. No validity data on the 1963 edition are presented. Similarly, data are presented (301, p. 52) on correlations of the WRAT with achieve- ment tests and on the validity of the Arithmetic subtest, but these are also identified as relating to the 1946 edition. Internal consistency data cited by the author (301, p. S53) involve intercorrelations among the three WRAT subtests and not validity, despite the author's assertion that "criteria of internal con- sistency, if properly interpreted, are usually more valid than are external criteria of com- parison." These data are also presented as ''one method of cross-validation." Correlations of the Wide Range Achievement Test with the California Test of Mental Maturity are given (301, p. 54) for a sample of 74 children spanning the age range of 5 to 15 years. They range from 0.74 to 0.84 and may be spuriously high in view of the heterogeneity of the sample. Similarly structured comparisons with the WISC for 300 boys (aged 5 to 15 years) and 244 girls (aged 5 to 15 years) are reported which indicate correlations as follows: Sex and test Reading | Arithmetic Boys Vocabularyleceecea- 0.65 0.56 Block Designee=e=== 0.41 0,41 Girls Vocabularyleeeceaa= 0.56 0.56 Block Design==e==== 0,39 0.50 1 Based on Jastak’s short-form revision (311). 32 In view of the composition of the sample, these are surprisingly low. The manual also reports (301, p. 55) cor- relations of WISC Verbal Scale, Performance Scale, and Full Scale with the WRAT (1963), with samples covering narrower age ranges of 5 to 7 years and 8 through 11 years. The results here are the most impressive concurrent validity data in the manual, although they indicate correlations in the 0.6 to 0.7 range with intelligence rather than achievement criteria, for which they are intended. As stated several times earlier, the accuracy of score levels in the WRAT norms is regarded as a more pressing problem for empirical demon- stration than the concurrent validity (covariation with related measures) of the test. On this point the validity section of the manual is silent. Grade Equivalents The 1963 manual (301, p. 22) states thatgrade norms were derived from ''the actual mean grade levels of the children in each grade group.’ De- spite variations in school grade-placement prac- tices over time, grade rating is characterized as "rather stable." The manual further asserts "striking comparability of grade ratings of the old and the new WRAT's "through nearly all edu- cational levels except the upper ranges." Grade ratings below 14 years of age are said to be less arbitrary than grade ratings over 14 years of age. The grade scores are intended to be comparable to mental ages. Standard Scores The WKAT standard scores can be converted from raw scores by age group in a table provided in the manual. The standard score has a mean of 100 and a standard deviation of 15 and is intended to be equivalent to an IQ from the WAIS, WISC, Stanford-Binet (Form L-M) or any of the major intelligence scales. Although these scales are not comparable themselves (as developed in some detail in section! of this report), the manual states that "the results from the WRAT test can thus be directly compared with the major individual in- telligence scales." The standard score is asserted to be the "most precise and most meaningful score." It is the only score that is comparable between sub- tests and that provides for uniform differences between scores. Percentiles Percentiles are included -''because of their present popularity and convenience," but the manual appropriately downgrades them and dis- courages their use. SUMMARY AND CONCLUSIONS The foregoing review of the WRAT is neces- sarily incomplete because of lack of adequate information on which to base a technical evalua- tion. The test is well conceptualized and has much face validity, but standardization information on the 1946 edition was inadequate, and on the 1963 edition it is thus far insufficient. Published research on the 1946 WRAT has been extremely limited and fails to answer most of the questions left unanswered by the authors’ manual. Moreover, analysis of the available in- formation on the 1963 edition raises doubts about normative score levels. The selection of the WRAT over other avail- able school achievement tests may be defended on the grounds of administrative expediency and suitability of the material for the purposes of the Survey, in spite of the fact that inadequate data exist to support the author's claims of va- lidity. It is possible that such data may be pro- duced, and every effort should be made to obtain them. However, unless these results are con- vincing—and reason to doubt that they will be has been expressed—it is recommended that serious consideration be given to carrying out a complete restandardization of the Reading and Arithmetic subtests on the entire national sample. Unless this is done, projections of estimates to population may be seriously in error. BIBLIOGRAPHY Research References and Manuals 301. Jastak, J. F.: Wide Range Achievement Test, rev. ed. Wilmington, Del. Guidance Associates, 1963. 302. Jastak,d. F., and Bijou, S. W.: The Wide Range Achieve- ment Test. Wilmington, Del. C. L. Story Co., 1946. 303. Wagner, R. F., and McCoy, F.: Two validity studies of the Wide Range Achievement Reading Test. Personal communication. 304. Hopkins, K. D., Dobson, J. C., and Oldridge, O. A.: The concurrent and congruent validities of the Wide Range Achievement Test. Educ.Psychol.Measur. 22:791-793, 1962. 305. Lawson, d. R., and Avila, D.: Comparison of Wide Range Achievement Test and Gray Oral Reading paragraphs reading scores of mentally retarded adults. Percept.Mot. Skills 14:474, 1962. 306. Murphy, G. M.: An investigation of the utility of mathe- matics sub-test from the Wide Range Achievement Test, as applied to intermediate level groups. Personal com- munication. 307. Reger, R.: Brief tests of intelligence and academic achievement. Psychol.Rep. 11:82, 1962. 308. Warren, S. A.: Academic achievement of trainable pupils with five or more years of schooling. Train.Sch.Bull. 60:75-88, 1963. 309. Holowinsky, I.: The relationship between intelligence (80-110 I.Q.) and achievement in basic educational skills. Train.Sch.Bull. 58:14-22, 1961. Other References 310. Chauncey, H., and Dobbin, J. E.: Testing, Its Place in Education Today. New York. Harper and Row, 1963. 311. Jastak, J. F., and Jastak, S. R.: Short forms of the WAIS and WISC vocabulary subtests. J.Clin.Psychol. 20:167- 199, 1964. 312. Sundberg, N. D.: The practice of psychological testing in clinical services in the United States. Am.Psycholo- gist 16:79-83, 1961. 33 ll. THE GOODENOUGH DRAW-A-MAN TEST BACKGROUND AND DEVELOPMENT A comprehensive historical survey of the study of children's drawings appeared recently in an important new book by Dale B. Harris (522), a former colleague of Florence Goodenough and apparent successor to her in the leadership role in the measurement of children's intelligence by point scales based on drawings of the human figure. The present review does not duplicate Harris' scholarly survey, but focuses more specifically on the problems of the Goodenough Test as used in the Health Examination Survey. The first formal intelligence test based on the analysis of children's drawings was published by Florence Goodenough (595) in 1926, but the literature on this subject goes back at least to 1885 (595, ch. I). Some of the early papers are summarized in this study, but the major emphasis has been placed on recent critical research on the Draw-A-Man Test and its variants, Nevertheless, it is of interest that in 1893 Herrick (501) demon- strated the developmental significance of profile drawings and that in the same year Barnes (502) recognized that drawings are used by young chil- dren as a means of expressing their ideas. Mean- while, Lukens (503), in 1896, outlined many details of human figure drawings which were later in- corporated in the point-scoring systems of Good- enough (595) and of Harris (522). The Goodenough Test is referred to in this discussion as the Draw-A-Man Test although the specific instructions in Cycle II of the Survey are to '"'make a picture of a person." However, the instructions go onto state that ''when a bust picture has been drawn intentionally, the child is given another sheet of paper with the instruction 'Now make a picture of a whole person.'" Only one pic- ture is used. Rationale In this procedure emphasis is placed on the representation of details in the drawing tomeasure conceptual maturity, Drawing technique is mini- mized, and distortions potentially usable as cues for personality evaluation are not scored. Recent 34 drawing tests focused on personality study have used two or more drawings. For example, Mach- over (596) instructs the subject to "draw a person and then to draw a person of the sex opposite to ‘the one previously drawn, while Buck (594) uses drawings of a house, a tree, and a person. In general, the cues and signs interpreted in person- ality study of drawings are different from those employed for the measurement of intelligence. Point-Scoring System The point system developed by Goodenough (595) for drawings which can be recognized as attempts to represent the human figure—no matter how crude—involves the presence or absence of 51 detailed points, which are listed as follows: 1-4a Head, legs, arms, trunk present 4b Length of trunk greater than breadth 4c Shoulders definitely indicated Sa Attachment of arms and legs Sb Legs attached to trunk; arms attached to trunk at correct point 6a Neck present 6b Outline of neck ~ontinuous with that of the head, of trunk, or both 7a-c Eyes, nose, mouth present 7d Both nose and mouth shown in two di- mensions; two lips shown Te Nostrils shown 8a Hair shown 8b Hair on more than circumference of head; nontransparent 9a Clothing present 9b At least two clothing items nontransparent 9c Entire drawing free from transparencies of any sort; sleeves and trousers shown 9d At least four clothing items definitely indicated %e Costume complete without incongruities 10a Fingers present 10b Correct number of fingers shown 10c Detail of fingers correct 10d Opposition of thumb shown 10e Hand shown as distinct from fingers or arm lla Arm joint shown (elbow, shoulder, or both) 11b Leg joint shown (knee, hip, or both) 12a-e Proportion: head, arms, legs, feet, two dimensions 13 Heel shown 14a-f Motor coordination a Lines reasonably firm and joining usually accurate B Increased firmness of lines and increased accuracy of line junctions c Head outline free from unintentional ir- regularity d Trunk outline free from unintentional ir- regularity e Arms and legs without irregularities, narrowing at point of body junction f Features symmetrical 15a Ears present 15b Ears in correct position and proportion l6a-d Eye detail, brow, lashes, or both shown; pupil shown; proportion; glance 17a Both chin and forehead shown 17b Projection of chin shown; chin clearly differentiated from lower lip 18a-b Profile drawings Standardization In Goodenough's original research, point scores based on these items were equated to age norms from which intelligence quotients could be computed in the same manner as in the Stanford- Binet test. Data on reliability and validity were reported in the 1926 book (595) and also in a monograph (504) published the same year. Using a basic standardization sample of 5,627 school children from kindergarten to the sixth grade aged 4 to 12 years, split-half and retest reliabilities were computed, A split-half reliability of 0.77 (corrected) was found to be constant from 5 to 10 years of age, and a retest reliability coefficient of 0.94 was reported for 194 first-grade children, Correlations with Stanford-Binet were 0.76 for mental ages and 0.74 for intelligence quotients. The experimental work, analysis, and reporting which characterized this undertaking would be regarded as impressive today, and the critical reader of Goodenough's book can well appreciate Lewis M. Terman's description of it (in the fore- word) as ''a notable accomplishment." Perspective In 1950, a quarter of a century after the pub- lication of her book, Goodenough collaborated with Dale Harris in a review (510) ofthe extensive lit- erature generated by her test. This review was critical of many studies of graphic expression that lacked quantification, but it acknowledged the value of drawings used projectively as a source of diagnostic cues. Goodenough and Harris made special note of some writers' attempts to attribute discrepancies between the Draw-A-Man Test and the Stanford-Binet (in which Draw-A-Man IQ's are markedly lower) as possible diagnostic cues of emotional or nervous instability or of brain damage. They also cautioned about the use of the Draw-A-Man Test incross-culturaicomparisons, pointing out that the Draw-A-Manis nota culture- free test, as many users have incorrectly as- sumed. This point is most dramatically illustrated by the Near Eastern study of Dennis (555). In the Fourth Mental Measurement Year- book, 1953, Stewart (514), while presenting a very favorable evaluation, suggested that the Goodenough norms might require revision due to social changes which have occurred since the original standardization. Such a revision was apparently justified, and the new Goodenough- Harris Drawing Test (552), published in 1963, fills an important need. This modified procedure consists of three drawings: a man, a woman, and "yourself." Separate point scales are pro- vided for drawings of men and drawings of women; separate norms are also provided for drawings made by boys (men) and drawings made by girls (women). An empirical study on a sample of 195 draw- ings taken from the Health Examination Survey population, in which the Harris scoring and norms were compared with the original Goodenough scoring and norms, is reported below. This study supports a recommendation that the Harris revi- 35 sion be adopted for scoring the Goodenough test in this Survey. EVALUATION OF INTELLIGENCE BY HUMAN FIGURE DRAWINGS Effective Range Barnes' (502) early observation that children draw candidly up to about 14 years of age and then more abstractly is supported by Barnhart (507), who described three types of drawings— schematic (graphic representation), predominat- ing in the age range 5 to 9 years; mixed, in the range 8 to 13 years; and visual realistic (abstract- ed, esthetic, nonspecific as to factual details), principally in the range 10 to 16 years. This apparently explains why the point scores cannot be validly extended above 14 years of age (522). The increase in point scores with age, up to 14 years of age, apparently reflects mental matur- ity and not chronological age. This was noted by Smith (506) and by McElwee (524), who reported a correlation of 0.72 between the Draw-A-Man and the Stanford-Binet mental ages for a sample of 45 subnormal 14-year-old children. Israelite (562) found a correlation of 0.71 between the Draw-A-Man and the Stanford-Binet for 256 men- tal defectives. Others have also successfully tested mentally defective adults with the Draw-A- Man Test. Relation to Artistic Ability An area of special interest in the interpreta- tion of children's drawings has been the relation of drawing "'maturity," as reflected in point score, and artistic ability. Goodenough acknowledged that drawings could be influenced by special coaching (as can most human responses) but that ordinary art instruction in school has little effect on the Draw-A-Man score. She reported a correlation of 0.44 between the Draw-A-Man and teacher ratings of drawing ability (504). Perturbing Factors Intelligence scores based on drawings are relatively independent of artistic ability, However, there is evidence that both internal factors, such 36 as health, emotions, and attitudes, and external environmental factors affect the drawing content. In the present review, studies have been found which demonstrate the influence on drawings of factors such as height and weight (543), sex and body image (512, 537-539, and 541), physical handicaps (571 and 572), mental age (521), affec- tive states experienced and experimentally in- duced (529, 530, and 532), institutionalization (540), teacher attitude (533), sociometric popu- larity (534), social acceptance (531), and social class (536). Although size of drawings appears to increase with mental age over the effective range of the Draw-A-Man, size standards have not been incor- porated in any of the published point scores. In general, the studies referred to in the preceding paragraph may be viewed as minor perturbing influences within a homogeneous cultural frame- work. Variability among drawings attributable to perturbing factors of the types enumerated within the social boundaries of the American culture appears to have significance for the study of personality and social behavior, but it does not appear to influence measures of intelligence de- rived from children's drawings in the age range 5S to 12 years. Culture The factors which influence children's draw- ings of the human figure most are those that re- flect the effects of a culture's customs and values, since these determine the way in which children are exposed to different representations of the human figure in dress, art, photographs, religious practices, and sex roles and attitudes. Hunkin (554) found the Goodenough norms inap- plicable to Bantu school children, and Dennis (555) attributed the steady decline in mean Draw- A-Man IQ from 5 to 10 years of age (among Egyptian and Lebanese children in the Near East) to the Arab culture, which restricts access to representations of the human figure. Studies of the Draw-A-Man with children of various Ameri- can Indian tribes on reservations (558-560) have produced varying results which may perhaps be understood only in the context of their respective culture patterns. On the other hand, Anastasi and DeJesus (556) found sex differences in agreement with Harris, discussed below, but found no ethnic dif- ferences in a comparison of Draw-A-Man scores of 50 Puerto Rican children of low socioeconomic class in New York City with those of Negro and white children of similar status which were re- ported by other investigators. Similarly, Levinson (243) found that the Draw-A-Man, as well as WISC Block Design, is culturally "fair" for native-born Jewish bilingual children in New York City. The importance of taking into account cultural variations when dealing with a heterogeneous pop- ulation such as that sampled by the Health Exami- nation Survey is illustrated by the following quota- tions from Harris (522, pp. 131 and 132). These quotations have been exerpted to illustrate how the customary dress of Eskimo children affects point scores on drawings of the human figure. Eskimo children are less likely to depictthe neck, the ears, and to correctly place the ears. These facts seem to reflect the greater prevalence of parkas in the Eskimo group's drawings and [this] is thus an artifact of the drawing situation. Due to the voluminous parka garments, elbow joints, knee joints and modeling of the hips are less likely [to be] shown, resulting in greater stiffness of fig- ures portrayed. Since the Eskimo boot does not have a heel, Eskimo children are less likely to indicate heels in their drawings. [Several instances], however, show that when the garb is appro- priate, the heel is shown. The children do have the concept of heels; their drawings are quite appropriate to the type of figure they are representing at the time. Eskimo chil- dren are also less likely to portray the arm and shoulder performing some type of move- ment, probably due to the loose parka, though this is not invariably the case. On the other hand, Eskimo children are more likely to portray with exactness the nostrils, the bridge of the nose, and, when portrayed at all, the thumb or fingers. The character- istic tendency of the Eskimo children to show a mittened hand earns for them a greater credit on the thumb opposition point and on the hand as distinct from fingers or arm in the age group ten to thirteen inclusive. In this age group also the Eskimo is more likely to draw the arms down at the side than held out stiffly from the body. The Es- kimo child is more likely to show the feet with a wide stance, that is, with toes pointing apart, or in perspective in either full-face or profile drawings. The Eskimo drawings include fewer transparencies in these age groups, and a larger percentage of them earn credit for showing a distinct costume, which of course follows from the tendency to draw the parka—the everyday costume in this part of Alaska. Aspects of the Eskimo drawings thataredis- tinctive and that are not apparent in the de- tailed scoring technique of the Goodenough method include: a greater emphasis on the eyebrow, on the nostrils and nose (as in- dicated above), and on general detail of facial features. There is some evidence of a general decrease in quality of the drawing in adoles- cence. This isnot sufficiently great, however, to reveal itself markedly in the trend of median scores as in the normative group. It is most noticeable in the increased tendency to draw the facial features and hands "'sketch- ily." Particularly among young Eskimo chil- dren there is a very distinct tendency to draw shorter arms and legs than in the norm group. Here again there is the possibility that the proportions of the body are distorted some- what by so many children depicting the fig- ures in parkas. Cultural factors influence drawings in many obvious ways such as type of garb, vehicles, im- plements, and actions portrayed, but the nature of the influence on a Goodenough-type point score is subtle, as illustrated in the preceding quota- tions from Harris. Because such variations are often inconsequential within the mainstream of American culture, there has been a wide tempta- tion to use the Draw-A-Man as a culture-free intelligence test. Nevertheless, as Harris prop- erly insisted (522, p. 133), "the data . . . suggest that the child's drawing of certain body features or parts is influenced by garb, and possibly by other conditions of living that call attention to particular parts or their functions. Allowance would have to be made, both in scoring and in 37 the movrms, for parts omitted in one of these cultures included in the present scoring system. Such allowance would have to be worked out em- pirically within each culture group." (italics added) Goodenough and Harris (510), in their 1950 review, affirmed that although the test may be unsuited to comparing children across cultures, it may still rank children within a culture accord- ing to relative intellectual maturity. In his 1963 publication (522, p. 133) Harris has further amend- ed this position to state that "for the most valid results, the points of the scale should be re- standardized for every group having a distinctly different pattern of dress, mode of living, and quality or level of academic education.’ In Harris’ judgment, "This conclusion virtually rules out the scale for cross-cultural comparisons; indeed, psychologists increasingly believe that mean dif- ferences among large, representative samples drawn from varying cultures express the gross differences in conceptual experience and training these groups have had. Further work, to determine exactly which aspects of intellectual or conceptual maturity the drawing task expresses, will be necessary to explain scientifically these observed cultural differences." No systematic research such as Harris de- lineated with respect to Eskimo childrenhas been done on the detailed effects of microvariations within the American culture. Yet there is little reason to doubt that subtle differences between urban and rural, industrial and suburban, warm climate and cold, eastern and western, and other prominent contrasting situations within the con- tinental United States (to say nothing of Alaska and Hawaii) might produce some significant variations. Undoubtedly, some of these subcul- tural variations reflect ethnic factors, such as the superstitious reluctance of some southwestern children of Mexican origin to draw eyes because of fear of the "evil eye." It is also possible that secular trends, which are revealed in the comparison of the 1926 and 1963 norms, may be occurring at differential rates in different localities and segments of the culture and that these also may subtly affect point scores. For example, the high-fashion announcements of transparent garments for fe- males not only aroused different reactions among 38 different segments of the population but also re- ceived widely varying prominence in different localities, Although this is an extreme example, it is nevertheless possible that some children might draw the female figure appropriately re- flecting a sophisticated transparent garment and be penalized on the point score for what could be considered a "bright' response. Sex Differences Both Goodenough (504) and Harris (522) have reported qualitative and quantitative differences in drawings which are related to the sex of the person doing the drawing. Harris' more recent work is of greater relevance. He believes that these sex differences cannot be attributed to dif- ferential selection of boys and girls according to intellect. Harris' recent data show that sex differences in total point scores appear at an early age and are considerably greater than those reported by Goodenough. Harris found that for the drawing of aman, the mean score difference favors girls by about one-half year of growth at each year of age, while for the drawing of a woman, this difference is roughly equal to a full year of growth, The Harris point scale, applied differentially to Man and Woman drawings by boys and by girls, appears to reduce mean differences. Sex differences in drawing point scores re- flect differences in maturation, cultural factors— including sex role and awareness—and perhaps some degree of difference in drawing proficiency. However, it is believed that these will be mini- mized by the adoption of the Harris norms and scoring system and that the remaining residual error probably will be inconsequential, Without doubt, the error will be smaller than that which would result from the blanket use of one uniform scoring system for the entire population. PERSONALITY STUDY BY CHILDREN'S DRAWINGS Although personality evaluation is not the primary reason for including the Draw-A-Man Test in the Survey, a review of the potentialities for such analysis is relevant. Since this topic has been covered more extensively by Harris in his recent publication than in this review, the following discussion is organized in celation to Harris' summary. Below are eight widely accepted but not necessarily established generalizations concern- ing personality measurement by children's draw- ings. These were evaluated by Harris inhis recent book (522, p. 52). As will be noted, several of the generalizations are rejected. 1. Drawing interpretation is move valid when based on a series of a subject's protocols than when based on one drawing. Despite the lack of clear-cut empirical evidence on this issue, Harris equates additional pictures as having the effectof increasing the length and therefore the reliability of the test. From this logical viewpoint, he considers it justified. 2. Drawings ave most useful for psychologi- cal analysis when teamed with other avail- able information about the child. This, too, is a logically sound principle, "especially when it is the content of drawings alone that is being used for psychological in- terpretation.," 3. Free drawings ave move meaningful psy- chologically than drawings of assigned topics. This is probably true for certain purposes, such as exploration of interests, but systematic comparison of individuals, as in a national survey, requires control of the task. 4. When a human figure drawing is assigned, the sex of the figure first drawn relates to the image the drawer holds of his own sex role. Of the studies summarized in Appendix III, those most relevant to the study of children ages 6 to 12 years are as follows: 512, 537-539, 541, and 542. According to Brown and Tolor (541), nor- mal individuals of both sexes tend todraw their own sex first, while persons with behavior disorders draw the opposite sex first. Harris agrees that most children of either sex will draw their own sex first when asked to ''draw a person.'' He further elaborates that as girls grow older there is an increasing tendency for them to draw a male figure. This, he feels, reflects both the cultural preference given to the male role and an increasing dissatisfaction with the female role. Harris also hypothesizes that the male figure is more culturally stereotyped and easier to draw than is the female figure. He considers deviates from this norm to be psychologically different from non- deviates. He also feels that the deviation has different meanings for the two sexes and has unique, idiosyncratic meanings to individuals. Since many deviations from the norm occur and since the meaning of such deviations is as yet unknown, it is unlikely that the principle (the figure drawn first relates to the image the drawer holds of his own sex role) is uni- versally valid. Therefore, even though about 86 percent of boys and 65 percent of girls have been reported to draw their own sex first, it is not possible to for- mulate any reliable interpretation for those who do not. . A child adopts a schema or style of draw- ing which is peculiar to him and which be- comes highly significant psychologically. Most of the evidence is opposed to this and suggests rather that developmental pat- terns do exist among children's drawings. The manner in which certain elements are portrayed in drawings may be used as signs of certain psychological states or conditions in the artist. In agreement with Harris, the present writer regards this statement as one of the eternal, unful- filled wishful myths of the "depth psychol- ogist." Two particular statements by Harris are relevant to possible further research in this frustrating area. First, "whether or not 'signs' are selected by an empirical or deductive procedure, there is still the question whether form or con- tent will provide the cues. Size, quality or texture of line, degree of angularity, pattern or shape, and placement on the page are often thought to be highly signifi- cant avenues for 'projecting' unconscious motives or needs.' References 512, 521, 537, 540, 543, 564, and 566 support this view, but neither form nor content signs of unequivocal value have thus far been validated. Thus, Harris' second state- ment, that ''useful and valid signs leading to dependable conclusions are, for the 39 most part, still to be ascertained," dis- poses of this generalization. 7. Drawings must be interpreted as wholes rather than segmentally or analytically. This, too, has been a strong sentimental favorite, but the evidence is mostly the other way, particularly in personality assessment, In fact, the history of psy- chometric progress has been away from global analysis toward specific analysis, has favored linear over curvilinear rela- tions, and generally has demonstrated that quantitative procedures are more valid, even if less spectacular, than those based on scorer judgment. Harris has cited analytic studies of com- ponent qualities of children's drawings, by Martin and Damrin and by Stewart (522, p. 56), which suggest that "drawings are actually appraised in terms of a few general dimensions, although they may be rated on a number of specifically defined elements or qualities." Harris believes that these studies lend credence to the belief that broad, dimensional evaluations (rather than highly particularistic ones), based on such analytic results, may be made more readily and more reliably, He also believes that they suggest the direc- tion these quantitatively and factorially defined "global ratings may take. "Their findings in relation to personality quali- ties, however, are not of such magnitude as to support the use of drawings indiagnos- ing individual cases." 8. The use of color in dvawings can be sig- nificant for studying personality. This is another popular clinical belief, on which the empirical evidence is equivocal. RESEARCH ON THE GOODENOUGH TEST Reliability Studies Table 6 summarizes the reliability coeffi- cients reported for the Draw-A-Man Test in the studies included in this review (523-528). In general, the reliabilities obtained by independent investigators have confirmed those reported by 40 Goodenough, The reliability of the point scale holds up in the mentally retarded range (523 and 524), and scorer agreement is high (526). One problem observed in interscorer com- parisons by the reviewer which is mentioned in connection with the Goodenough vs. the Good- enough-Harris comparison is that while the re- sults of two scorers may show a very high correlation, there may nevertheless be a constant difference in score levels between them, reflecting individual idiosyncrasies of their interpretations. The safest method of coping with such constant errors, in a survey in which a number of scorers may be used for different segments of the total sample, would be to have at least two people score every test and to use the average of the two for record. Correlations With Other Tests Correlations of the Draw-A-Man with the Stanford-Binet are summarized in table 7, and its correlations with other tests, in table 8. Similar tables appear in Harris (522, pp. 96 and 97). With few exceptions, correlations of the Draw-A-Man with the Stanford-Binet (in which coefficients are based on IQ's) reported by other investigators have averaged lower than those re- ported by Goodenough in 1926 (504). The ex- ceptions found are Williams (505), Israelite (562), White (565), and Ellis (unpublished master's colloquim paper, University of Minnesota, 1953), whose data agree substantially with those of Goodenough. Unfortunately, most of the publications cited which involve correlations of the Draw-A-Man with the Stanford-Binet and a number of other tests are based on very small samples (rarely more than 100), are usually not representative of their respective subuniverses, and do not always present assurance of testing under stand- ard conditions. As a result, the collection of correlation coefficients can only be interpreted very generally, These results indicate a considerable as- sociation between the Draw-A-Man Test and general intelligence tests, such as the Stanford- Binet and the WISC, which measure mental maturity. The common variance is probably about 50 percent. Maturationally, the original rationale presented by Goodenough—that drawing point Table 6. Studies reporting reliability coefficients of human figure drawing tests Number I 1 2 . God eliabilit Investigator Year ... 28 Subjects” Age range Type of coefficient al Z M F Yepsen (523) ---- | 1929 | Goodenough--=--- Feebleminded---=- 9.0 - 18.2 37 37 - | Test-retest Administration 0.89 Administration 2-3----- 0.91 Administration 1-3----- 0.91 Brill (525)----- 1935 | Goodenough=-=---- Feebleminded----- N.R. N.R. | === | --- | Test-retest 71 73 - | Administration 1-2-=---- 0.77 65 65 - | Administration 2-3----- 0.80 67 67 - | Administration 1-3----- 0.68 Albee and Hamlin | 1949 | Human Figure VA Mental N.R. N.R === | === | Interjudge-~=====m====- 0.95 (579) « Drawing, Paired | Hygiene Clinic. BPC ALIIAN = BOW www wen] 0.98 Comparisons. Range—normals to psychotics. Albee and Hamlin | 1950 | Machover------- Neurotic, N.R. 72 ---| === | Interjudge-============o 0.89 (581). schizophrenic, normal. Hinrichs (586) -- | 1935 | Goodenough-=---= Normals----------| 10-18 years 81 | --- | --- | Split-half, Spearman- 0.88-0.90 Brown. b Herron (532) ---- | 1957 | Goodenough----- Normals, Grades 113 months 16 16 - | Test-retest, group A, 3 and 4. (mean) Administration 1-2----- 0.52 Administration 2-3===--- 0.51 Administration 1-3----- 0.27 28 - 28 | Test-retest, group Ab Administration 1-2----- 0.79 Administration 2-3-=----| 0.69 Administration 1=-3----- 0.85 24 24 - | Test-retest, group B’ Administration 1-2===-- 0.92 Administration 2-3====- 0.40 Administration 1-3----- 0.86 15 - 15 | Test-retest, group B® Administration 1-2-=-=--- 0.85 Administration 2-3----- 0.73 Administration 1-3----- 0.63 McCurdy (527) --- | 1947 | Goodenough----- Normals--=======-| 83.2 Soaths 59 59 - | Test-retest-==-=-====m===-| 0.69 mean Buhrer, de 1951 | Goodenough----- Normals, 7-14 years [1,936 --- | === | N.Rmmmommommmmm mmm meem | 0.97 Navarro, and Spanish- Velasco (511). speaking. Frankiel (518)-- | 1957 | Goodenough and | NormalS--=-eeemecfe ccc cme 200 100 | 100 [rome cmm nnn mcnn nm rm mmm mmm mmm —————— Frankiel. 7 years 100 50 | 50 | Intrajudge----=---=-=-=~- 0.83 7 years 100 50 | 50 | Interjudge----=--=-===-==- 0.71-0.84 12 years 100 50 | 50 |Intrajudge------==-==--=- 0.89 12 years 100 50 | 50 | Interjudge--------=--==-~ 0.81-0.86 McHugh (508) ---- | 1945 | Goodenough----- Normals, pre- 62.0 months 83 | --- | --- | Test-retest=======mm=nm==-| 0.46 (IQ) school. (mean) 0.51 (MA) Seine 1926 | Goodenough----- Normals--=====u=- 4-12 years | 5,627 | === | === bommmmmmm memes gf Split-half, Spearman- 0.77 Brown. Test-retest, Grade 1 only-| 0.94 See footnotes at end of table. 41 Table 6. Studies reporting reliability coefficients of human figure drawing tests—Con. Number Investigator Year Win ang od Subjects? Age range Type of coefficient Relispiliy z M F Williams (505)-- [1935 | Goodenough----- Normals--======- 3-15 years 100 50 50 | Interrater--==========- 0.80-0.96 Smith (506) ==--- 1937 | Goodenough-=-=-- Normals---===mecl com cannann maa 1000 wm wit {Pg eGR BE eww vem mmm mlm mmm mm 6 years 100 | === | === |emm mmm mmm 0.91 7 years 100 | === | === |rmmmmmmm meme 0.91 8 years 100 | === | === frm mm mmm 0.95 9 years 100 | === | === fem=mmmmmmmm meme 0.96 10 years 100 | === | === |emmmmmmmmmmm meee 0.93 11 years 100 | === | === |emmmmmmmm mmm emma 0.95 12 years 100 | === | === fmm meme eee 0.92 13 years 100 | === | === fmmmmmm mmm 0.92 14 years 100 | === | === bemmmm mma 0.94 15-16 years | 100 | === | =o= Ee=wne=awe ————r 1 om 0.84 McCarthy (526) -- | 1944 | Goodenough----- Normals, Grades N.R. B86 || meek] wien eee sree a ———— — m— 3 and 4. InLTascorer = =mnmnumm= 0.94 Interscorer-=-----===-=-- 0.90 Test-retest------------ 0.68 Odd-even, Spearman- 0.89 Brown, McHugh (529)---- | 1952 | Goodenough----- | Normals, N.R. 118 838i] 60 [racsmrnmnmammnn mm. Grade 3. Intrajudge----========- 0.98 Interjudge----======--- 0.97 Stone (582) ----- 1952 | Machover--=----- Normals, N.R. BIZ | muw | wm Bmmmmmmmmmmm non Grade 6. Split-half First drawing 0.82 Second drawing-------- 0.76 Test-retest Drawings 1 and 2, males-=-mmmmmcccmaaaan 0.56 Drawings 1 and 2, females--===-ccccmaoa- 0.39 Drawings 1 and 2, LOERL~==m=mmmtw mm nnimn- 0.50 Designations of subjects are always white Americans unless otherwise specified. Indicates conditions preceding Draw-A-Man testing. Group A B Satisfying activity Frustrating activity Initial test Second test Satisfying activity Frustrating activity Third test Frustrating activity Satisfying activity NOTES: Unless otherwise indicated, it is assumed that reliability coefficients were Pearson Product-Moment and were com- puted from raw scores, Z —Total population; M—male; F—female; N.R.—noOt reported; IQ—intelligence quotient; MA—mental age. 42 Table 7. Studies reporting correlations between the Goodenough and Stanford-Binet Number Correlations Investigator Year Subjects” Age range z M|F 1Q MA McElwee (524) -==-=-mmmmmmmmmmm emma 1932 |Retarded--==--===ccmmmmmmmm emma 14 years 45 | =--|--- | N.R. | 0.72 Rohrs and Haworth (569)-=---mcceena- 1962 |Retarded-===-==m=m cmon mmm bee meee a] 46 | 237 23] 0.28 |N.R. (Form L-M) Familial-----==-=mcccmmcmmmmm ama 12.57 years 20 10] 10 [{ N.R., | N.R. (mean) Organic--=---=c--cmmmmmm mmm 9.2 years 26 137 13 | N.R. |N.R. (mean) Birch (550) -=====-mmmmmmmcmmee eee 1949 | Retarded-=----===-=cmmmmmmmmm meme 10-6 ~- 16-3 68 | 43] 25] 0.62 [0.69 1szaelite (300) ~mmesmerencnmmsanmmn 1030 [Fosbloninded-messes sremunmnssannans 6-3 - 40 years | 256 | 162| 94 | N.R. | 0.71 Johnson, Ellerd, and Lahey (592)---- | 1950 | State hospital population---=-=-=--=--- 6-9 - 17 years 209 | ---|---| 0.48 | N.R. White (565) =-=-=-cccmmmecccmeeeeeem 1945 | ==mmm mmm me ee eee femme eee 14] | mmm |mmm fmm mmm meme Feebleminded 8- 19-4 47 | =--|---] 0.63 [ N.R. Epileptic -mmmm==n 8-0 = 19.4 47 [ -==(=~=] 0.52 | N.R: Normal -=memmm cme cme ee meme een 4-8 - 10-6 47 | ===|=-=-=1 0.71 Rs Havighurst and Janke (544) -===-=ee-- 1944 | Normal§-=====-==occmmmmme meme mmm 10 years 114 | -=- |---| 0.50 | N.R Fowler (531) -=---ccmmmcccccmcceeeee 1953 | Normalg§==========m-cmcmeeemcmceceen 9-2 - 12-1 41 | 19] 22 | 0.41 [N.R. Lessing (551) -===m--cmmomcmmmceeeeoe 1961 | Normals=====-==-==--cmmcmemmmna an 8-9 years 23 21 200.5) | N.B. McHugh (549) —====-cmmmcmm cee eee en 1945 | NormalsS=======memmccmmmccmc me memmem 64 months 90 43| 47 | 0.41 | 0.45 (mean) Thompson and Finley (552) ---===-=--- 1963 | Guidance clinic referrals---------- 5-9 years 164 8l| 83| 0.67 te Form L-M) Goodenough (504) =====c=ccmmmoamanann 1926 | Normal§==========m-ommmmmm mmm mmm 4-12 years 5627 | ==~-|---| 0.74 | 0.76 Williams (505) -====m=meemcmmm eee om 1935 | Normals-====c-ccmeccmmac mcm neem 3-15 years 100 50; 50] 0.65; 0.80 "Designations of subjects are always white Americans unless otherwise specified. NOTES: Unless otherwise indicated all correlations are Pearson Product-Moment, with the Stanford-Binet, Form L. 3 =Total population; M—male; F-—female; IQ—intelligence quotient; MA—mental age; N,R,=—rnot reported. scores largely reflect the ability to form con- cepts—is supported by the network of corre- lations compiled from a variety of tests and by studies such as that of McHugh (549), which analyzed Draw-A-Man items. McHugh computed biserial correlations of Goodenough items with the Stanford-Binet and reported positive corre- lations for 29 items; the remainder were zero or slightly negative. The highest correlations, which support the conceptual interpretation stated, were the following: Item Correlation 2 (legs present) -------- 0.48 7a (eyes present) -------- 0.47 9a (clothing present) ---- 0.40 11b (leg joint shown) ----- 0.35 12e (proportion, two di- mensions) ---====cm-an 0.54 13 (heel shown) ========== 0.35 43 Table 8. Studies reporting correlations between the Goodenough and other measures Number Investigator Year Test or criterion variable Subjects” Age range Correlation Zz M F Havighurst, Gunther, and | 1946 | Arthur Point Scale of Performance | American 6-11 years | 294 | m=» | won |swmmmcmr manne Pratt (558). Tests (IQ). Indians. Zuni--=--=ccopmmmmm mm - 42 | =| === 0.10 Hopil---mmmmmmbm mmm meee] 78 | === | === 0.21 Navaho----===p=====mmum--- 47 | === | =-- 0.23 SiouUR-==-==-mfmmm mmm mm ae] 53 | === | === 0.33 Papago---=-==lf-=====m====o 74 | === | === 0.64 Albee and Hamlin (579) ---| 1949 | Clinical ratings of adjustments----| VA Mental N.R. N.R.| === | === 0.62 Hygiene (rank order) Clinic. Range—nor- 0.64 mals to psy- (product chotics. moment) Havighoese and Janke 1944 | Cornell-Coxe Performance Ability Normals=-=====- 10 years 114 | === | === 0.63 (544) . Scale. Havighurst, Gunther, and 1946 | Cornell-Coxe Performance Ability Normals==-=-===~| 6-11 years 66 28 38 0.63 Pratt (558). Scale. Hinrichs (586) -=-===--=~-~ 1935 | Furfey Revised Scale for Measuring | Delinquents---| 9-18 years 425 | === | === 0.35 Developmental Age in Boys. Johnson (557) ----==------ 1953 | Hoffman Bilingual Schedule=--=-=-==-=- Spanish N.R. 30 | === | === 0.05 bilinguals (U.8.). Boehncke (546) ---==meeemm 1938 | Leiter International Performance Normals===-===~| 5-12 years 257 | === | === 0.83 Scale. Ansbacher (553) -------=-- 1952 | MacQuarrie Test for Mechanical Normals=---=-=-=-- 10 years 100 | === | ===|======e—————— Ability. Tracing-===-cemmemccmm omen ———— 0.34 Tapping- 0.23 Dotting======semeeee mmm cme mm 0.16 Brenner and Morse (517)--| 1956 | Metropolitan Readiness Tests, Normals------- 4-7 - 5-11 16 7 0.58 Number Readiness (IQ). (rank order) Havighurst and Janke 1944 | Revised Minnesota Paper Form Normals-=--=----| 10 years 110 | ---| === 0.48 (544) . Board Test, Form AR. Brenner and Morse (517) --| 1956 | Monroe Visual subtest (IQ)-=---=-=-=- Normals-=---=-- 4-7 - 5-11 16 7 9 0.64 (rank order) Hornowski (547) ========== 1961 | Moray House Picture Intelligence Normals N.R. N.R.| ===] === 0.34 (M) Test. (Scotland). 0.49 (F) Johnson (557) =-==-===-==== 1953 | Otis Self-Administering Tests of Spanish N.R. 30 | ===] === -0.02 Mental Ability. bilinguals (U.8.). Brenner and Morse (517)--| 1956 | Picture Judgment of Maturity (IQ)--- Normals---=---- 4-7 - 5-11 16 7 9 0.64 (rank order) Pintner-Cunningham Primary Mental |====-==sec-ecomoocmc eee e gee e eee ee] 0.66 Test (MA). (rank order) Shirley and Goodenough 1932 | Pintner Non-Language Primary Deaf--~----=-=~ 5+ years 229 | === | === 0.33 (575). Mental Test (IQ). Norman and Midkiff (559)-| 1955 | Progressive Matrices-----=--=====--| Normals, 6-6 - 15-6 96 |---| --- 0.24 (1Q) American 0.35 (MA) Indian. Harris (548) -----ceceeean 1959 | Progressive Matrices----==-===m-aee Normals=-==-===--| 5-1 - 6-1 98 | 45., 53 0.22 Johnson (557) =====m=mmmmm 1953 | Reaction time--==--c--ccmomcammanan Spanish N.R. 30 | === | === 0.43 bilinguals (u.s.). Brenner and Morse (517)--| 1956 | Sangren Information Mental Age-----| Normals------- 4-7 - 5-11 16 7 9 0.67 (rank order) See footnotes at end of table. 44 Table 8. Studies reporting correlations between the Goodenough and other measures—Con. Number Investigator Year Test or criterion variable Subjects” Age range Correlation z M F Buhrer, de Navarro, and 1951 | School grades----=---mm=mcemmeme——— Normals, 7-14 years | 1,936 === | === |mcmmmmmmaaaa Velasco (511). Spanish- speaking, Mathematics -0.04 Language m= == m= -0.10 Language and Mathematics--- -0.01 Drawing====e== mmm mmm mmm ee | 0.27 Fowler (531)--=---mc-meen 1953 | Social Distance Scale (Fowler)-----| Normals------| 9-2 - 12-1 41 19 | 22 0.40 Shirley and Goodenough 1932 | Stanford Achievement, Education Deaf--------- 5+ years 41 | === | === 0.34 (575). (quotient). Ansbacher (553) ---------- 1952 | SRA Primary Mental Abilities------- Normals------| 10 years | 100 | === | === |rm-==mmmeanaa Word Vocabulary=--------ceceeamao 0.23 Picture Vocabulary----=---=-- es 0.19 Total Verbal Meaning-=----- 0.26 a 0.38 Word Grouping-- 0.28 Figure Grouping--- 0.34 Total Reasoning-- 0.40 Perception------ 0.37 NUMb ET = =m mm mmm mmm eee eee ee ee eee Hemme ee me ee ee 0.24 Total Nonreading 0.45 Total Score 0.41 SHRAP-==mm mmm mmm mmm mmm mmm mee 0.48 Harris (548) --=--------o-- 1959 | SRA Primary Mental Abilities-------| Normals------4 5-1 = 6-1 | 98 | 45 | 53 |-=--==-cccuua Verbal mmm mm mm mmm mm mo me mm me 0.50 Perception--- 0.44 Quantitative- 0.54 Motore=eeeee- 0.40 SPACE = mmm mmn = mnie wm mm mw i 0.51 Brenner and Morse (517)-- | 1956 | Teacher rank of school readiness---| Normals------ 4-7 - 5-11 16 7 9 | 0.69 (rho) Britton (536) ------------ 1954 | Warner's Index of Status Charac- Normals=-=--==---| 9-11 years | 232 | 102 | 130 0.11 teristics. Hanvik (593) -----=ceeomuo= 1953 | WISC Full Scale (IQ)-=-=-========= ~| Psychiatric 5-12 years 25 | === | --- 0.18 patients. (rank order) Rohrs and Haworth (569)-- | 1962 | Wechsler Intelligence Scale for Retarded, N.R. 46 23 23 |rmmmmmmmme mem Children (IQ). familial and organic. Verbal Scale----~==-=--momomemumnx 0.28 Performance Scale- 0.53 Full Scale====-=====--ccmcmmaann 0.46 Designations of subjects are always white Americans unless otherwise specified. NOTES: All correlation coefficients are Pearson Product-Moment unless otherwise specified. Z —Total population; Mw-male; F—female; IQ—intelligence quotient; N.R.—not reported; MA—mental age. 45 It is of interest that a careful survey of the literature spanning a period of over 40 years fails to disclose any definitive pattern of the particular components of mental maturity meas- ured by the Goodenough test. Harris believes that this may be attributed to the fact that such components are themselves not clearly differ- entiated in young children. The correlational results do, however, suggest strongly that the Draw-A-Man is more highly associated with factors measured by performance tests than with verbal abilities, In the Health Examination Survey, corre- lations of the Draw-A-Man with WISC and, more particularly, with the short form composed of WISC Vocabulary and Block Design would be most relevant, Table 3 includes three reports (115, 130, and 224) which mention correlations between the Draw-A-Man Test and the Full Scale IQ of the WISC. Of these, none mentions correlations be- tween the Draw-A-Man and the short form of the WISC. Harris' summary also cites the following unpublished data by Ellis. Correlation with: Age Number FS VS PS 8 years======-- 16 0.70.1 0.77 10.67 9 years------- 34 0.671 0.63 |0.59 10 years=---=--- 20 0.24] 0.17 {0.26 11 years=-==-=-- 17 0,50 | 0.45 | 0.46 12 years------- 19 0.62| 0.50 [0.68 13 years=------ 17 0.13] 0.05 .{0.15 Disregarding the 13-year-old group, since it is outside the effective range of the test as well as outside the age range of the Survey, Ellis’ results for the total sample of 106 have an average correlation with the WISC Full Scale IQ of 0.57. Again, this is higher than the corre- lations reported by others. In summary, it appears that the WISC corre- lations with the Draw-A-Man Test are substantial bur lower than those of the Stanford-Binet. They are, however, higher with the Performance 46 Scale than with the Verbal Scale (except in Ellis' two lowest grades). In comparing Draw-A-Man scores with WISC Full Scale estimates, there is no reason to assume any systematic differences in mean levels across the entire population. However, for statistical estimation as well as analytic purposes, it is most appropriate to compute the regression of Draw-A-Man on Voc., BD, and Total Score and then to work with differences between regressed and actual scores for discrepancy analysis, rather than with differences between scaled scores. In view of the Draw-A-Man's sensitivity to cultural variations, cases in which there are large discrepancies between the Draw-A-Man and the WISC should be thoroughly evaluated in the light of the WRAT scores and other infor- mation from the Health Examination Survey. Although Harris' summary and the reports con- sulted in this review have suggested a number of promising diagnostic score patterns, none of them seem well enough established to be adopted. THE HARRIS REVISION OF THE GOODENOUGH TEST Dale Harris' 1963 publication (522), which he has named the Goodenough-Harris Drawing Test, is a thorough revision and extension of Goodenough's test. As already mentioned, it bases the lengthier point-score scales on both drawings of the male figure and drawings of the female figure, for which it provides separate norms for boys and for girls. A third picture, in which the child draws a representation of himself, has not been empirically standardized. - Standardization of the Harris revision was completed on a total sample of 2,965 children, representative of four major geographic areas of the country. The sample was also representative of the 1960 census distribution of fathers' occupa- tions. Total point scores are converted to standard scores with a mean of 100 and a standard deviation of 15. Conceptually, these are equivalent to the WISC deviation IQ's. The new scales overlap extensively with the original point scales, and Harris found that children now earn substantially higher scores when the 1963 norms, rather than the 1926 ones, are utilized. The explanation for this phenomenon is not clear. The new norms do appear to take into account technical and social changes which have occurred between 1926 and 1963. They also offer the advantages of greater length (hence, higher reliability) and more ad- equate provision for sex differences. Comparison of Goodenough and Goodenough-Harris Scores It seems desirable to inquire whether the Harris scales and norms could be used to score human figure drawing obtained in the Health Examination Survey. As noted above, in this Survey only one picture is drawn by each child, who is instructed, '"Make a picture of a person. Make the very best person that you can." To use the Harris scales in the Survey it would be necessary for the scorer to decide whether each drawing was of a "Man" or of a "Woman," A sample of 200 drawings, 100 drawn by boys and the other 100 drawn by girls, was taken at random from the Survey files, These drawings were then carefully scored using Harris' norms, and the scores obtained were compared with the scores the drawings had already received on the 1926 Goodenough scale. (Scoring by the 1926 method is completed in the field by Survey staff psychologists.) Of the 200 cases, 195 were usable. Three drawings were rejected because they contained a face only, and for two cases age had been in- advertently omitted, precluding the computation of standard scores. For the remaining drawings, neither scorer reported any difficulty in identi- fying the sex represented, and their agreement on this was perfect. Table 9. Means of Goodenough-Harris and Goodenough variables and correlations between scorers and between methods for total sample and six subsamples Drawings of a Drawings of a Draw- Draw- woman man » Toagd ings of | ings of Variable & p a woman a man By By By By boys girls boys girls N=195 N=94 N=101 N=17 N=77 N=83 N=18 1. Goodenough-Harris point (A)----==-=-=--- 30.75 31.41 30,13 | 28.12; 32.134 | 30.20 29,78 2. Goodenough-Harris SS (A)-----=-==--=o-= 96.59 95.89 97.24 | 93.06 96.52 | 97.29 97.00 3. Goodenough-Harris point (B)-========== 36.02 36.62 35.47 | 34.71 37.04 | 35.54 35.11 4. Goodenough-Harris SS (B)============-- 105.97 105.15 | 106.73 | 104.06 | 105.39 | 106.63 107.22 1+3. Average Goodenough- Harris point (A,B)-- 33.39 34.02 32.80 | 31.42) 34.39 | 32.37 32.45 2+4. Average Goodenough- Harris SS (A,B)----- 101.28 100.52 | 101.99; 98.56{ 100.96 | 101.96 102.1) 5. Goodenough point----- 26.38 25.57 27.14 | 24.29] 25.86 27.20 26.83 6. Subject's CA----====- 115,01 111.89 | 1.7.92] 118.35 110.47 1 118.10 117.11 7- Goodenough MA==-=====-- 114.61 112.48 116.59 | 108.88 | 113,27 | 116.71 116.06 8. Goodenough IQ-=-=--=---- 101.23 102.27 | 100.27| 92.59] 104.42 | 100.10| 101.06 Dy 3 wmmscmmmmminmn mm sim msn 0.90 0.89 0.91 0.82 0.91 0,90 0.95 Ef mms mom mm. —— 0,90 0.88 0.91 0.79 0.89 0.92 0.83 F208 mmm mmm mmm mmm 0.78 0.76 0,81 0,60] 0,78| 0,87 0,47 TB wm wim mmr mmm es mimi me miei mi 0,81 0,78 0,84 0,58 0,82 0,89 0,48 MA—mental age; r--correlation. NOTE: Ne-number; A—scorer A; B—scorer B; SS—standard score; CA-——chronological age; { 47 The usable sample of 195 cases consisted of 100 boys and 95 girls. Of these, 17 boys drew a Woman figure and 18 girls drew a Man figure, The remaining 82 percent of the total group (83 percent of the boys and 81 percent of the girls) drew their own sex. The following eight variables were recorded for all 195 cases: Harris method, point score, scorer A Harris method, standard score, scorer A Harris method, point score, scorer B Harris method, standard score, scorer B Goodenough point score Subject's chronological age in months Goodenough mental age Goodenough IQ 0 NON UN Ah W N= Means, standard deviations, and intercorrelations were computed for the total sample and for the following six subsamples: (1) Woman drawings (N=94), (2) Man drawings (N=101), (3) Woman drawings by boys (N=17), (4) Woman drawings by girls (N=77), (5) Man drawings by boys (N=83), and (6) Man drawings by girls (N=18). A summary of the most relevant results, for all seven sample combinations, appears in table 9. The correlations between the two scorers {x 13 and r, 4 are high despite a systematic tend- ency for scorer B's results to exceed those of scorer A (they average 5.25 above scorer A on point score and 9.38 higher on standard score). As a more stable estimate of the Harris scores for comparison with the Goodenough, average mean scores for the two scorers were computed. These appear in table 9 between variables 4and 5. Although agreement between the two scorers is generally high, the lowest correlations were found for the 17 boys who elected to draw a female figure (subsample 3). The standard score correlations for the 18 girls who elected to draw a male figure (subsample 6) are also com- paratively low. These opposite-sex drawings also reflect the lowest correlations between Harris and Goodenough IQ's for both scorers s scorer agreement is lowest (Tog and re) Thus s g on opposite-sex drawings, and the results for these show the poorest agreement, correlation- wise, between the Goodenough-Harris and Good- enough IQ's. It is possible that these differences 48 could be eliminated by further training of scorers. Certainly these results illustrate the importance of quality control of scoring. The averaging pro- cess is also highly recommended if systematic scorer differences cannot be eliminated. The principal support, indicating an advantage of the Goodenough-Harris scale, appears in the comparison of mean scores for boys and girls on Woman and Man drawings as abstracted in table 10. In accordance with Harris' own findings, girls score higher than boys, but the differences are greater on the Goodenough scale than on the Good- enough-Harris scales and are greater on the Woman drawings than on the Man drawings. The greatest discrepancy and resulting scoring pen- alty by the Goodenough scale occurs in the case of the 17 percent of boys (subsample 3) who elected to draw a Woman. At the same time, the 81 percent of girls (subsample 4) who elected to draw their own sex received disproportionately high scores on the Goodenough, in comparison with the mean levels on the Goodenough-Harris, The Goodenough-Harris scores are higher than the Goodenough for both sexes on the Man drawing. The problems with the Woman drawing clearly support the observation, first pointed out by Goodenough and strongly reiterated by Harris, that the female figure is more culture-bound than the male, is less stereotyped, and is more susceptible to individual interpretation. Although the data on which the present analysis is based are limited, they do suggest that the Harris revision does less violence to the female figure than does the Goodenough scoring and that, in general, the Harris revision is more adequate for opposite-sex drawings. These data, which indicate a superiority of girls over boys in drawing scores, a tendency for the Goodenough-Harris scores to be higher than the Goodenough scores, and a tendency for girls who draw male figures to be older than girls who draw their own sex (while no such differ- entiation occurs among boys), are all consistent with trends reported elsewhere in the literature, However, the most important argument in favor of using the Goodenough-Harris scoring system is that the variation of mean scores among the four subsamples is thereby greatly reduced around a mean of 100. This range is from 92.59 to 104.42 (11.83) on the Goodenough and from 98.56to 102.11 (3.55) on the Goodenough-Harris. Although the Table 10. Comparison of Goodenough-Harris and Goodenough mean IQ's for boys and girls on same-sex and opposite-sex drawings Drawing of a woman Drawing of a man Sex Goodenough | Goodenough- 3 Goodenough | Goodenough- 9 1Q Harris IQ Difference 1Q Harris IQ Difference Boys~~-=~=~~ 92.59 98.56 +5.,97 100.10 101.96 +1.86 Girls------ 104.42 100.96 -3.46 101.06 102.11 +1.05 Difference- 11,83 2.50 | “ummm 0.96 0:15 [ namimminmms Table 11. Coefficients of variation of Harris and Goodenough IQ's for total sample and six subsamples Drawings of Drawings of Draw- | Draw- _. Total|| ings | ings a woman a man group || of a of a woman | man By By By By boys | girls | boys girls Harris standard score--=---===c-=-= 0.16 0.15]. 0.15] 0.10 |.0.,16 | 0.17 0.13 Goodenough IQ-======c--cecceeeaaa=- 0.19} 0.18] 0.19} 0.14 0.18 0.21 0.18 standard deviations of the Goodenough-Harris and Goodenough scores were not shown in table 9, the relative variability of scores based on the two systems is indicated intable 11, which reports hs Ching standard deviation coefficients of variation ra r— for Goodenough-Harris standard scores and for Good- enough IQ's for each of the subsamples. It is apparent that in every case variance is lower for the Harris scores. Recommendation On the basis of this analysis it is recom- mended that the following steps be adopted in relation to the Draw-A-Man Test in the Survey: (1) the Goodenough-Harris system should be used; (2) the entire sample should be scored centrally by uniform standards, with adequate training of scorers and quality control procedures routinely followed; and (3) if scorer variations cannot be eliminated by training, the procedure of averaging the results of two or more scorers should be adopted. SUMMARY AND CONCLUSIONS The foregoing review of the Draw-A-Man Test supports the view that it is a reliable and valid nonlanguage measure of mental maturity, although highly sensitive to cultural influences on the child's conceptual representation of the human figure. Its use in a national survey in the 6 to 12 age range, in conjunction with the WISC and WRAT, is logical and desirable—particularly as ameans of assessing intellectual development in cases in which there is impairment of verbal development or verbal performance. Personality assessment by means of thematic and qualitative assessment of children's drawings would probably be unrewarding. Some indications justifying further research have been noted; how- ever, such research is not sufficiently promising to warrant the expenditure of Survey funds. On the other hand, several lines of empirical work appear worthwhile, These are enumerated below. As discussed in the final portion of the review of the Draw-A-Man, there is strong evidence for the adoption of the Harris revision of the Draw-A- Man with central scoring by trained scorers, and 49 averaging of scores of two or more scorers, if scorer variations cannot be eliminated in train- ing. This procedure need not be regarded as expensive, since it could leave the field psychol- ogists free to test more children while the scoring is done centrally by lower paid workers, Although research on personality-assess- ment uses of the drawings within the Survey pro- gram is not recommended, the following lines of empirical study and analysis are regarded as useful and even important: 1. A systematic study of cultural variations related to the principal geographic areas in which Survey data were collected to evaluate the effects of factors such as customs, attitudes, dress, art, and social roles in relation to the items in the point scalesby which the Draw-A-Man is scored. Even if the results of such an analytic study should be negative, they would be very reassuring in relation to the use of the Draw-A-Man scores in the Survey. 2. Regression studies of Draw-A-Man scores with other psychometric variables in the Survey so that comparisons can be made on the basis of differences be- tween regressed and actual scores rather than directly between raw scores. 3. Further restandardization of the Good- enough-Harris norms on a national sample would be a valuable contribution to psycho- logical measurement of children that could only reflect credit on the Survey and would be of major importance for future use of this well-established and useful intelligence test. This significant undertaking, if approved, should include a complete item analysis as well as recom- putation of norms. Some additional suggestions regarding cross- disciplinary studies with reference to the Draw-A- Man Test are presented in a later section of this report. BIBLIOGRAPHY General References to Draw-A-Man 501. Herrick, M. A.: Children’s drawings. Ped.Sem. 3:338- 339, 1893. 502: Barnes, E.: A study of children’s drawings. Ped.Sem. 2:455-463, 1893. 503. Lukens, H.: A study of children’s drawings in the early years. Ped.Sem. 4:79-110, 1896. 504. Goodenough, F.L.: A new approach to the measurement of the intelligence of young children. J.Genet.Psychol. 33:185-211, 1926. 505. Williams, J. H.: Validity and reliability of the Goodenough intelligence test. Sch.&Soc. 41:653-656, 1935. 506. Smith, F. O.: What the Goodenough intelligence test measures. Psychological Bull. 34:760-761, 1937. 507. Barnhart, E.N.: Developmental stages in compositional construction in children’s drawings. J.Exp.Educ. 11:156- 184, 1942. 508. McHugh, G.: Changes in Goodenough IQ at the public school kindergarten level. J.Educ.Psychol. 36:17-30, 1945. 509. Lehner, G. F.J., and Silver, H.: Some relations between own age and ages assigned on the Draw-a-Person Test; abstracted, Am.Psychologist 3:341, 1948. 510. Goodenough, F. L., and Harris, D. B.: Studies in the psychology of children’s drawings, II, 1928-1949. Psy- chological Bull. 47:369-433, 1950. 50 511. Buhrer, L., de Navarro, R., and Velasco, E. S.: Ensayo de tipificacion de la prueba mental ‘‘Dibujo de un hom- bre’’ de F. Goodenough. Publ.Inst.Biotipol.Exp.U.Cuyo, 2:113, 1951. 512. Weider, A., and Noller, P. A.: Objective studies of chil- dren’s drawings of human figures, II, Sex, age, intelli- gence. J.Clin.Psychol. 9:20-23, 1953. 513. Tuska, S. A.: Developmental Concepts With the Draw-a- Person Test at Different Grade Levels. Unpublished master’s thesis, Ohio University, 1953. 514. Stewart, N.: Review of Goodenough Draw-A-Man Test, in 0.K. Buros, ed., The Fourth Mental Measurements Year- book. Highland Park, N.J. The Gryphon Press, 1953. 515. Woods, W. A., and Cook, W. E.: Proficiency in drawing and placement of hands in drawings of the human figure. J.Consult.Psychol. 18:119-121, 1954. 516. Bliss, M., and Berger, A.: Measurement of mental age as indicated by the male figure drawings of the mentally subnormal using Goodenough and Machover instructions. Am.J.Ment.Deficiency 59:73-79, 1954. 517. Brenner, A., and Morse, N. C.: The measurement of chil- dren’s readiness for school. Pap.Mich.Acad.Sci. 41: 333-340, 1956. 518. Frankiel, R. V.: 4 Quality Scale for the Goodenough Draw-a-Man Test. Unpublished master’s thesis, Univer- sity of Minnesota, 1957. 519. 520. 521. 522. Zuk, G. H.: Children’s spontaneous object elaborations on a visual-motor test. J.Clin.Psychol. 16:280-283, 1960. Stoltz, R. E., and Coltharp, F. C.: Clinical judgments and the Draw-A-Person Test. J.Consult.Psychol. 25: 43-45, 1961. Zuk, G. H.: Relation of mental age to size of figure on the Draw-A-Person Test. Percept.Mot.Skills 14:410, 1962. Harris, D. B.: Children’s Drawings as Measures of Intel- lectual Maturity. New York. Harcourt, Brace & World, Inc., 1963. Goodenough: Reliability Studies 523. 524. 525. 526. 528. Yepsen, L. N.: The reliability of the Goodenough draw- ing test with feebleminded subjects, J.Educ.Psychol. 20:448-451, 1929. McElwee, E. W.: The reliability of the Goodenough in- telligence test used with sub-normal children fourteen years of age. J.Appl.Psychol. 16:217-218, 1932. Brill, M.: The reliability of the Goodenough Draw-a-Man Test and the validity and reliability of an abbreviated scoring method. J.Educ.Psychol. 26:701-708, 1935. McCarthy, D.: A study of the reliability of the Good- enough Drawing Test of Intelligence. J.Psychol. 18: 201-206, 1944. . McCurdy, H. G.: Group and individual variability on the Goodenough Draw-A-Man Test. J.Educ.Psychol. 38:428- 436, 1947. Harris, D. B.: Intra-individual vs. inter-individual con- sistency in children’s drawings of a man; abstracted, Am .Psychologist 5:293, 1950. Goodenough: Factors Affecting Drawing Productions 529. 530. 531. 532. 533. 534. 535. 536. McHugh, A. F.: The Effect of Preceding Affective States on the Goodenough Draw-A-Man Test of Intelligence. Unpublished master’s thesis, Fordham University, 1952. Reichenberg-Hackett, W.: Changes in Goodenough Draw- ings after a gratifying experience. Am.J.Orthopsychiat. 23:501-517, 1953. Fowler, R. D.: The Relationship of Social Acceptance to Discrepancies Between the IQ Scores on the Stanford- Binet Intelligence Scale and the Goodenough Draw-a-Man Test. Unpublished master’s thesis, University of Ala- bama, 1953. Herron, W. G.: The Effect of Preceding Affective States on the Goodenough Drawing Test of Intelligence. Un- published master’s thesis, Fordham University, 1957. Koppitz, E. M.: Teacher’s.attitude and children’s per- formance on the Bender-Gestalt Test and Human Figure Drawings. J.Clin.Psychol. 16:204-208, 1960. Richey, M. H,, and Spotts, J. V.: The relationship of popularity to performance on the Goodenough Draw-A- Man Test. J.Consult.Psychol. 23:147-150, 1959. Tolor, A.: Teachers’ judgments of the popularity of chil- dren from their human figure drawings. J.Clin.Psychol. 11:158-162, 1955. Britton, J. H.: Influence of social class upon performance on the Draw-A-Man Test. J.Educ.Psychol. 45:44-51, 1954. Goodenough: Body Image, Sexual Identification 537. 538. 539. 540. 541. 542. 543. Weider, A., and Noller, P. A.: Objective studies of chil-- dren’s drawings of human figures. I, Sex awareness and socioeconomic level. J.Clin.Psychol. 6:319-325, 1950. Knopf, I. J., and Richards, T.W.: The child’s differen- tiation of sex as reflected in drawings of the human fig- ure. J.Genet.Psychol. 81:99-112, 1952. Swenson, C. H., and Newton, K. R.: The development of sexual differentiation on the Draw-a-Person Test. J. Clin.Psychol. 11:417-419, 1955. Lakin, M.: Certain formal characteristics of human fig- ure drawings by institutionalized aged and by normal children. J.Consult.Psychol. 20:471-474, 1956. Brown, D. G., and Tolor, A.: Human figure drawings as indicators of sexual identification and inversion. Per- cept.Mot. Skills 7:199-211, 1957. Fisher, G. M.: Sexual identification in mentally subnor- mal females. Am.J.Ment.Deficiency 66:266-269, 1961. Silverstein, A. B., and Robinson, H. Le The represen- tation of physique in children’s figure drawing. J.Con- sult.Psychol. 25:146-148, 1961. Goodenough: Relation to Other Tests 544. 545. 546. 547. 548. 549. 550. 551. 552. 553. Havighurst, R. J., and Janke, L. L.: Relations between ability and social status in a midwestern community. I, Ten-year-old children. J.Educ.Psychol. 35:357-368, 1944. Condell, J. F.: Note on the use of the Ammons Full- Range Picture Vocabulary Test with retarded children. Psychol.Rep. 5:150, 1959. Boehncke, C. F.: A Comparative Study of the Goodenough Drawing Test and the Leiter International Performance Scale. Unpublished master’s thesis, University of South- ern California, 1938. Hornowski, B.: Interpretation psychologique des diffe- rences entre sexes dans le dessin due bonhomme chez les jeunes adolescents (Psychological interpretation of sex differences in the Draw-a-Man Test among young adolescents). Revue Psychol. Appl. 11:7-9, 1961. Harris, D. B.: A note on some ability correlates of the Raven Progressive Matrices (1947) in the kindergarten. J.Educ.Psychol. 50:228-229, 1959. McHugh, G.: Relationship between the Goodenough Draw- a-Man Test and the 1937 revision of the Stanford-Binet Test. J.Educ.Psychol. 36:119-124, 1945. Birch, J. W.: The Goodenough Drawing Test and older mentally retarded children. Am.J.Ment.Deficiency 54: 218-224, 1949. Lessing, E. E.: A note on the significance of discrep- ancies between Goodenough and Binet IQ scores. J. Consult.Psychol. 25:456-457, 1961. Thompson, J. M., and Finley, C. J.: The relationship between the Goodenough Draw-a-Man Test and the Stan- ford-Binet Form L-M in children referred for school guid- ance services. Calif.J.Educ.Res. 14:19-22, 1963. Ansbacher, H. L.: The Goodenough Draw-A.Man Test and primary mental abilities. J.Consult.Psychol. 16: 176-18C, 1952. 51 Goodenough: Cultural Variations, Bilingualism 554. 555. 556. 558. 559. 560. Hunkin, V.: Validation of the Goodenough Draw-a-Man Test for African children. J.Soc.Res.Pretoria 1:52-63, 1950. Dennis, W.: Performance of Near Eastern children on the Draw-a-Man Test. Child Development 28:427-430, 1957. Anastasi, A., and DeJesus, C.: Language development and nonverbal IQ of Puerto Rican preschool children in New York City. J.Abnorm.&8ocial Psychol. 48:357-366, 1953. . Johnson, G. B., Jr.: Bilingualism asmeasured by a re- action-time technique and the relationship between a language and a non-language intelligence quotient. J. Genet.Psychol. 82:3-9, 1953. Havighurst, R. J., Gunther, M. X., and Pratt, I. E.: En- vironment and the Draw-A-Man Test, the performance of Indian children. J.Abnorm.&Social Psychol. 41:50-63, 1946. Norman, R. D., and Midkiff, K. L.: Navaho children on Raven Progressive Matrices and Goodenough Draw-A- Man Tests. SWest.J.Anthrop. 11:129-136, 1955. Carney, R. E., and Trowbridge, N.: Intelligence test per- formance of Indian children as a function of type of test and age. Percept.Mot. Skills 14:511-514, 1962. Goodenough: With Subnormal, Retarded, and Mentally Defective Children 561. 562. 563. 564. 565. 566. 567. 568. 52 McElwee, E. W.: Profile drawings of normal and subnor- mal children. J.Appl.Psychol. 18:599-603, 1934. Israelite, J.: A comparison of the difficulty of items for intellectually normal children and mental defectives on the Goodenough drawing test. Am.J.Orthopsychiat. 6: 494-503, 1936. Spoerl, D. T.: Personality and drawing in retarded chil- dren. Character. Pers. 8:227-239, 1940. Spoerl, D. T.: The drawing ability of mentally retarded children. J.G@enet.Psychol. 57:259-277, 1940. White, M. R.: The Performance of Epileptic, Feeble- minded and Normal Children on the Goodenough Test of Intelligence. Unpublished master’s thesis, State Univer- sity of Iowa, 1945. Gunzburg, H. C.: The significance of various aspects in drawings by educationally subnormal children. J.Ment. Sc. 96:951-975, 1950. Fabian, A. A.: Clinical and experimental studies of school children who are retarded in reading. Quart.J. Child Behav. 3:15-18, 1951. Hunt, B., and Patterson, R. M.: Performance of familial mentally deficient children in response to motivation on the Goodenough Draw-A-Man Test. Am.J.Ment.Deficien- cy 62:326-329, 1957. . Rohrs, F. W., and Haworth, M. R.: The 1960 Stanford- Binet, WISC, and Goodenough Tests with mentally re- tarded children. Am.J.Ment.Deficiency 66:853-859, 1962. Goodenough: Chronic Encephalitis 570. Bender, L.: The Goodenough Test (Drawing a Man) in chronic encephalitis in children. J.Child Psychiat. 3: 449-459, 1951. Goodenough: Physically Handicapped 571. 572. 573. Martorana, A. A.: A Comparison of the Personal, Emo- tional, and Family Adjustment of Crippled and Normal Children. Unpublished doctoral dissertation, University of Minnesota, 1954. Silverstein, A. B., and Robinson, H. A.: The represen- tation of orthopedic disability in children’s figure draw- ings. J.Consult.Psychol. 20:333-341, 1956. Johnson, O. G., and Wawrzasek, F.: Psychologists’ judg- ments of physical handicap from H-T-P drawings. J.Con- sult.Psychol. 25:284-287, 1961. Goodenough: Intelligence of Deaf Children 574. 575. 576. Peterson, E. G., and Williams, J. M.: Intelligence of deaf children as measured by drawings. Am.Ann.Deaf 75:273- 290, 1930. Shirley, M., and Goodenough, F. L.: A survey of the in- telligence of deaf children in Minnesota schools. Am. Ann.Deaf 77:238-247, 1932. Springer, N. N.: A comparative study of the intelligence of deaf and hearing children. Am.Ann.Deaf 83:138-152, 1938. i Goodenough: Measurement of Adjustment 5717. 578. 579. 580. 581. 582. 583. 584. Brill, M.: A study of instability using the Goodenough drawing scale. J.4bnorm.&8ocial Psychol. 32:288-302, 1937. Springer, N.N.: A study of drawings of maladjusted and adjusted children. J.Genet.Psychol. 58:131-138, 1941. Albee, G.W., and Hamlin, R.M.: An investigation of the reliability and validity of judgments of adjustment in- ferred from drawings. J.Clin.Psychol. 5:389-392, 1949. Ochs, E.: Changes in Goodenough drawings associated with changes in social adjustment. J.Clin.Psychol. 3: 282-284, 1950. Albee, G. W., and Hamlin, R.M.: Judgment of adjustment from drawings; the applicability of rating scale methods. J.Clin.Psychol. 6:363-365, 1950. Stone, P.M.: 4 Study of Objectively Scored Drawings of Human Figures in Relation to the Emotional Adjustment of 6th Grade Pupils. Unpublished doctoral dissertation, Yeshiva University, 1952. Palmer, H. R.: The Relationship of Differences Between Stanford-Binet and Goodenough 1Q’s to Personal Adjust- ment as Indicated by the California Test of Personality. Unpublished master’s thesis, University of Alabama, 1953. Popplestone, J. A.: Male Human Figure Drawing in Nor- mal and Emotionally Disturbed Children. Unpublished doctoral dissertation, Washington University, 1958. 585. Feldmau, M. J.. and Hunt, R. G.: The relation of diffi- culty in drawing to ratings of adjustment based on human figure drawings. J.Consult.Psychol. 22:217-219, 1958. Goodenough: With Delinquents 586. Hinrichs, W. E.: The Goodenough drawing test in relation to delinquency and problem behavior. Archs.Psychol., N.Y. No. 175, 1935. 587. Starke, P.:4An Attempt To Differentiate Delinquents From Non-delinquents by Tests of Dominance Behavior, Dom- inance Feeling and the Goodenough Drawing of a Man. Unpublished master’s thesis, University of Minnesota. 1950. Goodenough: With Disturbed Persons 588. Berrien, F.K.: A study of the drawings of abnormal chil dren. J.Educ.Psychol. 26:143-150, 1935. 589. Despert, J.L.: Emotional Problems in Children. Utica. State Hospitals Press, 1938. 59C. Des Lauriers, A., and Halpern, F.: Psychological tests in childhood schizophrenia. Am.J.Orthopsychiat. 17:57- 67, 1947. 591. Holzberg, J. D., and Wexler, M.: The validity of human form drawings as a measure of personality deviation. J. Project.Tech. 14:343-361, 1950. 592. Johnson, A. P., Ellerd, A. A., and Lahey, T. H.: The Goodenough Test as an aid to interpretation of chil- dren’s school behavior. Am.J.Ment.Deficiency 54:516- 520, 1950. 593. Hanvik, L. J.: The Goodenough Test as a measure of in- telligence in child psychiatric patients. J.Clin.Psychol. 9:71-72, 1953. Goodenough: Other References Cited in Text 594. Buck, J. N.: The H-T-P technique, a qualitative and quantitative scoring manual. J.Clin.Psychol. 4:317-396, 1948. 595. Goodenough, F.: Measurement of Intelligence by Draw- ings. New York. Harcourt, Brace and World, Inc., 1926. 596. Machover, K.: Personality Projection in the Drawing of the Human Figure. Springfield, Ill. Charles C. Thomas, 1949. IV. THE THEMATIC APPERCEPTION TEST The technology of personality measurement lags far behind that of ability and achievement measurement. This lag makes it difficult for organizations (such as the Division of Health Examination Statistics) which seek to estimate population parameters on the basis of definitive test scores. At present thereis not a single per- sonality test for children that could be recom- mended without qualification, In view of the extensive use of personality tests in clinical practices and in school situations, this sweeping statement may appear extreme. It is, neverthe- less, regrettably true. Perhaps clinical psychol- ogists can justify their use of various personality measures, on the basis of intensive individual case study in which test responses and scores are in- terpreted, by the clinician, in relation to con- sistent patterns of performance in the context of a total life record. The clinician usually feels free to accept or disregard information in this frame of reference, and he often employs informal, unstandardized ''tests'' as well as published pro- cedures without regard for formal considerations of reliability and validity. Furthermore, since clinical judgments are confined to individual cases, they are not subject to verification by the rules of evidence observed in scientific studies. Educators often justify their personality testing as contributing to research, which is important, and the only tenable position in the light of the facts. In contrast with the clinical and research uses of personality measures, where legitimacy is not primarily a function of the proven adequacy of the measurement instruments employed, surveys such as this one (HES) operate under severe constraints. 53 The survey scientist must defend the validity and reliability of his instruments as well as the ade- quacy of his sampling design for the purposes of his survey; both considerations affect the validity of population estimates from sample data. The choice of a personality measurement instrument for Cycle II must be considered inthe context of the preceding discussion. Although the California Personality Test and Cattell's Junior Personality Quiz are, in the opinion ofthe writer, the most adequately documented of the currently published and objectively scored personality tests for children, neither meets the reliability and validity standards necessary for Survey use and neither is appropriate for the entire age range of 6 through 11 years. Apart from these, no available tests even approach the requirements of this Survey. In the psychometric sense, the Thematic Apperception Test (TAT) is not a fest. It is a projective device consisting of a series of am- biguous (unstructured) pictures individually pre- sented to the subject (or patient), who is asked to imagine and relate a story. The rationale of the procedure is that people will seek to create structure when a stimulus situation is unstruc- tured and that in doing so they will draw on their own experience, needs, attitudes, and values to provide the details. This process is viewed as a projection’ of inner processes on the un- structured stimulus, The TAT was developed by Henry A. Murray of Harvard University in 1938 (788). At the same time he presented a report which outlined a motivational system of organismic needs and en- vironmental presses. This report was highly in- fluential and stimulated much research. Five years later (in 1943), the TAT pictures and a manual for their use were published (799). From the objective scoring standpoint, it is necessary to recognize that all projective methods share a major problem, since in all of them the testing strategy depends on the process by which subjects add structure to ambiguous stimuli. Although this structuring process does involve projection, in the sense defined above, it also simultaneously involves other factors. Indeed, the structuring process may be as much a function of external, situational factors, to which the subject is responding, as of internal factors. 54 How these various factors combine are only imperfectly understood in the scientific study of perception; they have not, to the writer's knowledge, been investigated in relation to the TAT pictures. In spite of these facts, for the past 60 or more years users of projective techniques have continued to assume that responses to various stimuli represent projection only. Cattell (796) has suggested that "projective" tests (which he thinks should be called "misper- ception tests'"), should employ stimuli of a much lower order of complexity than those of the TAT and the Rorschach inkblots in order to simplify interpretation. Technically this may be an im- provement, as Cattell has shown in the misper- ception tests which he designed for his objective test batteries. In these tests the subject's latitude of response to a specific ambiguity (e.g., esti- mating the number of communist party members in the United States or the value of a college degree) is extremely limited. A similar con- clusion is also implicit in the modifications of the TAT pictures made by McClelland (798) in his studies of motivation measurement in fantasy. In a complex projective technique such as the TAT, the story produced by a subject may represent his response to the entire picture or only to certain parts of the stimulus picture. In addition, the story itself necessarily requires technical interpretation by the examiner to the extent that it employs idiosyncratic language, symbols, and ideation. Because of the freedom and informality of the method, which is deliberate (in order to avoid prompting or the addition of extraneous variance contributed by the examiner), it is virtually impossible to relate responses to specific internal and external cues or patterns of cues, The very looseness of the interpretative procedure, in contrast to fixed scoring keys in the case of questionnaires (usually answered “ves.” no,” or "72", led George Kelly (797), in an Annual Review article, to observe that while in the case of questionnaires the subject tries to guess what the examiner is thinking, in projective techniques the examiner must guess what the subject is thinking. In either case, thereisa good deal of guessing going on. The TAT has some similarity to the Draw- A-Man Test in that the Draw-A-Man provides an unstructured stimulus (the instruction to draw a person) and permits wide latitude of response structuring on the part of the subject. It is note- worthy that the Draw-A-Man has produced no acceptable schemes for personality interpreta- tion. However, as pointed out in the discussion of the Draw-A-Man, the most promising results in personality, as well as in cognitive assessment, have been those employing detailed, objective techniques of scoring, such as the point scales. The selection of five cards of the TAT for the Survey undoubtedly reflects (1) the appraisal of existing personality tests mentioned above, combined with (2) the wecognition of apparent widespread acceptance of the TAT as a pro- jective technique and (3) the belief that an appropriate method of objective scoring of re- sponses to them can be developed for the specific use of the Survey as well as for later more general use by professional workers. The basis for this appraisal cannot be documented here, although the writer is prepared to defend it. Reference to the forthcoming Sixth Mental Meas- urements Yearbook (O. Buros, ed., New Bruns- wick, N.J., The Gryphon Press) might be suffi- cient for this purpose. The evidence for the recognition of acceptance of the TAT is discussed below, together with an evaluation of the prospects for successful development of an objective scoring procedure, REVIEW OF THE LITERATURE ON THE TAT The present review includes abstracts of pub- lished research articles, theses, and critical reviews of the TAT literature, as well as 5 general references on the thematic apperception method. These constitute only a small portion of the ex- tensive psychological, anthropological, and socio- logical research on the TAT and its variants which have appeared in undiminished quantity over the years (e.g., Thompson's Negro editionof the TAT, Symonds’ Picture Story Test, Bellak's Children's Apperception Test (CAT), Van Lennep's Four Picture Test, Phillipson's Object Relations Tech- nique, and numerous other techniques which can be traced to the Murray version). Both the TAT procedure and the Murray ''meed-press'' concepts have been used extensively in personality studies and studies of motivation. The items selected for inclusion in this report were judged relevant if they (1) used a measurement approach, (2) were validation or normative studies, (3) had an appli- cable sample in terms of age, or (4) used an adequate scoring procedure, Overview Treatment of the TAT by different writers ranges from uncritical acceptance on the basis of a priori assumptions, illustrated by Henry (749) and Piotrowski (702), through qualified acceptance with a "soft" attitude toward the contradictory evidence, as demonstrated by Mayman (701) and Lindzey (703), to objective evaluation, illustrated by Eron (706), Windle (704), and others. Windle's comment, that there is little agreement among results reported by different investigators, seems to describe accurately this field of research. One area in which some agreement may be found, however, is that of cognitive evaluation (714 and 737-739); this is highly reminiscent of the Draw- A-Man., The TAT literature abounds in elaborate but largely untested (critically, that is) scoring systems. Most of these are too extensive for brief summarization and go beyond the purposes of this report. However, they have been reviewed in anticipation of a further empirical study of the Survey's Thematic Apperception Test data, and references to 21 additional selected reports are included in the bibliography of section IV. Most of these, as well as a number of other suggested analytic methods of scoring the TAT, are well summarized in a 1951 publication by Edwin S. Shneidman, Walther Joel, and Kenneth B. Little (800). Although the modes of analysis vary in detail and in terminology, the typical one in- volves interpretation and frequency counting or evaluation on a rating scale of all or part of the following types of information, usually across all of the stories obtained for a selection of cards. (The full series of cards is often abridged because of practical time limitations, as it is in the Survey.) Formal (structural) aspects of the stories Compliance with instructions (including card rejection) 55 Consistency of stories Length of stories; vocabulary level Grammatical forms (nouns, pronouns, verbs, incomplete sentences) Number and type of situations described Number and type of characters included Outcome of stories Level of response (from description to im- aginative interpretation) Interpretive categories Feelings, moods, worries, emotional tone Needs expressed (or implied) Conflict areas Presses—physical, emotional, mental, eco- nomic, social, religious Characters—strivings, attitudes, obstacles, barriers, traits, and roles of hero, major characters, and minor characters Outcomes reflecting success, failure Thematic content—family dynamics, - inner adjustment, sexual adjustment, interpersonal relations, aggression (physical, nonphysical) Developmental level in Freudian (psycho- sexual) context Defense mechanisms utilized Manner in which environment is assimilated The number of variables enumerated under these categories is extensive (Murray's need- press system alone exceeds 83), and in most cases the variables require detailed, careful definition and intensive training of scorers. High reliabilities have often been achieved among scorers within a particular laboratory for a given period of tenure of the staff members involved, but these have not generally been maintained with staff changes or when systems have been tried out at other institutions. Often, definitions change over time as new generations of protocols appear, requiring decisions in relation to categories developed on the basis of earlier samples. 56 In spite of the logical (from some theoretical positions) appeal of these analytic approaches, they do not fit the requirements of psychometric procedures. Such analytic approaches satisfy the needs of various clinicians or investigators in their individual practices and researches, but for survey purposes they are useful primarily because they suggest areas which may be suitable for objective study. With the exception of some formal characteristics (such as length of story and other items that can be counted fairly accurately) which have been related to developmental rather than personality-adjustment concepts, there is so little agreement in the literature on most scoring cate- gories that an investigator seeking to develop an objective scoring procedure might as well start from "scratch." Research Demonstrating Developmental Factors Edelstein (737) completed an interesting pilot study demonstrating a system for scoring TAT stories. From her system a total age-adjusted score, correlating well with Stanford-Binet IQ's, could be derived. She used the following six scoring categories—number of words, qualifier/ word ratio, number of conditions, number of responses, number of situations involved, and number of characters. Her sample included only 15 boys and 13 girls (ages 9-5 to 12-5), but from a methodological viewpoint her study is promising. In a conceptually related study, Armstrong (714) administered the CAT (cards 1, 2, 4, 8, and 10) to a sample of 60 children in grades 1 to 3 in the University of Minnesota elementary school. The findings of her study relevant to the present review are as follows: (1) length of story in- creases with grade, (2) girls' protocols are longer than those of boys, (3) the use of first person pronouns shows a slight but consistent decline with grade progression, (4) girls tend to make more subjective and personalized state- ments than boys, and (5) girls have a consistently longer reaction time than boys. Slack (761) gave the TAT to 15 exogenous feebleminded boys and 12 endogenous ones at the Vineland Training School. He correlated a score reflecting the number of causally and purpose- fully connected statements with the Stanford-Binet and with Thurstone's test of Primary Mental Abilities (PMA). With chronological age held constant, causally or purposefully connected statements correlated with other variables as follows: S-B MA, 0.58; PMA MA, 0.70; PMA Verbal MA, 0.51; PMA Motor MA, 0.72. Length of stories (number of words) correlated as follows with the same variables (CA held constant): S-B MA, 0.31 (ns); PMA MA, 0.34 (ns); PMA Verbal MA, 0.53; PMA Motor MA, 0.48. The age-Cor- rected correlation of number of purposeful re- lations with the PMA Verbal MA was 0.90, and the correlation of number of causal relations with the same measures was 0.42. Slack also reported a significant difference between the endogenous and exogenous groups on length of stories. These studies lend some limited support to the possibility of developing an objective scoring system based on developmental criteria for the five TAT pictures used in the Survey. Other Relevant Research The following studies were selected for cita- tion on the basis of their relevance to the Survey problems. Lesser (720) demonstrated how a Guttman-type scale could be developed for measurement of aggressive fantasy. Bijou and Kenny (732) and Murstein (734) investigated ambiguity values of TAT cards. The former found the following ambiguity ranks (out of 21) for the four picture cards used in the Survey (card 16, blank, was not rated): Card number Rank lcm mmm cm mm mn ————— 2 emcee ————— 3 Bertie som im se ete i mmm em ee 17 8BM mmm 11 The latter reported that cards with medium ambiguity (8BM) were most "productive of the- matic content among college students. Milam (735) demonstrated the sensitivity of TAT responses to examiner influence. Apparently, the attitudes and behavior of the examiner, as perceived by the subject, account for variance in the TAT responses. This is true of all psycho- logical tests. It is not possible to say whether this is a greater problem on the TAT than on the WISC, for example, but it must be kept in mind as a significant source of uncontrolled variation. Gurevitz and Klapper (763) found that schizo- phrenic children characteristically respond to CAT cards with bizarre outcomes, evaluation of stimuli, use of titles, hostility, and verbosity. Holden (766) compared a small sample of cerebral palsied children with normal controls, His results clearly suggest that cerebral palsied respondents tend to describe the cards, while normal controls give more thematic content, The average number of descriptions (out of 10 cards) was 6.0 for the palsied children and 2.8 for the controls. Leitch and Schafer (770) reported a number of response criteria identifying psychotic responses. From the standpoint of further research on the development of a scoring procedure for the TAT, the following list of specific items has been recorded and evaluated in one or more of the studies reviewed (reference numbers shown in parentheses). In most cases the results were not included in the main discussion either because of sample limitations, subjective methods of scoring, inconclusiveness of results, or unrelatedness to the present problem. Many of them, however, do appear definable and worthy of further study. Frequency and duration RT latency (705 and 747) Total reaction time (705 and 747) Number of words (707, 714, 737, 741, 746, 747, and 764) Number of adjectives (737) Number of adverbs (737) Number of nouns (714) Number of pronouns (714) Number of verbs (714) Number of questions (705) Number of ego words (714) Number of situations (737) Number of characters (707 and 737) Male, female Nature of action Crying (718) Dancing (737) Disaster (713) Drunkenness (737) 57 58 Escape solutions (705 and 718) Fear of punishment (742) Fighting (720) Hardship (713) Illness (713) Loss of ability, skill, money (737) Suicide (705) Frightening (737) Killing (720) Ridiculing (720) Making fun of (737) Punishment (705 and 743) Stealing (737) Receiving aid (705) Giving aid (705) Teaching (737) Laughing (737) Singing (737) Book or movie cited as source (705) Criticism of picture (705) Liked, disliked (705) Title (763) Number of themes (707, 712, and 764) Card description Parts referred to (705) Number of rare picture details (705) Compliance with instructions (705, 707, and 721) Examiner included in story (770) Response Bizarre (705 and 763) Queer (770) Contradictory (770) Incoherent (705 and 770) Transcendental (707 and 714) Number of references Future events (705 and 721) Past events (705 and 721) Present events (705 and 721) Level (712, 721, 755, 766, and 776) Enumerative Descriptive Interpretive Language Neologisms (770) Stereotyped (705) Vocabulary level (705) Unusual wording (770) Fluency (705) Repetitions (770) Foreign expressions Relative age of characters (705) Older Peer Younger Sex role identification (705) Own Opposite Ambiguous Tone of story (712) Emotional Submission to fate Rebellion Fear Worry Lack of affect Aspiration Shift of tone Theme of story Unrelated (770) Curiosity (738) Scorning (720) Social approval (713) Positive Negative Evasive Stressful (725) Ordinary family activity (712) Mental inadequacy (713) Motivational inadequacy (713) Physical inadequacy (713) Perceptual distortions (705, 712, and 770) Neatness or orderliness of story (705) Overspecific statements (770) Overgeneralizations (770) Autistic logic (770) Feelings Anger toward parent(s) (743) Aesthetic (705) Ambivalent (705) Benign (705) Conflict (705) Empathy (723) Frustration (705 and 713) Guilt (705 and 713) Happiness (747) Hate (720) Independence (713) Inferiority (705) Paranoid (705) Parental anger to child (743) Pleasant (705) Pleasure (713) Sadistic (705) Security (713) Number of causal relations (761) Number of purposeful relations (761) Outcomes (713, 763, 772, and 775) Failure Success Aggressive (772) Clarity of statement (705) Bizarre (763) Self-reference (705) Number of personalized statements (705 and 714) Degree of response certainty (705) Level of interpretation (Eron, 712) Symbolic Abstract Descriptive Unreal Fairy tale Central character not in picture Autobiographical Continuations Alternate themes Comments Denial of theme Rejection Peculiar Confused Includes examiner in story No connection between story and picture Humorous PROSPECTS FOR DEVELOPING AN OBJECTIVE SCORING KEY FOR THE SURVEY'S TAT Although the TAT literature is scientifically "sloppy" in comparison with the material reviewed in relation to the WISC and the Draw-A-Man Test, the following assumptions seemed warranted: (1) a substantial number of items (both formal-struc- tural and thematic-interpretive) can be reliably defined and accurately scored, (2) discriminating developmental criteria can be devised, and (3) an objectively defined scoring system can be developed which will contribute useful information regarding development between ages 6 and 12 years. It seems unlikely, in light of the literature reviewed, that scoring scales can be constructed which will measure factors such as motivation, affective states, and personality traits, However, this is not serious since there is no indication that these factors have any developmental impli- cations. The anticipated developmental scales would greatly enrich the information obtained in the Survey by possibly providing developmental norms with regard to behavioral aspects not encompassed by the other tests, such as verbal expression, thematic content of imagination in standard test situations, associations to standard stimuli, role concepts and attitudes in relation to self, peers of same and opposite sex, parental and adult figures, and common cultural values. While the picture samples are limited, they appear to be well chosen for the purpose. Card 1 has a boy as the central figure; card 2, a girl; card 5, an adult-parental (mother) figure; and card 8BM, a possible stressful situation—involv- ing a father figure—within the experience back- ground of most school-age children, Card 16, the blank card, is completely unstructured. As a set of cards having nearly universal applicability in a United States national sample, the selection appears excellent. One of the advantages that an investigator working on this problem would have over most of those who have published reports in this area is the large sample obtained under standardized survey conditions. With adequate funds to work with a fairly large sample of perhaps 1,000 or more cases, a good test of these conclusions could be made. Of course, there is no guarantee that the results will be entirely satisfactory, although the prognosis appears good. However, the Survey is committed to doing something with these data, and no suitable scoring procedure is presently available, In the writer's judgment, the options available were nearly all unsatisfactory, and the one taken may prove to be a wise decision. 59 BIBLIOGRAPHY General References to TAT 701. 702. 703. 704. 705. 706. 707. 708. TAT: 709. 710. 711. 712. 713. 714. 715. 60 Mayman, M.: Review of the literature on the Thematic Apperception Test, in David Rapaport, Diagnostic Psy- chological Testing. Vol. II, The Theory, Statistical Evaluation, and Diagnostic Application of a Battery of Tests. Chicago. Year Book Publishers, 1946. pp. 496- 506. Piotrowski, Z. A.: A new evaluation of the Thematic Ap- perception Test. Psychoanalyt.Rev. 37:101-127, 1950. Lindzey, G.: Thematic Apperception Test, interpretive assumptions and related empirical evidence. Psychology Bull. 49:1-25, 1952. Windle, C.: Psychological tests in psychopathological prognosis. Psychology Bull. 49:451-482, 1952. Hartman, A. A.: An experimental examination of the The- matic Apperception Technique in clinical diagnosis. Psychological Monographs. Vol. 63, No. 8 (Whole No. 303). Washington, D.C. American Psychological Asso- ciation, Inc., 1950. Eron, L. D.: Some problems in the research application of the Thematic Apperception Test. J.Project.Tech. 19:125-129, 1955. Lindzey, G., and Silverman, M.: Thematic Apperception Test, techniques of group administration, sex differ- ences, and the role of verbal productivity. J.Personal- ity 27:311-323, 1959. Sanford, R. N., and others: Physique, personality and scholarship; a cooperative study of school children, in Society for Research in Child Development, Monograph, Vol. 8, No. 2. Washington, D.C. National Research Council, 1943. Normative Data Cox, B. F., and Sargent, H. D.: The common responses of normal children to ten pictures of the Thematic Apper- ception Test series; abstracted, Am.Psychologist 3:363, 1948. Bell, J. E.: A comparison of children’s fantasies in two equated projective techniques; abstracted, Am.Psychol- ogist 3:263, 1948. Whitehouse, E.: Norms for certain aspects of the Themat- ic Apperception Test on a group of nine and ten year old children; abstracted, Persona 1: 12-15, 1949. Eron, L. D.: A normative study of the Thematic Apper- ception Test. Psychological Monographs. Vol. 64, No. 9. Washington, D.C. American Psychological Associa- tion, Inc., 1950. Cox, B., and Sargent, H. D.: TAT responses of emotion- ally disturbed and emotionally stable children, clinical judgment versus normative data. J.Project.Tech. 14:61- 74, 1950. Armstrong, M. A. S.: Children’s responses to animal and human figures in thematic pictures. J.Consult.Psychol. 18:67-70, 1954. Fisher, G. M., and Shotwell, A. M.: Preference rankings of the TAT cards by adolescent normals, delinquents, and mental retardates. J.Project.Tech. 25:41-43, 1961. 716. Brayer, R., Craig, G., and Teichner, W.: Scaling difficulty values of TAT cards. J.Project.Tech. 25:272-276, 1961. TAT: Scoring Schemes 717. Eron, L. D., Terry, D., and Callahan, R.: The use of rating scales for emotional tone of TAT stories. J.Con- sult.Psychol. 14:473-478, 1950. 718. Fine, R.: A scoring scheme for the TAT and other ver- bal projective techniques. J.Project.Tech. 19:306-309, 1955. 719. Friedman, I.:Objectifying the subjective, a methodolog- ical approach tothe TAT. J.Project.Tech. 21:243-247, 1957. 720. Lesser, G. S.: Application of Guttman’s scaling method to aggressive fantasy in children. Educ.Psychol.Measur. 18:543-551, 1958. 721. Dana, R. H.: Proposal for objective scoring of the TAT. Percept.Mot.Skills 9:27-43, 1959. TAT: Stability, Reliability 722. Porter, F. S.: A Study of Certain Aspects of the Relia- bility and Validity of the Thematic Apperception Test. Unpublished master’s thesis, Iowa State University, 1944. 723. Harrison, R., and Rotter, J. B.: A note on the reliability of the Thematic Apperception Test. J.Abnorm.&Social Psychol. 40:97-99, 1945. 724. Jeffre, M. F. D.: A Critical Study of the Thematic Ap- perception Test Performance of Normal Children. Un- published master’s thesis, University of Iowa, 1945. 725. Mayman, M., and Kutner, B.: Reliability in analyzing Thematic Apperception Test stories. J.Abnorm.&Social Psychol. 42:365-368, 1947. 726. Kagan, J.: The stability of TAT fantasy and stimulus ambiguity. J.Consult.Psychol. 23:266-271, 1959. TAT: Validity Studies 727. Calvin, J.8., and Ward, L. C.: An attempted experiment- al validation of the Thematic Apperception Test. J.Clin. Psychol. 6:377-381, 1950. 728. Saxe, C. H.: A quantitative comparison of psychodiag- nostic formulations from the TAT and therapeutic con- tacts. J.Consult.Psychol. 14:116-127, 1950. 729. Davenport, B. F.: The semantic validity of TAT inter- pretations. J.Consult.Psychol. 16:171-175, 1952. 730. Bendig, A. W.: Predictive and postdictive validity of need achievement measures. J.Ed.Res. 52:119-120, 1958. 731. Henry, W. E., and Farley, J.: The validity of the The- matic Apperception Test in the study of adolescent per- sonality. Psychological Monographs. Vol. 73, No. 17 (Whole No. 487).Washington, D.C. American Psycholog- ical Association, Inc., 1959. TAT: Ambiguity Values of Cards 732. Bijou, S.W., and Kenny, D. T.: The ambiguity values of TAT cards. J.Consult.Psychol. 15:208-209, 1951. 788. Davenport, B. F.: The Ambiguity, Universality, and Re- liable Discrimination of TAT Interpretations. Unpub- lished doctoral dissertation, University of Southern Cali- fornia, 1951. 784. Murstein, B. I.: The relationship of stimulus ambiguity on the TAT to the productivity of themes. J.Consult. Psychol. 22:348, 1958. TAT: Examiner Influence, Interpreter Influence 785. Milam, J. R.: Examiner influences on Thematic Apper- ception Test stories. J.Project.Tech. 18:221-226, 1954. 786. Young, R. D., Jr.: The Effect of the Interpreter’s Per- sonality on the Interpretation of Thematic Apperception Test’s Protocols. Unpublished doctoral dissertation, University of Texas, 1953. TAT: Effects of Intelligence, Achievement 787. Edelstein, R. T.: The Evaluation of Intelligence From TAT Protocols. Unpublished master’s thesis, College of the City of New York, 1956. 738. Kagan, J., Sontag, L. W., Baker, D. T., and Nelson, V. L.: Personality and IQ change. J.Abnorm.&Social Psy- chol. 56:261-266, 1958. 739. Murstein, B.I., and Collier, H. L.: The role of the TAT in the measurement of achievement as a function of ex- pectancy. J.Project.Tech. 26:96-101, 1962. TAT: Personality Variables 740. McDowell, J. V.: Development Aspects of Phantasy Pro- duction onthe Thematic Apperception Test. Unpublished doctoral dissertation, Oklahoma State University, 1952. 741. Cook, R. A.: Identification and ego defensiveness in the- matic apperception. J.Project.Tech. 17:312-319, 1953. 742. Mussen, P. H., and Naylor, H. K.: Relationships between overt and fantasy aggression. J.Abnorm.&Social Psy- chol. 49:235-240, 1954. 743. Kagan, J.: Socialization of aggression and the percep- tion of parents in fantasy. Child Development 29:311- 320, 1958. 744. Fitzgerald, B. J.: The Relationship of Two Projective Measures to aSociometric Measure of Dependent Behav- ior. Unpublished doctoral dissertation, Ohio State Uni- versity, 1959. 745. Breger, L.: Conformity and the Expression of Hostility. Unpublished doctoral dissertation, Ohio State University, 1961. TAT: Effects of Set, Recent Experience, Stimulus Variables 746. Lubin, B.: Some effects of set and stimulus property on TAT stories. J.Project.Tech. 24:11-16, 1960. 747. Newbigging, P. L.: Influence of a stimulus variable on stories told to certain TAT pictures. Can.J.Psychol. 9:195-206, 1955. 748. Coleman, W.: The Thematic Apperception Test. I, Ef- fect of recent experience. II, Some quantitative obser- vations. J.Clin.Psychol. 3:257-264, 1947. TAT: Environmental Variations; Culture, Social Class, Race, Ethnic Group, Home Conditions, Sex Role, Sociometric Status, Social Acceptance 749. Henry, W. E.: The Thematic Apperception Technique in the study of culture-personality relations. Genet.Psy- chol.Monogr. 35:3-135, 1947. 750. Mason, B., and Ammons, R. B.: Note on social class and the Thematic Apperception Test. Percept.Mot. Skills 6:88, 1956. 751. Fisher, S., and Fisher, R. L.: A projective test analysis of ethnic subculture themes in families. J.Project.Tech. 24:366-369, 1960. 752. Mitchell, H. E.: Social class and race as factors affect- ing therole of the family in Thematic Apperception Test stories; abstracted, Am.Psychologist 5:299-300, 1950. 753. Mussen, P. H.: Differences between the TAT responses of Negro and white boys. J.Consult.Psychol. 17:373- 376, 1953. 754. Mussen, P. H.: Some personality and social factors re- lated to changes in children’s attitudes toward Negroes. J. Abnorm .&8ocial Psychol. 45:423-441, 1950. 755. Shields, D. L.: An Investigation of the Influences of Disparate Home Conditions Upon the Level at Which Children Responded to the Thematic Apperception Test. Unpublished master’s thesis, University of Pittsburgh, 1950. 756. McArthur, C.: Personality differences between middle and upper classes. J.4bnomm.&Social Psychol. 50:247- 254, 1955. 757. Cox, F. N.: Sociometric status and individual adjust- ment before and after play therapy. J.Abnorm.&Social Psychol. 48:354-356, 1953. 758. Herman, G.N.: 4 Comparison of the TAT Stories of Pre- adolescent School Children Differing in Social Accept- ance. Unpublished master’s thesis, University of Toron- to, 1952. 759. Milner, E.: Effects of sex role and social status on the early adolescent personality. Genet.Psychol.Monogr. 40:231-325, 1949. 760. Butler, O.P.: Parent Figures in Thematic Apperception Test Records of Children in Disparate Family Situations. Unpublished doctoral dissertation, University of Pitts- burgh, 1948. TAT: With Feebleminded, Retarded, Handicapped, Brain Injured, Palsied, Disturbed, and Psychotic Children 761. Slack, C.W.:Some intellective functions in the Thematic Apperception Test and their usein differentiating endog- enous feeblemindedness from exogenous feebleminded- ness. Train.Sch.Bull. 47:156-169, 1950. 762. Tolman, N. G., and Johnson, A. P.: Need for achieve- ment as related to brain injury in mentally retarded chil- dren. Am.J.Ment.Deficiency 62:692-697, 1958. 763. Gurevitz, S., and Klapper, Z. S.: Techniques for and evaluation of the responses of schizophrenic and cere- bral palsied children to the Children’s Apperception Test (C.A.T.). Quart. J. Child Behavior 3:38-65, 1951. 61 764. Abel, T. M.: Responses of Negro and white morons to the Thematic Apperception Test. Am.J.Ment.Deficiency 49:463-468, 1945. 765. Beier, E. G., Gorlow, L., and Stacey, C. L.: The fantasy life of the mental defective. Am.J.Ment.Deficiency 55: 582-589, 1951. 766. Holden, R. H.: The Children’s Apperception Test with cerebral palsied and normal children. Child Develop- ment 27:3-8, 1956. 767. Hood, P.N., Shank, K. H., and Williamson, D.: Environ- mental factors in relation to the speech of cerebral pal- sied children. J.Speech & Hearing Disorders 13:325-331, 1948. 768. Bergman, M., and Fisher, L. A.: The value of the The- matic Apperception Testin mental deficiency. Psychiat. Quart. Suppl. 27:22-42, 1953. 769. Ericson, M.: A study of the Thematic Apperception Test as applied to a group of disturbed children; abstracted, Am Psychologist, 2:271, 1947. 770. Leitch, M., and Schafer, S.: A study of the Thematic Ap- perception Tests of psychotic children. Am.J.Orthopsy- chiat. 17:387-342, 1947. 771. Shank, K.H.: An Analysis of the Degree of Relationship Between the Thematic Apperception Test and an Origi- nal Projective Test in Measuring Symptoms of Personal- ity Dynamics of Speech Handicapped Children. Unpub- lished doctoral dissertation, University of Denver, 1954. 772. Christensen, A.H.: 4 Quantitative Study of Personality Dynamics in Stuttering and Non-Stuttering Siblings. Un- published master’s thesis, University of Southern Cali- fornia, 1951. 773. Young, F. M.: Responses of juvenile delinquents to the Thematic Apperception Test. J.Genet.Psychol. 88:251- 259, 1956. TAT: With CAT and Michigan Picture Test 774. Symonds, P. M.: Adolescent Fantasy, an Investigation of the Picture-Story Method of Personality Study. New York. Columbia University Press, 1949. 775. Light, B. H.: Comparative study of a series of TAT and CAT cards. J.Clin.Psychol. 10:179-181, 1954. 776. Andrew, G., Walton, R. E., Hartwell, S. W., and Hutt, M. L.: The Michigan Picture Test, the stimulus value of the cards. J.Consult.Psychol. 51:51-54, 1951. Special Bibliography of TAT Scoring Systems 1 777. Andrew, G., Hartwell, S. W., Hutt, M. L.., and Walton, R. E.: The Michigan Picture Test. Chicago. Science Re- search Associates, Inc., 1953. 778. Arnold, M. B.: A demonstration analysis of the Thematic Apperception Test in a clinical setting. J.4bnorm.& Social Psychol. 44:97-111, 1949. 779. Aron, B.: A Manual for Analysis of the Thematic Apper- ception Test. Berkeley, Calif. Willis E. Berg, 1949. 780. Bellak, L.: A Guide to the Interpretation of the Thematic Apperception Test. New York. The Psychological Cor- poration, 1947. 2 See also 717 to 721. 781. Cox, B., and Sargent, H.: TAT responses of emotionally disturbed and emotionally stable children. J.Project. Tech. 14:61-74, 1950. 782. Dana, R. H.: Norms for three aspects of TAT behavior. J.Genet.Psychol. 57:83-89, 1957. 783. Fine, R.: Manual for Scoring Scheme for Verbal Projec- tive Techniques (TAT, MAPS, Stories, and the Like). Washington, D.C. Veterans Administration, 1948. 784. Fry, F.D.: Manual for scoring the TAT. J.Psychol. 35: 181-195, 1958. 785. Hartman, A. A.: An experimental examination of the The- matic Apperception Technique in clinical diagnosis. Psychological Monographs. Vol. 63, No. 8 (Whole No. 303). Washington, D.C. American Psychological Asso- ciation, Inc., 1950. pp. 1-48. 786. Henry, W.E.: The Analysis of Fantasy. New York. John Wiley and Sons, Inc., 1956. 787. Klebanoff, S.: Personality factors in symptomatic chronic alcoholism as indicated by the Thematic Apperception Test. J.Consult.Psychol. 11:111-119, 1947. 788. Murray, H. A.: Eawplorations in Personality. New York. Oxford University Press, 1938. 789. Rappaport, D.: The Thematic Apperception Test, Ch. IV, in Diagnostic Psychological Testing, Vol. II, Chicago. Yearbook Publishers, Inc., 1946. 790. Shorr, J. E.: A proposed system for scoring the TAT. J. Clin.Psychol. 4:189-195, 1948. 791. Stone, H.: The TAT Aggressive Content Scale. J.Proj. Tech. 20:445-452, 1956. 792. Terry, D.: The use of a rating scale of level ofresponse in TAT stories. J.Abnorm.&Social Psychol. 47:507-511, 1952. 793. Tomkins, S. 8., and Tomkins, E. S.: The Thematic Ap- perception Test, the Theory and Technique of Interpre- tation. New York. Grune and Stratton, 1948. 794. White, R.K.: Value Analysis, the Nature and Use of the Method. New York. Society for the Psychological Study of Social Issues, 1951. 795. Wyatt, F.: The scoring and analysis of the Thematic Ap- perception Test. J.Psychol. 24:319-330, 1947. Other References Cited in Text 796. Cattell, R. B.: Personality and Motivation Structure and Measurement. New York. Harcourt, Brace and World, 1959. 797. Kelly, G. A.: The theory and technique of assessment, in P. R. Farnsworth and Q. McNemar, eds., Annual Re- view of Psychology, Vol. 9. Palo Alto, Calif. Annual Reviews, Inc., 1958. 798. McClelland, D.: Studies in Motivation. New York. Ap- pleton-Century-Crofts, Ine., 1955. 799. Murray, H. A.: Thematic Apperception Test, Pictures and Manual. Cambridge. Harvard University Press, 1943. 800. Shneidman, E. S., Joel, W., and Little, K. B.: Thematic Test Analysis. New York. Grune and Stratton, 1951. V. TOTAL PSYCHOLOGICAL TEST BATTERY The foregoing reviews of the several com- ponents of the Survey's psychological testbattery have discussed the strengths and weaknesses of each test and the problems involved in estimating population parameters on a national scale from the sample data. In each case a number of specific problems were raised, and suggestions for treat- ment of data or for further research have been made in the respective sections of the report. However, the most important common problem derives from the examination of the standardi- zation basis of these tests. The norms for the WISC are unquestionably the most satisfactory, with the Draw-A-Man being second; the adequacy of the Wide Range Achievement Test norms has been questioned (see section II). Finally, new norms, related to the scoring system to be developed for the TAT, are yet to be constructed. In order to achieve the soundest possible basis for population estimates with this battery, it is recommended that new national norms, based on the total Survey sample, be developed for all of the tests before any final population estimates are published. While some preliminary estimates may be warranted, using norms provided by the test publishers, the discussions in the individual sections of the report point up the necessity of the recommended restandardization. In the event that this work cannot be fully supported, the order of priority indicated by the review would place the reanalysis of the WRAT first, the Draw-A-Man Test second, and the WISC third. It is assumed that this must be done for the TAT when a new scoring procedure is completed and adopted. The issues in relation to the WRAT are as follows: (1) No adequate sampling plan was fol- lowed in standardizing the 1963 revision, and, in fact, the bias of the sample is clearly mentioned in the manual. (2) The test scores used to compile the sample by levels are not equivalent; therefore, only limited confidence can be placed in the re- sulting norm levels, even though substantial correlation of the WRAT scales with concurrent criteria appears likely. In the case of the Draw-A-Man Test, it is recognized that (1) the Goodenough norms are outmoded, and that (2) the use of the Harris norms (which is recommended) without analysis of the raw score distributions on the national sample might lead to some errors. The adminis- tration of the Draw-A-Man Test in the Survey was different from that recommended by Harris, and it would be prudent to proceed empirically rather than to assume that the Survey drawings are equivalent. In addition, Harris' own norms do not reflect as good a national sample as even the WISC, for which further standardization is un- questionably justified. One of the major problems with the WISC subtests is that of examining further the optional basis for estimating Full Scale IQ's from the Vocabulary and Block Design scores. Even if restandardization should reveal no need for re- scaling the subtest items, the adoption of published conversion tables or direct proration is con- sidered unjustified without further research. This is discussed in more detail in section I. The information expected from the test battery may be summarized as follows: 1. WISC Vocabulary—score. This test indi- vidually provides a good estimate of ''g," the common ''general intelligence'' factor in the WISC, and may be accepted as a good measure of the verbal component of the general measure of intelligence. 2. WISC Block Design—score. This test is also well saturated in ''g'" and second only to Vocabulary in reliability. It should be accepted as a strong nonverbal intelli- gence test and as an estimate of the non- verbal component of the full test. 3. Draw-A-Man Test— Goodenough-Harris standard score. The Goodenough-Harris standard score (preferably restandard- ized on the total Survey sample) can be interpreted as a deviation IQ, ina manner comparable to the WISC IQ's. This score is a reliable and reasonably valid non- language measure of mental maturity. 4, WRAT Orval Reading—grade equivalent (RQ). 5. WRAT Oral Reading—standard score (Rss). 6. WRAT Avithmetic—grade equivalent (Aq). 7. WRAT Aritnmetic—standard score (Ass). 63 Both the grade equivalents and the stand- ard scores will be useful for the WRAT Reading and Arithmetic subtests (partic- ularly if they are restandardized on the total Survey sample). The grade equiva- lents will permit assessment of school retardation, while the standard scores, which have the same characteristics as deviation IQ's, will be more appropriate in pattern analytic combination with the WISC and Draw-A-Man scores. 8. TAT— developmental score(s). This may actually be a series of scores, Itis entered ""symbolically' at this time. It is possible to think of these data as pro- viding individual profiles or patterns which sup- plement information represented by the individual scores. For example, some children may rank high or low on all scales, indicating general ex- cellence or retardation in comparison with the general population. There may also be discrimi- nable test patterns associated with such special conditions as reading disability, mental defi- ciency, scholastic retardation, verbal impair- ment due to physical or social reasons, behavior disorders, and cultural deprivation. If such pat- terns exist, it should be possible to identify them by a standard research design based on discrim- ination of experimentally formed criterion groups. A hierarchical grouping analysis of score profiles, seeking to identify characteristic profiles of groups, would be an alternative approach. In this procedure, identification of criterion characteristics of the groups would follow rather than precede the main analysis. In either case, criterion data would be obtained from record sources within the Health Examination Survey. In this type of analysis it might also be profitable to explore patterns based on scores representing discrete residuals, with common variance par- tialled out and represented by an additional variable. Computer programs for these types of analy- sis are available, and such studies could be con- ducted economically on subsamples of the Survey sample. The inclusion of these psychological tests in the National Health Survey was a very important step which has tremendous practical value to the health, education, and welfare fields and which also has immense scientific value in the life sciences concerned with child development. De- spite the technical criticisms, which are in- evitable in a problem of the magnitude of this national survey, the tests have been judged to be either a good choice or at least an eminently reasonable compromise with reality within the constraints of the Survey. The research recommended should be looked on as an unprecedented opportunity to contribute toward adequate mental measurement of children. It is important for those working in this Survey to bear in mind that this is the first general sur- vey of psychological functions of children ever conducted on a sophisticated national sample. The standardization programs for the tests re- viewed—and for others referred to—fail to qualify for this distinction, National psychological sur- veys of adults have been made in both World Wars, and recently a national survey of adolescents was conducted by Project TALENT, However, CyclelIl is, to the writer's knowledge, the first one of its kind in the age range of 6 to 12 years. VI. CROSS-DISCIPLINARY ANALYSES The complete data of Cycle Il may be regarded as composing a matrix of several thousand vari- ables (specific measures or components of meas- urement procedures) over a sample of nearly 8,000 children. In the processes of data reduction and analysis, many of these variables will remain in the matrix without further manipulation (e.g., height, weight, body temperature, family income 64 level, twin status, number of siblings, and ages of parents). Some will require prescheduled analysis and computation of indexes according to established procedures in the respective fields (e.g., visual acuity, exercise tolerance, and electrocardiogram), while others will require extensive processing on the basis of empirically constructed or revised scoring keys and norms, as in the case of the psychological tests dis- cussed in this review, Upon completion of segmental analysis of each testing and examining procedure and reduction of all data to indexes and primary variables, it would be desirable to consider multivariate analysis of the resulting matrix. This type of approach will undoubtedly reveal many significant interrelation- ships not previously investigated because of lack of appropriate data. It is premature to consider it now, however, before the reduced data schedule is more definitely known. The primary purpose of the present dis- cussion is to explore possible linkages between the psychological tests in the Survey battery and other variables. This, too, is a formidable task, but some important areas of investigation are opened up by this Survey, and these opportunities for significant research deserve special mention. DATA AVAILABLE From various sources within the Survey, data on items such as the following, which have im- portant behavioral implications, will be available: Parents—age, nativity, education, income level, language spoken, psychiatric history, marital status, handedness, and use of medical care. (The distributions of these variables are of interest. In addition, an SES index of socio- economic level can be derived.) Siblings—number, twins, ages, education, marital status, work status. (From these data an additional variable, birth ordinal position, can be derived.) Family—size, living status, ethnic classification, race, SES. Child—school information: grade placement; progress rate; absences; characterization as requiring special provision ror hard ofhear- ing, visually handicapped, speech therapy, orthopedically handicapped, gifted, slow learning, mentally retarded, emotionally dis- turbed; description in relation to adjustment, attention, interpersonal relations, discipline, popularity, intellectual ability, academic per- formance. (These data are worthy of some detailed analysis in order to formulate ex- ternal rating criteria for independent test validation and to derive further indexes, such as peer rejection (based on interpersonal relations and popularity), general adjustment, and general adequacy (based on a frequency count of negative citation). Child— medical history: prenatal and birth cir- cumstances, food habits, enuresis, thumb- sucking, age of walking, talking, early learning rate, attendance at kindergarten, experience of unconsciousness, bad burns (with resulting scars), serious illness, weak- ness, nightmares, sleeping arrangements, age at puberty (girls). (Frequency distribu- tions of these items, particularly of food habits, which would also provide a basis for judging food idiosyncracies, and sleeping arrangements, which should correlate with SES but may also relate to other variables, should be of great interest. Correlations of many of these items with other data may be extremely important, as, for example, the investigation of sequelae of early uncon- sciousness and the development of a growth retardation classification, a disturbance in- dex, and a "weakness'’ index.) Child— sensory and motor indexes: visual acuity, color vision, hearing indexes, handedness, grip strength, vital capacity, exercise toler- ance. Child— body measurements: height, weight, an- thropometry, X-ray, dentition, Child— psychophysiological indexes: blood pres- sure, temperature, electrocardiogram, pho- nocardiogram. Child—medical findings: health status, pathology. Child— psychological tests: 1Q estimates; verbal ability level; performance ability level; reading, arithmetic, maturity level; adjust- ment index. ANALYSES INDICATED The organization and ordering of the lines of analysis suggested in this section are tentative and are not intended to suggest priorities, In most cases, further study of the literature in the particular areas and consultation with qualified professional persons would be appropriate before committing time and funds to particular studies. 65 Nevertheless, the richness of this ''data bank' is recognized as a source of new scientific knowledge, and it is hoped that it can be adequately exploited. Growth Indexes It is expected that mean growth indexes for boys and girls will be computed for as many functions as possible over the six age periods. Analysis of relations among growth trends— separately for boys and for girls—and of growth rate patterns would be of direct interest and would also permit comparison of pattern indexes with psychological test scores. Sex differences in growth patterns and relations of sex-related patterns to test scores are also of great interest. Other Factors Related to Test Scores Discriminant pattern analyses might be un- dertaken systematically in a multivariate design to investigate parental, sibling (including birth order and twin resemblance for the twin sample), family, school, medical, sensory and motor, anthropometric, psychophysiological, and medical correlates of psychological test scores. While this recommendation may appear forbidding in magnitude, the multivariate approach is actually more efficient and economical in total perspective than piecemeal analyses. Among the studies im- plied in this broad prescription are the following types of investigations: 1. Reading disability. Effects of visual and auditory impairment; handedness; SES; growth trends; developmental history; early, recent, and continuing emotional disturbance; illness; birth order, etc. 2. Mental retardation. Every item in the above enumeration is potentially related to mental retardation. 3. School retardation. Same as above, 4. Analyses of discrepancies between actual and predicted status in velation to con- comitant ov associated factors. These data offer an excellent opportunity to look for significant variance associated with overachievement and underachievement in school grade placement, reading achieve- ment (WRAT and school report), scho- lastic achievement (school report, WRAT Arithmetic), and peer relations (deviation from central tendency). While more detailed and specific investi- gations could be enumerated, it is more con- structive to emphasize the advisability of using the multivariate approach, since computer equip- ment and programs are available for such analyses and since results of greater value can be obtained at a far lower unit cost. 66 Acknowledgments The literature review and preparation of ab- stracts was under the immediate direction of Samuel H. Cox, Research Associate at the Institute of Behavioral Research. Principal persons assist- ing Mr. Cox were Robert M. Marx, John McCrady, Henry Orloff, and Max S. Taggart II. The project also was greatly expedited through the efforts of Miss Johnoween Gill, Reference Librar- ian, Texas Christian University. Without the loyal and competent help of these individuals this report could not have been com- pleted in only 3 months. BD: CA: CAT: CMAS: CRT: CTMM: E-G-Y: FRPV: FS: HES: 1Q: MA: ns: PPVT: PS: RT: SAT: S-B: SES: SRA: SRA-PMA: SS: TAT: Voc.: VS: WAIS: WISC: WRAT: GLOSSARY OF ABBREVIATIONS Block Design subtest of the Wechsler Intelligence Scale for Children Chronological age Children's Apperception Test Children's Manifest Anxiety Scale California Reading Test Chicago Tests of Primary Mental Abilities Kent E-G-Y Test (Scale D, Kent Series of Emergency Scales) Full-Range Picture Vocabulary Test (by Ammons) Full Scale (or Full Score) of the Wechsler Intelligence Scales General, or "global," intelligence factor Health Examination Survey Intelligence quotient Mean Mental age Number Not significant Peabody Picture Vocabulary Test Performance Scale (or Performance Score) of the Wechsler Intelligence tests Range Correlation Response time Stanford Achievement Test Stanford-Binet Intelligence Scale Socioeconomic status Science Research Associates, Inc. SRA Primary Mental Abilities Standard score Thematic Apperception Test Vocabulary subtest of the Wechsler Intelligence Scales Verbal Scale (or Verbal Score) of the Wechsler Intelligence tests Wechsler Adult Intelligence Scale Wechsler Intelligence Scale for Children Wide Range Achievement Test 67 % U.S. GOVERNMENT PRINTING OFFICE : 1966 O - 206-188 Series 1. Series 2. Series 3. Series 4. Series 10. Series 11. Series 12. Series 20. Series 21. Series 22. OUTLINE OF REPORT SERIES FOR VITAL AND HEALTH STATISTICS Public Health Service Publication No. 1000 Programs and collection procedures.—Reports which describe the general programs of the National Center for Health Statistics and its offices and divisions, data collection methods used, definitions, and other material necessary for understanding the data. Reports number 1-4 Data evaluation and methods research.—Studies of new statistical methodology including: experimental tests of new survey methods, studies of vital statistics collection methods, new analytical techniques, objective evaluations of reliability of collected data, contributions to statistical theory. Reports number 1-15 Analytical studies.—Reports presenting analytical or interpretive studies based on vital and health sta- tistics, carrying the analysis further than the expository types of reports in the other series. Reports number 1-4 Documents and committee reports.—Final reports of major committees concerned with vital and health statistics, and documents such as recommended model vital registration laws and revised birth and death certificates. : Reports number 1 and 2 Data From the Health Interview Survey.—Statistics on illness, accidental injuries, disability, use of hospital, medical, dental, and other services, and other health-related topics, based on data collected in a continuing national household interview survey. Reports number 1-27 Data From the Health Examination Survey—Statistics based on the direct examination, testing, and measurement of national samples of the population, including the medically defined prevalence of spe- cific diseases, and distributions of the population with respect to various physical and physiological measurements. Reports number 1-12 Data From the Health Records Suvvey.——Statistics from records of hospital discharges and statistics relating to the health characteristics of persons in institutions, and on hospital, medical, nursing, and personal care received, based on national samples of establishments providing these services and samples of the residents or patients. Reports number 1-4 Data on mortality. —Various statistics on mortality other than as included in annual or monthly reports— special analyses by cause of death, age, and other demographic variables, also geographic and time series analyses. Reports number 1 Data on natality, marviage, and divorce.—Various statistics on natality, marriage, and divorce other than as included in annual or monthly reports—special analyses by demographic variables, also geo- graphic and time series analyses, studies of fertility. Reports number 1-7 Data From the National Natality and Mortality Surveys.—Statistics on characteristics of births and deaths not available from the vital records, based on sample surveys stemming from these records, including such topics as mortality by socioeconomic class, medical experience in the last year of life, characteristics of pregnancy, etc, Reports number 1 For a list of titles of reports published in these series, write to: National Center for Health Statistics U.S. Public Health Service Washington, D.C. 20201 U.C. BERKELEY LIBRARIES €021205979