BISON A Guide to the Development of Jo~ Knowledge Tests: A Reference Kit for Measurement Specialists -~ . 4 ~~d United States Civil Service Commission ~ ~ Bureau of Policies and Standards Technical Memorandum 76-13 Objective PR10.02 A Guide to the Development of Job Knowledge Tests: A Reference Kit for Measurement Specialists Lynnette B. Plumlee, Ph.D. Personnel Research and Development Center United States Civil Service Commission Washington, D.C. 20415 1976 \ A Guide to the Development of Job Knowledge Tests: A Reference Kit for Measurement Specialists Abstract This Kit presents a suggested set of procedures for item writing and assembling job knowledge tests for use in merit system programs. It does not include detailed instructions concerning the establishment of scoring procedures and norms, the conducting of the final test analysis, or the demonstration of test validity. Consideration is given to such topics as the selectiort of a panel to oversee the project, the planning of selection procedures, the selection and training of item writers, the writing and review of test items, pretesting, item revision, and the assembly, review, and production of the final test. The Appendices supplement the text with such related material as forms, sample memoranda and checklists, procedural guides on various phases of the item writing and review process, a brief guide on norming and equating procedures, explanatory articles on test difficulty, chance scores, and basic testing principles for the nonspecialist,.and tables for: converting and equating item statistics. A Bibliography provides references to additional sources of background reading. Preface This reference Kit was designed primarily to provide psychologists and other measurement specialists with a set of procedures for item writing and assembling job knowl~ edge tests for use in merit systems programs. It should not be interpreted as reflecting the only way such test development should be done. It does not include detailed instructions on establishing scoring procedures and norms, conducting final test analysis, and demonstrating test validity. Whether the user is a psychologist or not, it is assumed that he or she will have a working knowledge of such fields as elementary psychology, basic principles of testing, individual differences, industrial psychology, elementary statistics, and basic test theory. Generally, such preparation would require at least a master's degree in psychology or education, with significant course work and experience equivalent to a major in educational and psychological measurement. Without such knowledge and experience, a person should not attempt any test development work. Test development should not be considered complete until the final test has been analyzed and its validity demonstrated. This Kit is being issued without recommended procedures and explanatory material for these phases beyond the production of the final test. Separate documents are planned which will include such topics as how to establish scoring procedures and norms, how to conduct the final test analysis, and how to validate the test for its intended use. This Kit was prepared by the author in an effort to bring together techniques and procedures used by those who have been extensively involved in test development. References to tests and other sources which will provide further background material are given in footnotes throughout the text and the Appendices. These sources are listed, with complete reference information, in the Bibliography at the end of the Kit. Complete citations for in-text references appear at the end of the section or appendix to which they apply. Some changes to the text have been made by the staff of PRDC to reflect internal views on different aspects of test development. Your comments and suggestions on improving this Kit,which will be periodically updated, are welcome. The author prepared this publication in fulfillment ofPurchase Order No. 74-2249, U. S. Civil Service Commissipn.She has a Ph. D. in Psychology from the University of Chicago,was Director of the Test Development Division of the Educational Testing Service, and is currently, in private practiceas a consultant on test development and validation procedures. Table of Contents I. Introduction 1 II. General Considerations in Planning the Test Development Project 1 III. Selecting the Supervisory Panel 3 A. Function and Duties of the Supervisory Panel 3 B. Qualifications Important for Panel Membership 4 C. Procedures for Selection of Panel :t-iembers 4 D. Size of Panel 5 IV. Planning the Selection Procedures and Test Development 5 A. Preparation Prior to the First Working Meeting of the Panel 5 B. Panel Work Session 7 V. Selecting the Item Writers 17 A. Responsibilities of Item ~iters 17 B. Writer Qualifications 17 C. Number of Items and Writers Needed 18 D. Steps Prior to Item Writing Workshop 20 VI. Training the Item Writers 20 A. Orientation 20 B. Trial Item Writing 21 VII. Item Writing and Reviewing 24 A. Selected Background References on Item Writing 24 B. Efficient Use of Workshop Time 25 C. Achieving Appropriate Item Difficulty 26 D. Item Review 26 E. Item Revision 28 F. Handling the Low-Productivity writer 29 VIII. Pretesting 30 A. Purpose of Pretesting 30 B. Pretest Criterion 30 c. Population Sample for Pretest 31 D. Description of the Pretest 32 E. Administration of the Pretest 33 F. Statistics to be Collected by the Pretest Analysis 34 IX. Item Revision and Final Test Assembly 36 A. Preparation of Items 36 B. Use of Item Analysis Data to Revise Items 36 C. Tentative Assembly 37 D. Meeting of Item-Writers for Final Test Assembly 38 X. Final Test Review 40 XI. Final Test Production 41 References 43 Glossary 44 Appendices I. Standards and Regulations 47 A. APA Standards for Educational and Psychological Tests 49 B. FPM Supplements 271-1, 271-2, 330-1, and 335-1, USCSC Guidelines on Examining Practices 51 C. EEOC Federal Guidelines on Employee Selection Procedures 53 D. OFCC Federal Guidelines for Reporting Validity 55 E. Division 14, APA, Guidelines for Choosing Consultants for Psychological Selection Validation Research and Implementation 57 II. Forms and Sample Memoranda 61 A. Initial Material for Panel 63 B. Sample Memorandum for Nominated Item Writers 70 C. Form DL Sample Daily Log 72 D. Form MN Measurement Need Definition 73 E. Form BD Background Data for Test Development 74 F. Form TS Test Specifications 75 G. Form PA Pretest Analysis 76 H. Form IAR Item Analysis Results 77 I. Sample Memorandum on Control of Test Materials 78 III. Checklists 79 A. Schedule Guide 81 B. Meeting Arrangements Checklist 82 C. Non-Content Factors Which May Affect Test Scores on Multiple-Choice Tests 84 IV. Procedural Guides 87 A. Suggested Source Material for Job Analysis 89 B. Item Writing Guide, Sample Items, and Examples of Item Modification 90 C. Item Types 104 D. Item Writing Rules for Multiple-Choice Tests of Job Knowledge 111 E. Suggestions for Developing Item Ideas 114 F. Instructions Regarding Item Format 115 G. Item Review Guide 120 H. Test Format and Sample Test Instructions 122 I. Procedures for Obtaining Pretest Statistics 126 J. Using Pretest Statistics 131 K. Test Review Guide and Test Review Sheet 135 L. Brief Guide on Norming and Equating Procedures 139 V. Explanatory Articles 143 A. Test Difficulty 145 B. Chance Scores 148 C. Basic Testing Principles for the Non-specialist 150 153 VI. Tables A. Percent Correct and Delta Equivalents 155 B. Conversion of rbis (biserial) to rpbi (point 157 biserial) c. Standard Errors of rbis for Selected Values of 158 rbis,LP, and N. Bibliography 159 1 I. Introduction This Kit has been prepared to assist the Measurement Specialist (MS) in properly developing job knowledge tests through the test assembly phase.l It does not include detailed information on establishing scoring procedures and norms, final test analysis, and test validation. 2 (The instructions in this Kit are not designed for developing tests using the Job Element Approach developed by E. S. Primoff. Such instructions are contained in other USCSC publications.) The Kit discusses the role of the MS as that of supervising and coordinating the total test development operation to insure that all essential steps are carried out and that procedures and documentation are in accord with his agency requirements and with Federal regulations. The Kit provides examples and detailed discussions of methodology to assist the MS in carryirtg out this role of providing the technical background and expertise in selection techniques and test development. For ease of presentation, an overview paragraph at the beginning of each section calls attention to the type of information contained in the section. To properly carry out his role, the MS is expected to be knowledgeable about those aspects of the test which affect test effectiveness, such as test instructions, item characteristics, format, conditions of administration, and examinee background and attitude. He must also be competent to select and advise a Supervisory Panel in planning and reviewing the test, and a Committee of Item Writers in writing the test questions. He3 should lcnow the relative usefulness of different selection procedures, including interviews, references, application data, transcripts of school records, and various testing techniques (ability, job knowledge, performance, job element). II. General Considerations in Planning the Test Development Project Overview. Determining the agency's need which Zed to the request for assistance; evaluating the adequacy of the job analysis; planning the job analysis~ if one is needed; deciding the kind and number of subject-matter experts required; planning the time schedule; and preparing the instructional materials. 1As used in this Kit, the term "job knowledge tests" includes written skills tests. 2Refer to the Preface and Appendix I-E for an explanation of the qualifications needed by the MS. If the MS does not possess all of these qualifications, he should consult a specialist in dealing with technical problem areas outside of his expertise. 3Th.e pronoun "he1 ' has beeil. used throughout in order to avoid the cumbersomeness of the double pronoun. 2 The Measurement Specialist's first task will be to determine fairlyprecisely the basis for the agency's request for test instruments orother selection procedures. What specific positions are involved? Arethe requested instruments or procedures to be used in selecting employeesfor employment, promotion, training, transfer, layoff, or other purposes?What State and local laws govern such selection? (For exa~ple, do thelaws specify part or all of the selection techniques? Do they requireranking of the best qualified?) What selection procedures have beenused previously? To what extent must future selection procedures relateto past procedures? Is the selection need unique to this agency or doother agencies have the same needs? If so, can a cooperative developmentbe undertaken? Since the adequacy of the selection procedure will depend largely upon its relevance to the requirements of the job, the MS should determine whether appropriate job analyses have been performed and detailed job descriptions written. This is one of the most critical aspects of test development. If an adequate job analysis has not been performed and a detaileid job description written -one which describes the essential job tasks and/or knowledge~, skills, and abilities required -these will need to be done.l (In rare cases, the job may be adequately described in other sources and a job analysis not be required. ) . For job knowledge test development purposes, the job analysis shoulddescribe the duties of the job in sufficient aetail that it can be used.for determining the abilities, skills, knowledges and/or other workercharacteristics required to perform the job. (In some cases of jobknowledge test development, the knowledge requirements alone would besufficient to describe; however, it is preferable to describe the wholejob.) The report should also specify the method of analysis used, i.e.,whether obtained through interview, observation, questionnaire, workerrecord, committee of job incumbents and/or supervisors, or otherprocedures. Where no job analysis has been performed, the MS will need to planfor performing an analysis before test development starts. He willneed to determine the method to be used and who will perform the analysis,whether it can and will 'be done by the Supervisory Panel (Panel), by theagency's personnel staff, or by the MS in cooperation with one or bothof these. Whether or not the MS is involved in the job analysis, hewill probably wish to provide direction and to see that the necessarydata are obtained. Although the Panel will eventually make the specific decisionsregarding selection techniques, the MS must determine, prior to selectionof the Panel, the scope of the assignment in terms of job knowledge andskill requirements. This will be needed for determining the backgroundrequirements of Panel members and hence the size of the Panel (seeSection III which follows). 1 See Appendix IV A for suggested sources of job analysis information. 3 Before assembling the Panel, the Measurement Specialist must also prepare a rough plan for the entire assignment that lists the specific steps to be performed by the Panel members together with a time schedule for completing each step (see Appendix III-A). It is easy for a person who has not previously conducted a project of this nature to underestimate the time required for certain steps, especially the time intervals that must be allowed for the persons involved to attend to their other ongoing responsibilities in their regular jobs, and the time required for mail transmittal when this is necessary. The Measurement Specialist must also allow himself adequate time for collecting or preparing the instructional materials that will be used by the Panel and the Item Writers. (See Section IV-A for a description of the preparation needed prior to the first working meeting of the Panel.) III. Selecting the Supervisory Panel Overview. Duties of Panel members; number needed; qualifications important for Panel membership; method of selection. A. Function and Duties of the Supervisory Panel The function of the Supervisory Panel is to assist the Measurement Specialist in assuring that selection procedures are relevant to performance on the job for which candidates are being selected. The Panel's duties include assisting the Measurement Specialist in the definition of selection criteria based on a job analysis, and the definition of the employee selection plan (including test plans where required). The panel will also approve the content of final tests and other selection procedures. Under the general supervision of the Measurement Specialist, the Panel will: 1. Review the job description for present accuracy;! 2. If not already done in the job analysis, identify the general knowledges, skills, and abilities essential to performing the job; 3. Recommend to the Measurement Specialist whether the applicant's possession of these qualifications should be established by interview, references, a job sample test, a paper and pencil test, or by other means; 1The Panel may be asked to participate in the job analysis, as discussed in Section II. 4. Help define and plan the validation strategy to be used, when requested by the Measurement Specialist;! 5. Work with the Measurement Specialist to determine test speci fications which the Item Writing Committee will use as a guide in preparing test material; 6. Review the final test after the Item Writers have given it their approval, but before it is reproduced as an operational form, in order to provide a final check on appropriateness and accuracy of content. Generally, it is not recommended that item writing responsibilitiesbe assigned to the Panel. Giving this committee too broad an assignment may detract both from the needed focus on the planning phase and the fresh view required at the final review stage. Also, since all will have important ongoing job responsibilities, there may be a limit to the amount of time .they could devote to this added assignment. However, if the Panel members,are able to work full-time on such an assignment for a long period of time, then such an item writing assignment could be considered. B. Qualifications Important for Panel Membership To serve their function effectively, it is advisable for Panel members to be familiar with the job either through having performed it or through having supervised it. Since their duties will require substantial time, insight, and careful thought, potential Panel members should be screened for available time, interest, and motivation. Consideration should be given to selecting persons who will be willing to listen to the recommendations of others but who will not defer to other members simply because of subordinate job status. Special consideration should be given to including persons in the Panel from minority groups to reduce the possibility of any biasing factors occurring in the test specifications. C. Procedures for Selection of Panel Members Recommendations for Panel membership usually will first from come agency heads, with those nominated asked to recommend additional candidates for membership. Those asked to make nominations should be 1It is essential for content validity demonstration that the validation strategies for a test be decided upon at the time the test is being planned. For content validity, the match between the jobrequirements and test content must be demonstrated. The present Kit does not include a section on how this should be demonstrated. 5 made aware of the responsibilities and the nature of the task and the need for motivated individuals. It is well to give those nominated the opportunity to evaluate their own interest in and suitability for the assignment. (See Appendix II-B for a Sample Memorandum for Nominated Item Writers.) The Panel will ordinarily consist of a balance of first -or second-level supervisors and outstanding, fully qualified operating personnel. D. Size of Panel Six Panel members may be sufficient, but this can be varied to include more if required for representation of different schools of thought or different specialties in the occupation. Going beyond seven members may make control of, and groupwide participation in discussion difficult. Cutting below four or five members may be possible for a well-defined job but may risk content bias when the job is not clearly defined. The most important consideration is that the Panel's expertise covers the job requirements. IV. Planning the Selection Procedures and Test Development Overview. Preparation for meeting with the Panel; Panel's role in evaluating the adequacy of the present job description; determining selection procedures relative to the job description; defining the criteria of successful job performance; identifying those elements of the criteria which the selection procedure is intended to predict;decisions to be made by the Panel relative to such areas as test content, difficulty, format, scoring, and norming; presentation of the specifications to facilitate item writing; considering minority group factors in test planning. A. Preparation Prior to the First Working Meeting of the Panel As a rule, the Panel members will be provided, before their first meeting, with a clear statement of their responsibilities and goals. (See Appendix II-A-2 for a sample statement.) If Panel members are located within easy commuting distance of one another, it may be efficient and effective to have an orientation meeting lasting an hour or two to discuss the overall plan and the Panel's responsibility. In this case the Measurement Specialist must be prepared with a statement of agenda to guide discussion and to insure that all critical points are covered. Handouts of explanatory and working material as listed below should be available for distribution. 6 If the group is scattered, the same material may be mailed, substituting for the agenda a written discussion of major points coveredby the agenda. The procedures which follow assume an existing jobanalysis. These procedures will need modification if the Panel participates in the job analysis discussed in Section II. Preliminary materials (or handouts at the orientation session) may include: 1. Covering letter; 2. Existing job analysis ·(If duties are listed with intervening space, the Panel members can use the list as a worksheet for recording the position requirements.); 3. Statement of Panel responsibilities and work to be accomplished(see Appendix II-A-2); 4. Instructions as needed on performing the assigned tasks (see Appendix II-A-3). The Panel members should be asked to do the following: 1. Examine the job description and suggest additions, deletions,or modifications on the basis of their own experience with the job.(The Measurement Specialist should carefully review these suggestionsto be confident that they do not reflect personal biases before heincorporates them into a revised description.)l Each Panel membershould be asked to rate each job duty on an appropriate scale and submithis ratings to the Measurement Specialist two weeks before the worksession. One example of such a scale is shown below: Frequency Scale for Rating Duties Scale Value Description 5 Required several times a day4 Required on a daily basis3 Required weekly2 Required several times a year1 Rarely required It is important that these Panel ratings be made independently by Panelmembers. lWhen the job description has been developed by professional jobanalysts following standard procedures, the Measurement Specialist shouldcheck with the job analyst responsible for this job before changing it. 2. Study the job description and identify the knowledges, skills, and other qualifications required to perform each job duty satisfactorily. These position requirements will be submitted to the Measurement Specialist together with the job duty ratings. Position requirements at this stage may be somewhat generalized: knowledge of algebra through quadratic equations; can operate a desk computer; can read at fifth grade level. (The Item Writers will later be asked to identify the requirements more specifically as a basis for item writing.) 3. Give thought to means of identifying whether the applicant's qualifications meet these position requirements, i.e., whether determined by references from previous supervisors, interview, test, etc. 4. Prepare a list of job performance criteria (identifiable behaviors and skills which characterize the top-level performer and those behaviors and skills which characterize the poor or marginal performer). When the position requirement lists are received from the Panel members, the Measurement Specialist will compile them and prepare the worksheet which the Panel can use later to rate the importance of the requirements. B. Panel Work Session After a brief review of the Panel's role in the test development process, the Measurement Specialist should proceed with the following steps: 1. Rating the position requirements. Give each Panel member a copy of the compiled list of position requirements and ask him to rate the requirements according to a scale developed for this purpose. One example of such a scale is shown below: Criticality Scale for Rating Job Requirements Scale Value Description 5 Critical to performance (employee must have it to do the job). Very important, but not critical (employee can operate with less supervision if he has the knowledge or skill). 4 3 Of some importance (some employees must have it if job is to get done). 2 Not important. Never required. 1 8 It is important that these ratings be done without collaboration among the Panel members. The compiled ratings may be used later as a basis for content validity for some elements of the selection procedure, and independence of ratings is necessary to confirm agreement among raters.l As Panel members complete their ratings, the Measurement Specialist can tally them on his work sheet. When completed, the tally sheet can be photocopied for the Panel. Mean criticality ratings should be used in conjunction with the frequency ratings for identifying those require ments to be measured by the test. 2. Defining the measurement need. Formulating the purpose of any proposed selection procedure is important, both for increasing the likelihood that it will serve the intended purpose and for providing documentation should a question later be raised about the professional nature of the development process, or should the procedure be challenged in court. For the latter purpose records must be kept of all major decisions, together with evidence on which the decision was based. (See Appendix II-C for a suggested format for a Daily Log.) Sample forms are included in the Kit to facilitate keeping a full record, but these records should be supplemented, insofar as possible, with all evidence and data which served as the basis for decisions. All such records ffiust be dated and the persons involved in the decision-making should be identifiable. Most test development for which this Kit will be used will bedirected at selecting persons qualified to handle certain jobs.Written tests generally will be used only if they provide better ormore reliable information about the applicant's potential job performance than other selection techniques. The specific need for a testmust be derived from a review of the job analysis and position requirements rather than determined on an a priori basis. As .a rule, theSupervisory Panel will participate with the Measurement Specialist inthe decision to develop or use a test. As a group, the Panel then suggests to the Measurement Specialistthe means by which each requirement can be identified .. The nature ofthis needs to be fairly specific, such as: structured interview;informal interview; level of schodl completed; references from formeremployers; achievement test; etc. The Measurement Specialist will assistin these discussions by advising the Panel regarding various means(including testing) of identifying skills, knowledges, and other aspectsof job potential. The outcome of this phase of the workshop will be acomposite statement such as laid out in Appendix II-D, Form MN. lThe procedures for documenting the content validity are not includedin this Kit. In addition, a list of criteria of successful job performance can be compiled. A copy of the composite list should be given to each member of the Panel (and subsequently to each Item Writer who has had experience with the job), asking him to rate each criterion on a scale developed for this purpose. One example of such a scale is shown below: Scale for Rating Job Performance Criteria Scale Value Description 5 Clearly indicates a high level (or low level) an important aspect of job performance. of 4 Generally indicates performance. a high (or low) level of 3 Somewhat related to performance on the job. 2 Does not necessarily distinguish between the person who performs the job well and the one who performs it poorly. 1 Unimportant or irrelevant for job performance. I have never observed whether employees exhibit this characteristic. Again, it is essential that these ratings be performed independently. As a group, the Panel should then match each selection procedure with the job performance criterion or criteria it is intended to predict. 3. Planning the test. In those cases where a test is specified, the next task of the work session will be to develop the specifications for the test. Form BD in Appendix II-E was designed to assist the Panel in stating the rationale for the test. The purpose of specifying other selection procedures is to avoid undue weighting of one aspect of the position requirements through duplication across selection procedures. The selection ratio provides a basis for determining test difficulty. If collected, ethnic group data may help focus attention on any problems of testing fairly across cultures as well as provide a basis for specifying pretest requirements to insure the representativeness of the sample. !Responses to N should not be counted in calculations. 10 Form TS in Appendix II-F was designed to serve as a summary sheetfor recording Panel decisions on test specifications. This informationwill guide the Item Writers in preparing the actual test. The MS canassist the Panel in establishing test specifications. Following are some considerations for the MS to discuss with the Panel relative to some of the decisions to be made in planning the test development project: a. Content. Content coverage will be based on the job analysisand position requirements as summarized in the Statement of Measure ment Needs (Appendix II-D, Form MN). Using the importance ratingsagreed upon by the Panel, a tentative weight should be assigned toeach major content area. The MS may decide to break out major areas into sub-areas and weight each sub-area appropriately to thetotal weight of the major area. b. Number of parts. The reason for separating the test intoseveral parts may be one of the following: (1) Analysis of test performance separately by content areais desired. (2) Different item types require different instructions or mental set. (3) To increase the likelihood that applicants are able torespond to all items they are capable of answering correctly. (4) To provide a rest period in a long test.l In general, unless one of these reasons or another equallystrong reason exists, it is desirable to have all of the testitems given at one time under one time limit. c. Item types. The assumption is made here that most testswill be of the multiple-choice type. Four or five choices arestandard. Since some examinees will not have had experiencetaking tests with different numbers of choices in the sameinstrument, it is best to use the same number for all items. The number of choices used will depend on the possib.le effectof guessing relative to the selection range, the difficulty offinding reasonable distracters, and the probable loss of candidatetime in reading ineffective distracters (see Appendix V-B). 1For security reasons, rest periods are not generally recommended for a test· requiring four hours or less. 11 Because of the difficulty of establishing uniform and defensible scoring guidelines, essay tests are not recommended unless fewer than 15 to 20 candidates will be tested. An · extensive discussion regarding the development of essay tests will be found in "Essay Examinations" ·(Coffman, 1971). d. Time. The time allowance may be restricted by administrative considerations or the feasibility of carrying out all planned selection procedures. However, sufficient length is needed to provide reliable measurement. (See the following discussion under "Number of items.") If nearly all examinees have adequate time to respond to all items, there is probably little to be gained by timing parts separately, unless part-scores are to be used. Items covering different content areas may be interspersed and arranged in order of difficulty to increase the likelihood of candidates reaching all items they are capable of answering. If, however, there is concern that some candidates may waste time on items about which they have little knowledge, or if scores on different content areas are to be studied independently, it may be desirable to group items by area and time each area separately. e. Number of items. This is a very technical area requiring great care. The actual relative weight carried by the items in each content area depends primarily on the standard deviation of this set of items relative to other sets;l the standard deviation, in turn, tends to vary with the number of items when item difficulty distributions and item discrimination indices are similar from content area to content area. Thus, the number of items per content area should generally reflect the desired weighting. It is preferable to increase the number of items to achieve the desired relative weighting among content areas rather than assign a different weight to items in the various areas (see the folkowing discussion). It may not be possible for the Panel to specify the total number of items exactly at this time, especially where there has been no previous experience with writing items in the given area. However, much can be done at the item writing stage to increase the efficient use of test time (see Appendix IV-B). it is suggested, therefore, th-at the .number of items be specified with the followi~g considerations in mind: (1) Error of measurement. The ratio of the measurement error to the number of items decreases with the number of lsee Guilford and Fruchter, 1973, pp. 385-386, and Guilford,· 1954, p. 405 and pp. 443-447, for a discussion of effective weighting of part content in the total test score. Also see Thorndike, 1971, p.3, and Tinkelman, 1971, pp. 68-70. 12 items although the actual size of error increases. In a non speed test, the standard error of measurement (SEM)l is approximately .43 /no. of items (Cureton et al., 1973, Lord, 1959, Swineford, 1959) or about 1.9 for 20 items, 2.5 for 35 items, 3.0 for 50 items, and 4.3 for 100 items; however, note that 25 items given double weight (i.e., 50 points) will yield an effective standard error of measurement of 2 x .43 125 (or 4.3) as compared with a standard error of measurement of 3.0 for 50 items given single weight. It is suggested that where the test is to carry considerable weight, a goal should be set of 75 to 100 items with a planned standard deviation (SD) at least three times the standard error of measurement. This ratio is easier to achieve with larger numbers of items. The square of the inverse of this ratio, SEM2/sn2, is equal to 1 -reliability. A ratio of three would yield an estimated reliability of .89, which is generally considered to be acceptable. If the test is a speed test, it is better to compute the standard error of measurement from the reliability rather than from the approximation shown here. (2) Chance scores. A score on which a decision is being made should be clearly above the chance range (see Appendix V-B). (3) Power vs. speed. Examinees should be given sufficient time per item to consider each question. Unless ability to work under heavy time pressure is an important requirement of the job, pressure for speed should not enter excessively into selection considerations. f. Scoring formula. If every examinee responds to every item,applying a correction for guessing does not change the ranking ofcandidates. If, however, one anticipates many omissions, use of thescoring formula may be justified, especially if there are only threeor four choices per item. It should be recognized, however, thatwhen a scoring formula is used, those applicants who base theiranswers on inadequate knowledge will be handicapped relative tothose who guess at random. For a properly constructed item, theformer very often finds his "answer" among the misleads and thenstands no chance of a correct answer, whereas the guesser standssome chance of obtaining the correct answer (one-fourth chance in a four-choice item).2 1The standard error of measurement indicates the amount ofvariation one can expect in an individual's score if he were totake other parallel forms. (See Glossary; also Guilford and.Fruchter, 1973, p. 401, and Guion, 1965, pp. 45-46, for furtherexplanation of the standard error of measurement.)2see also the discussion regarding guessing in Thorndike,1971, pp. 59-61. g. Weighting of parts. As indicated earlier, it is desirable to increase the number of items rather than give double or triple weight to one set of items. One or more of the following steps may be taken to accomplish this: (1) Increase the time to permit more items. (2) Use item types which require less time per item. (3) Examine the possibility of breaking some items into two or more parts (see Appendix IV-B). h. Setting selection standards. There are many factors relevant to the decision of whether to rank candidates on the basis of test scores (with those receiving higher scores given job preference) or to set a minimum cut-point that the candidate must meet to be given further consideration. These include psychometric as well as legal considerations. This Kit will not provide the user guidance on this matter.l L Norms. 2 If ~penings for the job in question occur very rarely such that this test is likely to be used only once, or if the nature of the content is ~xpected to change before a new opening occurs, there is little basis for establishing normative data. If the opening is similar to openings in other parts of the country and recruiting is done on a nationwide basis, there may be some benefit in relating local norms to national norms. In this case items from a nationally normed test should be incorporated in the local test to facilitate equating. The primary benefit in equating would be the comparison of candidate capability with that elsewhere. Items from another test must not be used without clearance from the publisher of that test. In the majority of cases, however, it is likely that where a decision is made to build a test locally, it will be because the job has aspects unique to the locality, in which case national norms will have little meaning. It may still be desirable to develop local norms as a means of comparing candidate groups from one year to the next. lsee Cronbach, 1970, p. 42lff., Guion, 1965, pp. 486-493, and the APA Standards for Educational and Psychological Tests, 1974, paragraphI4, for further information on this topic. 2The present Kit does not include detailed instructions on how to set up norms and on equating test items and parts; however, a brief discussion is presented in Appendix IV-1. Percentile norms have the advantage of being readily computed and easily understood.l They have the disadvantage that differences in the middle of the range appear to have more s1gnificance than they really do. A given raw score difference may be equivalent to 12 percentile points in the middle of the range and only two percentile points at the extremes. In a normal distribution the difference between the 40th percentile and the 50th percentile represents only .25 of a standard deviation, whereas the difference between the 85th percentile and the 95th percentile represents a difference of . 6 of a standard de,riation. (A unit equal to one standard deviation is generally considered to represent the same ability unit at different parts of the score range.) For this reason norms are often stated on a standard score scale for which there is an established mean and standard deviation. Various standard score scales may be used, and these are described in most texts on testing. A useful scale is one in which the mean is 50 and the standard deviation is ten. If the distribution is fairly normal, one simply sets the obtained raw mean equal to 50 and the standard deviation equal to ten and develops a scale of equivalent scores (Guilford & Fruchter, 1973, pp. 463-467). Future operational tests may be put on the same scale as this test by including in the future test a set of items from the current test. When this is used in merit systems, this standard scale can be linearly converted to a rating scale with 70 at the cut-point .and 100 at the practical top score. If the population is considered to be normally distributed but test scores are not, the raw scores may be put on the standard score scale through setting the median equal to the scale mean and using a table of normal curve values to find the number of standard deviation units corresponding to each percentile (Angoff, 1971, pp. 515-516). This procedure should be used with care. If the test being developed is a new form of an operational test, a scale may already be established. If so, the new test may be put on this scale such that scores on the new test may be compared with those on the earlier comparable test or other tests in a battery. If this is to be done, it is important to make the decision regarding method at the time test specifications are established. The various methods are as follows: 1See Ebel, 1972, p. 285ff., or Guilford and Fruchter, 1973, p. 36ff., for computing procedures. 15 (1) Common items. This involves using a set of not less than 20 items or 20% of the total number of items, whichever is greater, of each form as a set of common equating items (see Appendix IV-L for statistical procedures). The items should be representative of the content of both tests. They may be set up as a separate section or spaced throughout the test. (Avoid placing equating items near the end of the test.) (2) Common population. In this procedure a common group is asked to take both test forms. Since this is less commonly used, procedures are not described in this Kit (see Angoff, 1971, pp. 573-576). (3) Equi-percentile. This is used only if the same population takes both tests, or the populations taking the two tests can be considered equivalent. In this case a score on one test is set equal to the score on the other test that has the same percentile equivalent. A comprehensive discussion of score scales and norming is provided by Angoff (1971). j. Desired statistical characteristics.l In general, a mean of 65% to 75% of the number of items for a four-choice test, or 60% to 70% for a five-choice test, is probably reasonable (see discussion in Appendix V-A). A reasonable goal for the standard deviation would be at least three times the standard error of measurement as discussed above under "Number of Items" (Section IV-B-3e). There is no absolute standard for setting cut-points. However, it is generally assumed that the best procedure is to establish cut-points on the basis of validity information or known labor market conditions relative to the job to be filled.2 k. Percent completion desired. For job knowledge tests the aim should be to have every examinee able to read and react to every item; however, regardless of the instructions with respect to guessing, some examinees will omit items about which they have little or no knowledge. Accordingly, test results may show a number of no-responses even when all have had an opportunity to lThe Kit does not include a section on the statistical analysis of the test. 2see the APA Standards for Educational and Psychological Tests, 1974, for a discussion of this. 16 read every item. Furthermore, some examinees will stop trying and will quit before reading all items, especially in a lengthy test. A 75-to 100-item test may be considered relati~ely non speeded if 80% or more of the examinees record answers to the last item. 1. Chance mean and standard de~iation. These statistics shouldbe calculated to determine whether the anticipated cut-point scorecould be achieved through chance by someone who has substantiallyless knowledge than the cut-point;normally indicates. Few examineesguess at all items, though occasidnally one will if he has littlebackground in the content area. ; for example, if a candidatetaking a five-choice multiple choice test knows 20 out of 100 itemanswers, he will of course be more likely to achieve the cut-pointby guessing at the remaining 80 than if he guessed at all 100 (seethe discussion in Appendix V-B). m. Federal regulations regarding test development. Thepreceding procedures may seem unnecessarily detailed; however,the usefulness of the results is in large part a function of thecare taken in planning the test and developing it according tospecifications. Most of the technical background has been developed and has been available for many years. The American PsychologicalAssociation's Standards for Educational and Psychological Tests(1974) has been periodically updated to inform test builders andusers of the requirements of good testing practice. More recently the Federal agencies charged with enforcing equalemployment opportunities, concern~d that minority group potentialmay not have been adequately measured by many of the tests usedby employers, have prepared guide~ines which in effect restate asregulations many of the principles regarding sound test developmentthat have been long held in the psychological profession. Copiesof these regulations are included !in Appendix I. The Equal Employment Opportunity Coordinating Council, which consists of the U.S.Equal Employment Opportunity Commission, the U.S. Office of FederalContract Compliance (Department of Labor), the U.S. Civil ServiceCommission, the Civil Rights Divis'ion of the Department of Justice,and the U.S. Civil Rights Commission, has been working on developingcommon principles or guidelines to assist employers in conformingto the law. When it is published, it will become a part of thisKit. The procedures described here for item writing and assemblingjob knowledge tests are those for a professionally developed testsuch as are required by the Federal regulations. It is expected, ofcourse, that the person supervising the test development will havea background considerably beyond the contents of this Kit.Professional development of the test, though recommended, is notsufficient under the law. One must be able to demonstrate that 17 the test is valid and does not discriminate unfairly (for · non-job-related reasons) against any group of applicants. The need for validation is equally important for other selection procedures and is also covered by the regulations. Another kit is planned to cover these requirements. V. Selecting the Item Writers Overview. Item Writer responsibilities; qualifications important for successful item writing; selection of Item Writers; number of Item Writers needed; pre-training assignments. A. Responsibilities of Item Writers The Item Writers will review the list of general position requirements the Panel specified in Form MN and will identify the specific knowledges and skills required to perform the job. They will then write test items to measure candidates' possession of the knowledges and skills and will review items written by others. They will also revise the items as necessary on the basis of the pretest item analysis and will either review the Measurement Specialist's tentative test assembly or carry out the assembly themselves under the Measurement Specialist's supervision. B. Writer Qualifications Good item writing involves a large element of art. It does not appear to be a skill that everyone can develop. To enhance the probability of success, the Item Writer should have, as a minimum, the following qualifications: 1. Experience in the performance or supervision of the job for which the test is being designed. 2. Strong knowledge of relevant content areas. The knowledge of the Item Writer must be substantially beyond that required to answer the test questions in his specialty area. 3. Interest in the project, with motivation toward the item writing task. 4. Originality in approach to assignments. · 5. Ability to work in a group and openness to suggestions regarding his work. 6. Willingness to persevere with a task and try a variety of approaches. 18 Selection of those w.ith the most aptitude for and interest in writing items can save considerable item writing time. The!e should be representation of the relevant specialty areas to provide the needed coverage. Again, the Measurement Specialist should make an effort to include members of minority groups. Such persons need to be familiar with the culture they are expected to represent to enable them to recognize non-job-related factors that could put the members of the culture at a disadvantage. If inclusion of minority group members is not possible, individuals familiar with the minority group culture should be included. Because of the personal qualities involved, it is advisable to ask the Item Writer nominee to help evaluate the potential usefulness of his participation in the project. Recommendations can come from the same administrative personnel who nominated the Panel members, and from the Panel members. There could be some benefit from over lapping membership between the Panel and the Item Writing Committee, although it should be on a voluntary basis if possible. (See earlier comments at the end of Section III-A regarding dual assignment.) Appendix II-B shows the possible content of a memorandum to be sent to pot·ential Item Writers. It provides the individual with the opportunity to try item writing as a basis for evaluating his interest in accepting the assignment. The exact wording of the memorandum will vary depending on the anticipated number of nominees showing an interest. C. Number of Items and Writers Needed The workshop should be expected to produce at least double thenumber of items needed, even though steps are taken to reduce duplicatesand insure wide coverage. Many items will fail to meet the difficulty and discriminating power requirements set by the Supervisory Panel. The number of new items required may be reduced if there arereusable items in the files or if an item file can be secured fromother sources.l If there are several times as many items in the files as are needed.for one form, the problem of candidate familiarity withthe items from having taken earlier forms is probably not great. Ifthere has been a loss in security with a given form of the test, theseitems should be marked and not reused within a period of several years, or be reused only in a changed form. Often the test development project is designed to update oldertests. For example, under the Merit System Standards program of the II litems from commercial, copyrighted sources may not be used withoutII the author's or publisher's written permission. The U.S. Civil Service'Commission will provide State and local governments with test bookle1ts forsome position classes under the Merit System Standards program. In/anycase the user would need to show the relevance of these materials throughcontent or criterion-related validation. 19 U.S. Civil Service Commission, State and local governments might be asked to cooperate in such a project. Where there is only one previous form, parallel items can often be written; however, the Measurement Specialist should be careful not. to use too many of these, because an advantage could be given to a competitor who has taken the earl~er form and is able to recall some of the items for his specific study prior to taking the new test. Where several earlier forms are available, items from these forms may be parallelled as a means o~ saving time in writing new items to test the same concepts. As a general rule, the Measurement Specialist will need to decide early how many items will be from the file or will be parallels of such items. (Regardless of the number of items in the files, a minimum of 15% of the items might be required to be entirely new to keep the test abreast of new developments in the field.) Next, the Measurement Specialist must decide what fraction of the items from the files will be used in their present form. Then, in deciding the time required for the total item production phase, the Measurement Specialist should allow possibly half as·much time for-·parallelling an old item as for writing a new item, and perhaps one-fourth as much time for selecting old items from the files. Parallelled items and items from the files should go through the same review process as new items to make certain they are up-to-date in terms of content relevant to the job as presently described, and free of psychometric and subject-matter flaws. Training of the Item Writers may be expected to require a week to reach the point of productivity. (Training will continue beyond this initial week, but Writers should be producing usable items at a reasonable rate by the end of this time.) Items written in the first week will be usable, but they will ordinarily not be considered in estimating the time required to produce the necessary number. After the first week, the Measurement Specialist can probably count on an average of about four items per person per day. This may seem like an excessive amount of time per item, but many starts do not yield items, and any effort to test understanding, application, or other skills requires much more thought and development than does a test of simple knowledge. (These items are rough items which will need editing and review before actual inclusion in a test. The Measurement Specialist should not expect that final polished items will be turned out routinely at this rate.) The Measurement Specialist might expect to be able to accomplish the task with about six good writers, if they are versatile in the content areas designated. More may be needed to provide coverage of specialties. A few writers may be expected to fall by the way. The Measurement Specialist can either start with more than the required number or bring in new writers as others drop out. If extras are brought in initially, it may be possible to complete the task in less 20 than the allotted time should all prove to be successful writers~ but it can result in more training cost than otherwise required. If extra specialists are needed to handle some specialized content, they can be brought in for the time needed. D. Steps Prior to Item Writing Workshop When the individual is notified of his appointment as an ItemWriter, he should be asked to accept or reject the invitation promptly(preferably within a working day, so that someone else may be appointed if he rejects).l When he accepts, a summary of relevant informationshould be sent to him (time, place, schedule, etc.). He should alsobe sent a copy of the list of job duties and position requirements andasked to rate them, independently, on the same scale as the Panel ratedthem. These ratings must be returned prior to the first meeting sothat the Measurement Specialist can summarize them and compare themwith those provided by the Panel. This is done simply as a check onthe "universality" of the Panel's ratings and to provide further backupdata on content validity. If all ratings have been made independently,it is unlikely that there will be significant differences between thegroups. If there is a substantial difference between the two sets ofratings, the Measurement Specialist may wish to discuss the ratingsagain with the Panel. If the Measurement Specialist has not done considerable itemwriting, he will need to familiarize himself with the suggestionsregarding the writing of items in Appendix IV-B and try writing severalitems to measure position requirement concepts for the particular jobfor which a test is being written. Even if the test is in a field inwhich he has little background, he can probably find a few elementaryconcepts which are familiar to the nonexpert and for which he will beable to devise items. VI. Training the Item Writers Overview. Training period; orientation and procedures for securityof test material; training techniques; expediting training; helping ItemWriters prepare non-routine items to test application and understanding. A. Orientation Although the Item Writers will be generally familiar with theirrole by the'time of the first session, they will usually welcome agener.al review of their planned activities, the expected daily schedule,and other routine matters as well as a review of what will be expected lThis may not be a voluntary assignment. If an agency appoints anItem Writer, he will ordinarily not be asked if he wants to serve. 21 of them. It may help avoid premature discouragement if they are reassured·before they start that they are not expected to be skilled item writers immediately, and therefore should not become depressed by the small number of.items produced during the first few days. This reassurance will also alert the Item Writers to the fact that long-range item quality is expected to be above that of the first day's output. It may be appropriate at this time to remind the Item Hriters that criticism of items is a necessary part of the development process; if effective revisions are to be achieved, it is important that criticisms be both objectively given and accepted. The Writers must be told at this point of the precautions which must be taken to assure security of the items. Their cooperation must be enlisted in disposing of scratch work in designated security bags, locking up work when the workroom is left unattended, and not taking items home with them. (Probably few will wish to do so.) The Writers should not be discouraged from jotting down ideas which occur to them between work sessions, but should be urged to observe reasonable security precautions with such materials. Following the general orientation, the Measurement Specialist may show the group examples of items which are considered good; however, it is desirable that these items cover a variety of approaches and different formats. He must be wary of spending so much time with these that the Writers model their writing after these particular items and neglect desirable analytical approaches to specific measure ment problems. Also, in order to take advantage of the initial motivation to start the ·assignment, 'it is suggested that the Item Writers have an immediate opportunity to try their skill at item writing. B. Trial Item Writing The Measurement Specialist will usually select a general position requirement (preferably an area of knowledge) that is highly important. As a group, the Item Writers are then asked to analyze the specific knowledge and skill principles which must be understood. They should review the principles carefully to insure that those which will be taught on the job, and which therefore do not need to be understood at the time of entry, are not included in the test. The Measurement Specialist can then take one of the job-relevant principles and ask, "What can the applicant do to demonstrate that he knows this principle?" He should help the group convert one or more responses into question form. It is best to wait for their stiggestions, asking leading questions if necessary. The Measurement Specialist may need to write. most of· the first item. The group approach is continued until the Writers are offering item ideas. He should try other questions as needed to stimulate thinking (see Appendix IV-B). 22 Then, with each Writer in a comfortable location at a table, he should next ask them to try writing an item to measure another one of these principles. It is advisable that they work alone initially until they have a rough item. As two Writers finish, they can exchangetheir items for review, then discuss and revise them. (While someItem Writers are concentrating on their individual' assignments, thisdiscussion is best done at a remote table to avoid distracting them.)The Measurement Specialist may need to assist some individuals ingetting started. He should draw the individual (or a pair of individuals, if two are having difficulty) aside for individual training.He may wish to go through the initial training step again, askingeach to demonstrate his own knowledge of the principle, then helphim convert this to item form; or he may try a different approach ingetting them started. If two Writers finish a joint review and revisions ahead of theothers, they can assist other Hriters or try writing an item to measurea different principle. (The Measurement Specialist will probably wishto examine the "finished items" before allowing the writers to assistothers.) When all items have been written to the satisfaction of twoWriters, the Writers can take a break while the Measurement Specialistreviews the items and prepares for group discussion. (The MeasurementSpecialist should be prepared to suggest desirable revisions if satisfactory modifications are not proposed by the Writers.) When the groupreassembles, the items should be written on the board. (An alternative,where time and facilities permit, is to type them in multiple copy orphotocopy them so that each Writer may have a copy.) The Measurement Specialist should give each Writer an opportunityto review the items, and then consider the items with the Committee,one at a time, discussing the merits and shortcomings of each andmethods of improving them. The Measurement Specialist should listento the group's suggestions for improvement before offering his own.For this first set of items, a particular effort should be made toproduce usable revisions in order to demonstrate the kind of improvements which can be made and to avoid a tendency to discard later itemstoo hastily. The }1easurement Specialist may wish to tell the Writersthat when they are in the production phase later on, it is sometimesmore efficient and economical to discard an item and begin again.Too hasty rejection, though, may eliminate an item which tests animportant point but which is simply more difficult to measure. After the first set of items is in acceptable form (some may bediscarded), the process should be repeated with a new set to testanother knowledge requirement. Item Writers can now begin followingthe procedures outlined in Appendix IV-F for submitting items onhalf-sheets and in the prescribed format. When some skill has been 23 developed in writing items to test knowledge, the Measurement Specialist should suggest that it is time to go to a more difficult item category--one measuring ability to apply knowledge. The Writers can be introduced gradually to the Item Writing Guide (Appendix IV-B) at this time through drawing on the examples given there. The process used with the knowledge items should be repeated. The rules for writing multiple-choice items (Appendix IV-D) should not be introduced until after the group has had some successful experience in writing satisfactory items. Premature introduction of these rules could be discouraging. After some experience, many of them will already be familiar. After the Item Writers feel they understand the general approach to item writing well enough to start working independently, the task of spelling out the specific position requirements as item objectives can be resumed. It may be more efficient to assign this stating of item objectives to pairs (or trios) of Writers. If groups are assigned different parts of the position requirement outline, the resulting list of objectives produced by one team should be reviewed by the others before becoming a list from which to work. (If item ideas occur to the Writers as they are defining the item objectives, the Writers should be encouraged to jot them down sufficiently to be able to come back to them later in actual item writing.) Where some individuals have been assigned to the Item Writing Committee because of special competences, they will normally be assigned relevant parts of the outline for stating item objectives; however, it is desirable to postpone item writing in their specialty until they have demonstrated skill in writing items which the other. Writers are better able to review. When the objectives have been defined and agreed upon, the Measurement Specialist should assign different parts of the .outline to different Writers or pairs of Writers. The Measurement Specialist will probably find that some can work better in pairs while others work better as· individuals. Although this is the beginning of the item production phase, it may also be considered a continuation of training, since Writers will continue to learn from review. As a few items are written, they should be reviewed by someone not involved in the writing. If the item survives the initial review, it should be submitted to others who have not yet seen it. Those items which are subject to disagreement or to difficulties in polishing may be made the subject of group discussion as part of training. These group discussion sessions can also be used to point out the kinds of difficulties which members are encountering and to discuss suggestions for overcoming these difficulties. Items which appear to be especially good items should likewise be identified in the group sessions. 24 As item writing improves, it will probably be more efficient to accumulate a number of items before reviewing them. VII. Item Writing and Reviewing Overview. Selected references; making efficient use of workshoptime; estimating item difficulty; modifying items to make them moreeffective; purpose of and procedures for item review; handling theunsuccessful Item Writer. A. Selected Background References on Item Writingl Much has been written on suggestions for writing effective testquestions, especially multiple-choice questions. Some of these havebeen excerpted or summarized in the Item Writing Guide (Appendix IV-B).The following references contain suggestions and sample items illustrating "good" and "poor" item writing: Adkins, 1974, chap. 2; Ebel,1972, chaps. 5 and 8; Educational Testing Service, 1973a; and Hesman,1971. In addition, Bloom, Hastings and Hadaus', Handbook on Formativeand Summative Evaluation of Student Learning (1971), is a helpfulreference; it discusses approaches to item writing by subject field,and considers ways of testing comprehension and higher levels of application and analysis. This publication includes a condensation of Bloom's Taxonomy of Educational Objectives (1956), which may also be helpful to have as a resource. The suggested set of item writing rules (Appendix IV-D) consistsof those rules which seem most important and relevant to problemsmost likely to occur in situations for which this Kit is designed.The rules given here and in the reference texts should not be considered inviolable if the principle underlying the rule is understoodand there is a good reason for violating it. Other rules may be suggested by the Writers as they gainexperience, and these may be added to the list. If many suggestions are offered, they can be consolidated or combined with rules alreadyin the list to avoid excessive length. · lsee Bibliography for complete information on references cited. 25 B. Efficient Use of Workshop Time Although learning to write items requires time, the efficiencywith which items can be produced can be improved by adherence tocertain principles of operation such as the following: 1. Establish an assignment and checklist procedure to avoidunwanted duplication. It may be helpful to set up a chart showingitem writing goals and completions to provide a ready reference tocurrent status. Periodically check to see that test specifications requirements are being met. 2. Set up a routine which will process items systematically through the review phase, once the training phase is completed and items are accumulated for review. Items with minor flaws can be routed through three reviewers before returning them to the author for revision. Items with major flaws should be returned to the author's "in-folder" before further review. 3. Keep a record of ·individual productivity. If an individual is turning out an unusually small number of acceptable items, investigate the reasons and attempt a remedy. 4. Use brainstorming sessions to facilitate producing items tomeasure principles or applications which are causing particulardifficulty. 5. Incubate those ideas which are very difficult. Rather thanpursue tenaciously a difficult measurement task, it may help toleave it and come back to it later. 6. When difficulties occur, try different approaches ratherthan sticking with the initial idea. 7. Allow individuals to work alone, in pairs, or in groups,depending on their work style and efficiency. Before assigningwriting on an extended group basis, however, make certain thatproductivity, in terms of number and/or quality of items, is betterthan that obtained by the same individuals working alone. 8. Avoid personality conflicts in assigning team membership. 9. If leadership in the workshop is being shared by theMeasurement Specialist and an agency staff person, determine inadvance the roles and decision-making authority of each. 10. Follow the order of review recommended in D below, proceedingto the next review stage only when the item meets approval at thecurrent stage. 26 C. Achieving Appropriate Item Difficulty Item difficulty should center at a percent pass corresponding to the desired mean as a percent of the total number of items; i.e., if the goal is a mean of 70% of the number of items, item difficulty (percent pass) should center at .70 (without previous experience. the Item Writers will not be able to estimate item difficulty weil). If past experience confirms that actual item difficulty is greater than the planned item difficulty, then Item Writers should aim at a somewhat higher percent pass, perhaps .80 (see Appendix V-A). t4hen a person is writing items in his own area of specialization,there is a tendency to underestimate the difficulty of a concept orprinciple he is testing; in an effort to get plausible misleads hemay make the item even more difficult. For this reason an estimateof item difficulty made by the reviewers will probably be more accuratethan one made by the author of the item. Since item difficulty can usually be decreased (or increased) byrevision, the Item Writer need not be preoccupied with difficultywhen writing the item. (See suggestions regarding revision of itemdifficulty in E below.) He should focus on achieving a valid measureof the concept he is attempting to measure. Suggestions are made inAppendix IV-B which may help the Item Writer achieve the desireddifficulty. The Measurement Specialist can assist by observing itemwriting in progress and making a casual review of items which appearlengthy, checking whether there seems to be unnecessary vrordiness orthe Writer is devising too complicated a problem situation. In thelatter case the Measurement Specialist can question whether the ItemWriter is attempting to test more than one concept or process and,if so, whether these can be tested with separate items. D. Item Review A list of review suggestions which may be copied for Item Writersis included in Appendix IV-G. These may be distributed when theWriters begin to work individually or in teams. Prior to that timethe Measurement Specialist should make sure that all items written inthe training phase meet these requirements. The major areas on which the Measurement Specialist and ItemWriters should concentrate their review are shown below. There isan advantage in considering each of these four areas separately andin this order. If there is a need to revise an item on the basisof one stage of the review, the changes can be agreed upon beforegoing further. The changes at each stage could well affect those reviews which follow. For example, a criticism which appears to affect only one option may lead to changes in other parts of the item, and time spent reviewing the item for grammar and punctuation may be wasted. 1. Content. The reviewer needs to check that the content is relevant to the job, in terms of both the concept and the level of the concept, and that it is content which is required before entry to employment. It should not measure job knowledge that is normally expected to be learned on the job. The reviewer should make certain that there is one and only one correct answer to each question. Since all Item Writers presumably have working or supervisory experience with the job, they should as a general rule be able to answer any item required for selection. If more than one writer has trouble with a given item, it is well to examine it carefully for appropriateness. However, a level of knowledge beyond that required for job performance may be required to spot errors in assumptions or points which are open to disagreement among experts. Therefore, items that cover specialized content and have been written by specialist members of the committee should be referred for review to at least one other specialist. If no one else on the committee has this specialist background, these items must be submitted to specialists outside the committee. The Measurement Specialist should submit the items in a reasonably finished form and allow the reviewer adequate time for his review.! 2. Difficulty level. Each reviewer should be asked to make two estimates of difficulty: a. The percentage of those on the job(s) for which the test is designed that could answer the question; b. The percentage of the applicant population that could answer it. (If the subject is taught in high school, this estimate can be directed at a typical high school population taking the course.) It may help the reviewer to think of specific individuals, one on-job employee and one typical applicant, and evaluate the likelihood that such individuals could answer the item. What aspects of the stem or option might cause the most difficulty? Has the item been made artifically difficult, i.e., can a person understand the principle being tested and still miss the item? lProper security precautions should be taken by the Measurement Specialist to insure that test materials are not compromised in mailing them to these specialists. 28 Estimates of difficulty made by reviewers may range widely.Unless there is some reason to doubt the estimates of some reviewers, the average estimate may be taken as a basis for judging the suit ability .of item difficulty for the pretest. (Items may be included in the pretest even if some estimates for the applicant population are as low as .30 or as high as .90.) 3. Measurement principles. The reviewers will need to review the item relative to the list of Item Writing Rules. This listappears to be long, but with practice there should be less need torefer to it with each item reviewed. This review should be performedby both the Item Writers and the Measurement Specialist. 4. Editorial review. This review will ordinarily be done only after the item is considered in final pretest form. It should be done by someone with competence in English usage and expression. The Measurement Specialist or one of the Item Writers may have sucha backgound and be qualified. If not, the Measurement Specialistshould enlist the services of another person who can do this review.There are some advantages in having the item reviewed for this purposeby someone not familiar with the area being tested. Such a personcan check not only the clarity, grammar, expression, spelling, and punctuation, but can determine whether the item is answerable by aperson without knowledge of the field. The catching of errors isfacilitated if the editor reads through a set of items once for each of the following: clarity of expression, grammar, spelling, and punctuation. E. Item Revision ~fuen an author has accumulated several items which have beenreturned by the reviewers with review notes, he should interrupt hiswriting at a convenient time and attempt revision. Some revisionscan be made easily and the items forwarded to those who have not yet reviewed them. The item authors will wish to refer other items back to the person(s) who made the comments to determine whether theauthor's modifications meet the objections. Still other items maybe held for sessions where two or more Item Writers endeavor to workout a mutually satisfactory revision on a number of items. Frequentinterruptions should be avoided. If an item appears too complex and time consuming, it may helpto write out the steps required and see whether two or three items can be written to test these steps or the underlying principles separately. 29 If item difficulties appear to be running too high (i.e., low percent pass estimates), the following revision techniques may help: 1. Examine the stem for unnecessarily difficult wording, and simplify it. 2. Identify those misleads which are likely to be most attractive to better applicants and substitute misleads which will be less attractive to this group. 3. Try splitting the item into two items if it is measuring two concepts or processes. If the item is too easy, substitute misleads which have an element of truth but are defensibly wrong. F. Handling the Low-Productivity Writer Not everyone becomes a successful item writer. Despite efforts at preselection, some individuals may decide after trying it for two or three days that they are not interested in continuing, especially if they have other employment responsibilities. If the job can be handled either without them or with replacements, they should be allowed to drop out. Some, for a variety of reasons, may have difficulty producing their share of usable items. An effort should be made to determine whether lack of productivity results from a lack of self-confidence, from a too-rigid approach to an assignment requiring flexibility, or from other reasons. The following remedial steps can be taken by the Measurement Specialist: 1. If due to lack of self-confidence, discuss the problem with the Writer, encouraging him to relax and focus on item writing rather than his skills. It is possible that he is concerned because other Item Writers are producing more than he is. 2. Try a two-or three-person brainstorming session where a number of item ideas that the Writer can work on are jotted down. 3. If the Writer has other skills, for example, good verbal expression, team him with an Item Writer who has more ideas than he can develop. 4. Try him on developing item ideas which others have set aside. Some persons are fluent in generating item ideas but may have less patience with putting them in operational form. Others who have difficulty in generating ideas may be skillful in reworking rather embryonic ideas into effective items. If these efforts fail to produce an effective Item Writer, and if the individual cannot serve a useful role in complementing the skills of others, it may become necessary for the Measurement Specialist to talk with him about dropping. Since the Measurement Specialist has responsibilities to the others, he must decide when the gains from working with weak Item Writers are outweighed by the loss in time available for working with those who are productive. The nonproductive Item Writer will often welcome the opportunity to drop out. Any such departures should, if possible, be in a cordial atmosphere. Adverse reaction as to the quality or worth of the project by dissatisifed participants can be detrimental to otherwise productive programs and should be avoided. VIII. Pretesting OVePview. Pur-pose of pPetesting; selecting a cPitePion; selecting a population sample; data to be collected [Pam the pPetest; pPecautions to insuPe that item analysis Pesults will be useful in pPepaPing the final foPm; Pelative mePits of sepaPate pPetests vs. pPetesting in an opePationaZ foPm; pPedicting opePational foPm item difficulties fPom pPetest difficulties. A. Purpose of Pretesting The final test will do a more accurate job of measurement if items can be selected with pretest knowledge concerning each of them. The purposes of this pretest are to: 1. Determine the ability of the item to distinguish between applicants who have an overall grasp of the content area being tested and those who do not; 2. Determine the difficulty of the item; 3. Provjde clues to ambiguities and other needed revisions; 4. Provide clues to testing time required to make the final test a test of power rather than speed. B. Pretest Criterion To evaluate an item's ability to differentiate candidates relative to proficiency in the content area, a valid and reliable criterion is needed. It is usually assumed for item analysis purposes that a total score on a set of items designed to measure a given content area is a reasonable measure of knowledge of that content area. Thus the total score on the test of which this item is a part is generally used as the criterion. 31 The usefulness of a criterion depends upon the accuracy with which it represents the qualities the test is designed to measure. General performance criteria are relevant but they include ~any qualities, such as motivation, which the test is not designed to measure and hence are not useful for item analysis. For example, knowledge of mathematics can be appraised more accurately on the basis of some other measure of mathematics. It is preferable to avoid including in the same time limit items from substantially different content areas, such as economic principle items and actual accounting problems requiring extensive calculations, both because an individual may spend a disproportionate amount of time on those he finds more difficult (or easier) and because it helps for item and test analysis purposes to have separate scores for each. C. Population Sample for Pretest The population sample for the pretest should be comparable to the group taking the operational test with respect to education, cultural background, age, and sex. While a sample of two or three hundred is desirable, even a group of 30 is better than not pretesting at all. Samples of only 30 will not yield reliable indices nor will they permit analysis of minority group responses to the items, but they will provide some information on difficulty and may give some clues to possible ambiguities. The following three options are commonly used in selecting pretest populations. Options 1 and 2 provide the most representative populations. Option 1. The pretest may be included in or with an operational test in the same content area. Option 2. If the job is one for which there is frequent recruitment, but this content has not been included in previous testing, the test can be administered experimentally to an applicant population. This is an ideal procedure if the test is not used in selecting from among the first applicants who take it. Option 3. The pretest can sometimes be administered to on-roll employees. This has the problem of a restricted range of ability with respect to the qualifications required to perform the job. If the group includes some with minimal ability in this particular area, the analysis will generally be useful. In deciding whether an on-roll population will provide helpful data, consider the following: a. What selection criteria were used; i.e., is there likely to be a spread in the on-roll population relative to the ability 32 being measured? The Measurement Specialist ordinarilyshould not use an on-roll population if the employeesare highly selected relative to the knowledge or skillbeing tested. b. What percentage of the candidate population hastypically been selected? The on-roll group could also present problems in test security; if used, it is advisable to give the pretest before the operational test is announced. NOTE: An item analysis using operational population datashould also be run on the items in the final test phase,at least the first time the' test is used. This isespecially important when the pretest sample is small ornot typical of the operational population. If the numbertaking the operational fo~ is also small, it may benecessary to accumulate data and run the analysis on theresults of more than one administration. In many practicalsituations it is impossible to meet desirable populationsizes, and the Measurement Specialist must exercise discretionregarding means of obtaining evidence on individual itemsand operational test results. (If the Measurement Specialistdoes not have a strong background in statistics, he shouldseek professional advice with respect to his particularproblem from someone so qualified. Such assistance may beobtained from professional consultants and PsychologyDepartments of large universities.) D. Description of the Pretest In Option 1, it is best if pretest items are added to theoperational test as a separately timed section. (If the pretestitems are within the same time.limit, their influence on the itemsin the operational part is unknown.) If the applicant group islarge, more than one set of pretest items may be used, with eachset being taken by different examinees. In this case the differentpretest forms should be distributed alternately to the examinees toreduce sampling bias. In Options 2 and 3, the pretest should havea format and content similar to that of the operational test ifitem statistics are to be used in estimating total test statistics.Regardless of the option, the time per item should be comparablewith that in the operational form so that item statistics willreflect operational test conditions. The Measurement Specialistshould try to arrange the items in order of difficulty from easyto hard. Preferably he will start the test with one or two itemswhich are short and seem easy to the examinee. (This is lessimportant for Option 1.) Items which could be time-consumingshould be placed toward the end, but these items should be fewin number and used primarily to keep early finishers busy.The Measurement Specialist should then review the assembledtest as suggested in Appendix IV-K. 33 If the pretest is in a separate booklet, it is recommended that the Measurement Specialist use standard instructions such as those shown in Appendix IV-H. If the pretest is appended as a separate part, the instructions may be abbreviated as for part instructions. The Measurement Specialist must make certain that the pretest item numbering and option designation agree with the answer sheet; i.e., if the options are lettered A, B, C, D, and Eon the answer sheet, do the same on the pretest. Where there are separately timed parts, he should continue numbering across parts or use an answer sheet which provides for parts. If the answer sheet and pretest cannot be made to match, he should either find an answer sheet which does or make a special answer sheet for the pretest items. (The pretest may have fewer items than shown on the answer sheet.) If the pretest population consists of on-roll persons (Option 3), it will usually be necessary for them to know the purpose of testing, but the Measurement Specialist should avoid relating it to a specific upcoming recruitment or operational testing. They should be given an explanation for participating similar to that given the Item Writers, with the further explanation that results will not be given by name to anyone other than the participant. (Motivation is likely to be greater if names are entered on the answer sheet.) The Measurement Specialist may also wish to tell examinees that if they want to know their own results (score, relative standing, etc.), they may ask for them and results will be given to them privately. This, too, may improve test-taking motivation. E. Administration of the Pretest Pretest administration conditions should match those of the operational tests with respect to factors which may affect test results. A room which is well-ventilated, well-lighted, and free from distractions should be used for testing. Conversations in the testing room should not be permitted while testing is under way. A clock must be visible to the participants, and the test administrator should have a stopwatch (preferably two, in case one breaks down). Instructions for test administration must be prepared in advance. The Measurement Specialist will need to plan and record exactly what will be said, keeping in mind that essentially the same instructions should be used as for the operational test.l The test administrator should follow the instructions as given. lsee Clemans, 1971, chap. 7. 34 The same procedures should be used to control access to the test as are used for the operational test: 1; Keep materials locked up before and after administration. 2. Count materials before and after administration and do not dismiss the examinees until materials .are accounted for. Make certain the test booklet answer sheet (or answer booklet) and all scratchwork are returned. (Normally, scratchwork on pretests will -be done in the test booklet, since it is a good practice to use a test booklet only once. If separate paper is allowed, and if more than one sheet of scratch paper is needed, staple such sheets in booklet form.) procedures allow those who finish early to leave early, accountIf carefully for materials before allowing the individual to leave the room. 3. Allow only one examinee at a time to leave the room during a timed part of the test. 4. Avoid answering questions from examinees in a way which will give some examinees an advantage over others. If it is not feasible to adhere precisely to the operational form procedures, the Measurement Specialist must determine whether a specific deviation is likely to affect test results and take responsibility for authorizing the deviation. F. Statistics to be Collected by the Pretest Analysis There are several statistics which can be calculated to help in the analysis of test items during the pretest phase. These are: 1. Difficulty indices. The most commonly used index in evaluating item difficulty is the percent pass (p or%+). It may also be used in estimating the mean and standard deviation of a set of items.disadvantage is that p cannot easily be adjusted if the populationA on which p is based is not fairly similar to that taking the operational test. Also, p-values are not typically linear with respect to difficulty, i.e., a given difference in p means a greater difference in ability in the middle of the range than at the extremes. Another index of difficulty is the delta (6). The delta is basedon p but is a normalized item statistic, i.e., it is converted to ascale (mean = 13, standard deviation = 4) such that a difference ofone delta unit represents about the same difference in difficulty at different points on the scale.l lnelta is based on the assumption that ability is normally distributed. See Henrysson·, 1971, pp. 139-140, for a discussion of delta. 35 Deltas may be averaged and can be used for putting item difficulties on the same scale from one population to another, where the populations differ primarily in level (see Appendix IV-I for procedures). 2. Indices of item discriminating power. The biserial correlation (rbis) and the point biserial correlation (rpbi) are the two most common indiceso The point biserial is useful in estimating the standard deviation of the proposed test. (If the pretest population is not similar to the operational population, and if there is reason to doubt whether similar results would be obtained with the operational population, the item analysis should be used only for clues to item weaknesses and difficulty, and not for estimationo) The biserial correlation cannot be used directly in estimating the standard deviation, but it has the advantage of being more easily interpretableo The point biserial's maximum value depends on p, and it is thus less likely than the biserial to yield the same result when used with populations at different ability levelsol The biserial correlation can be converted to the point biserial through use of Table B, Appendix VI, if one wishes to estimate the standard deviation of a set of items. If computer facilities are available, either the biserial or the point biserial can be easily obtained. If computer facilities are not available, two short-cut methods give approximations of the biserial. If there are 100 cases or more in the population sample, the performance of the top 27% (based on the criterion) may be compared with the performance of the bottom 27% by using Fan's tables (Fan, 1952). These tables provide an estimate of the biserial, p, and deltao (See Appendis IV-I for guidelines on using the tables.) Where the sample size is less than 100, one can either use all of the cases and compute the biserial or point biserial, or use a method proposed by Diederich (1973), which compares the top 50% and the bottom 50% (see Appendix IV-I for procedures). Sample sizes under 500, regardless of method, lead to appreciable error in the index, as shown in Table C, Appendix VIo 1The influence of difficulty on the point biserial is indicated in Appendix VI, Table B. The multiplier in the top row indicates the maximum point biserial value for given values of p when assumptions basic to the point biserial are met. The biserial correlation can exceed 1 with a skewed distribution of scores, but this is usually not a critical consideration. See Lord and Novick, 1968, pp. 340-344, for a more detailed comparison of the biserial and the point biserial. 36 Since the number of cases in the pretest population islikely to be quite small in the situations for which thisKit is intended, pretesting can provide only a rough estimateof item discriminating power. If the items are analyzedeach time they are used, however, data can be built up anditems which have consistently higher indices can be used withgreater confidence. 3. Information on answer option8. In addition to the information on difficulty and discriminating power, _the itemanalysis should provide data on the answer options. Whena computer is used for the analysis, it should providedistributions of scores of those choosing each option. At a minimum, the mean score of those choosing each option shouldbe provided. In the high-low methods it is customary totally the number in the high group and the number in the lowgroup who chose each option (see Appendix IV-I). IX. Item Revision and Final Test Assembly OVerview. Revising items on the basis of item analysis;~aking changes without re-pretesting; considerations in testassembly; predicting final test statistics from pretest data. A. Preparation of Items If the test items are likely to be used again, it willprobably be worthwhile for the Measurement Specialist tohave each typed or pasted on a 5 x 8 card with the itemanalysis data recorded on the back, as suggested in Appendix IV-F. This facilitates working with the items and movingthem into different classification schemes for comparisonor making choices at the assembly stage. It enables theuser to spread the items out on a table for easy review ofcontent while assembling the final test. It also providesa convenient form for filing the items between uses. B. Use of Item Analysis Data to Revise Items The Measurement Specialist should make a preliminaryreview of the item analysis data to identify those items whichmay be in need of revision. The following are bases forsuch identification: 37 1. Item is too easy or too difficult; 2. Index of discriminating power is below that established as desirable; 3. As many low scorers as high scorers chose the correct option; 4. More high scorers than low scorers chose a wrong option; 5. More high scorers chose a wrong option than chose the keyed option; 6. Few or no examinees chose a given option; 7. Many omitted the item. See Appendix IV-J for a discussion of possible corrective measures. On the basis of this review, the Measure Specialist may be able to mru~e some changes, but if he is not trained in the field of the test content he should simply make notes to call attention to clues to poor item performance and defer revision until he meets with the Committee of Item Writers. C. Tentative Assembly The Measurement Specialist can make a tentative assembly of the final test using the Item Writers' classifications of the items and the item analysis data. He should be guided by the outline in the test specifications as well as by the item analysis data. Items testing the same area of content and considered usable (with or without revision) may be clipped together pending the meeting with the Committee of Item Writers. He can then make a quick tentative check on the test mean and standard deviation as part of this test assembly. A sum of p-values will provide an approximation to the mean if rights-only scores will be used and if the population on which the p-values were obtained is similar in ability level to the candidate population. Allowances should also be made for any planned or expected differences between pretest and final populations in percent completing the test. The standard deviation may be estimated from the formula: SD = E/p,q rpbi• where q = 1 -p (Gulliksen, 1950, p. 377; Lord and Novick, 1968, p. 330).1 lif the pretest and candidate populations are not sufficiently similar to warrant using the p's directly, and if equated deltas have been obtained, the mean may be estimated by the following procedure suggested by Angoff (1971, p. 586): convert equated deltas to p and sum. D. Meeting of Item Writers for Final Test Assembly It wiil facilitate review if each Writer has a copy of the tentative selection of items and a summary showing how these items fit the test content specifications. If the items are photocopied directly from the cards (see AppendixIV-F), the opportunity for error is decreased. The Measurement Specialist should call attention to the item analysis results as needed, and the Committee may readily suggest some minor changes. Other revisions should probably be left until after all the items are discussed, since the Committee may decide to make substitutions which will remove the need to revise some items. (Item Writers may wish to suggest changes for the unused items, but if the next need for a test in this area is far in the future, it may be unprofitable to spendmuch time in doing so.) The Measurement Specialist or one of the Writers should keep a record of all objections, however, with an indication as to whether there is consensus among the Writers, since a few replacement items may be required at the Panel review stage. It will be obvious that some changes will affect the item difficulty substantially. Some educated guesses about the effect on difficulty may be made using the information on the number of individuals choosing options which are beingdeleted or changed, and by guessing the effect on the choices of these individuals on substituted options. Changed items may be used if the Measurement Specialist feels reasonably assured that the changes will improve rather than adversely affect the usefulness of the item and the test. Any changes should be examined carefully by the Committee of Item Writers to make certain that no biasing factors have been introduced. Also, the Committee should check that the classification of the item has not been changed by the revision. Items used to equate the difficulty of the new test to an earlier operational test must not be changed unless the change is certain not to affect its difficulty. In examining items for possible ambiguities as a result of a low biserial, it is important to bear in mind that a low index does not necessarily mean a poor item. The index indicates the correlation between performance on the item and performance on the pretest as a whole. The item in question may test an important concept, but understanding of this concept may not be highly correlated with understanding of the other concepts tested. For example, an item may tes~ an important trigonometry concept, but if there are only afew trigonometry concepts in a test which covers algebra and geometry as well, the trigonometry items may have lower biserials than they would have in a test of trigonometry only. If nothing can be found wrong with the item, and its dissimilarity to the remainder of the test could account reasonably for its low biserial, the item may be left in. It is undesirable to eliminate items just because they lack homogeneity with the rest of the test. A test with all items testing the same concept will have a higher reliability and higher biserials than one with a more heterogeneous content, but it is likely to have a lower validity, since it will cover fewer aspects of the relevant performance criterion. If the effect of revisions on the items is unknown, the items should be reserved for re-tryout in a future pretest. When a tentative selection of items meets Committee approval, the Measurement Specialist should check the content against the test specifications and estimate the mean and standard deviation. He should compare the estimated standard deviation with the estimated standard error of measurement. He should summarize the content relative to the list of position requirements and check against the weights originally assigned. Finally, he should check the distribution of keyed answer options. The number in each option position should be roughly the same, but he should avoid having exactly the same number of each option. It is unlikely that an ideal choice of items can be made to meet all criteria: content, mean and standard deviation, and item discriminating power. Content should be given priority, if this does not result in substantial departure from the other ideals. The Measurement Specialist should remember, however, that the test is intended to sample the candidate's knowledge rather than measure every relevant aspect of it, and omission of one or two important concepts is unlikely to affect the test significantly if other criteria are met. The Measurement Specialist should next prepare the test instructions (see Appendix IV-H). Enough examples of items should be included in the test instructions, when the testtaking procedures are unusual, to make certain that the examinee understands the instructions. 40 The test should be typed according to the format describedin Appendix IV-H.l Before reproducing it, an Item Writer shouldtake the test, filling out an answer sheet as he does so. This "key" should then be checked carefully against answers keyed as correct on the item cards. X. Final Test Review Before the Item Writing Committee is dismissed, a photocopyof the test draft should be given to each Item Writer. The testshould be checked against test specifications, and each Item ·Writer asked to take the test as if hewere a candidate and to prepare a key. If the Item Writers, who are already familiarwith the items, fail to finish in about half the time allotted the candidates, the test should be examined for items which are unduly time-consuming. A copy of the test, as approved by the Item Writers, shouldthen be sent to each Supervisory Panel member for review. Each member should be sent a copy of the review points as spelled out in Appendix IV-K, and two or three copies of the review commentsl1eet. The Measurement Specialist should arrange a meeting forthe Panel to discuss their comments. A day or less should sufficefor this meeting in most cases. At this meeting the Measurement. Specialist should be p'reparedwith the unused items on cards (with item statistics recordedand Item Writer comments available). These may be needed in theevent the Panel has a strong objection to certain items in the assembled test. The focus of the Supervisory Panel.will be on the suitabilityof the test as a whole: Will the qualified candidate be expected to do well on the test? Will the low-scoring individual be significantly more likely than the high scorer to do poorly on the job? Do the items, taken together, test a representativesample of the content judged important for the job? Will ·the testhave face validity with the applicants, i.e., may they be expectedto consider it a reasonable requirement for the job? Minoritygroup members of the Panel should look for possible irrelevantcultural factors which might influence test performance but not job performance. The Measurement Specialist should encourage the Panelmembers to record their reactions to the individual items, since they may spot something not noticed by the Item Writers. If the focus is on the overall test, however, they may be less likely lsee Thorndike., 1971, chap. 6, for illustrations. 41 to seek minor flaws in items which are probably satisfactory in the form in which the Item Writers left thein. The Measurement Specialist should inform the Panel that changes should be made at this stage if there is a reason to believe that measurement will be improved, but changes should not be made simply to satisfy a personal style preference which may be a matter of opinion. XI. Final Test Production In preparing final test copy, the Measurement Specialist should bear in mind factors which will adversely affect test scores in a manner not related to the knowledge or skill being tested. Spacing, size, and style of print should be such that the examinee can clearly perceive the intent of the question and what is expected of him. He should know Where one item ends and the next begins, and the arrangements of options should not mislead him into marking his answer incorrectly on the answer sheet.l Mathematical symbols should be clear, and fractions should be presented in a manner wfich will avoid misinterpretation for example, vertical fractions, -x are less likely to be misinterpreted than 1/2 x. 2 If, because of the particular answer sheet being used, number options are used with numerical answers, or if lettered options are used with letter answers, spacing should be such as to avoid confusing the designated option with the answer. Also, the Measurement Specialist should avoid standard answer sheets which have odd-numbered items in the left-hand column and even-numbered items in the right-hand column. These are helpful in scoring odd and even numbered items separately for computing reliability, but they are confusing to use and will.add to the examinee's problems in coping with the test situation. The Measurement Specialist will need to be concerned with cost in preparing final test copy. If the test is to be used again at future administrations, it will probably reduce cost if booklets are reusable. However;. this is not highly recommended. Separate answer sheets facilitate scoring. They also facilitate test analysis where computer facilities are available. Separate scratch paper will be needed for tests where computation is required. The necessity to account for all scratch paper may reduce the advantage of reusable booklets if there is much computation. Accounting for all test material after test Isee Thorndike, 1971, chap. 6, for illustrations. 42 administration can ba simplified where more than one sheet of scratch. paper is required by binding together several sheets.Use of a distinctive color will facilitate accounting for scratch paper and reduce the risk of scratchwotk being done on sheetswhich could be carried out; but only very light colors shouldbe used to avoid visual difficulties. Control of the accuracy of final test copy is easier if thetest is typed under the supervision of those responsible for testpreparation rather than set by a typesetter. The Measurement Specialist should be sure that final copy is proofread twiceagainst the cards. It may be desirable, especially in mathematical or scientific copy, to have the proofs proofread once again aftercopy has been multilithed. Eve~ on photocopy an unwanted dot canappear, or part of a symbol can be accidentally deleted in cleaningcopy. This Kit does not aover aU aspeats of the test development proaess. It aovers only those aspeats dealing with item writingand assembZing job knowledge tests. 43 References American Psychological Association. Standards for educational and psychological tests. Washington, Do Co: Author, 1974o Angoff, Wo H. Scales, norms, and equivalent scores. In R. Lo Thorndike (Edo), Educational measurement (2nd ed.). Washington, D. c.: American Council on Education, 1971. Coffman, W. E. Essay examinations. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.)e Washington, Do C.: American Council on Education, 1971. Cureton, E. E., Cook, J. A., Fischer, R. T., Laser, So Ao, Rockwell, N. J., & Simmons, J. w. Length of test and standard error of measurement. Educational and Psychological Measurement, 1973, 33, 63 -68o · Diederich, P. Short-cut statistics for teacher-made tests. Princenton, No J.: EducatiQnal Testing Service, 1973. Fan, c. T. Item analysis table. Princeton, N. J.: Educational Testing Service, 1952. Guilford, J. P., &Fruchter, B. Fundamental statistics in psychology and education (5th edo). New York: McGraw-Hill, 1973. Gulliksen, H. Theory of mental tests. New York: John Wiley, 1950. Lord, F. M. Tests of the same length do have the same standard error of measurement. Educational and Psychological Measurement, 1959, 19, 233-239. Lord, Fo Mo, & Novick, M. R•. Statistical thories of mental test scores. Reading, Mass.: Addison Wesley, 1968. Swineford, F. Note on "tests of the same length do have the same standard error of measurement." Educational and Psychological Measurement, 1959, 19, 241-242. 44 Glossary of Some Common T~sting Terms ' --_. ~-::..: ' . -~-.. . ~: . . . . ; . : : . ; . . -.., -~ The foliriwing teims are 'defin~d according' t~-'the sensein which_ they are used in testing. Sqme may have a more-··general meaning than that indicatect'}:lere. Alternative. '-'An answer :-op'dori on a mt1'lt,iplE7-:-choice test question. -" __ _ _ Biserial correlation. A relationship be~w~~n. a criter~()~ score and the scOre '·on -the item~ ' ,. • , • '~' ., -~ «\• ! <;'' The' score obtained by g~essing."· Chance mean. The average sc:.ore ,of -;R~!S.9~-gu~~13:i,_ng at r_andom the ans~i--s to ail iteins. - 1 : j • • --~ Chance range~-: ust,{ally' i:is'ed>t~ :indicate-. :~1{~-~c~ie~ -~elow the---chance mean plus one ·or ~·o chance1 st:andar'd' d~viations. ·. . . . -\ ' -~ .. . . . . ' ~ . . ' ," . .. . . C'nanae standard deviation. The standard deviJ1ti9n of ..s.coresof persons guessing at' 'tan,do~. the answers to., all 'items. Content validity. Demonstrable rel?tionsh;i.p ~~twee_n.,.,-theskills and knowledges-required 'by the-selection procedure and those required by 1:h~ job•. -It ~~:H~uallybased on the judgmeri~ ?f exper~~:. '• ' ! .--_i ,_' ·; Criterion-related validity. StatisticaL-agreement -between performance on the-t'est and p~rforril aut-off score. _4 .score on a giyen test, s.uch'that those who 'score{ high~r are 'cicc.,ep~~ed-and.:~those whoscore lower are reJected., It als.b'' can-inclti'de differentlevels of placement for: differ.~nt :;;c<;:>,res ....-· · .,.,,.-< : : '·-; -. i . -·. '· . ·: : ..·' ·r jl._ ·• ... . . _:· ~ . .... .., .... _: : . .• :, •. .. ~ • •• • .• ~· Discriminating 'pawer of'cl.n ·-i'-tem. · Th~ "~xt'~rif--t6 \vlii:gh those with high scores answgr .the itellJ correc~ly, and.·.th.o.se. · with lm.f scores_' -~sw~·r' \i~-:~Il:GO,l7:r.e-~tl;r,-,-:~:, _ ·-~;._ s;,:_,·· ,, , Distraater>. An incorrect answer option. Face valid-ity~--Appearing tcf; i::~~t\~h-~t· i~ intendeC.L · .t,' .-, ·~.. :· . . :_:·..; \'-"·,· \'·,, ;_·, ~-.:.~ . • ' e •• Item. A test question or problem, includihg, in multiplechoice tests, the set of options or alternative answers. Job analysis. Refers to many procedures relating to analyzing jobs; as used here, it refers to analyzing duties performed on the job and the knowledges, skills, abilities, and other worker characteristics required to perform those duties. Mean. The sum of scores divided by the number of scores. MUltiple-choice. An item type with two or more designated answer options. NoPmal cUPVe (or noFpendix lio::-R Sample. Memorandum for·Nominate.d lte.m Writers You have. be.e.n recommended as a person who might be. interestedin participating in the development of a written test or othereval.uation instrument to be used in the selection of candidates for the job of------------------ It is important that the. . Agency have working on this assignment employees who are highly capable. and who have experience on the. job. The quality of the future. work force depends on selecting those candidates most likely to be. outstanding in their performance.. This means having selection procedures which will identify candidates who have the. qualities most important to job performance.. The. assignment will consist first of analyzing the knowle.dgesand skills judged essential to high level performance on this joband describing the specific ones which are critical to the job. It will require distinguishing the knowledges and skills which are truly important to successful job performance from those which "sound good" but are really unrelated to quality of performance. Persons participating will then be asked to write testquestions that measure some of the pertinent knowledges and skills. To assist you in deciding whether you would like to beconsidered further for this assignment, you will probably wish totry a sample of test item writing. First write down three orfour examples of knowledge which you consider important to successful performance of the job of ------------~~------~~----~~next, try writing an item to measure one of these. If you wish,have it reviewed by someone else, and then try revising it on thebasis of his comments. If you participate, you will be given instruction in itemwriting and reviewing and will work as a member of a committee towrite and review items for a test to be used in selecting among candidates for employment in --------------------------------- I shall phone you next week to learn whether you are interested in praticipating. (Alternate statement if more persons are. likelyto be available than can be used: I shall phone you next week tolearn whether you would like to be considered further for participating in this item writing project, or, if you wish, you may forwardyour item to me as soon as you have it ready.) Arrangements have bee.i:t made with your agency for those who participage to be excused from regular work responsibilities for a two-or-three week period (date), and some may be asked to participate in a later workshop of two days to assemble a final test. Appendix II-C 1 ....... Form DL Sample Daily LogN Agency: Purpose of Session: Date: Job Title: Those present: Topic Decision Pro Con ~It is recommended that all decisions be recorded in a daily log. Appendix II-D Form MN Measurement Need Definition Job Title: Agency: Date Completed: Location: -""~ Job Duties (Based on Importance Position Importance Evidence ~- . ---~- Job Description)1 Rating Requirement Rating (Selection--Pro'cedure) If test~ specify kind ...... i w 1 Attach copy of job description. Appendix 11-E 1 Form BD Background Data for Test Development -....J Agency: _______________________ Location: ________________________________ ___ ~ Function of Test: DOT Code: __________________________ _Job Title: Selection ------------------------------ Test Rationale:-----------------Date Form Completed:------------------Promotion----------------- ether __________________________________ _ Position requirements which the test is intended to identify: Rationale for using a test: Other selection procedures (if any) being used to identify the specified qualifications: Use of test (e. g., weight to be assigned the test in the selection process, ranking or cut-off, etc.): Number of job openings: Selection ratio (number to be selected divided by number of applicants): 2 Population data: Ratios Number American Spanish Black Indian Oriental Surname ether Total Male Female Currently on job Likely to apply Source population : Educational background: Less than Less than 4 4 yrs. college H.S. H.S. yrs. college or more Total On roll Applicant Age: Under 20 20-29 30-39 40 + Total On-roll Applicant Validation strategy planned: Ethnic group data collection is NOT required under USCSC regulations. Such data collection is useful for State and local governmental agencies in order to meet EEOC and OFCC regulations. Also, such data collection would be required under the draft of the. Uniform Guidelines on Employee .Selection Procedures byl the EEOCC. The numerator for a given ration is the number of persons in the given ethnic, education, or age group in a given row. The denominator for the ratio is the total number of persons for that row. Appendix II-F Form TS Test Specifications Agency: Location: Date Prepared: Job Title: DOT Code: Number of Item I No. of Desired Statistical Char. Chance Content IPart Items l Type Choices Time MI SD ]_% Com. M I SD II SEM --~--- 1. Scoring Procedures: Rights only __ R -___!i_ --Other (Specify) : k-~ 2. Weighting of Parts: Simple sum Other (g1ve reason): 3. Selection Standards: Minimum cut-off Multiple cut-off Ranking, sole predictor __ Ranking, combined predictors If more than one predictor, what are these and how will they be combined? How will cut-point(s) be determined? If cut-point, how does it compare with chance range? 4. Norms: One-time use Repeated use: %-ile Norms Scaled score norms (M = SD ) Equate to existing operational form? Form Equate to future operational form(s)? Form(s) ~--------------- Equating method: Common items Common population _______ -....! Equate M and SD Equi-percentile Other: V1 76 Appendix II-G Form PA Pretest Analysis Agency: Location: Description of PopulationSample:Job Title: DOT Code: Test Title: Date of Analysis: r r Part or No. p Mean his his Form Content Items Mean SD Range p Range Mean ---l iI•i 1. Equating items, if included: 6 base population: '~··.. Item numbers: Delta plot (attach plot of operational [y-axis] vs. pretest [x-axis] ) 2. Table of equivalent D.'s: Pretest ~ Operational ~ 3. Power vs. speed data: Percent completing pretest: Last item reached by all examinees: Plot of R [y-axis] vs. R + W [x-axis] (Rights vs. Rights plusWrongs) attach 4. Item analysis results (see next sheet): Appendix II-H Form IAR -Item Analysis Results Agency: Test: Population: Total N -----Location: Analysis Sample Description: Date of Analysis: 4 Total Breakout of Ethnic Group Item p Raw Equat rbis NRl N02 Criterion 3 Grot!p_ Black American Indian Oriental Spanish Surname Other Male Female No. t:::.. t:::.. Hi Low Total Hi Low Total Hi Low Total Hi Low Total Hi Low Total Hi Low Total Hi Low Total Hi Low Total ! Mean s.. D. .___ - 1 Number who did not reach item. 2 Number who omitted item but responded to subsequent items. use the three colunms in each breakout to enter the number who answered correctly in high group, in low group, in total 3 If high/low method of item analysis is used, ...... use the columns to enter total score mean of those answering correctly and those answering incorrectly, respectively. ...... group, respectively. If ~is is computed, 4 The USCSC does not require the collection of ethnic data. See note on Appendix II-E. 78 Appendix II-I Sample Memorandum on Control of Test Matertals The Agency urgently requests your cooperation in preserving the confidential nature of all examination materials, both those you construct and those you review. The following controls are suggested: 1. Submit, in longhand, questions that you construct, and retain no copy or notes pertaining to them or to any written sources used. 2. Make no copies of questions you are asked to review or of your comments on them except for those copies used for review. 3. Keep all materials pertaining to examination questions under lock and key when they are not in actual use. 4. Permit no other person to have access to examination materials without prior clearance with the Measurement Specialist or the head of the Agency. 5. Avoid discussing the content of examination questions in such a way as to identify the topic under consideration with a civilservice examination. 6. Keep in mind that you may have a tendency to inadver~entLy give an undue advantage to persons whom you supervise, teach, or merely talk to, and who may take examinations containing some of the questions you have worked on. 7. Except for necessary clearances, avoid disclosure of the factthat you are working on a civil service examination. Because of the importance of security considerations in examinations for the public service, we would appreciate your signing this note and returning it to us as an indication thatthe foregoing types of safeguards, and any additional ones thatseem desirable, have been applied. We should also appreciate yournotifying us if in any case circumstances beyond your control causeyou to have any doubts as to whether the confidential nature ofthe materials has been maintained. Signature Agency ---------------------------- Date Appendix III Checklists A. Schedule Guide B. Meeting Arrangements Checklist C. Non-Content Factors Which May Affect Test Scores on Multiple-Choice Tests 1 80 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 • 1 81 Appendix Ill-A Schedule Guide The following schedule is provided as a guide to the Measurement Specialist in setting up the overall program. Those who have not previously participated in a test development program of this nature may underestimate the total amount of time required. Such an underestimate could result in agreeing to an unrealistic completion date with a resulting pressure to spend less time than desirable on some steps. Minimum elasped time from request (in weeks) 1. Program planning completed 1 2. Selection of Supervisory Panel completed 2 3. Work session of Supervisory Panel initiated 6 (Allow 2 to 4 days for work session, depending on adequacy of existing job description.) Session completed 7 4. Selection of Item Writers completed 7 5. Work session of Item Writers initiated 9 (Training phase may be expected to overlap with production phase, but a minimum of 2 weeks total should normally be allowed.) Session completed 11 6. Pretest plans made 11 7. Pretest(s) administered 13 (If pretests are administered as part of the operational battery, the schedule must of course be adjusted accordingly.) 15 8. Pretest analysis completed 9. Meeting of Item Writers to review analysis and make final test assembly decisions initiated 17 19 10. Meeting of Panel to review test (Allow 1 to 2 days for meeting in setting schedule.) 22 11. Test produced and ready for administration 82 Appendix III-B Meeting Arrangements Checklist A. Preliminary arrangements 1. Meeting room(s) a. Blackboard b. Tables (The room for Item Writers should have tables to permitpair or group conferences as well as space with tables for individual concentrated work free from distractions.) c. Clock d. Locked file e. Cloak room 2. Hotel accommodations (if required for some) 3. Transportation (if required for reaching meeting place) 4. Coffee, if to be provided 5. Luncheon reservations, if group will lunch together 6. General supplies a. Pads, pencils, erasers, paper clips, stapler, transparenttape, ruler, ash trays (if smoking is permitted), black ink pens b. Additional supplies for Item Writers: half sheets, typewriter, drafting equipment as needed for sketching diagrams 7. Handouts 8. Reference materials (For Item Writers this includes: dictionary, thesaurus, Dictionary of Occupational Titles, references whichgive item writing suggestions and illustrative items, and texts in the content area of the test.) 9. Confidential trash pick-up 10. Copying facilities B. Day of meeting 1. Set up conference room a. Supplies b. Handouts c. Necessary reference materials d. Working materials 2. Provide the receptionist and/or local telephone operator with a list of participant names and instructions on interruptions. 3. See that the room is locked or materials are stored in locking files when conferees are out of the room. 84 Appendix III-C Non-Content Factors Which May Affect Test Scores on Multiple-Choice Tests I. Test A. Instructions 1. Content and complexity 2. Sample items 3. Instructions on guessing a. working speed b. timing of parts c. comparison of item numbers on answer sheet and test B. Mechanical features 1. Format clarity a. spacing b. size and style of print c. arrangement of options d. labelling of options e. form of answer sheet 2. Item difficulty a. distribution b. level c. order of items relative to difficulty 3. Speededness a. time limit vs. work limit b. short and long items under one time limit 4. Method of scoring a. weighting b. guessing correction c. accuracy of scoring C. Item construction 1. Item types a. choice of item type b. number of different item types c. arrangement in test d. coachability • 85 2. Framing of question a. readability b. length c. setting d. level of terminology 3. Answer options a. number of choices b. variation in number of choices c. placement of correct choices II. Examinee A. Mental set B. Personal background factors C. Educational background factors 1. Reading ability 2. Training a. recency b. amount c. specialized training III. Conditions of Administration A. Test Administrator's instruction 1. Accuracy of timing 2. Accuracy in reading instructions B. Physical surroundings 1. Seating arrangement 2. Lighting, etc. 3. Distractions C. Administration of other tests D. Order in which tests or parts are administered 87 Appendix IV Procedural Guides A. Suggested Source Material for Job Analysis B. Item Writing Guide, Sample Items, and Examples of Item Modification C. Item Types D. Item Writing Rules for Multiple-Choice Tests of Job Knowledge E. Suggestions for Developing Item Ideas F. Instructions Regarding Item Format G. Item Review Guide H. Test Format and Sample Test Instructions I. Procedures for Obtaining Pretest Statistics J. Using Pretest Statistics K. Test Review Guide and Test Review Sheet L. Brief Guide on Norming and Equating Procedures Appendix IV-A Suggested Source Material for Job Analysis There is no one job analysis method which is clearly superior to all others. In determining an adequate approach to follow, the MS should select one which has been used successfully byothers for determining the job requirements for job selection purposes rather than job classification or job evaluation pur poses. Also, the method chosen should provide some type of listing of the important job requirements in terms of knowledges and skills (or at least provide a basis for inferring important knowledges and skills) for job knowledge test development purposes. Of most value, are procedures which use one or more of such methods as interviews, questionnaires, check-lists, dairies, brain storming techniques, and critical incidents. Of little value are such methods as the Position Analysis Questionnaire and Occupational Analysis Inventory both of which concentrate on underlying cognitive, affective, and physical abilities; however, these two methods are very useful in the development of other types of tests. Good sources of job analysis information can be found in standard textbooks in industrial psychology. An article by Prien and Ronanl reviews many of these procedures. The job element job analysis procedure of Ernest Primoff also can be used for developing lists of important knowledges and skills. The follow ing two manuals may also be of some use to those who have effectively tried them out. U. S. Civil Service Commission. Job analysis developing and documenting data. Washington, D. C.: Author, 1973. U. S. Civil Service Commission. Job analysis for improvedjob-related selection. Washington, D. C.: Author, 1975. 1Prien, E. P., and Ronan, W. M. Job analysis: A review of research findings. PersollneJ..J>_s~~l?:9_!_9_gy, 1971, ~. 371-396. 90 Appendix IV-B Item Writing Guide, Sample Items and Examples of Item Modification This Guide is intended to assist the Item Writer in working from a position requirement to a finished item ready for review. It is intended both for a general orientation and for reference use. Throughout the item writing process, one must keep in focus thepurpose of the item as a measure of the candidate's ability to handlesome aspect of the job, and avoid being diverted into testing knowledgefor its own sake. Five stages of item development will be considered.The form of the item at these stages is indicated at the right. 1. General statement of the knowledge or Positionskill required to perform the job as requirementdescribed in the job analysis. 2. Identification of the specific concept, Item objectiveprinciple, or skill required. 3. Preparation of a rough statement of the Item idea (oritem, including the stem and as many tentative item)options as the Item Writer can think of. 4. Item Writer's review of his tentative item. 5. Statement of the item in a form which Provisional itemthe Writer is willing to have reviewedby others. Stage 1: Selection of a position requirement to be tested. The position requirement should be derived from the job analysiswith a specific task in mind. For example: First-Aid Attendant Requirements--Knowledge of conditions requiring first aid; procedures for treating conditions of shock Stage 2: Development of item objectives. For example: In the event of an accident, examine the patient to determine whether an emergency exists. Sunstroke symptoms include a rapid full pulse and dry skin. 91 Symptoms of shock include weakness, paleness, perspiration on upper lip and forehead. For a patient in shock, avoid overheating or chilling. Although the Writer may understand his field well enough to write questions without frequent reference to a text, it will help assure coverage to go through two or three currently accepted texts to iden tify concepts. Those concepts covered by more than one text may be important enough to consider testing. This analysis will also help in developing the overall outline. The Writer should not, however, accept this agreement among texts as sufficient evidence of importance for job performance. He should think of his own experience and evaluate whether the skill or knowledge is of sufficient importance to warrant testing, i.e., could he have performed his job satisfactorily without it? Is it something he could have learned on the job as the occasion required it? This evaluation step is very important, as the Agency may later be called upon to justify having tested a particular knowledge or skill. The Writer should recognize that testing is a sampling process,and should not become concerned if the list of item objectives is solong that not all of them can be covered in the number of items allotted.If the number of potential items is large, it is important not only thatthose concepts or skills selected be critical, but also that one area ofskill or knowledge not be represented disproportionately in the finaltest. The Writer must avoid placing excessive weight on the areas ofparticular concern to him individually. stage 3: Devising a way of measuring the objective. What might the individual do who does not understand the principle? For example, what steps may be taken by the individual who does notknow the proper treatment of shock? (Don't worry about wording at this time; just get the idea down.) The following question might be written: El. In case of shock~ which one of the following should be doneforo a patient? (A) Give the patient a droink. (B) Covero the patient waY'Irlly. (C) Raise the patient's head above his feet. (D) Place an ice pack on the patient's head. (E) Covero the patient with a light blanket. One should write down as much of the item idea as comes to mind,including the stem and any distracters which seem plausible, beforegoing on to the next item; otherwise, they may be forgotten. If more .. 92 than the needed number of answer options occur to the Writer, theyshould be written down. Later consideration may indicate that certaindistracters are better than others. The·above item is a straightforward test of knowledge. The sameknowledge is required in the following item which also tests the candidate's ability to recognize the symptoms of shock: E2. A traffic accident victim is weak and pale and showsperspiration on his upper Zip and forehead. Which oneof the following should the ambulance attendant do? (A) Remove the victim's coat and shirt. (B) Give the victim a drink of water. (C) Cover the victim with a light blanket. {D) Give the victim a pillow to raise his head. (E) Place an ice pack on the victim's head. Both of these items need rewording, but the present statement is sufficient for recording the idea. The item situation should be realistic in order to make the itemseem reasonable to the candidate, i.e., to give it "face validity."The length of the item, however, should not be appreciably increasedto add "window dressing." Avoid asking for the information oneusually has and giving the information one would normally seek. Anexample of such an undesirable inverted item would be: "If it takes 12.5 cubic feet of concrete to build a square loading pad 6 inchesthick, what is the length of one side of the pad?" Avoid using terms in wrong options which imply the candidateneeds to know something irrelevant to the job. The following questions may help in converting an item objectiveto a test question: 1. What are common misconceptions about the principle? 2. Why is the principle important to satisfactory job performance? 3; In what sort of circumstances might it be important to understand the principle? 4. What might the individual do who does not understand theprinciple? 5. What might be the consequences of a lack of knowledge of theprinciple? 6. How can the individual demonstrate that he has the knowledge? See also the list of suggestions for developing item ideas in Appendix IV~E. The list should be referred to only after the item objective has been stated and one knows the principle or process one wishes to test. At the item idea stage, the focus should be on relating the principle to the job, and the Writer should avoid being led, in the search for an item statement, to measurement of an abstract concept which is related to the concept being tested, but which is not essential to job performance. Most questions will not start with the phrases in Appendix IV-E, but these phrases may provide an idea that will stimulate the Writer when he is trying to put an idea into item form, or they may suggest a new approach when the one which the Writer has been trying does not seem to be working. In writing distracters, some Item Writers find it helpful to projectthemselves into the situation and ask themselves, "If I were taking a test which included this question as a completion item, and I didn't know the answer, how would I answer it? What sounds reasonable?" For example, in item E2 above, the uninformed individual, seeing perspiration, might consider the individual to be too hot and remove the coat; another might try to make the patient more comfortable by giving him a pillow. Sometimes a Writer has only one or two plausible distracters. He may be tempted to turn to one of the following solutions: 1. Using "None of the Above" as an alternative. This is all right if the other options are clearly right or wrong, but this is often not the case. 2. Putting in a "filler." Because it is obviously wrong or unrelated, this serves no useful purpose other than to give the requisite number of options. 3. Throwing the item out. If the point is an important one, an effort should be made to salvage the item. Consider the following example, which has three logical choices as shown: E3. A test analysis shows that all students answered correctlyall items which they answered. The test is most probably (A) a test of speed (B) a test of power (C) a test in which both power and speed are important. In some cases where there are only two good distracters, a fourth option--"Cannot be determined from the information given"--can be used effectively. In item E3, however, this option might attract the better examinee who could argue that one needs to know more about the test and the population before being certain that the examinees have really tried all items to which they responded. The response "None of these" is not a good solution since the three options fairly well cover the possibilities, assuming the information is sufficient. It would be better in 94 this case to reconsider what is being tested and make a fresh start. The item as it.stands could be answered on the basis of rote learning.One might try measuring whether the candidate knows how to make use ofthe analysis information, as follows: E4. Scores on a 100-item test of tool recognition rangedfrom SO"to 90; however_, eveY.y student answered correctlymost of the items to which he recorded an answer. Whichone of the following is the most reasonable conclusion? ' (A) Fewer_, and some more difficult items_, should be used. (B) Many items are ambiguous and need revision. (C) The, test is too difficult. (D) The time per item is probably adequate. Stage 4: Review ~f the tentative item: Before spending time working over the item, ask the following questions: 1. What specifically am I trying to measure? One should try verbalizing it as .if he were telling someone else. If not obvious, the answer and associated logic can sometimes be clarified by writingdown the objective and the approach. 2. Could I do the job effectively without .b.~ing able to answer th~ q~e~titin? If i6, is ·ii-~ec~~~e the content is inappropriate or because the wording is not clear? 3. About what fraction of qualified workers on this job could answer it? What fraction of my acquaintances who are working in unrelated jobs could answer it? If th~se fractions are roughly the same, would those who can answer it be :expected to do better on this job than those who cannot? (Ignore ot~er factors entering into job performance, such as interest, motivation, special skills, etc. Look at those aspects of the job which involve the skill being measured.) 4. Is the item time-consuming? If so, it may be possible toreword the item or split it into more than one question. 5. Will the wording be clear to someone else reading it for the first time? Could it be stated more simply and still provide the necessary information? 6. Look at each option. Is the correct answer clearly correct? Is it based on current thinking? Do the options cover popularmisconceptions? Is any option likely to be chosen by only a few candidates? 7. Would this item be defensible if printed in a newspaper? t i 95 8. If the item involves computation, could values be chosen which would take less computation time without changing what is being tested? The Writer may wish to accumulate several items which have reached this stage and then reread the "Item Writing Rules for Job Knowledge Tests" in Appendix IV-D before attempting to polish the items. The procedure of accumulating saves time in going over the rules and it also gives the Writer a fresher look at his items, thus increasing the likelihood of spotting possible weaknesses. Stage 5: Improving the item The item should now be put in a form in which it can be reviewed The following discussion considers certain characteristics by others. of "good" items and techniques for improving weak items. Simplifying wording: 1. Will the candidate know clearly what he is expected to do? Does he have all the information he needs to work with? Does answer ing the item depend on certain assumptions which must be stated? For example, in E2, option B may be misleading. Since small amounts of water may be given a conscious patient in shock, B might be considered Change "drink" to "glass" correct. (Option C is the intended key.) to suggest a ,larger amount, or if only four choices will be-used, this option is a candidate for deletion. Option D might be stated better as "Place a pillow under the patient's head." '" 2~ Does the. item .read smooth~YZ. For example, even though E3 is understandable as stated, the first sentence of the stem reads more smoothly as worded in E4. 3. Is the terminology unnecessarily difficult, i.e., could a candidate who understands the concept being measured fail the item because he doesn't understand the words which are used? This is especially important if candidates have a native tongue other than One should not avoid terms which the candidate must under English. stand to perform the job, but if the terminology itself is not critical, one should look for more generally understood words and phrasing. For example, instead of asking "What are the detrimental effects of ?" ask "What are the harmful effects of ?" Where terminology is important, one may wish to consider testing for knowledge of terminology separately from knowledge of concepts. 4. Is the wording as concise as is consistent with a clear statement of the question of problem? If the statement of the stem runs more than· three typed lines, could it be shortened? Since the accuracy of measurement depends to a large extent on the number of items in the test, time should not be wasted in unnecessary reading. Consider it a challenge to state the stem in one line less, as if space were so limited. Can you do this and make the question equally clear? Such a 96 focus on rev1s1on may help to make wording even clearer than initially. One should not, of course, shorten the wording if doing so leaves outessential information or makes the question more confusing. It mayeven be necessary to add wording to provide all the information needed. 5. May the candidate overlook in hasty reading certain wordswhich are essential to understanding the task? For example, by thetime the examinee has finished reading and comparing the options, hemay forget he was supposed to look for the "least important reason" and choose the "most important." The Writer should capitalize suchwords as NOT, LEAST, and other negatives. If the examinee may worka numerical problem in a unit of measure different from the one given,underline or capitalize the unit of measure asked for. 6. Put as much of the question as possible in the stem. For example, compare the following two items for ease of reading: E5. The standard deviation is a more sensitive measure ofvariability than the range because (A) the standard deviation shows how scores relate to themean (B) the standard deviation is based on all the scores (C) the standard deviation gives more weight to extremescores (D) the standard deviation is more complex mathematically. E6. The standard deviation is a more sensitive measure of scorevariability than the range because the standard deviation (A) shows how the scores relate to the mean (B) is based on all the scores (C) gives more weight to extreme scores (D) is more complex mathematically. Phrasing the correct option: The intended answer must be stated clearly and concisely. Itneed not be a complete sentence, but it must follow grammatically from the stem. It must be clearly the best answer, but make certain it cannot be identified as correct without reading the stem. Item ideasoften occur as the result of thinking of a common misconception. Inlater work on the idea, it may become apparent that the correct answer is obviously correct, and the author may be tempted to make it a littleless obvious by disguising it. For example, the author may think ofweaknesses in item writing and decide to write an item to measureunderstanding of these. He writes the following question: 97 E7. A good multiple-choice item is characterized by which one of the following? (A) uses double negatives (B) tests knowledge of the views of an authority (C) tests an important point (D) asks the examinee his opinion Unfortunately, the correct answer is the only one which sounds reasonable. The author might consider using a less obvious correct answer such as "short answer options," but this weakens the item by reducing the importance of the principle being tested, and it is still fairly obvious. Perhaps a more satisfactory solution would be to make the correct answer a weakness in item writing. Such an alternative might be phrased as follows: EB. The stem of an item reads "Which one of the following is the most important role of education in the United States today?" What is the primary weakness in this question? (A) The geographical area is not specific enough. (B) The term "today" may be misleading. (C) The term "most" is too strong. (D) The correct answer is a matter of opinion. Item testing breadth: E9. The purpose of defining the abilities required by the job is to increase the likelihood that the test will (A) distinguish between high and low capability candidates (B) appear reasonable to the examinee (C) have more easy items (D) have less ambiguous items. Item testing depth: ElO. The purpose of defining the abilities required by the job is to increase the likelihood that the test will (A) distinguish between high and low capability candidates (B) cover all the duties actually performed on the job (C) identify individuals with the most knowledge of the subject-matter (D) appear to the examinee to be related to the job. Each distracter should be plausible to someone who does not under stand the principle being tested, though care must be taken that the distracters do not become plausible enough to mislead the better candidates. All the distracters should constitute possible answers to a direct question implied or stated in the stem. As with the 98 correct option, they must be related to the stem grammatically and in content. A distracter must not be identifiable as incorrect without reading the stem. Clues to the correctness of incorrectness of an option should not be provided. Such clues, called specific determiners, evolve from various weaknesses in the options: 1. An option which does not follow grammatically from the stemmay indicate an incorrect response. 2. An option which is longer than the others may suggest the correct answer. The extra length is often due to stating the assump tions that are essential to make the option correct. (Such qualifi cations, if necessary, can sometimes be incorporated in the stem.) 3. Two options which are synonymous automatically disclose that both are incorrect. 4. Two options whic-h are mutually exclusive may give a clue to one being correct, 5. An option which encompasses another option may enable the examinee to eliminate one of these. 6. An option which uses closely similar terminology to that in the stem may be attractive to one searching for the correct answer. 7. A non-parallel option may give a clue to the correct answer. In writing items using "None of the above" as an option, eachstatement must be clearly correct or incorrect. If one is simplybetter than the others, the examinee has no basis for judging howcorrect the correct answer must be to meet the examiner's intentions.This is one reason for avoiding true-false items in employment tests,since it is difficult to phrase many important concepts as statements which are clearly correct or clearly incorrect and which do not dependon unstated conditions or assumptions. "All of the above" must not be used as an option. Although this may be justifiable in a course examination where students have beentrained by the teacher to expect such a choice and be held responsible for it, a job applicant meeting this kind of item for the first timemay stop after finding a correct answer and not discover there is more than one. He is in a hurry, and unless he sees the need to readall options, he may not do so. Appendix IV-D, "Item Writing Rules for Job Knowledge Tests," contains a longer list of suggested do's and dont's. 99 Decreasing the length of long options: The following item requires more reading time than necessary: Ell. If the discriminating power of an item is less than .35, (A) one should discard it since it ~s too low to be discriminating (B) one should consider leaving it as it is, since it may be a good item which is not like other items (C) change any options which less than 10% of the examinees select (D) try it out on another population sample. This item is testing more than one principle. The first of the (correlation with principles is that an item's discriminating power total test score) depends both on its inherent capacity to discriminate between high ability and low ability examinees and on its correlation with (similarity to) other items in the test. The following item defines the problem further in the stem and leads to shorter options: E12. For which one of the following reasons might a test developer NOT discard an item that has an item-total correlation of .30 and no detectable flaws? (A) An item-total (biserial) correlation of .30 indicates a very discriminating item. (B) The content is dissimilar to that of other items. (C) It is a very easy item. (D) It is a very difficult item. The similarity of C and D may lead the examinee to assume that one of these is correct. Since neither is correct, the rule of avoiding similarly stated options can probably be safely ignored. The second principle being measured in Ell (by option C) is that a distractor may be effective even though not highly popular. It might be better to focus more precisely on the significance of the response pattern, i.e., the numbers in the high and low scoring groups who chose each option. 100 E13. An item analysis shows the number of high scoring and lowscoring candidates who chose each answer option. Option2 is the correct answer. Option: 1 2* 3 4 High 2?% 2 54 18 26 Low 2?% 12 18 10 60 Of the following~ which one would be the best option to replace? (A) option 1~ because fewer than 10% chose it (B) option 3~ because more in the top group than bottomgroup chose it (C) option 4~ because so many of the top group chose it (D) option 4~ because more chose it than chose the correct option Obtaining more items per given amount of time: Perhaps the most obvious way of increasing the number of itemsfor a given amount of time is to write items which require less time to answer. If previous recommendations on wording are followed, theremay be little more to be accomplished along this line. Use of simpler numbers has been suggested, as well as breaking a long item do"tom intotwo or more separate items. There are, how·ever, techni.ques of item format which facilitate covering more ground in a limited amount of time. Some of these are described here. A common set of options saves reading time and can be usefulwhere this approach does not result in testing one area of contenttoo heavily. This is also useful when a set of diagrams involves two or more important concepts and the diagrams take some study to comprehend thoroughly. For example: Use the options at the right in answering items E14 through E16.Which one of the score systems shown at the right relates a person's performance to E14. a given level of attainment? (A) standard score E15. the mean in terms of deviation units? (B) percentile score E16. the proportion of the norm population (C) criterion refer which achieves scores below his? enced score (D) raw score 101 The following graphs are to be used in answering items El? through E20. A given answer option may be used for more than one question. ~ .•, . . -, I \ I ' /0 . . . l,. . //)~ . . . 10 ~ ., . : . . I I . '". I .. .' . I :, . , ·. ...'.,. I: , :' i. .I . I ' I ',' i)( (Jy:_ oX 0 (J ' ~~~ 0 ~ c. t> A El?. Which one of the scatter-diagrams most indicates a lack of relationship between the two variables? (A) A (B) B (C) C (D) D E18. Which one o.f the scatter-diagrams indicates a perfect relationship? (A) A (B) B (C) C (D) D E19. Which kind of a scatter-diagram would one expect if only those examinees who scored above a certain point on X zuere tested on Y? (A) A (B) B (C) C (D) D E20. Which scatter-diagram shows that a high score on X is associated with a low score on Y and vice versa? (A) A (B) B (C) C (D) D This format is similar to a matching item. Ordinarily in matc~ing items, one list may be shorter than the other, but each element is usually used only once. With the common set of options, it is not unusual to have the same option correct for more than one item. If this is the case, make sure the examinee understands that some options may be used more than once and others not at all. More items per time period can also be obtained by using the same set of data, a complex diagram, or a paragraph for two or more items. As with other items in the test, items based on a common set of data must be independent of one another. The answer to one item must not depend on answering another item correctly. Sometimes different item types can be useful ir.. getting more infor mation about the candidates' knowledge in a given amount of time. The 102 following type may be useful where one wants a quick check on quantitative types of knowledge: Mark each of the following statements in items E21 through E25 according to the following answer options: Mark Option A if th~ value in column A is greaterOption B if the value in column B is greaterOption C if the values in columns A and B are equalOption D if the relative size of the values in the twocolumns cannot be determined from the given information A B E21. -J:7o . ?0 E22. ...r.ao E23. mean standard deviation E24. 3 em 1 inch E25. boiling point boiling point of of water cooking oil Making the item less difficult: The following item illustrates two points: First, it shows asample multiple true-false item for which examinees may consider morethan one option or none of the options to be correct. If this typeis used, it should probably be used only with candidates for higher level jobs. Good college students have sometimes objected to itsuse, even though only one of the lettered,answer options is correct. As with the true-false item type, one must be certain that each optionis clearly right or wrong. Secondly, it illustrates an item whichrequires excessive time to read and understand for one point credit. E26. On the basis of the infor,mation given, which of the three conclusions given below can be drawn regarding the follow ing set of raw test scores on a 40-item test of arithmetic computation? Conclusions Jones 20 I. Smith has twice as much computational ability as Jones has. Smith 40 II. Smith should be able to handle any jobrequiring arithmetic computation. Brown 35 III. Brown is nearer to Smith than to Jonesin arithmetic ability. (A) None (B) III only (C) II and III only (D) I, II, and III 103 This item tests at least three concepts: 1) that a score twice as high does not mean twice as much ability; 2) that one needs to compare test content with job content before knowing whether the test covers the job requirements; and 3) that raw scores cannot be interpreted directly on a scale of ability. Try breaking long items such as this into two or more less complexitems. For example: E2?. Three candidates for a job which requires a high level of computational accuracy make scores of 20~ 35~ and 40 on an arithmetic test. Which one of the following items wiZZ be LEAST useful to the employment interviewer in evaluating the qualifications of the candidates? (A) a copy of the test (B) a review of the test by the job supervisor (C) the standard error of measurement (D) national norms on the test for high school graduates E28. With which one of the following score systems does a score of 60 generally reflect twice as much ability as a score of 30? (A) raw score (R) raw score corrected for guessing (C) standard score scale with M = 45 and SD = 10 (D) none of these scales Parts I and II of The Construction of Test Questions (see Bibliographyfor reference), obtainable from the U. S. Civil Service Commission, include suggestions on item writing and are the sources for some of the ideas in cluded here. Answer Key to Items in Item Writing Guide El. E E8. D El5. A E22. c E2. c E9. A E16: B E23. D E3. A ElO. A El7. D E24. A E4. A Ell. B E18. c E25. B E5. B El2. B El9. B E26. A E6. B El3. B E20. c E27. D E7. c El4. c E21. A E28. D 104 Appendix IV-C Item Typesl Although multiple-choice items are almost always preferred, there are occasions when other item types can be used. This appendix briefly presents some-samples and suggestions for the item writer to follow in writing these various other common types, i.e., true-false, matching, completion, and essay. Item writing i.s a creative process. Often considerable ingenuity is needed to write items that require the candidate to go through the required thought processes and to demonstrate that he possesses the necessary depth and scope of knowledge or skill required in the subject. Just as there are no set formulas for producing a good story or a good painting, so there are no set rules that will guarantee the production of good test items. The guidelines presented here can be summarized by stating that the item should be expressed as clearly and simply as possible. Quite often an item is too difficult and discriminates poorly because the request for information was not stated clearly. 1. True-false items. The true-false test item consists of a simple statement which candidates must identify as true or false. However, because of very serious problems with it, this item type is not recommended for use in employment tests. a. Characteristics. (1) Positive points. The true-false item can sample wide rangesof material and is easily and objectively scored. It canbe written as a factual question or as a question thatrequires reasoning. (2) Negative points. The pitfalls and shortcomings of this typeof item warrant careful consideration before it is used veryextensively. Since there are only two alternative answers,it encourages guessing. Half of the questions might be answered correctly without any knowledge of the subject. In addition, it is very difficult to make a statement which is absolutely true or absolutely false without giving some hint as to the correct answer. Finally, a relatively large number of such questions must be used in an examination if there is lThis material is excerpted from: Kraft, J. D. Instructor's guide tothe construction of writs and examinations. West Point: U.S. MilitaryAcademy, 1968. This publication includes examples of questions from twoother publications: U. S. Navy, Constructing and using achievement tests. NAVPERS 16808-A. Washington, D. C.: Author, 1949. U. S. Army, Techniques of military instruction. FM21-6. Washington, D. C.: Author, 1967, Chap. 13. 105 any expectation that the test will separate the moreknowledgeable candidates from those less knowledgeable.Contrary to appearances, it is the most difficult typeof item to write. b. Principles of construction. (1) One-half of the items should be true and one-half false. (2) The true statements should not be consistently longerthan the false statements, or vice versa. (3) The application of knowledge should be required in asmany of the items as possible. (4) The crucial element should come near the end of the statement. (5) One part of the item should not contain a true idea andanother part a false idea. Make it all true or all false. (6) Double negatives or involved statements should be avoided. Poor (confusing): Better: One should not work on any Before working on any electrical or radio equipment electrical or radio if he is not sure that no equipment one should be circuits are energized. certain that all circuitsare deenergized. (7) The words "all," "only," "never," and "always" in the statement should be avoided. These words usually indicatethat the item is false. Poor (correct answer given away by use of conditional words): All airplane propellers are made of aluminum. Never run an airplane in a hangar. Only casein glue may be used in joining wooden parts of a model. (8) The words "generally" and "usually" in the statement shouldbe avoided. These words usually indicate that the item istrue. Poor (correct answer given away by use of conditional words): Sea plane floats are generally equipped with rudders. In general, jet planes are faster than prop planes. 106 2. Matching items. Matching items generally include two lists ofrelated words, phrases, or symbols. The candidates are requiredto match each item in one list with some one item in the secondlist with which it is most closely related. a. Characteristics. (1) Positive points. The matching item is especially valuablefor testing knowledge of relationships and making associations.The matching exercise may require candidates to match (1) termsor words with their definitions, (2) short questions withtheir answers, (3) symbols with their proper names, (4)descriptive phrases with other phrases, (5) causes with theireffects, and (6) principles with situations in which theprinciples apply. A large number of responses can be obtainedin a small space and with one set of directions. It can betotally objective and is easy to grade. (2) Negative points. Overlapping items are dffficult to control.Also, the process of elimination can give clues to the correctanswer. Finally, it is difficult to use with a standardanswer sheet. b. Principles of construction. (1) The directions should be specific, indicating on exactly what basis the matching should be done. Include these directionswith each matching exercise. (2) Generally, candidates should be required to make at leastfive and not more than twelve responses in completing eachmatching exercise. (If a standard answer sheet will be used,usually only four or five responses are allowed.) (3) At least three extra items should be included from whichresponses must be chosen, or responses should be allowed to be used more than once. This tends to reduce the possibilityof guessing or answering by a process of elimination. (4) Only related materials should be included in any one exercise.Nothing should be listed in either column that is not a part of the subject in question. (5) The column containing the longer phrases or clauses should beplaced on the left-hand side of the page. If answer sheetsare not used, candidates should record their responses at theleft of this column. This makes the selection easier. 107 (6) At least three plausible responses should be included from which each correct response must be selected. If,in order to do this, it is necessary to include threetimes as many items in one column as in the other,another type of test item should be used. (7) In setting up the test, all of a given matching exerciseshould appear onone page or on facing pages. 3. Completion items. The simple completion item requires the candidateto recall and supply one or more key words that have been omitted from statements. These words, when placed in appropriate blanks, make the statement complete, meaningful, or true. The statements may be isolated and more or less unrelated, or they may be combined to form short paragraphs that carry a continuous line of thought. a. Charact~ristics. (1) Positive points. This item form can be used to test acandidate's knowledge of specific facts and it demandsaccurate information. It can be used effectively tosample a wide range of subject matter. This item formis particularly suitable for testing memory for materialwhich must be recall in a precise way, such as technicalterms that must be known, abbreviations, weights, mmeasures, and tolerances. (2) Negative points. This item form is difficult to score.It is very difficult td develop a scoring key whichincludes all acceptable responses. b. Principles of construction. (1) The item should be so written that only a limited number of responses will be acceptable. This is very difficultto insure. Poor (many answers possible): Better (only one answer): One should never smoke in a The wrench used to measureship's --------------------the amount of twisting force being applied to a nut iscalled a wrench. (2) It is generally best to use only one or two blanks in asingle sentence. Poor (for mind readers): Better: A ------is a device for con-A generator is a device forverting energy into converting machanical energy --------energy. into energy. (3) If possible, the blanks should be kept near the endrather than the beginning of the sentence. Poor (must read twice): Better: A an overloacircuit. is used to prevent d in an electric light A device to prevent overload 5n an eleclight circuit is a _ an tric ______ (~) The question should be so arranged that the answer may be written in a column at the right. This will increase scoring speed and accuracy. (5) Except in rare cases, verbs should not be omitted. (6) Statements should not be copied directly from testbooks to make a completion item. (7) The statement should be complete enough that there can be no doubt as to its meaning. Avoid being too brief. (8) Only those key words which the candidate should know should be omitted. Do not ask for the recall of trivals details. c. Related item forms. These forms are similar to the completionand short essay. (1) Short-answer items. The simplest recall item is the question or statement that demands a short answer to be written in given spaces. (2) Listing items. These are very similar to the above and are frequently considered essay items. The candidate is asked to list or outline the parts of a problem or subject. (3) Problem-solving items. Data are presented to the candidate and he is asked to manipulate the data to obtain an answer. (4) Identification items. The candidate is asked to label parts or a diagram, identify a group of formulas, write titles of books after authors' names. 4. Essay items. The essay item calls on a candidate to describe, compare, discuss, or explain some aspect of the subject he is studying. It is often used when the number of candidates is below 15. a. Characteristics. (1) Positive points. The essay item can be used effectively to measure a candidate's ability to organize and expressthoughts. It is very useful for selecting persons for 109 some high level jobs where the ability to organizeand present facts is extremely important, such asfor hearing examiners. In comparisonwith objectivetype items, good essay item8 are not easier to writebecause the Item Writer must consider in advanceexactly what will be considered acceptable for an answer. (2) Negative points. Its greatest disadvantage is that thesubjectivity involved in grading significantly reducesthe accuracy and consistency of the scores assigned bythe rater. Its scoring may become subject to the rater'sinterest and range of knowledge and other similar factors.Handwriting, style, grammar, and other commonly extraneousfactors significantly influence the grade assigned.Responding to an essay item requires much candidate time;this greatly restricts the sample of test material whichmay be asked. Unfortunately, essay items allow thecandidate, rather than the Item Writer, to sample andtreat in depth that part of the subject he knows best.Scoring the items requires much more time than is requiredfor other item types. On the one hand, it providescandidates an opportunity to bluff; on the other hand,candidates who know the subject matter well, but are notskilled in writing, may be penalized on an essayexamination. b. Principles of construction. (1) Specific answers should be requested. The question shouldbe worded in such a manner that it provides the candidatewith an outline that he can use in formulating hisresponse (unless the question is designed specificallyto measure this ability to outline and expand his subject). (2) The item should be stated in a simple, direct manner. (3) The candidates should be told the basis for grading atthe time they are given the test. (4) The essay item should be designed to require candidates to organize, outline, compare, explain why, give areason, describe, or explain how. c. Principle of scoring. (1) The Item Writer should write out, before the test isgiven, the answer or answers ~e will accept as sufficient and correct. He should include every point that is tobe accepted. He should use this as a standard in gradingthe test papers. (2) He should not grade an entire test paper at one time. Rather, he should score one essay item on all the test papers before he proceeds to the next essay item. (3} The Item Writer gener~ly should allow me unit of credit for each point covered in the answer. However, if a particular point or principle is considerably more important than others, it should be given more th~ one unit of credit. The grader then merely totals the points earned. (4) Code numbers should be used instead of names on the candidates' papers where the candidates might be known by the grader. 111 Appendix IV-D Item Writing Rules for Multiple-Choice Tests of Job Knowledge A. Concept measured 1. The concept measured must be important to the ability toperform the job for which the test is being used. 2. In general, limit the item to one concept (or one specifictype of error) unless synthesis of concepts is being tested. 3. Avoid items. which require reference to an authority to determine the correct answer. 4. When more than one item is based on the same set of information,do not make a correct answer to one item dependent upon answering another item correctly. B. Stem 1. Define the question or task in the stem. 2. Put as much of the question or statement as possible in thestem. 3. State the stem as concisely as possible, but provide the necessary information. 4. State the question at the level of the candidate. 5. Avoid terminology which is not important to knowledge of theconcept or to the job. 6. For problems involving computation, keep the values assimple as possible while still testing the desired skill. 7. Avoid asking the candidate his opinion. 8. Avoid imprecise words. 9. Avoid double negatives. 10. Where a single negative is needed to measure the concept,underline it or put it in capitals: LEAST, NEVER, NOT, etc. 11. Where items require the solution to a problem, check todetermine that the right answer cannot be obtained by acommon wrong method. For example, 22 = ? (2 + 2 is also equalto 4). 112 12. Specify in the stem the unit of measure in which answer optionswill be given. Ordinarily, the unit of measure need not berepeated in the options. 13. The item should have face validity, i.e., it should seem a reasonable question to ask. C. Answer options 1. There must be one and only one correct answer or best answer. 2. The correct option should generally be at the same level ofdifficulty as the distracters. 3. Avoid unnecessary wordiness. 4. Do not use options which are based on careless reading, unlessreading is being tested (e.g., wrong unit of measure, "most"rather than "least"). 5. In true-false, multiple true-false, or none-of-these type items,make each option statement clearly right or wrong. "Best answer"cannot be used with these item types. 6. Do not use "None of these" with numerical answers where theanswer can appear reasonably in more than one form (e.g., 44and 6.28). 7 7. Vary the correct answer position; avoid a pattern. 8. If options, such as numerical options, have a logical sequence,put them in logical order. (If this conflicts with the preceding rule, change the order where the logical sequence is not obvious, but avoid making illogical order a clue.) 9. Do not put in the options information which could reasonably beput in the stem. 10. Avoid options which appear unrelated to the job. 11. Avoid "specific determiners," which give a clue to the correct answer for a person who does not understand the correct answer.For example: a. a distracter which does not follow grammatically from the stem; b. an option which can be judged correct or incorrect without reading the stem; c. synonymous options (which rule out both options for a person who recognizes the equivalence); d. an option which includes another option (For example, less than 5, less than 3; in China, in Peking; all of the above); e. implausible distracters; f. similar terminology in two options if one is correct; g. similar terminology in the correct option and the stem; h. nonparallel options; i. mutually exclusive answer options, where one is correct; j. a correct answer which is longer than the distracters; k. qualifiers in the correct answer (probably, ordinarily, etc.), unless also used in the distracters; 1. words such as "nevern or "ah1ays," which suggest a wrong option; m. a correct option that differs from the distracters in favorableness, style, or terminology. D. Format 1. Put different options on the same line only when all can be put on one line. Otherwise list options. 2. If an answer sheet with numbered spaces must be used with numerical answers, put parentheses around the option number and space between the option number and the option. (Forexample, (1) 64, not (1)64; certainly not 1. 64.) 3. If the options include the whole numbers 1, 2, 3, 4, or 5, and answer sheet spaces are numbered, put such a whole number with the correspondingly numbered space. (Ignore Rule C8 above in this case.) 4. Single space the stem and the options, except where fractions or symbols require more space. Use 1Yz, or double space between the stem and the options, and between the options, where. spacepermits. 5. Use vertical fractions. For example, 23 not 23/Zk.) 2k' 6. Put all of an item on the same page. Appendix IV-E Suggestions for Developing Item Ideasl The following question leads may suggest an approach when one encounters difficulty in developing an item idea: What conclusions should be drawn from ? Which of the following describes the concept of ? What is the effect of ? What should be done? When should the worker ? What should the worker do when ? What constitutes an error? When happens, what (else) will happen? What is likely to be the result of ? What recognized principle is violated? Which is best for the purpose of ? What is the important difference between and ? How are and alike? What is the cause of ? Under what condition(s) does (is, are) ? What purpose is served by ? Which comes first in order to ? Which comes last in order to ? Which follows ? Which step has been left out? Those who agree on (theory) support it because Which alternative doesn't belong among _____ according to the principle of ? 1These suggestions came from USCSC item writing materials. 115 Appendix IV-F Instructions Regarding Item Format Newly Written Items Use 5 x 8 cards, or cut sheets of 8~ x 11 heavy paper in half. Front of card or half-sheet: 1. Using black ink, start about 1/4 inch from the top and 1 inch from the left margin and print the item. Keep it compact, butmake it legible. (These handwritten copies may be used forreproducing, and small print will facilitate reproducing moreitems per page.) 2. Letter answer options (A), (B), (C), (D), and (E) beneath the stem. Unless all options can be put on the same line, put options in a vertical list with option numbers in line with left hand margin of stem. (Options are sometimes numbered, but letters are preferable when answers are numerical.) 3. If several items are based on the same paragraph or set of data, write the common information on one card and each of the items on a separate card. Clip the items together. 4. When a drawing accompanies an item, refer to it in the stem ofthe item. For example, "From the data shown in Figure " 5. The unit of measurement in which the answer is provided shouldbe given in the stem. For example, "What is the distance, in feet?" 6. If negatives are used, they should be capitalized or underlined,for example, NEVER, NONE, EXCEPT, etc. Similarly, any importantword or phrase which could be overlooked in reading should becapitalized or underlined: "What is the LEAST ?" If aproblem may be solved in inches and the answer is given in feet,capitalize or underline the unit of measurement. Back of card or half-sheet: 1. Put the answer key in the upper left corner. 2. Put the concept tested in the upper center; also include appropriateoutline identification. 3. Record the Writer's last name (first initial, if necessary) in theupper right corner. 4. If a passage is taken from a copyrighted publication, record the source. A credit by-line should be provided at the end of the passage. If the passage is original with the Item Writer, no reference need be made. If it is unrecognizable as a modification of a published passage, record the reference under the author's name on the back and let someone else double-check recognizability. 5. In items which require solution of a problem, give the basis for any incorrect answer option which is not self-evident. 6. Try to keep information for steps 1 to 5 above compact to allow room for reviewers' comments below. In reviewing an item authored by someone else, the reviewer records his initials at the left margin, his comments in the center section, and the approximate time for answering at the right. The codes shown on the Item Review Guide, Appendix IV-G may be used, where appropriate, for comments and time. Pretested Items 1. Use the same procedures for the front of the file card as are given for newly written items, except that the item should be typed or cut from the test copy and pasted. 2. Follow the instructions for the back of the card as given above for steps 1, 2, and 4. Steps 3 and -~ are optional. 3. Record the item analysis data on the back of the card: N number in sample; Nt = number who ans-vJered this or a later item; p = number correct/Nt; 0 = number who omitted this item but answered a later item; NR = number who omitted this and all subsequent items. New Item 117 Front of item card or half-sheet ~t. d\~\~....\~ ~ a." ·,~W\ (_ ~~t.. c..~rrt Q.t\$\lltQ.\"' e.o~u..c..\-h~)~'-"\0... ~..~ ~ <;'"t~ C.O"u.,\'\ C.o.l\ \,t, \f\<:..~t.o"U~ \:>~'· (A) r't\C..t"~.S\1\~ ~(.. ha""'o\Q.(\Q)t'\ 0~ "'c. \e.~\ Q..~ (\ '-'l\6\t.. ~) :t'C\c.~Cl'-'"\ '1'\t. lt"''\<\~ ~ ~~ ~o~c-ca.c..~ o~\.,t'l..., use an option which is clearlywrong to the person who understands the principle in order to avoidattracting to this option those with fuoderate to high scores. If the index of discriminating power is high, it may be better to leave theitem in its present form. It is possible that only three plausibleanswers can be written and that those choosing this particular optionwere doing so on the basis of random guessing. 2. A wrong option which attracts more high scorers than lowscorers. Look for the possibility that this option is defensible as acorrect answer. Reread the item as if you had chosen this answerand were trying to defend it. Deletei the option or revise it to makeit more clearly unacceptable. If few low scorers chose it, it shouldbe replaced, since it is contributing negatively to prediction. 3. A correct option which attracts more low scorers than highscorers. Look for the possibility that the correct answer can be obtainedby a wrong method of reasoning. Change the values or wording toeliminate this possibility. 4. A correct answer which attracts fewer high scorers than anotheroption. If the biserial is relatively low, look for two defensible answers.If the biserial is relatively high, it may be that this is a verydifficult, but good item. Unless the correct option is very unattractive to the low scorers, or an ambiguity is discovered which can be corrected, the item should be dropped or revised, since chance is likely to be playing more of a role than understanding in gainingcredit for this item. 133 S. A high number of omits among high scorers (or the mean of the omits is high). This may be due to limited time, to a difficult or complex item which the examinee has decided to leave for the moment, or to ambiguitieswhich leave the examinee uncertain. It may help to review both the Item 1-Jriting Guide and the Item Writing Rules before attempting to revise items. B. Evaluating Speed Characteristics If there are many omits, make a plot of the number of items answered correctly against the total number of items for which an answer was recorded (i.e., R vs. R + W). Examples of such plots follow: R R so so so . . . ..·... . . . .. ~ L-------R+W ~-------------R+W ~o...-------R+W so so so 1 2 3 The first diagram represents the kind of plot one would find if nearly everyone recorded an answer to every item; the scores are distributed along the right-hand margin. This is considered a power test. The third diagram represents a situation in which each examinee tends to answer correctly all items he attempts. This is the kind of distribution one expects to find for a test of speed in doing somethingwhich contains very easy items. The middle diagram shows a test which measures both speed and power. For a test where speed is not critical, the Measurement Specialistmight aim for a distribution between 1 and 2, since many will omit those they are unable to answer.! As a further check on whether sufficient time has been allowed, check the number who did not reach items toward the end of the test (NR). If everyone recorded an answer to the last item, the test is generally considered to be not speeded. Before making this assumption, 1A more extensive discussion of this plot and its use is given in Swineford, 1974, pp. 9-12. (This manual was prepared for internal use, but the discussion is generally applicable.) 134 however, determine whether easy but more time-consuming items havebeen answered. If many of the high group omitted many of the items,this is evidence that they were stopped by time and simply were doingthose which looked easiest to them. If 80% complete the test withfew omits among the high scorers, and if almost all get 75-80% ofthe way through, the test is generally considered to be primarily testing power. C. Estimating Test Mean and Standard Deviation When pretest and final test populations are comparable in ability level: I M = [p (i.e., add the p's of the selected items to obtainan estimate of the mean)~ SD = L /Pq·rit (i.e., for each item, find the product of ptimes 1 -p; multiply the square root of the product by thepoint biserial and sum over items to obtain an estimate ofthe standard deviation). When pretest and final test populations are not comparable in abilitylevel: To obtain a rough estimate of the mean: convert the pretest p's to deltas (using Table A-1 in Appendix VI); estimate the operational form deltas (equated deltas) as described in the preceding Appendix IV-I; convert equated deltas to p; and sum these p's over items (Angoff, 1971, p. 586). It is probably inadvisable to try to estimate the standarddeviation where the two populations are not comparable because ofthe change in point biserials to be expected in going from onepopulation to the other. 1see Angoff, 1971, p. 586, and Gulliksen, 1950, pp. 376-377. 2To the extent that the pretest Nt (number answering this ora later item) differs from N (total number of examinees), use ofp = N+/Nt will ordinarily yield an overestimate of the mean ifthose completing the test have more ability in the area tested than those who omit later items. If many fail to complete thepretest and if the final test will be similarly speeded, it maybe better for estimating purposes to use p = N+/N. Appendix IV-K Test Review Guide and Test Review Sheet Test Review Guide PLEASE READ THESE INSTRUCTIONS BEFORE LOOKING AT THE TEST. The accompanying test materials are to be locked up when not being worked on and should not be shown to others. Your approach to this test should be toward evaluating the test as a whole. The items have been reviewed individually by the members of the Item Writing Committee both before and after pretesting. An editor has reviewed the test for grammar, expression, punctuation, and spelling. We especially need the Panel's review to the overall test: Does it appear to do what it is designed to do? You will be asked specific questions about the test in these instructions. Although your primary focus will be on the total test rather than the individual items, the first few times you go through the test, make notes on problems you encounter with individual items, either in answering the item or in reviewing the test relative to specific questions. Make the note long enough to enable you to recall the difficulty later, but try not to become sidetracked by a patticular item. Please go through the test step by step, rather than trying to do two steps at once. This will facilitate focusing on one problem at a time. 1. Take the test as if you were a candidate, recording responses on the enclosed answer sheet. It is difficult to see the test as the candidate will see it if you study the test over before taking it. Read the instructions before starting to count time. Note the time you start and the time you finish. Try to finish in the time allotted, as you would if you were a candidate. If you do not complete it, draw a line under the last item attempted and continue working. It would be most helpful to your later evaluation if you change to a different color pencil at the end of the time limit. DO NOT CHECK YOUR ANSWER AGAINST THE SCORING KEY UNTIL YOU HAVE FINISHED. 136 2. When you have completed the test, check your answers againstthe scoring key and go over the items with which you find disagreement. For each item there must be a completely defensibleanswer and only one. If you feel an answer other than the onekeyed is defensible, circle both answers on the answer sheet andrecord your comments on the Test Review Sheet under the sectionfor "Comments on Specific Items." 3. Fill in the blanks and answer the questions on the Test Review Sheet. In providing information on timing, you will wish to consider whether your own time is representative of what should be ex pected of the candidate. In considering difficulty, look through the test again andconsider whether any of the items might cause special difficulties for any on-job persons whom you know; e.g., is the terminology appropriate? Would a person who has been working on thejob in one locality have an unfair advantage over someone equallyqualified who comes from another area and is not familiar with localpractices and terminology? In considering coverage, keep in mind the test specificationswhich the Panel set up. Does the content appear to be wellbalanced relative to the intent? 4. Go over the test once more, this time item by item readingthe stem and each of the options slowly. Look especially forthese possible weaknesses: -Does any item have a distracter which could be defendedas the correct answer? Do any items depend on correctly answering another item? -Does information in any item given a clue to the correctanswer in another item? -On rereading any item did you get an interpretation different from that which you had the first time? -Record any comments you have on the Test Review Sheet,"Comments on Specific Items." Attach your Comment sheets, answer sheet, test key, and testspecifications to your copy of the test and file in a secure placeuntil the meeting. Be sure to bring all of these items with youto the meeting. TEST REVIRW SHRET Reviewer Test Date Stage: Pretest draft Final draft If more space is needed for comments, use another sheet. Read the separate set of instructions in the Test Review Guide before filling in these sheets. 1. Appropriateness of test length: Present timing is about right Should allow minutes for the test or items for present time. Comments: 2. Appropriateness of difficulty: Are the first two items easy enough that all candidates will feel at ease in answering them? (It is not necessary that they be able to answer them correctly.) Comments: 3. Appropriateness of coverage: Did the content as a whole seem appropriate? Would qualified employees whom you know be likely to do well? Would the poorly qualified be likely to do substantially less well, i.e., would the test be likely to distinguish the qualified employee from the poorly qualified employee or applicant? Did you feel in taking the test that you were being tested on relevant knowledge? Will the candidates be likely to consider it relevant? Will minority candidates encounter any difficulties with the test which are unrelated to job performance'? Does the content seem to you to fulfill the test specifications? Were some aspects covered too heavily at the ~{pense of others? Comments : The Test Review Continuation Sheet (following page) is to be used for comments on specific items 138 Reviewer ---------------------- Test Review Continuation Sheet Date:(Comments on Specific Items) Test: Code0 -Retain, no change necessaryPretest ------Final ------1 -Retain with suggested modifications2 -Retain if problem can be correctedOnly those items on which you have 3 -Discard for reason givencomments need be listed. It willbe assumed that those not listedare judged to be satisfactory. ItemNumber Comments Code 139 Appendix IV-:-L Brief Guide on Norming and Equating Procedures Norming Procedures To transform a set of raw scores to a given scale, use the following formula: s s y y -- Y== --x--x+Y s s X X where Y is the scale score equivalent to a raw score of X; X and SX are the obtained raw score mean and standard deviation; and Y and Sy are the chosen scale score mean and standard deviation. If one has chosen to use a scale with a mean of 50 and a standard deviation of 10, the following steps would b~ taken: ~ubstitute into the equation the obtained values for X and SX, Y == 50, and Sy == 10. This will give a formula of the form Y == bX + a. To obtain a table of scaled Y values equivalent to each X value, substitute each possible value of X in the obtained formula. Equating Procedures To transform scores on a current test to an established scale, one standard procedure is as follows:l Use in the pretest a set of items from an earlier form of the test for which scores on the established scale have already been obtained. (A set of 20 items, or 20% of the number of items in the test, whichever is greater, is recommended.) Obtain the statistics indicated in the following table, which also defines the symbols used in the equations below. lthese procedures, with adapted notation, are taken from Angoff, 1971, pp 579-580. Several procedures are described which depend on the randomness of the groups, the manner of administration, and the use of common items. He also includes procedures developed by Swineford and Fan for converting scores through item statistics. 140 Earlier form Curre;;t form total test common item total test common itemraw score raw score raw score raw score Mean u X ws u s a a Standard sw su sx suDeviation s B a a where t is used to indicate data on the common items for thecombined populations. Obtain the correlation, rwu, between w and u for the earlierpopulation.Obtain the corr~lation, rXU' between X and u for the currentpopulation.Substitute the obtained values in the following equations:1 sx - A a xt Xa + rwu·su ( ut -ua ) a s ~ Ys wt ws + r (U t -us )wu sus s 2 X a 2 A s ) (s 2 s 2) SXt Xa2 + (rxu . su ut ua a ~:~w + (r . suB ) . (s 2 -s 2 ) wu s ut us 1 These equations are based on the assumption that thefollowing statistics are the same for group t and a: sx -s X-rxu Sy U sx2 (1-rXU2 ); and rxu ·-sfr (Angoff, 1971, p. 580). 141 Substitute these tour values in the equation: W-W X -X t t (Note that the denominators are the ---=-- square roots of values obtained s s W X above.) t t This gives ru< equation of the form W = AX + B, where swt such that raw scores on X (current A = --and B = Wt -AXt sx form) tcan be converted to raw score equivalents on W (earlierform). A table of equivalents for X and W scaled scores canthen be set up using the table of equivalents establishedearlier for W scaled scores and raw scores, or an equation for converting X to scaled scores can be produced by substituting the equation obtained above in the equation derivedpreviously for Form W for converting to scaled scores. If y = mw + k is the equation for converting W raw scores to astandard score scale, substitute AX+ B for W and obtain theformula Y = (AX + B) + k, which can be used to convert Xscores directly to scaled scores. References Angoff, W. H. Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.). Washington, D. c.: American Council on Education, 1971. Diederich, P. Short-cut.statistics for teacher-made tests.Princeton, N.J.: Educational Testing Service, 1973. Fan, C. T. Item analysis table. Princeton, N.J.:Educational Testing Service, 1952. -, 142 143 Appendix V Explanatory Articles A. Test Difficulty B. Chance Scores C. Basic Testing Principles for the Non-specialist 145 Appendix V-A Test Difficulty The usefulness of a test in a given selection situation depends in part on the difficulty characteristics of the items. The following concepts are relevant: 1. With indices of item discriminating power such as one normally finds in achievement tests, the spread of scores tends to increase as item difficulties approach, .50. If item discriminating power indices are very high, using items all with the same difficulty could give a bimodal distribution, but this is unlikely to be found in practice. 2. When chance responses are not involved (i.e., items are write-in answer only), selection at a given cut-point is theoretically improved by use of items which are at the .50 difficulty level for persons at the ability level represented by the cut-point. 3. For multiple-choice tests the average item difficulty for those at the cut-point should be above .50, depending on the number of_answer options. Test specialists are not.in complete agreement regarding the best average item difficulty. 4. There is also disagreement on the importance of a1mangat a narrow range of item difficulties when examinees are tobe ranked, and it has been suggested that less attention begiven to the spread in item difficulty and more to other factorswhich affect test results (Nunnally, 1967, pp. 250-254; Swineford, 1947, p. 13). The following arguments are offered in favor of awider range of item difficulty: a. It is not easy to develop all items within a narrowrange of .50 to .80. b. Some important concepts may not readily be testedin that range. c. Gains in reliability from a greater concentrationof items difficulty may be offset by possible losses in validity due to the elimination of some concepts. The average item difficulty should be such that random guessingwill be a minimal factor in scores which fall in the range wherediscrimination is wanted. Where discrimination is desired primarily in the. middle of the score range, one might reasonably seek an average difficulty of . 60 to; •70 for the usual multiplechoice test with 4 or 5 choices (Ti~leman, 1971, p. 63; and Lord & Novick, 1968, p. 392). A rev;iew of Educational Testing Service test analyses over the years indicates that, as long as a few very easy:items (p = .85 to .95) are included in the test, most scores will fall outside the range reasonably attributable solely to chance. This situation occurs even when the average p value falls below .50 (Swineford, 1975). Scores throughout the total range will, of course, be influenced by guessing, though random guessing may be expected to decrease as scores increases. If one were devising a test to ' select scholarship winners or, on the contrary, to screen out persons with essentially no knowledge, a very difficult or very easy test, respectively, would be reasonable. In most cases where a cut-point is used, factors other than the test will be considered in selection. In such situations the cut-point will likely be based on considerations of what constitutes an acceptable level of knowledge and will be used only to screen out those with less knowledge, the final selection being made by other procedures. The spread in item difficulties will be influenced somewhat by the selection procedure. If one is using a cut-point, statistical goals will dictate a narrow range of item difficulties, but one may wish to include a few highly important concepts which fall outside the middle difficulty range •. If one wishes to compare examinees throughout the range, a wider spread in item difficulties may be necessary to test adequately those at the extremes of the ability distribution. As Swineford (1974) explains: Theoretical studies have shown that a test composed of items of middle difficulty for a group is more reliable for that group than one whose items cover a wide range in difficulty. In practice, however, the difference is so slight that it can be disregarded in favor of other test characteristics. Item biserial correlations are closely related to score variability, for example. When a test is to be used for various subgroups, which probably differ from one another with respect to level of ability, or when different test users select different cutting scores, then of course it is impossible to provide middle difficulty for each subgroup, for this reason a range in item difficulty is desirable. (p. 13) One should avoid item difficulties below about .30, where chance may be contriubting more to correct answers than knowledge. One or two items over .90 can help put the candidates at ease. The majority, however, should be in the range .35 to .85. 147 Of the two indices of difficulty which are used, p is themore common, but delta has certain advantages. The followingoutline compares them. p: The number of examineees who answer this itemcorrectly divided by the number who record ananswer to this or a subsequent item. Delta: The standard score of the item, such thatMt,. = 13 and SD D. = 4 . Advantages of p: Disadvantages of p: 1. More familiar. A given difference in p valueshas a different meaning atdifferent points of the scale. 2. When p values appropriate Cannot as easily be used into the population are known, converting from one difficultythey can be summ.ed to obtain scale to another.the estimated mean of a newlyassembled test. (See AppendixIV-J for formula and limitations.) Advantages of delta: Disadvantages of delta: 1. Can be used for equating the Is not as easily computed.item difficulties obtainedon one population to those ona similar population of asomewhat different abilitylevel. 2. A given range in delta values Less information is availablerepresents roughly the same on it in the literature.ability increment at differentlevels of difficulty. Appendix V-B Chance Scores In multiple-choice tests there is a possibility that the examinee can, by guessing, receive a substantial score.· It is important to take this possibility into consideration. Even though most examinees do not guess at random, some do, especially when the material is so difficult that they have no knowledge at all. Guessing is likely to have more effect on certain individual scores than on test results as a whole. The seore an individual can be expected to obtain on a chance basis is computed as follows+ On the average the guesser will be expected to answer correctly a fraction 1/k of those he does not know, where k is the number of options. For example, in a 50-item, five-choice test, if he guesses at all items, he would be expected to answer 10 items correctly on the average. Sometimes his chance score will be higher, sometimes lower. W'e are most concemed about how high a score he could get solely by chance. To determine this we compute the standard deviation of chance, In (k-1) /k2 where n is the number of items. For the 50-item, five-choice test, this would be 2.8. Sixteen percent of the time he would be expected to obtain a score 1 or more standard deviations of chance above the chance mean, in this case, 10 + 2.8, or 12.8. (Another way of looking at this would be to say that 16% of those who guess at all items would be expected to obtain scores of 12.8 or above. Such a fractional score ordinarily is not obtainable, but it is used here to illustrate the statistical principle.) Of course, most will not guess at all items. Suppose an individual records what he believes is the correct answer to t items and guesses at the rest. The same formulas apply, but only to the items for which he guesses the answer. In this case he would be expected to get n-t points by guessing, on k the average. The standard deviation of chance in this case is I (n-t) • (k-1) /kZ. As more items are known, less guessing 1These statistics are based on the assumption that each option is equally frequently keyed as correct. 149 will be. done, and the e;t;:J;ect o:l;. chance responSe becomes less.• For this reason~ i:l; one's. goal is that the average examinee will answer 70% of the items correctly, the effect of chance will be relatively small for mast examinees. In a four.,-choice, 50-item test, the person who knows the answers to 30 items (60% of 50) and guesses at random for the remaining 20 may be expected to obtain a score of 35 or more half the time (30 + ;:>; (50 -30 ) and a score of 37 or higher 16% of the time (35 + (50 -30) • 3 16). Thus there is a substantial probability that someone who knows the answers to 60% of the items could correctly answer 70% of the items by guessing. On the other hand, persons who actually know 30 of the 50 items are not likely to make random guesses; they will have partialknowledge which will lead them to some correct answers and some incorrect answers. Accordingly, wholly random guessing will occur infrequently on most tests. An individual is ordinarily better off trying to answer those questions about which he has some knowledge, since his ability to eliminate some answers as wrong will enhance his probability of answering correctly, as it should. As the test is made more speeded or more difficult, guessingincreases and correction for guessing (by the formula W ) R- K-1 becomes desirable; however, if everyone records an answer to every item, the ranking of the examinees will be the same regardless of whether such a scoring formula is used. One can reduce the effect of guessing by increasing the number of answer choices per item, but little is gained beyondfive options. As the number of options increases, so does the time spent reading noneffective options. A better plan is to use fewer options and include more items. In general, for a power test where the mean is around 70% of the total number of items, a rights-only score is preferable. See also the discussion of the problem of guessingby Thorndike. (1971, pp. 59-61) Appendix v....-c Basic Testing Principles for the Nonspecialist 1. A test is a sampling process; it measures a sample of all the understanding needed for a given purpose. Rarely does it measure everything the examinee should know. 2. A person who can answer most of the items correctly is considered to have more knowledge or skill in the subject area measured by these items than does one who can answer correctly a significantly smaller number of items. 3. If a person takes many parallel forms of the test, his scores will be expected to vary around his "true score" (i.e., his average score over all possible parallel tests). This variability is represented by the standard error of measurement (usually referred to as the SEM). 4. The mean score of a test is the average score of those taking it, and it is a measure of the difficulty of the test relative to the total number of items in the test. 5. The standard deviation (SD) of scores ona test indicates how much scores vary around the mean. If all the scores are near one another, the standard deviation will be small. If the scores are spread out, the standard deviation will be large. 6. The larger the standard deviation is relative to the standard error of measurement, the more accurately a test measures whatever it is measuring. This is represented by the "reliability." A test's reliability depends on such things as the number of items, the ability of individual items to discriminate among examinees (see principle 11), the difficulty of the test, and the extent to which different items measure the same thing. 7. A test's validity is the extent to which a test measures what it is intended to measure. For job selection tests, it indicates how well the examinees' scores are related to one or more important aspects of what is required on the job. 8. A test can be reliable without being valid, but it cannot have a high validity without having a high reliability. 9. A test which few examinees finish may be measuring substantially how quickly a person can respond in a. test situation, as well as the ability he has in the content area. (This is called a "speeded test" or a "test of speed.") Allowing more time for such a test may change the rank order of 151 examinees with respect to final scores. When most examinees have time to respond to all itemS on the test, the test is called a "power test." · 10. A raw score (number of itemS answered correctly) by itself gives little information about an examinee's ability. It must be related to the specific kind of. content in the test or to a "norm" group in order to be interpreted. A standard score shows how the examinee stands relative to the "average" person of a defined group in terms of the standard deviation; i.e., it shows whether the individual is 1 standard deviation above the mean, 1.5 standard deviation above the mean, etc. 11. The ability of a test item to distinguish highperformers from low performers on the job depends on suchthings as its relevance to abilities or skills required onthe job, ambiguities in the item, defensibility of theintended correct answer or an intended wrong answer, thenumber of answer options, the difficulty of the item, unintentional clues to the correctness or incorrectness of an option,hdw the test is designed and administered, etc. References Lord, F. M., & Novick, M. R. Statistical theories--of mentaltest scores. Reading, Mass.: Addison Wesley, 1968. Nunnally, J. C. fsychometric theory. New York: McGraw-Hill, 1967 Swineford, F. The test analysis manual (ETS SR-74-06).Princeton, N. J.: Educational Testing Service, 1974. Swineford, F. Personal communication, 1975. Thorndike, R. L. (Ed.). Educational measurement (2nd ed.).Washington, D. C.: American Council on Education, 1971. Tinkelman, S. N. Planning the objective test. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.). Washington, D. C.: American Council on Education, 1971. Appendix VI Tables A. Percent Correct and Delta Equivalents 1. Table of Delta for Selected Values of p 2. Table of p for Selected Values of Delta B. Conversion of r (biserial) to r (point biserial) bis pbi c. Standard Errors of rb. for Selected Values of rb. , p, and N 1S 1S II . Appendix VI-A-l Table A-1 Table of Delta for Selected Values of pl Units p+ Tens 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 Tens p+ 90 7.9 7.8 7.6 7.5 7~4 7.2 7.1 6.9 6.8 6.6 6.4 90 80 9.6 9.6 9.5 9.4 9.3 9.3 9.2 9.1 9.0 8.9 . 8.9 8.8 8.7 8.6 8.5 8.4 8.3 8.2 8.1 8.0 80 70 10.9 10.8 10.8 10.7 10.7 10.6 10.5 10.5 10.4 10.4 10.3 10.2 10.2 10.1 10.0 10.0 9.9 9.8 9.8 9.7 70 60 12.0 11.9 11.9 11.8 11.8 11.7 11.7 11.6 11.6 11.5 11.5 11.4 11.4 11.3 11.2 11.2 11.1 11.1 11.0 11.0 60 50 13.0 13.0 12.9 12.8 12.8 12.7 12.7 12.6 12.6 12.5 12.5 12.4 12.4 12~3 12.3 12.2 12.2 12.1 12.1 12.0 50 40 14.0 14.0 13.9 13.9 13.8 13.8 13.7 13.7 13.6 13.6 13.5 13.5 13.4 13.4 13.3 13.3 13.2 13.2 13.1 13.0 40 30 15.1 15.0 15.0 14.9 14.9 14.8 14.8 14.7 14.6 14.6 14.5 14.5 14.4 14.4 14.3 14.3 14.2 14.2 14.1 14.1 30 20 16.4 16.3 16.2 16.2 16.1 16.0 16.0 15.9 15.8 15.8 15.7 15.6 15.6 15.5 15.5 15.4 15.3 15.3 15.2 15.2 20 10 18.1 18.0 17.9 17.8 17.7 17.6 17.5 17.4 17.3 17.2 17.1 17.1 17.0 16.9 16.8 16.7 16.7 16.6 16.5 16.4 10 0 19.6 19.4 19.2 19.1 18.9 18.8 18.6 18.5 18.4 18.2 0 !From The Test Analysis Manual (ETS SR-74-06) by F. Swineford Princeton, N. J.: Educational Testing Service, 1974. Reprinted by permission. t: "' Appendix VI-A-2 Table A-2 Table of p for Selected Values of Delta1 Delta Tenths Delta Units .o .1 .2 .3 .4 .5 .6 .7 .8 .9 Units 19.0 .07 .06 •06 .06 .05 .05 .05 .05 19.0 18.0 .11 .10 .10 .09 .09 .08 .08 •08 .07 .07 18.0 17.0 .16 .15 .15 .14 .14 .13 .13 .12 .12 .11 17.0 16.0 .23 •22 .21 .20 .20 .19 .18 .18 .17 .16 16.0 ~5.0 .31 .30 .29 .28 .27 •27 .26 .25 .24 .23 15.0 14.0 .40 .39 .38 .37 .36 .35 .34 .34 .33 •32 14.0 13.0 .50 .49 .48 .47 .46 .45 .44 .43 •42 .41 13.0 12.0 •60 .59 .58 .57 .56 .55 .54 .53 .52 .51 12.0 11.0 .69 .68 .67 .66 .66 .65 •64 •63 •62 .61 11.0 10.0 .77 .77 •76 .75 .74 .73 .73 .72 .71 .70 10.0 9.0 •84 •84 •83 •82 .82 .81 .80 .80 •79 • 78 9.0 8.0 •89 •89 .88 .88 .87 •87 •86 • 86 •85 •85 8.0 7.0 .93 .93 .93 .92 .92 .92 .91 .91 .90 .90 7.0 6.0 .95 .95 .95 .95 .94 .94 .94 6.0 I From The Test Analysis Manual (ETS SR-74-06) by F. Swineford, Princeton, N. J.: Educational Testing Service, 1974. Reprinted by permission. Appendix VI-B Table B Conversion of rbis (biserial) to r pb.]. (point biserial)! (Assumes rbis obtained from normal distribution) .40 .30 .25 .20 .15 .12 .10 .08 p or or or or or or or .50 or .60 .70 .75 •80 . 85 •88 .90 .92 Multi • 789 .759 .734 .700 .653 .616 .585 .548 rbis rpbi .26 .21 •21 .20 .19 .18 .17 .16 .15 .14 .28 .22 .22 .21 .21 .20 .18 .17 .16 .15 .30 .24 .24 .23 .22 .21 .20 .18 .18 .16 . 32 .26 •25 .24 .23 . 22 .21 .20 .19 .18 .34 .27 .27 .26 .25 .24 .22 .21 .20 .19 .36 .29 .28 .27 .26 .25 .24 .22 .21 .20 .38 .30 .30 .29 .28 .27 .25 .23 .22 .21 .40 •32 .32 .30 .29 .28 .26 .25 .23 .22 .42 .34 .33 .32 .31 .29 •27 .26 .25 .23 .44 .35 •35 .33 .32 .31 .29 •27 .26 .24 .46 .37 .36 •35 .34 •32 . .30 .28 .27 .25 .48 .38 ...-38 .36 .35 .34 • 31. .30 •28 .26 .so .40 •39 .38 •37 .35 .33 .31 .29 .27 .52 .41 .41 .39 .38 .36 .34 •32 .30 .28 .54 .43 •43 .41 •40 .38 •35 .33 .32 .30 .56 .45 .44 .43 .41 .39 •37 .34 .33 .31 .58 .46 .46 .44 .43 .41 .38 .36 .34 .32 .60 .48 .47 .46 .44 .42 .39 .37 .35 .33 .62 .49 .49 .47 .46 .43 .40 .38 .36 .34 • 64 .51 ·.50 .49 .47 .45 .42 .39 .37 .35 .66 .53 .52 • 70 .56 .55 .53 .51 •49 .46 .43 .41 .38 .so .48 .46 .43 .41 .39 .36 .68 .54 .54 .52 .so .48 .44 .42 .40 .37 For values of rbis higher than .70, multiply by multiplier at head of column. !Prepared by the author using the formula: y rpbi = rbis (Lord, F. M. & Novick, M. R. Statistical ;,/P (1 -p theories of mental test scores, p. 340) Appendix VI-C Table C Standard Errors of rbis for Selected Values of r p and Nl bis' ' .50 .40 or •60 P l .30 or .70 .20 or .80 N = 2,000 .60 .......... •40 .......... •20 .......... .020 .024 .027 .020 .025 .027 .021 .026 .029 .024 .028 .031 . . .. N = 1,000 .60 .......... •40 .......... •20 .......... .028 .035 .038 .029 .035 .039 .030 .037 .040 .034 .040 .044 N = 500 .60 .......... .040 .041 .043 .048 • 40 .......... .049 .050 .052 .057 • 20 .......... .054 .055 .057 .062 N = 300 .60 .......... .052 .052 .055 .062 • 40 .......... .063 .064 .067 .073 • 20 .......... .070 .071 .074 .080 N • 150 .60 .......... .073 .074 .078 .087 • 40 .......... .090 .090 .095 .104 • 20 .......... .099 .100 .104 .113 N • 100 .60 .......... .089 .091 .096 .107 •40 .......... .109 .111 .116 .127 .20 .......... .121 .123 .128 .139 !From The Test Analysis Manual (ETS SR-74-06) by F. Swineford, Princeton, N. J.: Educational Testing Service, 1974. Reprinted by permission. 159 Bibliography Adkins, D.C. Construction and analysis of achievement tests. Washington, D. C.: U. S. Civil Service Commission, 1947. (Out of print) Adkins, D. C. Test construction (2nd ed.). Columbus, Ohio: Charles E. Merrill, 1974. American Psychological Association. Standards for education al and psychological tests. Washington, D. C.: Author, 1974. Anastasi, A. Psychological testing (4th ed.). New York: Macmillan, 1976. Angoff, W. H. Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.). Washington, D. C.: American Council on Education, 1971. Bloom, B. S. (~d.). Taxonomy of educational objectives: The classification of educational goals, handbook 1. Cognitive domain. New York: McKay, 1956. Bloom, B. S.~.• Hastings, J. T., & Madaus, G. F. Handbook on formative and summative evaluation of student learning. New York: McGraw-Hill, 1971. Clemans, W. V. Test administration. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.). Washington, D. C.: American Council on Education, 1971. Chronbach, L. J. Essentials of psychological testing (3rd ed.). New York: Harper and Row, 1970. Ebel, R. L. Essentials of educational measurement (2nd ed.). Englewood Cliffs, N. J.: Prentice-Hall, 1972. Educational Testing Service. Making the classroom test: A guide for teachers. Princeton, N. J.: Author, 1973. Educational Testing Service. Multiple-choice questions: A close look. Princeton, N. J.: Author, 1973. (b) Equal Employment Opportunity Commission, 29 CFR 1607. Guidelines on employee selection procedures. 35 Fed. Reg. 12333. Guion, R. M. Personnel testing. New York: McGraw-Hill, 1965. Guilford, J. P. Psychometric methods (2nd ed.). New York: McGraw-Hill, 1954. Guilford, J. P., & Fruchter, B. Fundamental statistics in psychology and education (5th ed.). NewYork: McGrawHill, 1973. Gulliksen, H. Theory of mental tests. New York: John Wiley,1950. Henrysson, s. Gathering, analyzing, and using data on test items. In.R. L. Thorndike (Ed.), Educational measurement (2nd ed. ) • Washington, D. C. : American Council on Education, 1971. Katy, M. Selecting an achievement test: Principles and procedures. Princeton, N.J.: Educational Testing Service, 1973 Lord, F. M., &Novick, M. R. Statistical theories of mental test scores. Reading, Mass.: Addison Wesley, 1968. McNemar, Q. Psychological statistics (3rd ed.). New York: John Wiley, 1962. Nunnally, J. C. Psychometric Theory. New York: McGrawHill, 1967. Office of Federal Contract Compliance, Equal Employment Opportunity, Department of Labor, 41 CFR 60-3. Employeetesting and other selection procedures.· 36 Fed. Reg. 19307. Office of Federal Contract Compliance, 41 CFR 60-3. Employeetesting and other selection procedures. Guidelines for reporting validity. 38 Fed. Reg. 4413. Swineford, F. The test analysis manual (ETS SR-74-06). Princeton, N. J.: Educational Testing Service, 1974. 161 Tinkelman, S. N. Planning the objective test. In R. L. Thorndike (Ed.), Educational meaSurement (2nd ed.). Washington, D. C.: American Council on Education, 1971. Thorndike, R. L. (Ed.). Educational measurement (2nd ed.). Washington, D. C.: American Council on Education, 1971. Thorndike, R. L. Reproducing the test. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.). Washington, D. C.: American Council on Education, 1971 U. S. Civil Service Commission. The selection of employees (prepared for the Department of Navy). Washington, D. C.: Author, November 1959. U. S. Civil Service Commission. Benchmark position descriptions for the factor ranking/benchmark approach to the evaluation of general schedule positions, GS-1 through GS-15. Washington, D. C.: Author, July 1973. U. S. Civil Service Commission. Instructions for field test of the factor ranking/benchmark approach to the evaluation of general schedule positions, GS-1 through GS-15. Washington, D. C.: Author, October 1973. U. S. CiviJ.:. Service Commission. Achieving job-related selection for entry-level police officers and firefighters. Washington, D. C.: Author, November 1973. U. S. Department of Health, Education, and Welfare. Information for examination review panel members. Washington, D. C.: Author, 1962. U. S. Department of Health, Education, and Welfare. The construction of test questions, part I. Forms of test questions. Washington, D. C.: Author, 1963. U. S. Department of Health, Education, and Welfare. The construction of test questions, part II. Techniques of item construction. Washington, D. C.: Author, 1967. U. S. Department of Labor. Handbook for analyzing jobs. Washington, D. C.: Author, 1972. 162 Wesman, A. G. Writing the test item. In R. L. Thorndike(Ed.), Educational measurement (2nd ed~). Washington, D. C.: American Council on Education, 1971. Wood, D. A. Test construction. Columbus, Ohio: Charles E.Merrill, 1960.