BISON 

A Guide to the Development 
of Jo~ Knowledge Tests: 
A Reference Kit for 
Measurement Specialists 

-~ 
. 
4 ~~d United States 
Civil Service Commission 
~ ~ Bureau of Policies and Standards 
Technical Memorandum 76-13 Objective PR10.02 
A Guide to the Development of Job Knowledge Tests: 
A Reference Kit for Measurement Specialists 

Lynnette B. Plumlee, Ph.D. 
Personnel Research and Development Center 
United States Civil Service Commission 
Washington, D.C. 20415 
1976 

\ 


A Guide to the Development of Job Knowledge Tests: 
A Reference Kit for Measurement Specialists 

Abstract 

This Kit presents a suggested set of procedures for item writing and assembling job knowledge tests for use in merit system programs. It does not include detailed instructions concerning the establishment of scoring procedures and norms, the conducting of the final test analysis, or the demonstration of test validity. Consideration is given to such topics as the selectiort of a panel to oversee the project, the planning of selection procedures, the selection and training of item writers, the writing and review of test items, pretesting, item revision, and the assembly, review, and production of the final test. The Appendices supplement the text with such related material as forms, sample memoranda and checklists, procedural guides on various phases of the item writing and review process, a brief guide on norming and equating procedures, explanatory articles on test difficulty, chance scores, and basic testing principles for the nonspecialist,.and tables for: converting and equating item statistics. A Bibliography provides references to additional sources of background reading. 

Preface 
This reference Kit was designed primarily to provide psychologists and other measurement specialists with a set of procedures for item writing and assembling job knowl~ edge tests for use in merit systems programs. It should not be interpreted as reflecting the only way such test development should be done. It does not include detailed instructions on establishing scoring procedures and norms, conducting final test analysis, and demonstrating test validity. 
Whether the user is a psychologist or not, it is assumed that he or she will have a working knowledge of such fields as elementary psychology, basic principles of testing, individual differences, industrial psychology, elementary statistics, and basic test theory. Generally, such preparation would require at least a master's degree in psychology or education, with significant course work and experience equivalent to a major in educational and psychological measurement. Without such knowledge and experience, a person should not attempt any test development work. 
Test development should not be considered complete until the final test has been analyzed and its validity demonstrated. This Kit is being issued without recommended procedures and explanatory material for these phases beyond the production of the final test. Separate documents are planned which will include such topics as how to establish scoring procedures and norms, how to conduct the final test analysis, and how to validate the test for its intended use. 
This Kit was prepared by the author in an effort to bring together techniques and procedures used by those who have been extensively involved in test development. References to tests and other sources which will provide further background material are given in footnotes throughout the text and the Appendices. These sources are listed, with complete reference information, in the Bibliography at the end of the Kit. Complete citations for in-text references appear at the end of the section or appendix to which they apply. Some changes to the text have been made by the staff of PRDC 
to reflect internal views on different aspects of test development. Your comments and suggestions on improving this Kit,which will be periodically updated, are welcome. 
The author prepared this publication in fulfillment ofPurchase Order No. 74-2249, U. S. Civil Service Commissipn.She has a Ph. D. in Psychology from the University of Chicago,was Director of the Test Development Division of the Educational Testing Service, and is currently, in private practiceas a consultant on test development and validation procedures. 
Table of Contents 
I. 	Introduction 1 
II. 	General Considerations in Planning the Test Development Project 1 
III. Selecting the Supervisory Panel 	3 
A. 	Function and Duties of the Supervisory Panel 3 
B. 	Qualifications Important for Panel Membership 4 
C. 	Procedures for Selection of Panel :t-iembers 4 
D. 	Size of Panel 5 
IV. 	Planning the Selection Procedures and Test Development 5 
A. 	Preparation Prior to the First Working Meeting of the Panel 5 
B. 	Panel Work Session 7 
V. 	Selecting the Item Writers 17 
A. 	Responsibilities of Item ~iters 17 
B. 	Writer Qualifications 17 
C. 	Number of Items and Writers Needed 18 
D. 	Steps Prior to Item Writing Workshop 20 
VI. 	Training the Item Writers 20 
A. 	Orientation 20 
B. 	Trial Item Writing 21 
VII. Item Writing and Reviewing 	24 
A. 	Selected Background References on Item Writing 24 
B. 	Efficient Use of Workshop Time 25 
C. 	Achieving Appropriate Item Difficulty 26 
D. 	Item Review 26 
E. 	Item Revision 28 
F. 	Handling the Low-Productivity writer 29 
VIII. Pretesting 	30 
A. 	Purpose of Pretesting 30 
B. 	Pretest Criterion 30 
c. 	Population Sample for Pretest 31 
D. 	Description of the Pretest 32 
E. 	Administration of the Pretest 33 
F. 	Statistics to be Collected by the Pretest Analysis 34 
IX. Item Revision and Final Test Assembly 	36 
A. 	Preparation of Items 36 
B. 	Use of Item Analysis Data to Revise Items 36 
C. 	Tentative Assembly 37 
D. 	Meeting of Item-Writers for Final Test Assembly 38 
X. 	Final Test Review 40 
XI. Final Test Production 	41 
References 	43 
Glossary 	44 
Appendices 
I. 	Standards and Regulations 47 
A. 	APA Standards for Educational and Psychological Tests 49 
B. 	FPM Supplements 271-1, 271-2, 330-1, and 335-1, USCSC Guidelines on Examining Practices 51 
C. 	EEOC Federal Guidelines on Employee Selection Procedures 53 
D. 	OFCC Federal Guidelines for Reporting Validity 55 
E. 	Division 14, APA, Guidelines for Choosing Consultants for Psychological Selection Validation Research and Implementation 57 
II. Forms and Sample Memoranda 	61 
A. 	Initial Material for Panel 63 
B. 	Sample Memorandum for Nominated Item Writers 70 
C. 	Form DL Sample Daily Log 72 
D. 	Form MN Measurement Need Definition 73 
E. 	Form BD Background Data for Test Development 74 
F. 	Form TS Test Specifications 75 
G. 	Form PA Pretest Analysis 76 
H. 	Form IAR Item Analysis Results 77 
I. 	Sample Memorandum on Control of Test Materials 78 
III. Checklists 	79 
A. 	Schedule Guide 81 
B. 	Meeting Arrangements Checklist 82 
C. 	Non-Content Factors Which May Affect Test Scores on Multiple-Choice Tests 84 
IV. Procedural Guides 	87 
A. 	Suggested Source Material for Job Analysis 89 
B. 	Item Writing Guide, Sample Items, and Examples of Item Modification 90 
C. 	Item Types 104 
D. 	Item Writing Rules for Multiple-Choice Tests of Job Knowledge 111 
E. 	Suggestions for Developing Item Ideas 114 
F. 	Instructions Regarding Item Format 115 
G. 	Item Review Guide 120 
H. 	Test Format and Sample Test Instructions 122 
I. 	Procedures for Obtaining Pretest Statistics 126 
J. 	Using Pretest Statistics 131 
K. 	Test Review Guide and Test Review Sheet 135 
L. 	Brief Guide on Norming and Equating Procedures 139 
V. Explanatory Articles 	143 
A. 	Test Difficulty 145 
B. 	Chance Scores 148 
C. 	Basic Testing Principles for the Non-specialist 150 
153
VI. Tables 
A. 	Percent Correct and Delta Equivalents 155 
B. 	Conversion of rbis (biserial) to rpbi (point 157
biserial) 
c. 
Standard Errors of rbis for Selected Values of 158 rbis,LP, and N. 
Bibliography 	159 

1 
I. Introduction 
This Kit has been prepared to assist the Measurement Specialist (MS) in properly developing job knowledge tests through the test assembly phase.l It does not include detailed information on establishing scoring procedures and norms, final test analysis, and test validation. 2 (The instructions in this Kit are not designed for developing tests using the Job Element Approach developed by 
E. S. Primoff. Such instructions are contained in other USCSC publications.) 
The Kit discusses the role of the MS as that of supervising and coordinating the total test development operation to insure that all essential steps are carried out and that procedures and documentation are in accord with his agency requirements and with Federal regulations. The Kit provides examples and detailed discussions of methodology to assist the MS in carryirtg out this role of providing the technical background and expertise in selection techniques and test development. For ease of presentation, an overview paragraph at the beginning of each section calls attention to the type of information contained in the section. 
To properly carry out his role, the MS is expected to be knowledgeable about those aspects of the test which affect test effectiveness, such as test instructions, item characteristics, format, conditions of administration, and examinee background and attitude. He must also be competent to select and advise a Supervisory Panel in planning and reviewing the test, and a Committee of Item Writers in writing the test questions. He3 should lcnow the relative usefulness of different selection procedures, including interviews, references, application data, transcripts of school records, and various testing techniques (ability, job knowledge, performance, job element). 
II. General Considerations in Planning the Test Development Project 
Overview. Determining the agency's need which Zed to the request for assistance; evaluating the adequacy of the job analysis; planning the job analysis~ if one is needed; deciding the kind and number of subject-matter experts required; planning the time schedule; and preparing the instructional materials. 
1As used in this Kit, the term "job knowledge tests" includes written skills tests. 
2Refer to the Preface and Appendix I-E for an explanation of the qualifications needed by the MS. If the MS does not possess all of these qualifications, he should consult a specialist in dealing with technical problem areas outside of his expertise. 
3Th.e pronoun "he1 ' has beeil. used throughout in order to avoid the cumbersomeness of the double pronoun. 
2 
The Measurement Specialist's first task will be to determine fairlyprecisely the basis for the agency's request for test instruments orother selection procedures. What specific positions are involved? Arethe requested instruments or procedures to be used in selecting employeesfor employment, promotion, training, transfer, layoff, or other purposes?What State and local laws govern such selection? (For exa~ple, do thelaws specify part or all of the selection techniques? Do they requireranking of the best qualified?) What selection procedures have beenused previously? To what extent must future selection procedures relateto past procedures? Is the selection need unique to this agency or doother agencies have the same needs? If so, can a cooperative developmentbe undertaken? 
Since the adequacy of the selection procedure will depend largely
upon its relevance to the requirements of the job, the MS should
determine whether appropriate job analyses have been performed and
detailed job descriptions written. This is one of the most critical
aspects of test development. If an adequate job analysis has not been
performed and a detaileid job description written -one which describes
the essential job tasks and/or knowledge~, skills, and abilities
required -these will need to be done.l (In rare cases, the job may be
adequately described in other sources and a job analysis not be
required. ) . 

For job knowledge test development purposes, the job analysis shoulddescribe the duties of the job in sufficient aetail that it can be used.for determining the abilities, skills, knowledges and/or other workercharacteristics required to perform the job. (In some cases of jobknowledge test development, the knowledge requirements alone would besufficient to describe; however, it is preferable to describe the wholejob.) The report should also specify the method of analysis used, i.e.,whether obtained through interview, observation, questionnaire, workerrecord, committee of job incumbents and/or supervisors, or otherprocedures. 
Where no job analysis has been performed, the MS will need to planfor performing an analysis before test development starts. He willneed to determine the method to be used and who will perform the analysis,whether it can and will 'be done by the Supervisory Panel (Panel), by theagency's personnel staff, or by the MS in cooperation with one or bothof these. Whether or not the MS is involved in the job analysis, hewill probably wish to provide direction and to see that the necessarydata are obtained. 
Although the Panel will eventually make the specific decisionsregarding selection techniques, the MS must determine, prior to selectionof the Panel, the scope of the assignment in terms of job knowledge andskill requirements. This will be needed for determining the backgroundrequirements of Panel members and hence the size of the Panel (seeSection III which follows). 
1 See Appendix IV A for suggested sources of job analysis information. 
3 
Before assembling the Panel, the Measurement Specialist must also prepare a rough plan for the entire assignment that lists the specific steps to be performed by the Panel members together with a time schedule for completing each step (see Appendix III-A). It is easy for a person who has not previously conducted a project of this nature to underestimate the time required for certain steps, especially the time intervals that must be allowed for the persons involved to attend to their other ongoing responsibilities in their regular jobs, and the time required for mail transmittal when this is necessary. 
The Measurement Specialist must also allow himself adequate time for collecting or preparing the instructional materials that will be used by the Panel and the Item Writers. (See Section IV-A for a description of the preparation needed prior to the first working 
meeting of the Panel.) 
III. Selecting the Supervisory Panel 
Overview. Duties of Panel members; number needed; qualifications important for Panel membership; method of selection. 
A. Function and Duties of the Supervisory Panel 
The function of the Supervisory Panel is to assist the Measurement Specialist in assuring that selection procedures are relevant to performance on the job for which candidates are being selected. The Panel's duties include assisting the Measurement Specialist in the definition of selection criteria based on a job analysis, and the definition of the employee selection plan (including test plans where required). The panel will also approve the content of final tests and other selection procedures. Under the general supervision of the Measurement Specialist, the Panel will: 
1. Review the job description for present accuracy;! 
2. 
If not already done in the job analysis, identify the general knowledges, skills, and abilities essential to performing the job; 

3. 
Recommend to the Measurement Specialist whether the applicant's possession of these qualifications should be established by interview, references, a job sample test, a paper and pencil test, or by other means; 


1The Panel may be asked to participate in the job analysis, as discussed in Section II. 
4. Help define and plan the validation strategy to be used, when 
requested by the Measurement Specialist;! 

5. Work with the Measurement Specialist to determine test speci
fications which the Item Writing Committee will use as a guide in 
preparing test material; 

6. Review the final test after the Item Writers have given it 
their approval, but before it is reproduced as an operational form, 
in order to provide a final check on appropriateness and accuracy of 

content. 
Generally, it is not recommended that item writing responsibilitiesbe assigned to the Panel. Giving this committee too broad an assignment may detract both from the needed focus on the planning phase and the 
fresh view required at the final review stage. Also, since all will have important ongoing job responsibilities, there may be a limit to the amount of time .they could devote to this added assignment. However,
if the Panel members,are able to work full-time on such an assignment for a long period of time, then such an item writing assignment could be considered. 
B. Qualifications Important for Panel Membership 
To serve their function effectively, it is advisable for Panel members to be familiar with the job either through having performed it or through having supervised it. 
Since their duties will require substantial time, insight, and 
careful thought, potential Panel members should be screened for available time, interest, and motivation. Consideration should be given to 
selecting persons who will be willing to listen to the recommendations of others but who will not defer to other members simply because of subordinate job status. Special consideration should be given to 
including persons in the Panel from minority groups to reduce the 
possibility of any biasing factors occurring in the test specifications. 
C. Procedures for Selection of Panel Members 
Recommendations for Panel membership usually will first from
come agency heads, with those nominated asked to recommend additional candidates for membership. Those asked to make nominations should be 
1It is essential for content validity demonstration that the validation strategies for a test be decided upon at the time the test is being planned. For content validity, the match between the jobrequirements and test content must be demonstrated. The present Kit does not include a section on how this should be demonstrated. 
5 
made aware of the responsibilities and the nature of the task and 
the need for motivated individuals. It is well to give those 
nominated the opportunity to evaluate their own interest in and 
suitability for the assignment. (See Appendix II-B for a Sample 
Memorandum for Nominated Item Writers.) 
The Panel will ordinarily consist of a balance of first -or second-level supervisors and outstanding, fully qualified operating personnel. 
D. Size of Panel 
Six Panel members may be sufficient, but this can be varied to include more if required for representation of different schools of thought or different specialties in the occupation. Going beyond seven members may make control of, and groupwide participation in discussion difficult. Cutting below four or five members may be possible for a well-defined job but may risk content bias when the job is not clearly defined. The most important consideration is that the Panel's expertise covers the job requirements. 
IV. Planning the Selection Procedures and Test Development 
Overview. Preparation for meeting with the Panel; Panel's role in evaluating the adequacy of the present job description; determining selection procedures relative to the job description; defining the criteria of successful job performance; identifying those elements of the criteria which the selection procedure is intended to predict;decisions to be made by the Panel relative to such areas as test content, difficulty, format, scoring, and norming; presentation
of the specifications to facilitate item writing; considering minority group factors in test planning. 
A. Preparation Prior to the First Working Meeting of the Panel 
As a rule, the Panel members will be provided, before their first meeting, with a clear statement of their responsibilities and goals. (See Appendix II-A-2 for a sample statement.) 
If Panel members are located within easy commuting distance of one another, it may be efficient and effective to have an orientation meeting lasting an hour or two to discuss the overall plan and the Panel's responsibility. In this case the Measurement Specialist must be prepared with a statement of agenda to guide discussion and to insure that all critical points are covered. Handouts of explanatory and working material as listed below should be available for distribution. 
6 
If the group is scattered, the same material may be mailed, substituting for the agenda a written discussion of major points coveredby the agenda. The procedures which follow assume an existing jobanalysis. These procedures will need modification if the Panel participates in the job analysis discussed in Section II. 
Preliminary materials (or handouts at the orientation session)
may include: 

1. Covering letter; 
2. Existing job analysis ·(If duties are listed with intervening
space, the Panel members can use the list as a worksheet for recording
the position requirements.); 

3. Statement of Panel responsibilities and work to be accomplished(see Appendix II-A-2); 
4. Instructions as needed on performing the assigned tasks (see
Appendix II-A-3). 

The Panel members should be asked to do the following: 
1. Examine the job description and suggest additions, deletions,or modifications on the basis of their own experience with the job.(The Measurement Specialist should carefully review these suggestionsto be confident that they do not reflect personal biases before heincorporates them into a revised description.)l Each Panel membershould be asked to rate each job duty on an appropriate scale and submithis ratings to the Measurement Specialist two weeks before the worksession. One example of such a scale is shown below: 
Frequency Scale for Rating Duties 
Scale 
Value Description 

5 Required several times a day4 Required on a daily basis3 Required weekly2 Required several times a year1 Rarely required 
It is important that these Panel ratings be made independently by Panelmembers. 
lWhen the job description has been developed by professional jobanalysts following standard procedures, the Measurement Specialist shouldcheck with the job analyst responsible for this job before changing it. 
2. Study the job description and identify the knowledges, skills, and other qualifications required to perform each job duty satisfactorily. These position requirements will be submitted to the Measurement Specialist together with the job duty ratings. Position requirements at this stage may be somewhat generalized: knowledge of algebra through quadratic equations; can operate a desk computer; can read at fifth grade level. (The Item Writers will later be asked to identify the requirements more specifically as a basis for item writing.) 
3. Give thought to means of identifying whether the applicant's 
qualifications meet these position requirements, i.e., whether determined by references from previous supervisors, interview, test, etc. 
4. Prepare a list of job performance criteria (identifiable behaviors and skills which characterize the top-level performer and those behaviors and skills which characterize the poor or marginal performer). 
When the position requirement lists are received from the Panel members, the Measurement Specialist will compile them and prepare the worksheet which the Panel can use later to rate the importance of the requirements. 
B. Panel Work Session 
After a brief review of the Panel's role in the test development process, the Measurement Specialist should proceed with the following steps: 
1. Rating the position requirements. Give each Panel member a copy of the compiled list of position requirements and ask him to rate the requirements according to a scale developed for this purpose. One example of such a scale is shown below: 
Criticality Scale for Rating Job Requirements 
Scale 
Value Description 

5 	Critical to performance (employee must have it to do the job). 
Very important, but not critical (employee can operate with less supervision if he has the knowledge or skill). 
4 
3 	Of some importance (some employees must have it if job is to get done). 
2 	Not important. 
Never required.
1 
8 
It is important that these ratings be done without collaboration among
the Panel members. The compiled ratings may be used later as a basis
for content validity for some elements of the selection procedure, and
independence of ratings is necessary to confirm agreement among raters.l
As Panel members complete their ratings, the Measurement Specialist can
tally them on his work sheet. When completed, the tally sheet can be
photocopied for the Panel. Mean criticality ratings should be used in
conjunction with the frequency ratings for identifying those require
ments to be measured by the test. 
2. Defining the measurement need. Formulating the purpose of any
proposed selection procedure is important, both for increasing the

likelihood that it will serve the intended purpose and for providing
documentation should a question later be raised about the professional
nature of the development process, or should the procedure be challenged
in court. For the latter purpose records must be kept of all major
decisions, together with evidence on which the decision was based.
(See Appendix II-C for a suggested format for a Daily Log.) Sample
forms are included in the Kit to facilitate keeping a full record, but
these records should be supplemented, insofar as possible, with all
evidence and data which served as the basis for decisions. All such
records ffiust be dated and the persons involved in the decision-making
should be identifiable. 
Most test development for which this Kit will be used will bedirected at selecting persons qualified to handle certain jobs.Written tests generally will be used only if they provide better ormore reliable information about the applicant's potential job performance than other selection techniques. The specific need for a testmust be derived from a review of the job analysis and position requirements rather than determined on an a priori basis. As .a rule, theSupervisory Panel will participate with the Measurement Specialist inthe decision to develop or use a test. 
As a group, the Panel then suggests to the Measurement Specialistthe means by which each requirement can be identified .. The nature ofthis needs to be fairly specific, such as: structured interview;informal interview; level of schodl completed; references from formeremployers; achievement test; etc. The Measurement Specialist will assistin these discussions by advising the Panel regarding various means(including testing) of identifying skills, knowledges, and other aspectsof job potential. The outcome of this phase of the workshop will be acomposite statement such as laid out in Appendix II-D, Form MN. 
lThe procedures for documenting the content validity are not includedin this Kit. 
In addition, a list of criteria of successful job performance 
can be compiled. A copy of the composite list should be given to each member of the Panel (and subsequently to each Item Writer who has had experience with the job), asking him to rate each criterion on a scale developed for this purpose. One example of such a scale is shown below: 
Scale for Rating Job Performance Criteria  
Scale Value  Description  
5  Clearly indicates a high level (or low level) an important aspect of job performance.  of  
4  Generally indicates performance.  a  high (or low)  level of  
3  Somewhat related to performance  on  the job.  
2  Does not necessarily distinguish between the person who performs the job well and the one who performs it poorly.  
1  Unimportant  or  irrelevant for job performance.  
I  have  never  observed whether employees exhibit  

this characteristic. 
Again, it is essential that these ratings be performed independently. As a group, the Panel should then match each selection procedure with the job performance criterion or criteria it is intended to predict. 
3. Planning the test. In those cases where a test is specified, the next task of the work session will be to develop the specifications for the test. Form BD in Appendix II-E was designed to assist the Panel in stating the rationale for the test. The purpose of specifying other selection procedures is to avoid undue weighting of one aspect of the position requirements through duplication across selection procedures. The selection ratio provides a basis for determining test difficulty. If collected, ethnic group data may help focus attention on any problems of testing fairly across cultures as well as provide a basis for specifying pretest requirements to insure the representativeness of the sample. 
!Responses to N should not be counted in calculations. 
10 
Form TS in Appendix II-F was designed to serve as a summary sheetfor recording Panel decisions on test specifications. This informationwill guide the Item Writers in preparing the actual test. The MS canassist the Panel in establishing test specifications. 
Following are some considerations for the MS to discuss with the
Panel relative to some of the decisions to be made in planning the
test development project: 

a. Content. Content coverage will be based on the job analysisand position requirements as summarized in the Statement of Measure
ment Needs (Appendix II-D, Form MN). Using the importance ratingsagreed upon by the Panel, a tentative weight should be assigned toeach major content area. The MS may decide to break out major
areas into sub-areas and weight each sub-area appropriately to thetotal weight of the major area. 
b. Number of parts. The reason for separating the test intoseveral parts may be one of the following: 
(1) Analysis of test performance separately by content areais desired. 
(2) Different item types require different instructions or
mental set. 
(3) To increase the likelihood that applicants are able torespond to all items they are capable of answering correctly. 
(4) To provide a rest period in a long test.l 
In general, unless one of these reasons or another equallystrong reason exists, it is desirable to have all of the testitems given at one time under one time limit. 
c. Item types. The assumption is made here that most testswill be of the multiple-choice type. Four or five choices arestandard. Since some examinees will not have had experiencetaking tests with different numbers of choices in the sameinstrument, it is best to use the same number for all items.
The number of choices used will depend on the possib.le effectof guessing relative to the selection range, the difficulty offinding reasonable distracters, and the probable loss of candidatetime in reading ineffective distracters (see Appendix V-B). 
1For security reasons, rest periods are not generally recommended
for a test· requiring four hours or less. 
11 
Because of the difficulty of establishing uniform and defensible scoring guidelines, essay tests are not recommended unless fewer than 15 to 20 candidates will be tested. An · extensive discussion regarding the development of essay tests will be found in "Essay Examinations" ·(Coffman, 1971). 
d. Time. The time allowance may be restricted by administrative considerations or the feasibility of carrying out all planned selection procedures. However, sufficient length is needed to provide reliable measurement. (See the following discussion under "Number of items.") 
If nearly all examinees have adequate time to respond to all items, there is probably little to be gained by timing parts separately, unless part-scores are to be used. Items covering different content areas may be interspersed and arranged in order of difficulty to increase the likelihood of candidates reaching all items they are capable of answering. If, however, there is concern that some candidates may waste time on items about which they have little knowledge, or if scores on different content areas are to be studied independently, it may be desirable to group items by area and time each area separately. 
e. Number of items. This is a very technical area requiring great care. The actual relative weight carried by the items in each content area depends primarily on the standard deviation of this set of items relative to other sets;l the standard deviation, in turn, tends to vary with the number of items when item difficulty distributions and item discrimination indices are similar from content area to content area. Thus, the number of items per content area should generally reflect the desired weighting. It is preferable to increase the number of items to achieve the desired relative weighting among content areas rather than assign a different weight to items in the various areas (see the folkowing discussion). 
It may not be possible for the Panel to specify the total number of items exactly at this time, especially where there has been no previous experience with writing items in the given area. However, much can be done at the item writing stage to increase the efficient use of test time (see Appendix IV-B). it is suggested, therefore, th-at the .number of items be specified with the followi~g considerations in mind: 
(1) Error of measurement. The ratio of the measurement error to the number of items decreases with the number of 
lsee Guilford and Fruchter, 1973, pp. 385-386, and Guilford,· 1954, p. 405 and pp. 443-447, for a discussion of effective weighting of part content in the total test score. Also see Thorndike, 1971, p.3, and Tinkelman, 1971, pp. 68-70. 
12 
items although the actual size of error increases. In a non
speed test, the standard error of measurement (SEM)l is
approximately .43 /no. of items (Cureton et al., 1973, Lord,
1959, Swineford, 1959) or about 1.9 for 20 items, 2.5 for 35
items, 3.0 for 50 items, and 4.3 for 100 items; however, note
that 25 items given double weight (i.e., 50 points) will yield
an effective standard error of measurement of 2 x .43 125
(or 4.3) as compared with a standard error of measurement of
3.0 for 50 items given single weight. 
It is suggested that where the test is to carry considerable
weight, a goal should be set of 75 to 100 items with a planned
standard deviation (SD) at least three times the standard error
of measurement. This ratio is easier to achieve with larger
numbers of items. The square of the inverse of this ratio,
SEM2/sn2, is equal to 1 -reliability. A ratio of three would
yield an estimated reliability of .89, which is generally
considered to be acceptable. If the test is a speed test, it
is better to compute the standard error of measurement from the
reliability rather than from the approximation shown here. 

(2) Chance scores. A score on which a decision is being made
should be clearly above the chance range (see Appendix V-B). 

(3) Power vs. speed. Examinees should be given sufficient
time per item to consider each question. Unless ability to work
under heavy time pressure is an important requirement of the job,
pressure for speed should not enter excessively into selection

considerations. 
f. Scoring formula. If every examinee responds to every item,applying a correction for guessing does not change the ranking ofcandidates. If, however, one anticipates many omissions, use of thescoring formula may be justified, especially if there are only threeor four choices per item. It should be recognized, however, thatwhen a scoring formula is used, those applicants who base theiranswers on inadequate knowledge will be handicapped relative tothose who guess at random. For a properly constructed item, theformer very often finds his "answer" among the misleads and thenstands no chance of a correct answer, whereas the guesser standssome chance of obtaining the correct answer (one-fourth chance in a
four-choice item).2 
1The standard error of measurement indicates the amount ofvariation one can expect in an individual's score if he were totake other parallel forms. (See Glossary; also Guilford and.Fruchter, 1973, p. 401, and Guion, 1965, pp. 45-46, for furtherexplanation of the standard error of measurement.)2see also the discussion regarding guessing in Thorndike,1971, pp. 59-61. 
g. Weighting of parts. As indicated earlier, it is desirable to increase the number of items rather than give double 
or triple weight to one set of items. One or more of the following steps may be taken to accomplish this: 
(1) 
Increase the time to permit more items. 

(2) 
Use item types which require less time per item. 


(3) Examine the possibility of breaking some items into two or more parts (see Appendix IV-B). 
h. Setting selection standards. There are many factors relevant to the decision of whether to rank candidates on the basis of test scores (with those receiving higher scores given 
job preference) or to set a minimum cut-point that the candidate must meet to be given further consideration. These include psychometric as well as legal considerations. This Kit will not provide the user guidance on this matter.l 
L Norms. 2 If ~penings for the job in question occur very rarely such that this test is likely to be used only once, 
or 
if the nature of the content is ~xpected to change before a 
new opening occurs, there is little basis for establishing 
normative data. 
If the opening is similar to openings in other parts of the country and recruiting is done on a nationwide basis, there may be some benefit in relating local norms to national norms. In this case items from a nationally normed test should be 
incorporated in the local test to facilitate equating. The primary benefit in equating would be the comparison of candidate capability with that elsewhere. Items from another test 
must not be used without clearance from the publisher of that test. In the majority of cases, however, it is likely that where a decision is made to build a test locally, it will be because the job has aspects unique to the locality, in which 
case national norms will have little meaning. It may still be desirable to develop local norms as a means of comparing candidate groups from one year to the next. 
lsee Cronbach, 1970, p. 42lff., Guion, 1965, pp. 486-493, and the APA Standards for Educational and Psychological Tests, 1974, paragraphI4, for further information on this topic. 2The present Kit does not include detailed instructions on how to set up norms and on equating test items and parts; however, a brief discussion is presented in Appendix IV-1. 
Percentile norms have the advantage of being readily computed and easily understood.l They have the disadvantage that differences in the middle of the range appear to have more s1gnificance than they really do. A given raw score difference may be equivalent to 12 percentile points in the middle of the range and only two percentile points at the extremes. In a normal distribution the difference between the 40th percentile and the 50th percentile represents only .25 of a standard deviation, whereas the difference between the 85th percentile and the 95th percentile represents a difference of . 6 of a standard de,riation. (A unit equal to one standard deviation is generally considered to represent the same ability unit at different parts of the score range.) 
For this reason norms are often stated on a standard score scale for which there is an established mean and standard deviation. Various standard score scales may be used, and these are described in most texts on testing. 
A useful scale is one in which the mean is 50 and the standard deviation is ten. If the distribution is fairly normal, one simply sets the obtained raw mean equal to 50 and the standard deviation equal to ten and develops a scale of equivalent scores (Guilford & Fruchter, 1973, pp. 463-467). Future operational tests may be put on the same scale as this test by including in the future test a set of items from the current test. When this is used in merit systems, this standard scale can be linearly converted to a rating scale with 70 at the cut-point .and 100 at the practical top score. 
If the population is considered to be normally distributed but test scores are not, the raw scores may be put on the standard score scale through setting the median equal to the scale mean and using a table of normal curve values to find the number of standard deviation units corresponding to each percentile (Angoff, 1971, pp. 515-516). This procedure should be used with care. 
If the test being developed is a new form of an operational test, a scale may already be established. If so, the new test may be put on this scale such that scores on the new test may be compared with those on the earlier comparable test or other tests in a battery. If this is to be done, it is important to make the 
decision regarding method at the time test specifications are established. The various methods are as follows: 
1See Ebel, 1972, p. 285ff., or Guilford and Fruchter, 1973, p. 36ff., for computing procedures. 
15 
(1) 
Common items. This involves using a set of not less than 20 items or 20% of the total number of items, whichever is greater, of each form as a set of common equating items (see Appendix IV-L for statistical procedures). The items should be representative of the content of both tests. They may be set up as a separate section or spaced throughout the test. (Avoid placing equating items near the end of the test.) 

(2) 
Common population. In this procedure a common group is asked to take both test forms. Since this is less commonly used, procedures are not described in this Kit (see Angoff, 1971, pp. 573-576). 

(3) 
Equi-percentile. This is used only if the same population takes both tests, or the populations taking the two tests can be considered equivalent. In this case a score on one test is set equal to the score on the other test that has the same percentile equivalent. 


A comprehensive discussion of score scales and norming is provided by Angoff (1971). 
j. Desired statistical characteristics.l In general, a mean of 65% to 75% of the number of items for a four-choice test, or 60% to 70% for a five-choice test, is probably reasonable (see discussion in Appendix V-A). A reasonable goal for the standard deviation would be at least three times the standard error of measurement as discussed above under "Number of Items" (Section IV-B-3e). 
There is no absolute standard for setting cut-points. However, it is generally assumed that the best procedure is to establish cut-points on the basis of validity information or known labor market conditions relative to the job to be filled.2 
k. Percent completion desired. For job knowledge tests the aim should be to have every examinee able to read and react to every item; however, regardless of the instructions with respect to guessing, some examinees will omit items about which they have little or no knowledge. Accordingly, test results may show a number of no-responses even when all have had an opportunity to 
lThe Kit does not include a section on the statistical analysis of the test. 
2see the APA Standards for Educational and Psychological Tests, 1974, for a discussion of this. 
16 
read every item. Furthermore, some examinees will stop trying
and will quit before reading all items, especially in a lengthy
test. A 75-to 100-item test may be considered relati~ely non
speeded if 80% or more of the examinees record answers to the last
item. 

1. Chance mean and standard de~iation. These statistics shouldbe calculated to determine whether the anticipated cut-point scorecould be achieved through chance by someone who has substantiallyless knowledge than the cut-point;normally indicates. Few examineesguess at all items, though occasidnally one will if he has littlebackground in the content area. ; for example, if a candidatetaking a five-choice multiple choice test knows 20 out of 100 itemanswers, he will of course be more likely to achieve the cut-pointby guessing at the remaining 80 than if he guessed at all 100 (seethe discussion in Appendix V-B). 
m. Federal regulations regarding test development. Thepreceding procedures may seem unnecessarily detailed; however,the usefulness of the results is in large part a function of thecare taken in planning the test and developing it according tospecifications. Most of the technical background has been developed
and has been available for many years. The American PsychologicalAssociation's Standards for Educational and Psychological Tests(1974) has been periodically updated to inform test builders andusers of the requirements of good testing practice. 
More recently the Federal agencies charged with enforcing equalemployment opportunities, concern~d that minority group potentialmay not have been adequately measured by many of the tests usedby employers, have prepared guide~ines which in effect restate asregulations many of the principles regarding sound test developmentthat have been long held in the psychological profession. Copiesof these regulations are included !in Appendix I. The Equal Employment Opportunity Coordinating Council, which consists of the U.S.Equal Employment Opportunity Commission, the U.S. Office of FederalContract Compliance (Department of Labor), the U.S. Civil ServiceCommission, the Civil Rights Divis'ion of the Department of Justice,and the U.S. Civil Rights Commission, has been working on developingcommon principles or guidelines to assist employers in conformingto the law. When it is published, it will become a part of thisKit. 
The procedures described here for item writing and assemblingjob knowledge tests are those for a professionally developed testsuch as are required by the Federal regulations. It is expected, ofcourse, that the person supervising the test development will havea background considerably beyond the contents of this Kit.Professional development of the test, though recommended, is notsufficient under the law. One must be able to demonstrate that 
17 
the test is valid and does not discriminate unfairly (for · non-job-related reasons) against any group of applicants. The need for validation is equally important for other selection procedures and is also covered by the regulations. Another kit is planned to cover these requirements. 
V. Selecting the Item Writers 
Overview. Item Writer responsibilities; qualifications important for successful item writing; selection of Item Writers; number of Item Writers needed; pre-training assignments. 
A. Responsibilities of Item Writers 
The Item Writers will review the list of general position requirements the Panel specified in Form MN and will identify the specific knowledges and skills required to perform the job. They will then write test items to measure candidates' possession of the knowledges and skills and will review items written by others. They will also revise the items as necessary on the basis of the pretest item analysis and will either review the Measurement Specialist's tentative test assembly or carry out the assembly themselves under the Measurement Specialist's supervision. 
B. Writer Qualifications 
Good item writing involves a large element of art. It does not appear to be a skill that everyone can develop. To enhance the probability of success, the Item Writer should have, as a minimum, the following qualifications: 
1. 
Experience in the performance or supervision of the job for which the test is being designed. 

2. 
Strong knowledge of relevant content areas. The knowledge of the Item Writer must be substantially beyond that required to answer the test questions in his specialty area. 

3. 
Interest in the project, with motivation toward the item writing task. 


4. Originality in approach to assignments. 
· 5. Ability to work in a group and openness to suggestions regarding his work. 
6. Willingness to persevere with a task and try a variety of approaches. 
18 
Selection of those w.ith the most aptitude for and interest in
writing items can save considerable item writing time. The!e should
be representation of the relevant specialty areas to provide the
needed coverage. Again, the Measurement Specialist should make an
effort to include members of minority groups. Such persons need to
be familiar with the culture they are expected to represent to enable
them to recognize non-job-related factors that could put the members
of the culture at a disadvantage. If inclusion of minority group
members is not possible, individuals familiar with the minority group
culture should be included. 

Because of the personal qualities involved, it is advisable to
ask the Item Writer nominee to help evaluate the potential usefulness
of his participation in the project. Recommendations can come from
the same administrative personnel who nominated the Panel members,
and from the Panel members. There could be some benefit from over
lapping membership between the Panel and the Item Writing Committee,
although it should be on a voluntary basis if possible. (See earlier
comments at the end of Section III-A regarding dual assignment.)
Appendix II-B shows the possible content of a memorandum to be sent

to pot·ential Item Writers. It provides the individual with the
opportunity to try item writing as a basis for evaluating his interest
in accepting the assignment. The exact wording of the memorandum will
vary depending on the anticipated number of nominees showing an interest. 
C. Number of Items and Writers Needed 
The workshop should be expected to produce at least double thenumber of items needed, even though steps are taken to reduce duplicatesand insure wide coverage. Many items will fail to meet the difficulty
and discriminating power requirements set by the Supervisory Panel. 
The number of new items required may be reduced if there arereusable items in the files or if an item file can be secured fromother sources.l If there are several times as many items in the files
as are needed.for one form, the problem of candidate familiarity withthe items from having taken earlier forms is probably not great. Ifthere has been a loss in security with a given form of the test, theseitems should be marked and not reused within a period of several years,
or be reused only in a changed form. 
Often the test development project is designed to update oldertests. For example, under the Merit System Standards program of the II 
litems from commercial, copyrighted sources may not be used withoutII 
the author's or publisher's written permission. The U.S. Civil Service'Commission will provide State and local governments with test bookle1ts forsome position classes under the Merit System Standards program. In/anycase the user would need to show the relevance of these materials throughcontent or criterion-related validation. 
19 

U.S. Civil Service Commission, State and local governments might be asked to cooperate in such a project. Where there is only one previous form, parallel items can often be written; however, the Measurement Specialist should be careful not. to use too many of these, because an advantage could be given to a competitor who has taken the earl~er form and is able to recall some of the items for his specific study prior to taking the new test. 
Where several earlier forms are available, items from these forms may be parallelled as a means o~ saving time in writing new items to test the same concepts. As a general rule, the Measurement Specialist will need to decide early how many items will be from the file or will be parallels of such items. (Regardless of the number of items in the files, a minimum of 15% of the items might be required to be entirely new to keep the test abreast of new developments in the field.) Next, the Measurement Specialist must decide what fraction of the items from the files will be used in their present form. Then, in deciding the time required for the total item production phase, the Measurement Specialist should allow possibly half as·much time for-·parallelling an old item as for writing a new item, and perhaps one-fourth as much time for selecting old items from the files. Parallelled items and items from the files should go through the same review process as new items to make certain they are up-to-date in terms of content relevant to the job as presently described, and free of psychometric and subject-matter flaws. 
Training of the Item Writers may be expected to require a week to reach the point of productivity. (Training will continue beyond this initial week, but Writers should be producing usable items at a reasonable rate by the end of this time.) Items written in the first week will be usable, but they will ordinarily not be considered in estimating the time required to produce the necessary number. After the first week, the Measurement Specialist can probably count on an average of about four items per person per day. This may seem like an excessive amount of time per item, but many starts do not yield items, and any effort to test understanding, application, or other skills requires much more thought and development than does a test of simple knowledge. (These items are rough items which will need editing and review before actual inclusion in a test. The Measurement Specialist should not expect that final polished items will be turned out routinely at this rate.) 
The Measurement Specialist might expect to be able to accomplish the task with about six good writers, if they are versatile in the content areas designated. More may be needed to provide coverage of specialties. A few writers may be expected to fall by the way. The Measurement Specialist can either start with more than the required number or bring in new writers as others drop out. If extras are brought in initially, it may be possible to complete the task in less 
20 
than the allotted time should all prove to be successful writers~ but
it can result in more training cost than otherwise required. If extra
specialists are needed to handle some specialized content, they can
be brought in for the time needed. 
D. Steps Prior to Item Writing Workshop 
When the individual is notified of his appointment as an ItemWriter, he should be asked to accept or reject the invitation promptly(preferably within a working day, so that someone else may be appointed
if he rejects).l When he accepts, a summary of relevant informationshould be sent to him (time, place, schedule, etc.). He should alsobe sent a copy of the list of job duties and position requirements andasked to rate them, independently, on the same scale as the Panel ratedthem. These ratings must be returned prior to the first meeting sothat the Measurement Specialist can summarize them and compare themwith those provided by the Panel. This is done simply as a check onthe "universality" of the Panel's ratings and to provide further backupdata on content validity. If all ratings have been made independently,it is unlikely that there will be significant differences between thegroups. If there is a substantial difference between the two sets ofratings, the Measurement Specialist may wish to discuss the ratingsagain with the Panel. 
If the Measurement Specialist has not done considerable itemwriting, he will need to familiarize himself with the suggestionsregarding the writing of items in Appendix IV-B and try writing severalitems to measure position requirement concepts for the particular jobfor which a test is being written. Even if the test is in a field inwhich he has little background, he can probably find a few elementaryconcepts which are familiar to the nonexpert and for which he will beable to devise items. 
VI. Training the Item Writers 
Overview. Training period; orientation and procedures for securityof test material; training techniques; expediting training; helping ItemWriters prepare non-routine items to test application and understanding. 
A. Orientation 
Although the Item Writers will be generally familiar with theirrole by the'time of the first session, they will usually welcome agener.al review of their planned activities, the expected daily schedule,and other routine matters as well as a review of what will be expected 
lThis may not be a voluntary assignment. If an agency appoints anItem Writer, he will ordinarily not be asked if he wants to serve. 
21 

of them. It may help avoid premature discouragement if they are 
reassured·before they start that they are not expected to be skilled 
item writers immediately, and therefore should not become depressed 
by the small number of.items produced during the first few days. 
This reassurance will also alert the Item Writers to the fact that 
long-range item quality is expected to be above that of the first 
day's output. 
It may be appropriate at this time to remind the Item Hriters that criticism of items is a necessary part of the development process; if effective revisions are to be achieved, it is important that criticisms be both objectively given and accepted. 
The Writers must be told at this point of the precautions which must be taken to assure security of the items. Their cooperation must be enlisted in disposing of scratch work in designated security bags, locking up work when the workroom is left unattended, and not taking items home with them. (Probably few will wish to do so.) The Writers should not be discouraged from jotting down ideas which occur to them between work sessions, but should be urged to observe reasonable security precautions with such materials. 
Following the general orientation, the Measurement Specialist may show the group examples of items which are considered good; however, it is desirable that these items cover a variety of approaches 
and different formats. He must be wary of spending so much time with 
these that the Writers model their writing after these particular 
items and neglect desirable analytical approaches to specific measure
ment problems. Also, in order to take advantage of the initial motivation to start the ·assignment, 'it is suggested that the Item Writers have an immediate opportunity to try their skill at item writing. 
B. Trial Item Writing 
The Measurement Specialist will usually select a general position requirement (preferably an area of knowledge) that is highly important. As a group, the Item Writers are then asked to analyze the specific knowledge and skill principles which must be understood. They should 
review the principles carefully to insure that those which will be 
taught on the job, and which therefore do not need to be understood at the time of entry, are not included in the test. The Measurement 
Specialist can then take one of the job-relevant principles and ask, "What can the applicant do to demonstrate that he knows this principle?" He should help the group convert one or more responses into question form. It is best to wait for their stiggestions, asking leading questions if necessary. The Measurement Specialist may need to write. most of· the first item. The group approach is continued until the Writers are offering item ideas. He should try other questions as needed to stimulate thinking (see Appendix IV-B). 
22 
Then, with each Writer in a comfortable location at a table, he
should next ask them to try writing an item to measure another one of
these principles. It is advisable that they work alone initially

until they have a rough item. As two Writers finish, they can exchangetheir items for review, then discuss and revise them. (While someItem Writers are concentrating on their individual' assignments, thisdiscussion is best done at a remote table to avoid distracting them.)The Measurement Specialist may need to assist some individuals ingetting started. He should draw the individual (or a pair of individuals, if two are having difficulty) aside for individual training.He may wish to go through the initial training step again, askingeach to demonstrate his own knowledge of the principle, then helphim convert this to item form; or he may try a different approach ingetting them started. 
If two Writers finish a joint review and revisions ahead of theothers, they can assist other Hriters or try writing an item to measurea different principle. (The Measurement Specialist will probably wishto examine the "finished items" before allowing the writers to assistothers.) 
When all items have been written to the satisfaction of twoWriters, the Writers can take a break while the Measurement Specialistreviews the items and prepares for group discussion. (The MeasurementSpecialist should be prepared to suggest desirable revisions if satisfactory modifications are not proposed by the Writers.) When the groupreassembles, the items should be written on the board. (An alternative,where time and facilities permit, is to type them in multiple copy orphotocopy them so that each Writer may have a copy.) 
The Measurement Specialist should give each Writer an opportunityto review the items, and then consider the items with the Committee,one at a time, discussing the merits and shortcomings of each andmethods of improving them. The Measurement Specialist should listento the group's suggestions for improvement before offering his own.For this first set of items, a particular effort should be made toproduce usable revisions in order to demonstrate the kind of improvements which can be made and to avoid a tendency to discard later itemstoo hastily. The }1easurement Specialist may wish to tell the Writersthat when they are in the production phase later on, it is sometimesmore efficient and economical to discard an item and begin again.Too hasty rejection, though, may eliminate an item which tests animportant point but which is simply more difficult to measure. 
After the first set of items is in acceptable form (some may bediscarded), the process should be repeated with a new set to testanother knowledge requirement. Item Writers can now begin followingthe procedures outlined in Appendix IV-F for submitting items onhalf-sheets and in the prescribed format. When some skill has been 
23 

developed in writing items to test knowledge, the Measurement Specialist should suggest that it is time to go to a more difficult 
item category--one measuring ability to apply knowledge. The Writers can be introduced gradually to the Item Writing Guide 
(Appendix IV-B) at this time through drawing on the examples given 
there. 
The process used with the knowledge items should be repeated. The rules for writing multiple-choice items (Appendix IV-D) should not be introduced until after the group has had some successful experience in writing satisfactory items. Premature introduction of these rules could be discouraging. After some experience, many of them will already be familiar. 
After the Item Writers feel they understand the general approach to item writing well enough to start working independently, the task of spelling out the specific position requirements as item objectives can be resumed. It may be more efficient to assign this stating of item objectives to pairs (or trios) of Writers. If groups are assigned different parts of the position requirement outline, the resulting list of objectives produced by one team should be reviewed by the others before becoming a list from which to work. (If item ideas occur to the Writers as they are defining the item objectives, the Writers should be encouraged to jot them down sufficiently to be able to come back to them later in actual item writing.) 
Where some individuals have been assigned to the Item Writing Committee because of special competences, they will normally be assigned relevant parts of the outline for stating item objectives; however, it is desirable to postpone item writing in their specialty until they have demonstrated skill in writing items which the other. Writers are better able to review. 
When the objectives have been defined and agreed upon, the Measurement Specialist should assign different parts of the .outline to different Writers or pairs of Writers. The Measurement Specialist will probably find that some can work better in pairs while others work better as· individuals. Although this is the beginning of the item production phase, it may also be considered a continuation of training, since Writers will continue to learn from review. As a few items are written, they should be reviewed by someone not involved in the writing. If the item survives the initial review, it should be submitted to others who have not yet seen it. Those items which are subject to disagreement or to difficulties in polishing may be made the subject of group discussion as part of training. These group discussion sessions can also be used to point out the kinds of difficulties which members are encountering and to discuss suggestions for overcoming these difficulties. Items which appear to be especially good items should likewise be identified in the group sessions. 
24 
As item writing improves, it will probably be more efficient to
accumulate a number of items before reviewing them. 
VII. Item Writing and Reviewing 
Overview. Selected references; making efficient use of workshoptime; estimating item difficulty; modifying items to make them moreeffective; purpose of and procedures for item review; handling theunsuccessful Item Writer. 
A. Selected Background References on Item Writingl 
Much has been written on suggestions for writing effective testquestions, especially multiple-choice questions. Some of these havebeen excerpted or summarized in the Item Writing Guide (Appendix IV-B).The following references contain suggestions and sample items illustrating "good" and "poor" item writing: Adkins, 1974, chap. 2; Ebel,1972, chaps. 5 and 8; Educational Testing Service, 1973a; and Hesman,1971. 
In addition, Bloom, Hastings and Hadaus', Handbook on Formativeand Summative Evaluation of Student Learning (1971), is a helpfulreference; it discusses approaches to item writing by subject field,and considers ways of testing comprehension and higher levels of
application and analysis. 
This publication includes a condensation
of Bloom's Taxonomy of Educational Objectives (1956), which may also
be helpful to have as a resource. 
The suggested set of item writing rules (Appendix IV-D) consistsof those rules which seem most important and relevant to problemsmost likely to occur in situations for which this Kit is designed.The rules given here and in the reference texts should not be considered inviolable if the principle underlying the rule is understoodand there is a good reason for violating it. 
Other rules may be suggested by the Writers as they gainexperience, and these may be added to the list. 
If many suggestions
are offered, they can be consolidated or combined with rules alreadyin the list to avoid excessive length. · 
lsee Bibliography for complete information on references cited. 
25 
B. Efficient Use of Workshop Time 
Although learning to write items requires time, the efficiencywith which items can be produced can be improved by adherence tocertain principles of operation such as the following: 
1. Establish an assignment and checklist procedure to avoidunwanted duplication. It may be helpful to set up a chart showingitem writing goals and completions to provide a ready reference tocurrent status. Periodically check to see that test specifications
requirements are being met. 
2. Set up a routine which will process items systematically
through the review phase, once the training phase is completed and
items are accumulated for review. Items with minor flaws can be
routed through three reviewers before returning them to the author
for revision. Items with major flaws should be returned to the
author's "in-folder" before further review. 

3. Keep a record of ·individual productivity. If an individual
is turning out an unusually small number of acceptable items,
investigate the reasons and attempt a remedy. 

4. Use brainstorming sessions to facilitate producing items tomeasure principles or applications which are causing particulardifficulty. 
5. Incubate those ideas which are very difficult. Rather thanpursue tenaciously a difficult measurement task, it may help toleave it and come back to it later. 
6. When difficulties occur, try different approaches ratherthan sticking with the initial idea. 
7. Allow individuals to work alone, in pairs, or in groups,depending on their work style and efficiency. Before assigningwriting on an extended group basis, however, make certain thatproductivity, in terms of number and/or quality of items, is betterthan that obtained by the same individuals working alone. 
8. Avoid personality conflicts in assigning team membership. 
9. If leadership in the workshop is being shared by theMeasurement Specialist and an agency staff person, determine inadvance the roles and decision-making authority of each. 
10. Follow the order of review recommended in D below, proceedingto the next review stage only when the item meets approval at thecurrent stage. 
26 
C. Achieving Appropriate Item Difficulty 
Item difficulty should center at a percent pass corresponding to
the desired mean as a percent of the total number of items; i.e., if
the goal is a mean of 70% of the number of items, item difficulty
(percent pass) should center at .70 (without previous experience.
the Item Writers will not be able to estimate item difficulty weil).
If past experience confirms that actual item difficulty is greater
than the planned item difficulty, then Item Writers should aim at a
somewhat higher percent pass, perhaps .80 (see Appendix V-A). 

t4hen a person is writing items in his own area of specialization,there is a tendency to underestimate the difficulty of a concept orprinciple he is testing; in an effort to get plausible misleads hemay make the item even more difficult. For this reason an estimateof item difficulty made by the reviewers will probably be more accuratethan one made by the author of the item. 
Since item difficulty can usually be decreased (or increased) byrevision, the Item Writer need not be preoccupied with difficultywhen writing the item. (See suggestions regarding revision of itemdifficulty in E below.) He should focus on achieving a valid measureof the concept he is attempting to measure. Suggestions are made inAppendix IV-B which may help the Item Writer achieve the desireddifficulty. The Measurement Specialist can assist by observing itemwriting in progress and making a casual review of items which appearlengthy, checking whether there seems to be unnecessary vrordiness orthe Writer is devising too complicated a problem situation. In thelatter case the Measurement Specialist can question whether the ItemWriter is attempting to test more than one concept or process and,if so, whether these can be tested with separate items. 
D. Item Review 
A list of review suggestions which may be copied for Item Writersis included in Appendix IV-G. These may be distributed when theWriters begin to work individually or in teams. Prior to that timethe Measurement Specialist should make sure that all items written inthe training phase meet these requirements. 
The major areas on which the Measurement Specialist and ItemWriters should concentrate their review are shown below. There isan advantage in considering each of these four areas separately andin this order. If there is a need to revise an item on the basisof one stage of the review, the changes can be agreed upon beforegoing further. The changes at each stage could well affect those 
reviews which follow. For example, a criticism which appears to affect 
only one option may lead to changes in other parts of the item, and 
time spent reviewing the item for grammar and punctuation may be 
wasted. 
1. Content. The reviewer needs to check that the content is 
relevant to the job, in terms of both the concept and the level of 
the concept, and that it is content which is required before entry 
to employment. It should not measure job knowledge that is normally 
expected to be learned on the job. 

The reviewer should make certain that there is one and only one 
correct answer to each question. 

Since all Item Writers presumably have working or supervisory 
experience with the job, they should as a general rule be able to 
answer any item required for selection. If more than one writer 
has trouble with a given item, it is well to examine it carefully 
for appropriateness. However, a level of knowledge beyond that 
required for job performance may be required to spot errors in 
assumptions or points which are open to disagreement among experts.
Therefore, items that cover specialized content and have been written 
by specialist members of the committee should be referred for review 
to at least one other specialist. If no one else on the committee 
has this specialist background, these items must be submitted to 
specialists outside the committee. The Measurement Specialist should 
submit the items in a reasonably finished form and allow the reviewer 
adequate time for his review.! 
2. Difficulty level. Each reviewer should be asked to make two estimates of difficulty: 
a. 
The percentage of those on the job(s) for which the test is designed that could answer the question; 

b. 
The percentage of the applicant population that could answer it. (If the subject is taught in high school, this estimate can be directed at a typical high school population taking the course.) 


It may help the reviewer to think of specific individuals, one 
on-job employee and one typical applicant, and evaluate the likelihood that such individuals could answer the item. What aspects of the stem or option might cause the most difficulty? Has the item been made artifically difficult, i.e., can a person understand the principle being tested and still miss the item? 
lProper security precautions should be taken by the Measurement Specialist to insure that test materials are not compromised in mailing them to these specialists. 
28 
Estimates of difficulty made by reviewers may range widely.Unless there is some reason to doubt the estimates of some reviewers,
the average estimate may be taken as a basis for judging the suit
ability .of item difficulty for the pretest. 
(Items may be included
in the pretest even if some estimates for the applicant population
are as low as .30 or as high as .90.) 

3. Measurement principles. The reviewers will need to review
the item relative to the list of Item Writing Rules. 

This listappears to be long, but with practice there should be less need torefer to it with each item reviewed. This review should be performedby both the Item Writers and the Measurement Specialist. 
4. Editorial review. This review will ordinarily be done only
after the item is considered in final pretest form. It should be

done by someone with competence in English usage and expression.
The Measurement Specialist or one of the Item Writers may have sucha backgound and be qualified. If not, the Measurement Specialistshould enlist the services of another person who can do this review.There are some advantages in having the item reviewed for this purposeby someone not familiar with the area being tested. Such a personcan check not only the clarity, grammar, expression, spelling, and
punctuation, but can determine whether the item is answerable by aperson without knowledge of the field. The catching of errors isfacilitated if the editor reads through a set of items once for each
of the following: clarity of expression, grammar, spelling, and
punctuation. 
E. Item Revision 
~fuen an author has accumulated several items which have beenreturned by the reviewers with review notes, he should interrupt hiswriting at a convenient time and attempt revision. Some revisionscan be made easily and the items forwarded to those who have not yet
reviewed them. The item authors will wish to refer other items back
to the person(s) who made the comments to determine whether theauthor's modifications meet the objections. Still other items maybe held for sessions where two or more Item Writers endeavor to workout a mutually satisfactory revision on a number of items. Frequentinterruptions should be avoided. 
If an item appears too complex and time consuming, it may helpto write out the steps required and see whether two or three items
can be written to test these steps or the underlying principles
separately. 
29 
If item difficulties appear to be running too high (i.e., low percent pass estimates), the following revision techniques may help: 
1. 
Examine the stem for unnecessarily difficult wording, and simplify it. 

2. 
Identify those misleads which are likely to be most attractive to better applicants and substitute misleads which will be less attractive to this group. 

3. 
Try splitting the item into two items if it is measuring two concepts or processes. 


If the item is too easy, substitute misleads which have an element of truth but are defensibly wrong. 
F. Handling the Low-Productivity Writer 
Not everyone becomes a successful item writer. Despite efforts at preselection, some individuals may decide after trying it for two or three days that they are not interested in continuing, especially if they have other employment responsibilities. If the job can be handled either without them or with replacements, they should be allowed to drop out. 
Some, for a variety of reasons, may have difficulty producing their share of usable items. An effort should be made to determine whether lack of productivity results from a lack of self-confidence, from a too-rigid approach to an assignment requiring flexibility, or from other reasons. The following remedial steps can be taken by the Measurement Specialist: 
1. 
If due to lack of self-confidence, discuss the problem with the Writer, encouraging him to relax and focus on item writing rather than his skills. It is possible that he is concerned because other Item Writers are producing more than he is. 

2. 
Try a two-or three-person brainstorming session where a number of item ideas that the Writer can work on are jotted down. 

3. 
If the Writer has other skills, for example, good verbal expression, team him with an Item Writer who has more ideas than he can develop. 

4. 
Try him on developing item ideas which others have set aside. Some persons are fluent in generating item ideas but may have less patience with putting them in operational form. Others who have difficulty in generating ideas may be skillful in reworking rather embryonic ideas into effective items. 


If these efforts fail to produce an effective Item Writer, and 
if the individual cannot serve a useful role in complementing the 
skills of others, it may become necessary for the Measurement Specialist 
to talk with him about dropping. Since the Measurement Specialist has 
responsibilities to the others, he must decide when the gains from working with weak Item Writers are outweighed by the loss in time available for working with those who are productive. The nonproductive 
Item Writer will often welcome the opportunity to drop out. Any such 
departures should, if possible, be in a cordial atmosphere. Adverse 
reaction as to the quality or worth of the project by dissatisifed 
participants can be detrimental to otherwise productive programs and 
should be avoided. 
VIII. Pretesting 
OVePview. Pur-pose of pPetesting; selecting a cPitePion; selecting a population sample; data to be collected [Pam the pPetest; pPecautions to insuPe that item analysis Pesults will be useful in pPepaPing the final foPm; Pelative mePits of sepaPate pPetests vs. pPetesting in an opePationaZ foPm; pPedicting opePational foPm item difficulties fPom 
pPetest difficulties. 
A. Purpose of Pretesting 
The final test will do a more accurate job of measurement if items can be selected with pretest knowledge concerning each of them. The purposes of this pretest are to: 
1. Determine the ability of the item to distinguish between 
applicants who have an overall grasp of the content area being tested and those who do not; 
2. 
Determine the difficulty of the item; 

3. 
Provjde clues to ambiguities and other needed revisions; 


4. Provide clues to testing time required to make the final test a test of power rather than speed. 
B. Pretest Criterion 
To evaluate an item's ability to differentiate candidates relative to proficiency in the content area, a valid and reliable criterion is needed. It is usually assumed for item analysis purposes that a total 
score on a set of items designed to measure a given content area is a 
reasonable measure of knowledge of that content area. Thus the total score on the test of which this item is a part is generally used as 
the criterion. 
31 
The usefulness of a criterion depends upon the accuracy with which it represents the qualities the test is designed to measure. General performance criteria are relevant but they include ~any qualities, such as motivation, which the test is not designed to measure and hence are not useful for item analysis. For example, knowledge of mathematics can be appraised more accurately on the basis of some other measure of mathematics. It is preferable to avoid including in the same time limit items from substantially different content areas, such as economic principle items and actual accounting problems requiring extensive calculations, both because an individual may spend a disproportionate amount of time on those he finds more difficult (or easier) and because it helps for item and test analysis purposes to have separate scores for each. 
C. Population Sample for Pretest 
The population sample for the pretest should be comparable to the group taking the operational test with respect to education, cultural background, age, and sex. While a sample of two or three hundred is desirable, even a group of 30 is better than not pretesting at all. Samples of only 30 will not yield reliable indices nor will they permit analysis of minority group responses to the items, but they will provide some information on difficulty and may give some clues to possible ambiguities. 
The following three options are commonly used in selecting pretest populations. Options 1 and 2 provide the most representative 
populations. 
Option 1. The pretest may be included in or with an operational test in the same content area. 
Option 2. If the job is one for which there is frequent recruitment, but this content has not been included in previous testing, the test can be administered experimentally to an applicant population. This is an ideal procedure if the test is not used in selecting from among the first applicants who take it. 
Option 3. The pretest can sometimes be administered to on-roll employees. This has the problem of a restricted range of ability with respect to the qualifications required to perform the job. If the group includes some with minimal ability in this particular area, the analysis will generally be useful. In deciding whether an on-roll population will provide helpful data, consider the following: 
a. What selection criteria were used; i.e., is there likely to be a spread in the on-roll population relative to the ability 
32 
being measured? The Measurement Specialist ordinarilyshould not use an on-roll population if the employeesare highly selected relative to the knowledge or skillbeing tested. 
b. What percentage of the candidate population hastypically been selected? 
The on-roll group could also present problems in test security;
if used, it is advisable to give the pretest before the
operational test is announced. 

NOTE: An item analysis using operational population datashould also be run on the items in the final test phase,at least the first time the' test is used. This isespecially important when the pretest sample is small ornot typical of the operational population. If the numbertaking the operational fo~ is also small, it may benecessary to accumulate data and run the analysis on theresults of more than one administration. In many practicalsituations it is impossible to meet desirable populationsizes, and the Measurement Specialist must exercise discretionregarding means of obtaining evidence on individual itemsand operational test results. (If the Measurement Specialistdoes not have a strong background in statistics, he shouldseek professional advice with respect to his particularproblem from someone so qualified. Such assistance may beobtained from professional consultants and PsychologyDepartments of large universities.) 
D. Description of the Pretest 
In Option 1, it is best if pretest items are added to theoperational test as a separately timed section. (If the pretestitems are within the same time.limit, their influence on the itemsin the operational part is unknown.) If the applicant group islarge, more than one set of pretest items may be used, with eachset being taken by different examinees. In this case the differentpretest forms should be distributed alternately to the examinees toreduce sampling bias. In Options 2 and 3, the pretest should havea format and content similar to that of the operational test ifitem statistics are to be used in estimating total test statistics.Regardless of the option, the time per item should be comparablewith that in the operational form so that item statistics willreflect operational test conditions. The Measurement Specialistshould try to arrange the items in order of difficulty from easyto hard. Preferably he will start the test with one or two itemswhich are short and seem easy to the examinee. (This is lessimportant for Option 1.) Items which could be time-consumingshould be placed toward the end, but these items should be fewin number and used primarily to keep early finishers busy.The Measurement Specialist should then review the assembledtest as suggested in Appendix IV-K. 
33 
If the pretest is in a separate booklet, it is recommended that the Measurement Specialist use standard instructions such as those shown in Appendix IV-H. If the pretest is appended as a separate part, the instructions may be abbreviated as for part instructions. 
The Measurement Specialist must make certain that the pretest item numbering and option designation agree with the answer sheet; i.e., if the options are lettered A, B, C, D, and Eon the answer sheet, do the same on the pretest. Where there are separately timed parts, he should continue numbering across parts or use an answer sheet which provides for parts. If the answer sheet and pretest cannot be made to match, he should either find an answer sheet which does or make a special answer sheet for the pretest items. (The pretest may have fewer items than shown on the answer sheet.) 
If the pretest population consists of on-roll persons (Option 3), it will usually be necessary for them to know the purpose of testing, but the Measurement Specialist should avoid relating it to a specific upcoming recruitment or operational testing. They should be given an explanation for participating similar to that given the Item Writers, with the further explanation that results will not be given by name to anyone other than the participant. (Motivation is likely to be greater if names are entered on the answer sheet.) The Measurement Specialist may also wish to tell examinees that if they want to know their own results (score, relative standing, etc.), they may ask for them and results will be given to them privately. This, too, may improve test-taking motivation. 
E. Administration of the Pretest 
Pretest administration conditions should match those of the operational tests with respect to factors which may affect test results. A room which is well-ventilated, well-lighted, and free from distractions should be used for testing. Conversations in the testing room should not be permitted while testing is under way. A clock must be visible to the participants, and the test administrator should have a stopwatch (preferably two, in case one breaks down). 
Instructions for test administration must be prepared in advance. The Measurement Specialist will need to plan and record exactly what will be said, keeping in mind that essentially the same instructions should be used as for the operational test.l The test administrator should follow the instructions as given. 
lsee Clemans, 1971, chap. 7. 
34 
The same procedures should be used to control access to the test
as are used for the operational test: 

1; Keep materials locked up before and after administration. 
2. Count materials before and after administration and do not
dismiss the examinees until materials .are accounted for. Make certain
the test booklet answer sheet (or answer booklet) and all scratchwork
are returned. (Normally, scratchwork on pretests will -be done in the
test booklet, since it is a good practice to use a test booklet only
once. If separate paper is allowed, and if more than one sheet of
scratch paper is needed, staple such sheets in booklet form.)
procedures allow those who finish early to leave early, accountIf 

carefully for materials before allowing the individual to leave the
room. 
3. Allow only one examinee at a time to leave the room during a
timed part of the test. 

4. Avoid answering questions from examinees in a way which will
give some examinees an advantage over others. 

If it is not feasible to adhere precisely to the operational form
procedures, the Measurement Specialist must determine whether a specific
deviation is likely to affect test results and take responsibility for
authorizing the deviation. 
F. Statistics to be Collected by the Pretest Analysis 
There are several statistics which can be calculated to help in
the analysis of test items during the pretest phase. 
These are: 
1. Difficulty indices. 
The most commonly used index in evaluating
item difficulty is the percent pass (p or%+). 
It may also be used
in estimating the mean and standard deviation of a set of items.disadvantage is that p cannot easily be adjusted if the populationA 
on which p is based is not fairly similar to that taking the operational
test. 
Also, p-values are not typically linear with respect to difficulty,
i.e., a given difference in p means a greater difference in ability
in the middle of the range than at the extremes. 
Another index of difficulty is the delta (6). The delta is basedon p but is a normalized item statistic, i.e., it is converted to ascale (mean = 13, standard deviation = 4) such that a difference ofone delta unit represents about the same difference in difficulty at
different points on the scale.l 
lnelta is based on the assumption that ability is normally distributed.
See Henrysson·, 1971, pp. 139-140, for a discussion of delta. 
35 
Deltas may be averaged and can be used for putting item difficulties on the same scale from one population to another, where the populations differ primarily in level (see Appendix IV-I for procedures). 
2. Indices of item discriminating power. The biserial 
correlation (rbis) and the point biserial correlation (rpbi) are the two most common indiceso The point biserial is useful in estimating the standard deviation of the proposed test. (If the pretest population is not similar to the operational population, and if there is reason to doubt whether similar results would be obtained with the operational population, the item analysis should be used only for clues to item weaknesses and difficulty, and not for estimationo) The biserial correlation cannot be used directly in estimating the standard deviation, but it has the advantage of being more easily interpretableo The point biserial's maximum value depends on p, and it is thus less likely than the biserial to yield the same result when used with populations at different ability levelsol 
The biserial correlation can be converted to the point biserial through use of Table B, Appendix VI, if one wishes to estimate the standard deviation of a set of items. If computer facilities are available, either the biserial or the point biserial can be easily obtained. If computer facilities are not available, two short-cut methods give approximations of the biserial. 
If there are 100 cases or more in the population sample, the performance of the top 27% (based on the criterion) may be compared with the performance of the bottom 27% by using Fan's tables (Fan, 1952). These tables provide an estimate of the biserial, p, and deltao (See Appendis IV-I for guidelines on using the tables.) Where the sample size is less than 100, one can either use all of the cases and compute the biserial or point biserial, or use a method proposed by Diederich (1973), which compares the top 50% and the bottom 50% (see Appendix IV-I for procedures). Sample sizes under 500, regardless of method, lead to appreciable error in the index, as shown in Table C, Appendix VIo 
1The influence of difficulty on the point biserial is indicated in Appendix VI, Table B. The multiplier in the top row indicates the maximum point biserial value for given values of p when assumptions basic to the point biserial are met. The biserial correlation can exceed 1 with a skewed distribution of scores, but this is usually not a critical consideration. See Lord and Novick, 1968, pp. 340-344, for a more detailed comparison of the biserial and the point biserial. 
36 
Since the number of cases in the pretest population islikely to be quite small in the situations for which thisKit is intended, pretesting can provide only a rough estimateof item discriminating power. 
If the items are analyzedeach time they are used, however, data can be built up anditems which have consistently higher indices can be used withgreater confidence. 
3. Information on answer option8. In addition to the
information on difficulty and discriminating power, _the itemanalysis should provide data on the answer options. Whena computer is used for the analysis, it should providedistributions of scores of those choosing each option. 
At
a minimum, the mean score of those choosing each option shouldbe provided. 
In the high-low methods it is customary totally the number in the high group and the number in the lowgroup who chose each option (see Appendix IV-I). 
IX. Item Revision and Final Test Assembly 
OVerview. 
Revising items on the basis of item analysis;~aking changes without re-pretesting; considerations in testassembly; predicting final test statistics from pretest
data. 
A. Preparation of Items 
If the test items are likely to be used again, it willprobably be worthwhile for the Measurement Specialist tohave each typed or pasted on a 5 x 8 card with the itemanalysis data recorded on the back, as suggested in Appendix
IV-F. This facilitates working with the items and movingthem into different classification schemes for comparisonor making choices at the assembly stage. It enables theuser to spread the items out on a table for easy review ofcontent while assembling the final test. It also providesa convenient form for filing the items between uses. 
B. Use of Item Analysis Data to Revise Items 
The Measurement Specialist should make a preliminaryreview of the item analysis data to identify those items whichmay be in need of revision. The following are bases forsuch identification: 
37 

1. Item is too easy or too difficult; 
2. 
Index of discriminating power is below that established as desirable; 

3. 
As many low scorers as high scorers chose the correct option; 


4. More high scorers than low scorers chose a wrong option; 
5. More high scorers chose a wrong option than chose the keyed option; 
6. 
Few or no examinees chose a given option; 

7. 
Many omitted the item. 


See Appendix IV-J for a discussion of possible corrective measures. 
On the basis of this review, the Measure Specialist may be 
able to mru~e some changes, but if he is not trained in the field of the test content he should simply make notes to call attention to clues to poor item performance and defer revision until he meets with the Committee of Item Writers. 
C. Tentative Assembly 
The Measurement Specialist can make a tentative assembly of the final test using the Item Writers' classifications of the items and the item analysis data. He should be guided by the outline in the test specifications as well as by the item analysis data. Items testing the same area of content and considered usable (with 
or without revision) may be clipped together pending the meeting with the Committee of Item Writers. He can then make a quick 
tentative check on the test mean and standard deviation as part of this test assembly. A sum of p-values will provide an approximation to the mean if rights-only scores will be used and if the population on which the p-values were obtained is similar in ability level to the candidate population. Allowances should also be made for any planned or expected differences between pretest and final populations in percent completing 
the test. The standard deviation may be estimated from the formula: SD = E/p,q rpbi• where q = 1 -p (Gulliksen, 1950, 
p. 377; Lord and Novick, 1968, p. 330).1 
lif the pretest and candidate populations are not 
sufficiently similar to warrant using the p's directly, and 
if equated deltas have been obtained, the mean may be estimated 
by the following procedure suggested by Angoff (1971, p. 586): 
convert equated deltas to p and sum. 
D. Meeting of Item Writers for Final Test Assembly 
It wiil facilitate review if each Writer has a copy of 
the tentative selection of items and a summary showing how 
these items fit the test content specifications. If the items are photocopied directly from the cards (see AppendixIV-F), the opportunity for error is decreased. The Measurement Specialist should call attention to the item analysis results as needed, and the Committee may readily suggest some minor changes. Other revisions should probably be left until after all the items are discussed, since the Committee may decide to make substitutions which will remove the need to revise some items. (Item Writers may wish to suggest changes for the unused items, but if the next need for a test in this area is far in the future, it may be unprofitable to spendmuch time in doing so.) The Measurement Specialist or one of the Writers should keep a record of all objections, however, 
with an indication as to whether there is consensus among
the Writers, since a few replacement items may be required 
at the Panel review stage. 
It will be obvious that some changes will affect the item difficulty substantially. Some educated guesses about the effect on difficulty may be made using the information 
on the number of individuals choosing options which are beingdeleted or changed, and by guessing the effect on the choices of these individuals on substituted options. Changed items may be used if the Measurement Specialist feels reasonably assured that the changes will improve rather than adversely 
affect the usefulness of the item and the test. Any changes should be examined carefully by the Committee of Item Writers to make certain that no biasing factors have been introduced. Also, the Committee should check that the classification of the item has not been changed by the revision. Items used 
to equate the difficulty of the new test to an earlier operational test must not be changed unless the change is certain not to affect its difficulty. 
In examining items for possible ambiguities as a result of a low biserial, it is important to bear in mind that a low index does not necessarily mean a poor item. The index 
indicates the correlation between performance on the item and performance on the pretest as a whole. The item in 
question may test an important concept, but understanding of 
this concept may not be highly correlated with understanding of the other concepts tested. For example, an item may tes~ an important trigonometry concept, but if there are only afew trigonometry concepts in a test which covers algebra and 
geometry as well, the trigonometry items may have lower biserials than they would have in a test of trigonometry 
only. If nothing can be found wrong with the item, and its 
dissimilarity to the remainder of the test could account 
reasonably for its low biserial, the item may be left in. It 
is undesirable to eliminate items just because they lack homogeneity with the rest of the test. A test with all items 
testing the same concept will have a higher reliability and higher biserials than one with a more heterogeneous content, but it is likely to have a lower validity, since it will 
cover fewer aspects of the relevant performance criterion. 
If the effect of revisions on the items is unknown, the items should be reserved for re-tryout in a future pretest. 
When a tentative selection of items meets Committee approval, the Measurement Specialist should check the content against the test specifications and estimate the mean and standard deviation. He should compare the estimated standard deviation with the estimated standard error of measurement. He should summarize the content relative to the list of position requirements and check against the weights originally assigned. Finally, he should check the distribution of keyed answer options. The number in each option position should be roughly the same, but he should avoid having exactly the same number of each option. 
It is unlikely that an ideal choice of items can be made to meet all criteria: content, mean and standard deviation, and item discriminating power. Content should be given priority, if this does not result in substantial departure from the other ideals. The Measurement Specialist should remember, however, that the test is intended to sample the candidate's knowledge rather than measure every relevant aspect of it, and omission of one or two important concepts is unlikely to affect the test significantly if other criteria are met. 
The Measurement Specialist should next prepare the test instructions (see Appendix IV-H). Enough examples of items should be included in the test instructions, when the testtaking procedures are unusual, to make certain that the examinee understands the instructions. 
40 
The test should be typed according to the format describedin Appendix IV-H.l Before reproducing it, an Item Writer shouldtake the test, filling out an answer sheet as he does so. 
This
"key" should then be checked carefully against answers keyed as
correct on the item cards. 
X. Final Test Review 
Before the Item Writing Committee is dismissed, a photocopyof the test draft should be given to each Item Writer. The testshould be checked against test specifications, and each Item ·Writer asked to take the test as if hewere a candidate and to
prepare a key. 
If the Item Writers, who are already familiarwith 
the items, fail to finish in about half the time allotted
the candidates, the test should be examined for items which are
unduly time-consuming. 

A copy of the test, as approved by the Item Writers, shouldthen be sent to each Supervisory Panel member for review. Each
member should be sent a copy of the review points as spelled out
in Appendix IV-K, and two or three copies of the review commentsl1eet. The Measurement Specialist should arrange a meeting forthe Panel to discuss their comments. A day or less should sufficefor this meeting in most cases. 
At this meeting the Measurement. Specialist should be p'reparedwith the unused items on cards (with item statistics recordedand Item Writer comments available). These may be needed in theevent the Panel has a strong objection to certain items in the
assembled test. 
The focus of the Supervisory Panel.will be on the suitabilityof the test as a whole: Will the qualified candidate be expected
to do well on the test? Will the low-scoring individual be
significantly more likely than the high scorer to do poorly on
the job? 

Do the items, taken together, test a representativesample of the content judged important for the job? Will ·the testhave face validity with the applicants, i.e., may they be expectedto consider it a reasonable requirement for the job? Minoritygroup members of the Panel should look for possible irrelevantcultural factors which might influence test performance but not
job performance. 
The Measurement Specialist should encourage the Panelmembers to record their reactions to the individual items, since
they may spot something not noticed by the Item Writers. 
If the
focus is on the overall test, however, they may be less likely 
lsee Thorndike., 1971, chap. 6, for illustrations. 
41 
to seek minor flaws in items which are probably satisfactory in the form in which the Item Writers left thein. The Measurement Specialist should inform the Panel that changes should be made at this stage if there is a reason to believe that measurement will be improved, but changes should not be made simply to satisfy a personal style preference which may be a matter of opinion. 
XI. Final Test Production 
In preparing final test copy, the Measurement Specialist should bear in mind factors which will adversely affect test scores in a manner not related to the knowledge or skill being tested. Spacing, size, and style of print should be such that the examinee can clearly perceive the intent of the question and what is expected of him. He should know Where one item ends and the next begins, and the arrangements of options should not mislead him into marking his answer incorrectly on the answer sheet.l Mathematical symbols should be clear, and fractions should be presented in a manner wfich will avoid misinterpretation for example, vertical fractions, -x are less likely to be misinterpreted than 1/2 x. 2 
If, because of the particular answer sheet being used, number options are used with numerical answers, or if lettered options are used with letter answers, spacing should be such as to avoid confusing the designated option with the answer. Also, the Measurement Specialist should avoid standard answer sheets which have odd-numbered items in the left-hand column and even-numbered items in the right-hand column. These are helpful in scoring odd and even numbered items separately for computing reliability, but they are confusing to use and will.add to the examinee's problems in coping with the test situation. 
The Measurement Specialist will need to be concerned with cost in preparing final test copy. If the test is to be used again at future administrations, it will probably reduce cost if booklets are reusable. However;. this is not highly recommended. Separate answer sheets facilitate scoring. They also facilitate test analysis where computer facilities are available. Separate scratch paper will be needed for tests where computation is required. The necessity to account for all scratch paper may reduce the advantage of reusable booklets if there is much computation. Accounting for all test material after test 
Isee Thorndike, 1971, chap. 6, for illustrations. 
42 
administration can ba simplified where more than one sheet of
scratch. paper is required by binding together several sheets.Use of a distinctive color will facilitate accounting for scratch
paper and reduce the risk of scratchwotk being done on sheetswhich could be carried out; but only very light colors shouldbe used to avoid visual difficulties. 
Control of the accuracy of final test copy is easier if thetest is typed under the supervision of those responsible for testpreparation rather than set by a typesetter. The Measurement
Specialist should be sure that final copy is proofread twiceagainst the cards. It may be desirable, especially in mathematical
or scientific copy, to have the proofs proofread once again aftercopy has been multilithed. Eve~ on photocopy an unwanted dot canappear, or part of a symbol can be accidentally deleted in cleaningcopy. 
This Kit does not aover aU aspeats of the test development
proaess. It aovers only those aspeats dealing with item writingand assembZing job knowledge tests. 
43 
References 
American Psychological Association. Standards for educational and psychological tests. Washington, Do Co: Author, 1974o 
Angoff, Wo H. Scales, norms, and equivalent scores. In 
R. Lo Thorndike (Edo), Educational measurement (2nd ed.). 
Washington, D. c.: American Council on Education, 1971. 

Coffman, W. E. Essay examinations. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.)e Washington, Do C.: American Council on Education, 1971. 
Cureton, E. E., Cook, J. A., Fischer, R. T., Laser, So Ao, Rockwell, N. J., & Simmons, J. w. Length of test and standard error of measurement. Educational and Psychological Measurement, 1973, 33, 63 -68o · 
Diederich, P. Short-cut statistics for teacher-made tests. Princenton, No J.: EducatiQnal Testing Service, 1973. 
Fan, 	c. T. Item analysis table. Princeton, N. J.: Educational Testing Service, 1952. 
Guilford, J. P., &Fruchter, B. Fundamental statistics in psychology and education (5th edo). New York: McGraw-Hill, 1973. 
Gulliksen, H. Theory of mental tests. New York: John Wiley, 1950. 
Lord, F. M. Tests of the same length do have the same standard error of measurement. Educational and Psychological Measurement, 1959, 19, 233-239. 
Lord, Fo Mo, & Novick, M. R•. Statistical thories of mental test scores. Reading, Mass.: Addison Wesley, 1968. 
Swineford, F. Note on "tests of the same length do have the same standard error of measurement." Educational and Psychological Measurement, 1959, 19, 241-242. 
44 
Glossary of Some Common T~sting Terms
' --_. ~-::..: ' . -~-.. . ~: . . . . ; . : : . ; . . -.., 
-~ 
The foliriwing teims are 'defin~d according' t~-'the sensein which_ they are used in testing. Sqme may have a more-··general meaning than that indicatect'}:lere. 
Alternative. '-'An answer :-op'dori on a mt1'lt,iplE7-:-choice test
question. -" __ _ _ 
Biserial correlation. A relationship be~w~~n. a criter~()~
score and the scOre '·on -the item~ 

' ,. • , • '~' ., -~ «\• ! <;''
The' score obtained by g~essing."· 
Chance mean. The average sc:.ore ,of -;R~!S.9~-gu~~13:i,_ng at r_andom
the ans~i--s to ail iteins. -
1 : j • • --~ 
Chance range~-: ust,{ally' i:is'ed>t~ :indicate-. :~1{~-~c~ie~ -~elow the---chance mean plus one ·or ~·o chance1 st:andar'd' d~viations.
·. . . . -\ ' -~ .. . . . . ' ~ . . ' ," . .. . . 
C'nanae standard deviation. The standard deviJ1ti9n of ..s.coresof persons guessing at' 'tan,do~. the answers to., all 'items. 
Content validity. Demonstrable rel?tionsh;i.p ~~twee_n.,.,-theskills and knowledges-required 'by the-selection procedure and those required by 1:h~ job•. -It ~~:H~uallybased on the judgmeri~ ?f exper~~:. '• ' ! .--_i ,_' ·;
Criterion-related validity. StatisticaL-agreement -between
performance on the-t'est and p~rforril<mce on: a:' ~dt,erion'
-such as a measure ·or ~qb .l'e~f~~M.c~: ·:;;·-:--~, --, 

Cut-point or> aut-off score. _4 .score on a giyen test, s.uch'that those who 'score{ high~r are 'cicc.,ep~~ed-and.:~those whoscore lower are reJected., It als.b'' can-inclti'de differentlevels of placement for: differ.~nt :;;c<;:>,res ....-· · .,.,,.-<
: : '·-; -. i . -·. '· . ·: : ..·' ·r jl._ 
·• ... . . _:· ~ . .... .., .... _: : . .• :, •. .. ~ • •• • .• ~· 
Discriminating 'pawer of'cl.n ·-i'-tem. · Th~ "~xt'~rif--t6 \vlii:gh those
with high scores answgr .the itellJ correc~ly, and.·.th.o.se.
· with lm.f scores_' -~sw~·r' \i~-:~Il:GO,l7:r.e-~tl;r,-,-:~:, _ ·-~;._ s;,:_,·· ,, , 

Distraater>. An incorrect answer option. 
Face valid-ity~--Appearing tcf; i::~~t\~h-~t· i~ intendeC.L · 
.t,' .-, ·~.. :· . . :_:·..; \'-"·,· 
\'·,, ;_·, ~-.:.~ . • ' e •• 
Item. A test question or problem, includihg, in multiplechoice tests, the set of options or alternative answers. 
Job analysis. Refers to many procedures relating to analyzing jobs; as used here, it refers to analyzing duties performed on the job and the knowledges, skills, abilities, and other worker characteristics required to perform those duties. 
Mean. The sum of scores divided by the number of scores. 
MUltiple-choice. An item type with two or more designated answer options. 
NoPmal cUPVe (or noF<mal distribution). A frequency distribution based on expected random frequencies. When a population sample is randomly selected and of sufficient size, most human traits tend to be distributed in a fashion approximating the normal curve. 
No~s. A set of score equivalents referred to a given population or "norm group." 
Options. The alternative answers in a test item. 
Parallel foPms. Two test forms which are equivalent, both statistically and in content. 
Percent pass. The number of examinees who answer the item correctly divided by the number t-Tho record an answer to this or a subsequent item. 
Percentile score. The percentage of the described population receiving a score less .than the obtained score. 
Point biserial correlation. A relatiQnship between a criterion score and the score on the item. 
Position requirements. Knowledges, skills, al?iliti.es, and other worker characteristics needed to perform the job duties. 
R~ score. The number of items answered correctly. 
Scaled score. A score which has been converted to a specified scale with a given set of characteristics. 
Scatter diagram. A plot showing the relationship between scores or performance on two variables. Each point in the scatter diagram represents one individual and locates him relative to his score on the two measures 'that are defined by the axes of the plot. 
Speededness. That characteristic of a test which shows the extent to which the examinee's score depends upon quick responses. 
Standard deviation. A measure of the variability of scores. It is based upon the normal curve. 
Stem. The lead statement in a multiple-choice test question. 
Standard error of measurement. The standard deviation of error scores about the true score. 
Standard score. A converted score expressed in terms of standard deviation units (or multiples thereof) and related to the mean score on a predetermined scale. 
True-false. An item type consisting of a statement \vhich the examinee judges to be true or false. 
True 	score. The hypothetical average score on an infinite number of parallel forms. 
47 
Appendix I 
Standards and Regulations 
A. 	APA Standards for Educational and Psychological Tests
B. 	FPM Supplements 271-1, 271-2, 330-1, and 335-1, USCSC Guidelines on Examining Practices
C. 	EEOC Federal Guidelines on Employee Selection Procedures
D. 	OFCC Federal Guidelines for Reporting Validity
E. 	Division 14, APA, "Guidelines for Choosing Consultants forPsychological Selection Validation Research and Implementation" 
48 
Appendix I-A 
APA Standards for Educational and Psychological Tests 
These Standards were prepared by a joint Committee of the American Psychological Association, the American Educational Research Association, and the National Council on Measurement in Education, F. B. Davis, Chairman. They are intended to provide guidance to those involved in the development and use of tests with respect to-standards for manuals and interpretive 
materials, administration and scoring, research on reliability and validity, and use of tests. Although the Standards are directed primarily at standardized tests, most of the content applies to tests developed locally as well. These Standards represent the combined thinking of a large number of persons who are i~volved in testing, and should be read by anyone expecting to develop or use tests. 
Copies can be purchased from the American Psychological Association, 1200 17th Street, N. W., Washington, D. c. 20036. The current price per copy is $3.00 for members and $5.00 for non-members. 
50 

Appendix I-B 
FPM Supplements 271-1, 271-2, 330-1, and 335-1, USCSC Guidelines on Examining Practices 
52 

53 

Appendix I-C 
EEOC Federal Guidelines on Employee Selection Procedures 
The following instructions were taken from Federal regulations governing employee testing and other selection procedures in the Federal Register, Vol. 35, Noo 149, August 1, 1970, and a subsequent document in the Federal Register, Vol. 36, No. 192, October 2, 1971. They govern employers in the use of selection procedures and provide information regarding validation requirements. 
54 

Appendix 1-D 
OFCC Federal Guidelines for Reporting Validity 
The following set of instructions on record-keeping and reporting was taken from the Office of Federal Contract Compliance "Guidelines for Reporting Validity" in the Federal Register, Vol 38, No. 30, February 14, 1973. It may be useful 
in helping one anticipate record requirements, and therebyenable one to obtain the necessary information more efficiently. 
56 

57 
Appendix I-E 
Division 14, APA Guidelines for Choosing Consultants for Psychological Selection Validation Research and Implementationl Prepared by the Professional Affairs Committee 
The proper use of procedures in selection and placement of people in 
organizations is not only desirable, but in most instances~ it is required by law. It is necessary that any selection device (tests, interviews, application forms, etc.) be used in a manner that does not select unfairly among individuals. In order to satisfy legal requirements in most situations, the selection tests or other procedures must be properly "validated." "Validation" is the term commonly employed to describe the determination of the value of personnel selection procedures. 
Validation of psychological tests and other selection tools require a high degree of specialized competency and experience which is not always available within organizations, particularly smaller ones. Nevertheless, it is necessary that all organizations meet the requirements of various State and Federal laws and it is often necessary to seek the professional services of persons or firms qualified to validate selection procedures. 
The Industrial and Organizational Division (Division 14) of the American Psychological Association (APA) has prepared the following guidelines to assist organizations in the determination of qualified individuals or firms seeking professional help in selection validation procedures. There is not a single standard upon which a judgment about qualification for selection validation can be made (e.g., special validation license, list of recommended or acceptable persons, etc.). Therefore, the standards are termed guidelines in recognition of the fact that the burden for deciding upon candidates for validation work rests within the organization seeking such services. 
Some judgment must be exercised by the person or firm having to select a qualified selection validation consultant and, to some degree, the problem is no different than that of selecting a consultant of any kind. Most consultants (individuals or firms) with sufficient competency and experience to perform psychological test validation would fulfill the requirements of the guidelines. 
(1) Certification 
Most states require that persons who offer psychological services to the public, including test validation, must be certified or licensed. Requirements vary state by state, but such certification or licensing should be 
lFrom The Industrial-Organizational Psychologist, April 1974, 11, 12-13. Reprinted by permission. 
58 
considered as a minimum requirement. Proof of certification or licensingmay be demanded, because official documents are provided to personshaving passed the examination and other requirements for state certification or licensing. 
(2) Professional Membership 
Generally, persons engaged in test validation work belong to the
American Psychological Association (APA). ~mbership is probable in
Division 14 of the American Psychological Association (Division of
Industrial and Organizational Psychology) or Division 5 (Division of
Evaluation and Measurement). Although membership in either APA or
Divisions 14 or 5 is not crucial, it indicates that the person sub
scribes to the principles and ethics of this professional organization.
Thus, membership in the APA is strongly recommended. 

A very desirable additional qualification is the possession of
diplomatic status from the American Board of Professional Psychology
(ABPP) with a speciality in industrial and organizational psychology. 

(3) Education 
Most industrial and organizational psychologists hold the PhDdegree. Some competent persons do not, but in these cases the experienceof the person should be explored very carefully. A bachelor's degreeis not sufficient. Regardless of degree, the individual's trainingshould have included heavy emphasis on statistics and behavioral sciences. 
(4) Knowledge and Experience 
A potential consultant should be able to provide evidence of similarwork and experience in business, industry, government, etc. in the areaof test validation, including, if possible, reprints or reports of hisresearches. In the case of test validation, a minimum requirementshould be that the consultant demonstrate familiarity with existingFederal and State laws and regulations that are applicable (e.g.,Uniform Guidelines on Employee Selection Procedures, etc.). Similarly,the potential candidate should be familiar withtthe Standards forEducational and Psychological Tests and Manuals published by the APA. 
In cases where individuals may not have had the time to build arepository of experience, then evidence of specific graduate trainingin test and measurement theory, statistics, and behavioral scienceshould be sought (preferably from a transcript from an accreditedcollege or university). 
(5) Recent Clients 
The names of previous clients should be provided in order thatthey may be verified. Questions about the consultant's specifictasks performed, integrity, promptness, and fulfillment of obligationscould be answered in this matter. 
59 
(6) Claims Made 
Normally, a competent professional will not make any claims for
extraordinary results nor guarantee certain positive outcomes. 

Suchclaims, if made, should be grounds for discontinuing further consideration of the potential consultant. Exaggerated claims, whether madeverbally or in a brochure, are unethical and would not be made byacceptable consultants. Further, the potential consultant should notbe interested in selling or promoting a unique method or device thatonly he can perform. Acceptable procedures are available to all qualified professionals. 
(7) Fees 
No generally agreed upon standard fee or fee rate is established.
The nature of the task and experience of the person(s) will figure in
determining the fee to be charged. 
The firm seeking the service of theconsultant should negotiate a fee satisfactory to each party. However,the fee should be for services performed and in no case should be dependent unon provisions of some "positive" or "guaranteed" results. 
Appendix II 
Forms and Sample Memoranda 
A. Initial Material for Panel 
1. 
Covering Memorandum for Panel Members 

2. 
Background Information 

3. 
Sample Instructions for the Pre-Workshop Assignment 


B. Sample Memorandum for Nominated Item Writers 
C. Form DL Sample Daily Log 
D. Form MN Measurement Need Definition 
E. Form BD Background Data for Test Development 
F. Form TS Test Specifications 
G. Form PA Pretest Analysis 
H. Form IAR Item Analysis Results 
I. Sample Memorandum on Control of Test Materials 
62 

63 

Appendix II-A-1: Covering Memorandum for Panel Members 
To: Supervisory Panel Re: Assignment prior to meeting on ---------------------
The 	following items of information are enclosed: 
1. 	
A memorandum entitled "Background Information", which provides a description of the test development process and other background material. 

2. 	
Instructions regarding work to be done before our work sessions on (date). Note that certain material is to be returned to me by (date). It is suggested you read the "Background Information" before tackling the assignment. 

3. 	
A copy of the existing job description for the job of 


(job title) This was prepared in (year) and may not accurately reflect the job as it is being performed currently. You will be asked to review it in the light of your knowledge of the job and note additions, deletions, or modifications you feel are necessary. 
64 
Appendix II-A-2: Background Information1 
General Information 
You have been appointed to serve as an expert on the
Supervisory Panel for selection procedures because of your
specialized knowledge and experience in 

(field)
Your participation in the process of developing selection
procedures helps indirectly to improve the efficiency of
governmental agencies, since the selection process is
designed to eff~ct the appointments of the best qualified
persons available to fill vacancies.. Your participation is
also important to the profession you represent because your
decisions will determine, in part, the caliber and type of
persons selected for public service in your field; further
more, the services you render will also help to increase

confidence in the Merit System as a basis for the selection
of personnel. 

Background information provided at the end of this
discussion explains the role of the Federal government in
assisting State and ..local governments in the preparation of
selection procedures. This information may be helpful in
understanding the Panel's role in the selection process. 

The goals of the Panel will be to provide the necessary
framework for insuring that selection procedures are relevant
to the job requirements and that the test(s) as part of these
procedures are appropriately designed. 

The role of the Panel in this assignment will be todetermine selection procedures to be used in evaluatingapplicants for the designated job and, where instrumentsare used, to specify the test design. 
Production of adequate test instruments is a collaborative effort: Classification technicians are essential toassure that the examination is based on a true picture ofthe job; subject-matter specialists are needed to be surethat all areas are sampled adequately and that the factualcontent is correct; measurement specialists must be responsiblefor coordinating the work and assuring that the test istechnically sound as a measuring instrument. The role of the
subject-matter specialist is divided between the Panel and the Item 
lMuch of the material included here is taken from"Information for Examination Review Panel Members, U. S.Department of Health, Education and Welfare, 1962." 
This Appendix is written for cooperative projects betweenFederal, State, and local governments. For other uses, the~~asurement Specialist will need to make substantial changesto its content. 
.... 
. :~ 
65 
Writing Committee. Since the subject-matter specialist has 
other employment responsibilities, it was felt that no one 
group of specialists should be asked to give all the time 
needed ·to · prepare the test. Also, it is desirable 
to have persons who are familiar with the test plan, but 
not involved in the detailed item writing work, to be 
available for reviewing the test in its final form. 
The Panel will be asked to (1) review the existing job 
description for relevance of its elements to the actual job (If no job description exists, it may be the responsibility of the Panel to develop one.); (2) indicate the general worker qualifications required 'to perform the described job; (3) 
determine the selection procedures to be used in identifying 
candidates with these qualifications; (4) define test 
specifications; and (5) review the final test to determine 
its adherence to the test specifications. 
A Merit System examination should reveal candidate knowledge of facts, skills, ability to learn, and ability to make judgments in the area of knowledge pertiment to the job. jh~ examination must be acceptable to experts in the subject 
,!?:e.tng· tested, must avoid offense to any individual or group, ''iffid must provide an opportunity to evaluate the candidate's 
probable performance on the job. Furthermore, the examination 
itself should be a comprehensive sampling in a given area of 
the knowledges and skills required on the job. It should not, 
however, include knowledges and skills which, while part of 
the field of knowledge being tested, are not required on the 
job. 
Test Development Process 
The first step in developing a test is to analyze the job and prepare a written job description showing the requirements for effective performance. These job descriptions should already be available, but often are not. The Panel will prepare an outline of position requirements that will show, in general terms, the knowledges and skills required by the job. The Panel prepares test specifications indicating the weight to be assigned to each part of the outline and other desirable characteristics of the test, such as difficulty, length (time and number of items), scoring, etc. 
It will be the first task of the Item Writers to identify 
the concepts and principles more precisely. These will be 
sufficiently specific to serve as "item objectives." In 
developing the specific item objectives, the Item Writers will 
refer to written, authoritative sources such as textbooks 
66 
~d pamphlets to assure better coverage in developing items.These items will be reviewed and revised by the Item Writers,using item writing and review guides. They will also be
reviewed by an editorial specialist for grammar, expression,
spelling, and punctuation. 
The items will then be tried out on a population similarin nature to the one that will ultimately take the examination.The items will be analyzed to determine probable difficultyfor the candidate group, ability of the items to differentiate
between 
persons with high and low levels of the knowledge or
skill being tested, and to provide clues to possible

ambiguities in the items. The Item Writers will revise the
items as necessary on the basis of the analysis and will

assemble a final set of items to meet the test specifications.
The assembled test will them be given another editorial review. 
At this stage the test will be submitted to the Supervisory
Panel for review before being considered to be in final form.
The Panel will have the opportunity at this stage to suggest
changes which they feel are desirable in order to meet the

specifications. 
This may include substitution of items orchanges in specific items. 
Those changes which the majority
of the Panel members agree on will ordinarily be made. 

The total process of test development will require several
months. The Panel will be involved at the initial stage on

the following dates: 
(list of dates)
The final meeting to --~~--------~--~------------~--~-
review the test is expected to be about(date) 
Prior to each of these meetings, the
Panel will be asked to carry out some assignments in preparation

for the meeting. 
Background Information on Assistance Providedby the Federal Government 
The Social Security Act, as amended in 1939, provided for
the selection of personnel on a merit basis in State agenciesreceiving Federal grants-in-aid under the Act. Through theyears, the selection of personnel on a merit basis has beenexpanded to cover other Federal grant-in-aid programs.
1971 the Intergovernmental Personnel Act transferred theIn management of the Merit System Standards Program to the U. S.Civil Service Commission (USCSC). In the Commission the PersonnelResearch and Development Center (PRDC) develops examinationmaterials relating to programs in public assistance, child welfare, 
maternal and child health, public health, vocational rehabilitation, and employment service, unemployment insurance programs, and assistance programs in Civil Defen8e. 
Federal assistance to the States for these programs includes setting standards, reviewing plans and operations, and providing technical assistance through USCSC Headquarters in Washington, D.C., and its ten regional offices. This technical assistance includes 
help with position classification plans, including the establishment 
of minimum training and experience requirements, pay plans, and examinations. PRDC services with regard to examinations include help and consultation on methods of evaluating training and experience, and on oral, written, and performance tests. 
68 
Appendix II-A-:-3 
Sample Instructions. for thePre:""Workshop Assignment 
1. 	Please study the materials in this entire package carefully
as soon as you.poss.ibly can.· Some assignments are to be
carried out within the next weeks. 

2. 	Review the enclosed job description carefully. How well does
it agree with your knowledge of the job? Does it include all
the tasks likely to be.encountered on the job? Does it
include some tasks which are rarely if ever required to be

performed? 
3. 	Using the sheet on which job duties have been listed, rate
each duty according to the following code (add any tasks
which you feel should be included but which are not listed}: 

5-~Required several times a day
4--Required on a daily basis
3--Required weekly
z-.,..Required several times a year
1--Rarely required 

4. 	Next, list the knowledge and skills needed to carry out eachof those duties which you consider relevant to the job. These
"position requirements" should be listed in general. terms, suchas "knowledge of algebra through quadratics," or "ability touse an electronic desk calculator to perform operations of thecomplexity of " In listing these, indicate thejob duty for which each is required. 
5. 	Mail the completed list together with ratings on job duties toat the following address, to be there
-n-o~1-a_t_e_r~th~an-----(~d~a-t-e~)----
(address) 
6. 	Keep a copy of your list of requirements and give some thoughthefore the meeting to how you would determine whether anindividual meets these requirements; i.e., which of thesewould you determine by interview, which by references, whichby test, etc. 
7. 	Before our meeting, also give some thought to how you knowwhether an individual is successfully applying the requiredskills or knowledges on the job. At the work session we shallattempt to compile a list of criteria of successful job
performance. It may help to think of an employee who isespecially good in the given skill and compare him with one 
who is weak. in this skill. How do you know that the former is better thail. the latter in respect to this skill? (You may, of course, select different individuals for comparison with respect to another skill area.) Some of the evidence may be negative, e.g., "the low performer uses unnecessary steps in calculating." Please bring your notes on these 
criteria to our work session on (date) 
70 
AJ.>pendix lio::-R 
Sample. Memorandum for·Nominate.d lte.m Writers 
You have. be.e.n recommended as a person who might be. interestedin participating in the development of a written test or othereval.uation instrument to be used in the selection of candidates for
the job of------------------
It is important that the. . Agency have
working on this assignment employees who are highly capable. and
who have experience on the. job. The quality of the future. work
force depends on selecting those candidates most likely to be.
outstanding in their performance.. This means having selection
procedures which will identify candidates who have the. qualities
most important to job performance.. 

The. assignment will consist first of analyzing the knowle.dgesand skills judged essential to high level performance on this joband describing the specific ones which are critical to the job. 
It
will require distinguishing the knowledges and skills which are
truly important to successful job performance from those which
"sound good" but are really unrelated to quality of performance. 
Persons participating will then be asked to write testquestions that measure some of the pertinent knowledges and skills. 
To assist you in deciding whether you would like to beconsidered further for this assignment, you will probably wish totry a sample of test item writing. First write down three orfour examples of knowledge which you consider important to successful performance of the job of ------------~~------~~----~~next, try writing an item to measure one of these. If you wish,have it reviewed by someone else, and then try revising it on thebasis of his comments. 
If you participate, you will be given instruction in itemwriting and reviewing and will work as a member of a committee towrite and review items for a test to be used in selecting among
candidates for employment in ---------------------------------
I shall phone you next week to learn whether you are interested
in praticipating. (Alternate statement if more persons are. likelyto be available than can be used: I shall phone you next week tolearn whether you would like to be considered further for participating in this item writing project, or, if you wish, you may forwardyour item to me as soon as you have it ready.) 
Arrangements have bee.i:t made with your agency for those 
who participage to be excused from regular work responsibilities for a two-or-three week period (date), and some may be asked to participate in a later workshop of two days to assemble a final test. 
Appendix II-C 1 .......
Form DL Sample Daily LogN Agency: Purpose of Session: Date: Job Title: Those present: 
Topic 
Decision Pro Con 
~It is recommended that all decisions be recorded in a daily log. 
Appendix II-D 
Form MN Measurement Need Definition 

Job Title:
Agency: 
Date Completed:
Location: 
-""~ Job Duties (Based on Importance Position Importance Evidence ~-
. ---~-
Job Description)1 Rating Requirement Rating 	(Selection--Pro'cedure) 
If test~ specify kind 

......
i w 
1 Attach copy of job description. 
Appendix 11-E 
1
Form BD Background Data for Test Development 
-....J 
Agency: _______________________ Location: ________________________________ ___ ~ 
Function of Test: DOT Code: __________________________ _Job Title: 
Selection ------------------------------
Test Rationale:-----------------Date Form Completed:------------------Promotion-----------------
ether __________________________________ _ 
Position requirements which the test is intended to identify: 
Rationale for using a test: 
Other selection procedures (if any) being used to identify the specified qualifications: 
Use of test (e. g., weight to be assigned the test in the selection process, ranking or cut-off, etc.): 
Number of job openings: Selection ratio (number to be selected divided by number of applicants): 
2
Population data: Ratios Number 
American Spanish 
Black Indian Oriental Surname ether Total Male Female 
Currently on job 
Likely to apply 
Source population : 

Educational background: Less than Less than 4 4 yrs. college 
H.S. H.S. yrs. college or more Total 
On roll 
Applicant 

Age: Under 20 20-29 30-39 40 + Total On-roll Applicant 
Validation strategy planned: 
Ethnic group data collection is NOT required under USCSC regulations. Such data collection is useful for State and local governmental agencies in order 
to meet EEOC and OFCC regulations. Also, such data collection would be required under the draft of the. Uniform Guidelines on Employee .Selection 
Procedures byl the EEOCC. 

The numerator for a given ration is the number of persons in the given ethnic, education, or age group in a given row. The denominator for the ratio is 
the total number of persons for that row. 

Appendix II-F 
Form TS Test Specifications Agency: Location: Date Prepared: Job Title: DOT Code: 
Number 	of Item 
I 
No. of Desired Statistical Char. Chance
Content
IPart 	Items l Type Choices Time MI SD ]_% Com. M I SD II SEM 
--~---
1. Scoring Procedures: Rights only __ R -___!i_ --Other (Specify) : 
k-~
2. 
Weighting of Parts: Simple sum Other (g1ve reason): 

3. 
Selection Standards: 


Minimum cut-off Multiple cut-off 
Ranking, sole predictor __ Ranking, combined predictors 
If more than one predictor, what are these and how will they be combined? 

How will cut-point(s) be determined? 
If cut-point, how does it compare with chance range? 

4. 	Norms: One-time use Repeated use: %-ile Norms Scaled score norms (M = SD ) 
Equate to existing operational form? Form 
Equate to future operational form(s)? Form(s) 

~---------------
Equating method: Common items Common population _______ 
-....! 
Equate 	M and SD Equi-percentile Other: V1 
76 
Appendix II-G Form PA Pretest Analysis Agency: Location: Description of PopulationSample:Job Title: DOT Code: Test Title: Date of Analysis: 
r r
Part or No. p Mean his his
Form Content Items Mean SD Range p Range Mean

---l	iI•i 
1. 	Equating items, if included: 6 base population: 
'~··..
Item numbers: 
Delta plot (attach plot of operational [y-axis] vs. pretest [x-axis] ) 

2. 	Table of equivalent D.'s: 
Pretest ~ 
Operational ~ 
3. 	Power vs. speed data: Percent completing pretest: Last item reached by all examinees: Plot of R [y-axis] vs. R + W [x-axis] (Rights vs. Rights plusWrongs) attach 
4. 	Item analysis results (see next sheet): 
Appendix II-H 
Form IAR -Item Analysis Results 
Agency: Test: Population: Total N -----Location: Analysis Sample Description: Date of Analysis: 
4  
Total  Breakout of Ethnic Group  
Item  p  Raw  Equat  rbis  NRl  N02  Criterion 3 Grot!p_  Black  American Indian  Oriental  Spanish Surname  Other  Male  Female  
No.  t:::..  t:::..  Hi  Low  Total  Hi  Low  Total  Hi  Low  Total  Hi  Low  Total  Hi  Low  Total  Hi  Low  Total  Hi  Low  Total  Hi  Low  Total  
!  

Mean s.. D. .___ 
-
1 Number who did not reach item. 
2 Number who omitted item but responded to subsequent items. use the three colunms in each breakout to enter the number who answered correctly in high group, in low group, in total
3 If high/low method of item analysis is used, 
...... 
use the columns to enter total score mean of those answering correctly and those answering incorrectly, respectively. ......
group, respectively. If ~is is computed, 4 The USCSC does not require the collection of ethnic data. See note on Appendix II-E. 
78 
Appendix II-I 
Sample Memorandum on Control of Test Matertals 
The 
Agency urgently requests
your cooperation in preserving the confidential nature of all

examination materials, both those you construct and those you review. 
The 	following controls are suggested: 
1. 	Submit, in longhand, questions that you construct, and retain
no copy or notes pertaining to them or to any written sources
used. 

2. 	Make no copies of questions you are asked to review or of your
comments on them except for those copies used for review. 

3. 	Keep all materials pertaining to examination questions under
lock and key when they are not in actual use. 

4. 	Permit no other person to have access to examination materials
without prior clearance with the Measurement Specialist or the
head of the Agency. 

5. 	Avoid discussing the content of examination questions in such a
way as to identify the topic under consideration with a civilservice examination. 
6. 	Keep in mind that you may have a tendency to inadver~entLy give
an undue advantage to persons whom you supervise, teach, or
merely talk to, and who may take examinations containing some
of the questions you have worked on. 

7. 	Except for necessary clearances, avoid disclosure of the factthat you are working on a civil service examination. 
Because of the importance of security considerations in
examinations for the public service, we would appreciate your
signing this note and returning it to us as an indication thatthe foregoing types of safeguards, and any additional ones thatseem desirable, have been applied. 
We should also appreciate yournotifying us if in any case circumstances beyond your control causeyou to have any doubts as to whether the confidential nature ofthe materials has been maintained. 
Signature 
Agency ----------------------------
Date 
Appendix III 
Checklists 
A. Schedule Guide 
B. Meeting Arrangements Checklist 
C. 	Non-Content Factors Which May Affect Test Scores on Multiple-Choice Tests 
1 
80 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
• 1 
81 

Appendix Ill-A 
Schedule Guide 
The following schedule is provided as a guide to the Measurement Specialist in setting up the overall program. Those who have not previously participated in a test development program of this nature may underestimate the total amount of time required. Such an underestimate could result in agreeing to an unrealistic completion date with a resulting pressure to spend less time than desirable on some steps. 
Minimum 
elasped time 
from 	request 
(in 	weeks) 
1. 	
Program planning completed 1 

2. 	
Selection of Supervisory Panel completed 2 

3. 	
Work session of Supervisory Panel initiated 6 
(Allow 2 to 4 days for work session, depending 
on adequacy of existing job description.) 


Session completed 	7 

4. 	
Selection of Item Writers completed 7 

5. 	
Work session of Item Writers initiated 9 
(Training phase may be expected to overlap 
with production phase, but a minimum of 2 
weeks total should normally be allowed.) 


Session completed 	11 

6. 	
Pretest plans made 11 

7. 	
Pretest(s) administered 13 
(If pretests are administered as part of the 
operational battery, the schedule must of 
course be adjusted accordingly.) 


15
8. 	
Pretest analysis completed 

9. 	
Meeting of Item Writers to review analysis 
and make final test assembly decisions initiated 


17 
19
10. 	Meeting of Panel to review test (Allow 1 to 2 days for meeting in setting schedule.) 
22
11. 	Test produced and ready for administration 
82 Appendix III-B Meeting Arrangements Checklist 
A. Preliminary arrangements 
1. 	Meeting room(s) 
a. 	Blackboard 
b. 	Tables (The room for Item Writers should have tables to permitpair or group conferences as well as space with tables for
individual concentrated work free from distractions.) 
c. 	Clock 
d. 	Locked file 
e. 	Cloak room 
2. 	Hotel accommodations (if required for some) 
3. 	Transportation (if required for reaching meeting place) 
4. 	Coffee, if to be provided 
5. 	Luncheon reservations, if group will lunch together 
6. 	General supplies 
a. Pads, pencils, erasers, paper clips, stapler, transparenttape, ruler, ash trays (if smoking is permitted), black ink
pens 
b. 	Additional supplies for Item Writers: 
half sheets, typewriter,
drafting equipment as needed for sketching diagrams 
7. 	Handouts 
8. 	Reference materials (For Item Writers this includes: dictionary,
thesaurus, Dictionary of Occupational Titles, references whichgive item writing suggestions and illustrative items, and texts
in 	the content area of the test.) 
9. 	Confidential trash pick-up 
10. Copying facilities 
B. Day of meeting 
1. 	Set up conference room 
a. 	
Supplies 

b. 	
Handouts 

c. 	
Necessary reference materials 

d. 	
Working materials 


2. 	
Provide the receptionist and/or local telephone operator with a list of participant names and instructions on interruptions. 

3. 	
See that the room is locked or materials are stored in locking files when conferees are out of the room. 


84 Appendix III-C Non-Content Factors Which May Affect Test Scores on Multiple-Choice Tests 
I. Test 
A. Instructions 
1. Content and complexity
2. Sample items
3. Instructions on guessing 
a. working speed
b. timing of parts
c. comparison of item numbers on answer sheet and test 
B. Mechanical features 
1. Format clarity 
a. spacing
b. size and style of print
c. arrangement of options
d. labelling of options
e. form of answer sheet 
2. Item difficulty 
a. distribution
b. level
c. order of items relative to difficulty 
3. Speededness 
a. time limit vs. work limit
b. short and long items under one time limit 
4. Method of scoring 
a. weighting
b. guessing correction
c. accuracy of scoring 
C. Item construction 
1. Item types 
a. choice of item type
b. number of different item types
c. arrangement in test
d. coachability 
• 

85 
2. Framing of question 
a. readability
b. length
c. setting
d. level of terminology 
3. Answer options 
a. number of choices
b. variation in number of choices
c. placement of correct choices 
II. Examinee 
A. Mental set 
B. Personal background factors 
C. Educational background factors 
1. Reading ability
2. Training 
a. recency
b. amount
c. specialized training 
III. Conditions of Administration 
A. Test Administrator's instruction 
1. Accuracy of timing
2. Accuracy in reading instructions 
B. Physical surroundings 
1. Seating arrangement
2. Lighting, etc.
3. Distractions 
C. Administration of other tests 
D. Order in which tests or parts are administered 
87 
Appendix IV Procedural Guides 
A. Suggested Source Material for Job Analysis 
B. Item Writing Guide, Sample Items, and Examples of Item Modification 
C. Item Types 
D. Item Writing Rules for Multiple-Choice Tests of Job Knowledge 
E. Suggestions for Developing Item Ideas 
F. Instructions Regarding Item Format 
G. Item Review Guide 
H. Test Format and Sample Test Instructions 
I. Procedures for Obtaining Pretest Statistics 
J. Using Pretest Statistics 
K. Test Review Guide and Test Review Sheet 
L. Brief Guide on Norming and Equating Procedures 
Appendix IV-A 
Suggested Source Material for Job Analysis 

There is no one job analysis method which is clearly superior to all others. In determining an adequate approach to follow, the MS should select one which has been used successfully byothers for determining the job requirements for job selection 
purposes rather than job classification or job evaluation pur
poses. Also, the method chosen should provide some type of listing of the important job requirements in terms of knowledges and skills (or at least provide a basis for inferring important knowledges and skills) for job knowledge test development purposes.
Of most value, are procedures which use one or more of such methods as interviews, questionnaires, check-lists, dairies, brain storming techniques, and critical incidents. Of little value are such methods as the Position Analysis Questionnaire 
and Occupational Analysis Inventory both of which concentrate 
on 
underlying cognitive, affective, and physical abilities; however, these two methods are very useful in the development of other types of tests. 
Good sources of job analysis information can be found in standard textbooks in industrial psychology. An article by Prien and Ronanl reviews many of these procedures. The job element job analysis procedure of Ernest Primoff also can be used for developing lists of important knowledges and skills. The follow
ing two manuals may also be of some use to those who have effectively tried them out. 
U. S. Civil Service Commission. Job analysis developing and documenting data. Washington, D. C.: Author, 1973. 
U. S. Civil Service Commission. Job analysis for improvedjob-related selection. Washington, D. C.: Author, 1975. 
1Prien, E. P., and Ronan, W. M. Job analysis: A review of research findings. PersollneJ..J>_s~~l?:9_!_9_gy, 1971, ~. 371-396. 
90 
Appendix IV-B 
Item Writing Guide, Sample Items and Examples of Item Modification 
This Guide is intended to assist the Item Writer in working from
a position requirement to a finished item ready for review. It is
intended both for a general orientation and for reference use. 

Throughout the item writing process, one must keep in focus thepurpose of the item as a measure of the candidate's ability to handlesome aspect of the job, and avoid being diverted into testing knowledgefor its own sake. Five stages of item development will be considered.The form of the item at these stages is indicated at the right. 
1. 	General statement of the knowledge or Positionskill required to perform the job as requirementdescribed in the job analysis. 
2. 	Identification of the specific concept, Item objectiveprinciple, or skill required. 
3. 	Preparation of a rough statement of the Item idea (oritem, including the stem and as many tentative item)options as the Item Writer can think of. 
4. 	Item Writer's review of his tentative
item. 

5. 	Statement of the item in a form which Provisional itemthe Writer is willing to have reviewedby others. 
Stage 1: Selection of a position requirement to be tested. 
The position requirement should be derived from the job analysiswith a specific task in mind. For example: 
First-Aid Attendant 
Requirements--Knowledge of conditions requiring first aid;
procedures for treating conditions of shock 
Stage 2: Development of item objectives. For example: 
In 	the event of an accident, examine the patient to determine
whether an emergency exists. 
Sunstroke symptoms include a rapid full pulse and dry skin. 
91 
Symptoms of shock include weakness, paleness, perspiration on
upper lip and forehead. 
For a patient in shock, avoid overheating or chilling. 
Although the Writer may understand his field well enough to write
questions without frequent reference to a text, it will help assure
coverage to go through two or three currently accepted texts to iden
tify concepts. Those concepts covered by more than one text may be
important enough to consider testing. This analysis will also help in
developing the overall outline. 

The Writer should not, however, accept this agreement among texts
as sufficient evidence of importance for job performance. He should
think of his own experience and evaluate whether the skill or knowledge
is of sufficient importance to warrant testing, i.e., could he have
performed his job satisfactorily without it? Is it something he could
have learned on the job as the occasion required it? This evaluation
step is very important, as the Agency may later be called upon to
justify having tested a particular knowledge or skill. 

The Writer should recognize that testing is a sampling process,and should not become concerned if the list of item objectives is solong that not all of them can be covered in the number of items allotted.If the number of potential items is large, it is important not only thatthose concepts or skills selected be critical, but also that one area ofskill or knowledge not be represented disproportionately in the finaltest. The Writer must avoid placing excessive weight on the areas ofparticular concern to him individually. 
stage 3: Devising a way of measuring the objective. 
What 	might the individual do who does not understand the principle? 
For example, what steps may be taken by the individual who does notknow the proper treatment of shock? (Don't worry about wording at this
time; just get the idea down.) The following question might be written: 
El. 	In case of shock~ which one of the following should be doneforo a patient? 
(A) Give the patient a droink.
(B) Covero the patient waY'Irlly.
(C) Raise the patient's head above his feet.
(D) Place an ice pack on the patient's head.
(E) Covero the patient with a light blanket. 
One should write down as much of the item idea as comes to mind,including the stem and any distracters which seem plausible, beforegoing on to the next item; otherwise, they may be forgotten. If more 
.. 
92 
than the needed number of answer options occur to the Writer, theyshould be written down. Later consideration may indicate that certaindistracters are better than others. 
The·above item is a straightforward test of knowledge. The sameknowledge is required in the following item which also tests the candidate's ability to recognize the symptoms of shock: 
E2. 	A traffic accident victim is weak and pale and showsperspiration on his upper Zip and forehead. Which oneof the following should the ambulance attendant do? 
(A) Remove the victim's coat and shirt.
(B) Give the victim a drink of water.
(C) Cover the victim with a light blanket.
{D) Give the victim a pillow to raise his head.

(E) Place an ice pack on the victim's head. 

Both of these items need rewording, but the present statement is
sufficient for recording the idea. 

The item situation should be realistic in order to make the itemseem reasonable to the candidate, i.e., to give it "face validity."The length of the item, however, should not be appreciably increasedto add "window dressing." Avoid asking for the information oneusually has and giving the information one would normally seek. Anexample of such an undesirable inverted item would be: "If it takes
12.5 cubic feet of concrete to build a square loading pad 6 inchesthick, what is the length of one side of the pad?" 
Avoid using terms in wrong options which imply the candidateneeds to know something irrelevant to the job. 
The following questions may help in converting an item objectiveto a test question: 
1. What are common misconceptions about the principle? 
2. Why is the principle important to satisfactory job performance? 
3; In what sort of circumstances might it be important to understand the principle? 
4. What might the individual do who does not understand theprinciple? 
5. What might be the consequences of a lack of knowledge of theprinciple? 
6. How can the individual demonstrate that he has the knowledge? 
See also the list of suggestions for developing item ideas in Appendix IV~E. The list should be referred to only after the item objective has been stated and one knows the principle or process one 
wishes to test. At the item idea stage, the focus should be on relating 
the principle to the job, and the Writer should avoid being led, in the 
search for an item statement, to measurement of an abstract concept 
which is related to the concept being tested, but which is not essential 
to job performance. Most questions will not start with the phrases in 
Appendix IV-E, but these phrases may provide an idea that will stimulate 
the Writer when he is trying to put an idea into item form, or they may 
suggest a new approach when the one which the Writer has been trying
does 	not seem to be working. 
In writing distracters, some Item Writers find it helpful to projectthemselves into the situation and ask themselves, "If I were taking a test which included this question as a completion item, and I didn't know the answer, how would I answer it? What sounds reasonable?" For example, in item E2 above, the uninformed individual, seeing perspiration, 
might consider the individual to be too hot and remove the coat; another might try to make the patient more comfortable by giving him a pillow. 
Sometimes a Writer has only one or two plausible distracters. He may be tempted to turn to one of the following solutions: 
1. Using "None of the Above" as an alternative. This is all right 
if the other options are clearly right or wrong, but this is often not the case. 
2. 
Putting in a "filler." Because it is obviously wrong or unrelated, this serves no useful purpose other than to give the requisite number of options. 

3. 
Throwing the item out. If the point is an important one, an effort should be made to salvage the item. Consider the following example, which has three logical choices as shown: 


E3. 	A test analysis shows that all students answered correctlyall items which they answered. The test is most probably 
(A) 
a test of speed 

(B) 
a test of power 

(C) 
a test in which both power and speed are important. 


In some cases where there are only two good distracters, a fourth option--"Cannot be determined from the information given"--can be used effectively. In item E3, however, this option might attract the better examinee who could argue that one needs to know more about the test and the population before being certain that the examinees have really tried all items to which they responded. The response "None of these" is not a good solution since the three options fairly well cover the possibilities, assuming the information is sufficient. It would be better in 
94 
this 	case to reconsider what is being tested and make a fresh start.
The item as it.stands could be answered on the basis of rote learning.One might try measuring whether the candidate knows how to make use ofthe analysis information, as follows: 
E4. 	Scores on a 100-item test of tool recognition rangedfrom SO"to 90; however_, eveY.y student answered correctlymost of the items to which he recorded an answer. Whichone of the following is the most reasonable conclusion? 
'
(A) Fewer_, and some more difficult items_,
should be used.
(B) Many items are ambiguous and need revision.
(C) The, test is too difficult.
(D) The time per item is probably adequate. 
Stage 4: Review ~f the tentative item: 
Before spending time working over the item, ask the following
questions: 
1. What specifically am I trying to measure? 
One should try
verbalizing it as .if he were telling someone else. 
If not obvious,
the answer and associated logic can sometimes be clarified by writingdown the objective and the approach. 
2. Could I do the job effectively without .b.~ing able to answer
th~ q~e~titin? 

If i6, is ·ii-~ec~~~e the content is inappropriate or
because the wording is not clear? 
3. About what fraction of qualified workers on this job could
answer it? What fraction of my acquaintances who are working in

unrelated jobs could answer it? 
If th~se fractions are roughly the
same, would those who can answer it be :expected to do better on this
job than those who cannot? (Ignore ot~er factors entering into job

performance, such as interest, motivation, special skills, etc. 
Look
at those aspects of the job which involve the skill being measured.) 

4. Is the item time-consuming? 
If so, it may be possible toreword the item or split it into more than one question. 
5. Will the wording be clear to someone else reading it for the
first time? 
Could it be stated more simply and still provide the
necessary information? 
6. Look at each option. 
Is the correct answer clearly correct?
Is it based on current thinking? Do the options cover popularmisconceptions? Is any option likely to be chosen by only a few
candidates? 
7. Would this item be defensible if printed in a newspaper? 
t 	i 
95 
8. If the item involves computation, could values be chosen which would take less computation time without changing what is being tested? 
The Writer may wish to accumulate several items which have reached this stage and then reread the "Item Writing Rules for Job Knowledge Tests" in Appendix IV-D before attempting to polish the items. The procedure of accumulating saves time in going over the rules and it 
also gives the Writer a fresher look at his items, thus increasing the likelihood of spotting possible weaknesses. 
Stage 5: Improving the item 
The item should now be put in a form in which it can be reviewed The following discussion considers certain characteristics
by others.
of "good" items and techniques for improving weak items. 

Simplifying wording: 

1. Will the candidate know clearly what he is expected to do? 
Does he have all the information he needs to work with? Does answer
ing the item depend on certain assumptions which must be stated? For 
example, in E2, option B may be misleading. Since small amounts of 
water may be given a conscious patient in shock, B might be considered 
Change "drink" to "glass"

correct. (Option C is the intended key.) 
to suggest a ,larger amount, or if only four choices will be-used, this 
option is a candidate for deletion. Option D might be stated better as "Place a pillow under the patient's head." 
'" 2~ Does the. item .read smooth~YZ. For example, even though E3 is 
understandable as stated, the first sentence of the stem reads more 
smoothly as worded in E4. 

3. Is the terminology unnecessarily difficult, i.e., could a 
candidate who understands the concept being measured fail the item 
because he doesn't understand the words which are used? This is 
especially important if candidates have a native tongue other than 

One should not avoid terms which the candidate must under
English.
stand to perform the job, but if the terminology itself is not critical, one should look for more generally understood words and phrasing. For example, instead of asking "What are the detrimental effects of ?" ask "What are the harmful effects of ?" Where terminology is important, one may wish to consider testing for knowledge of terminology 
separately from knowledge of concepts. 
4. Is the wording as concise as is consistent with a clear statement of the question of problem? If the statement of the stem runs more than· three typed lines, could it be shortened? Since the accuracy of measurement depends to a large extent on the number of items in the test, time should not be wasted in unnecessary reading. Consider it a challenge to state the stem in one line less, as if space were so limited. Can you do this and make the question equally clear? Such a 
96 
focus on rev1s1on may help to make wording even clearer than initially.
One should not, of course, shorten the wording if doing so leaves outessential information or makes the question more confusing. 
It mayeven be necessary to add wording to provide all the information needed. 
5. May the candidate overlook in hasty reading certain wordswhich are essential to understanding the task? For example, by thetime the examinee has finished reading and comparing the options, hemay forget he was supposed to look for the "least important reason"
and 	choose the "most important." 
The Writer should capitalize suchwords as NOT, LEAST, and other negatives. If the examinee may worka numerical problem in a unit of measure different from the one given,underline or capitalize the unit of measure asked for. 
6. Put as much of the question as possible in the stem. For
example, compare the following two items for ease of reading: 
E5. 	The standard deviation is a more sensitive measure ofvariability than the range because 
(A) 
the standard deviation shows how scores relate to themean 
(B) 	the standard deviation is based on all the scores
(C) 	the standard deviation gives more weight to extremescores
(D) 	the standard deviation is more complex mathematically. 
E6. 	The standard deviation is a more sensitive measure of scorevariability than the range because the standard deviation 
(A) 	shows how the scores relate to the mean
(B) 	is based on all the scores
(C) 	gives more weight to extreme scores
(D) 	is more complex mathematically. 
Phrasing the correct option: 
The intended answer must be stated clearly and concisely. Itneed not be a complete sentence, but it must follow grammatically from
the 	stem. 
It must be clearly the best answer, but make certain it
cannot be identified as correct without reading the stem. 
Item ideasoften occur as the result of thinking of a common misconception. Inlater work on the idea, it may become apparent that the correct answer
is obviously correct, and the author may be tempted to make it a littleless obvious by disguising it. For example, the author may think ofweaknesses in item writing and decide to write an item to measureunderstanding of these. 
He writes the following question: 
97 

E7. 	A good multiple-choice item is characterized by which one of the following? 
(A) 	
uses double negatives 

(B) 	
tests knowledge of the views of an authority 

(C) 	
tests an important point 

(D) 	
asks the examinee his opinion 


Unfortunately, the correct answer is the only one which sounds reasonable. The author might consider using a less obvious correct answer such as "short answer options," but this weakens the item by reducing the importance of the principle being tested, and it is still fairly obvious. Perhaps a more satisfactory solution would be to make the correct answer a weakness in item writing. Such an alternative might be phrased as follows: 
EB. 	The stem of an item reads "Which one of the following is the most important role of education in the United States today?" What is the primary weakness in this question? 
(A) 	
The geographical area is not specific enough. 

(B) 	
The term "today" may be misleading. 

(C) 	
The term "most" is too strong. 

(D) 	
The correct answer is a matter of opinion. 


Item 	testing breadth: 
E9. 	The purpose of defining the abilities required by the job is to increase the likelihood that the test will 
(A) 	
distinguish between high and low capability candidates 

(B) 	
appear reasonable to the examinee 

(C) 	
have more easy items 

(D) 	
have less ambiguous items. 


Item 	testing depth: 
ElO. 	The purpose of defining the abilities required by the job is to increase the likelihood that the test will 
(A) 	
distinguish between high and low capability candidates 

(B) 	
cover all the duties actually performed on the job 

(C) 	
identify individuals with the most knowledge of the subject-matter 

(D) 	
appear to the examinee to be related to the job. 


Each 	distracter should be plausible to someone who does not under
stand the principle being tested, though care must be taken that the distracters do not become plausible enough to mislead the better candidates. All the distracters should constitute possible answers to a direct question implied or stated in the stem. As with the 
98 
correct option, they must be related to the stem grammatically and
in content. 
A distracter must not be identifiable as incorrect
without reading the stem. Clues to the correctness of incorrectness of an option should
not be provided. 
Such clues, called specific determiners, evolve
from various weaknesses in the options: 
1. An option which does not follow grammatically from the stemmay indicate an incorrect response. 
2. An option which is longer than the others may suggest the
correct answer. 
The extra length is often due to stating the assump
tions that are essential to make the option correct. 
(Such qualifi
cations, if necessary, can sometimes be incorporated in the stem.) 
3. Two options which are synonymous automatically disclose that
both are incorrect. 
4. Two options whic-h are mutually exclusive may give a clue to
one being correct, 
5. An option which encompasses another option may enable the
examinee to eliminate one of these. 

6. An option which uses closely similar terminology to that in
the stem may be attractive to one searching for the correct answer. 
7. A non-parallel option may give a clue to the correct answer. 
In writing items using "None of the above" as an option, eachstatement must be clearly correct or incorrect. If one is simplybetter than the others, the examinee has no basis for judging howcorrect the correct answer must be to meet the examiner's intentions.This is one reason for avoiding true-false items in employment tests,since it is difficult to phrase many important concepts as statements
which are clearly correct or clearly incorrect and which do not dependon unstated conditions or assumptions. 
"All of the above" must not be used as an option. Although this
may be justifiable in a course examination where students have beentrained by the teacher to expect such a choice and be held responsible
for it, a job applicant meeting this kind of item for the first timemay stop after finding a correct answer and not discover there is
more than one. He is in a hurry, and unless he sees the need to readall options, he may not do so. 
Appendix IV-D, "Item Writing Rules for Job Knowledge Tests,"
contains a longer list of suggested do's and dont's. 
99 

Decreasing the length of long options: 
The following item requires more reading time than necessary: 

Ell.  If the discriminating power of an  item is less than .35,  
(A)  one  should discard it since it ~s too  low to be  
discriminating (B) one should consider leaving it as it is, since it may be a good item which is not like other items (C) change any options which less than 10% of the examinees  
select  
(D)  try it out on another population sample.  
This  item is testing more  than  one  principle.  The  first of the  

(correlation with
principles is that an item's discriminating power total test score) depends both on its inherent capacity to discriminate between high ability and low ability examinees and on its correlation with (similarity to) other items in the test. 
The following item defines the problem further in the stem and leads to shorter options: 
E12. For which one of the following reasons might a test developer 
NOT 	discard an item that has an item-total correlation of 
.30 	and no detectable flaws? 
(A) 	
An item-total (biserial) correlation of .30 indicates a very discriminating item. 

(B) 	
The content is dissimilar to that of other items. 

(C) 	
It is a very easy item. 

(D) 	
It is a very difficult item. 


The similarity of C and D may lead the examinee to assume that one of these is correct. Since neither is correct, the rule of avoiding similarly stated options can probably be safely ignored. 
The second principle being measured in Ell (by option C) is that a distractor may be effective even though not highly popular. It might be better to focus more precisely on the significance of the response pattern, i.e., the numbers in the high and low scoring groups who chose each option. 
100 
E13. 	An item analysis shows the number of high scoring and lowscoring candidates who chose each answer option. 
Option2 is the correct answer. 
Option: 1 2* 3 4 
High 2?% 2 54 18 
26
Low 2?% 12 18 10 60 
Of the following~ which one would be the best option to
replace? 
(A) 	option 1~ because fewer than 10% chose it
(B) 	option 3~ because more in the top group than bottomgroup chose it
(C) 	option 4~ because so many of the top group chose it
(D) 	option 4~ because more chose it than chose the correct
option 
Obtaining more items per given amount of time: 
Perhaps the most obvious way of increasing the number of itemsfor a given amount of time is to write items which require less time
to answer. If previous recommendations on wording are followed, theremay be little more to be accomplished along this line. 
Use of simpler
numbers has been suggested, as well as breaking a long item do"tom intotwo or more separate items. 
There 	are, how·ever, techni.ques of item
format which facilitate covering more ground in a limited amount of
time. Some of these are described here. 
A common set of options saves reading time and can be usefulwhere this approach does not result in testing one area of contenttoo heavily. 
This is also useful when a set of diagrams involves
two 	or more important concepts and the diagrams take some study to
comprehend thoroughly. For example: 
Use the options at the right in answering items E14 through E16.Which one of the score systems shown at the right relates a person's
performance to 
E14. 	a given level of attainment? 
(A) 	standard score 
E15. the mean in terms of deviation units? (B) percentile score E16. the proportion of the norm population 
(C) 	criterion refer
which 	achieves scores below his? 
enced score 
(D) 	raw score 
101 
The following graphs are to be used in answering items El? through E20. A given answer option may be used for more than one question. 
~ .•, 
. . -, 	I \ I '
/0 	. . . l,. . //)~ . . . 10 ~ ., . 
: . . 	I I 
. '". I .. 	.' . 
I
:, . 	, ·.
...'.,. I: 	, :' i. 
.I . 
I ' 
I 	','
i)( 	(Jy:_
oX 0 	(J ' ~~~ 0 

~ 	c. t>
A 
El?. 	Which one of the scatter-diagrams most indicates a lack of relationship between the two variables? 
(A) A (B) B (C) C (D) D 
E18. 	Which one o.f the scatter-diagrams indicates a perfect relationship? 
(A) A (B) B (C) C (D) D 
E19. 	Which kind of a scatter-diagram would one expect if only those examinees who scored above a certain point on X zuere tested on Y? 
(A) A (B) B (C) C (D) D 
E20. 	Which scatter-diagram shows that a high score on X is associated with a low score on Y and vice versa? 
(A) A (B) B (C) C (D) D 
This format is similar to a matching item. Ordinarily in matc~ing 
items, one list may be shorter than the other, but each element is usually used only once. With the common set of options, it is not unusual to have the same option correct for more than one item. If this is the case, make sure the examinee understands that some options may be used more than once and others not at all. 
More items per time period can also be obtained by using the same set of data, a complex diagram, or a paragraph for two or more items. As with other items in the test, items based on a common set of data must be independent of one another. The answer to one item must not depend on answering another item correctly. 
Sometimes different item types can be useful ir.. getting more infor
mation about the candidates' knowledge in a given amount of time. The 


102 
following type may be useful where one wants a quick check on
quantitative types of knowledge: 

Mark 	each of the following statements in items E21 through E25
according to the following answer options: 
Mark Option A if th~ value in column A is greaterOption B if the value in column B is greaterOption C if the values in columns A and B are equalOption D if the relative size of the values in the twocolumns cannot be determined from the given information A 
B 
E21. -J:7o 
. ?0 
E22. ...r.ao 

E23. mean 
standard deviation 
E24. 3 em 

1 inch 
E25. boiling point boiling point of
of water cooking oil 
Making the item less difficult: 
The following item illustrates two points: First, it shows asample multiple true-false item for which examinees may consider morethan one option or none of the options to be correct. If this typeis used, it should probably be used only with candidates for higher
level jobs. 
Good college students have sometimes objected to itsuse, even though only one of the lettered,answer options is correct.
As with the true-false item type, one must be certain that each optionis clearly right or wrong. 
Secondly, it illustrates an item whichrequires excessive time to read and understand for one point credit. 
E26. On the basis of the infor,mation given, which of the three
conclusions given below can be drawn regarding the follow
ing 	set of raw test scores on a 40-item test of arithmetic
computation? 
Conclusions 
Jones 20 I. Smith has twice as much computational
ability as Jones has. 
Smith 40 
II. 	Smith should be able to handle any jobrequiring arithmetic computation. 
Brown 35 
III. 	Brown is nearer to Smith than to Jonesin arithmetic ability. 
(A) 	None (B) III only (C) II and III only 
(D) 	I, II, and III 
103 

This item tests at least three concepts: 1) that a score twice as 
high does not mean twice as much ability; 2) that one needs to compare test content with job content before knowing whether the test covers the job requirements; and 3) that raw scores cannot be interpreted directly on a scale of ability. 
Try breaking long items such as this into two or more less complexitems. For example: 
E2?. 	Three candidates for a job which requires a high level of computational accuracy make scores of 20~ 35~ and 40 on an arithmetic test. Which one of the following items wiZZ be LEAST useful to the employment interviewer in evaluating the qualifications of the candidates? 
(A) 
a 	copy of the test 

(B) 
a 	review of the test by the job supervisor 

(C) 
the standard error of measurement 

(D) 
national norms on the test for high school graduates 


E28. 	With which one of the following score systems does a score of 60 generally reflect twice as much ability as a score of 30? 
(A) 
raw score 

(R) 
raw score corrected for guessing 

(C) 
standard score scale with M = 45 and SD = 10 

(D) 
none of these scales 


Parts I and II of The Construction of Test Questions (see Bibliographyfor reference), obtainable from the U. S. Civil Service Commission, 
include 
suggestions on item writing and are the sources for some of the ideas in
cluded here. 
Answer Key to Items in Item Writing Guide 
El. E E8. D El5. A E22. c E2. c E9. A E16: B E23. D E3. A ElO. A El7. D E24. A E4. A Ell. B E18. c E25. B E5. B El2. B El9. B E26. A E6. B El3. B E20. c E27. D E7. c El4. c E21. A E28. D 
104 
Appendix IV-C 
Item 	Typesl 
Although multiple-choice items are almost always preferred, there
are occasions when other item types can be used. 

This 	appendix briefly
presents some-samples and suggestions for the item writer to follow in
writing these various other common types, i.e., true-false, matching,
completion, and essay. 
Item writing i.s a creative process. Often considerable ingenuity
is needed to write items that require the candidate to go through the
required thought processes and to demonstrate that he possesses the
necessary depth and scope of knowledge or skill required in the subject.
Just as there are no set formulas for producing a good story or a good
painting, so there are no set rules that will guarantee the production

of good test items. The guidelines presented here can be summarized
by stating that the item should be expressed as clearly and simply
as possible. 
Quite often an item is too difficult and discriminates
poorly because the request for information was not stated clearly. 

1. 	True-false items. The true-false test item consists of a simple
statement which candidates must identify as true or false. However,
because of very serious problems with it, this item type is not
recommended for use in employment tests. 

a. 	Characteristics. 
(1) 	Positive points. The true-false item can sample wide rangesof material and is easily and objectively scored. It canbe written as a factual question or as a question thatrequires reasoning. 
(2) 	Negative points. The pitfalls and shortcomings of this typeof item warrant careful consideration before it is used veryextensively. Since there are only two alternative answers,it encourages guessing. Half of the questions might be
answered correctly without any knowledge of the subject. In
addition, it is very difficult to make a statement which is
absolutely true or absolutely false without giving some hint
as to the correct answer. Finally, a relatively large number
of such questions must be used in an examination if there is 
lThis material is excerpted from: Kraft, J. D. Instructor's guide tothe construction of writs and examinations. West Point: U.S. MilitaryAcademy, 1968. This publication includes examples of questions from twoother publications:
U. S. Navy, Constructing and using achievement tests. 
NAVPERS 16808-A.
Washington, D. C.: Author, 1949.
U. S. Army, Techniques of military instruction. FM21-6. Washington,
D. C.: Author, 1967, Chap. 13. 
105 
any expectation that the test will separate the moreknowledgeable candidates from those less knowledgeable.Contrary to appearances, it is the most difficult typeof item to write. 
b. Principles of construction. 
(1) 	One-half of the items should be true and one-half false.
(2) 	The true statements should not be consistently longerthan the false statements, or vice versa.
(3) 	The application of knowledge should be required in asmany of the items as possible.
(4) 	The crucial element should come near the end of the
statement.

(5) 	One part of the item should not contain a true idea andanother part a false idea. 
Make 	it all true or all false.
(6) Double negatives or involved statements should be avoided. 
Poor (confusing): 	Better: 
One 	should not work on any Before working on any
electrical or radio equipment electrical or radio
if he is not sure that no 	equipment one should be
circuits are energized. 
certain that all circuitsare deenergized. 
(7) 	The words "all," "only," "never," and "always" in the
statement should be avoided. 

These words usually indicatethat the item is false. 
Poor (correct answer given away by use of conditional words): 
All airplane propellers are made of aluminum.
Never run an airplane in a hangar.
Only casein glue may be used in joining wooden parts of a model. 

(8) 	The words "generally" and "usually" in the statement shouldbe avoided. These words usually indicate that the item istrue. 
Poor (correct answer given away by use of conditional words): 
Sea plane floats are generally equipped with rudders.
In general, jet planes are faster than prop planes. 

106 
2. 	Matching items. Matching items generally include two lists ofrelated words, phrases, or symbols. The candidates are requiredto match each item in one list with some one item in the secondlist with which it is most closely related. 
a. 	Characteristics. 
(1) 	Positive points. The matching item is especially valuablefor testing knowledge of relationships and making associations.The matching exercise may require candidates to match (1) termsor words with their definitions, (2) short questions withtheir answers, (3) symbols with their proper names, (4)descriptive phrases with other phrases, (5) causes with theireffects, and (6) principles with situations in which theprinciples apply. A large number of responses can be obtainedin a small space and with one set of directions. It can betotally objective and is easy to grade. 
(2) 	Negative points. Overlapping items are dffficult to control.Also, the process of elimination can give clues to the correctanswer. Finally, it is difficult to use with a standardanswer sheet. 
b. 	Principles of construction. 
(1) The directions should be specific, indicating on exactly what
basis the matching should be done. Include these directionswith each matching exercise. 
(2) 	Generally, candidates should be required to make at leastfive and not more than twelve responses in completing eachmatching exercise. (If a standard answer sheet will be used,usually only four or five responses are allowed.) 
(3) 	At least three extra items should be included from whichresponses must be chosen, or responses should be allowed
to be used more than once. This tends to reduce the possibilityof guessing or answering by a process of elimination. 
(4) 	Only related materials should be included in any one exercise.Nothing should be listed in either column that is not a part
of the subject in question. 
(5) 	The column containing the longer phrases or clauses should beplaced on the left-hand side of the page. If answer sheetsare not used, candidates should record their responses at theleft of this column. This makes the selection easier. 
107 
(6) At least three plausible responses should be included
from which each correct response must be selected. If,in order to do this, it is necessary to include threetimes as many items in one column as in the other,another type of test item should be used. 
(7) 	In setting up the test, all of a given matching exerciseshould appear onone page or on facing pages. 
3. 	Completion items. The simple completion item requires the candidateto recall and supply one or more key words that have been omitted
from 	statements. These words, when placed in appropriate blanks,
make 	the statement complete, meaningful, or true. The statements
may 	be isolated and more or less unrelated, or they may be combined
to form short paragraphs that carry a continuous line of thought. 
a. 	Charact~ristics. 
(1) 	Positive points. This item form can be used to test acandidate's knowledge of specific facts and it demandsaccurate information. It can be used effectively tosample a wide range of subject matter. This item formis particularly suitable for testing memory for materialwhich must be recall in a precise way, such as technicalterms that must be known, abbreviations, weights, mmeasures, and tolerances. 
(2) 	Negative points. This item form is difficult to score.It is very difficult td develop a scoring key whichincludes all acceptable responses. 
b. 	Principles of construction. 
(1) The item should be so written that only a limited number
of responses will be acceptable. This is very difficultto insure. 
Poor 	(many answers possible): Better (only one answer): 
One should never smoke in a The wrench used to measureship's --------------------the amount of twisting force
being applied to a nut iscalled a wrench. 
(2) 	It is generally best to use only one or two blanks in asingle sentence. 
Poor 	(for mind readers): Better: 
A ------is a device for con-A generator is a device forverting energy into converting machanical energy
--------energy. 	into energy. 
(3) 	If possible, the blanks should be kept near the endrather than the beginning of the sentence. 
Poor  (must read twice):  Better:  
A an overloacircuit.  is used to prevent d in an electric light  A device to prevent overload 5n an eleclight circuit is a _ an tric ______  

(~) 	The question should be so arranged that the answer may
be written in a column at the right. This will increase 
scoring speed and accuracy. 
(5) 	
Except in rare cases, verbs should not be omitted. 

(6) 	
Statements should not be copied directly from testbooks to make a completion item. 

(7) 	
The statement should be complete enough that there 

can be no doubt as to its meaning. Avoid being too brief. 

(8) 	
Only those key words which the candidate should know should be omitted. Do not ask for the recall of trivals details. 


c. 	Related item forms. These forms are similar to the completionand short essay. 
(1) 	
Short-answer items. The simplest recall item is the question or statement that demands a short answer to be written in given spaces. 

(2) 	
Listing items. These are very similar to the above and are frequently considered essay items. The candidate is 

asked to list or outline the parts of a problem or subject. 

(3) 	
Problem-solving items. Data are presented to the 

candidate and he is asked to manipulate the data to obtain an answer. 

(4) 	
Identification items. The candidate is asked to label 


parts or a diagram, identify a group of formulas, write titles of books after authors' names. 
4. 	Essay items. The essay item calls on a candidate to describe, compare, discuss, or explain some aspect of the subject he is 
studying. It is often used when the number of candidates is below 15. 
a. 	Characteristics. 
(1) 	Positive points. The essay item can be used effectively to measure a candidate's ability to organize and expressthoughts. It is very useful for selecting persons for 
109 
some high level jobs where the ability to organizeand present facts is extremely important, such asfor hearing examiners. In comparisonwith objectivetype items, good essay item8 are not easier to writebecause the Item Writer must consider in advanceexactly what will be considered acceptable for an answer. 
(2) 	Negative points. Its greatest disadvantage is that thesubjectivity involved in grading significantly reducesthe accuracy and consistency of the scores assigned bythe rater. Its scoring may become subject to the rater'sinterest and range of knowledge and other similar factors.Handwriting, style, grammar, and other commonly extraneousfactors significantly influence the grade assigned.Responding to an essay item requires much candidate time;this greatly restricts the sample of test material whichmay be asked. Unfortunately, essay items allow thecandidate, rather than the Item Writer, to sample andtreat in depth that part of the subject he knows best.Scoring the items requires much more time than is requiredfor other item types. On the one hand, it providescandidates an opportunity to bluff; on the other hand,candidates who know the subject matter well, but are notskilled in writing, may be penalized on an essayexamination. 
b. Principles of construction. 
(1) 	Specific answers should be requested. The question shouldbe worded in such a manner that it provides the candidatewith an outline that he can use in formulating hisresponse (unless the question is designed specificallyto measure this ability to outline and expand his subject). 
(2) 	The item should be stated in a simple, direct manner. 
(3) 	The candidates should be told the basis for grading atthe time they are given the test. 
(4) 	The essay item should be designed to require candidates
to organize, outline, compare, explain why, give areason, describe, or explain how. 
c. Principle of scoring. 
(1) 	The Item Writer should write out, before the test isgiven, the answer or answers ~e will accept as sufficient
and correct. He should include every point that is tobe accepted. He should use this as a standard in gradingthe test papers. 
(2) 	He should not grade an entire test paper at one time. Rather, he should score one essay item on all the test papers before he proceeds to the next essay item. 
(3} 	The Item Writer gener~ly should allow me unit of credit for each point covered in the answer. However, if a particular point or principle is considerably more important than others, it should be given more th~ one unit of credit. The grader then merely totals the points earned. 
(4) 	Code numbers should be used instead of names on the candidates' papers where the candidates might be known by the grader. 
111 Appendix IV-D Item Writing Rules for Multiple-Choice Tests of Job Knowledge 
A. 	Concept measured 
1. 	The concept measured must be important to the ability toperform the job for which the test is being used. 
2. 	In general, limit the item to one concept (or one specifictype of error) unless synthesis of concepts is being tested. 
3. 	Avoid items. which require reference to an authority to
determine the correct answer. 

4. 	When more than one item is based on the same set of information,do not make a correct answer to one item dependent upon answering another item correctly. 
B. 	Stem 
1. 	Define the question or task in the stem. 
2. 	Put as much of the question or statement as possible in thestem. 
3. 	State the stem as concisely as possible, but provide the
necessary information. 

4. 	State the question at the level of the candidate. 
5. 	Avoid terminology which is not important to knowledge of theconcept or to the job. 
6. 	For problems involving computation, keep the values assimple as possible while still testing the desired skill. 
7. 	Avoid asking the candidate his opinion. 
8. 	Avoid imprecise words. 
9. 	Avoid double negatives. 
10. 	Where a single negative is needed to measure the concept,underline it or put it in capitals: LEAST, NEVER, NOT, etc. 
11. 	Where items require the solution to a problem, check todetermine that the right answer cannot be obtained by acommon wrong method. For example, 22 = ? (2 + 2 is also equalto 4). 
112 
12. 	Specify in the stem the unit of measure in which answer optionswill be given. Ordinarily, the unit of measure need not berepeated in the options. 
13. 	The item should have face validity, i.e., it should seem a
reasonable question to ask. 

C. 	Answer options 
1. 	There must be one and only one correct answer or best answer. 
2. 	The correct option should generally be at the same level ofdifficulty as the distracters. 
3. 	Avoid unnecessary wordiness. 
4. 	Do not use options which are based on careless reading, unlessreading is being tested (e.g., wrong unit of measure, "most"rather than "least"). 
5. 	In true-false, multiple true-false, or none-of-these type items,make each option statement clearly right or wrong. "Best answer"cannot be used with these item types. 
6. 	Do not use "None of these" with numerical answers where theanswer can appear reasonably in more than one form (e.g., 44and 6.28). 
7 
7. 	Vary the correct answer position; avoid a pattern. 
8. 	If options, such as numerical options, have a logical sequence,put them in logical order. (If this conflicts with the
preceding rule, change the order where the logical sequence
is not obvious, but avoid making illogical order a clue.) 
9. 	Do not put in the options information which could reasonably beput in the stem. 
10. 	Avoid options which appear unrelated to the job. 
11. 	Avoid "specific determiners," which give a clue to the correct
answer for a person who does not understand the correct answer.For example: 
a. 	a distracter which does not follow grammatically from the stem; 
b. 	an option which can be judged correct or incorrect without
reading the stem; 
c. 	
synonymous options (which rule out both options for a person who recognizes the equivalence); 

d. 	
an option which includes another option (For example, less than 5, less than 3; in China, in Peking; all of the above); 

e. 	
implausible distracters; 

f. 	
similar terminology in two options if one is correct; 

g. 	
similar terminology in the correct option and the stem; 

h. 	
nonparallel options; 

i. 	
mutually exclusive answer options, where one is correct; 

j. 	
a correct answer which is longer than the distracters; 

k. 	
qualifiers in the correct answer (probably, ordinarily, etc.), unless also used in the distracters; 


1. 	words such as "nevern or "ah1ays," which suggest a wrong option; 
m. 	a correct option that differs from the distracters in favorableness, style, or terminology. 
D. Format 
1. 	
Put different options on the same line only when all can be put on one line. Otherwise list options. 

2. 	
If an answer sheet with numbered spaces must be used with numerical answers, put parentheses around the option number and space between the option number and the option. (Forexample, (1) 64, not (1)64; certainly not 1. 64.) 

3. 	
If the options include the whole numbers 1, 2, 3, 4, or 5, and 

answer sheet spaces are numbered, put such a whole number with the correspondingly numbered space. (Ignore Rule C8 above in this case.) 

4. 
Single space the stem and the options, except where fractions 

or symbols require more space. Use 1Yz, or double space between the stem and the options, and between the options, where. spacepermits. 

5. 	
Use vertical fractions. For example, 23 not 23/Zk.)
2k' 


6. 	
Put all of an item on the same page. 


Appendix IV-E 
Suggestions for Developing Item Ideasl 
The following question leads may suggest an approach when one encounters difficulty in developing an item idea: 
What conclusions should be drawn from ? 
Which of the following describes the concept of ? 
What is the effect of ? 
What should be done? 
When should the worker ? 
What should the worker do when ? 
What constitutes an error? 
When happens, what (else) will happen? 
What is likely to be the result of ? 
What recognized principle is violated? 
Which is best for the purpose of ? 
What is the important difference between and ? 
How are and alike? 
What is the cause of ? 
Under what condition(s) does (is, are) ? 
What purpose is served by ? 
Which comes first in order to ? 
Which comes last in order to ? 
Which follows ? 
Which step has been left out? 
Those who agree on (theory) support it because 
Which alternative doesn't belong among _____ according to the 
principle of ? 
1These suggestions came from USCSC item writing materials. 
115 
Appendix IV-F 
Instructions Regarding Item Format 
Newly Written Items 
Use 5 x 8 cards, or cut sheets of 8~ x 11 heavy paper in half. 
Front of card or half-sheet: 
1. 	Using black ink, start about 1/4 inch from the top and 1 inch
from the left margin and print the item. Keep it compact, butmake it legible. (These handwritten copies may be used forreproducing, and small print will facilitate reproducing moreitems per page.) 
2. 	Letter answer options (A), (B), (C), (D), and (E) beneath the
stem. Unless all options can be put on the same line, put
options in a vertical list with option numbers in line with
left hand margin of stem. (Options are sometimes numbered, but
letters are preferable when answers are numerical.) 

3. 	If several items are based on the same paragraph or set of data,
write the common information on one card and each of the items
on a separate card. Clip the items together. 

4. 	When a drawing accompanies an item, refer to it in the stem ofthe item. For example, "From the data shown in Figure " 
5. 	The unit of measurement in which the answer is provided shouldbe given in the stem. For example, "What is the distance, in
feet?" 
6. 	If negatives are used, they should be capitalized or underlined,for example, NEVER, NONE, EXCEPT, etc. Similarly, any importantword or phrase which could be overlooked in reading should becapitalized or underlined: "What is the LEAST ?" If aproblem may be solved in inches and the answer is given in feet,capitalize or underline the unit of measurement. 
Back of card or half-sheet: 
1. 	Put the answer key in the upper left corner. 
2. 
Put the concept tested in the upper center; also include appropriateoutline identification. 
3. 	Record the Writer's last name (first initial, if necessary) in theupper right corner. 
4. 	If a passage is taken from a copyrighted publication, record the 
source. A credit by-line should be provided at the end of the passage. If the passage is original with the Item Writer, 
no reference need be made. If it is unrecognizable as a modification of a published passage, record the reference under the author's 
name on the back and let someone else double-check recognizability. 
5. 	
In items which require solution of a problem, give the basis for any incorrect answer option which is not self-evident. 

6. 	
Try to keep information for steps 1 to 5 above compact to allow room for reviewers' comments below. 


In reviewing an item authored by someone else, the reviewer records his initials at the left margin, his comments in the center section, and the approximate time for answering at the right. The codes shown on the Item Review Guide, Appendix IV-G may be used, where 
appropriate, for comments and time. 
Pretested Items 
1. 	
Use the same procedures for the front of the file card as are given for newly written items, except that the item should be typed or cut from the test copy and pasted. 

2. 	
Follow the instructions for the back of the card as given above for steps 1, 2, and 4. Steps 3 and -~ are optional. 

3. 	
Record the item analysis data on the back of the card: N number in sample; Nt = number who ans-vJered this or a later item; p = number correct/Nt; 0 = number who omitted this item but answered a later item; NR = number who omitted this and all subsequent items. 


New Item 
117
Front of item card or half-sheet 
~t. d\~\~....\~ ~ a." ·,~W\ (_ ~~t.. c..~rrt Q.t\$\lltQ.\"' e.o~u..c..\-h~)~'-"\0... ~..~ ~ <;'"t~ C.O"u.,\'\ C.o.l\ \,t, \f\<:..~t.o"U~ \:>~'· 
(A) r't\C..t"~.S\1\~ ~(.. ha""'o\Q.(\Q)t'\ 0~ "'c. \e.~\ Q..~ (\ '-'l\6\t.. ~) :t'C\c.~Cl'-'"\ '1'\t. lt"''\<\~ ~ ~~ ~o~c-ca.c..~ o~\.,t'l..., <e.) p\..,-\\ \"' ~t. ~~t."'' \ (\ +ru.e.-\Q.\~c. ~oc-~ {l)) T"J\A.'(.i,"\ ~~ 0~\t'II\S 'f'v\OC"t.. S\.C'W\\\o.e-, 
Back of card or half-sheet 
Key: D Concept: :I·h.tV\ d\~\c....\~ ~a.... \at. \nc.t1t4"e.\ Author
b'\ "'1&.~\r..\C"\~, .r.t<\LCL Q.i.'-~•M.'~i.c.l\1, O..N\Of\.. 
Q,f\'W._._ ~\O"f\'-• 

Reviewed by Comments 
Time req.to answer ""~, "61 1 C..t~ '1);)1 E \ T 'l. ~~c."(.~~~ 
~~)~~~~~~~--~~'4o ~· 
Instruction for preparing item for review 
Using black ink, start about 1/4 inch from the top and print the item. Keepit compact but make it legible. (Hand copy many be used for reproducing, and
small print will facilitate putting more items on one page.) 
If a set of items uses the same data, use a separate half-sheet for the dataand for each item and paper clip them together. 
Item 	Card After Analysis 
Front: Back:  Same  as  for new  item, except that it will be typed.  
Key: Form  Concept: Item Pop. No.  N  Nt  p  6  r  N  A  B  c  D  E  NO  NR  

Arranging Cards for Photocopying 
The difficulty of an item which tests a given concept can be increased by: 
(A) 	
increasing the homogeneity of the test as a whole 

(B) 	
increasing the length of the correct option 

(C) 	
putting the item in true-false form 

(D) 	
making the options more similar 


Which one of the following is LEAST often a purpose of pretesting? 
(A) 	
to give evidence on an item's ability to measure the intended content. 

(B) 	
to provide clues to ambiguities in wording 

(C) 	
to fi~d a relationship between an item and job success. 


(D) 	to provide evidence on item diff~culty 
The means of a test can be estimated by adding the p-values of the items if the pretest and final test populations are similar in ability 
and if (and only if): 
(A) 	
the number of cases is the same for both groups 

(B) 	
each item is scored either 0 or 1 

(C) 	
the content of the two tests is similar 

(D) 	
the pretest and final test ~ontain an equal number of items 


119 
To photograph, overlap the cards as illustrated, insert themin a plastic photo holder, and photocopy. Make one copy; add itemnumbers, instructions (if any), headings, and page numbers.Reproduce the desired number of copies. 
Appendix IV-G 
Item Review Guide 
1. 	Answer the item. Note approximately the time taken for answering. This information will be useful later in estimating whether the test can be completed in the allotted time. The following code may be used: 
Code 
·---
less than 10 seconds 
Tl 
10 	to 30 seconds 
T2 
30 seconds to a minute 
T3 
over a minute 
T4 
2. 	
Check your answer against the keyed answer. 

3. 	
Evaluate the content. 


A. Importance of concept tested to job performance 
(1) 
essential 	Al 

(2) 
important, but not essential 	A2 

(3) 
not important 	A3 


B. Adequacy with which intended concept is measured 
(1) 	
good measure Bl 

(2) 	
measures, but depends on knowledge of other, B2 unimportant concepts or facts 

(3) 	
fails to measure B3 


C. Defensibility of key 
(1) 	
correct and defensible Cl 

(2) 	
subject to debate or philosophy; depends on C2 conditions not covered by stem 

(3) 	
incorrect key (give obtained answer) C3 

(4) 	
more than one defensible answer (give reason) C4 


D. Clarity 
(1) 	
clear to anyone understanding concept Dl 

(2) 	
may be ambiguous to person who understands D2 concept 

(3) 	
don't understand concept D3 


Suggest any modification which you feel would improve clarity. 
121 
E. Estimated difficulty 	Codes 
(1) 	most will answer correctly (70-100% correct) El
(2) 	middle difficulty (40-70% correct) E2
(3) 	difficult (less than 40% correct) E3
(4) 	item more difficult than concept E4 
4. 	To maintain adherence to principles of good test construction,note any principles, such as those shown in the Item Writing
Rules (Appendix IV-D), which are violated. If easy to do so,suggest a modification which will make it a better item. 
Appendix IV-H Test Format and Sample Test Instructions 
1. Instructions 
a. 	
General instructions are given on the cover page. 

b. 	
Any special instructions needed for a part of the test are normally given at the beginning of the part. 

c. 	
Special instructions needed for a given set of items are shown immediately preceding the set of items. Where sets of items are used, these instructions should indicate how many items are to be answered from that set. 

d. 	
All instructions normally run from margin to margin to call attention to them as not being part of an item proper. 

e. 	
Illustrative examples·should be given for all unusual item types. If the procedure for answering involves two or more steps, provide practice items so that the examinee may become familiar with what is expected of him before timing begins. Practice items should be put on a separate page and not included in the time limit. 

f. 	
Instructions should call for the best answer to each item. 

g. 	
Where instructions are on the same page as the items to which they apply, direct the examinees to start work as soon as instructions are read. 

h. 	
Place instructions at the end of each timed section reminding examinees to check their work and then wait for a signal before continuing to the next part. 


2. Timing 
a. 	
Give the time allowed at the beginning of each separately timed part. 

b. 	
Information on the time allowed may also be given on the cover. (Even if the examinee already knows the time allowance, it is helpful for future reference to have the time shown on the cover.) 


3. Spacing 
a. 	
Double space between instructions and the item which follows. 

b. 	
Double space between items. 


123 
c. 	Single space the stem of the item unless fractions or othersymbols require 1 1/2 or double space. 
d. 	Use 1 1/2 spaces between stem and options. 
e. 	Spacing between options may be single if this is not confusing.If required for clarity (such as where fractional answers areused, or where some options require more than one line), use1 1/2 spaces. 
f. 	Diagrams may be centered above or put at the right of the item(s)for which they are used. 
g. 	A common set of options may be centered in list form above theitems for which they are used or may be put at the right ofthe items. 
h. 	Put options in list form unless all options can be put on thesame line with adequate spacing between them. 
~. Setting up mathematical and scientific material 
a. 	Try to use vertical fractions throughout (e.g.,~ not 4/5) toavoid confusion. Although not confusing in some instances,it is in others, and it is safest and generally easiest to beconsistent throughout. 
b. 	Center fractions with respect to signs of operation. For
example, 

X 4
this: 21 + l not this: X -4 21 + h
4 . 6 4 63 3 
c. 	Leave space horizontally around equations. For example: 
In the formula T = 2X + 6, what is the value of X whenT = 3 ? 
5. Paging 
a. 	Start each separately timed part on the left-hand page. 
b. 	All items which use a common set of options or common datashould be on the same page or page spread with the commonmaterial, preferably on the same page. If this is not possible,repeat the common material on the next page. 
124 
Sample Set of Instructions for Cover of Testl 
DO NOT TURN THE PAGE UNTIL YOU ARE TOLD TO DO SO. Fill out the
information at the top of the answer sheet as requested. 
This is a test of your knowledge in the field of Accounting. Read
these instructions carefully and wait for the signal to begin. 
Time allowed: Part I -1 hour If you finish a part before time
II -45 minutes is called, go back and check your
III -30 minutes work. Wait for the signal before
going on to the next part. 

Answers to the test questions are to be recorded on the separate answersheet. The numbers on the answer sheet correspond to the questionnumbers in the test booklet. After each question number on the answersheet there are four pairs of dotted lines labeled A, B, C, and D. Foreach question in the test, select the best of the four suggested answers,
which are also labeled A, B, C, and D. Blacken the space on the answersheet that has the same number as your answer. Do not spend too muchtime on a very difficult question. If you are not sure of an answer,
make the best choice you can,2 but DO NOT SPEND TOO MUCH TIME ON ANY
ONE QUESTION. 
Do not worry if you cannot answer all of the questions. 
Do not make any marks on the test booklet.3 You will be given paper onwhich to do your scratch work. All scratch work must be returned with
the test booklet. 
Be sure the number of the question you are answering always correspondsto the number on the answer sheet. Be sure to use the special pencilin marking your answers, and make a solid black mark.4 Mark only oneanswer for each question. If you make a mistake, completely erase theblack mark. Do not merely cross it out. Do not make any unnecessary
marks on the answer sheet. Keep your answer sheet on a smooth hardsurface while marking your answers. 
The sample answer sheet below shows how answers to two sample questionsshould be marked. Do NOT mark these answers on your separate answer
sheet. 
!Adapted from Adkins, 1947, p. 36.
2Do not use this phrase if there is a correction for guessing.
3some test booklets may be written in. The instructions should
reflect local policy. 
4Do not use this phrase if a special pencil is not needed. 
125 
SAMPLE QUESTIONS! 
The two sample questions shown below are typical of those in the test. They are to be answered on the basis of your knowledge of accounting. 
Sl. 	An auditor discovered that none of the prepaid expenses had been included on the balance sheet. This omission resulted in an 
A) overstatement of net worth B) understatement of liabilities C) understatement of assets D) understatement of operating expenses 
82. 	Richard Roe, who files his income tax returns on the calendar year basis, became permanently blind on October 1. In filing his income tax for that calendar year, the amount to which Roe was entitled as a special deduction for the blind is 
A) nothing C) $600.00 
B) $500.00 D) none of these 
SAMPLE ANSWER SHEET 
A  B  c  D  E  
Sl  ~  ~  r  ~  0  
A  B  c  D  E  
S2  ~  ~  ~  I  ~  

Always look for the best answer and mark only one space for each item. WAIT FOR INSTRUCTIONS BEFORE TURNING THE PAGE. 
lThese sample questions should be placed on the front page of the test booklet or on the back cover. 
Appendix IV-I 
Procedures for Obtaining Pretest Statistics 
The following indices (and others not given here) are available 
for determining item discriminating power and difficulty: 

Discriminating Power Difficulty 
rpbi (point biserial correlation) p (percent answering correctly)
rbis (biserial correlation) ~ (delta standard score of item 
estimated Ibis (high 27%-low 27% on difficulty scale; mean = 
method) using Fan's Table 13, 
standard deviation = 4)
estimated r~.1S (high) 50%·-low 50% 
method using Diederich's 
procedure 

The point biserial and p are used in estimating test statistics. 
Either the biserial or the point biserial will probably be o?tained 
where computers are available, since they use all of the score data. 
The estimates of the biserial will normally be used where computers 
are not available. The high-low 50% estimate can be used where there 
are too few cases to risk throwing out data on the middle-ability 
group. The estimates! especially with fewer than 100 cases, should 
be used with caution. 
The point biserial is influenced by item difficulty, and the maximum V?lue for the point biserial ranges from about .80 when p·is.50 to under .60 as p exceeds • 90 (see multiplier row in Table B, 
Appendix VI). Allowance should therefore be made for difficulty
level when interpreting the point biserial, especially where 
a 
difficult concept appears important 

to test. Because of its dependence on p, the point biserial may also be expected to vary more from one ability level population to another than will the biserial.2 
The point biserial is used in estimating the standard deviation 
of the assembled test, and Table B, Appendix VI can be used to approximate the point biserial from the biserial when the latter has been obtained. These estimates of the standard deviation should not be 
taken too seriously with small samples because both sample size and the three 
estimating steps will influence the reliability of the obtained index. 
lsee Appendix VI, Table C for expected error. For example, with 150 examinees in the analysis sample (N = 150), a biserial of .40 has a standard error of .09 when p = .50. This means that an item whose average biserial for comparable populations is .40 may be expected to have a biserial within the range .40 + .09 two-thirds of the time, when obtained on a sample of 150 examinees, and outside this rangeone-third of the time. 
2see Lord and Novick, 1968, pp. 340-344, for a comparison of the point biserial and the biserial. A less extensive and less technical comparison is given by Henrysson, 1971, p. 142. 
127 
The accuracy of the estimated standard deviation also depends on the similarity of the pretest content to operational test content and the similarity of the pretest population to the operational test population. 
The difficulty index, p (percent answering correctly), will be obtained routinely. This index is used in estimating test mean and standard deviation. The index, ~. converts p to a standard score scale and can be used for equating item difficulties on different tests. Table A, Appendix VI can be used for making the conversion. See the discussion of test difficulty in Appendix V-A for a comparison of delta and p advantages and disadvantages. 
To the extent that the pretest population is atypical relative to the knowledge or skill being tested, the item difficulties will need to be corrected before using them in final test selection. 
In addition to indices of difficulty and discriminating power, the Measurement Specialist should have information on the functioning of the options to give clues to ambiguities, possible defensible answers, and contribution of the options to discriminating power. 
To Obtain Statistics by Computer 
For each examinee, punch the score ~o be used as the criterion (see Note 2 at the end of this appendix) and the response to each item. For each item: 
1. Sort on.option chosen. 
2. 
For each option, compute: Number choosing this option; distribution and mean of criterion scores fot those choosing the option; and percent choosing the option (number choosing the option divided by total number of examinees answering this question or a subsequent question). 

3. 
For the item as a whole, compute p and delta. (Use Table A-1 in Appendix VI to find the latter.) Compute either the point biserial or bi~~rial correlation: 


M-M fP"" 
r pb" --;-n_t__~t-\/·r~

1 
where M+ = mean of those choosing the correct option, Mt = mean of those 
examinees who give an answer to this or a later item on the test, SDt 
standard deviation of those examinees who give an answer to this or a 
later item on the test, p = N+/Nt (where N+ = the number choosing the 
correct option), andy = the ordinate on the standard normal distribution 
corresponding to p.l 
1see Guilford and Fruchter, 1973, p. 299. Other equivalent formulas are also given here. 
To Use Fan's Table (Fan, 1952) 
Arrange the answer sheets by criterion score (see Note 2 at the end of this appendix). 
Put the top 27% in one pile and the bottom 27% in another pile. For each pile, make three sub-piles: (1) those who recorded an answer to this item; (2) those who omitted this item but recorded an answer to a later item (0); and (3) those who omitted this item and all of the later items (NR, for "not reached"). 
For each item, tally the number choosing each response as follows 
(use Nt = sub-piles 1 + 2): 

B c D E NR 
Top 27% J 

Note that Nt the sum of the preceding tallies in the row. As a check, also: NR + Nt for the top 27% should equal the total number of papers in the top 27% pile; similarly for the bottom 27% pile. 
To use Fan's Table: Divide the number choosing the correct answer by Nt for the top 27% (PH); similarly for the bottom 27% (PL). Use these ratios to enter.the tables to find the estimated biserial, p, and delta. 
To Use Diederich's Method (Diederich, 1973, p. 2) 
Arrange the answer sheets in order of the criterion score (see Note 2 at the end of this appendix). 
Divide the papers into top and bottom halves on the basis of the criterion score. Tally the number in each half which give each answer option, recording 0, Nt, and NR as above, but using the following chart form: 
Option Choice: A B C D E 0 N NR 
Top half 
-+--t----+----t--+----+---+-j-t.....--1.·----t·l 
Bottom half _ . I ] 
For the correct option, subtract the number in the bottom half from the number in the top half, multiply by 3 and divide by the sum of Nt's for the top and bottom halves (exclude NR). This gives a value approximately equal to the biserial when p is between .25 and .75. Compute p using the total group which reached the item (N+/Nt). 
130 
Equating Item Difficulties 
If there was a previous operational form of the test, and if thecandidate population is not being used for pretesting, it is possibleto estimate the candidate population difficulty by the followingmethod: 
1. Include in the pretest some items from the operational test,
using a wide range of difficulty and at least 10 to 15 items. 

2. Convert the p's obtained on these items to deltas, separately
for the operational population and the pretest population (use Table
A-1, Appendix VI). 

3. Plot the operational population deltas (y-axis) against the
pretest population deltas (x-axis) for these common items. 

4. Draw a line of best fit, eliminating any items which deviate
substantially from the line of best fit. 
(If there appears to be
little relationship between the deltas for the two populations, this
method should not be used.) 

5. Using this line of best fit, estimate operational deltas for
all of the pretest deltas. 
Notes 
Note 1: Bear in mind that the statistics described in this set
of procedures will be expected to vary from one use to another,
especially with the sample sizes likely to be available; however,
they do provide information which is helpful if used with an under
standing of the possible variability. 

Note 2: 
If the test content is relatively homogeneous, use the
total score as the criterion. If the test consists of two or more

parts testing difficult content, such as background knowledge in one
part and procedural skills in another, the item analysis data may bemore meaningful if the items in each part are analyzed against thepart score; however, this will overestimate the biserial if there arenot many items in the part score, and increasingly higher discriminationvalues should be sought as the number of items per criterion score
drops below 50. 
If the part score is used as the criterion, estimatethe mean and standard deviation separately for each part. 
Note 3: 
If one uses an on-roll group for pretesting, thediscriminating power indices are likely to be underestimates; the p's
are likely to be overestimates. 
The problem in using the biserial inthis case will be in deciding whether a low index is due to the atypical
population or to problems with the item. 
If an earlier form of thetest is available for equating item difficulties to the candidatescale, this should be done provided the on-roll score distribution is
not highly skewed. 
131 

Appendix IV-J 
Using Pretest Statistics 
The following information is provided by the item analysis: (1) item difficulty; (2) clues to time requirements for items; (3) the discriminating power of the item; and (4) clues to ambiguities, obscure points, and other item weaknesses. 
The interpretation of p as a measure of difficulty is familiar and straightforward if one takes possible population differences into consideration. Deltas are obtained from p and are interpreted on a scale which increases in difficulty from approximately 7, representing an item answered correctly by 93%, to a delta of 17 representing 16% or lower (see Appendix VI, Table A-1). 
A biserial of .40 is generally considered quite acceptable, with biserials below this value being used if the item seems to be a desirable item on other bases and the lower discriminating power could be.attributable to the item content being less related to the content of the test as a whole. 
A delta of 15 or higher should ordinarily be used only if the biserial is high and option statistics suggest that low-ability examinees are being attracted away from the correct option. 
Indices for item difficulty and discriminating power indicate level, but they provide no information on why. To find clues, for example, to a p of .30 and a discriminating power below the acceptable level f~r a supposedly easy item, one must examine the responses to item alternatives. These clues may be obtained from either the computer data on options or high-low data. (See the discussion below under "Possible clues in response patterns.") 
A. Using Analysis Data as a Basis for Revising Items 
1. If the item is too easy, and if the concept tested is an easy concept, delete the item or use it as an easy starter item. Do not make it artificially difficult. 
2. If the item is too easy, and if the concept is a difficult one, 
examine the options to determine whether wrong options may be identified as wrong, or the correct option as correct, without understanding the concept. Substitute distracters which are more plausible. 
3. If the item is too difficult, and if many of the high group choose a certain wrong option, look for an ambiguity in the stem or an option which causes a different interpretation of the item and hence leads to a different answer. Are there possibly two schools of thought, which may justify a different answer? Remove the ambiguity or substitute a distracter which is more clearly wrong. 
132 
4. If the item is too difficult, and if many omit the item,
check whether the item looks long or complicated compared with items
which precede and follow it. Such an' item may be omitted by persons
capable of answering it correctly. L~ok for ways of restating the
question in shorter or more readable form. 

5. If the discrimination index ik too low (biserial less than
.35), look for ambiguities in either the stem or the options. The
pattern of selection by the high and low groups sometimes gives a
clue, as suggested below. 

Possible clues in response patterns: 
1. An option which attracts very few respondents. 
This option is probably contributing little to discriminatingpower. In replacing it, be careful tc;> use an option which is clearlywrong to the person who understands the principle in order to avoidattracting to this option those with fuoderate to high scores. 
If the
index of discriminating power is high, it may be better to leave theitem in its present form. It is possible that only three plausibleanswers can be written and that those choosing this particular optionwere doing so on the basis of random guessing. 
2. A wrong option which attracts more high scorers than lowscorers. 
Look for the possibility that this option is defensible as acorrect answer. Reread the item as if you had chosen this answerand were trying to defend it. Deletei the option or revise it to makeit more clearly unacceptable. If few low scorers chose it, it shouldbe replaced, since it is contributing negatively to prediction. 
3. A correct option which attracts more low scorers than highscorers. 
Look for the possibility that the correct answer can be obtainedby a wrong method of reasoning. Change the values or wording toeliminate this possibility. 
4. A correct answer which attracts fewer high scorers than anotheroption. 
If the biserial is relatively low, look for two defensible answers.If the biserial is relatively high, it may be that this is a verydifficult, but good item. 
Unless the correct option is very unattractive to the low scorers, or an ambiguity is discovered which can be
corrected, the item should be dropped or revised, since chance is
likely to be playing more of a role than understanding in gainingcredit for this item. 
133 
S. A high number of omits among high scorers (or the mean of the omits is high). 
This may be due to limited time, to a difficult or complex item which the examinee has decided to leave for the moment, or to ambiguitieswhich leave the examinee uncertain. 
It may help to review both the Item 1-Jriting Guide and the Item Writing Rules before attempting to revise items. 
B. Evaluating Speed Characteristics If there are many omits, make a plot of the number of items answered 
correctly against the total number of items for which an answer was 
recorded  (i.e., R vs.  R + W).  Examples  of such plots follow:  
R  R  
so  so  so  

. . . ..·... 
. . .
.. ~ 
L-------R+W ~-------------R+W ~o...-------R+W 
so so so 
1 
2 
3 The first diagram represents the kind of plot one would find if nearly everyone recorded an answer to every item; the scores are distributed along the right-hand margin. This is considered a power test. The third diagram represents a situation in which each examinee tends to answer correctly all items he attempts. This is the kind of distribution one expects to find for 
a test of speed in doing somethingwhich contains very easy items. The middle diagram shows a test which measures both speed and power. 
For a test where speed is not critical, the Measurement Specialistmight aim for a distribution between 1 and 2, since many will omit those they are unable to answer.! 
As a further check on whether sufficient time has been allowed, check the number who did not reach items toward the end of the test (NR). If everyone recorded an answer to the last item, the test is 
generally considered to be not speeded. Before making this assumption, 
1A more extensive discussion of this plot and its use is given in Swineford, 1974, pp. 9-12. (This manual was prepared for internal use, but the discussion is generally applicable.) 

134 
however, determine whether easy but more time-consuming items havebeen answered. If many of the high group omitted many of the items,this is evidence that they were stopped by time and simply were doingthose which looked easiest to them. If 80% complete the test withfew omits among the high scorers, and if almost all get 75-80% ofthe way through, the test is generally considered to be primarily
testing power. 
C. Estimating Test Mean and Standard Deviation 
When pretest and final test populations are comparable in ability
level: I 

M = [p (i.e., add the p's of the selected items to obtainan estimate of the mean)~ 
SD = L /Pq·rit (i.e., for each item, find the product of ptimes 1 -p; multiply the square root of the product by thepoint biserial and sum over items to obtain an estimate ofthe standard deviation). 
When pretest and final test populations are not comparable in abilitylevel: 
To obtain a rough estimate of the mean: convert the pretest p's
to deltas (using Table A-1 in Appendix VI); estimate the operational
form deltas (equated deltas) as described in the preceding Appendix
IV-I; convert equated deltas to p; and sum these p's over items
(Angoff, 1971, p. 586). 

It is probably inadvisable to try to estimate the standarddeviation where the two populations are not comparable because ofthe change in point biserials to be expected in going from onepopulation to the other. 
1see Angoff, 1971, p. 586, and Gulliksen, 1950, pp. 376-377. 
2To the extent that the pretest Nt (number answering this ora later item) differs from N (total number of examinees), use ofp = N+/Nt will ordinarily yield an overestimate of the mean ifthose completing the test have more ability in the area tested
than those who omit later items. If many fail to complete thepretest and if the final test will be similarly speeded, it maybe better for estimating purposes to use p = N+/N. 
Appendix IV-K 
Test Review Guide and Test Review Sheet 
Test Review Guide 
PLEASE READ THESE INSTRUCTIONS BEFORE LOOKING AT THE TEST. 
The accompanying test materials are to be locked up when not being worked on and should not be shown to others. 
Your approach to this test should be toward evaluating the test as a whole. The items have been reviewed individually by the members of the Item Writing Committee both before and after pretesting. An editor has reviewed the test for grammar, expression, punctuation, and spelling. We especially need the Panel's review to the overall test: Does it appear to do what it is designed to do? You will be asked specific questions about the test in these instructions. 
Although your primary focus will be on the total test rather than the individual items, the first few times you go through the test, make notes on problems you encounter with individual items, either in answering the item or in reviewing the test relative to specific questions. Make the note long enough to enable you to recall the difficulty later, but try not to become sidetracked by a patticular item. 
Please go through the test step by step, rather than trying to do two steps at once. This will facilitate focusing on one problem at a time. 
1. Take the test as if you were a candidate, recording responses on the enclosed answer sheet. It is difficult to see the test as the candidate will see it if you study the test over before taking it. Read the instructions before starting to count time. Note the time you start and the time you finish. Try to finish in the time allotted, as you would if you were a candidate. If you do not complete it, draw a line under the last item attempted and continue working. It would be most helpful to your later evaluation if you change to a different color pencil at the end of the time limit. DO NOT CHECK YOUR ANSWER AGAINST THE SCORING KEY UNTIL YOU HAVE FINISHED. 
136 
2. When you have completed the test, check your answers againstthe scoring key and go over the items with which you find disagreement. For each item there must be a completely defensibleanswer and only one. If you feel an answer other than the onekeyed is defensible, circle both answers on the answer sheet andrecord your comments on the Test Review Sheet under the sectionfor "Comments on Specific Items." 
3. Fill in the blanks and answer the questions on the Test Review Sheet. 
In providing information on timing, you will wish to consider
whether your own time is representative of what should be ex
pected of the candidate. 

In considering difficulty, look through the test again andconsider whether any of the items might cause special difficulties for any on-job persons whom you know; e.g., is the terminology appropriate? Would a person who has been working on thejob in one locality have an unfair advantage over someone equallyqualified who comes from another area and is not familiar with localpractices and terminology? 
In considering coverage, keep in mind the test specificationswhich the Panel set up. Does the content appear to be wellbalanced relative to the intent? 
4. Go over the test once more, this time item by item readingthe stem and each of the options slowly. Look especially forthese possible weaknesses: 
-Does any item have a distracter which could be defendedas the correct answer? 
Do any items depend on correctly answering another item? 
-Does information in any item given a clue to the correctanswer in another item? 
-On rereading any item did you get an interpretation different from that which you had the first time? 
-Record any comments you have on the Test Review Sheet,"Comments on Specific Items." 
Attach your Comment sheets, answer sheet, test key, and testspecifications to your copy of the test and file in a secure placeuntil the meeting. Be sure to bring all of these items with youto the meeting. 
TEST REVIRW SHRET Reviewer 
Test Date 
Stage: Pretest draft Final draft 
If more space is needed for comments, use another sheet. Read the separate set of instructions in the Test Review Guide before filling in these sheets. 
1. Appropriateness of test length: 
Present timing is about right 
Should allow minutes for the test or items for present time. 
Comments: 
2. Appropriateness of difficulty: 
Are the first two items easy enough that all candidates will feel at ease in answering them? (It is not necessary that they be able to answer them correctly.) 
Comments: 
3. Appropriateness of coverage: 
Did the content as a whole seem appropriate? Would qualified employees whom you know be likely to do well? Would the poorly qualified be likely to do substantially less well, i.e., would the test be likely to distinguish the qualified employee from the poorly qualified employee or applicant? 
Did you feel in taking the test that you were being tested on relevant knowledge? Will the candidates be likely to consider it relevant? 
Will minority candidates encounter any difficulties with the test which are unrelated to job performance'? 
Does the content seem to you to fulfill the test specifications? Were some aspects covered too heavily at the ~{pense of others? 
Comments : 
The Test Review Continuation Sheet (following page) is to be 
used for comments on specific items 
138 Reviewer ----------------------
Test Review Continuation Sheet Date:(Comments on Specific Items) 
Test: 
Code0 -Retain, no change necessaryPretest ------Final ------1 -Retain with suggested modifications2 -Retain if problem can be correctedOnly those items on which you have 3 -Discard for reason givencomments need be listed. It willbe assumed that those not listedare judged to be satisfactory. 
ItemNumber Comments Code 
139 
Appendix IV-:-L 
Brief Guide on Norming and Equating Procedures 
Norming Procedures 
To transform a set of raw scores to a given scale, use the following formula: 
s s 
y y --
Y== --x--x+Y
s s 
X X 
where Y is the scale score equivalent to a raw score of X; X and SX are the obtained raw score mean and standard deviation; and Y and Sy are the chosen scale score mean and standard deviation. 
If one has chosen to use a scale with a mean of 50 and a standard deviation of 10, the following steps would b~ taken: ~ubstitute into the equation the obtained values for X and SX, 
Y == 50, and Sy == 10. This will give a formula of the form Y == bX + a. 
To obtain a table of scaled Y values equivalent to each X value, substitute each possible value of X in the obtained formula. 
Equating Procedures 
To transform scores on a current test to an established scale, one standard procedure is as follows:l Use in the pretest a set of items from an earlier form of the test for which scores on the established scale have already been obtained. (A set of 20 items, or 20% of the number of items in the test, whichever is greater, is recommended.) Obtain the statistics indicated in the following table, which also defines the symbols used in the equations below. 
lthese procedures, with adapted notation, are taken from Angoff, 1971, pp 579-580. Several procedures are described which depend on the randomness of the groups, the manner of administration, and the use of common items. He also includes procedures developed by Swineford and Fan for converting scores through item statistics. 
140 
Earlier form Curre;;t form 
total test common item total test common itemraw score raw score raw score raw score 
Mean u X
ws u
s a a 
Standard sw su sx suDeviation s B a a 
where t is used to indicate data on the common items for thecombined populations. 
Obtain the correlation, rwu, between w and u for the earlierpopulation.Obtain the corr~lation, rXU' between X and u for the currentpopulation.Substitute the obtained values in the following equations:1 
sx
-
A 
a
xt Xa + rwu·su ( ut -ua )
a 
s ~ Ys
wt ws + r (U t -us )wu sus 
s
2 X a 2
A 
s ) (s 2 s 2)
SXt Xa2 + (rxu . su ut 
ua a 
~:~w
+ (r . suB ) . (s 2 -s 2 )
wu s ut us 
1
These equations are based on the assumption that thefollowing statistics are the same for group t and a: 
sx -s
X-rxu Sy U 
sx2 (1-rXU2 ); and rxu ·-sfr (Angoff, 
1971, p. 580). 
141 
Substitute these tour values in the equation: 
W-W X -X
t t 
(Note that the denominators are the
---=--
square roots of values obtained
s s
W X above.)
t t 

This 	gives ru< equation of the form W = AX + B, where 
swt 	such that raw scores on X (current
A = --and B = Wt -AXt
sx 
form) tcan be converted to raw score equivalents on W (earlierform). A table of equivalents for X and W scaled scores canthen be set up using the table of equivalents establishedearlier for W scaled scores and raw scores, or an equation
for converting X to scaled scores can be produced by substituting the equation obtained above in the equation derivedpreviously for Form W for converting to scaled scores. If
y = mw + k is the equation for converting W raw scores to astandard score scale, substitute AX+ B for W and obtain theformula Y = (AX + B) + k, which can be used to convert Xscores directly to scaled scores. 
References 
Angoff, W. H. Scales, norms, and equivalent scores. In
R. L. Thorndike (Ed.), Educational measurement (2nd ed.).
Washington, D. c.: American Council on Education, 1971. 

Diederich, P. Short-cut.statistics for teacher-made tests.Princeton, N.J.: Educational Testing Service, 1973. 
Fan, 	C. T. Item analysis table. Princeton, N.J.:Educational Testing Service, 1952. 
-, 
142 

143 
Appendix V 
Explanatory Articles 

A. Test Difficulty 
B. Chance Scores 
C. Basic Testing Principles for the Non-specialist 
145 
Appendix V-A 
Test Difficulty 
The usefulness of a test in a given selection situation
depends in part on the difficulty characteristics of the items.
The following concepts are relevant: 

1. With indices of item discriminating power such as one
normally finds in achievement tests, the spread of scores tends
to increase as item difficulties approach, .50. If item
discriminating power indices are very high, using items all with

the same difficulty could give a bimodal distribution, but this
is unlikely to be found in practice. 

2. When chance responses are not involved (i.e., items are
write-in answer only), selection at a given cut-point is
theoretically improved by use of items which are at the .50
difficulty level for persons at the ability level represented by

the cut-point. 
3. For multiple-choice tests the average item difficulty
for those at the cut-point should be above .50, depending on the
number of_answer options. Test specialists are not.in complete
agreement regarding the best average item difficulty. 

4. There is also disagreement on the importance of a1mangat a narrow range of item difficulties when examinees are tobe ranked, and it has been suggested that less attention begiven to the spread in item difficulty and more to other factorswhich affect test results (Nunnally, 1967, pp. 250-254; Swineford,
1947, p. 13). The following arguments are offered in favor of awider range of item difficulty: 
a. It is not easy to develop all items within a narrowrange of .50 to .80. 
b. Some important concepts may not readily be testedin that range. 
c. Gains in reliability from a greater concentrationof items difficulty may be offset by possible losses in
validity due to the elimination of some concepts. 
The average item difficulty should be such that random guessingwill be a minimal factor in scores which fall in the range wherediscrimination is wanted. Where discrimination is desired
primarily in the. middle of the score range, one might reasonably 
seek an average difficulty of . 60 to; •70 for the usual multiplechoice test with 4 or 5 choices (Ti~leman, 1971, p. 63; and Lord & Novick, 1968, p. 392). A rev;iew of Educational Testing Service test analyses over the years indicates that, as long as a few very easy:items (p = .85 to .95) are included in the test, most scores will fall outside the range reasonably attributable solely to chance. This situation occurs even when the average p value falls below .50 (Swineford, 1975). Scores throughout the total range will, of course, be influenced by guessing, though random guessing may be expected to decrease as scores increases. 
If one were devising a test to ' select scholarship winners or, on the contrary, to screen out persons with essentially no knowledge, a very difficult or very easy test, respectively, would be reasonable. In most cases where a cut-point is used, factors other than the test will be considered in selection. In such situations the cut-point will likely be based on considerations of what constitutes an acceptable level of knowledge and will be used only to screen out those with less knowledge, the final selection being made by other procedures. 
The spread in item difficulties will be influenced somewhat by the selection procedure. If one is using a cut-point, statistical goals will dictate a narrow range of item difficulties, but one may wish to include a few highly important concepts which fall outside the middle difficulty range •. If one wishes to compare examinees throughout the range, a wider spread in item difficulties may be necessary to test adequately those at the extremes of the ability distribution. As Swineford (1974) explains: 
Theoretical studies have shown that a test composed of items of middle difficulty for a group is more reliable for that group than one whose items cover a wide range in difficulty. In practice, however, the difference is so slight that it can be disregarded in favor of other test characteristics. Item biserial correlations are closely related to score variability, for example. When a test is to be used for various subgroups, which probably differ from one another with respect to level of ability, or when different test users select different cutting scores, then of course it is impossible to provide middle difficulty for each subgroup, for this reason a range in item difficulty is desirable. (p. 13) 
One should avoid item difficulties below about .30, where chance may be contriubting more to correct answers than knowledge. One or two items over .90 can help put the candidates at ease. The majority, however, should be in the range .35 to .85. 
147 
Of the two indices of difficulty which are used, p is themore common, but delta has certain advantages. The followingoutline compares them. 
p: 	The number of examineees who answer this itemcorrectly divided by the number who record ananswer to this or a subsequent item. 
Delta: 	The standard score of the item, such thatMt,. = 13 and SD D. = 4 . 
Advantages of p: 	Disadvantages of p: 
1. 	More familiar. A given difference in p valueshas a different meaning atdifferent points of the scale. 
2. 	When p values appropriate Cannot as easily be used into the population are known, converting from one difficultythey can be summ.ed to obtain scale to another.the estimated mean of a newlyassembled test. (See AppendixIV-J for formula and limitations.) 
Advantages of delta: 	Disadvantages of delta: 
1. 	Can be used for equating the Is not as easily computed.item difficulties obtainedon one population to those ona similar population of asomewhat different abilitylevel. 
2. 	A given range in delta values Less information is availablerepresents roughly the same on it in the literature.ability increment at differentlevels of difficulty. 
Appendix V-B 
Chance Scores 
In multiple-choice tests there is a possibility that the examinee can, by guessing, receive a substantial score.· It is important to take this possibility into consideration. Even though most examinees do not guess at random, some do, especially when the material is so difficult that they have no knowledge at all. Guessing is likely to have more effect on certain individual scores than on test results as a whole. 
The seore an individual can be expected to obtain on a 
chance basis is computed as follows+ On the average the 
guesser will be expected to answer correctly a fraction 1/k of those he does not know, where k is the number of options. 
For example, in a 50-item, five-choice test, if he guesses at all items, he would be expected to answer 10 items 
correctly on the average. Sometimes his chance score will be higher, sometimes lower. W'e are most concemed about how high a score he could get solely by chance. To determine this we compute the standard deviation of chance, In (k-1) /k2 where n is the number of items. For the 50-item, five-choice 
test, this would be 2.8. Sixteen percent of the time he would be expected to obtain a score 1 or more standard deviations of chance above the chance mean, in this case, 10 + 2.8, or 12.8. 
(Another way of looking at this would be to say that 16% of 
those who guess at all items would be expected to obtain scores of 12.8 or above. Such a fractional score ordinarily 
is not obtainable, but it is used here to illustrate the statistical principle.) 
Of course, most will not guess at all items. Suppose an individual records what he believes is the correct answer to t items and guesses at the rest. The same formulas apply, but only to the items for which he guesses the answer. In this case he would be expected to get n-t points by guessing, on 
k 
the average. The standard deviation of chance in this case is I (n-t) • (k-1) /kZ. As more items are known, less guessing 
1These statistics are based on the assumption that each option is equally frequently keyed as correct. 
149 

will be. done, and the e;t;:J;ect o:l;. chance responSe becomes less.• 
For this reason~ i:l; one's. goal is that the average examinee 
will answer 70% of the items correctly, the effect of chance 
will be relatively small for mast examinees. In a four.,-choice, 
50-item test, the person who knows the answers to 30 items 
(60% of 50) and guesses at random for the remaining 20 may be 
expected to obtain a score of 35 or more half the time 
(30 + ;:>; (50 -30 ) and a score of 37 or higher 16% of the 
time (35 + (50 -30) • 3 16). Thus there is a substantial 
probability that someone who knows the answers to 60% of the 
items could correctly answer 70% of the items by guessing.
On the other hand, persons who actually know 30 of the 50 items are not likely to make random guesses; they will have partialknowledge which will lead them to some correct answers and some 
incorrect answers. Accordingly, wholly random guessing will 
occur infrequently on most tests. An individual is ordinarily
better off trying to answer those questions about which he has 
some knowledge, since his ability to eliminate some answers as 
wrong will enhance his probability of answering correctly, as it should. 
As the test is made more speeded or more difficult, guessingincreases and correction for guessing (by the formula W ) 
R-
K-1
becomes desirable; however, if everyone records an answer to every item, the ranking of the examinees will be the same 
regardless of whether such a scoring formula is used. 
One can reduce the effect of guessing by increasing the number of answer choices per item, but little is gained beyondfive options. As the number of options increases, so does the time spent reading noneffective options. A better plan is to use fewer options and include more items. 
In general, for a power test where the mean is around 
70% of the total number of items, a rights-only score is preferable. See also the discussion of the problem of guessingby Thorndike. (1971, pp. 59-61) 
Appendix v....-c 
Basic Testing Principles for the Nonspecialist 
1. 
A test is a sampling process; it measures a sample of all the understanding needed for a given purpose. Rarely does it measure everything the examinee should know. 

2. 
A person who can answer most of the items correctly is considered to have more knowledge or skill in the subject area measured by these items than does one who can answer correctly a significantly smaller number of items. 

3. 
If a person takes many parallel forms of the test, his scores will be expected to vary around his "true score" (i.e., his average score over all possible parallel tests). This variability is represented by the standard error of measurement (usually referred to as the SEM). 

4. 
The mean score of a test is the average score of those taking it, and it is a measure of the difficulty of the test relative to the total number of items in the test. 

5. 
The standard deviation (SD) of scores ona test indicates how much scores vary around the mean. If all the scores are near one another, the standard deviation will be small. If the scores are spread out, the standard deviation will be large. 

6. 
The larger the standard deviation is relative to the standard error of measurement, the more accurately a test measures whatever it is measuring. This is represented by the "reliability." A test's reliability depends on such things as the number of items, the ability of individual items to discriminate among examinees (see principle 11), the difficulty of the test, and the extent to which different items measure the same thing. 

7. 
A test's validity is the extent to which a test measures what it is intended to measure. For job selection tests, it indicates how well the examinees' scores are related to one or more important aspects of what is required on the job. 

8. 
A test can be reliable without being valid, but it cannot have a high validity without having a high reliability. 

9. 
A test which few examinees finish may be measuring substantially how quickly a person can respond in a. test situation, as well as the ability he has in the content area. (This is called a "speeded test" or a "test of speed.") Allowing more time for such a test may change the rank order of 


151 
examinees with respect to final scores. When most examinees
have time to respond to all itemS on the test, the test is
called a "power test." · 
10. A raw score (number of itemS answered correctly) by
itself gives little information about an examinee's ability.
It must be related to the specific kind of. content in the test
or to a "norm" group in order to be interpreted. A standard
score shows how the examinee stands relative to the "average"
person of a defined group in terms of the standard deviation;
i.e., it shows whether the individual is 1 standard deviation
above the mean, 1.5 standard deviation above the mean, etc. 

11. The ability of a test item to distinguish highperformers from low performers on the job depends on suchthings as its relevance to abilities or skills required onthe job, ambiguities in the item, defensibility of theintended correct answer or an intended wrong answer, thenumber of answer options, the difficulty of the item, unintentional clues to the correctness or incorrectness of an option,hdw the test is designed and administered, etc. 
References 
Lord, F. M., & Novick, M. R. Statistical theories--of mentaltest scores. Reading, Mass.: Addison Wesley, 1968. 
Nunnally, J. C. fsychometric theory. New York: McGraw-Hill, 1967 
Swineford, F. The test analysis manual (ETS SR-74-06).Princeton, N. J.: Educational Testing Service, 1974. 
Swineford, F. Personal communication, 1975. 
Thorndike, R. L. (Ed.). Educational measurement (2nd ed.).Washington, D. C.: American Council on Education, 1971. 
Tinkelman, S. N. Planning the objective test. In R. L. Thorndike
(Ed.), Educational measurement (2nd ed.). Washington, D. C.:
American Council on Education, 1971. 
Appendix VI Tables 
A. Percent Correct and Delta Equivalents 
1. 
Table of Delta for Selected Values of p 

2. 
Table of p for Selected Values of Delta 


B. Conversion of r (biserial) to r (point biserial)
bis pbi 
c. Standard Errors of rb. for Selected Values of rb. , p, and N 
1S 1S 
II . 
Appendix VI-A-l Table A-1 Table of Delta for Selected Values of pl 
Units p+ Tens 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 Tens 
p+ 
90 7.9 7.8 7.6 7.5 7~4 7.2 7.1 6.9 6.8 6.6 6.4 90 

80 9.6 9.6 9.5 9.4 9.3 9.3 9.2 9.1 9.0 8.9 . 8.9 8.8 8.7 8.6 8.5 8.4 8.3 8.2 8.1 8.0 80 70 10.9 10.8 10.8 10.7 10.7 10.6 10.5 10.5 10.4 10.4 10.3 10.2 10.2 10.1 10.0 10.0 9.9 9.8 9.8 9.7 70 60 12.0 11.9 11.9 11.8 11.8 11.7 11.7 11.6 11.6 11.5 11.5 11.4 11.4 11.3 11.2 11.2 11.1 11.1 11.0 11.0 60 50 13.0 13.0 12.9 12.8 12.8 12.7 12.7 12.6 12.6 12.5 12.5 12.4 12.4 12~3 12.3 12.2 12.2 12.1 12.1 12.0 50 40 14.0 14.0 13.9 13.9 13.8 13.8 13.7 13.7 13.6 13.6 13.5 13.5 13.4 13.4 13.3 13.3 13.2 13.2 13.1 13.0 40 30 15.1 15.0 15.0 14.9 14.9 14.8 14.8 14.7 14.6 14.6 14.5 14.5 14.4 14.4 14.3 14.3 14.2 14.2 14.1 14.1 30 20 16.4 16.3 16.2 16.2 16.1 16.0 16.0 15.9 15.8 15.8 15.7 15.6 15.6 15.5 15.5 15.4 15.3 15.3 15.2 15.2 20 10 18.1 18.0 17.9 17.8 17.7 17.6 17.5 17.4 17.3 17.2 17.1 17.1 17.0 16.9 16.8 16.7 16.7 16.6 16.5 16.4 10 

0 19.6 19.4 19.2 19.1 18.9 18.8 18.6 18.5 18.4 18.2 0 


!From The Test Analysis Manual (ETS SR-74-06) by F. Swineford Princeton, N. J.: Educational Testing Service, 1974. 
Reprinted by permission. 

t: 
"' 
Appendix VI-A-2 
Table A-2 
Table of p for Selected Values of Delta1 

Delta Tenths Delta 
Units .o .1 .2 .3 .4 .5 .6 .7 .8 .9 Units 

19.0 .07 .06 •06 .06 .05 .05 .05 .05 19.0 
18.0 .11 .10 .10 .09 .09 .08 .08 •08 .07 .07 18.0 
17.0 .16 .15 .15 .14 .14 .13 .13 .12 .12 .11 17.0 
16.0 .23 •22 .21 
.20 .20 .19 .18 .18 .17 .16 16.0 ~5.0 .31 .30 .29 
.28 .27 •27 .26 .25 .24 .23 15.0 
14.0 .40 .39 .38 .37 .36 .35 .34 .34 .33 •32 14.0 
13.0 .50 .49 .48 .47 .46 .45 .44 
.43 •42 .41 13.0 
12.0 •60 .59 .58 .57 
.56 .55 .54 .53 .52 .51 12.0 
11.0 .69 .68 .67 .66 .66 .65 •64 •63 •62 .61 11.0 
10.0 .77 .77 •76 .75 .74 .73 .73 .72 .71 .70 10.0 
9.0 •84 •84 •83 •82 .82 .81 .80 .80 •79 • 78 9.0 
8.0 •89 •89 .88 .88 .87 •87 •86 • 86 •85 •85 8.0 
7.0 .93 .93 .93 .92 .92 .92 .91 .91 .90 .90 7.0 
6.0 .95 .95 .95 .95 .94 .94 .94 6.0 
I From The Test Analysis Manual (ETS SR-74-06) by F. Swineford, Princeton,
N. J.: Educational Testing Service, 1974. Reprinted by permission. 
Appendix VI-B 
Table B 
Conversion of rbis (biserial) to r pb.]. (point biserial)! 
(Assumes rbis obtained from normal distribution) 
.40 .30 .25 .20 .15 .12 .10 .08 p or or or or or or or
.50 or .60 .70 .75 •80 . 85 •88 .90 .92 
Multi • 789 .759 .734 .700 .653 .616 .585 .548 
rbis rpbi 
.26 .21 •21 .20 .19 .18 .17 .16 .15 .14 .28 .22 .22 .21 .21 .20 .18 .17 .16 .15 .30 .24 .24 .23 .22 .21 .20 .18 .18 .16 . 32 .26 •25 .24 .23 . 22 .21 .20 .19 .18 .34 .27 .27 .26 .25 .24 .22 .21 .20 .19 .36 .29 .28 .27 .26 .25 .24 .22 .21 .20 .38 .30 .30 
.29 .28 .27 .25 .23 .22 .21 .40 •32 .32 .30 .29 .28 .26 .25 .23 .22 
.42 .34 .33 .32 .31 .29 •27 .26 .25 .23 .44 .35 •35 .33 .32 .31 .29 •27 .26 .24 .46 .37 .36 •35 .34 •32 . .30 .28 .27 .25 .48 .38 ...-38 .36 .35 .34 • 31. .30 •28 .26 .so .40 •39 .38 •37 .35 .33 .31 .29 .27 .52 .41 .41 .39 .38 .36 .34 •32 .30 .28 .54 .43 •43 .41 •40 .38 •35 .33 .32 .30 .56 .45 .44 .43 .41 .39 •37 .34 .33 .31 .58 .46 .46 .44 .43 .41 .38 .36 .34 .32 .60 .48 .47 .46 .44 .42 .39 .37 .35 .33 .62 .49 .49 .47 .46 .43 .40 .38 .36 .34 
•
64 .51 ·.50 .49 .47 .45 .42 .39 .37 .35 .66 .53 .52 

• 
70 .56 .55 .53 .51 •49 .46 .43 .41 .38 


.so .48 .46 .43 .41 .39 .36 .68 .54 .54 .52 .so .48 .44 .42 .40 .37 
For values of rbis higher than .70, multiply by multiplier at head of column. 
!Prepared by the author using the formula: y rpbi = rbis (Lord, F. M. & Novick, M. R. Statistical
;,/P (1 -p 
theories of mental test scores, p. 340) 
Appendix VI-C Table C Standard Errors of rbis for Selected Values of r p and Nl 
bis' ' 
.50  .40  or  •60  P l .30 or  .70  .20 or  .80  
N =  2,000  
.60 .......... •40 .......... •20 ..........  .020 .024 .027  .020 .025 .027  .021 .026 .029  .024 .028 .031  
. .  ..  
N =  1,000  
.60 .......... •40 .......... •20 ..........  .028 .035 .038  .029 .035 .039  .030 .037 .040  .034 .040 .044  

N = 500 .60 .......... .040 .041 .043 .048 
•
40 .......... .049 .050 .052 .057 


•
20 .......... .054 .055 .057 .062 


N = 300 .60 .......... .052 .052 .055 .062 
•
40 .......... .063 .064 .067 .073 


•
20 .......... .070 .071 .074 .080 


N • 150 .60 .......... .073 .074 .078 .087 
•
40 .......... .090 .090 .095 .104 


•
20 .......... .099 .100 .104 .113 


N • 100 .60 .......... .089 .091 .096 .107 
•40 .......... .109 .111 .116 .127 

.20 .......... .121 .123 .128 .139 

!From The Test Analysis Manual (ETS SR-74-06) by F. Swineford, Princeton, 
N. J.: Educational Testing Service, 1974. Reprinted by permission. 
159 

Bibliography 
Adkins, D.C. Construction and analysis of achievement tests. 
Washington, D. C.: U. S. Civil Service Commission, 1947. 
(Out of print) 

Adkins, D. C. Test construction (2nd ed.). Columbus, Ohio: 
Charles E. Merrill, 1974. 

American Psychological Association. Standards for education
al and psychological tests. Washington, D. C.: Author, 
1974. 

Anastasi, A. Psychological testing (4th ed.). New York: 
Macmillan, 1976. 

Angoff, W. H. Scales, norms, and equivalent scores. In 
R. L. Thorndike (Ed.), Educational measurement (2nd ed.). 
Washington, D. C.: American Council on Education, 1971. 

Bloom, B. S. (~d.). Taxonomy of educational objectives: The classification of educational goals, handbook 1. Cognitive domain. New York: McKay, 1956. 
Bloom, B. S.~.• Hastings, J. T., & Madaus, G. F. Handbook on 
formative and summative evaluation of student learning. 
New York: McGraw-Hill, 1971. 

Clemans, W. V. Test administration. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.). Washington, D. C.: American Council on Education, 1971. 
Chronbach, L. J. Essentials of psychological testing (3rd ed.). New York: Harper and Row, 1970. 
Ebel, R. L. Essentials of educational measurement (2nd ed.). Englewood Cliffs, N. J.: Prentice-Hall, 1972. 
Educational Testing Service. Making the classroom test: A guide for teachers. Princeton, N. J.: Author, 1973. 
Educational Testing Service. Multiple-choice questions: A close look. Princeton, N. J.: Author, 1973. (b) 
Equal Employment Opportunity Commission, 29 CFR 1607. Guidelines on employee selection procedures. 35 Fed. Reg. 12333. 
Guion, R. M. Personnel testing. New York: McGraw-Hill, 1965. 
Guilford, J. P. Psychometric methods (2nd ed.). New York: McGraw-Hill, 1954. 
Guilford, J. P., & Fruchter, B. Fundamental statistics in psychology and education (5th ed.). NewYork: McGrawHill, 1973. 
Gulliksen, H. Theory of mental tests. New York: John Wiley,1950. 
Henrysson, s. Gathering, analyzing, and using data on test items. In.R. L. Thorndike (Ed.), Educational measurement (2nd ed. ) • Washington, D. C. : American Council on Education, 1971. 
Katy, M. Selecting an achievement test: Principles and procedures. Princeton, N.J.: Educational Testing Service, 1973 
Lord, F. M., &Novick, M. R. Statistical theories of mental test scores. Reading, Mass.: Addison Wesley, 1968. 
McNemar, Q. Psychological statistics (3rd ed.). New York: John Wiley, 1962. 
Nunnally, J. C. Psychometric Theory. New York: McGrawHill, 1967. 
Office of Federal Contract Compliance, Equal Employment
Opportunity, Department of Labor, 41 CFR 60-3. Employeetesting and other selection procedures.· 36 Fed. Reg. 19307. 
Office of Federal Contract Compliance, 41 CFR 60-3. Employeetesting and other selection procedures. Guidelines for reporting validity. 38 Fed. Reg. 4413. 
Swineford, F. The test analysis manual (ETS SR-74-06). Princeton, N. J.: Educational Testing Service, 1974. 
161 

Tinkelman, S. N. Planning the objective test. In R. L. Thorndike (Ed.), Educational meaSurement (2nd ed.). Washington, D. C.: American Council on Education, 1971. 
Thorndike, R. L. (Ed.). Educational measurement (2nd ed.). Washington, D. C.: American Council on Education, 1971. 
Thorndike, R. L. Reproducing the test. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.). Washington, 
D. 	C.: American Council on Education, 1971 
U. 	S. Civil Service Commission. The selection of employees (prepared for the Department of Navy). Washington, D. C.: Author, November 1959. 
U. 	S. Civil Service Commission. Benchmark position descriptions for the factor ranking/benchmark approach to the evaluation of general schedule positions, GS-1 through GS-15. Washington, D. C.: Author, July 1973. 
U. 	S. Civil Service Commission. Instructions for field test of the factor ranking/benchmark approach to the evaluation of general schedule positions, GS-1 through GS-15. Washington, D. C.: Author, October 1973. 
U. 	S. CiviJ.:. Service Commission. Achieving job-related selection for entry-level police officers and firefighters. Washington, D. C.: Author, November 1973. 
U. 	S. Department of Health, Education, and Welfare. Information for examination review panel members. Washington, 
D. 	C.: Author, 1962. 
U. 	S. Department of Health, Education, and Welfare. The construction of test questions, part I. Forms of test questions. Washington, D. C.: Author, 1963. 
U. 	S. Department of Health, Education, and Welfare. The construction of test questions, part II. Techniques of item construction. Washington, D. C.: Author, 1967. 
U. 	S. Department of Labor. Handbook for analyzing jobs. Washington, D. C.: Author, 1972. 
162 
Wesman, A. G. Writing the test item. In R. L. Thorndike(Ed.), Educational measurement (2nd ed~). Washington,
D. C.: American Council on Education, 1971. 
Wood, D. A. Test construction. Columbus, Ohio: Charles E.Merrill, 1960.