0% found this document useful (0 votes)
240 views49 pages

Psychological Test Score Analysis

The document discusses different methods for interpreting psychological test scores, including raw scores, norm-referenced scores, and criterion-referenced scores. Norm-referenced scores compare an individual's performance to that of a standardized sample group and can be expressed as percentiles, standard deviations, or age/grade equivalents. Criterion-referenced scores measure performance against a specified standard or criterion rather than comparative performance. The document also covers topics like developing norms, comparing scores over time/across tests, and using item response theory for "sample-free" measurement.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
240 views49 pages

Psychological Test Score Analysis

The document discusses different methods for interpreting psychological test scores, including raw scores, norm-referenced scores, and criterion-referenced scores. Norm-referenced scores compare an individual's performance to that of a standardized sample group and can be expressed as percentiles, standard deviations, or age/grade equivalents. Criterion-referenced scores measure performance against a specified standard or criterion rather than comparative performance. The document also covers topics like developing norms, comparing scores over time/across tests, and using item response theory for "sample-free" measurement.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Lecture 3 Psychological Testing and Measurement

Sunthud Pornpresertmanit
Why interpreting raw score?

Student A Student B
Math = 50 Math = 60
English = 40 English = 45
Science = 70 Science = 40
Why interpreting raw score?
 A raw score on any psychological test is meaningless.
 Raw score
 Percentage
 The difficulty level of the items making up each test
will determine the meaning of the score.
How to interpret scores?
 Relative Score (Norm-referenced)
 Absolute Score (Criterion-referenced)
Norm-referenced
 The relative scores are designed to serve dual purpose.
 Evaluate individual performance from normative sample
 Provide comparable measures on different test
 The relative scores can be expressed in the same units.
Norm-referenced
 Relative scores are expressed in one of two major ways.
 Developmental level attained
 Relative position within a specified group
Developmental Norms
 How far along the normal developmental path the
individual has progressed
 They are psychometrical crude and do not lend
themselves well to precise statistical treatment.
Developmental Norms
 Mental Age
 Grade Equivalent
 Ordinal Scale
Mental Age: Age-equivalent items
 Items were grouped into year levels.
 Scores based on the highest year level that child
attained.
 In actual practice, the individual performance showed
certain amount of scatter.
 Therefore, it was customary to compute basal age and
give partial credits for all test passed at higher year
levels.
Mental Age: Age norms
 Standardized samples of each age constitute the age
norm on the test.
 All raw scores on such a test can be transformed in a
similar manner to the age norms.
Mental Age: Limitations
 The mental age unit tend to shrink with advancing
years.
 Intellectual development progress and maturity
Grade-equivalent
 Like Mental Age
 Computing the mean raw score obtained by children
in each grade and then equating raw score to grade
mean
Grade-equivalent: Limitations
 Content of instruction varies from grade to grade
 Unequal distance
 Grade norms tend to be incorrectly regarded as
performance standards
 Grade norms are subject to misinterpretation
Ordinal scale
 Developmental norms derives from research in child
psychology (Empirical observation).
 Gesell Developmental Schedules
 Sequential development in motor, adaptive, language,
personal-social from 4 weeks to 36 months
Ordinal scale
 Jean Piaget
 Concerns in specific concepts in cognitive development
than broad abilities (e.g. object permanence,
conservation)
 His theory is used in standardized scales.
 Critical analyses and empirical evaluations have
highlighted both its constructive features and
limitations.
Ordinal scale
 In conclusion, ordinal scales are designed to identify
the stage reached by the child in the development of
specific behavior functions.
 Interpreting score in both quantitative (age) and
qualitative (description)
 It share important features with the criterion-
referenced tests.
Discussing Question
 Why grade norms are subject to misinterpretation?
 What is the criterion that discriminate whether
behavior can be developed to ordinal scale or not?
Within-group Norms
 The individual’s performance is evaluated in terms of
the performance of the most nearly comparable
standardized group.
 Clearly quantitative meaning defined
Within-group Norms
 Percentile: percentage of persons in the
standardization sample who fall below a given raw
score
 Standard Score (z-score): transformation from raw
score to M = 0, SD = 1
 Derived Standard Score: transformation from raw
score
 Deviation IQ: M = 100, SD = 15 or 16 (depend on tests)
 T-score: M = 50, SD = 10
 CEEB score: M = 500, SD = 100
 Stanine: M = 5, SD = 2
Within-group Norms
 Standard score and derived standard score
 Linear Transformation

Standard
Raw score
score

 Normalized Transformation

Cumulative Standard
Raw score
percent Score
Within-group Norms
 Why transform to normal distribution?
 Most raw score distributions approximate the normal
curve more closely than they do any other type of curve.
 Normal curve has many useful mathematical properties.
 Normalized transformation should be carried out only
when the sample is large and representative.
 Deviation from normality results from defects in the test
Discussing Question
 What are the advantages and disadvantages of
percentile, standard score and derived standard score?
 Why ratio IQ is not comparable at different age level?
 What is the characteristics of samples making up
norms?
 Why population used for making norms affect
interpreting scores and comparability of derived scores
within a person?
Discussing Question
Percentile z score Derived z
Easy to compute and understand
Inequality of units
Negative value and decimal
Norms
 Test scores cannot be interpreted in abstract; they
must be referred to particular test.
 There are three principal reasons to account for
systematic variations among the scores obtained by
the same individual on different test?
 Test may differ in contest despite similar labels.
 The scale unit may not be comparable.
 The composition of the standardized samples used in
establishing norms for different tests may vary.
Norms
 Which population establishing norms?
 How to draw a sample from population?
 How much sample to be used for establishing norms?
 How much response rates?
Discussing Question
 Why institutional samples are not representative of
population?
 School
 Hospital
 Prison
 What are the advantages and disadvantages of
restricted population?
Equating Norms Procedures
 How to compare scores of individuals or groups across
time or test
 Alternate form
 Anchor test
 Fixed referenced group
 Simultaneous norming
Alternate form
 Two or more versions of test
 Alternate form = tests that are exactly the same in
content sampling
 Parallel form = content, M, SD, Reliability and validity
Simultaneous norming
 By norming tests at the same time and on the same
group of people, one can readily compare the
performance of individuals or subgroups on more than
one test, using same standard.
 Equipercentile method
 Example: Metropolitan Achievement Test
 300,000 four-, fifth-, sixth grade schoolchildren
 7 batteries that consist of reading comprehension and
vocabulary subtests
Anchor tests
 Anchor test consist of common sets of items
administered to different groups in the context of two
or more tests.
 These items allow comparability of test scores from
one test to another.
 The accuracy of comparability depends on correlations
between tests equated, reliability and validity of each
test.
Specific Norms
 For many testing purposes, highly specific norms are
desirable.
 Subgroups may be formed with respect to age, grade,
type of curriculum, sex, geographical region, urban or
rural environment, socioeconomic level etc.
 Local norm developed by the test users within
particular settings.
 Local norms are more appropriate than broad national
norms for many testing purposes, such as prediction of
subsequent performance or college achievement.
Fixed reference groups
 Fixed norms from one test (Reference group like
specific norm)
 When develop new test, use some items from the test
in new test for equating the new test to the norm (like
anchor test)
 Score scale maintain over time
Fixed reference groups
 Example: Scholastic Aptitude Test
 1926-1941  Norms in each year
 After 1941  11,000 test takers in 1941 used as reference
group
 Any change in the candidate over time can be detected
only with a fixed-score scale.
 In 1995, SAT changed fixed reference group from 1941 to
1990 test takers.
Item response theory
 Item response theory (IRT) or latent trait theory
 Based on probability that a person of specified ability
(the so-called latent trait) succeeds on an item of
specified difficulty.
 IRT models are used to establish a uniform “sample-
free” scale of measurement.
Item response theory
 IRT models place both persons and test items on a
common scale
 If item parameter estimates (such as item difficulty)
are invariant across population, it is not necessarily to
fix performance to referenced group.
Item response theory
 IRT is suited for use in computerized adaptive testing
(CAT)
 Their ability levels can be estimated based on their
responses to test items during the testing process
 These estimates are used to select subsequent sets of
test items that are appropriate to the test takers’ ability
levels
 Problems of CAT are test security, test costs, and the
inability of examinees to review and amend their
responses.
Criterion-referenced test
 Criterion-, content-, domain-, or objective-referenced
test
 Criterion-referenced testing uses as its interpretive
frame of reference in specified criterion or standard.
 Most found in education and occupational settings.
 The CRT focus is on what test takers can do and what
they know, not on how they compare with others.
Criterion-referenced test
 The CRT can be used in 2 perspectives
 Knowledge testing
 Mastery testing
Knowledge testing
 The important procedure to construct this test is a
clearly defined domain of knowledge or skills to be
assessed by the test and application of that knowledge.
 Table of specifications
 Then, items are prepared to sample each objective.
 Test items may be written in protocol.
 These tests are usually described as measures of
“achievement.”
Knowledge testing
 This CRT score has qualitative meaning.
 This CRT is equivalent to interpreting test scores in the
light of the demonstrated validity of the particular
test. (can combined with NRT)
Performance Assessment
 Sometimes, the CRT can be used to ascertain or certify
competence in tasks that are more realistic.
 The question is that “Does this test taker display
mastery of the skill in question?”
 This question can be rated by subjective judgment or
develop method for applying the criteria.
Mastery testing
 Procedures that evaluate test performance on the basis
of whether test taker does or does not demonstrate a
preestablished level of mastery are known as mastery
test.
 After suitable training, nearly everyone can achieved
complete mastery.
 Therefore, individual differences in performance are of
little or no interest.
Mastery testing
 Two important questions?
 How many items must be used for reliable assessment of
each of the specific instructional objectives covered by
the test? (Number of items)
 What proportions of item must be correct for the
reliable establishment of mastery? (Cutoff)
 These questions can be solved by statistical
techniques.
 The good example is qualifying for a driver’s license
test.
Mastery testing
 The utility of mastery and nonmastery can be used to
specify cutoff score.
 Mastery tests are suitable for basic skills.
Relation between NRT and CRT
 Beyond basic skills, mastery testing is inapplicable or
insufficient.
 Therefore, NRT is usually used in complex skills.
 Some published test are so constructed as to permit
both norm-referenced and domain-referenced
applications.
 An example is provided by Stanford Diagnostic Test in
reading and in mathematics.
Relation between NRT and CRT
 A normative framework is implicit in all testing.
 Applying a cutoff point to dichotomize performance
simply ignores the remaining individual differences
within the two categories and discards potentially
useful information.
Discussing Question
 Are letter grades suitable for assessing academic
performance? Why?
 Suppose that you are clinical psychologists license
tests developers. Making this license test by these
steps.
 Define content domain and objective
 Create how to measure
 Develop cutoff standard (by your opinion)

You might also like