Boston College Lynch School of Education
Psychometric Theory - ED669
(Spring 2000)
Psychometric Theory is that discipline which addresses the measurement and quantification of psychological phenomena (latent traits). Strictly speaking, psychological phenomena are not directly observable. Typically, they must be inferred from observations taken on some behavior that may be observed and is assumed to operationally define the unobservable characteristic that is of interest. An operational definition is most useful when it delineates boundaries of behavior and differential points between those boundaries. Ideally, a "scale" comprised of independent items is developed to measure a hypothesized unidimensional trait. Data are gathered and various statistical models are then employed to determine the extent to which the scale, or measurement instrument, functioned as intended.
Instructor:
Prof. Larry H. Ludlow
Campion Hall 336C
617-552-4221
Ludlow@bc.edu
1. "The Reader may here observe the Force of Numbers, which
can be successfully applied even to those things, which one would imagine are
subject to no Rules. There are very few things which we know, which are not
capable of being reduc'd to a Mathematical Reasoning; and when they cannot,
it's a sign our Knowledge of them is very small and confus'd; and where a mathematical
reasoning can be had, it's a great folly to make use of any other, as to grope
for a thing in the dark, when you have a Candle standing by you." John Arbuthnot,
1692.
In I. Todhunter, A History of Mathematical Theory of
Probability. (Macmillan,p48-51, 1865).
2. "Psychometry, it is hardly necessary to say, means the art
of imposing measurement and number upon operations of the mind...".
F. Galton, Psychometric Experiments. Brain, II, 149-162,
1879.
3. "...that until the phenomena of any branch of knowledge have
been subjected to measurement and number, it cannot assume the status and dignity
of a science."
Galton.
4. "I often say that when you can measure what you are speaking
about, and express it in numbers, you know something about it; but when you
cannot measure it, when you cannot express it in numbers, your knowledge is
of a meagre and unsatisfactory kind: it may be the beginning of knowledge, but
you have scarcely, in your thoughts, advanced to the stage of science, whatever
the matter may be."
Sir William Thomson, Lord Kelvin. Electrical Units
of Measurement. Popular Lectures and Addresses, Vol 1 of 3. (London: Macmillan,
1889, p. 73-74)
5. "The grand, and indeed only, character of truth is its capability
of enduring the test of universal experience, and coming unchanged out of every
possible form of fair discussion".
Sir John Herschel.
6. "Whatever exists, exists in some amount."
E. L. Thorndike.
Ludlow's Challenge:
If it exists, it can be measured; If it can't be measured, it doesn't exist.
Ludlow, L.H. Psychometrics Lectures, Boston College, February 1996
A) Introduce you to Classical Test Theory (True Score Theory), Item Response Theory, and the Rasch model (in particular); and
B) Provide an opportunity for you to develop competent, practical data analysis/consulting skills.
You will spend considerable time in the library and on the computer. [It is assumed that you will exert individual initiative in solving computing/analysis problems as they arise.]
COURSE ASSESSMENTS
You will be evaluated on the following components:
a) data analyses (Classical, Rasch)
b) literature reactions
c) measurement essay
d) Rasch presentation (Final)
e) class participation
Literature Reactions
The literature reactions (theory memos, reviews, reaction papers) will take the form of at least 1-2 pages, (greater length is acceptable but is not encouraged) typed and double-spaced. They will be handed in the first seven class meetings after the initial lecture. Their purpose is to introduce the literature to you and, in turn, your interests to me.
1) Begin the main body of your discussion with a direct quote from the article and its page number. Following the quote, write an analysis of its meaning to you. Your analysis should not be a paraphrased rendition of the quote but illustrative of your independent thinking on an interesting idea. For example, identify what may be wrong with the author's thinking on a question and suggest how the approach could be improved. Or, when your quote captures the brilliance of someone's thinking suggest ways its application may be broadened. Or, how can what we typically accept as "standard procedure" be improved by an obviously better way? Or, when you have encountered a particularly interesting topic, discuss its research potential for you or its potential for incorporation into your current employment. Or, you may wish to challenge "Ludlow's Challenge."
2) Organize the reaction papers and reviews according to the format shown below. In this form, your name and date are in the upper right hand corner and the full literature citation is in the upper left hand corner of the document.
| Pearson, K. The Grammer of Science. | Your Name |
| London : Adam and Charles Black, 1900. | Date |
Page6
Now would follow your reaction to the quote.
3) Your first Reaction Paper is to answer the question "What is Measurement?" You may consult any of the materials in this syllabus. BUT, I want to know what you in your own words think constitutes measurement. Your remaining Reaction Papers will be of the form presented in steps (1) and (2) above.
4) No papers are due for the evening in which analyses are submitted.
Data Analyses
The data analyses will consist of your output from the measurement programs and a complete report stating the results. You may supply your own data or you may solicit School of Education faculty for data. A reasonable way to satisfy this course component is to analyze the same data set for each psychometric model. The report should describe the sample, the variable being measured, items of the instrument (including their number and scoring format), the psychometric model and its psychometric properties, the interpretation of whether or not the data fit the model, and what modifications (if any) would improve the instrument.
Measurement Essay
The measurement essay will integrate your literature reactions and your understanding of class discussions. This is an opportunity for you to formally summarize your understanding of the essentials of measurement. One reasonable way in which to satisfy this component is to take a single topic and focus each reaction paper on some aspect of that topic. The measurement essay would then trace the development of your research. This essay should be 5-10 pages in length (potentially longer), typed, double-spaced, and fully referenced. In your essay you may include a discussion of topics that remain confusing, or appear as potentially researchable. Potential topics might include: authentic assessment, item banking, tailored testing, computer adaptive testing, Rasch applications, latent trait model fit, standard setting, one-parameter versus three-parameter models, differential item functioning (DIF), comparisons of estimation algorithms, goodness of fit tests, etc. You might even address how, if any, your interpretation of the first reaction paper "What is measurement?" has shifted/clarified/been re-defined over the course of the semester.
Rasch Presentation
Your last data analysis will close with the Rasch model. You will provide a brief (15-20 minute) class presentation of your results.
Required Texts
Andrich, D. (1988). Rasch Models for Measurement. Newbury Park: Sage.
Crocker, L. & Algina, J. (1986). Introduction to Classical & Modern Test Theory. NY: Holt, Rinehart & Winston.
Hambleton, R.K., Swaminathan, H. & Roger, H. J. (1991). Fundamentals of Item Response Theory. Newbury Park: Sage.
Wright, B.D. & Masters, G.N. (1982). Rating Scale Analysis. Chicago: MESA Press.
1. History: Psychophysics to Psychometrics.
Principals and theoretical development.
Required Readings:
1. Chapters1 and 3 of Crocker & Algina.
2. "Forward" of Wright & Masters.
3. Ludlow, L.H. (1998). Galton: The first psychometrician?. Popular Measurement,
1, 13-14.
4. Thurstone, L.L., Psychology as a quantitative rational science. In Thurstone,
L.L. The Measurement of Values. University of Chicago Press, 1959.
5. Boring, E.G. The beginning and growth of measurement in psychology. In Woolf,
H. (Ed.) Quantification. Bobbs-Merrill, 196l.
Suggested Readings:
1. Thurstone, L.L., Attitudes can be measured. In Thurstone,
L.L. The Measurement of Values. University of Chicago Press, 1959.
2. Boring, E.G. Gustav Theodor Fechner. In Boring, E.G. A History of Experimental
Psychology (2nd ed.). Prentice-Hall, 1950.
4. Kuhn, T.S. The function of measurement in modern physical science. In Woolf,
H. (Ed.) Quantification. Bobbs-Merrill, 1961.
5. Thurstone, L.L. Psychophysical analysis. In Thurstone, L.L. The Measurement
of Values. University of Chicago Press, 1959.
6. Jones, L.V. The nature of measurement. In Educational Measurement (2nd ed).
Thorndike, R.L. (Ed) (2nd Ed). American Council on Education, 1971.
7. Stevens, S.S. Mathematics, measurement, and psychophysics. In Stevens, S.S.
(Ed). Handbook of Experimental Psychology. Wiley, 1951.
8. Galton, F. (1879). Psychometric experiments. Brain, II, 149-162.
2. Classical True Score Theory:
Theory, assumptions, applications.
SPSS reliability and factor analysis computer output intepretation of TASC data.
Required Readings:
1. Chapters 5-7, 13-14 of Crocker & Algina.
2. Spearman, C. (1904). The proof and measurement of association between two
things. American Journal of Psychology, 15, 72-101.
3. Allen, M.J. & Yen, W.M. Classical True-Score Theory (Ch. 3) in Introduction
to Measurement Theory. Monterey, CA: Brooks/Cole, 1979.
Suggested Readings:
1. Traub, R.E. & Rowley, G.L. (1991). Understanding reliability.
Educational Measurement: Issues and Practice, 10, 37-45.
2. Loevinger, J. (1965). Person and population as psychometric concepts. Psychological
Review, 72, 143-155.
3. Loevenger, J. (1954). The attenuation paradox in test theory. Psychological
Bulletin, 5, 493-504.
4. Thurstone, L.L. Psychological Implications of Factor Analysis. Psychometric
Laboratory Paper #44. The University of Chicago, Sept., 1947.
5. Thurstone, L.L. Psychological Assumptions of Factor Analysis. Psychometric
Laboratory Paper #51. The University of Chicago, Feb.,1949.
6. Gould, J. (1981). Chapter 6 in The Mismeasure of Man. NY: Norton.
7. Spearman, C. (1904). "General Intelligence," Objectively Determined and Measured.
American Journal of Psychology, 15, 201-293.
8. Hattie, J., Jaeger, R.M. & Bond, L. (1999). Persistent methodological questions
in educational testing. Review of Research in Education, 24, Chapter 11. Washington,
DC: AERA.
9. Traub. R.E. (1997). Classical test theory in historical perspective. Educational
Measurement: Issues and Practice, 8-14.
3. Guttman's Scale Theory:
Theory, assumptions, applications.
Interpretation of Hillock's Taxonomy of Reading Skills Hierarchy.
Required Readings:
1. Stouffer, S.A. An Overview of the Contributions to Scaling
and Scale Theory. In Measurement and Prediction, Stouffer, S.A. et al., Princeton
University Press, 1950.
2. Guttman, L.L. The Basis for Scalogram Analysis. In
op cit.
3. Ludlow, L.H. & Hillocks, Jr., G. (1985). Psychometric Considerations in the
Analysis of Reading Skill Hierarchies. Journal of Experimental Education, 54,
15-21.
4. Item Response Theory:
Basics - item and test characteristic curves, the information function, one-parameter
dichotomous /rating scale/partial credit models.
Required Readings:
1. Chapter 15 of Crocker & Algina.
2. Jaeger, R.M. (1987). Two decades of revolution in educational measurement!?
Educational Measurement: Issues and Practice, 6-14.
3. Ludlow, L.H. & Haley, K.C. (1999). Newton: The pinball wizard?. Popular Measurement,
2, 5-7.
Suggested Overview Readings-Past/Present/Future:
1. Bock, R.D. (1997). A brief history of item response theory.
Educational Measurement: Issues and Practice, 21-32.
2. Fischer, G.H. & Molenaar, I.W. (Eds.) (1995). Rasch Models: Foundations,
recent developments, and applications. NY: Springer. (see Ch 1).
3. Hambleton, R., Swaminathan, H. (1985). Item reponse theory: Principles and
applications. Boston: Kluwer. (see Ch 1 and 2)
4. Hambleton, R. (1989). Principles and selected applications of item response
theory. In Linn, R.L. (Ed.). Educational Measurement. (3rd ed). NCME, AERA:
McMillan.
5. Mislevy, R.J. (1987). Recent developments in item response theory with implications
for teacher certification. In Review of Research in Education, Rothkopf, E.F.
(Ed.) Vol. 14, Washington: AERA.
6. Mislevy, R.L. (1996). Test theory reconceived. Journal of Educational Measurement,
379-416.
7. Reckase, M.D. The past and future of multidimensional item response theory.
Applied Psychological Measurement, 21, 25-36.
8. Wainer, H. (1989). The future of item analysis. Journal of Educational Measurement,
26, 191-208.
9. Van der Linden, W. & Hambleton, R. (Eds.). (1997). Handbook of modern item
response theory. NY: Springer. (see Ch 1).
General Measurement Articles:
1. Andrich, D. (1989). Distinctions between assumptions and
requirements in measurement in the social sciences. In J. A. Keats, R. Taft,
R. A. Heath & S. H. Lovibond (Eds.), Mathematical and Theoretical Systems: Proceedings
of the 24th International Congress of Psychology of the International Union
of Psychological Science, Vol. 4 (pp. 7-16). North-Holland: Elsevier Science
Publishers.
2. Andrich, D. (1996). Measurement criteria for choosing among models with graded
responses. In Categorical variables in developmental research: Methods of analysis
(pp. 3-35). Academic Press, Inc.
3. Fisher, W. P., Jr. (1994). The Rasch debate: Validity and revolution in educational
measurement. In M. Wilson (Ed.), Objective measurement: Theory into practice.
Vol. II (pp. 36-72). Norwood, New Jersey: Ablex Publishing Corporation.
4. Michell, J. (1986). Measurement scales and statistics: A clash of paradigms.
Psychological Bulletin, 100, 398-407.
5. Michell, J. (1997). Quantitative science and the definition of measurement
in psychology. British Journal of Psychology, 88, 355-383.
6. Wright, B. D. (1984). Despair and hope for educational measurement. Contemporary
Education Review, 3(1), 281-288.
7. Wright, B. D. (1999). Fundamental measurement for psychology. In S. E. Embretson
& S. L. Hershberger (Eds.), The new rules of measurement: What every educator
and psychologist should know. Hillsdale, NJ: LEA.
5. The Rasch Model:
Purpose, assumptions, estimation procedures, item and
person fit, residual analysis, applications.
Computer output interpretation of TASC and TAMP data sets.
Required Readings:
1. Chapters 1-5 of Wright & Masters.
2. Wright, D.D. (1980). "Foreward", and "Afterward". In Rasch, G. Probabilistic
Models for Some Intelligence and Attainment Tests. University of Chicago Press.
3. Ludlow, L.H. & Haley, S.M. (1995). Rasch model logits: Interpretation, use,
and transformation. Educational and Psychological Measurement, 55, 967-975.
4. Ludlow, L.H. & O'Leary, M. (1999). Omitted and not reached items: Practical
data analysis implications. Educational and Psychological Measurement, 59, 615-630.
Suggested Readings:
1. Wright, B.D. (1967). Sample-free test calibration and person
measurement. In Proceedings of the 1967 Invitational Conference on Testing Problems.
Princeton: Educational Testing Service, 85-101.
2. Whitely, S.E. & Davis, R.V. (1974). The nature of objectivity with the Rasch
model. Journal of Educational Measurement, 163-178.
3. Andrich, D. (1978). Relationships between the Thurstone and Rasch Approaches
to Item Scaling. Applied Psychological Measurement, 449-460.
4. Englehard, G. (1984). Thorndike, Thurstone, and Rasch: A Comparison of their
methods of scaling psychological and educational tests. Applied Psychological
Measurement, 21-38.
5. Hambleton, R. Principles and selected applications of item response theory.
Chapter 4. In Educational Measurement (3rd ed). Linn, R. (Ed). NY: Macmillan.
6. Brink. N. (1972). Rasch's logistic model vs. The Guttman model. Educational
and Psychological Measurement, 32, 921-927.
7. Hambleton, R. & Jones, R. (1993). An NCME instructional module on comparison
of classical test theory and item response theory and their application to test
development. Educational Measurement: Issues and Practice, 38-47.
8. Gable, R., Ludlow, L., Wolf, M. (1990). The use classical and Rasch latent
trait models to enhance the validity of affective measures. Educational and
Psychological Measurement, 50, 869-878.
9. McNamara, T. (1996). Raters and ratings: Introduction to multi-faceted measurement.
Concepts and procedures in Rasch measurement. Ch 5 & 6 in Measuring Second Language
Performance. London: Longman.
Two related articles:
1. Leonard, M. (1980). Rasch promises: A layman's guide to the
Rasch method of item analysis. Educational Researcher, 22, 188-192.
2. Willmont, A. (1980). What does Rasch promise? A reply to Rasch promises by
Martin Leonard. Educational Researcher, 22, 193-197.
Five related articles:
1. Divgi, D.R. (1986). Does the Rasch model really work for
multiple choice items? Not if you look closely. Journal of Educational Measurement,
23, 283-298.
2. Henning, G.(1989). Does the Rasch model work for multiple-choice items? Take
another look: A response to Divgi. Journal of Educational Measurement, 26, 91-97.
3. Andrich, D. (1989). Statistical reasoning in psychometric models and educational
measurement. Journal of Educational Measurement, 26, 81-90.
4. Goldstein, H. (1979). Consequences of using the Rasch model for educational
assessment. British Educational Research Journal, 5, 211-220.
5. Goldstein, H. (1980). Dimensionality, bias, independence and measurement
scale problems in latent trait test score models. British Journal of Mathematical
and Statistical Psychology, 33, 234-246.
Two general other-discipline articles:
1. Alphen A., Halfens, R., Hasman, A., & Imbos, T. (1994). Likert
or Rasch? Nothing is more applicable than good theory. Journal of Advanced Nursing,
20, 196-201.
2. Spray, J. (1987). Recent developments in measurement and possible applications
to the measurement of psychomotor behavior. Research Quarterly for Exercise
and Sport, 58, 203-209.
Related Books:
1. Fischer, G.H. & Molenaar, I.W. Rasch Models: Foundations,
Recent Developments, and Applications. NY: Springer, 1995.
2.Wilson, M. (ed). Objective Measurement: Theory Into Practice. Volume 1-4.
Norwood, NJ: Ablex, 1992-1997.
3. Wright, B.D. & Stone, M.H. Best Test Design. Chicago: MESA Press, 1979.
Other:
Any issue of Rasch Measurement: Transactions of the Rasch Measurement Special Interest Group. (see me for their location)
Variable Development and Application Examples:
1. Hillocks, Jr. G. & Ludlow, L.H. (1984). A taxonomy of skills
in reading and interpreting fiction. American Educational Research Journal,
7-24.
2. Ludlow, L.H. (1985). A strategy for the graphical representation of Rasch
model residuals. Educational and Psychological Measurement, 45, 851-860.
3. Ludlow, L.H. (1986). Graphical analysis of item response theory residuals.
Applied Psychological Measurement, 10, 217-229.
4. Ludlow, L.H. & Hwang, R. (1990). Evaluating district-level performance relative
to the system. Educational Research Quarterly, 14, 29-37.
5. Ludlow, L.H. & Guida, F.V. (1992). The Test Anxiety Scale for Children as
a Measure of academic anxiety. Educational and Psychological Measurement, 51,
1013-1021.
6. Ludlow, L.H. & Lunz, M. (1998). The Job Responsibilities Scale: Invariance
in a longitudinal prospective study. Journal of Outcome Measurement, 2, 326-337.
7. Ludlow, L.H. (1998). Scale invariance from a three-dimensional graphical
perspective: Visualizing an eigenvector.Educational and Psychological Measurement,
58, 166-178.
8. Ludlow, L.H. (1999). The structure of the Job Responsibilities Scale: A multi-method
analysis. Educational and Psychological Measurement, 59, 962-975.
9. Coster, W.J., Mancini, M.C. & Ludlow, L.H. (1999). Factor structure of the
School Function Assessment. Educational and Psychological Measurement, 59, 665-677.
10. Coster, W., Ludlow, L.H. & Mancini,M. (1999). Using IRT variable maps to
enrich understanding of rehabilitation data. Journal of Outcome Measurement,
3, 123-133.
TAMP/PEDI Projects:
1. Gans & Haley, et al. (1988). Description and interobserver
reliability of the Tufts Assessment of Motor Performance. American Journal of
Physical Medicine and Rehabilitation, 2, 202-210.
2. Haley & Ludlow, et al. (1991). Tufts Assessment of Motor Performance: An
empirical approach to identifying motor performance categories, Archives of
Physical Medicine and Rehabilitation, 72, 359-366.
3. Ludlow & Haley. (1991). Polytomous Rasch models for behavioral assessment:
The Tufts Assessment of Motor Performance. In Objective Measurement, Vol. 1,
Wilson, M. (Ed.) Ablex.
4. Ludlow, Haley & Gans. (1992). A hierarchical model of functional performance
in rehabilitation medicine: The Tufts Assessment of Motor Performance. Evaluation
and the Health Professions, 15, 59-74.
5. Haley & Ludlow. (1992). Applicability of the hierarchical scales of the Tufts
Assessment of Motor Performance for school-aged children and adults with disabilities.
Physical Therapy, 72, 191-206.
6. Fisher, A.G., Bryze, K.A., Granger, C.V., Haley, S.M., Hamilton, B.B., Heineman,
A.W., Puderbaugh, J.K., Linacre, J.M., Ludlow, L.H., McCabe, M.A. & Wright,
B.D. (1994). Applications of conjoint measurement to the development of functional
assessment. International Journal of Educational Research, 21, 579-593.
7. Haley, S.M., Ludlow, L.H. & Coster, W.J. (1993). Pediatric Evaluation of
Disability Inventory: Clinical Interpretation of summary scores using Rasch
rating scale methodology. Physical Medicine and Rehabilitation Clinics of North
America: New Developments in Functional Assessment, 4, 529-540.
8. Ludlow, L.H. & Haley, S.H. (1996). Effect of context in rating of mobility
activities in children with disabilities. Educational and Psychological Measurement,
56, 122-129.
6. Operation of Psychometric computer programs:
SCALE, WINSTEPS, RUMM, PARSCALE, BILOG-MG.
7.The Two-and
Three-parameter IRT Models:
Purpose, assumptions, estimation, model fit, applications.
· Baker, F.B. (1992). Item Response Theory: Parameter Estimation
Techniques. NY: Marcel Dekker.
· Hambleton, R.K. (Ed) (1983). Applications of Item Response Theory. Vancouver,
BC: Educational Research Institute of British Columbia.
· Hambleton, R.K. & Swaminathan, H. (1985). Item Response
Theory: Principles and Applications, Boston: Nijhoff.
· Hambleton, R.K., Swaminathan, H. & Rogers, J. (1991). Fundamentals of Item
Response Theory. Sage.
· Harris, D. (1989). Comparison of 1-,2-, and 3-parameter IRT models. Educational
Measurement: Issues and Practice. NCME Instructional Module, Spring, 35-41.
· Hulin, C.L., Drasgow, F. & Parsons, C.K. (1983). Item Response Theory: Application
to Psychological Measurement. Homewood, IL: Dow Jones-Irwin.
· Lord, F.M. (1983). Applications of Item Response Theory to Practical Testing
Problems. Hillsdale, NJ: Erlbaum.
· Van der Linden, W. & Hambleton, R.K. (1996). Handbook of Modern Item Response
Theory. NY: Springer.
· Wainer, H. & Messick, S. (1983). Principals of Modern Psychological Measurement.
Hillsdale, NJ: Erlbaum.
8. Technical Applications of IRT:
Item banking, adaptive testing, item and test bias, equating, test construction,
differential item functioning (DIF), scale anchoring, cut-scores, plausible
values.
Differential Item Functioning:
· Berk, R.A. (Ed.) (1982). Handbook of Methods for Detecting
Test Bias. Baltimore, MD: Johns Hopkins University Press.
· Holland, P.W. & Wainer. H. (1993). Differential Item Functioning. Hillsdale,
NJ: Erlbaum.
· Zumbo, B. D. (1999). A Handbook on the Theory and Methods of Differential
Item Functioning (DIF): Logistic Regression Modeling as a Unitary Framework
for Binary and Likert-type (Ordinal) Item Scores. Ottawa ON: Directorate of
Human Resources Research and Evaluation, Department of National Defense.
Computerized Adapative Testing:
· Sands, W.A., Waters, B.K. & McBride, J.R. (Eds). (1997). Computerized
Adaptive Testing: From Inquiry to Operation. Washington, DC: APA.
· Wainer, H., Dorans, N.J., Flauger, R., Green, B.F., Mislevy, R., Steinberg,
L. & Thissen, D. (1990). Computerized Adaptive Testing: A Primer. Hillsdale,
NJ: Erlbaum.
Equating:
· Angoff, W.H. (1984). Scales, Norms, and Equivalent Scores.
Princeton: ETS.
· Holland, P.W. & Rubin, D.B. (1982). Test Equating. NY: Academic Press.
· Kolen, M.J. & Brennan, R.L. (1995). Test Equating: Methods and Practices.
NY: Springer.
· Linn, R.L. & Kiplinger, V.L. (1995). Linking statewide tests to the NAEP:
Stability of results. Applied Measurement in Education, 8, 135-155.
· Mislevy, R.J., Sheehan, K.M. & Wingersky, M. (1993). How to equate tests with
Little or no data. Journal of Educational Measurement, 30, 55-78.
Cut-scores:
· Berk, R.A. (1986). A consumer's guide to setting performance
standards on criterion-referenced tests. Review of Educational Research, 56,
137-172.
· Glass, G.V. (1978). Standards and criteria. Journal of Educational Measurement,
15, 237-261.
· Jaeger, R.M. (1989). Certification of Student Competence. In R.L.Linn (Ed.),
Educational measurement (3rd ed., pp 485-514). New York: American Council on
Education and Macmillan.
· Kane, M. (1994). Validating the performance standards associated with cutscores.
Review of Educational Research, 64, 425-461.
General References on Science and Measurement
Rasch Bibliography
(in my files)
Guttman Bibliography
(in my files)
Relevant Dissertations
(Chapters with Rasch model descriptions)
Special Edition Journals
Proceedings:
Relevant Journals
Web Sites
Finally, check these interesting IRT-related web sites:
http://www.rasch.org/
http://quarles.unbc.ca/psyc/itc/index.html
I have not checked this next site yet but it sounds interesting:
e-PSYCHOMETRICS, a user-friendly on-screen book, which provides
useful internet resources for measurement theory. The main topics include introduction
to measurement theory; introduction to reliability and validity; introduction
to classical test theory and the corresponding procedures for estimating reliability
and validity; introduction to item response theory including Rasch models, computerized
adaptive testing (CAT), differential item functioning (DIF) and test equating.
Other topics include online documentation and abstracts, books and journals,
mail-servers, and professional organizations.
The address is: http://go.to/EricWong
Please note that it will take several minutes to download the webbook. Kindly send me comments and suggestions. Eric WONG
Classical True-Score Theory Assignment
(Spring 2000-100 points)
Upon your data set, use SPSS procedures to perform a classical true-score theory (CTT) item analysis. Provide an answer to all of the following questions. An outline format is preferable. There is no need to try to write the assignment as a mini-publication at this point.
1. Instrument and sample:
Explain the purpose of your measurement instrument. What does
the instrument purport to measure? Who developed it (wrote the items)? How many
items are included? What is the scoring format? How many response options are
provided? Is it a speeded test? How long does it take to answer? Is it a standardized
or non-standardized instrument? Is it primarily for norm-referenced or criterion-referenced
purposes?
Where did your sample come from? Who collected the data?
How many subjects are there? Are they a subset of a larger study and, if so,
briefly explain why they were specifically chosen. Are there any special characteristics
about them? What is the population to whom they are generaizable?
2. Measurement model:
Explain the statistical form of the true score model (present
the relevant equations and explain them). What are its primary assumptions (present
the equations and explain them)? Do they appear reasonably well met for your
data?
For your data, show how the following were computed (what equations led to the
statistics): item difficulty (for dichotomous data), discrimination (corrected
item-total correlation), reliability (for internal consistency), and standard
error of measurement (based on the internal consistency estimate). Explain the
various components of the equations. Why is the item-total correlation corrected?
Explain the general purpose of a common factor analysis when it is applied to
items of a test. Briefly explain what you think an eigenvalue is and explain
what the factor loadings are. What is the purpose of the scree plot and a varimax
rotation? What general procedures are normally conducted in order to determine
the appropriateness of factoring a correlation matrix?
3. Analysis:
Discuss the distributional characteristics of your item difficulties
and person total scores, e.g, are they as intended, are they surprising? Discuss
whether your discrimination estimates are reasonable or not. Are there any particular
items with statistical problems (what are the statistical problems)? What might
have caused the problems, if there are any? Should any items be removed or revised?
Interpret the Cronbach alpha you obtained.
Discuss the results of your initial factor analysis and how you subsequently
decided on the number of final factors to retain. What percent of variance was
extracted by those factors and what is your opinion of the magnitude of the
percent that was accounted for? Was the rotated and plotted final solution interpretable
(just plot the first two factors)? What verbal labels did you apply to "name"
the factors (and explain why you applied those names)? Was your solution expected
or surprising (did you have any idea about what might result from the factor
analysis)? What is the reliability of each of the final factors in your solution?
How many scores for each tested person would you recommend should be reported?
4. Submit your write-up and output. A useful way to write your analysis is to cut and paste into it the appropriate tables/graphs/figures that are output by SPSS rather than referring the reader to the pages of your output. (NOTE: pay attention to typo's and notation errors.)
Item Response Theory Assignment
(Spring 2000: 100 points)
Upon your data set, use SCALE/ WINSTEPS/ RUMM/ PARSCALE to perform an item response theory analysis.
1. Instrument and sample:
Explain the purpose of your measurement instrument. Who developed
it? How many items are included? What is the scoring format? How many response
options are provided? Is it a speeded test? How long does it take to answer?
Where did your sample come from? How many subjects are there? Are there any
special characteristics about them? Basically, I want you to remind me of the
characteristics of the data used for the classical analysis.
For your data, what is the "variable" that is being measured? That is, what
is the hypothesized structure that is to be tested by the Rasch model?
2. Measurement model details:
Explain the statistical components of the Rasch model. Why is
it called a one-parameter model when clearly there is a parameter for both persons
and items? What are the primary assumptions of the model? Do they appear reasonably
well met for your data?
Explain how the initial PROX person ability estimates and item difficulty estimates
are computed. Why are persons and items with perfect correct or zero scores
removed from analysis? What does the term "sufficient statistic" refer to? Explain
(in your own words) what person and item "logits" are. How is the "expected"
value for a person on any item computed?
Explain the difference the and "UCON" estimation procedure. What does the term
"likelihood" refer to? Explain, in general basic terms, how the Newton-Raphson
algorithm operates. What is its function?
How are person and item weighted fit statistics computed? Explain how person
and item positive and negative fit statistics may be interpreted. What might
be done if an item or person is considered to misfit the model?
3. Analysis:
Discuss the initial distributional characteristics of your item
difficulties (does it appear to be a relatively easy or hard instrument) and
person abilities (do they appear relatively capable or not). Are these findings
as intended? For your data, what do "difficulty" and "ability" translate into?
Are there any particular persons with statistical problems? What might have
caused them if there are? Are there any particular items with statistical problems?
What might have caused them is there are? (How are you defining a "problem"
and what have you done to try to locate their source?)
Explain what the "variable map" is and what it reveals about your data. Was
your solution expected or surprising? What modifications, if any, would you
suggest if the instrument were to be revised and re-administered?
Finally, compare and contrast the Rasch results to your
previous classical analysis results. For example, is there any additional insight
you have gained about your data? In addition, how does the standard error of
measurement associated with a person's performance differ between the two models?
4. Submit your write-up and output. A useful way to write your analysis is to cut and paste into it the appropriate tables/graphs/figures that are output by the software rather than referring the reader to the pages of your output. (NOTE: pay attention to typo's and notation errors.)