|Year : 2018 | Volume
| Issue : 2 | Page : 33-35
Forty-five common rater errors in medical and health professions education
Kenneth D Royal
Department of Clinical Sciences, North Carolina State University, Raleigh, North Carolina, USA
|Date of Web Publication||7-Feb-2019|
Dr. Kenneth D Royal
Department of Clinical Sciences, North Carolina State University, Raleigh, North Carolina
Source of Support: None, Conflict of Interest: None
Minimizing the influence of rater errors is a persistent and considerable challenge for educators in the medical and health professions. This article presents a list of 45 common rater errors that assessors and evaluators should be cognizant of while rating performance assessments. Readers are encouraged to examine each rater error type, reflect on the extent to which s/he has previously committed each error, and identify strategies for mitigating and preventing errors in future performance assessment scenarios.
Keywords: Assessment, clinical education, evaluation, grading, medical education, performance assessment, standardized patients
|How to cite this article:|
Royal KD. Forty-five common rater errors in medical and health professions education. Educ Health Prof 2018;1:33-5
In the medical and health professions, raters are commonly used in both real practice and simulated settings to directly observe and evaluate an individual while performing/demonstrating a variety of skills, tasks, procedures, and/or behaviors. Using rubrics, checklists, and other instruments, raters provide scores that may be used for formative (e.g., teaching), summative (e.g., determining competency), or other (e.g., documenting clinical skills for accreditation) purposes. Score results often carry moderate to high stakes for examinees; thus, it is imperative that the scores/ratings are valid indicators of performance.
However, obtaining valid scores through performance assessments is typically much more challenging than more objective types of assessments, such as multiple choice examinations. Unlike multiple choice examinations that involve three primary sources of measurement error (instrumentation, examinees, and conditions of administration), performance assessments are much more complex. More specifically, the inclusion of human raters introduces an inescapable element of subjectivity that poses an additional and significant threat to score validity. Suffice it to say, there is a minimum of four potential sources of measurement error in performance assessment scenarios: instrument, examinees, conditions of administration, and raters. Although numerous strategies are available to minimize sources of error, research has long noted that reducing rater errors is the most difficult.
The purpose of this brief article is threefold. First, the author intends to bring attention to the critical issue of rater errors in performance assessments. Second, the author intends to identify and describe 45 types of rater errors that were been identified from a multidisciplinary review of the literature.,, Third, it is the author's hope that this list will help raters not only become more aware of potential cognitive biases that might affect examinees' scores and distort score validity but also become better equipped to mitigate and prevent many of these errors in future performance assessments [Table 1].
| Discussion and Recommendations|| |
The list presented in [Table 1] provides a sobering perspective on the challenges raters face when assigning ratings in performance assessment scenarios. Fortunately, there are some tips that can help reduce many rater errors.
First, any individual that is tasked with rating performances should have received prior training on the topic of rater inconsistencies and undergone a series of rater calibration exercises (also known as “norming”) with other raters. The purpose of these exercises is to standardize raters in such a way that no examinee will be unduly advantaged or disadvantaged as a result of being evaluated by a given rater. Persons unfamiliar with the rater calibration/norming process should consult works by Allen and Maki for a thorough overview. If raters have never engaged in this activity, they should immediately consult an expert in educational assessment who can help provide the requisite training and/or provide guidance on how to set up a robust rater training program.
Second, it is important to identify the type and quantity of errors that raters have committed in the past. As George Santayana famously stated, “Those who cannot remember the past are condemned to repeat it.” Therefore, raters are encouraged to review each type of error, and mark each error that she/he committed in the past. The rater should thoughtfully consider why each flagged error occurred previously and what she/he can do to avoid committing this error again in the future. Simply becoming aware of one's tendency to commit a particular error often is enough to avoid committing that same error again. Of course, some types of errors may pose a more persistent challenge.
Third, raters are encouraged to discuss errors with other individuals who also provide ratings of the same examinees. It is critical raters understand that mitigating rater errors require a combination of planning, teamwork, ongoing communication, and evaluation. Thus, raters should frequently converse with fellow raters not only to re-calibrate but also to discuss any issues such as new information or other changes that might affect one's ratings in some way. These conversations typically are particularly effective for mitigating some of the most common rater errors, such as “drift” and “fatigue,” and may help mitigate or prevent many other types of errors.
Finally, those responsible for analyzing data should become familiar with various techniques for scoring performance assessment data. Perhaps, the most common approaches to data analysis include calculating traditional summary statistics and inter-rater reliability estimates as a validity check. Although these techniques are fundamental to understanding the data, they leave much to be desired methodologically. More recently, specialized techniques such as generalizability theory, and Many-Faceted Rasch Measurement (MFRM) modeling, have become commonplace in high-stakes settings. Although detailed discussion of these techniques is beyond the scope of this article, readers are encouraged to learn more about these techniques as they may be useful for identifying and differentiating various sources of error, and in the case of the MFRM, producing linear measures that account for differences among facets (e.g., task difficulty and rater leniency/stringency) before calculating an examinee's score.
| Conclusion|| |
Minimizing the influence of rater errors is a persistent and considerable challenge for educators in the medical and health professions. Readers also are encouraged to remain cognizant of rater errors and do their best to ensure minimal error emanating from subjective elements manifest in examinees' scores. To help accomplish this goal, a list of rater errors believed to be the most comprehensive ever assembled was presented. Readers are encouraged to examine each rater error type, reflect on the extent to which she/he has previously committed each error, and identify strategies for mitigating and preventing errors in future performance assessment scenarios.
Financial support and sponsorship
Conflicts of interest
Dr. Royal is the editor-in-chief of Education in the Health Professions. All peer-review activities relating to this manuscript were independently performed by other members of the editorial board.
| References|| |
Royal KD, Hecker KG. Rater errors in clinical performance assessments. J Vet Med Educ 2016;43:5-8.
Linacre JM. Many-Facet Rasch Measurement. Chicago, IL: MESA Press; 1989.
Johnson RL, Penny JA, Gordon B. Assessing Performance: Developing, Scoring, and Validating Performance Tasks. New York: Guilford Press; 2009.
Wesolowski BC, Wind SA, Engelhard G. Rater fairness in music performance assessment: Evaluating model data fit and differential rater functioning. Music Sci 2015;19:147-70.
Allen M. Assessing Academic Programs in Higher Education. San Francisco: Jossey-Bass; 2004.
Maki P. Assessing for Learning: Building a Sustainable Commitment across the Institution. Sterling: Stylus Publishing; 2004.
Brennan RL. Generalizability Theory. New York: Springer- Verlag; 2001.
Chiu CW. Scoring Performance Assessments Based on Judgements: Generalizability Theory. New York: Kluwer; 2001.
Linacre JM, Engelhard G, Tatum DS, Myford CM. Measurement with judges: Many-faceted conjoint measurement. Int J Educ Res 1994;21:569-77.
Lunz ME, Schumacker RE. Scoring and analysis of performance examinations: A comparison of methods and interpretations. J Appl Meas 1997;1:219-38.