|Year : 2020 | Volume
| Issue : 1 | Page : 16-21
Exploring the effects of authoring and answering peer-generated multiple-choice questions
Lysa Pam Posner1, Regina Schoenfeld-Tacher1, Mari-Wells Hedgpeth2, Kenneth Royal2
1 Department of Molecular Biomedical Sciences, College of Veterinary Medicine, North Carolina State University, Raleigh, NC, USA
2 Department of Clinical Sciences, College of Veterinary Medicine, North Carolina State University, Raleigh, NC, USA
|Date of Submission||25-Oct-2019|
|Date of Acceptance||20-Nov-2019|
|Date of Web Publication||13-Mar-2020|
Dr. Lysa Pam Posner
Department of Molecular Biomedical Sciences, College of Veterinary Medicine, North Carolina State University, Raleigh, NC 27607
Source of Support: None, Conflict of Interest: None
Background: Many students believe that completing practice test questions improve their examination performance. This study was designed to investigate the effects of authoring and answering peer-generated multiple-choice questions. Methods: First-year Doctor of Veterinary Medicine students were voluntarily enrolled in the study. Each student was required to create at least three questions and encouraged to answer as many items as they wanted. Following the examination, participating students were required to complete a questionnaire characterizing the usefulness and enjoyability of the program. Results: A total of 94/101 students utilized the PeerWise program. Students believed that developing peer-generated questions improve their understanding of the material (79% agreed or strongly agreed). Fifty-six percent of students said that they would use peer-generated questions as a study tool if no extra credit was associated with it (agree or strongly agree); however, none of them used the technique when not incentivized. Of the 290 questions generated, only 4% of the questions required a deep understanding of the content, whereas 62% required recall only. Conclusions: We conclude that students generally perceived the program to be useful, but questionable quality items may have potentially limited students' learning gains.
Keywords: Assessment, item writing, medical education, peer learning
|How to cite this article:|
Posner LP, Schoenfeld-Tacher R, Hedgpeth MW, Royal K. Exploring the effects of authoring and answering peer-generated multiple-choice questions. Educ Health Prof 2020;3:16-21
|How to cite this URL:|
Posner LP, Schoenfeld-Tacher R, Hedgpeth MW, Royal K. Exploring the effects of authoring and answering peer-generated multiple-choice questions. Educ Health Prof [serial online] 2020 [cited 2020 May 30];3:16-21. Available from: http://www.ehpjournal.com/text.asp?2020/3/1/16/280540
| Introduction|| |
The perception that practice test questions improve subsequent examination performance, referred to as the “testing effect,” is typically held by students in health professions programs. Practice questions are routinely used by medical students; however, the style has evolved from paper “flashcards” to computer programs.,
Previous research has shown that fostering student interaction with course content and teaching that material to others increases student comprehension. A free online program, PeerWise allows students enrolled in a course to the author, answer and evaluate their own multiple-choice questions (MCQs), as well as those of their peers. The potential benefits for students of authoring and answering practice questions include increased engagement with material to be learned due to both repetition as well as likely seeing in an alternative form, learning that results from answering practice questions, learning that comes from constructing items, and social engagement through discussion. The novelty of using a computer-based program may make the process more interesting and engaging to students. Thus, it would be expected that the increased time on task and reflective engagement with the course material to formulate the study questions would support deeper learning of the content than simply answering existing MCQs.
A number of studies have previously examined the effectiveness of PeerWise in multiple contexts. For example, the use of PeerWise was shown to enhance the academic performance in undergraduate science and computer science courses.,,, At the professional level, PeerWise has been shown to increase students' perceived learning and satisfaction with a pharmacology course. Within veterinary medical education, evaluated PeerWise across three separate courses – anatomy/physiology, pathology, and clinical sciences, with mixed results. In some veterinary courses, there was a positive correlation between the number of PeerWise questions answered and examination score, whereas in other courses, there was no correlation. However, in all instances, veterinary students reportedly enjoyed the exercises and recognized the need to develop a deep understanding of the topic prior to authoring MCQs.
Interestingly, some research has reported that students were able to produce surprisingly accurate practice MCQs despite having been given no formal training in multiple choice item writing. Bottomley and Denny reported that undergraduate biochemistry students given a formatting example (stem: one correct answer, three incorrect answers and explanation) were able to create MCQs without incorrect information in the stem or answer 91.2% of the time without further instructor supervision. Similarly, the instructor supervision of MCQ generation and subsequent monitoring of the student discussion was independent to the success of the PeerWise activity., Although some studies reported that the MCQs generated were of “good quality” based on Bloom's taxonomy, the questions were generally of a superficial nature: ~33% in category 1 (memorization), ~43% in category 2 (comprehension), and ~25% in category 3 (application).
While there is some evidence that having students' construct and answer MCQs increase the comprehension of the material, the quality of the MCQs generated is likely important to the process. Poorly constructed MCQs, due in large part to the student author's potential inadequate comprehension of the material and inexperience in developing sound test items, might limit the effectiveness of the process. Inaccurate questions (either from wrong answers or poorly written questions) might contribute to misunderstanding of material and greater student confusion. When undergraduate biochemistry students were provided the ability to comment on questions written by their peers, only 50% of participants noticed/commented on the content flaws present in the 8.8% of MCQs with factual errors. Therefore, an argument can be made that a potential unintended consequence of PeerWise is that some student-authored items may be detrimental to student learning.
The present study was designed to investigate the effects of authoring and answering peer-generated MCQs. More specifically, we sought to answer the following research questions (RQs):
- RQ1: Do Doctor of Veterinary Medicine (DVM) students exhibit better comprehension of material if they authored more practice examination questions?
- RQ2: Do DVM students exhibit better comprehension of material if they answered more peer-generated questions?
- RQ3: Do DVM students report enjoying the use of PeerWise as a study aid?
- RQ4: Do DVM students voluntarily using PeerWise as a study aid when not incentivized?
- RQ5: Do DVM students produce items that would be suitable for the instructor to use on a subsequent assessment?
| Methods|| |
A total of 101 students were enrolled in the 1st-year curriculum of the DVM program at North Carolina State University. Students ranged in age from 20 to 42 years with a median age of 23 years. Eighty-five percent identified as female, 15% as male, and 1% as Other. With respect to race/ethnicity variables, 73% identified as white, 26% of students identified as a member of one of nine minority groups, and 1% did not provide this information.
Setting and context
All students were enrolled in a required 4-credit h physiology course. Although the course consisted of four sections, only the respiratory physiology section was utilized for this study. The respiratory physiology section spanned 3 weeks, and students were presented with lecture notes, copies of PowerPoint presentations and had a physiology reference book reserved in the library. There were a total of ten, 1-h lectures and two 3-h laboratories associated with the section. An optional 50-min review session was held the day before the respiratory physiology examination, which consisted of 65 MCQs.
On the 1st day of the course, students were provided an incentive of 15 extra credit points (which accounted for 2.9% of the total points for the course) for participation in the study. Students were informed that their participation was voluntary, and the items they contributed would be treated anonymously and their responses to each question would be treated with strict confidentiality. To receive the 15 points, students were required to complete both parts of the study (participation using the PeerWise program and completion of a survey at the end of the study) which are described below.
The first part of the study consisted of students self-enrolling in the PeerWise program and generating three MCQ or true/false questions. Students were not provided any instruction on how to write the questions and the quality of questions they authored did not affect their earning extra credit points. All students had free access to the PeerWise program and the questions generated by their peers. Students were told the questions would not be monitored by the instructor for accuracy.
Administrative controls of the PeerWise program made it possible for the instructor to identify the number of students who self-enrolled, the number of questions generated, and the number of questions each student answered. The study was approved by the University's Institutional Review Board.
Impact of PeerWise use on exam grades
To assess if authoring or answering peer-generated questions improved comprehension of the material, the number of questions answered, and the number of questions generated were compared with students' examination scores (see data analysis below).
Item quality classification and evaluation
An expert in examination item quality reviewed and categorized all items according to the item writing guidelines presented by Haladyna et al. The types and frequency of technical item writing flaws in either the stem or the answer options were noted. Technical flaws of the stem include negatively phrased items, unfocused items, items that are opinion based, or used absolute language. Technical flaws of answer options included: Items containing answer options that were not of equal length, items that include the use of “none of the above,” “all of the above,” or complex, Type K, and items containing answer options that were not listed in a logical order, distractors that were not parallel in structure, and items whose distractors were not plausible. In addition, technical flaws of cuing (testwiseness) include items that contained grammatical cues and items that provided hints to answers to items on the assessment.
Assessment of item domains
Items were classified based on the cognitive processes required by examinees to answer the questions. Items that assessed facts without the use of application were classified as recall items, whereas items that required problem-solving (e.g., interpretation of data and decision-making) were classified as application of knowledge. Items classified as recall were deemed to be of a lower cognitive order. Questions, where none of the answers or more than one answer could be considered correct, were labeled as “incorrect” items. The content was evaluated by the course instructor, and each question was rated using the following schema: 1 = incorrect question, 2 = required recall/memorization, 3 = required moderate application of knowledge, and 4 = required advanced application of content.
Concurrent with the start (3 days after the respiratory section examination) of the next physiology content section (renal physiology), an identical peer question generation opportunity was created for the students to use but was not incentivized. The number of students enrolled themselves, the number of questions generated, and the number of questions answered were tracked by the PeerWise program.
The final part of the study consisted of students completing a 12-item survey (Addendum A) about their experience using the PeerWise program with a Likert-type scale. Students were required to complete the survey within 24 h of the respiratory physiology examination. The goal of the survey was to determine if students thought it was worthwhile to write or answer content questions and if they found the process enjoyable.
Data analysis consisted of several components. First, students' performance on the examination was compared based on the number of items generated and the number of items completed in PeerWise. Scatterplots were generated, and Spearman's rho correlation coefficients were produced to assess the relationship. Next, items were evaluated for quality according to the aforementioned classification methodology and descriptive statistics were produced. Finally, students' survey data regarding their experience with the PeerWise program were analyzed through descriptive statistics. All data analyses were performed using IBM SPSS Statistics for Windows, Version 25.0. (Armonk, NY: IBM Corp), and all significance testing was performed with alpha set to 0.05.
| Results|| |
A total of 94 (93%) students completed the study and generated a total of 290 questions (average of approximately three questions per student) during the incentivized trial.
Students' examination performance was assessed relative to the number of items generated and the number of items answered. The number of items generated ranged from 3 to 6, with 90 students generating three items, two students generating four items, and one student generating five and six items, respectively. The average number of items generated was 3.07 (standard deviation [SD] = 0.395) with a median of 3. [Figure 1] illustrates the relationship between the number of items authored and examination performance. A Spearman's rho correlation coefficient of 0.093 indicates the relationship between the number of items generated and eventual examination performance was negligible.
|Figure 1: Number of questions authored and examination grades (maximum grade 130)|
Click here to view
The number of peer-generated items answered was highly variable. A total of 36 (38.3%) students did not attempt to answer any items, whereas 58 (61.7%) answered items authored by other students. The range of items answered was 0–289, with a mean of 47.31 (SD = 75.49) and a median of 5. [Figure 2] illustrates the relationship between the number of items answered relative to examination performance. A Spearman's rho correlation coefficient of 0.009 indicates the relationship between the number of items answered and examination performance also was negligible.
|Figure 2: Number of questions answered and examination grade (maximum grade 130)|
Click here to view
Seven students (6.9%) did not complete the assignment (generation of PeerWise questions and completion of postexamination survey). Of these seven students, 2 (2.13%) did not attempt any part of the exercise. Interestingly, the two nonparticipating students finished the course with grades of 80.4% (C+) and 96.5% (A), respectively.
When not incentivized, none of the students created or answered questions for the following section in the course (renal physiology).
Student survey findings
Students were administered a 12-item survey inquiring about their experience with the PeerWise program. Responses are summarized in [Table 1]. Survey questions with the largest consensus of opinion included “Answering peer-generated questions helped to improve my understanding of the material” (64% agreed or strongly agreed), “Badly written peer-generated questions were a major distraction to the exercise” (54% agreed or strongly agreed), and “Answering peer-generated questions increased my stress regarding the material” (59% disagreed or strongly disagreed).
A breakdown of item quality is presented by a total number of item flaws [Table 2] and the number of item flaw types [Table 3]. Of the 12 items that were identified as having two flaws, four items contained two flaws in the stem, one item was found to have two flaws in the answer options, six items had one flaw in both the stem and the answer options, and one item had one flaw in the answer options and one flaw from cuing. Items also were reviewed for content. A breakdown of results is presented in [Table 4].
| Discussion|| |
When incentivized, 93% of the DVM students participated in this study using PeerWise program to author and answer peer-generated questions. Results indicate authoring and answering examination questions produced a nominal positive benefit in examination grades. This is in contrast to undergraduate students who demonstrated a positive benefit,,, from engaging in these tasks but support the mixed results of another study involving DVM students. Although the results of this study were somewhat unexpected, there are a number of potential reasons for the lack of effect. First, it is possible that there is a difference in the inherent motivation for deeper learning in veterinary students compared with undergraduate students. DVM students are generally older, most have already completed an undergraduate degree and are looking to become skilled veterinarians. In addition, one subset of students that PeerWise participation particularly helps is students in the lower quartile, and since most DVM students are typically high performers (this is necessary to gain admission to a doctoral program), there may be less demonstrable effects. This study was not designed to look at the effects of different performing students, so no additional commentary on this area can be made.
Second, students in this study were incentivized to write three questions, and then, allowed to use the rest of the program as desired. Other studies have indicated that PeerWise can be used as a standalone technology, meaning that minimal instructor involvement is necessary. The authors intentionally did not provide instruction on the question type, construction, or other uses of the program. It is likely that the peer grading/rating and discussion that were required in other studies enhanced the positive effects of those studies. Other studies have identified that peer rating and discussion forums contributed to deeper learning and better performance. The “PeerWise score” for a student increases with participation, answering questions correctly, rating others questions fairly, and contributing to discussion that is considered valuable by their peers. Another complicating factor was the fact that the overall examination scores were relatively high (92.1% ± 5.3%), and thus, the lack of variability made identifying any positive effects of PeerWise use more difficult. This also suggests that if using the PeerWise program, students would benefit from more instructor involvement.
A related issue is the quality of the questions generated by the students. In studies where students were not instructed on how to write the questions but were graded on the quality and clarity of the questions, most of the questions were rated as “good”, minimal errors and spanned Bloom taxonomy categories 1, 2, an 3. It is possible that the lack of feedback on the questions was generated and the lack of incentive to create useful questions impacted the positive effects in the study. It is also possible that poorly written questions or answers with wrong or misleading information may have also negatively impacted performance. In this study, of the 290 questions submitted, only 9.7% were considered incorrect or had multiple correct answers. This is consistent with other studies that reported approximately 9% of questions considered to be “incorrect.”
Course instructors may benefit from the use of this type of program due to the production of hundreds of potential examination questions. However, in this study, only 4% of student-generated questions require a deep understanding of the content, whereas 62% required recall. This is similar to 56% of questions that required recall in another study. Furthermore, more than half (53%) of the questions generated were of poor technical quality and would require revision in the event they were to be used in subsequent assessments.
It is possible that because students were creating and answering questions of a more superficial nature, they did not work toward developing a deeper understanding of the material, and thus, PeerWise was not as helpful in preparation for an examination assessing this type of learning. Interestingly, the questions requiring more depth of understanding had a similar rate of technical flaws (41%) to the questions requiring recall (47%). While most of the studies on the use of PeerWise did not instruct students on how to write proper examination questions, the fact that almost half of all questions written were technically flawed, and most were written at the recall level indicates that considerable revision would be necessary before potentially using any peer-generated items.
With respect to survey findings on the usefulness and enjoyableness of using the PeerWise program, the results were positive. Students felt that developing and answering peer-generated questions helped to improve their understanding of the material and that they would use peer-generated questions as a study tool if no extra credit was associated with it. However, when not incentivized by the instructor, none of the students enrolled or used the PeerWise program in the next section of the physiology course. The disconnect between their response and action is surprising since most students responded that the process helped them learn the material, and logically would, therefore, help with success on assessments. Since the subsequent section of physiology started within 3 days of the previous section in which they used the program, and the instructor made a point of announcing in class that the new section of the program was set up for them, it is unlikely that the students forgot about its availability. A variety of reasons may explain this disconnect. It is possible that DVM students who carry a large academic course load thought they would use such a program, but when presented with a time shortage opted to forego that option. Alternatively, students may have been confident they could succeed academically without the use of PeerWise. This is plausible because most students did well on the previous section, but did not know how the use of the PeerWise program may have actually affected their performance. Finally, some students may have responded with bias of a supportive nature stemming from the fact that the instructor introduced them to the program and favorably discussed how participation could be helpful.
With respect to limitations of this work, we are unable to speak to the degree to which students made an earnest and sincere effort to construct quality items and answer peer-generated items to the best of their ability. Further, we are unaware of how students may have used the information. It is possible that students answered peer-generated items both individually and/or as part of a group given the extra credit opportunity was to be completed outside of class. It is unknown if dialogue occurred about the items, which might include specific content discussed, the quality of discussions with peers, and other important considerations that could have some bearing on the results. Finally, given the PeerWise program involves both generating and consuming items, and because students in the course generally performed very well, we could not discern which potential action (generating items or answering items) may be more beneficial for students.
While this study evaluated PeerWise use among veterinary medical students, the findings should be applicable to most 1st year students in any content intensive, scientific based, and professional program (e.g., medical school).
| Conclusions|| |
A nominal, positive effect was demonstrated with respect to students' performance as a result of both generating items for peers and answering peer-generated items. Peer-generated items were generally correct (had a single correct answer) with respect to substantive information. However, a considerable number of items contained various types of technical flaws, and the majority were written at the recall level of knowledge. Most student-authored items would not be appropriate for the instructor to use on a subsequent assessment.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
| References|| |
Roediger HL, Karpicke JD. Test-enhanced learning: Taking memory tests improves long-term retention. Psychol Sci 2006;17:249-55.
Augustin M. How to learn effectively in medical school: Test yourself, learn actively, and repeat in intervals. Yale J Biol Med 2014;87:207-12.
Schmidmaier R, Ebersbach R, Schiller M, Hege I, Holzer M, Fischer MR. Using electronic flashcards to promote learning in medical students: Retesting versus restudying. Med Educ 2011;45:1101-10.
Benware CA, Deci EL. Quality of learning with an active versus passive motivational set. Am Educ Res J. 1984;21:755-65.
Denny P, Luxton-Reilly A, Hamer J. The PeerWise System of Student Contributed Assessment Questions. Proceedings of the Tenth Conference on Australasian Computing Education, Wollongong, NSW, Australia. 2008;78:69-74.
Bottomley S, Denny P. A participatory learning approach to biochemistry using student authored and evaluated multiple-choice questions. Biochem Mol Biol Educ 2011;39:352-61.
Devon J, Paterson JH, Moffat DC, McCrae J. Evaluation of student engagement with peer feedback based on student-generated MCQs. Innov Teach Learn Info Comp Sci 2012;11:27-37.
Galloway KW, Burns S. Doing it for themselves: Students creating a high quality peer-learning environment. Chem Educ Res Pract 2015;16:82-92.
McQueen HA, Shields C, Finnegan DJ, Higham J, Simmen MW. PeerWise provides significant academic benefits to biological science students across diverse learning tasks, but with minimal instructor intervention. Biochem Mol Biol Educ 2014;42:371-81.
Tatachar A, Li F, Gibson CM, Kominski C. Pharmacy students' perception of learning and satisfaction with various active learning exercises. Curr Pharm Teach Learn. 2016;8(4):577-83.
Rhind SM, Pettigrew GW. Peer generation of multiple-choice questions: student engagement and experiences. J Vet Med Educ 2012;39:375-9.
Bloom BS, Engelhart MD, Furst EJ, Hill WH, Krathwohl DR. Taxonomy of Educational Objectives, Handbook I: The Cognitive Domain. New York: David McKay Co Inc.; 1956.
Haladyna TM, Downing SM, Rodriguez MC. A review of multiple-choice item-writing guidelines for classroom assessment. Appl Meas Educ 2002;15:309-33.
[Figure 1], [Figure 2]
[Table 1], [Table 2], [Table 3], [Table 4]