Validity and Reliability of Scores Obtained on Multiple-Choice Questions : Why Functioning Distractors Matter

Plausible distractors are important for accurate measurement of knowledge via multiple-choice questions (MCQs). This study demonstrates the impact of higher distractor functioning on validity and reliability of scores obtained on MCQs. Freeresponse (FR) and MCQ versions of a neurohistology practice exam were given to four cohorts of Year 1 medical students. Consistently non-functioning multiple-choice distractors (<5% selection frequency) were replaced with those developed from incorrect responses on FR version of the items, followed by administration of the revised MCQ version to subsequent two cohorts. Validity was assessed by comparing an index of expected MCQ difficulty with an index of observed MCQ difficulty, while reliability was assessed via Cronbach’s alpha coefficient before and after replacement of consistently non-functioning distractors. Pre-intervention, effect size (Cohen’s d) of the difference between mean expected and observed MCQ difficulty indices was noted to be 0.4 – 0.59. Post-intervention, this difference reduced to 0.15 along with an increase in Cronbach’s alpha coefficient of scores obtained on MCQ version of the exam. Through this study, we showed that multiple-choice distractors developed from incorrect responses on free-response version of the items enhance the validity and reliability of scores obtained on MCQs.


Introduction
Validity of obtained scores is necessary for an assessment instrument and is irrespective of the level of examinees' education or the domain or subject under assessment.A search of the literature on scholarship of teaching and learning reveals a plethora of studies on the topic, ranging from cultural validity of assessment (Shaw, 1997), to impact of clarity of assessment's design on learners' performance (Solano-Flores & Nelson-Barber, 2001).In medical education, the desire to yield valid assessment scores is even stronger, since learner competence has immediate and serious implication on patient care.Although the study presented here was conducted in the context of undergraduate medical education, it demonstrates how the multiplechoice question, an assessment instrument prevalent in science and humanities education, can be improved to help educator scholars make more definitive conclusions about competence of learners and effectiveness of curricula.
Journal of the Scholarship of Teaching and Learning, Vol. 16, No. 1, February 2016. Josotl.Indiana.edu 3 functioning distractors offer very little in terms of validity of scores, while unnecessarily increasing the response time needed per MCQ.

Reliability of Scores Obtained on Multiple-choice Questions
The concept of reliability is ingrained in Classical Test Theory, the central tenet of which is that an examinee's observed score (X) can be decomposed into her/his true score (T) and a random error component (E) (X = T + E) (De Champlain, 2010).True score (T) is the score obtained if the exam were measuring the ability of interest perfectly (i.e. with no measurement error).A reliability coefficient, which ranges from 0 (lowest) to 1 (highest), estimates of the level of concordance between observed and true scores of an examinee (De Champlain, 2010).
The type of reliability frequently discussed in the context of MCQ is internal consistency, which is meant for exams that require a single administration to a group of examinees (Downing, 2004).Internal consistency reliability assesses the correlation between scores obtained on two parallel forms of an exam, i.e., the forms assessing the same content and on which examinees have the same true scores and equal errors of measurement.Cronbach's alpha is its widely-used coefficient; a coefficient of 0.8 or more is desired for high-stakes in-house exams (De Champlain, 2010;Downing, 2004).
It has been suggested that reliability can be improved by increasing the number of items given in an exam (Downing, 2004).Such an improvement can be estimated using the Spearman-Brown "prophecy" formula ), where "" is the Cronbach's alpha coefficient and "k" is the number of items in an exam (Karras, 1997).However, owing to the usually fixed number of items given in high-stakes in-house or licensure exams, an alternate way to improve reliability is to increase the spread of scores obtained on an exam (total test variance).An increased distribution of scores can be obtained by eliciting a wider range of performances from examinees by giving a greater number of moderately difficult (difficulty index: 0.4 -0.8) and sufficiently discriminatory (point biserial correlation ≥ 0.2) items in the exam (Hutchinson et al., 2002).McManus et al. discuss in greater detail how this approach may increase the standard deviation, hence variance, of observed scores (McManus et al., 2003).
In the study presented here, two versions (FR and MCQ) of the same neurohistology exam were randomly distributed among six cohorts of Year 1 medical students.The evidence of validity pertaining to Relations to other variables, described above, was gathered before and after replacement of consistently non-functioning distractors with those developed from incorrect responses on the FR version of the items.Specifically, an index of expected MCQ difficulty was calculated (see Methods) and compared with the index of observed MCQ difficulty.This comparison was based on assumptions that, 1. FR version of an item elicits true knowledge, and 2. Faculty responsible for the assessment of basic science content writes reasonably plausible MCQ distractors.The effect of distractor functioning on range of ability elicited from examinees and its impact on reliability of obtained scores was also studied.

Research hypotheses
Research hypothesis of the validity part of the study was: There is no difference between expected and observed MCQ difficulty indices when selection of all provided options is accounted for in calculating the expected index.To date, no such comparisons of actual performance on multiple-choice questions (observed difficulty index) with what it ought to have been (expected difficulty index) have been reported, especially in the context of assessment in undergraduate medical education, which highlights the novelty of the presented study.Research hypothesis of the reliability part of the study was: Enhanced distractor functioning increases the standard deviation and, therefore, reliability coefficient of scores obtained on multiple-choice exams.

Research Design
An experimental research design with random distribution of the free-response (FR) and multiple-choice (MCQ) versions of an exam was employed.The study was approved and adjudged exempt from detailed review by the Institutional Review Board of University of North Dakota.

Subjects and Setting
Six cohorts of Year 1 medical students at the University of North Dakota School of Medicine and Health Sciences served as subjects.
The school's medical education curriculum is a hybrid of Patient-Centered Learning (PCL) as well as traditional, discipline-based instruction.Neurohistology is taught during the neuroscience curricular block scheduled at the end of academic Year 1 via a combination of lectures and laboratory exercises by faculty with expertise in neuroscience.

Sample of Questions
A neurohistology exam comprising 25 items with a mix of knowledge (factual recall) and application-type questions was used.A FR (fill-in-the-blank) and a MCQ (one-best answer) version of this exam was created; the only difference between these two versions was in the format of the asked question (example: Figure 1).Of the 25 FR-MCQ item-sets, two were excluded from analysis since their FR version contained options, thereby not meeting the criterion needed for comparison with the MCQ version.

Procedure
Each cohort of students was invited, via email, to attend a non-mandatory practice session 5 days prior to the end-of-block neurohistology exam.No information in regards to design of the study was shared in advance.No points were granted for participation in the study.Once seated, an approximately equal number of free-response and multiple-choice versions of the exam printouts were randomly distributed amongst the subjects.Then, the purpose of the study was shared, and subjects were asked not to provide any personal or identifiable information on the answer sheets.Neurohistology images (example: Figure 1) were projected on a screen and one minute was provided to answer each question.After the exam, each question was discussed openly and students were asked not to change their answers.The answer sheets were collected, codified and scored according to pre-developed answer keys.

Intervention
The following revisions were performed on the MCQ version of the exam based on examinee performance in Cohorts 1 -4.
a. Thirty-one distractors in 15 MCQs with consistent selection frequency of 0% were replaced with new distractors developed from frequent incorrect responses on FR version of the items.b.Five 5-option MCQs were converted to 4-option MCQs via removal of a distractor with consistently 0% selection frequency.The number of 5-, 4-and 3-option MCQs in the original (unrevised) version was 21, 1 and 1, respectively; these numbers were 16, 6 and 1 in the revised MCQ version of the exam.In order to note the extent of distractor functioning from a bigger sample of subjects, the revised MCQ version of the exam was given to all subjects in Cohort 5.In Cohort 6, the revised MCQ version was given to random half of subjects while the other half received the FR version of the exam.

Data Collection and Analysis
The following variables were calculated from student performance: a. Individual, as well as mean and standard deviation of scores in each cohort.b.Psychometric characteristics, i.e. the difficulty and discriminatory ability of each item.Difficulty was calculated via difficulty index (number of correct answers / number of all answers), while discriminatory ability was calculated via point biserial (item-total) correlation (Tavakol & Dennick, 2011).b.The index of expected MCQ difficulty was calculated as follows.Suppose the FR version of an item is correctly answered by 60% examinees (FR difficulty index: 0.6).
The proportion of examinees with an incorrect answer on the FR version would be 40% (0.4).Now suppose that the MCQ version of this item contains 5 options.It will be anticipated that a certain proportion of examinees who answered the item incorrectly on its FR version might have chosen the correct MCQ option, using random or educated guessing, had they taken the MCQ version of the exam.Probability would suggest that such a proportion among 40% (0.4) examinees would be at least 8% (0.08) (0.4 / 5 = 0.08).This proportion of examinees (0.08) can be added to the FR difficulty index to generate the index of expected MCQ difficulty (0.6 + 0.08 = 0.68) (Table 1).!! ) (Hojat & Xu, 2004).e. Number of MCQ distractors with ≥5%, ≥10%, ≥20%, and ≥33% selection frequency in each cohort.f.Cronbach's alpha coefficient of scores, before and after revision, on MCQ version of the exam.g.Standard Error of Measurement (SEM = SD 1 − ), which is the standard deviation of an examinee's observed score, given her true score (Karras, 1997).SEM describes precision of measurement and is used to establish a confidence interval within which an examinee's true score is expected to fall4 .Exam performance data from all cohorts were stored in Microsoft Excel (2010) and analyzed via MS-Excel and SigmaStat v. 20.

Results
Table 2 displays the number of students taking the FR and MCQ versions of the exam, score means and their standard deviations, mean item difficulty indices and mean point biserial correlations.As expected, scores on FR version tended to be lower in all cohorts than scores on MCQ version of the exam.Moreover, the revised MCQ version (Cohorts 5 and 6) exhibited greater difficulty and discriminatory ability than the original MCQ version (Cohorts 1 -4) of the exam.Table 3 and Figure 2 display Effect Size (Cohen's d) of the difference between mean expected and observed MCQ difficulty indices before (Cohorts 1 -4) and after (Cohort 6) replacement of previously non-functioning distractors; Cohen's d could not be calculated for Cohort 5, since all subjects in that cohort received the revised MCQ version of the exam.Considerable increase in MCQ difficulty was noted after replacement of consistently nonfunctioning distractors (Cohorts 5 and 6), with a concomitant reduction in disparity between mean expected and observed MCQ difficulty indices (Cohort 6) (d = 0.15).Table 4 and Figure 3 display the number of distractors with ≥5%, ≥10%, ≥20%, and ≥33% selection frequency in MCQ version of the exam before (Cohorts 1 -4) and after (Cohorts 5 and 6) replacement of consistently non-functioning distractors.Table 4 also displays the number of total as well as functioning (≥5% selection frequency) distractors per MCQ.Both higher distractor selection in most categories and a greater number of functioning distractors per MCQ was noted after replacement of consistently non-functioning distractors (Cohorts 5 and 6).Table 5 and Figure 4 display the reliability coefficients (Cronbach's alpha) and Standard Errors of Measurement (SEM) of scores obtained on FR and MCQ versions of the exam.After replacement of previously non-functioning distractors (Cohorts 5 and 6), scores obtained on the MCQ version of the exam exhibited greater standard deviation (3.61), higher Cronbach's alpha coefficient (0.74 and 0.78) and a slightly higher Standard Error of Measurement (1.84 and 1.66).Figure 4 demonstrates the directly proportional relationship between standard deviation and reliability coefficient of exam scores.A peculiar finding was high standard deviation and reliability coefficient of scores on MCQ version of the exam in Cohort 1.This is an interesting finding, since examinees in that cohort had received the unrevised MCQ version of the exam.See Discussion for a possible explanation of this finding.

Discussion
The first observation, in line with previously published studies (Ward, 1982;Norman et al., 1987;Schuwirth, 1996;Norman, 1988), was that performance on FR version of an exam is consistently lower than performance on its MCQ version (Table 1).Since FR and MCQ versions were randomly distributed in each cohort, the consistently disparate performance is attributable to the nature of the two versions; the MCQ version contains options and allows for some degree of cueing and correct guessing, while the FR version requires production of an answer spontaneously from memory.
Secondly, the difficulty of a MCQ-based exam is lower than expected when the number of distractors with sufficient plausibility (≥5%, ≥10%, ≥20% and ≥33% selection frequencies) is low.Tables 3 and 4 highlight this finding.Effect size of the difference between mean expected and observed MCQ difficulty indices was found to be higher in cohorts with lower overall distractor functioning (Cohorts 1 -4).However, when consistently non-functioning distractors were replaced with those developed from frequent incorrect answers on FR version of the items (Cohorts 5 and 6), a higher overall distractor functioning and reduced disparity between mean expected and observed MCQ difficulty indices was noted.In other words, when incorrect responses on FR versions of the items are used to construct MCQ distractors, the MCQs tend to demonstrate their expected difficulty thereby enhancing the evidence of validity of scores (SEM) (SEM = SD 1 −  ) (Hutchinson et al., 2002;Harvill, 1991), since increased range of ability (standard deviation) elicited by an exam increases not only the reliability coefficient but also the error of measurement of assessment instrument.This theory has been reported on by Tighe et al., who studied the interrelationships among standard deviation, Standard Error of Measurement and exam reliability via a Monte Carlo simulation of 10,000 candidates taking a postgraduate exam (Tighe et al., 2010).They found that scores obtained on the very same exam experienced a decrease in reliability coefficient when retaken by only those examinees who had already passed it.In other words, allowing very weak (unprepared) candidates to take an exam can artificially inflate the reliability of scores obtained on an exam.Tighe et al. suggested that, when ability range of examinees is noted to be narrow, the Standard Error of Measurement may be enough for assessment of measurement precision.We agree with this suggestion and advise interpretation of the reliability coefficient in light of the psychometric characteristics (difficulty index and point biserial correlations) distractor functioning of MCQs.
A number of limitations apply to the presented study.First is the small number of investigated items (n=23).Although suitable for assessment of knowledge of neurohistology, this number may be insufficient for an experiment of this nature and study may be expanded to include more items.Second potential limitation is the no-stakes nature of the exam used in this study; it was given as practice for the high stakes neurohistology exam.Despite the no-stakes nature of the experiment, it is worth noting that our research question focused solely on differential performance on FR and MCQ version of an exam at single time-points.Thirdly, a potential limitation is the generalizability of our findings.Although we anticipate considerable generalizability of our findings owing to the nature of our intervention (using common responses on FR version of the items as distractors on MCQ version of the same items), we are yet to see a replication of our experimental design in settings other than undergraduate medication education.We invite educator scholars in sciences and humanities to replicate our design and study the validity and reliability of scores obtained on MCQs revised on the principle elicited in this study.We predict that many educator scholars will find this approach to be resource-friendly and efficient.
For its ease of administration and objective grading, multiple-choice testing is the prevalent form of assessment in science and humanities education.However, it relies on recognition of the most credible answer from a brief list of options, some of which may be barely plausible.The examination is a far cry from real-life situations healthcare, science and humanities professionals face every day.Novel problems in any discipline are rarely solved simply by choosing from among a limited list of presented options.For example, in a healthcare setting, although signs and symptoms of an illness allow for some cueing and educated guessing, patients do not present the healthcare provider with five options from among which the "single best answer" is chosen (Veloski et al., 1999).In that setting, the "single best answer" is expected to be chosen based on knowledge, analysis and reason.Therefore, it is imperative that multiplechoice questions undergo strict scrutiny for their ability to elicit true knowledge.Therefore, it is imperative that multiple-choice questions undergo strict scrutiny for their ability to elicit true knowledge.Using an adequate yardstick for comparison, such as performance on open-ended, free-response version of the same questions, is a useful step in this direction and helps assess the validity of scores obtained on such questions.In medicine, licensure bodies such as National Board of Medical Examiners recognize the importance of conducting such comparisons, and a few studies of this nature have been published in the past (Case & Swanson, 1994;Swanson et al., 2005;Swanson et al., 2006).In our experience, administering two versions (free-response and multiple-choice) of the same exam as practice for a high-stakes multiple-choice exam allows learners to detect areas of needed improvement, and instructors to encourage deep, rather than superficial, learning strategies.An attempt to improve the ability of MCQs to accurately serve their purpose, through such ventures, may truly be worthy of faculty time and effort.

Figure 1 .
Figure 1.Example of Free-Response (FR) and Multiple-Choice (MCQ) version of an item.

Figure 2 .
Figure 2. Effect size of the difference between mean expected and observed MCQ difficulty indices.

Figure 3 .
Figure 3. Percentage of MCQ distractors with different selection frequencies.

Figure 4 .
Figure 4. Standard deviation and reliability coefficient (Cronbach's alpha) of scores obtained on MCQ version of the exam.

Table 1 . Calculation of expected MCQ difficulty index
(Hojat & Xu, 2004's d]of the difference between mean expected and observed MCQ difficulty indices.Effect size represents the extent to which research hypothesis is considered to be true, or the degree to which findings of an experiment have practical significance in the study population regardless of the size of the study sample(Hojat & Xu, 2004).Cohen's d is a statistic that is equal to the difference between means of experimental (M e ) and control (M c ) groups divided by the standard deviation for the control group (σ c ) (Cohen's d = !" - !"

Table 2 . Number of students taking the Free-Response (FR) and Multiple-Choice (MCQ) versions of the exam in all cohorts.
Mean score, standard deviation, mean item difficulty (diff.)and mean point biserial correlations (pbi) are also displayed.

Table 3 . Mean Free-Response (FR) and Multiple-Choice (MCQ) difficulty indices and their Standard Deviations (SD) in all cohorts.
Effect size (Cohen's d) of the difference b/w Mean Observed and Expected MCQ difficulty indices is also displayed.