---
title: "Notation Efficace des Questions à Choix Multiples"
title_en: "Efficient Scoring of Multiple-Choice Tests"
authors:
  - name: "Alexis Direr"
    affiliation: "Université d'Orléans, LEO"
    orcid: "0000-0002-4459-7780"
doi: "10.3917/reco.763.0417"
keywords: [estimation theory, multiple choice tests, decision making, loss aversion, narrow framing, formula scoring, number right scoring, mean squared error]
keywords_fr: [théorie de l'estimation, questions à choix multiples, prise de décision, aversion aux pertes]
jel_codes: [A20, C93, D80]
language: [fr, en]
type: research-article
---

# Notation Efficace des Questions à Choix Multiples / Efficient Scoring of Multiple-Choice Tests

**Author**
- Alexis Direr — Université d'Orléans, LEO (rue de Blois, BP 26739, 45067 Orléans Cedex 02) — *alexis.direr@univ-orleans.fr* — ORCID: [0000-0002-4459-7780](https://orcid.org/0000-0002-4459-7780)

**DOI**: [10.3917/reco.763.0417](https://doi.org/10.3917/reco.763.0417)

**Keywords (EN)**: estimation theory, multiple choice tests, decision making, loss aversion.
**Mots-clés (FR)**: théorie de l'estimation, questions à choix multiples, prise de décision, aversion aux pertes.
**JEL codes**: A20, C93, D80.

---

## Abstract

**English.** This paper studies the optimal scoring of multiple-choice tests in which the marks for wrong selections and omissions jointly minimize the mean square difference between score and examinees' abilities. Examinees are loss averse and, as a result, reluctant to risk answers on the basis of their knowledge. I find that it is efficient to incentivize the lowest able to omit, except when the test has a very large number of items. The mark for omission is positive when the test size is limited and negative when it is large. Loss aversion generally improves estimators efficiency by spontaneously inducing more omission and thereby reducing the need to bias the mark upward to encourage omission. The model sheds light on the statistical properties of two widely used scoring methods, Number right scoring and Formula scoring.

**Français.** Cet article étudie la notation efficace des questions à choix multiples dans lesquelles les points minimisent l'écart quadratique moyen entre le score et les connaissances des candidats. Les candidats présentent une aversion aux pertes et sont réticents à risquer des réponses sur la base de leurs connaissances. Je trouve qu'il est généralement efficace d'inciter les candidats les moins informés à omettre, sauf lorsque le test comporte un très grand nombre de questions. La note en cas d'omission est positive lorsque la taille du test est limitée et négative dans le cas inverse. L'aversion aux pertes améliore généralement l'efficacité des estimateurs en induisant spontanément davantage d'omissions. Le modèle éclaire les propriétés statistiques de deux méthodes de notation populaires : le comptage du nombre de bonnes réponses (Number right scoring) et la notation par formule (Formula scoring).

---

## 1. Motivation and contributions

Multiple-choice questions (MCQs) are widespread in education and high-stakes assessment (SAT, GRE) because they are quickly and automatically gradable, broadly sample the curriculum, and avoid examiner bias. Their main weakness is the difficulty of dealing with random answers: examinees with no relevant knowledge can still earn points by luck, and a correct answer cannot be distinguished a posteriori from a knowledge-based, partial-knowledge or random-guess origin. Chance therefore adds measurement error to scores; while the law of large numbers eliminates it asymptotically, real tests have a limited number of questions, and the scoring rule itself can attenuate the residual error.

The paper makes the following contributions.

1. **A statistical-efficiency framing of MCQ scoring.** The marking rule is designed to minimize the mean squared error (MSE) between observed and true scores, where the true score is the score an examinee would receive if their ability were perfectly observed. Both the wrong-answer mark $\theta$ and the omission mark $\gamma$ are endogenous, in contrast to most of the prior literature where only the wrong-answer penalty is optimized (Espinosa and Gardeazabal 2010; Budescu and Bo 2015; Akyol, Key and Krishna 2022).

2. **Joint loss aversion and narrow framing.** Examinees evaluate each question separately (narrow framing, Tversky and Kahneman 1981) and overweight losses (loss aversion, Kahneman and Tversky 1979). This generates a spontaneous bias toward omission consistent with empirical evidence that examinees skip questions even when the expected mark from random guessing exceeds the omission mark (Sheriffs and Boomer 1954; Ebel 1968; Cross and Frary 1977; Bliss 1980; Pekkarinen 2015; Akyol, Key and Krishna 2022).

3. **A test-size-dependent efficient rule.** The efficient $\gamma$ is positive when the test is short — the least able are encouraged to omit and reveal their type — and turns negative when the test is sufficiently long, at which point all examinees are required to answer. The transition is sharp in the simulations and depends on the strength of loss aversion.

4. **Loss aversion as a substitute for biased marks.** Because loss-averse examinees omit more spontaneously, the designer needs less upward bias on $\gamma$ to induce the desired omission rate. RMSE decreases with loss aversion in short tests; the regime switch to forced answering occurs at smaller $n$.

5. **A re-reading of Number Right Scoring (NRS) and Formula Scoring (FS).** Both rules set $\gamma = 0$ regardless of test length and are therefore inefficient over a broad range of test sizes. The paper reframes the long-standing NRS–FS debate as a debate over the wrong answer mark only, missing that the omission mark is the more powerful lever for the behavior of low-ability examinees.

---

## 2. Model

### 2.1 Setup

The test has $n$ items, each with $m$ options (one correct, $m-1$ incorrect). Items are equally difficult, free of obvious answers or ambiguity, and properly randomized; time is unconstrained. Each examinee has a constant probability $p$ of selecting the correct option, where $p$ measures their domain knowledge. Marks are normalized to:

- $1$ for a correct selection,
- $\theta$ for a wrong selection,
- $\gamma$ for an omission,

with the minimal restriction $\theta \leq \gamma < 1$. Let $z$ be the number of omissions and $\tilde{x} \sim B(n - z, p)$ the number of correct selections among answered items. The realized score is

$$\tilde{s} = \frac{\tilde{x} + \gamma z + \theta (n - z - \tilde{x})}{n}.$$

### 2.2 True score and the role of $\theta^*$

The true score is the expected mark in a benchmark test with mark $1$ for correct answers and notional penalty $\theta^*$ for wrong answers, assuming examinees never omit:

$$s(p) = p + (1 - p)\theta^*.$$

Setting $\theta^* = -1/(m-1)$ (the Formula Scoring penalty) makes the expected score of a pure random guesser equal to zero: $E[s(1/m)] = 0$. Misinformation is ruled out by setting the lowest possible ability $p_0 = 1/m$.

### 2.3 Risk preferences: narrow framing and loss aversion

Three behavioral assumptions govern the choice between answering and omitting:

1. **Narrow framing.** Utility is derived from the marks of each individual question, not from the aggregate or average score (Tversky and Kahneman 1981; Read, Loewenstein and Rabin 1999; Bereby-Meyer, Meyer and Flascher 2002).
2. **Loss aversion.** Utility from gains equals the gain itself ($u(1) = 1$, $u(\gamma) = \gamma$), but losses are scaled up by a coefficient $\lambda$: $u(\theta) = \lambda\theta$. Loss aversion is summarized by $\theta(\lambda - 1) \leq 0$, i.e. $\lambda > 1$ when $\theta \leq 0$ and $\lambda < 1$ when $\theta > 0$.
3. **Linearity** of utility in marks within the gain or loss domain.

Given $(\gamma, \theta)$, an examinee with success probability $p$ omits if $\gamma > p + (1 - p)\lambda\theta$. The indifference threshold is

$$\bar{p} = \frac{\gamma - \lambda\theta}{1 - \lambda\theta} > \frac{\gamma - \theta}{1 - \theta} \quad \text{whenever } \lambda > 1,$$

so loss aversion mechanically raises the omission threshold relative to risk neutrality.

### 2.4 Mean squared error criterion

Under constant per-question success probability, an examinee with $p > \bar{p}$ answers all items and the realized score $\tilde{s} = (\tilde{x} + (n - \tilde{x})\theta)/n$ is a linear estimator of $s(p)$ with mean squared error

$$\text{mse}(\theta; p) = E[(\tilde{s} - s(p))^2] = V(\tilde{s}; p) + (E[\tilde{s}] - s(p))^2.$$

Examinees with $p \leq \bar{p}$ omit every question and receive the constant score $\gamma$. Their squared error is the squared bias $sb(\gamma; p) = (s(p) - \gamma)^2$.

Letting $f(p)$ denote the (assumed known) population density of $p$, the designer minimizes the population-average error:

$$\min_{\gamma, \theta}\ \text{MSE}(\gamma, \theta) = \int_{p_0}^{\bar{p}} (s(p) - \gamma)^2 f(p)\,dp + \int_{\bar{p}}^{1} E[(\tilde{s} - s(p))^2] f(p)\,dp.$$

The omission component decomposes into a conditional variance plus a squared bias:

$$\frac{1}{F(\bar{p})} \int_{p_0}^{\bar{p}} (s(p) - \gamma)^2 f(p)\,dp = V_{|\text{omit}}\big(s(p)\big) + \big(\bar{s}(p) - \gamma\big)^2,$$

where $\bar{s}(p)$ is the mean ability among omitters. The conditional-variance term is a lower bound on omission error that does not vanish with $n$ — a key asymmetry with the answer component, whose error can be made arbitrarily small by increasing $n$.

---

## 3. Efficient scoring (analytical properties)

When $n \to \infty$, ability is estimated arbitrarily well from answers, so the efficient rule sets $\gamma$ low enough that all examinees answer ($\gamma < p_0 + (1 - p_0)\lambda\theta^*$) and the wrong-answer mark $\hat{\theta} \to \theta^*$.

For finite $n$, the first-order conditions are:

$$\frac{\partial \text{MSE}}{\partial \gamma}: \big[sb(\hat{\gamma}; \bar{p}) - \text{mse}(\hat{\theta}; \bar{p})\big] \frac{d\bar{p}}{d\gamma} f(\bar{p}) + \int_{p_0}^{\bar{p}} \frac{\partial sb}{\partial \gamma}(\hat{\gamma}; p) f(p)\,dp = 0,$$

$$\frac{\partial \text{MSE}}{\partial \theta}: \big[sb(\hat{\gamma}; \bar{p}) - \text{mse}(\hat{\theta}; \bar{p})\big] \frac{d\bar{p}}{d\theta} f(\bar{p}) + \int_{\bar{p}}^{1} \frac{\partial \text{mse}}{\partial \theta}(\hat{\theta}; p) f(p)\,dp = 0,$$

with $d\bar{p}/d\gamma = 1/(1 - \lambda\hat{\theta}) > 0$ and $-d\bar{p}/d\theta = (1 - \bar{p})\lambda/(1 - \lambda\hat{\theta}) > 0$. The bracketed term

$$sb(\hat{\gamma}; \bar{p}) - \text{mse}(\hat{\theta}; \bar{p}) = (\hat{\gamma} - s(\bar{p}))^2 - E[(\tilde{s} - s(\bar{p}))^2]$$

is the net effect on MSE of a marginal examinee switching from selection to omission. The system has no closed-form solution in the general case, motivating the simulations in §4.

---

## 4. Simulated efficient scoring

### 4.1 Calibration

- Three loss-aversion levels: $\lambda \in \{1, 1.5, 2.5\}$. The reference point $\lambda = 2.25$ from Tversky and Kahneman (1992) lies between the two non-neutral cases; it is unclear how a coefficient estimated from monetary outcomes transfers to test marks.
- Ability distribution: uniform on $[p_0, 1]$, with $p_0 = 1/m$.
- Per-question options: $m = 3$, so $\theta^* = -1/(m-1) = -0.5$.
- Test sizes: $n \in \{1, 5, 10, 20, 40, 80, 200, \infty\}$.
- Grid search: $\theta \in [\underline{\theta}, \theta^*]$ and $\bar{p} \in [1/m, 1]$, each discretized into $2{,}500$ points; $\gamma$ is recovered from $\gamma = \bar{p} + (1 - \bar{p})\lambda\theta$. The MSE is evaluated at $2{,}500^2 = 6{,}250{,}000$ pairs.
- Robustness check (not tabulated): varying $m \in \{2, 3, 4, 5\}$ leaves error measures roughly unchanged provided the total option count $m \times n$ is held fixed.

### 4.2 Reference results — moderate loss aversion ($\lambda = 1.5$, $m = 3$)

| Number of questions $n$ | 1 | 5 | 10 | 20 | 40 | 80 | 200 | $\infty$ |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| Efficient $\hat{\theta}$ (wrong) | $-1.79$ | $-0.48$ | $-0.46$ | $-0.45$ | $-0.45$ | $-0.46$ | $-0.49$ | $-0.50$ |
| Efficient $\hat{\gamma}$ (omit) | $0.59$ | $0.30$ | $0.22$ | $0.16$ | $0.11$ | $0.08$ | $-0.16$ | $-0.17$ |
| Incentive to omit $\hat{\gamma} - \hat{\theta}$ | $2.39$ | $0.78$ | $0.68$ | $0.61$ | $0.57$ | $0.54$ | $0.33$ | $0.33$ |
| Omission bias $100\,(\hat{\gamma} - \bar{s}(p))$ | $17.7$ | $10.81$ | $6.6$ | $3.3$ | $1.0$ | $-0.7$ | $0.00$ | $0.00$ |
| Share omitting (%) | $83.5$ | $39.4$ | $30.6$ | $24.9$ | $21.1$ | $18.6$ | $0.00$ | $0.00$ |
| RMSE | $0.386$ | $0.221$ | $0.166$ | $0.122$ | $0.089$ | $0.066$ | $0.046$ | $0.000$ |

Two scoring regimes emerge with the test size. For **short tests** ($n \lesssim 170$ in Figure 1), the omission mark $\hat{\gamma}$ is positive and lies *above* the mean ability of those who omit (positive omission bias): the designer trades estimator bias against the variance reduction obtained by routing the least informed into omission. For **long tests**, $\hat{\gamma}$ jumps to negative values around $-0.16$, the omission rate falls to zero, and selection is forced because ability is now precisely estimated even for low-$p$ examinees. The $\gamma - \theta$ incentive to omit drops from $\approx 0.6$ to $\approx 0.33$ at the regime switch.

The wrong-answer penalty $\hat{\theta}$ stays close to the notional $\theta^* = -0.5$ across all $n > 1$, suggesting that fixing $\theta = \theta^*$ is a good approximation in practice. The omission mark is much more sensitive to $n$ than the wrong-answer mark — consistent with the observation that omission marks target only the low-ability subpopulation, whereas the wrong-answer mark applies to everyone.

> **Authors' critical reading.** The omission rate is bounded below by the conditional-variance term (§2.4), which does not vanish with $n$. Pooling too many examinees into omission therefore exposes the score to an irreducible measurement error from heterogeneous partial knowledge. The efficient share of omitters reflects a trade-off, not an unconditional preference for omission.

### 4.3 Efficient scoring and risk preferences

Loss aversion has two opposing effects on the omission rate (Table 1):

| Number of questions $n$ | 1 | 5 | 10 | 20 | 40 | 80 | 200 | $\infty$ |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| Risk neutrality ($\lambda = 1$) | $43.4$ | $26.0$ | $19.6$ | $14.4$ | $10.5$ | $7.6$ | $4.9$ | $0.00$ |
| Moderate ($\lambda = 1.5$) | $83.5$ | $39.4$ | $30.6$ | $24.9$ | $21.1$ | $18.6$ | $0.00$ | $0.00$ |
| Strong ($\lambda = 2.5$) | $85.7$ | $46.8$ | $38.0$ | $32.8$ | $29.9$ | $0.00$ | $0.00$ | $0.00$ |

*Share of examinees omitting (%) under the efficient rule, $\theta^* = -0.5$.*

For a given $n$, more loss-averse examinees omit more often. But the regime switch to forced answering occurs at smaller $n$ as $\lambda$ rises: the threshold sits between $n = 200$ and $\infty$ for $\lambda = 1$, between $80$ and $200$ for $\lambda = 1.5$, and between $40$ and $80$ for $\lambda = 2.5$. The intuition is that a higher $\lambda$ lets the designer reach any target omission share with a less distorted (lower) $\gamma$, making the cost of switching to a forced-answer regime cheaper.

Loss aversion also lowers the population RMSE in short tests (Table 2):

| Number of questions $n$ | 1 | 5 | 10 | 20 | 40 | 80 | 200 | $\infty$ |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| Risk neutrality ($\lambda = 1$) | $0.406$ | $0.241$ | $0.181$ | $0.133$ | $0.097$ | $0.070$ | $0.045$ | $0.000$ |
| Moderate ($\lambda = 1.5$) | $0.386$ | $0.221$ | $0.166$ | $0.122$ | $0.089$ | $0.066$ | $0.046$ | $0.000$ |
| Strong ($\lambda = 2.5$) | $0.324$ | $0.196$ | $0.151$ | $0.119$ | $0.097$ | $0.072$ | $0.046$ | $0.000$ |

The efficiency gain is substantial for $n \leq 40$ and disappears for larger tests. Loss aversion partly substitutes for biased omission marks: when the least able omit on their own, the designer needs less upward distortion of $\gamma$ to obtain the efficient sorting.

---

## 5. Relation to existing scoring methods

**Number Right Scoring (NRS)** sets $\theta = \gamma = 0$. Pure random guessers earn an expected score of $1/m > 0$.

**Formula Scoring (FS)** sets $\gamma = 0$ and $\theta = -1/(m-1)$, equalizing the expected score of pure guessing and omission (Thurstone 1919; Holzinger 1924).

The psychometric debate over NRS vs FS has focused on the wrong-answer mark: FS is praised for raising reliability through implicit encouragement of omission (Lord 1975; Mattson 1975; Burton 2001) and criticized for confounding ability with risk attitudes (Votaw 1936; Frary 1988; Budescu and Bar-Hillel 1993).

The model relocates the debate. Both NRS and FS impose $\gamma = 0$ regardless of $n$, which is generically inefficient: short tests call for $\gamma > 0$ (positive points for omission to credit partial knowledge and incentivize self-selection by the least able), large tests for $\gamma < 0$ (force everyone to answer). FS attempts to incentivize omission only through the wrong-answer penalty, a less precisely targeted lever. The efficient mark $\hat{\theta}$ is in fact close to $\theta^* = -1/(m-1)$ across calibrations, so a fixed FS-style penalty is approximately right — the missing piece is a test-size-dependent omission mark.

---

## 6. Conclusion

Three lessons are extracted from the model.

1. **Use long tests when feasible.** Simulations suggest at least 40 and ideally up to 100 questions to exploit the law of large numbers. Constructing many high-quality items is itself a constraint, and inefficiencies from poorly written, ambiguous or redundant items are a separate source of error not modeled here.
2. **Target a decreasing share of omitters as the test grows.** Short tests warrant positive $\gamma$ to encourage omission by the least able; long tests warrant negative $\gamma$ to force selection. Test instructions should be aligned with the rule: candidates should be told to omit when uncertain in short tests, and to answer everything in long ones.
3. **Target low-ability examinees through the omission mark, not the wrong-answer penalty.** This reframes the long-standing NRS–FS debate, which has focused exclusively on the value of the wrong-answer mark.

### Limitations and research extensions

- **Overconfidence.** Examinees are typically overconfident (Keren 1991; Yates 1990; Lichtenstein and Fischhoff 1977; Heath and Tversky 1991), which lowers omission and may interact with ability. Misinformation (Burton 2004) is similarly outside the model.
- **Heterogeneous question difficulty.** Real tests have items of varying difficulty (e.g. increasing through the test); per-item adjustment of marks could be efficient.
- **Heterogeneous risk preferences.** The model assumes examinees differ only in ability. Akyol, Key and Krishna (2022) document that risk-preference heterogeneity affects responses to penalties, requiring richer estimation procedures.
- **Ranking-based criteria.** MSE is not the right loss when relative ranking is the primary aim of the exam; rank-statistic optimization typically requires non-parametric methods.

---

## Acknowledgments

The author thanks Marcel Voia, Christoph Heinzel, the two journal referees, and participants of the AFSE 2019 Conference and the 2024 International Niort Conference on Economic and Financial Risks for helpful comments.

---

## Main references

Akyol, P., Key, J., Krishna, K. (2022). Hit or miss? Test taking behavior in multiple choice exams. *Annals of Economics and Statistics* 147, 3–50.

Bereby-Meyer, Y., Meyer, J., Flascher, O. M. (2002). Prospect theory analysis of guessing in multiple choice tests. *Journal of Behavioral Decision Making* 15, 313–327.

Budescu, D. V., Bar-Hillel, M. (1993). To guess or not to guess: A decision-theoretic view of formula scoring. *Journal of Educational Measurement* 30(4), 277–291.

Budescu, D. V., Bo, Y. (2015). Analyzing test-taking behavior: Decision theory meets psychometric theory. *Psychometrika* 80(4), 1105–1122.

Burton, R. F. (2001). Quantifying the effects of chance in multiple choice and true/false tests: question selection and guessing of answers. *Assessment and Evaluation in Higher Education* 26(1), 41–50.

Espinosa, M. P., Gardeazabal, J. (2010). Optimal correction for guessing in multiple-choice tests. *Journal of Mathematical Psychology* 54(5), 415–425.

Frary, R. B. (1988). Formula scoring of multiple-choice tests (correction for guessing). *Educational Measurement: Issues and Practice* 7, 33–38.

Harvill, L. M. (1991). Standard error of measurement. *Educational Measurement: Issues and Practice* 10, 33–41.

Holzinger, K. J. (1924). On scoring multiple-response tests. *Journal of Educational Measurement* 15, 445–447.

Iriberri, N., Rey-Biel, P. (2021). Brave boys and play-it-safe girls: gender differences in willingness to guess in a large scale natural field experiment. *European Economic Review* 131, 103603.

Kahneman, D., Tversky, A. (1979). Prospect theory: An analysis of decision under risk. *Econometrica* 47(2), 263–292.

Lord, F. M. (1975). Formula scoring and number-right scoring. *Journal of Educational Measurement* 12, 7–12.

Pekkarinen, T. (2015). Gender differences in behaviour under competitive pressure: Evidence on omission patterns in university entrance examinations. *Journal of Economic Behavior and Organization* 115, 94–110.

Thurstone, L. L. (1919). A method for scoring tests. *Psychological Bulletin* 16, 235–240.

Tversky, A., Kahneman, D. (1981). The framing of decisions and the psychology of choice. *Science* 211, 453–458.

Tversky, A., Kahneman, D. (1992). Advances in prospect theory: Cumulative representation of uncertainty. *Journal of Risk and Uncertainty* 5, 297–323.

*The full reference list appears in the PDF.*
