Theory¶
Understanding the mechanism behind missing data is essential for realistic simulation and effective imputation strategies. In statistics, missingness is typically classified into three types: MCAR, MAR, and MNAR.
Introduction¶
Let \(\boldsymbol{X} \in \mathbb{R}^{n \times k}\) be a complete data matrix and \(\boldsymbol{M} \in \{0,1\}^{n \times k}\) be a missingness mask, where \(M_{ij} = 1\) denotes an observed value and \(M_{ij} = 0\) a missing one. The conditional distribution of missingness is:
where \(\Psi\) denotes the parameters governing the missing process. Depending on how \(f\) depends on \(\boldsymbol{X}\), we obtain different mechanisms.
Note
MissMecha supports simulation under MCAR, MAR, and MNAR for both numerical and categorical data.
Missing Completely At Random (MCAR)¶
Under MCAR, the missingness does not depend on either the observed or missing values:
This means that the pattern of missing data is entirely random.
Example: Suppose a storage glitch randomly deletes cells from a spreadsheet. The probability of missingness is unrelated to the data itself.
Missing At Random (MAR)¶
Under MAR, the missingness depends on observed values only:
This allows us to condition on known variables when modeling the missingness process.
Example: If all missing salaries belong to female employees, and gender is fully observed, then the missingness can be explained by the gender column.
Missing Not At Random (MNAR)¶
Under MNAR, the missingness depends on the missing values themselves, even after conditioning on observed data:
This makes modeling and imputation more difficult, as missing values carry information about why they are missing.
Example: Individuals with very high income might avoid reporting their income. In this case, missingness is driven by the unobserved value itself.
Illustrative Table¶
The table below (from ~cite{Missing_Mechanisms}) illustrates how different mechanisms affect a toy dataset:
IQ |
Ratings |
MCAR |
MAR |
MNAR |
---|---|---|---|---|
78 |
9 |
? |
? |
9 |
84 |
13 |
13 |
? |
13 |
85 |
8 |
8 |
? |
? |
105 |
11 |
? |
11 |
11 |
118 |
16 |
16 |
16 |
16 |
References¶
Little, R. J. A., & Rubin, D. B. (2002). Statistical Analysis with Missing Data (2nd ed.). Wiley-Interscience.
Enders, C. K. (2010). Applied Missing Data Analysis. The Guilford Press.
Gomer, B., & Yuan, K. H. (2021). Subtypes of the Missing Not At Random Missing Data Mechanism. *Psychological Methods, 26*(5), 559–598. https://doi.org/10.1037/met0000377