Generator Demo¶
This notebook demonstrates how to simulate missingness patterns across different mechanisms (MCAR, MAR, MNAR) using MissMecha.
We will:
Create a synthetic dataset
Apply column-specific missingness
Systematically simulate various mechanisms
Import Required Libraries¶
[8]:
import numpy as np
import pandas as pd
from missmecha.generator import MissMechaGenerator
from missmecha.analysis import compute_missing_rate
Generate a Sample Dataset¶
[9]:
np.random.seed(42)
X_train = pd.DataFrame({
"age": np.random.randint(20, 60, size=200),
"income": np.random.normal(60000, 10000, size=200),
"gender": np.random.choice([0, 1], size=200)
})
X_train.head()
[9]:
age | income | gender | |
---|---|---|---|
0 | 58 | 50094.636749 | 0 |
1 | 48 | 54337.022704 | 0 |
2 | 34 | 60996.513651 | 1 |
3 | 27 | 54965.243459 | 0 |
4 | 40 | 44493.365689 | 0 |
Check Number of Available Mechanisms¶
[10]:
from missmecha.generator import MECHANISM_LOOKUP
print(f"Supported MCAR types: {len(MECHANISM_LOOKUP['mcar'])}")
print(f"Supported MAR types: {len(MECHANISM_LOOKUP['mar'])}")
print(f"Supported MNAR types: {len(MECHANISM_LOOKUP['mnar'])}")
Supported MCAR types: 3
Supported MAR types: 8
Supported MNAR types: 6
Column-Wise Missingness Simulation Example¶
MissMecha allows flexible missingness configuration per column via the info dictionary.
[11]:
columnwise_df = X_train.copy()
info = {
"age": {
"mechanism": "mcar",
"type": 1,
"rate": 0.2
},
"income": {
"mechanism": "mar",
"type": 3,
"rate": 0.3,
"depend_on": "age"
},
"gender": {
"mechanism": "mnar",
"type": 4,
"rate": 0.4,
"parameter": {"q": 0.1, "p": 0.8, "cut": "low"}
}
}
gen_info = MissMechaGenerator(info=info, cat_cols=["gender"])
df_missing_info = gen_info.fit_transform(columnwise_df)
print("Dataset with column-specific missingness:")
df_missing_info.head()
[MARType3] No label provided. Using synthetic labels instead.
Dataset with column-specific missingness:
[11]:
age | income | gender | |
---|---|---|---|
0 | 58.0 | NaN | 0.0 |
1 | 48.0 | NaN | 0.0 |
2 | 34.0 | 60996.513651 | 1.0 |
3 | 27.0 | 54965.243459 | NaN |
4 | 40.0 | 44493.365689 | NaN |
Accessing Missingness Mask¶
[14]:
# Binary mask: 1 = observed, 0 = missing
mask = gen_info.get_mask()
print(mask[:5])
# Boolean mask: True = observed, False = missing
bool_mask = gen_info.get_bool_mask()
print(bool_mask[:5])
[[1 0 1]
[1 0 1]
[1 1 1]
[1 1 0]
[1 1 0]]
[[ True False True]
[ True False True]
[ True True True]
[ True True False]
[ True True False]]
Systematically Test Different Missing Mechanisms¶
We will:
Loop through MCAR, MAR, and MNAR
Try different missingness rates (30%, 70%)
Report the missingness structure
Perform Little’s MCAR test
[12]:
missing_type_list = ["mcar"]
mechanism_type_list = [1, 2, 3]
missing_rate_list = [0.3, 0.7]
for missing_type in missing_type_list:
for mechanism_type in mechanism_type_list:
for missing_rate in missing_rate_list:
print(f"\nMechanism: {missing_type.upper()}-{mechanism_type} | Missing rate: {missing_rate}")
# Initialize generator
mecha = MissMechaGenerator(
mechanism=missing_type,
mechanism_type=mechanism_type,
missing_rate=missing_rate,
seed=42
)
# Fit and apply missingness
X_missing = mecha.fit_transform(X_train)
# Report missing rate summary
print("Missingness summary:")
compute_missing_rate(X_missing)
print("-----------------------------------------------------------")
Mechanism: MCAR-1 | Missing rate: 0.3
Missingness summary:
Overall missing rate: 30.17%
181 / 600 total values are missing.
Top variables by missing rate:
n_missing | missing_rate (%) | n_unique | dtype | n_total | |
---|---|---|---|---|---|
column | |||||
income | 67 | 33.5 | 133 | float64 | 200 |
gender | 58 | 29.0 | 2 | float64 | 200 |
age | 56 | 28.0 | 38 | float64 | 200 |
-----------------------------------------------------------
Mechanism: MCAR-1 | Missing rate: 0.7
Missingness summary:
Overall missing rate: 70.83%
425 / 600 total values are missing.
Top variables by missing rate:
n_missing | missing_rate (%) | n_unique | dtype | n_total | |
---|---|---|---|---|---|
column | |||||
gender | 144 | 72.0 | 2 | float64 | 200 |
income | 141 | 70.5 | 59 | float64 | 200 |
age | 140 | 70.0 | 32 | float64 | 200 |
-----------------------------------------------------------
Mechanism: MCAR-2 | Missing rate: 0.3
Missingness summary:
Overall missing rate: 30.00%
180 / 600 total values are missing.
Top variables by missing rate:
n_missing | missing_rate (%) | n_unique | dtype | n_total | |
---|---|---|---|---|---|
column | |||||
gender | 64 | 32.0 | 2 | float64 | 200 |
age | 60 | 30.0 | 39 | float64 | 200 |
income | 56 | 28.0 | 144 | float64 | 200 |
-----------------------------------------------------------
Mechanism: MCAR-2 | Missing rate: 0.7
Missingness summary:
Overall missing rate: 70.00%
420 / 600 total values are missing.
Top variables by missing rate:
n_missing | missing_rate (%) | n_unique | dtype | n_total | |
---|---|---|---|---|---|
column | |||||
age | 144 | 72.0 | 30 | float64 | 200 |
gender | 141 | 70.5 | 2 | float64 | 200 |
income | 135 | 67.5 | 65 | float64 | 200 |
-----------------------------------------------------------
Mechanism: MCAR-3 | Missing rate: 0.3
Missingness summary:
Overall missing rate: 30.00%
180 / 600 total values are missing.
Top variables by missing rate:
n_missing | missing_rate (%) | n_unique | dtype | n_total | |
---|---|---|---|---|---|
column | |||||
age | 60 | 30.0 | 38 | float64 | 200 |
income | 60 | 30.0 | 140 | float64 | 200 |
gender | 60 | 30.0 | 2 | float64 | 200 |
-----------------------------------------------------------
Mechanism: MCAR-3 | Missing rate: 0.7
Missingness summary:
Overall missing rate: 70.00%
420 / 600 total values are missing.
Top variables by missing rate:
n_missing | missing_rate (%) | n_unique | dtype | n_total | |
---|---|---|---|---|---|
column | |||||
age | 140 | 70.0 | 30 | float64 | 200 |
income | 140 | 70.0 | 60 | float64 | 200 |
gender | 140 | 70.0 | 2 | float64 | 200 |
-----------------------------------------------------------
Key Takeaways¶
This batch experiment shows how different mechanisms affect missingness structure.
MissMecha makes it easy to systematically evaluate and validate simulation settings.