Analysis Demo¶
This notebook demonstrates how to use MissMecha’s analysis module to:
Summarize missingness
Perform baseline imputation
Evaluate imputation quality
Test missingness mechanism (MCAR vs non-MCAR)
Import Required Libraries¶
[1]:
import numpy as np
import pandas as pd
from missmecha.generator import MissMechaGenerator
from missmecha.impute import SimpleSmartImputer
from missmecha.analysis import compute_missing_rate, evaluate_imputation, MCARTest
Create a Synthetic Dataset¶
[2]:
np.random.seed(42)
data = pd.DataFrame({
"age": np.random.randint(20, 65, size=100),
"income": np.random.normal(60000, 10000, size=100),
"gender": np.random.choice([0, 1], size=100)
})
data.head()
[2]:
age | income | gender | |
---|---|---|---|
0 | 58 | 69305.844008 | 0 |
1 | 48 | 66777.674097 | 1 |
2 | 34 | 66984.402592 | 1 |
3 | 62 | 61736.020637 | 1 |
4 | 27 | 66622.845136 | 1 |
We simulate a mixed-type dataset including both numerical and categorical variables.
Generate Missingness (MCAR)¶
Apply Missing Completely At Random (MCAR) pattern.
[3]:
mecha = MissMechaGenerator(mechanism="mcar", missing_rate=0.5)
mcar_missing = mecha.fit_transform(data)
mcar_missing.head()
[3]:
age | income | gender | |
---|---|---|---|
0 | 58.0 | 69305.844008 | NaN |
1 | 48.0 | NaN | NaN |
2 | 34.0 | NaN | 1.0 |
3 | NaN | 61736.020637 | 1.0 |
4 | NaN | 66622.845136 | NaN |
Compute Missing Rate¶
[4]:
missing_rate = compute_missing_rate(mcar_missing)
Overall missing rate: 51.00%
153 / 300 total values are missing.
Top variables by missing rate:
n_missing | missing_rate (%) | n_unique | dtype | n_total | |
---|---|---|---|---|---|
column | |||||
gender | 55 | 55.0 | 2 | float64 | 100 |
income | 51 | 51.0 | 49 | float64 | 100 |
age | 47 | 47.0 | 32 | float64 | 100 |
This summarizes:
Overall missing rate
Per-column missing rates
Impute Missing Values¶
Use SimpleSmartImputer
to fill missing values.
[5]:
imp = SimpleSmartImputer(cat_cols=["gender"])
data_imputed = imp.fit_transform(mcar_missing)
[SimpleSmartImputer] Column 'age' treated as numerical. Fill value = 41.45283018867924
[SimpleSmartImputer] Column 'income' treated as numerical. Fill value = 60593.13322338924
[SimpleSmartImputer] Column 'gender' treated as categorical. Fill value = 1.0
By specifying cat_cols
, the imputer knows to treat gender
as categorical (mode imputation).
Evaluate Imputation Quality¶
Using RMSE for Numerical Features¶
[6]:
eval_results = evaluate_imputation(
data,
data_imputed,
mecha.bool_mask,
method="rmse"
)
eval_results
--------------------------------------------------
Column RMSE Scaled (0-1)
--------------------------------------------------
age 13.060 0.297
income 9781.570 0.210
gender 0.661 0.661
--------------------------------------------------
Overall 3265.097 0.389
This uses Root Mean Squared Error (RMSE) to measure reconstruction quality.
Using AvgErr for Mixed Types¶
If cat_cols
are specified, evaluate_imputation()
automatically applies:
RMSE/MAE for numerical columns
Accuracy for categorical columns
[7]:
eval_results = evaluate_imputation(
data,
data_imputed,
mecha.bool_mask,
cat_cols=["gender"]
)
eval_results
--------------------------------------------------
Column AvgErr Scaled (0-1)
--------------------------------------------------
age 13.060 0.297
income 9781.570 0.210
gender 0.564 0.564
--------------------------------------------------
Overall 3265.065 0.357
A Note on AvgERR¶
The AvgERR metric combines numerical and categorical evaluations seamlessly:
[ :nbsphinx-math:`text{AvgErr}`(v_j) =
]
✨ It adapts metric choice based on each column’s type.
Statistical Test: Little’s MCAR Test¶
Test for MCAR¶
[8]:
pval_mcar = MCARTest(method="little")(mcar_missing)
print(f"Little's MCAR test p-value (MCAR case): {pval_mcar:.4f}")
Method: Little's MCAR Test
Test Statistic p-value: 0.251537
Decision: Fail to reject the null hypothesis (α = 0.05)
→ There is insufficient evidence to reject MCAR.
Little's MCAR test p-value (MCAR case): 0.2515
A high p-value (>0.05) suggests MCAR cannot be rejected.
Test for Non-MCAR (MAR Example)¶
Now simulate MAR and test again.
[9]:
mecha_mar = MissMechaGenerator(mechanism="mar", mechanism_type=5, missing_rate=0.2)
mar_missing = mecha_mar.fit_transform(data)
pval_mar = MCARTest(method="little")(mar_missing)
print(f"Little's MCAR test p-value (MAR case): {pval_mar:.4f}")
[MARType5] Selected column 1 as dependency (xd).
Method: Little's MCAR Test
Test Statistic p-value: 0.017166
Decision: Reject the null hypothesis (α = 0.05)
→ The data is unlikely to be Missing Completely At Random (MCAR).
Little's MCAR test p-value (MAR case): 0.0172
A low p-value (<0.05) suggests the data is not MCAR.
Key Takeaways¶
compute_missing_rate()
summarizes missingness patterns.SimpleSmartImputer
offers quick baseline imputation.evaluate_imputation()
adapts metric choice based on variable types.MCARTest
provides statistical evidence whether missingness is random.MissMecha Analysis module makes missingness study practical and systematic.