Analysis Demo¶

This notebook demonstrates how to use MissMecha’s analysis module to:

Summarize missingness
Perform baseline imputation
Evaluate imputation quality
Test missingness mechanism (MCAR vs non-MCAR)

Import Required Libraries¶

[1]:

import numpy as np
import pandas as pd
from missmecha.generator import MissMechaGenerator
from missmecha.impute import SimpleSmartImputer
from missmecha.analysis import compute_missing_rate, evaluate_imputation, MCARTest

Create a Synthetic Dataset¶

[2]:

np.random.seed(42)

data = pd.DataFrame({
    "age": np.random.randint(20, 65, size=100),
    "income": np.random.normal(60000, 10000, size=100),
    "gender": np.random.choice([0, 1], size=100)
})

data.head()

[2]:

	age	income	gender
0	58	69305.844008	0
1	48	66777.674097	1
2	34	66984.402592	1
3	62	61736.020637	1
4	27	66622.845136	1

We simulate a mixed-type dataset including both numerical and categorical variables.

Generate Missingness (MCAR)¶

Apply Missing Completely At Random (MCAR) pattern.

[3]:

mecha = MissMechaGenerator(mechanism="mcar", missing_rate=0.5)
mcar_missing = mecha.fit_transform(data)

mcar_missing.head()

[3]:

	age	income	gender
0	58.0	69305.844008	NaN
1	48.0	NaN	NaN
2	34.0	NaN	1.0
3	NaN	61736.020637	1.0
4	NaN	66622.845136	NaN

Compute Missing Rate¶

[4]:

missing_rate = compute_missing_rate(mcar_missing)

Overall missing rate: 51.00%
153 / 300 total values are missing.

Top variables by missing rate:

	n_missing	missing_rate (%)	n_unique	dtype	n_total
column
gender	55	55.0	2	float64	100
income	51	51.0	49	float64	100
age	47	47.0	32	float64	100

This summarizes:

Overall missing rate
Per-column missing rates

Impute Missing Values¶

Use SimpleSmartImputer to fill missing values.

[5]:

imp = SimpleSmartImputer(cat_cols=["gender"])
data_imputed = imp.fit_transform(mcar_missing)

[SimpleSmartImputer] Column 'age' treated as numerical. Fill value = 41.45283018867924
[SimpleSmartImputer] Column 'income' treated as numerical. Fill value = 60593.13322338924
[SimpleSmartImputer] Column 'gender' treated as categorical. Fill value = 1.0

By specifying cat_cols, the imputer knows to treat gender as categorical (mode imputation).

Evaluate Imputation Quality¶

Using RMSE for Numerical Features¶

[6]:

eval_results = evaluate_imputation(
    data,
    data_imputed,
    mecha.bool_mask,
    method="rmse"
)

eval_results

--------------------------------------------------
Column                 RMSE   Scaled (0-1)
--------------------------------------------------
age                  13.060          0.297
income             9781.570          0.210
gender                0.661          0.661
--------------------------------------------------
Overall            3265.097          0.389

This uses Root Mean Squared Error (RMSE) to measure reconstruction quality.

Using AvgErr for Mixed Types¶

If cat_cols are specified, evaluate_imputation() automatically applies:

RMSE/MAE for numerical columns
Accuracy for categorical columns

[7]:

eval_results = evaluate_imputation(
    data,
    data_imputed,
    mecha.bool_mask,
    cat_cols=["gender"]
)

eval_results

--------------------------------------------------
Column               AvgErr   Scaled (0-1)
--------------------------------------------------
age                  13.060          0.297
income             9781.570          0.210
gender                0.564          0.564
--------------------------------------------------
Overall            3265.065          0.357

A Note on AvgERR¶

The AvgERR metric combines numerical and categorical evaluations seamlessly:

[ :nbsphinx-math:`text{AvgErr}`(v_j) =

]

✨ It adapts metric choice based on each column’s type.

Statistical Test: Little’s MCAR Test¶

Test for MCAR¶

[8]:

pval_mcar = MCARTest(method="little")(mcar_missing)
print(f"Little's MCAR test p-value (MCAR case): {pval_mcar:.4f}")

Method: Little's MCAR Test
Test Statistic p-value: 0.251537
Decision: Fail to reject the null hypothesis (α = 0.05)
→ There is insufficient evidence to reject MCAR.
Little's MCAR test p-value (MCAR case): 0.2515

A high p-value (>0.05) suggests MCAR cannot be rejected.

Test for Non-MCAR (MAR Example)¶

Now simulate MAR and test again.

[9]:

mecha_mar = MissMechaGenerator(mechanism="mar", mechanism_type=5, missing_rate=0.2)
mar_missing = mecha_mar.fit_transform(data)

pval_mar = MCARTest(method="little")(mar_missing)
print(f"Little's MCAR test p-value (MAR case): {pval_mar:.4f}")

[MARType5] Selected column 1 as dependency (xd).
Method: Little's MCAR Test
Test Statistic p-value: 0.017166
Decision: Reject the null hypothesis (α = 0.05)
→ The data is unlikely to be Missing Completely At Random (MCAR).
Little's MCAR test p-value (MAR case): 0.0172

A low p-value (<0.05) suggests the data is not MCAR.

Key Takeaways¶

compute_missing_rate() summarizes missingness patterns.
SimpleSmartImputer offers quick baseline imputation.
evaluate_imputation() adapts metric choice based on variable types.
MCARTest provides statistical evidence whether missingness is random.
MissMecha Analysis module makes missingness study practical and systematic.