Analysis Demo

This notebook demonstrates how to use MissMecha’s analysis module to:

  • Summarize missingness

  • Perform baseline imputation

  • Evaluate imputation quality

  • Test missingness mechanism (MCAR vs non-MCAR)


Import Required Libraries

[1]:
import numpy as np
import pandas as pd
from missmecha.generator import MissMechaGenerator
from missmecha.impute import SimpleSmartImputer
from missmecha.analysis import compute_missing_rate, evaluate_imputation, MCARTest

Create a Synthetic Dataset

[2]:
np.random.seed(42)

data = pd.DataFrame({
    "age": np.random.randint(20, 65, size=100),
    "income": np.random.normal(60000, 10000, size=100),
    "gender": np.random.choice([0, 1], size=100)
})

data.head()
[2]:
age income gender
0 58 69305.844008 0
1 48 66777.674097 1
2 34 66984.402592 1
3 62 61736.020637 1
4 27 66622.845136 1

We simulate a mixed-type dataset including both numerical and categorical variables.


Generate Missingness (MCAR)

Apply Missing Completely At Random (MCAR) pattern.

[3]:
mecha = MissMechaGenerator(mechanism="mcar", missing_rate=0.5)
mcar_missing = mecha.fit_transform(data)

mcar_missing.head()
[3]:
age income gender
0 58.0 69305.844008 NaN
1 48.0 NaN NaN
2 34.0 NaN 1.0
3 NaN 61736.020637 1.0
4 NaN 66622.845136 NaN

Compute Missing Rate

[4]:
missing_rate = compute_missing_rate(mcar_missing)
Overall missing rate: 51.00%
153 / 300 total values are missing.

Top variables by missing rate:
n_missing missing_rate (%) n_unique dtype n_total
column
gender 55 55.0 2 float64 100
income 51 51.0 49 float64 100
age 47 47.0 32 float64 100

This summarizes:

  • Overall missing rate

  • Per-column missing rates


Impute Missing Values

Use SimpleSmartImputer to fill missing values.

[5]:
imp = SimpleSmartImputer(cat_cols=["gender"])
data_imputed = imp.fit_transform(mcar_missing)
[SimpleSmartImputer] Column 'age' treated as numerical. Fill value = 41.45283018867924
[SimpleSmartImputer] Column 'income' treated as numerical. Fill value = 60593.13322338924
[SimpleSmartImputer] Column 'gender' treated as categorical. Fill value = 1.0

By specifying cat_cols, the imputer knows to treat gender as categorical (mode imputation).


Evaluate Imputation Quality

Using RMSE for Numerical Features

[6]:
eval_results = evaluate_imputation(
    data,
    data_imputed,
    mecha.bool_mask,
    method="rmse"
)

eval_results

--------------------------------------------------
Column                 RMSE   Scaled (0-1)
--------------------------------------------------
age                  13.060          0.297
income             9781.570          0.210
gender                0.661          0.661
--------------------------------------------------
Overall            3265.097          0.389

This uses Root Mean Squared Error (RMSE) to measure reconstruction quality.

Using AvgErr for Mixed Types

If cat_cols are specified, evaluate_imputation() automatically applies:

  • RMSE/MAE for numerical columns

  • Accuracy for categorical columns

[7]:
eval_results = evaluate_imputation(
    data,
    data_imputed,
    mecha.bool_mask,
    cat_cols=["gender"]
)

eval_results
--------------------------------------------------
Column               AvgErr   Scaled (0-1)
--------------------------------------------------
age                  13.060          0.297
income             9781.570          0.210
gender                0.564          0.564
--------------------------------------------------
Overall            3265.065          0.357

A Note on AvgERR

The AvgERR metric combines numerical and categorical evaluations seamlessly:

[ :nbsphinx-math:`text{AvgErr}`(v_j) =

]

✨ It adapts metric choice based on each column’s type.


Statistical Test: Little’s MCAR Test

Test for MCAR

[8]:
pval_mcar = MCARTest(method="little")(mcar_missing)
print(f"Little's MCAR test p-value (MCAR case): {pval_mcar:.4f}")
Method: Little's MCAR Test
Test Statistic p-value: 0.251537
Decision: Fail to reject the null hypothesis (α = 0.05)
→ There is insufficient evidence to reject MCAR.
Little's MCAR test p-value (MCAR case): 0.2515

A high p-value (>0.05) suggests MCAR cannot be rejected.

Test for Non-MCAR (MAR Example)

Now simulate MAR and test again.

[9]:
mecha_mar = MissMechaGenerator(mechanism="mar", mechanism_type=5, missing_rate=0.2)
mar_missing = mecha_mar.fit_transform(data)

pval_mar = MCARTest(method="little")(mar_missing)
print(f"Little's MCAR test p-value (MAR case): {pval_mar:.4f}")
[MARType5] Selected column 1 as dependency (xd).
Method: Little's MCAR Test
Test Statistic p-value: 0.017166
Decision: Reject the null hypothesis (α = 0.05)
→ The data is unlikely to be Missing Completely At Random (MCAR).
Little's MCAR test p-value (MAR case): 0.0172

A low p-value (<0.05) suggests the data is not MCAR.


Key Takeaways

  • compute_missing_rate() summarizes missingness patterns.

  • SimpleSmartImputer offers quick baseline imputation.

  • evaluate_imputation() adapts metric choice based on variable types.

  • MCARTest provides statistical evidence whether missingness is random.

  • MissMecha Analysis module makes missingness study practical and systematic.