Generator Demo

This notebook demonstrates how to simulate missingness patterns across different mechanisms (MCAR, MAR, MNAR) using MissMecha.

We will:

  • Create a synthetic dataset

  • Apply column-specific missingness

  • Systematically simulate various mechanisms

Import Required Libraries

[8]:
import numpy as np
import pandas as pd
from missmecha.generator import MissMechaGenerator
from missmecha.analysis import compute_missing_rate

Generate a Sample Dataset

[9]:
np.random.seed(42)
X_train = pd.DataFrame({
    "age": np.random.randint(20, 60, size=200),
    "income": np.random.normal(60000, 10000, size=200),
    "gender": np.random.choice([0, 1], size=200)
})

X_train.head()
[9]:
age income gender
0 58 50094.636749 0
1 48 54337.022704 0
2 34 60996.513651 1
3 27 54965.243459 0
4 40 44493.365689 0

Check Number of Available Mechanisms

[10]:
from missmecha.generator import MECHANISM_LOOKUP

print(f"Supported MCAR types: {len(MECHANISM_LOOKUP['mcar'])}")
print(f"Supported MAR types: {len(MECHANISM_LOOKUP['mar'])}")
print(f"Supported MNAR types: {len(MECHANISM_LOOKUP['mnar'])}")
Supported MCAR types: 3
Supported MAR types: 8
Supported MNAR types: 6

Column-Wise Missingness Simulation Example

MissMecha allows flexible missingness configuration per column via the info dictionary.

[11]:
columnwise_df = X_train.copy()

info = {
    "age": {
        "mechanism": "mcar",
        "type": 1,
        "rate": 0.2
    },
    "income": {
        "mechanism": "mar",
        "type": 3,
        "rate": 0.3,
        "depend_on": "age"
    },
    "gender": {
        "mechanism": "mnar",
        "type": 4,
        "rate": 0.4,
        "parameter": {"q": 0.1, "p": 0.8, "cut": "low"}
    }
}

gen_info = MissMechaGenerator(info=info, cat_cols=["gender"])
df_missing_info = gen_info.fit_transform(columnwise_df)

print("Dataset with column-specific missingness:")
df_missing_info.head()
[MARType3] No label provided. Using synthetic labels instead.
Dataset with column-specific missingness:
[11]:
age income gender
0 58.0 NaN 0.0
1 48.0 NaN 0.0
2 34.0 60996.513651 1.0
3 27.0 54965.243459 NaN
4 40.0 44493.365689 NaN

Accessing Missingness Mask

[14]:
# Binary mask: 1 = observed, 0 = missing
mask = gen_info.get_mask()
print(mask[:5])


# Boolean mask: True = observed, False = missing
bool_mask = gen_info.get_bool_mask()
print(bool_mask[:5])
[[1 0 1]
 [1 0 1]
 [1 1 1]
 [1 1 0]
 [1 1 0]]
[[ True False  True]
 [ True False  True]
 [ True  True  True]
 [ True  True False]
 [ True  True False]]

Systematically Test Different Missing Mechanisms

We will:

  • Loop through MCAR, MAR, and MNAR

  • Try different missingness rates (30%, 70%)

  • Report the missingness structure

  • Perform Little’s MCAR test

[12]:
missing_type_list = ["mcar"]
mechanism_type_list = [1, 2, 3]
missing_rate_list = [0.3, 0.7]

for missing_type in missing_type_list:
    for mechanism_type in mechanism_type_list:
        for missing_rate in missing_rate_list:
            print(f"\nMechanism: {missing_type.upper()}-{mechanism_type} | Missing rate: {missing_rate}")

            # Initialize generator
            mecha = MissMechaGenerator(
                mechanism=missing_type,
                mechanism_type=mechanism_type,
                missing_rate=missing_rate,
                seed=42
            )

            # Fit and apply missingness
            X_missing = mecha.fit_transform(X_train)

            # Report missing rate summary
            print("Missingness summary:")
            compute_missing_rate(X_missing)

            print("-----------------------------------------------------------")


Mechanism: MCAR-1 | Missing rate: 0.3
Missingness summary:
Overall missing rate: 30.17%
181 / 600 total values are missing.

Top variables by missing rate:
n_missing missing_rate (%) n_unique dtype n_total
column
income 67 33.5 133 float64 200
gender 58 29.0 2 float64 200
age 56 28.0 38 float64 200
-----------------------------------------------------------

Mechanism: MCAR-1 | Missing rate: 0.7
Missingness summary:
Overall missing rate: 70.83%
425 / 600 total values are missing.

Top variables by missing rate:
n_missing missing_rate (%) n_unique dtype n_total
column
gender 144 72.0 2 float64 200
income 141 70.5 59 float64 200
age 140 70.0 32 float64 200
-----------------------------------------------------------

Mechanism: MCAR-2 | Missing rate: 0.3
Missingness summary:
Overall missing rate: 30.00%
180 / 600 total values are missing.

Top variables by missing rate:
n_missing missing_rate (%) n_unique dtype n_total
column
gender 64 32.0 2 float64 200
age 60 30.0 39 float64 200
income 56 28.0 144 float64 200
-----------------------------------------------------------

Mechanism: MCAR-2 | Missing rate: 0.7
Missingness summary:
Overall missing rate: 70.00%
420 / 600 total values are missing.

Top variables by missing rate:
n_missing missing_rate (%) n_unique dtype n_total
column
age 144 72.0 30 float64 200
gender 141 70.5 2 float64 200
income 135 67.5 65 float64 200
-----------------------------------------------------------

Mechanism: MCAR-3 | Missing rate: 0.3
Missingness summary:
Overall missing rate: 30.00%
180 / 600 total values are missing.

Top variables by missing rate:
n_missing missing_rate (%) n_unique dtype n_total
column
age 60 30.0 38 float64 200
income 60 30.0 140 float64 200
gender 60 30.0 2 float64 200
-----------------------------------------------------------

Mechanism: MCAR-3 | Missing rate: 0.7
Missingness summary:
Overall missing rate: 70.00%
420 / 600 total values are missing.

Top variables by missing rate:
n_missing missing_rate (%) n_unique dtype n_total
column
age 140 70.0 30 float64 200
income 140 70.0 60 float64 200
gender 140 70.0 2 float64 200
-----------------------------------------------------------

Key Takeaways

  • This batch experiment shows how different mechanisms affect missingness structure.

  • MissMecha makes it easy to systematically evaluate and validate simulation settings.