Custom Mechanism DemoΒΆ

This notebook demonstrates how to use a custom missing data mechanism in MissMecha.

You will learn how to:

  • Define a custom masking class

  • Inject it into MissMechaGenerator

  • Visualize the result

[ ]:
# Import required modules
import numpy as np
import pandas as pd
from missmecha.generator import MissMechaGenerator
import matplotlib.pyplot as plt
[15]:
np.random.seed(0)
X = pd.DataFrame({
    'feature1': np.random.randn(100),
    'feature2': np.random.rand(100) * 10,
    'target': np.random.choice([0, 1], size=100)
})
X.head()
[15]:
feature1 feature2 target
0 1.764052 4.238550 1
1 0.400157 6.063932 0
2 0.978738 0.191932 0
3 2.240893 3.015748 1
4 1.867558 6.601735 0
[16]:
import numpy as np
from missmecha.generator import MissMechaGenerator

class MyCustomMasker:
    """
    A simple custom mechanism that masks the first `missing_rate` proportion of rows.
    """
    def __init__(self, missing_rate=0.1, seed=42):
        self.missing_rate = missing_rate
        self.seed = seed

    def fit(self, X, y=None):
        self.n_rows = X.shape[0]
        return self

    def transform(self, X):
        X_missing = X.astype(float).copy()
        cutoff = int(self.missing_rate * self.n_rows)
        X_missing[:cutoff, :] = np.nan
        return X_missing


[17]:
# Use MissMechaGenerator with the custom mechanism
gen = MissMechaGenerator(mechanism='custom', custom_class=MyCustomMasker, missing_rate=0.2)
X_missing = gen.fit_transform(X)

compute_missing_rate(X_missing)
Overall missing rate: 20.00%
60 / 300 total values are missing.

Top variables by missing rate:
n_missing missing_rate (%) n_unique dtype n_total
column
feature1 20 20.0 80 float64 100
feature2 20 20.0 80 float64 100
target 20 20.0 2 float64 100
[17]:
{'report':           n_missing  missing_rate (%)  n_unique    dtype  n_total
 column
 feature1         20              20.0        80  float64      100
 feature2         20              20.0        80  float64      100
 target           20              20.0         2  float64      100,
 'overall_missing_rate': 20.0}