Impute Demo

This notebook demonstrates how to use the SimpleSmartImputer module from MissMecha for basic missing value imputation.

We will:

  • Create a small dataset with both numerical and categorical features

  • Apply automatic imputation strategies

  • Inspect the imputed results

[1]:
# --- Import Required Libraries ---
import pandas as pd
import numpy as np

# --- Set random seed for reproducibility ---
np.random.seed(42)

[2]:
# --- Create a Synthetic Dataset ---

data = {
    'age': [25, np.nan, 30, 22, np.nan, 28, 35, 40, np.nan, 32],
    'income': [50000, 60000, np.nan, 52000, 58000, np.nan, 61000, 59000, 57000, np.nan],
    'gender': ['M', 'F', 'M', np.nan, 'F', 'F', 'M', 'F', np.nan, 'F']
}

df = pd.DataFrame(data)

print("Original dataset with missing values:")
df.head()

Original dataset with missing values:
[2]:
age income gender
0 25.0 50000.0 M
1 NaN 60000.0 F
2 30.0 NaN M
3 22.0 52000.0 NaN
4 NaN 58000.0 F
[5]:
# --- Apply SimpleSmartImputer ---

from missmecha.impute import SimpleSmartImputer

# Instantiate the imputer
imp = SimpleSmartImputer(cat_cols=['gender'])

# Fit and transform the data
df_imputed = imp.fit_transform(df)

print("Imputed dataset:")
df_imputed.head()


[SimpleSmartImputer] Column 'age' treated as numerical. Fill value = 30.285714285714285
[SimpleSmartImputer] Column 'income' treated as numerical. Fill value = 56714.28571428572
[SimpleSmartImputer] Column 'gender' treated as categorical. Fill value = F
Imputed dataset:
[5]:
age income gender
0 25.000000 50000.000000 M
1 30.285714 60000.000000 F
2 30.000000 56714.285714 M
3 22.000000 52000.000000 F
4 30.285714 58000.000000 F

Note

  • Numerical columns (e.g., age, income) are imputed using the mean.

  • Categorical columns (e.g., gender) are imputed using the mode.

  • No manual column type specification is required; MissMecha detects types automatically.

For more advanced control, users can manually specify column types or customize imputation behavior.

Key Takeaways

  • Automatically imputes numerical columns with mean and categorical columns with mode.

  • Supports scikit-learn style API (fit, transform, fit_transform).

  • Allows manual specification of categorical columns if needed.

  • Provides a quick and lightweight baseline for missing data imputation.