Impute Demo¶

This notebook demonstrates how to use the SimpleSmartImputer module from MissMecha for basic missing value imputation.

We will:

Create a small dataset with both numerical and categorical features
Apply automatic imputation strategies
Inspect the imputed results

[1]:

# --- Import Required Libraries ---
import pandas as pd
import numpy as np

# --- Set random seed for reproducibility ---
np.random.seed(42)

[2]:

# --- Create a Synthetic Dataset ---

data = {
    'age': [25, np.nan, 30, 22, np.nan, 28, 35, 40, np.nan, 32],
    'income': [50000, 60000, np.nan, 52000, 58000, np.nan, 61000, 59000, 57000, np.nan],
    'gender': ['M', 'F', 'M', np.nan, 'F', 'F', 'M', 'F', np.nan, 'F']
}

df = pd.DataFrame(data)

print("Original dataset with missing values:")
df.head()

Original dataset with missing values:

[2]:

	age	income	gender
0	25.0	50000.0	M
1	NaN	60000.0	F
2	30.0	NaN	M
3	22.0	52000.0	NaN
4	NaN	58000.0	F

[5]:

# --- Apply SimpleSmartImputer ---

from missmecha.impute import SimpleSmartImputer

# Instantiate the imputer
imp = SimpleSmartImputer(cat_cols=['gender'])

# Fit and transform the data
df_imputed = imp.fit_transform(df)

print("Imputed dataset:")
df_imputed.head()

[SimpleSmartImputer] Column 'age' treated as numerical. Fill value = 30.285714285714285
[SimpleSmartImputer] Column 'income' treated as numerical. Fill value = 56714.28571428572
[SimpleSmartImputer] Column 'gender' treated as categorical. Fill value = F
Imputed dataset:

[5]:

	age	income	gender
0	25.000000	50000.000000	M
1	30.285714	60000.000000	F
2	30.000000	56714.285714	M
3	22.000000	52000.000000	F
4	30.285714	58000.000000	F

Note¶

Numerical columns (e.g., age, income) are imputed using the mean.
Categorical columns (e.g., gender) are imputed using the mode.
No manual column type specification is required; MissMecha detects types automatically.

For more advanced control, users can manually specify column types or customize imputation behavior.

Key Takeaways¶

Automatically imputes numerical columns with mean and categorical columns with mode.
Supports scikit-learn style API (fit, transform, fit_transform).
Allows manual specification of categorical columns if needed.
Provides a quick and lightweight baseline for missing data imputation.