Impute Demo¶
This notebook demonstrates how to use the SimpleSmartImputer
module from MissMecha for basic missing value imputation.
We will:
Create a small dataset with both numerical and categorical features
Apply automatic imputation strategies
Inspect the imputed results
[1]:
# --- Import Required Libraries ---
import pandas as pd
import numpy as np
# --- Set random seed for reproducibility ---
np.random.seed(42)
[2]:
# --- Create a Synthetic Dataset ---
data = {
'age': [25, np.nan, 30, 22, np.nan, 28, 35, 40, np.nan, 32],
'income': [50000, 60000, np.nan, 52000, 58000, np.nan, 61000, 59000, 57000, np.nan],
'gender': ['M', 'F', 'M', np.nan, 'F', 'F', 'M', 'F', np.nan, 'F']
}
df = pd.DataFrame(data)
print("Original dataset with missing values:")
df.head()
Original dataset with missing values:
[2]:
age | income | gender | |
---|---|---|---|
0 | 25.0 | 50000.0 | M |
1 | NaN | 60000.0 | F |
2 | 30.0 | NaN | M |
3 | 22.0 | 52000.0 | NaN |
4 | NaN | 58000.0 | F |
[5]:
# --- Apply SimpleSmartImputer ---
from missmecha.impute import SimpleSmartImputer
# Instantiate the imputer
imp = SimpleSmartImputer(cat_cols=['gender'])
# Fit and transform the data
df_imputed = imp.fit_transform(df)
print("Imputed dataset:")
df_imputed.head()
[SimpleSmartImputer] Column 'age' treated as numerical. Fill value = 30.285714285714285
[SimpleSmartImputer] Column 'income' treated as numerical. Fill value = 56714.28571428572
[SimpleSmartImputer] Column 'gender' treated as categorical. Fill value = F
Imputed dataset:
[5]:
age | income | gender | |
---|---|---|---|
0 | 25.000000 | 50000.000000 | M |
1 | 30.285714 | 60000.000000 | F |
2 | 30.000000 | 56714.285714 | M |
3 | 22.000000 | 52000.000000 | F |
4 | 30.285714 | 58000.000000 | F |
Note¶
Numerical columns (e.g.,
age
,income
) are imputed using the mean.Categorical columns (e.g.,
gender
) are imputed using the mode.No manual column type specification is required; MissMecha detects types automatically.
For more advanced control, users can manually specify column types or customize imputation behavior.
Key Takeaways¶
Automatically imputes numerical columns with mean and categorical columns with mode.
Supports scikit-learn style API (
fit
,transform
,fit_transform
).Allows manual specification of categorical columns if needed.
Provides a quick and lightweight baseline for missing data imputation.