Analysis Module¶
This module provides utilities to analyze the structure and quality of missing data and its imputations.
Function Overview¶
Compute and summarize missingness statistics for each column. |
|
Evaluate imputation quality by comparing imputed values to ground truth. |
|
A class to perform MCAR (Missing Completely At Random) tests. |
Module Reference¶
compute_missing_rate
¶
Summarize the extent and structure of missing data in a DataFrame or NumPy array.
- missmecha.analysis.compute_missing_rate(data, print_summary=True, plot=False)[source]¶
Compute and summarize missingness statistics for each column.
This function calculates the number and percentage of missing values for each column in a dataset, and optionally provides a summary table and barplot.
- Parameters:
data (pandas.DataFrame or numpy.ndarray) – The dataset to analyze for missingness. If ndarray, it will be converted to DataFrame.
print_summary (bool, default=True) – If True, prints the overall missing rate and top variables by missing rate.
plot (bool, default=False) – If True, displays a barplot of missing rates per column.
- Returns:
result – A dictionary with: - ‘report’ : pandas.DataFrame with per-column missing statistics. - ‘overall_missing_rate’ : float, overall percentage of missing entries.
- Return type:
dict
Examples
>>> from missmecha.analysis import compute_missing_rate >>> df = pd.read_csv("data.csv") >>> stats = compute_missing_rate(df, print_summary=True, plot=True)
evaluate_imputation
¶
Evaluate imputation quality by comparing filled values to the ground truth at missing positions.
- missmecha.analysis.evaluate_imputation(original_df, imputed_df, mask_array, method='rmse', cat_cols=None)[source]¶
Evaluate imputation quality by comparing imputed values to ground truth.
This function computes per-column and overall evaluation scores based on the positions that were originally missing. It supports mixed-type data by applying different metrics for categorical and numerical columns. Returns both original and scaled (0-1) versions of the evaluation metrics.
- Parameters:
original_df (pd.DataFrame) – The fully observed reference dataset (i.e., ground truth).
imputed_df (pd.DataFrame) – The dataset after imputation has been applied.
mask_array (np.ndarray or pd.DataFrame of bool) – Boolean array where True = originally observed, False = originally missing. Usually obtained from MissMechaGenerator.bool_mask.
method (str, default="rmse") – Evaluation method to use for numeric columns. One of {‘rmse’, ‘mae’, ‘accuracy’}.
cat_cols (list of str, optional) – Column names that should be treated as categorical. These will always use accuracy. - If not provided, all columns use the method specified by method.
- Returns:
result – Dictionary with two sub-dictionaries: - ‘original’: Contains raw evaluation scores
’column_scores’: mapping from column name to evaluation score
’overall_score’: average of valid column scores (float)
- ’scaled’: Contains normalized scores (0-1 range)
’column_scores’: mapping from column name to scaled evaluation score
’overall_score’: average of valid scaled column scores (float)
For categorical columns, the scaled score equals the original accuracy score.
- Return type:
dict
- Raises:
ValueError – If an unsupported method or column type is used.
Notes
If cat_cols is None: all columns use the selected method.
- If cat_cols is provided:
columns in cat_cols use accuracy
all other columns use method, which must be ‘rmse’ or ‘mae’
Includes formatted print output.
Examples
>>> from missmecha.analysis import evaluate_imputation >>> result = evaluate_imputation(X_true, X_filled, mask, method="rmse")
>>> result = evaluate_imputation( ... original_df=X_true, ... imputed_df=X_filled, ... mask_array=mask, ... method="mae", ... cat_cols=["gender", "job_type"] ... ) >>> print(result["overall_score"]) 0.872
MCARTest
¶
This class supports two approaches to test the MCAR assumption:
Little’s MCAR Test: a global test for whether the missingness is completely at random.
Pairwise T-Tests: individual tests that compare observed vs. missing groups.
- class missmecha.analysis.MCARTest(method: str = 'little')[source]¶
Bases:
object
A class to perform MCAR (Missing Completely At Random) tests.
Supports Little’s MCAR test (global test for all variables) and pairwise MCAR t-tests (for individual variables).
- static little_mcar_test(X: DataFrame) float [source]¶
Perform Little’s MCAR test on a DataFrame.
- Parameters:
X (pd.DataFrame) – Input dataset.
- Returns:
pvalue – P-value of the test.
- Return type:
float
- static mcar_t_tests(X: DataFrame) DataFrame [source]¶
Perform pairwise MCAR t-tests between missing and observed groups.
- Parameters:
X (pd.DataFrame) – Input dataset.
- Returns:
p_matrix – Matrix of p-values (var vs var).
- Return type:
pd.DataFrame
- static report(pvalue: float, alpha: float = 0.05, method: str = "Little's MCAR Test") None [source]¶
Print a summary report of the MCAR test.
- Parameters:
pvalue (float) – The p-value from the MCAR test.
alpha (float, default=0.05) – Significance level.
method (str, default="Little's MCAR Test") – Method name shown in report.