Analysis Module

This module provides utilities to analyze the structure and quality of missing data and its imputations.

Function Overview

compute_missing_rate

Compute and summarize missingness statistics for each column.

evaluate_imputation

Evaluate imputation quality by comparing imputed values to ground truth.

MCARTest

A class to perform MCAR (Missing Completely At Random) tests.

Module Reference

compute_missing_rate

Summarize the extent and structure of missing data in a DataFrame or NumPy array.

missmecha.analysis.compute_missing_rate(data, print_summary=True, plot=False)[source]

Compute and summarize missingness statistics for each column.

This function calculates the number and percentage of missing values for each column in a dataset, and optionally provides a summary table and barplot.

Parameters:
  • data (pandas.DataFrame or numpy.ndarray) – The dataset to analyze for missingness. If ndarray, it will be converted to DataFrame.

  • print_summary (bool, default=True) – If True, prints the overall missing rate and top variables by missing rate.

  • plot (bool, default=False) – If True, displays a barplot of missing rates per column.

Returns:

result – A dictionary with: - ‘report’ : pandas.DataFrame with per-column missing statistics. - ‘overall_missing_rate’ : float, overall percentage of missing entries.

Return type:

dict

Examples

>>> from missmecha.analysis import compute_missing_rate
>>> df = pd.read_csv("data.csv")
>>> stats = compute_missing_rate(df, print_summary=True, plot=True)

evaluate_imputation

Evaluate imputation quality by comparing filled values to the ground truth at missing positions.

missmecha.analysis.evaluate_imputation(original_df, imputed_df, mask_array, method='rmse', cat_cols=None)[source]

Evaluate imputation quality by comparing imputed values to ground truth.

This function computes per-column and overall evaluation scores based on the positions that were originally missing. It supports mixed-type data by applying different metrics for categorical and numerical columns. Returns both original and scaled (0-1) versions of the evaluation metrics.

Parameters:
  • original_df (pd.DataFrame) – The fully observed reference dataset (i.e., ground truth).

  • imputed_df (pd.DataFrame) – The dataset after imputation has been applied.

  • mask_array (np.ndarray or pd.DataFrame of bool) – Boolean array where True = originally observed, False = originally missing. Usually obtained from MissMechaGenerator.bool_mask.

  • method (str, default="rmse") – Evaluation method to use for numeric columns. One of {‘rmse’, ‘mae’, ‘accuracy’}.

  • cat_cols (list of str, optional) – Column names that should be treated as categorical. These will always use accuracy. - If not provided, all columns use the method specified by method.

Returns:

result – Dictionary with two sub-dictionaries: - ‘original’: Contains raw evaluation scores

  • ’column_scores’: mapping from column name to evaluation score

  • ’overall_score’: average of valid column scores (float)

  • ’scaled’: Contains normalized scores (0-1 range)
    • ’column_scores’: mapping from column name to scaled evaluation score

    • ’overall_score’: average of valid scaled column scores (float)

For categorical columns, the scaled score equals the original accuracy score.

Return type:

dict

Raises:

ValueError – If an unsupported method or column type is used.

Notes

  • If cat_cols is None: all columns use the selected method.

  • If cat_cols is provided:
    • columns in cat_cols use accuracy

    • all other columns use method, which must be ‘rmse’ or ‘mae’

  • Includes formatted print output.

Examples

>>> from missmecha.analysis import evaluate_imputation
>>> result = evaluate_imputation(X_true, X_filled, mask, method="rmse")
>>> result = evaluate_imputation(
...     original_df=X_true,
...     imputed_df=X_filled,
...     mask_array=mask,
...     method="mae",
...     cat_cols=["gender", "job_type"]
... )
>>> print(result["overall_score"])
0.872

MCARTest

This class supports two approaches to test the MCAR assumption:

  • Little’s MCAR Test: a global test for whether the missingness is completely at random.

  • Pairwise T-Tests: individual tests that compare observed vs. missing groups.

class missmecha.analysis.MCARTest(method: str = 'little')[source]

Bases: object

A class to perform MCAR (Missing Completely At Random) tests.

Supports Little’s MCAR test (global test for all variables) and pairwise MCAR t-tests (for individual variables).

static little_mcar_test(X: DataFrame) float[source]

Perform Little’s MCAR test on a DataFrame.

Parameters:

X (pd.DataFrame) – Input dataset.

Returns:

pvalue – P-value of the test.

Return type:

float

static mcar_t_tests(X: DataFrame) DataFrame[source]

Perform pairwise MCAR t-tests between missing and observed groups.

Parameters:

X (pd.DataFrame) – Input dataset.

Returns:

p_matrix – Matrix of p-values (var vs var).

Return type:

pd.DataFrame

static report(pvalue: float, alpha: float = 0.05, method: str = "Little's MCAR Test") None[source]

Print a summary report of the MCAR test.

Parameters:
  • pvalue (float) – The p-value from the MCAR test.

  • alpha (float, default=0.05) – Significance level.

  • method (str, default="Little's MCAR Test") – Method name shown in report.