Numerical MAR Types

This section introduces functions for simulating Missing At Random (MAR) mechanisms in numerical data.

MAR assumes that the probability of missingness depends only on the observed variables.

This allows for more realistic simulation than MCAR, while remaining statistically manageable for imputation or analysis.

All functions below are designed for continuous or ordinal numerical data, and cover a wide range of feature dependencies.

Note

All MAR types in MissMecha support tabular numerical data. Some types support full-matrix masking, while others work on a per-column basis.


Overview of MAR Mechanisms

Summary of MAR Types

Type

Dependency

Description

MARType1

Logistic model

Introduces missingness using a fitted logistic regression over observed features.

MARType2

Mutual information

Selects masking columns based on mutual information with synthetic labels.

MARType3

Point-biserial

Computes correlations with a (synthetic or real) binary label to guide missingness.

MARType4

Correlation ranking

Identifies least-correlated columns and masks based on their top correlated partners.

MARType5

Ranking

Uses value rankings in a controlling column to assign missingness to others.

MARType6

Binary grouping

Splits rows into high/low groups based on median and applies missingness with skewed probability.

MARType7

Top-value rule

Selects rows with top values in a controlling column and masks all others.

MARType8

Extreme-value

Masks rows with both highest and lowest values in a selected column.

MARType1

class missmecha.generate.mar.MARType1(missing_rate=0.1, seed=1, para=0.3, depend_on=None)[source]

Bases: object

MAR Mechanism - Type 1 (Logistic Missingness Based on Observed Features)

Introduces missingness based on a logistic model, where the missingness probability depends on a subset of observed features.

Parameters:
  • missing_rate (float, default=0.1) – Target proportion of missing entries.

  • seed (int, default=1) – Random seed for reproducibility.

  • para (float, default=0.3) – Proportion of observed features to use when no depend_on is specified.

  • depend_on (list[int] or None) – Indices of features to use as observed covariates. If None, sampled randomly.

fit(X, y=None, xs=None)[source]

Fit the logistic model to determine missingness probabilities.

Parameters:
  • X (np.ndarray) – Input data matrix.

  • y (Ignored) – Included for compatibility.

  • xs (int or None) – Index of the feature to mask. If None, all non-observed features will be masked.

Returns:

self – Fitted object with learned parameters.

Return type:

MARType1

transform(X)[source]

Apply the learned MARType1 mechanism to the input data.

Parameters:

X (np.ndarray) – Input data to apply missingness to.

Returns:

X_missing – Array with NaN entries introduced based on the fitted logistic model.

Return type:

np.ndarray


MARType2

class missmecha.generate.mar.MARType2(missing_rate=0.1, seed=1, depend_on=None)[source]

Bases: object

MAR Mechanism - Type 2 (Mutual Information-Based Feature Ranking)

Selects features with high mutual information scores relative to a synthetic label, and introduces missingness proportionally across features.

Parameters:
  • missing_rate (float, default=0.1) – Overall proportion of missing entries.

  • seed (int, default=1) – Random seed for reproducibility.

  • depend_on (list[int] or None) – List of features to compute mutual information against. If None, all features are used.

fit(X, y=None)[source]

Compute mutual information scores from observed features and fit internal parameters.

Parameters:
  • X (np.ndarray) – Input data matrix.

  • y (Ignored) – Included for compatibility.

Returns:

self – Fitted object.

Return type:

MARType2

transform(X)[source]

Apply missingness proportionally across all features.

Parameters:

X (np.ndarray) – Input data to apply missingness to.

Returns:

X_missing – Transformed array with missing entries.

Return type:

np.ndarray


MARType3

class missmecha.generate.mar.MARType3(missing_rate=0.1, seed=1, depend_on=None)[source]

Bases: object

MAR Mechanism - Type 3 (Point-Biserial Correlation with Observed or Synthetic Label)

Estimates the importance of each feature by computing point-biserial correlation between each column and a binary target (real or synthetic). The overall correlation score determines the intensity of random missingness.

Parameters:
  • missing_rate (float, default=0.1) – Overall proportion of missing values to introduce.

  • seed (int, default=1) – Random seed for reproducibility.

  • depend_on (list[int] or None) – Columns used to construct synthetic labels if y is not provided.

fit(X, y=None)[source]

Compute feature correlations with a binary label and determine average correlation.

If a label y is not provided, a synthetic label is generated by projecting the data onto a random direction. Point-biserial correlation is then calculated between each feature and the binary label to estimate dependency strength.

Parameters:
  • X (np.ndarray) – Input data matrix (will be converted to float).

  • y (np.ndarray or None) – Optional binary label. If not provided, a synthetic label will be generated.

Returns:

self – Fitted object containing average correlation score.

Return type:

MARType3

transform(X)[source]

Apply uniform missingness with intensity guided by average point-biserial correlation.

Missing entries are randomly introduced into the data matrix based on the fitted correlation score and the desired missing rate.

Parameters:

X (np.ndarray) – Input data matrix to apply missingness.

Returns:

X_missing – Transformed data with missing values inserted.

Return type:

np.ndarray


MARType4

class missmecha.generate.mar.MARType4(missing_rate=0.1, seed=1, depend_on=None)[source]

Bases: object

MAR Mechanism - Type 4 (Correlation-Driven Column Ranking with Pairwise Masking)

Selects features with weakest correlation to a binary label (real or synthetic), then introduces missing values into those features based on their relationship with the most correlated partner column.

Parameters:
  • missing_rate (float, default=0.1) – Target proportion of missing entries.

  • seed (int, default=1) – Random seed for reproducibility.

  • depend_on (list[int] or None) – Columns to use when generating synthetic label. If None, all columns are used.

fit(X, y=None)[source]

Compute feature correlations to a binary label and rank features by relevance.

A synthetic label is generated from selected columns if y is not provided. Features with weakest correlation are selected as targets for masking. Their most correlated counterpart feature is later used to determine which rows to mask.

Parameters:
  • X (np.ndarray) – Input data matrix (will be converted to float).

  • y (np.ndarray or None) – Optional binary label. If not provided, a synthetic label is generated.

Returns:

self – Fitted object storing ranked feature indices.

Return type:

MARType4

transform(X)[source]

Apply column-wise masking based on correlations with paired columns.

For each target column, the most correlated other column is identified. Rows with the smallest values in the correlated column are masked in the target column.

Parameters:

X (np.ndarray) – Input data matrix to apply missingness.

Returns:

X_missing – Transformed data with missing values applied.

Return type:

np.ndarray


MARType5

class missmecha.generate.mar.MARType5(missing_rate=0.1, seed=1, depend_on=None)[source]

Bases: object

MAR Mechanism - Type 5 (Rank-Based Missingness from a Dependent Feature)

Selects a single column as the dependency feature (xd), and generates missingness in all other columns based on ranks in xd. Rows with higher values in xd are more likely to be selected for missingness.

Parameters:
  • missing_rate (float, default=0.1) – Target proportion of missing entries.

  • seed (int, default=1) – Random seed for reproducibility.

  • depend_on (list[int] or None) – Candidate columns to select as the dependency column. If None, all columns are considered.

fit(X, y=None)[source]

Select a dependency feature to control missingness.

A single column is randomly selected from the specified candidates (or all columns if depend_on is None) and stored as the controlling feature for masking.

Parameters:
  • X (np.ndarray) – Input data matrix (converted to float).

  • y (Ignored) – Included for compatibility.

Returns:

self – Fitted object storing the selected dependency feature.

Return type:

MARType5

transform(X)[source]

Introduce missing values based on rank probabilities from the selected feature.

The higher the rank (value) of a row in the selected dependency feature, the more likely it is to be chosen for masking across other columns.

Parameters:

X (np.ndarray) – Input data matrix to apply missingness.

Returns:

X_missing – Transformed data with NaNs introduced based on ranked dependency.

Return type:

np.ndarray


MARType6

class missmecha.generate.mar.MARType6(missing_rate=0.1, seed=1, depend_on=None)[source]

Bases: object

MAR Mechanism - Type 6 (Skewed Binary Grouping Based on Dependency Column)

Partitions the dataset into two groups (high vs. low) based on the median of a selected dependency column. Then introduces missingness with skewed probabilities between the groups (e.g., 90% from the high group, 10% from the low group).

Parameters:
  • missing_rate (float, default=0.1) – Proportion of total values to mask.

  • seed (int, default=1) – Random seed for reproducibility.

  • depend_on (list[int] or None) – Candidate columns to select the controlling feature (xd). If None, all columns are considered.

fit(X, y=None)[source]

Select a dependency feature to define group-based masking.

Randomly selects one feature (xd) from the candidate list or all columns. This feature is later used to partition rows into high/low groups.

Parameters:
  • X (np.ndarray) – Input data matrix.

  • y (Ignored) – Included for compatibility.

Returns:

self – Fitted object storing the selected dependency column.

Return type:

MARType6

transform(X)[source]

Apply missingness by sampling more frequently from one group.

The selected feature xd is used to split the rows into two groups based on median value. Rows from the higher-value group are sampled with greater probability to introduce missing values across other columns.

Parameters:

X (np.ndarray) – Input data to apply missingness.

Returns:

X_missing – Transformed array with missing values introduced.

Return type:

np.ndarray


MARType7

class missmecha.generate.mar.MARType7(missing_rate=0.1, seed=1, depend_on=None)[source]

Bases: object

MAR Mechanism - Type 7 (Top Value Masking Based on Dependency Column)

Selects a controlling feature (xd), ranks its values, and applies missingness to the top-ranked rows (those with the highest values) across the remaining columns.

Parameters:
  • missing_rate (float, default=0.1) – Target proportion of values to mask.

  • seed (int, default=1) – Random seed to ensure reproducibility.

  • depend_on (list[int] or None) – List of candidate features for controlling missingness. If None, selects from all columns.

fit(X, y=None)[source]

Randomly select a column to use for top-value-based masking.

The selected feature (xd) will determine which rows receive missingness, by identifying the highest-valued entries.

Parameters:
  • X (np.ndarray) – Input data matrix.

  • y (Ignored) – Included for interface consistency.

Returns:

self – Fitted object containing the controlling feature.

Return type:

MARType7

transform(X)[source]

Introduce missing values into the rows with the highest values in the selected feature.

For each non-controlling column, missingness is applied to a fixed number of rows corresponding to the top-ranked values in the dependency column.

Parameters:

X (np.ndarray) – Input data to transform.

Returns:

X_missing – Array with missing values inserted into top-ranked rows.

Return type:

np.ndarray


MARType8

class missmecha.generate.mar.MARType8(missing_rate=0.1, seed=1, depend_on=None)[source]

Bases: object

MAR Mechanism - Type 8 (Extreme Value Masking Based on Dependency Column)

Applies missingness to rows with the most extreme values (both lowest and highest) in a selected controlling feature (xd), and masks the rest of the columns accordingly.

Parameters:
  • missing_rate (float, default=0.1) – Desired overall proportion of missing values.

  • seed (int, default=1) – Random seed for reproducibility.

  • depend_on (list[int] or None) – Columns to choose from as the dependency column. If None, selects from all features.

fit(X, y=None)[source]

Select a dependency feature and identify extreme-valued rows.

The selected column (xd) is used to rank all rows. Both low and high extremes will be targeted for masking during transformation.

Parameters:
  • X (np.ndarray) – Input data matrix (converted to float).

  • y (Ignored) – Included for compatibility.

Returns:

self – Fitted object storing the selected dependency column.

Return type:

MARType8

transform(X)[source]

Apply missingness to extreme-value rows in the selected column.

Both the highest and lowest value rows in the dependency column are selected, and missing values are introduced into the remaining columns.

Parameters:

X (np.ndarray) – Input data matrix to apply missingness.

Returns:

X_missing – Transformed data with missing entries introduced in extreme rows.

Return type:

np.ndarray