Numerical MNAR Types

This section introduces a suite of Missing Not At Random (MNAR) mechanisms specifically designed for numerical data.

Unlike MCAR and MAR, MNAR assumes that the probability of missingness depends on the unobserved values themselves, making it the most complex and realistic class of missingness mechanisms to simulate.

All mechanisms below support numerical arrays (NumPy or pandas), and many allow column-wise or full-matrix control.

Note

Some types (e.g., MNARType5 and MNARType6) are designed to operate on each column independently (single-column masking). Others (e.g., MNARType1 to MNARType4) apply global strategies across multiple columns.


Overview of MNAR Mechanisms

Summary of MNAR Types

Type

Scope

Description

MNARType1

Global

Quantile-based masking using thresholds on both masked and observed columns.

MNARType2

Global

Logistic missingness using observed features to determine probabilities.

MNARType3

Global

Self-masking with logistic sampling, where each feature masks itself.

MNARType4

Global

Applies missingness above/below specific quantiles, with support for “upper”, “lower”, or “both” cuts.

MNARType5

Single-column

Applies self-masking to each feature independently using fitted logistic intercepts.

MNARType6

Single-column

Column-wise masking below percentile thresholds; supports both NumPy and pandas formats.

MNARType1

class missmecha.generate.mnar.MNARType1(missing_rate=0.1, seed=1, up_percentile=None, obs_percentile=0.5, depend_on=None)[source]

Bases: object

MNAR Mechanism - Type 1 (Quantile-Based Threshold Masking)

Introduces missingness based on whether values exceed a column-specific threshold, defined by a quantile (e.g., top 20%). This is applied to both the target columns and optionally conditioned on extreme observed values.

Parameters:
  • missing_rate (float, default=0.1) – Approximate proportion of values to be masked.

  • seed (int, default=1) – Random seed for reproducibility.

  • up_percentile (float, default=0.5) – Quantile threshold above which values in the masking column are considered extreme.

  • obs_percentile (float, default=0.5) – Threshold for additional conditioning on observed values (used when available).

  • depend_on (Ignored) – Present for API compatibility; not used in this type.

fit(X, y=None)[source]

Precompute masking thresholds for each target column using quantile cutoffs.

The data is scaled column-wise to [0, 1] before calculating quantiles. One threshold per column is stored for use in the transformation step.

Parameters:
  • X (np.ndarray) – Input numerical data.

  • y (Ignored) – Included for interface compatibility.

Returns:

self – Fitted object with threshold values stored.

Return type:

MNARType1

transform(X)[source]

Apply quantile-based missingness to the dataset.

For each selected column, values greater than the quantile threshold are masked. Optionally, further filtering can be applied based on observed values in the remaining columns.

Parameters:

X (np.ndarray) – Input data to apply missingness.

Returns:

X_missing – Data with NaNs inserted based on precomputed thresholds.

Return type:

np.ndarray


MNARType2

class missmecha.generate.mnar.MNARType2(missing_rate=0.1, para=0.3, exclude_inputs=True, seed=1, depend_on=None)[source]

Bases: object

MNAR Mechanism - Type 2 (Logistic Missingness Using Observed Features)

Simulates missingness by fitting a logistic model over a subset of the input features, and then masking values in the remaining columns based on predicted probabilities.

If exclude_inputs=True, input features are excluded from missingness and used only as predictors. Otherwise, all features can be masked.

Parameters:
  • missing_rate (float, default=0.1) – Target overall proportion of missing values.

  • para (float, default=0.3) – Proportion of columns used as logistic predictors.

  • exclude_inputs (bool, default=True) – Whether to exclude input (predictor) features from being masked.

  • seed (int, default=1) – Random seed for reproducibility.

  • depend_on (Ignored) – Present for compatibility; not used in this mechanism.

fit(X, y=None)[source]

Fit a logistic model to predict missingness probabilities.

Randomly selects a subset of columns as predictors (based on para) and learns logistic coefficients and intercepts for the remaining columns. These will be used to determine masking during transform().

Parameters:
  • X (np.ndarray) – Input data matrix.

  • y (Ignored) – Included for API compatibility.

Returns:

self – Fitted object with learned parameters.

Return type:

MNARType2

transform(X)[source]

Apply logistic missingness using learned probabilities.

Probabilities are computed using the fitted logistic model, and missingness is introduced accordingly. If exclude_inputs=True, masking is restricted to the non-predictor columns.

Parameters:

X (np.ndarray) – Input data matrix.

Returns:

X_missing – Data matrix with missing values injected.

Return type:

np.ndarray


MNARType3

class missmecha.generate.mnar.MNARType3(missing_rate=0.1, seed=1, depend_on=None)[source]

Bases: object

MNAR Mechanism - Type 3 (Self-Masking with Logistic Probabilities)

A self-masking mechanism where each feature determines its own missingness probability via a feature-wise logistic function. Coefficients and intercepts are learned for each column independently.

Parameters:
  • missing_rate (float, default=0.1) – Target proportion of missing values.

  • seed (int, default=1) – Random seed for reproducibility.

  • depend_on (Ignored) – Present for compatibility; not used in this mechanism.

fit(X, y=None)[source]

Fit a logistic model for each feature using its own values as input.

For every column, a separate set of logistic coefficients and intercepts are computed to match the specified missing_rate.

Parameters:
  • X (np.ndarray) – Input data matrix.

  • y (Ignored) – Included for interface compatibility.

Returns:

self – Fitted object with per-feature logistic models.

Return type:

MNARType3

transform(X)[source]

Apply self-masking based on feature-wise logistic models.

Each column masks its own values independently according to the logistic probability computed from the feature’s value and learned intercept.

Parameters:

X (np.ndarray) – Input data matrix.

Returns:

X_missing – Transformed matrix with missing entries introduced column-wise.

Return type:

np.ndarray


MNARType4

class missmecha.generate.mnar.MNARType4(missing_rate=0.1, q=0.25, p=0.5, cut='both', seed=1, depend_on=None)[source]

Bases: object

MNAR Mechanism - Type 4 (Quantile Cutoff Masking with Optional Upper/Lower/Both)

Introduces missingness based on whether feature values lie above, below, or at both extremes of a specified quantile cutoff. Offers flexible selection for cutoff direction.

Parameters:
  • missing_rate (float, default=0.1) – Proportion of values to be masked.

  • q (float, default=0.25) – Quantile value used to define cutoff thresholds (e.g., q=0.25 for 25% tails).

  • p (float, default=0.5) – Proportion of columns to be affected.

  • cut ({"upper", "lower", "both"}, default="both") – Defines which side(s) of the distribution will be masked.

  • seed (int, default=1) – Random seed for reproducibility.

  • depend_on (Ignored) – Present for compatibility; not used in this mechanism.

fit(X, y=None)[source]

Precompute cutoff thresholds for each column.

Depending on the cut parameter, stores upper, lower, or both quantile thresholds for selected columns. Columns are chosen randomly based on p.

Parameters:
  • X (np.ndarray) – Input data matrix.

  • y (Ignored) – Included for API compatibility.

Returns:

self – Fitted object with quantile thresholds stored.

Return type:

MNARType4

transform(X)[source]

Apply missingness to values beyond the selected quantile cutoffs.

Missing values are introduced into the selected columns where entries fall beyond the precomputed upper, lower, or both quantiles. A Bernoulli sampling is used to approximate the target missing_rate.

Parameters:

X (np.ndarray) – Input data matrix to transform.

Returns:

X_missing – Transformed data with missing entries injected.

Return type:

np.ndarray


MNARType5

class missmecha.generate.mnar.MNARType5(missing_rate=0.1, seed=1, depend_on=None)[source]

Bases: object

MNAR Mechanism - Type 5 (Single-Column Self-Masking with Fitted Intercepts)

Introduces missingness for each column independently by fitting a logistic function to its own values. A coefficient and intercept are learned per feature.

This mechanism is suitable for per-column missingness and assumes that the missingness probability depends only on the value of the feature itself.

Parameters:
  • missing_rate (float, default=0.1) – Desired proportion of missing values per column.

  • seed (int, default=1) – Random seed for reproducibility.

  • depend_on (Ignored) – Included for API compatibility.

fit(X, y=None)[source]

Fit feature-wise logistic coefficients and intercepts.

For each column, learns a logistic intercept such that the expected proportion of missing values matches the missing_rate.

Parameters:
  • X (np.ndarray) – Input numerical data (n_samples, n_features).

  • y (Ignored) – Present for compatibility.

Returns:

self – Fitted object with per-column logistic parameters.

Return type:

MNARType5

transform(X)[source]

Apply self-masking to each column based on learned probabilities.

For each feature, a logistic model is used to compute the probability of masking, and missing values are introduced accordingly.

Parameters:

X (np.ndarray) – Input data matrix to apply missingness.

Returns:

X_missing – Transformed data with per-feature missing entries.

Return type:

np.ndarray


MNARType6

class missmecha.generate.mnar.MNARType6(missing_rate=0.1, seed=1, depend_on=None)[source]

Bases: object

MNAR Mechanism - Type 6 (Percentile-Based Per-Column Thresholding)

Introduces missingness separately for each column, based on whether values fall below a specified percentile threshold. This allows for fine-grained, column-wise control of missingness and supports both NumPy arrays and pandas DataFrames.

Parameters:
  • missing_rate (float, default=0.1) – Threshold percentile for masking (e.g., 0.1 = bottom 10% values become missing).

  • seed (int, default=1) – Random seed for reproducibility.

  • depend_on (Ignored) – Present for compatibility.

fit(X, y=None)[source]

Compute per-column thresholds based on the given percentile.

For each feature, a percentile cutoff is calculated and stored. During transform, values below this cutoff will be masked.

Parameters:
  • X (np.ndarray or pd.DataFrame) – Input data used to calculate percentile thresholds.

  • y (Ignored) – Present for API compatibility.

Returns:

self – Fitted object with threshold values stored.

Return type:

MNARType6

transform(X)[source]

Apply per-column masking to values below the learned percentile thresholds.

Automatically handles both NumPy arrays and pandas DataFrames. If input is a DataFrame, missing values will be inserted by column name.

Parameters:

X (np.ndarray or pd.DataFrame) – Input data matrix to apply missingness.

Returns:

X_missing – Transformed data with missing entries inserted.

Return type:

np.ndarray or pd.DataFrame