Publications | Youran (Echo) Zhou

2025

ECAI

HI-PMK: A Data-Dependent Kernel for Incomplete Heterogeneous Data Representation

Youran Zhou, Mohamed Reda Bouadjenek, Jonathan Wells, and Sunil Aryal

In ECAI 2025, 2025

Abs DOI PDF

Handling incomplete and heterogeneous data remains a central challenge in real-world machine learning, where missing values may follow complex mechanisms (MCAR, MAR, MNAR) and features can be of mixed types (numerical and categorical). Existing methods often rely on imputation, which may introduce bias or privacy risks, or fail to jointly address data heterogeneity and structured missingness. We propose the Heterogeneous Incomplete Probability Mass Kernel (HI-PMK), a novel data-dependent representation learning approach that eliminates the need for imputation. HI-PMK introduces two key innovations: (1) a probability mass-based dissimilarity measure that adapts to local data distributions across heterogeneous features (numerical, ordinal, nominal), and (2) a missingness-aware uncertainty strategy (MaxU) that conservatively handles all three missingness mechanisms by assigning maximal plausible dissimilarity to unobserved entries. Our approach is privacy-preserving, scalable, and readily applicable to downstream tasks such as classification and clustering. Extensive experiments on over 15 benchmark datasets demonstrate that HI-PMK consistently outperforms traditional imputation-based pipelines and kernel methods across a wide range of missing data settings. Code is available at: github.com/echoid/Incomplete-Heter-Kernel.
CIKM

Toward Robust Machine Learning under Diverse Incomplete Data Mechanisms in Real-World Applications

Youran Zhou

In Proceedings of the 34th ACM International Conference on Information and Knowledge Management, Seoul, Republic of Korea, 2025

Abs DOI PDF

Incomplete data is a pervasive challenge across a wide range of data types, including tabular, sensor, time-series, image, and textual data. Its presence stems from various real-world factors and gives rise to different missingness mechanisms. While much of the existing research focuses on the Missing Completely At Random (MCAR) assumption, the more complex and realistic mechanisms-Missing At Random (MAR) and Missing Not At Random (MNAR)-remain relatively underexplored despite their prevalence and impact. This PhD project aims to systematically investigate the challenges posed by diverse Incomplete data mechanisms and to develop robust machine learning methods that can perform reliably across MCAR, MAR, and MNAR scenarios. The research spans multiple data modalities and focuses on improving both the theoretical understanding and practical handling of incomplete data. By addressing mechanism-specific imputation challenges and proposing broadly applicable solutions, this work contributes to building more resilient and trustworthy data-driven systems in real-world settings.
CIKM

MissDDIM: Deterministic and Efficient Conditional Diffusion for Tabular Data Imputation

Youran Zhou, Mohamed Reda Bouadjenek, and Sunil Aryal

In Proceedings of the 34th ACM International Conference on Information and Knowledge Management, Seoul, Republic of Korea, 2025

Abs DOI PDF

Diffusion models have recently emerged as powerful tools for missing data imputation by modeling the joint distribution of observed and unobserved variables. However, existing methods, typically based on stochastic denoising diffusion probabilistic models (DDPMs), suffer from high inference latency and variable outputs, limiting their applicability in real-world tabular settings. To address these deficiencies, we present in this paper MissDDIM, a conditional diffusion framework that adapts Denoising Diffusion Implicit Models (DDIM) for tabular imputation. While stochastic sampling enables diverse completions, it also introduces output variability that complicates downstream processing. MissDDIM replaces this with a deterministic, non-Markovian sampling path, yielding faster and more consistent imputations. To better leverage incomplete inputs during training, we introduce a self-masking strategy that dynamically constructs imputation targets from observed features-enabling robust conditioning without requiring fully observed data. Experiments on five benchmark datasets demonstrate that MissDDIM matches or exceeds the accuracy of state-of-the-art diffusion models, while significantly improving inference speed and stability. These results highlight the practical value of deterministic diffusion for real-world imputation tasks.

Project

MissMecha: A Flexible Python Toolkit for Missing Data Mechanisms

Youran Zhou, Mohamed Reda Bouadjenek, and Sunil Aryal

2025

@misc{missmecha,
  title = {MissMecha: A Flexible Python Toolkit for Missing Data Mechanisms},
  author = {Zhou, Youran and Bouadjenek, Mohamed Reda and Aryal, Sunil},
  year = {2025},
  url = {https://echoid.github.io/MissMecha/},
}

2024

ECML‘24
Missing Data Imputation: Do Advanced ML/DL Techniques Outperform Traditional Approaches?

Youran Zhou, Mohamed Reda Bouadjenek, and Sunil Aryal

In Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track, 2024

Abs DOI Bib Slides

Missing data poses a significant challenge in real-world data analysis, prompting the development of various imputation methods. However, existing literature often overlooks two critical limitations. Firstly, many methods assume a Missing Completely At Random (MCAR) mechanism, which is relatively easy to handle but may not reflect real-world scenarios where data is often missing due to some underlying mechanisms (issues/problems) that are often unknown. This type of missing data is categorized as Missing At Random (MAR) and Missing Not At Random (MNAR). Secondly, the effectiveness of these methods is primarily assessed solely in terms of imputation accuracy using metrics such as Root Mean Square Error (RMSE), ignoring the practical utility of imputed data in downstream tasks. In this study, we comprehensively compare a broad spectrum of missing data imputation techniques, ranging from traditional statistical methods to advanced machine and deep learning approaches. Our evaluation considers their effectiveness in handling various missing mechanisms across different missing parameters. Furthermore, we assess the imputed data’s quality not only in terms of RMSE but also its impact on downstream tasks, such as classification, regression, and clustering. Contrary to common assumptions, our findings reveal that the superiority of complex deep learning-based methods is not guaranteed over simple traditional techniques. Moreover, relying solely on RMSE for evaluation can be misleading. Instead, selecting an imputation method should prioritise its effectiveness in enhancing the performance of learning algorithms in downstream tasks.
@inproceedings{10.1007/978-3-031-70381-2_7, author = {Zhou, Youran and Bouadjenek, Mohamed Reda and Aryal, Sunil}, editor = {Bifet, Albert and Krilavi{\v{c}}ius, Tomas and Miliou, Ioanna and Nowaczyk, Slawomir}, title = {Missing Data Imputation: Do Advanced ML/DL Techniques Outperform Traditional Approaches?}, booktitle = {Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track}, year = {2024}, publisher = {Springer Nature Switzerland}, address = {Cham}, pages = {100--115}, isbn = {978-3-031-70381-2}, url = {https://link.springer.com/chapter/10.1007/978-3-031-70381-2_7}, doi = {https://doi.org/10.1007/978-3-031-70381-2_7}, }

ECML‘24

Developing robust methods to handle missing data in real-world applications effectively

Youran Zhou, Mohamed Reda Bouadjenek, and Sunil Aryal

2024

This work was presented at the ECML PKDD 2024 PhD Forum. https://ecmlpkdd.org/2024/program-accepted-phd-forum/

arXiv Bib Poster

@misc{zhou2025developingrobustmethodshandle,
  title = {Developing robust methods to handle missing data in real-world applications effectively},
  author = {Zhou, Youran and Bouadjenek, Mohamed Reda and Aryal, Sunil},
  year = {2024},
  booktitle = {ECML PKDD 2024 PhD Forum},
  url = {https://ecmlpkdd.org/2024/program-accept},
  note = {This work was presented at the ECML PKDD 2024 PhD Forum.
  https://ecmlpkdd.org/2024/program-accepted-phd-forum/}
}

2022

Msc.Thesis

Synthesizing Tabular Data Using Selectivity Enhanced Generative Adversarial Networks

Youran Zhou and Jianzhong Qi

2022

arXiv Bib

@misc{zhou2025synthesizingtabulardatausing,
  title = {Synthesizing Tabular Data Using Selectivity Enhanced Generative Adversarial Networks},
  author = {Zhou, Youran and Qi, Jianzhong},
  year = {2022},
  archiveprefix = {arXiv},
}