Abstract

This thesis focuses on a theoretical understanding of statistical methods and machine learning schemes for Missing Not at Random (MNAR) data. By developing innovative theoretical frameworks and novel algorithms, we seek to improve the accuracy and robustness of data analysis in the presence of MNAR data, which poses significant challenges across various scientific domains.

这个论文主题侧重于对于非随机缺失数据的统计方法和机器学习方案的理论理解，通过开发创新的理论框架和新算法，我们旨在提高分析在MNAR数据存在下数据分析的准确性和稳定性，这在各种科学领域都存在重大挑战

Traditional methods for handling missing data, such as mean imputation or listwise deletion, are often inadequate, particularly when data are MNAR. MNAR data occurs when the probability of missingness is related to the unobserved data, introducing complex biases that standard methods cannot address [1]. This thesis aims to bridge the gap between theoretical statistics and machine learning by developing new methods to effectively handle MNAR data in a robust manner. The primary objective of this thesis is to formulate new theoretical models for understanding the mechanisms of MNAR data and to develop innovative machine learning algorithms based on these models. The thesis will start with a comprehensive literature review to assess the current state of MNAR data handling and identify critical gaps and limitations in existing methods.

传统的处理缺失数据的方法，比如说均值插补或者列表删除，通常是不行的，尤其是数据是MNAR的时候。当缺失概率和未观测到的数据相关时，MNAR数据就会发生，包括标准方法无法解决的复杂偏差[1]。这个论文目的是填补理论统计和机器学习之间的 gap ，通过开展新方法以稳健的方法来有效的处理MNAR数据。这个论文的首要目标是制定新的理论模型，以便理解MNAR数据机制并且基于这些模型开发创新机器学习算法。这个论文将从全面文献综述开始，来评估当前MNAR数据处理的现状，并且识别现有方法中的关键空白和局限性

Then, we will focus on developing new probabilistic and Bayesian models designed to capture the complexities of MNAR mechanisms while being robust to the possible presence of outliers.

然后我们会关注于开发新的统计和贝叶斯模型，以捕捉MNAR机制的复杂性，同时对可能存在的异常值具有鲁棒性。

Methodology

The literature of robust statistical modeling is vast and can be categorized into outlier identification and outlier accommodation. In this thesis, we will focus on the outlier accommodation aiming at extracting the maximum information from the observed data without flagging those containing outliers. Nevertheless, the proposed approach can provide insightful information for outlier identification.

鲁棒统计模型的文献非常丰富，并且可被分为异常值识别和异常值适应两类，在这个论文中，我们将重点关注异常值适应，旨在从观察到的数据中最大化的提取信息，而不标记包含异常值的数据。然而，所提出的方法可以为异常值识别提供有见地的信息。

In this context, Welsh and Richardson [2] used the marginal distribution of the response vectors based on a t distribution without reference to the hierarchical structure of the model. A similar approach has been considered in [3] within a Bayesian framework. The aforementioned approaches have been proposed to use solely a multi-variate t model. In addition, in some practical applications, such parametric models could be too restrictive, and we could risk the probability distribution of the data falling outside the assumed multivariate t model. If this happens, the parametric model is said to be misspecified. To avoid model misspecification (which, of course, will lead to performance degradation of the learning procedure), one can opt for a more general non-parametric model to characterize the statistical behavior of the collected data. Nevertheless, non-parametric models typically suffer from the curse of dimensionality and thus require a large amount of data to provide an accurate estimate.

在这个背景下，Welsh and Richardson [2]使用了基于 t 分布的响应向量的边际分布，而没有涉及模型的层次结构。一个类似的方法已经在贝叶斯框架中被考虑。这个上述提到的方法已经被提出用于多元 t 模型，此外，在一些实际应用中，这样的参数模型可能会过于严格，并且我们可能会导致数据的概率分布超出假定的多元 t 模型的风险，如果发生这个情况，这个参数模型被称为规范不正确。为了避免模型的规范不正确（这当然会导致学习过程的性能下降），可以选择更一般的非参数模型来描述已收集数据的统计行为，然而，非参数模型通常受到纬度灾难的问题，因此需要大量数据来提供更准确的估计

To fill the gap, we propose a Complex Elliptically Symmetric (CES) modeling of the covariate to handle the possible presence of outliers. The main reason is that the CES distributions encompass a wide range of distributions, including Gaussian, Generalized Gaussian, compound Gaussian, t-distribution, W-distribution, and K-distribution. These distributions turn out to represent accurately spiky measurements or other heavy-tailed observations. We will then consider the expectation-maximization (EM) based algorithm to handle the outliers. In its classical form, EM does not allow for obtaining closed-form expressions of its surrogate function. To this end, stochastic approximations of the EM could be employed [4]. This consists in successively simulating the random effects due to the outliers with the conditional distribution and updating the unknown parameters of the model. To cope with possible difficulties of sampling the posterior distribution, one can form a distribution that converges to the conditional one along the algorithm iterations. We would also consider a strategy to escape from local maxima by flattening or tempereting the condition likelihood.

为了填补这个空白，我们提出复杂的椭圆对称协变量模型来处理可能存在的异常值，这个主要原因是CES分布包含广泛的分布，比如高斯分布，广义高斯分布，复合高斯分布，t-分布，W-分布，K-分布。这些分布准确的表示尖峰测量或者其他重尾观测，然后我们将考虑基于 EM 期望最大值算法来处理异常值，在其经典形式中，EM 不允许获得其替代函数的闭环形式表达式，为此，可以采用 EM的随机近似[4]。这包括由于条件分布连续模拟而产生的随机效应，并更新模型的未知参数。为了应对可能的后验分布抽样困难，可以形成一个沿着算法迭代而收敛到条件分布的分布。我们还将考虑一个策略来逃离局部最大值通过降低或者调节条件可能性

Second, we will study statistical MNAR mechanisms. It is worth mentioning that, unlike data that is Missing Completely at Random (MCAR) or Missing at Random (MAR), MNAR data is missing due to reasons related to the unobserved data itself [5]. This dependency introduces complex biases that complicate the analysis and interpretation of data [6]. In this thesis, we will develop various theoretical and practical approaches to address MNAR data, ranging from probabilistic models to machine learning algorithms. These models will provide a foundation for designing novel imputation techniques and machine learning algorithms that can accurately estimate the missing values and account for the biases introduced by MNAR data. In developing these algorithms, we will explore machine learning approaches such as autoencoders and generative adversarial networks, which have shown potential in handling complex missing data patterns [7]. The research will also integrate classical statistical methods with modern machine learning techniques to create hybrid models that leverage the strengths of both approaches. These hybrid models will be designed to adapt dynamically based on the type and extent of missing data, ensuring robustness and accuracy in various scenarios [8]. A critical component of this work will be the rigorous theoretical and empirical evaluation of the developed methods. Extensive simulation studies will be conducted to assess the theoretical properties and performance of the new algorithms. Additionally, the methods will be tested on benchmark datasets from various domains to validate their effectiveness and generalizability.

其次，我们将研究统计学中的MNAR机制，值得一提的是，于完全随机缺失或者随机缺失数据不同，MNAR数据缺失是由于未观察到数据本身 [5]，这种依赖性引入了复杂的偏差，使数据的分析和解释变得复杂 [6]。在本论文中，我们将开发各种理论和实践方法来处理MNAR数据，从概率模型到机器学习算法，这些模型将会提供一个设计新插补技术和机器学习算法的基础，可以准确估计缺失数据并且考虑MNAR数据引入的偏差。在开发这些算法时，我们将探索机器学习方法比如自编码器和生成对抗网络，这些方法在处理复杂缺失数据模式方面展现出潜力。这个研究将会融合传统统计模型和现代机器学习技术，来创造结合来两种方法的优势的混合模型。这些混合模型将被设计为根据缺失数据的类型和程度动态调整，确保各种情境下的鲁棒性和准确性。这项工作的关键组成成分将是对所开发的算法进行严格的理论和实证评估。将进行大量的模拟研究，以评估新算法的理论特性和性能。此外，这些方法还将在各个领域的基准数据集上进行测试，以验证他们的有效性和泛化能力。

The expected outcomes of this thesis include the development of advanced theoretical models for MNAR data with the possible presence of outliers, innovative machine learning algorithms, and hybrid methods that integrate statistical and machine learning approaches. These developments will significantly improve the accuracy and robustness of data analysis

本论文的预期结果包括开发先进的模型理论，用于可能存在离群值的MNAR数据，创建机器学习算法，和整合了统计学和机器学习的混合方法，这些发展将显著提高数据分析的准确性和鲁棒性。

in the presence of MNAR data. The project aims to publish its findings in leading scientific journals and present them at international conferences, contributing to the broader scientific community’s understanding of MNAR data handling.

在MNAR数据存在情况下，这个项目旨在在领先的科学期刊上发表研究成果并且在全球会议上展示，从而促进更广泛的科学界对MNAR数据处理的理解。

The timeline for the thesis is structured as follows: The initial phase involves the literature review and theoretical framework development, followed by the design and implementation of new algorithms. Subsequent phases consists in integrating statistical and machine learning methods, evaluating the algorithms through simulation studies and benchmark testing. The associated publications and conference presentations will be prepared to disseminate the research. The final phase is dedicated to the dissertation redaction.

论文时间表安排如下: 初始阶段设计文献综述和理论框架发展，然后是新算法的设计和实施，再往后阶段包括整合统计和机器学习方法，根据模拟研究和基准测试来评估算法，相关的出版物和会议报告将被准备以传播研究结果，最后一个阶段致力于论文的攥写

Conclusion

This thesis will advance the theoretical and practical understanding of handling MNAR data in statistics and machine learning with the possible presence of outlying data. By developing innovative algorithms and hybrid methods, we will address a critical challenge in robust data analysis, leading to more reliable and accurate insights into corrupted MNAR data.

这个论文将推进统计和机器学习在处理可能带着异常数据的MNAR数据中的理论和实践理解，通过开发创新算法和混合方法，我们将解决鲁棒数据分析关键挑战，从而获得更多对 MNAR 可靠和准确的见解

Zehua

Zehua

统计学习处理非随机缺失数据博士课题

分享

Abstract

Methodology

Conclusion