Date of Award
Doctor of Philosophy (PhD)
School of Mathematical and Statistical Sciences
Missing data is common in real-world studies and can create issues in statistical inference. Discarding cases with missing values or replacing the missing values with inappropriate imputation techniques can result in biased estimates. However, many imputation techniques have assumptions that can be hard to assess in practice. Therefore the actual appropriate imputation technique is often unclear. This dissertation consists of two major projects on missing data imputation. One of the projects focuses on comparing different missing data techniques for data sets with different missing data set characteristics. To address this issue, we develop a factorial simulation design to measure the impact of certain data set characteristics on the validity of several popular imputation techniques. The factors in the study are missing data mechanisms and missing data percentages. The evaluation of the simulation output includes bias of imputation estimates from the population parameters, deviation of imputation estimates from the full (or complete) data estimates, confidence interval coverage and width for the parameters of interest. Simulation results suggest that advanced missing data techniques such as multiple imputation (MI) and multivariate imputation by chained equations (MICE) are superior to conventional missing data techniques such as listwise deletion (LD) and arithmetic mean imputation (AMI). ANOVA results show that missing data mechanisms, missing data percentages, and missing data methods significantly impact the parameter estimates' quality after imputation. The other project focuses on the analysis of missing multilevel data, especially with binary variables. Multiple imputation (MI) of data with both fixed and random effects (or multilevel data) is an important area of recent statistical research. Joint modeling (JM) and fully conditional specification (FCS) imputation are two common multiple imputation methods. We develop a factorial simulation design to measure the impact of certain data set characteristics on the validity of several of these imputation techniques. The factors in the study are missing data percentages, intraclass correlation (ICC), number of clusters, and number of observations per cluster. The evaluation of the simulation outputs includes average values of pooled parameter estimates and the corresponding Monte Carlo standard errors. Simulation results suggest that it is essential to include the clustering structure in the imputation model for multilevel data, especially when the ICC is big, and there is a large amount of missing data. Careful investigation is also needed when a small within-cluster sample size and binary variables are involved in the data. Finally, we discuss a real data example to illustrate the applicability of missing data imputation methods for multilevel data.
Yang, Tiantian, "Comparison of Missing Data Imputation Techniques and Analysis with Multilevel Data" (2021). All Dissertations. 2876.