Date of Award


Document Type


Degree Name

Doctor of Philosophy (PhD)


School of Mathematical and Statistical Sciences

Committee Member

Yu-Bo Wang

Committee Member

Patrick Gerard

Committee Member

Qiong Zhang


Causal inference analysis is one of the most significant and well researched topics in the analysis of observational studies. It addresses the challenge of estimating the relationship between the treatment of interest and the outcome variable (i.e., the treatment effect) in the presence of background covariates. This study aims to determine how causal inference in observational studies can be extended to cover data sets where post treatment variables exist, cluster level covariates exist, or the number of observations in the treatment groups is very unbalanced. More specifically, this study can be summarized into the following two sub-topics.1. The first part of this study focuses on estimating the treatment effect when both the post-treatment variable and cluster level variables exist in the data sets. In previous literature, one popular method, called, the principal stratification method, can properly handle the problem with a post-treatment variable to reach an unbiased treatment effect estimate. The traditional method to consider the cluster level effect is to include the cluster labels as random effects in the model. When either the size of clusters is large, or each cluster only contains small number of observations, we find this method often results in poor estimates of the treatment effect. 2. The second part of this study focuses on estimating the treatment effect estimation when the number of observations is very unbalanced in the two treatment groups (i.e., the treatment or control is a rare event) and background covariates and cluster effects exist in the data sets. Propensity score analysis is the traditional method to adjust for the covariate imbalance be-tween the treated and control groups. The logit or probit links are often used in the propensity score for balancing the background covariates between the treatment groups. When there is severe imbalance between the treatments, we find this method often results in poor estimates of the treatment effect. To tackle these issues, we propose comprehensive Bayesian frameworks for estimating the treatment effect in the presence of post-treatment variables, cluster and/or imbalances. For the first problem, a comprehensive framework with post-treatment variable and a clustered structure is addressed. The proposed framework constructs the clustering structure as random effects with a spike and slab prior in a Bayesian hierarchical model. The key idea is to estimate the causal treatment effect with a more parsimonious and less complex model and thereby also re-duce the computing complexity. This is especially useful when a large number of clusters have no significant influence on the outcome. Several different data generating scenarios (including combinations of clustering structure and post-treatment variable) are considered through simulation studies. The simulation results suggest that the proposed methodology generates the most consistent estimates. The advantages of the proposed methodology are also demonstrated using two case studies of educational performance and infant birth weight. For the second problem, we propose a two-step Bayesian framework. The first step is to estimate the propensity score using a proposed generalized skewed link function. The generalized link function is adopted from a skewed link where the parameter associated with skewness follows a Dirac-spike prior and a mixture structure for the error term. As one of the three commonly used sparsity inducing prior, the Dirac-spike prior, also allows us to determine the necessity of skewness. The second step is to estimate the treatment effect considering propensity score as additional latent variable to adjust for covariates imbalance in outcome analysis. The proposed framework includes the clusters as indicator variables in hierarchical models. The normal mixture inverse gamma (NMIG),one type of spike and slab prior, is used to allow for many of the cluster effects to not be significant. The framework can determine the true underlying relation between background covariates and the binary response with least misspecification rate. These results of empirical simulations and data application case studies show advantages of the proposed methods. Both approaches can result in more parsimonious models as one distinct advantage. Another one is that the Bayesian framework can use computationally efficient Markov Chain Monte Carlo (MCMC) sampling algorithms separately by making use of data augmentation and rewriting technique like the Polya–Gamma technique for binary regression in the second problem.



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.