Date of Award

December 2018

Document Type


Degree Name

Doctor of Philosophy (PhD)


Genetics and Biochemistry

Committee Member

Frank A Feltus

Committee Member

Julia Frugoli

Committee Member

Stephen Kresovich

Committee Member

Feng Luo

Committee Member

Hong Luo


Biomarkers can be described as molecular signatures that are associated with a trait or disease. RNA expression data facilitates discovery of biomarkers underlying complex phenotypes because it can capture dynamic biochemical processes that are regulated in tissue-specific and time-specific manners. Gene Coexpression Network (GCN) analysis is a method that utilizes RNA expression data to identify binary gene relationships across experimental conditions. Using a novel GCN construction algorithm, Knowledge Independent Network Construction (KINC), I provide evidence for novel polygenic biomarkers in both plant and animal use cases.

Kidney cancer is comprised of several distinct subtypes that demonstrate unique histological and molecular signatures. Using KINC, I have identified gene correlations that are specific to clear cell renal cell carcinoma (ccRCC), the most common form of kidney cancer. ccRCC is associated with two common mutation profiles that respond differently to targeted therapy. By identifying GCN edges that are specific to patients with each of these two mutation profiles, I discovered unique genes with similar biological function, suggesting a role for T cell exhaustion in the development of ccRCC.

Medicago truncatula is a legume that is capable of atmospheric nitrogen fixation through a symbiotic relationship between plant and rhizobium that results in root nodulation. This process is governed by complex gene expression patterns that are dynamically regulated across tissues over the course of rhizobial infection. Using de novo RNA sequencing data generated from the root maturation zone at five distinct time points, I identified hundreds of genes that were differentially expressed between control and inoculated plants at specific time points. To discover genes that were co-regulated during this experiment, I constructed a GCN using the KINC software. By combining GCN clustering analysis with differentially expressed genes, I present evidence for novel root nodulation biomarkers. These biomarkers suggest that temporal regulation of pathogen response related genes is an important process in nodulation.

Large-scale GCN analysis requires computational resources and stable data-processing pipelines. Supercomputers such as Clemson University’s Palmetto Cluster provide data storage and processing resources that enable terabyte-scale experiments. However, with the wealth of public sequencing data available for mining, petabyte-scale experiments are required to provide novel insights across the tree of life. I discuss computational challenges that I have discovered with large scale RNA expression data mining, and present two workflows, OSG-GEM and OSG-KINC, that enable researchers to access geographically distributed computing resources to handle petabyte-scale experiments.



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.