Date of Award
Doctor of Philosophy (PhD)
William Bridges, Committee Chair
Topic modeling has been used widely to extract the structures (topics) in a collection (corpus) of documents. One popular method is the Latent Dirichlet Allocation (LDA). LDA assumes a Bayesian generative model with multinomial distributions of topics and vocabularies within the topics. The LDA model result (i.e., the number and types of topics in the corpus) depends on tuning parameters. Several methods, ad hoc or heuristic, have been proposed and analyzed for selecting these parameters. But all these methods have been developed using one or more real corpora. Unfortunately, with real corpora, the true number and types of topics are unknown and it is difficult to assess how well the data follow the assumptions of LDA. To address this issue, we developed a factorial simulation design to create corpora with known structure that varied on the following four factors: 1) number of topics, 2) proportions of topics in documents, 3) size of the vocabulary in topics, and 4) proportion of vocabulary that is contained in documents. Results suggest that the quality of LDA fitting depends on the document-topic distribution and the fitting performs the best when the document lengths are at least four times the vocabulary size. We have also proposed a pre-processing method that may be used to increase quality of the LDA result in some of the worst-case scenarios from the factorial simulation study.
Feng, Haotian, "Performance of Latent Dirichlet Allocation with Different Topic and Document Structures" (2019). All Dissertations. 2448.