Date of Award

8-2019

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Mathematical Sciences

Committee Member

William Bridges, Committee Chair

Committee Member

Alexander Herzog

Committee Member

Jun Luo

Committee Member

Christopher McMahan

Committee Member

Ilya Safro

Abstract

Topic modeling has been used widely to extract the structures (topics) in a collection (corpus) of documents. One popular method is the Latent Dirichlet Allocation (LDA). LDA assumes a Bayesian generative model with multinomial distributions of topics and vocabularies within the topics. The LDA model result (i.e., the number and types of topics in the corpus) depends on tuning parameters. Several methods, ad hoc or heuristic, have been proposed and analyzed for selecting these parameters. But all these methods have been developed using one or more real corpora. Unfortunately, with real corpora, the true number and types of topics are unknown and it is difficult to assess how well the data follow the assumptions of LDA. To address this issue, we developed a factorial simulation design to create corpora with known structure that varied on the following four factors: 1) number of topics, 2) proportions of topics in documents, 3) size of the vocabulary in topics, and 4) proportion of vocabulary that is contained in documents. Results suggest that the quality of LDA fitting depends on the document-topic distribution and the fitting performs the best when the document lengths are at least four times the vocabulary size. We have also proposed a pre-processing method that may be used to increase quality of the LDA result in some of the worst-case scenarios from the factorial simulation study.

Recommended Citation

Feng, Haotian, "Performance of Latent Dirichlet Allocation with Different Topic and Document Structures" (2019). All Dissertations. 2448.
https://tigerprints.clemson.edu/all_dissertations/2448

Download

COinS

All Dissertations

Performance of Latent Dirichlet Allocation with Different Topic and Document Structures

Date of Award

Document Type

Degree Name

Department

Committee Member

Committee Member

Committee Member

Committee Member

Committee Member

Abstract

Recommended Citation

Search

Browse by

Useful Links

All Dissertations

Performance of Latent Dirichlet Allocation with Different Topic and Document Structures

Author

Date of Award

Document Type

Degree Name

Department

Committee Member

Committee Member

Committee Member

Committee Member

Committee Member

Abstract

Recommended Citation

Share

Search

Browse by

Useful Links