Date of Award

8-2016

Document Type

Thesis

Degree Name

Master of Science (MS)

Legacy Department

Computer Science

Committee Member

Dr. Amy Apon, Committee Chair

Committee Member

Dr. Brian Malloy

Committee Member

Dr. Paul Wilson

Abstract

Dynamic Topic Models (DTM) are a way to extract time-variant information from a collection of documents. The only available implementation of this is slow, taking days to process a corpus of 533,588 documents. In order to see how topics - both their key words and their proportional size in all documents - change over time, we analyze Clustered Latent Dirichlet Allocation (CLDA) as an alternative to DTM. This algorithm is based on existing parallel components, using Latent Dirichlet Allocation (LDA) to extract topics at local times, and k-means clustering to combine topics from dierent time periods. This method is two orders of magnitude faster than DTM, and allows for more freedom of experiment design. Results show that most topics generated by this algorithm are similar to those generated by DTM at both the local and global level using the Jaccard index and Sørensen-Dice coecient, and that this method's perplexity compares favorably to DTM. We also explore tradeos in CLDA method parameters.

Share

COinS