Date of Award

12-2012

Document Type

Thesis

Degree Name

Master of Science (MS)

Legacy Department

Computer Engineering

Committee Chair/Advisor

Brooks, Richard R

Committee Member

Wang , Kuang-Ching

Committee Member

Hoover , Adam

Abstract

Data leak prevention (DLP) solutions monitor and control data flow. Current techniques find data that matches user defined syntactic patterns. Unfortunately, large classes of DLP relevant data are defined by information semantics, rather than data syntax. Syntax refers to data format, whereas semantics refers to data meaning. The class of social security numbers can be adequately expressed using data syntax, whereas a new industrial process can only be adequately described using information semantics. In this paper, we propose methods for extracting and identifying document semantics using training sets of limited size (tens of documents). The first method is based on singular value decomposition, which uses linear algebra to automatically extract semantic features from documents in the training set. The second method is to infer a hidden Markov model (HMM) expressing relations between the features extracted using the singular value method. This HMM can detect documents containing the intellectual property semantic information. A third method views the English language as a probabilistic context-free grammar (PCFG), and extracts semantic information from individual sentences in order to detect documents containing intellectual property. Test results on 5 document sets show the proposed methods give at least 84% true positive and below 22% false positive rates. Our methods are trained with only positive examples, and have lower false positive rates, compared to Latent Dirichlet Allocation (LDA) and Support Vector Machines (SVM).

Recommended Citation

Zhao, Lianyu, "Semantic Similarity Detection in Natural Language Documents" (2012). All Theses. 1526.
https://tigerprints.clemson.edu/all_theses/1526

Download

Included in

Computer Sciences Commons

COinS

All Theses

Semantic Similarity Detection in Natural Language Documents

Date of Award

Document Type

Degree Name

Legacy Department

Committee Chair/Advisor

Committee Member

Committee Member

Abstract

Recommended Citation

Included in

Search

Browse by

Useful Links

All Theses

Semantic Similarity Detection in Natural Language Documents

Author

Date of Award

Document Type

Degree Name

Legacy Department

Committee Chair/Advisor

Committee Member

Committee Member

Abstract

Recommended Citation

Included in

Share

Search

Browse by

Useful Links