Date of Award
Doctor of Philosophy (PhD)
School of Computing
Streaming data are produced in great velocity and diverse variety. The vision of this research is to build an end-to-end system that handles the collection, curation and analysis of streaming data. The streaming data used in this thesis contain both numeric type data and text type data. First, in the field of data collection, we design and evaluate a data delivery framework that handles the real-time nature of streaming data. In this component, we use streaming data in automotive domain since it is suitable for testing and evaluating our data delivery system. Secondly, in the field of data curation, we use a language model to analyze two online automotive forums as an example for streaming text data curation. Last but not least, we present our approach for automated query expansion on Twitter data as an example of streaming social media data analysis. This thesis provides a holistic view of the end-to-end system we have designed, built and analyzed.
To study the streaming data in automotive domain, a complex and massive amount of data is being collected from on-board sensors of operational connected vehicles (CVs), infrastructure data sources such as roadway sensors and traffic signals, mobile data sources such as cell phones, social media sources such as Twitter, and news and weather data services. Unfortunately, these data create a bottleneck at data centers for processing and retrievals of collected data, and require the deployment of additional message transfer infrastructure between data producers and consumers to support diverse CV applications. The first part of this dissertation, we present a strategy for creating an efficient and low-latency distributed message delivery system for CV systems using a distributed message delivery platform. This strategy enables large-scale ingestion, curation, and transformation of unstructured data (roadway traffic-related and roadway non-traffic-related data) into labeled and customized topics for a large number of subscribers or consumers, such as CVs, mobile devices, and data centers. We evaluate the performance of this strategy by developing a prototype infrastructure using Apache Kafka, an open source message delivery system, and compared its performance with the latency requirements of CV applications. We present experimental results of the message delivery infrastructure on two different distributed computing testbeds at Clemson University. Experiments were performed to measure the latency of the message delivery system for a variety of testing scenarios. These experiments reveal that measured latencies are less than the U.S. Department of Transportation recommended latency requirements for CV applications, which provides evidence that the system is capable for managing CV related data distribution tasks.
Human-generated streaming data are large in volume and noisy in content. Direct acquisition of the full scope of human-generated data is often ineffective. In our research, we try to find an alternative resource to study such data. Common Crawl is a massive multi-petabyte dataset hosted by Amazon. It contains archived HTML web page data from 2008 to date. Common Crawl has been widely used for text mining purposes. Using data extracted from Common Crawl has several advantages over a direct crawl of web data, among which is removing the likelihood of a user's home IP address becoming blacklisted for accessing a given web site too frequently. However, Common Crawl is a data sample, and so questions arise about the quality of Common Crawl as a representative sample of the original data. We perform systematic tests on the similarity of topics estimated from Common Crawl compared to topics estimated from the full data of online forums. Our target is online discussions from a user forum for car enthusiasts, but our research strategy can be applied to other domains and samples to evaluate the representativeness of topic models. We show that topic proportions estimated from Common Crawl are not significantly different than those estimated on the full data. We also show that topics are similar in terms of their word compositions, and not worse than topic similarity estimated under true random sampling, which we simulate through a series of experiments. Our research will be of interest to analysts who wish to use Common Crawl to study topics of interest in user forum data, and analysts applying topic models to other data samples.
Twitter data is another example of high-velocity streaming data. We use it as an example to study the query expansion application in streaming social media data analysis. Query expansion is a problem concerned with gathering more relevant documents from a given set that cover a certain topic. Here in this thesis we outline a number of tools for a query expansion system that will allow its user to gather more relevant documents (in this case, tweets from the Twitter social media system), while discriminating from irrelevant documents. These tools include a method for triggering a given query expansion using a Jaccard similarity threshold between keywords, and a query expansion method using archived news reports to create a vector space of novel keywords. As the nature of streaming data, Twitter stream contains emerging events that are constantly changing and therefore not predictable using static queries. Since keywords used in static query method often mismatch the words used in topics around emerging events. To solve this problem, our proposed approach of automated query expansion detects the emerging events in the first place. Then we combine both local analysis and global analysis methods to generate queries for capturing the emerging topics. Experiment results show that by combining the global analysis and local analysis method, our approach can capture the semantic information in the emerging events with high efficiency.
Du, Yuheng, "Streaming Infrastructure and Natural Language Modeling with Application to Streaming Big Data" (2019). All Dissertations. 2329.