Date of Award
Doctor of Philosophy
Electrical and Computer Engineering
Computer Engineering( Software Systems)
Clustering is a fundamental data mining method for understanding and interpreting data. People apply clustering as an individual data analysis task or as one component in the machine learning pipelines. The goal of clustering is to partition the input objects into groups, which are named as clusters, such that for objects within the same cluster are similar, whereas objects in different clusters are not.
Many previous works used clustering as batch jobs, which assume all the data are ready in hand before doing the clustering. However, in many cases, the input data is not available all at once, but arrives as a continuous, possibly even unending sequence, which we call it as “data stream”. It is not applicable to directly apply the batch algorithms on the data streams, because once new data arrives, we need to recompute the clustering result from the beginning all the data received so far, which is time consuming.
In this thesis, we design efficient clustering algorithms on streaming data, while maintaining clustering accuracy. The efficiency of algorithm that we considered includes lower time or space cost. Specifically, we study two fundamental clustering models in this thesis: distance-based and density-based.
Among many distance-based clustering algorithms, k-means is the most widely used one and its streaming version is called “streaming k-means” algorithm. Previous works mainly focused on the aspects of single pass, incrementally update and low space cost. We point out the previous works do not perform well on the aspect of query speed to return the clustering results, we are thus motivated to design algorithms to answer queries fast, while also maintaining the desirable properties of high clustering accuracy and small space requirement.Our proposed clustering algorithms systematically reuse the “coresets” (summaries of data) computed for recent queries in answering the current clustering query, a novel technique which we refer to as “coreset caching”. We also present a new algorithm called OnlineCC that integrates the coreset caching idea with a simple sequential streaming k-means algorithm. In practice, OnlineCC algorithm can provide constant query time. We present both theoretical analysis and detailed experiments demonstrating the correctness, accuracy and efficiency of all our proposed clustering algorithms.
In terms of the density-based clustering methods, DBSCAN is the most representative one but works on static data. We consider DBSCAN clustering on streaming data with time decay, where each data point arrives sequentially and recently arrived points are considered more important than the older ones. In this scenario, clusters can change due to both new point insertion and time decay. The challenge is to design algorithm to efficiently maintain clusters on such dynamic data. We present an incremental DBSCAN algorithm which works for general time decay, that is, for any decay model. In addition, we study two specific time decay models-exponential and polynomial decay, which are widely used in practice. The expression of decay function is explicitly given and for these two models, we design algorithms to achieve faster runtime compared to the algorithm for general decay model.
Our incremental algorithm achieves better clustering quality compared to the previous works. We present detailed empirical evaluation to validate the accuracy and runtime efficiency of our proposed algorithms.
Zhang, Yu, "Towards efficient clustering on streaming data" (2021). Graduate Theses and Dissertations. 18645.