Date of Award
Doctor of Philosophy
In the unsupervised learning setting, where data labels are not available and few constraints are put on data structure before analysis, having a robust procedure is paramount for any method tasked with analyzing that data. This dissertation presents three distinct papers, each of which provides a mechanic for extending the use cases of the K-means algorithm.
Paper 1. Clustering is a difficult problem that is further challenged in higher dimensions where some of the information can be redundant. Such redundancy can be in the form of dimensions that have group information that is already present in other variables, or are simply irrelevant and contribute no useful information with regard to clustering. The K-means algorithm is arguably the most widely used clustering tool, but its performance is degraded by the presence of redundant dimensions. We provide a formal approach to identifying and removing these redundant features and demonstrate improved performance, as well as interpretability of the derived groupings. Our methodology also simultaneously estimates the number of groups while selecting dimensions informative for clustering. We evaluate performance on datasets simulated under many complexities and conditions as well as on a set of handwritten digits, and is used to identify different running styles among participants in a 100 km ultra-marathon race.
Paper 2. The K-means algorithm is extended to allow for partitioning of skewed groups. Our algorithm is called TiK-Means and contributes a K-means type algorithm that assigns observations to groups while estimating their skewness-transformation parameters. The resulting groups and transformation reveal general-structured clusters that can be explained by inverting the estimated transformation. Further, a modification of the jump statistic chooses the number of groups. Our algorithm is evaluated on simulated and real- life datasets and then applied to a long-standing astronomical dispute regarding the distinct kinds of gamma ray bursts.
Paper 3. This paper presents a method for processing handwritten documents and clustering components of the writing into groups based on structural attributes. The obtained cluster membership information is used to develop a statistical model for writer identification. The presented clustering algorithm creates a grouping structure for glyphs, which are small pieces of handwriting extracted using the handwriter R package developed by Berry. To facilitate the clustering of glyphs, a distance measure inspired by the graph edit distance and a method for calculating the center of a set of glyphs are both introduced. The clustering algorithm is applied to the MNIST dataset for demonstration and exploratory purposes. Various behaviors of the algorithm are explored using its relatively simple digit glyphs. We also establish a Bayesian hierarchical model for modeling a set of writers based on their propensity for writing glyphs that are assigned to certain clusters. We then perform a full scale writer identification analysis on handwritten documents from 27 writers in the Computer Vision Lab dataset.
Berry, Nicholas S., "Extending K-means" (2019). Graduate Theses and Dissertations. 16973.