Degree Type

Dissertation

Date of Award

2017

Degree Name

Doctor of Philosophy

Department

Statistics

Major

Statistics

First Advisor

Ranjan Maitra

Abstract

k-means clustering is the most common clustering technique for homogeneous data sets. In this thesis we introduced some contributions for problems related to k-means. The first topic, we developed a modification of the k-means algorithm to efficiently partition massive data sets in a semi-supervised framework, i.e. partial information is available. Our algorithms are designed to also work in cases where not all of the groups have representatives in the supervised part of the data set as well as when the total number of groups is not known in advance. We provide strategies for initializing our algorithm and for determining the number of clusters. The second contribution we develop a methodology to model the distribution function of the difference in residuals for a K-groups model against a K' -groups model for assessing if more groups fit the model better (K'> K). This leads us to estimate the distribution of a sum of random variables: We provide two possible approaches here, with our first method relying on the theory of non-parametric kernel estimation and a second approximate approach that uses the normal approximation for this tail probability. Finally, we introduce a new merging tool that does not require any distribution assumption. To achieve this we computed the normed residuals, for each cluster realization. These residuals form sample from a non-negative distribution using asymmetric kernel estimation we estimate the miss-classification probability. Further we extend this non-parametric estimation to merge clusters.

DOI

https://doi.org/10.31274/etd-180810-5901

Copyright Owner

Israel A. Almodovar-Rivera

Language

en

File Format

application/pdf

File Size

84 pages

Share

COinS