Campus Units

Statistics, Genetics, Development and Cell Biology

Document Type

Article

Publication Version

Submitted Manuscript

Publication Date

2020

Journal or Book Title

arXiv

Abstract

Mining clusters from datasets is an important endeavor in many applications. The k-means algorithm is a popular and efficient distribution-free approach for clustering numerical-valued data but can not be applied to categorical-valued observations. The k-modes algorithm addresses this lacuna by taking the k-means objective function, replacing the dissimilarity measure and using modes instead of means in the modified objective function. Unlike many other clustering algorithms, both k-modes and k-means are scalable, because they do not require calculation of all pairwise dissimilarities. We provide a fast and computationally efficient implementation of k-modes, OTQT, and prove that it can find superior clusterings to existing algorithms. We also examine five initialization methods and three types of K-selection methods, many of them novel, and all appropriate for k-modes. By examining the performance on real and simulated datasets, we show that simple random initialization is the best intializer, a novel K-selection method is more accurate than two methods adapted from k-means, and that the new OTQT algorithm is more accurate and almost always faster than existing algorithms.

Comments

This is a pre-print of the article Dorman, Karin S., and Ranjan Maitra. "An Efficient k-modes Algorithm for Clustering Categorical Datasets." arXiv preprint arXiv:2006.03936 (2020). Posted with permission.

Copyright Owner

The Author(s)

Language

en

File Format

application/pdf

Published Version

Share

COinS