A Hierarchical Clustering and Validity Index for Mixed Data

Yang, Rui

A Hierarchical Clustering and Validity Index for Mixed Data

File

Yang_iastate_0097E_12533.pdf (1.51 MB)

Date

2012-01-01

Authors

Yang, Rui

Advisor

Sigurdur Olafsson

Altmetrics

Organizational Units

Organizational Unit

Industrial and Manufacturing Systems Engineering

The Department of Industrial and Manufacturing Systems Engineering teaches the design, analysis, and improvement of the systems and processes in manufacturing, consulting, and service industries by application of the principles of engineering. The Department of General Engineering was formed in 1929. In 1956 its name changed to Department of Industrial Engineering. In 1989 its name changed to the Department of Industrial and Manufacturing Systems Engineering.

Department

Industrial and Manufacturing Systems Engineering

Abstract

This study develops novel approaches to partition mixed data into natural groups, that is, clustering datasets containing both numeric and nominal attributes. Such data arises in many diverse applications. Our approach addresses two important issues regarding clustering mixed datasets. One is how to find the optimal number of clusters which is important because this is unknown in many applications. The other is how to group the objects "naturally" according to a suitable similarity measurement. These problems are especially difficult for the mixed datasets since they involve determining how to unify the two different representation schemes for numeric and nominal data.

To address the issue of constructing clusters, that is, to naturally group objects, we compare the performance of four distances capable of dealing with the mixed datasets when incorporating into a classical agglomerative hierarchical clustering approach. Based on these results, we conclude that the so-called co-occurrence distance to measure the dissimilarity performs well as this distance is found to obtain good clustering results with reasonable computation, thus balancing effectiveness and efficiency.

The second important contribution of this research is to define an entropy-based validity index to validate the sequence of partitions generated by the hierarchical clustering with the co-occurrence distance. A cluster validity index called the BK index is modified for mixed data and used in conjunction with the proposed clustering algorithm. This index is compared to three well-known indices, namely, the Calinski-Harabasz index (CH), the Dunn index (DU), and the Silhouette index (SI). The results show that the modified BK index outperforms the three other indices for its ability to identify the true number of clusters.

Finally, the study also identifies the limitation of the hierarchical clustering with a co-occurrence distance, and provides some remedies to improve not only the clustering accuracy but especially the ability to correctly identify best number of classes of the mixed datasets.

Copyright

Sun Jan 01 00:00:00 UTC 2012

Collections

Theses and Dissertations

Full item page