Data clustering using proximity matrices with missing values

Thumbnail Image
Date
2019-07-15
Authors
Karimzadeh, Samira
Olafsson, Sigurdur
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Authors
Person
Olafsson, Sigurdur
Associate Professor
Research Projects
Organizational Units
Organizational Unit
Industrial and Manufacturing Systems Engineering
The Department of Industrial and Manufacturing Systems Engineering teaches the design, analysis, and improvement of the systems and processes in manufacturing, consulting, and service industries by application of the principles of engineering. The Department of General Engineering was formed in 1929. In 1956 its name changed to Department of Industrial Engineering. In 1989 its name changed to the Department of Industrial and Manufacturing Systems Engineering.
Journal Issue
Is Version Of
Versions
Series
Department
Industrial and Manufacturing Systems Engineering
Abstract

In most applications of data clustering the input data includes vectors describing the location of each data point, from which distances between data points can be calculated and a proximity matrix constructed. In some applications, however, the only available input is the proximity matrix, that is, the distances between each pair of data point. Several clustering algorithms can still be applied, but if the proximity matrix has missing values no standard method is directly applicable. Imputation can be done to replace missing values, but most imputation methods do not apply when only the proximity matrix is available. As a partial solution to fill this gap, we propose the Proximity Matrix Completion (PMC) algorithm. This algorithm assumes that data is missing due to one of two reasons: complete dissimilarity or incomplete observations; and imputes values accordingly. To determine which case applies the data is modeled as a graph and a set of maximum cliques in the graph is found. Overlap between cliques then determines the case and hence the method of imputation for each missing data point. This approach is motivated by an application in plant breeding, where what is needed is to cluster new experimental seed varieties into sets of varieties that interact similarly to the environment, and this application is presented as a case study in the paper. The applicability, limitations and performance of the new algorithm versus other methods of imputation are further studied by applying it to datasets derived from three well-known test datasets.

Comments

This is a manuscript of an article published as Karimzadeh, Samira, and Sigurdur Olafsson. "Data Clustering using Proximity Matrices with Missing Values." Expert Systems with Applications 126 (2019): 265-276. DOI: 10.1016/j.eswa.2019.02.022. Posted with permission.

Description
Keywords
Citation
DOI
Copyright
Tue Jan 01 00:00:00 UTC 2019
Collections