Degree Type

Dissertation

Date of Award

2014

Degree Name

Doctor of Philosophy

Department

Statistics

First Advisor

Dianne Cook

Abstract

Statistical graphics play an important role in exploratory data analysis, model checking and diagnosis. Recent developments suggest that visual inference helps to quantify the significance of findings made from graphics. In visual inference, lineups embed the plot of the data among a set of null plots, and engage a human observer to select the plot that is most different from the rest. If the data plot is selected it corresponds to the rejection of a null hypothesis. With high dimensional data, statistical graphics are obtained by plotting low-dimensional projections, for example, in classification tasks projection pursuit is used to find low-dimensional projections that reveal differences between labelled groups. In many contemporary data sets the number of observations is relatively small compared to the number of variables, which is known as a high dimension low sample size (HDLSS) problem. The research conducted and described in this thesis explores the use of visual inference on understanding low dimensional pictures of HDLSS data. This approach may be helpful to broaden the understanding of issues related to HDLSS data in the data analysis community. Methods are illustrated using data from a published paper, which erroneously found real separation in microarray data. The thesis also describes metrics developed to assist the use of lineups for making inferential statements. Metrics measure the quality of the lineup, and help to understand what people see in the data plots. The null plots represent a finite sample from a null distribution, and the selected sample potentially affects the ease or difficulty of a lineup. Distance metrics are designed to describe how close the true data plot is to the null plots, and how close the null plots are to each other. The distribution of the distance metrics is studied to learn how well this matches to what people detect in the plots, the effect of null generating mechanism and plot choices for particular tasks. The analysis was conducted on data collected from Amazon Turk studies conducted with lineups for studying an array of exploratory data analysis tasks. Finally an R package is constructed to provide open source tools to use visual inference and distance metrics.

Copyright Owner

Niladri Roy Chowdhury

Language

en

File Format

application/pdf

File Size

127 pages

Share

COinS