Degree Type

Dissertation

Date of Award

1989

Degree Name

Doctor of Philosophy

Department

Statistics

First Advisor

Wayne A. Fuller

Abstract

By law, government agencies like the Census Bureau and the National Center for Health Statistics must disseminate collected data in such a way that a respondent cannot be identified. Deleting identifiers (e.g., name, address) from each data record is a standard technique practiced by releasing agencies to preserve the confidentiality of each respondent. Though this inhibits potential intruders from directly identifying a respondent, an additional confidentiality concern stems from the presence of non-confidential public use data files in which direct identifiers have not been removed. If statistical techniques can be used to link a public use data record to a released data record, an intruder may be able to identify a respondent's confidential attributes. One method of preventing disclosure when other files are available to the intruder, is disguising or "masking" each data vector in the file;In this research, we concentrate on the data perturbation technique of masking each data vector by adding a random error vector. After describing the general procedure, we consider the approach an intruder might use in attempting to determine an individual's confidential attributes. It is shown that the conditional expected value of the attributes given the masked data and the public data is the best predictor of the unknown attributes;We investigate the effect of the covariance structure of the error vectors on the success of the intruder. It is demonstrated that, if the variance of the added error is fixed at a fraction of the variance of the original variables, then the optimal correlation structure of the errors with respect to confidentiality protection is the correlation structure of the original variables;We present a masking algorithm designed to preserve the moments and univariate distribution functions of masked variables, while providing disclosure protection. The degree of protection is a function of the variance of the added error. A computer program that implements the algorithm is outlined. The procedure is designed so that the covariance structure of the masked data is similar to that of the original data. Results of masking example data files with the computer program are summarized.

DOI

https://doi.org/10.31274/rtd-180813-9036

Publisher

Digital Repository @ Iowa State University, http://lib.dr.iastate.edu/

Copyright Owner

Gary R. Sullivan

Language

en

Proquest ID

AAI9014961

File Format

application/pdf

File Size

199 pages

Share

COinS