Date of Award
Doctor of Philosophy
Wayne A. Fuller
By law, government agencies like the Census Bureau and the National Center for Health Statistics must disseminate collected data in such a way that a respondent cannot be identified. Deleting identifiers (e.g., name, address) from each data record is a standard technique practiced by releasing agencies to preserve the confidentiality of each respondent. Though this inhibits potential intruders from directly identifying a respondent, an additional confidentiality concern stems from the presence of non-confidential public use data files in which direct identifiers have not been removed. If statistical techniques can be used to link a public use data record to a released data record, an intruder may be able to identify a respondent's confidential attributes. One method of preventing disclosure when other files are available to the intruder, is disguising or "masking" each data vector in the file;In this research, we concentrate on the data perturbation technique of masking each data vector by adding a random error vector. After describing the general procedure, we consider the approach an intruder might use in attempting to determine an individual's confidential attributes. It is shown that the conditional expected value of the attributes given the masked data and the public data is the best predictor of the unknown attributes;We investigate the effect of the covariance structure of the error vectors on the success of the intruder. It is demonstrated that, if the variance of the added error is fixed at a fraction of the variance of the original variables, then the optimal correlation structure of the errors with respect to confidentiality protection is the correlation structure of the original variables;We present a masking algorithm designed to preserve the moments and univariate distribution functions of masked variables, while providing disclosure protection. The degree of protection is a function of the variance of the added error. A computer program that implements the algorithm is outlined. The procedure is designed so that the covariance structure of the masked data is similar to that of the original data. Results of masking example data files with the computer program are summarized.
Digital Repository @ Iowa State University, http://lib.dr.iastate.edu/
Gary R. Sullivan
Sullivan, Gary R., "The use of added error to avoid disclosure in microdata releases " (1989). Retrospective Theses and Dissertations. 9186.