The use of added error to avoid disclosure in microdata releases

Sullivan, Gary

The use of added error to avoid disclosure in microdata releases

File

r_9014961.pdf (2.67 MB)

Date

1989

Authors

Sullivan, Gary

Advisor

Wayne A. Fuller

Altmetrics

Organizational Units

Organizational Unit

Statistics

As leaders in statistical research, collaboration, and education, the Department of Statistics at Iowa State University offers students an education like no other. We are committed to our mission of developing and applying statistical methods, and proud of our award-winning students and faculty.

Department

Statistics

Abstract

By law, government agencies like the Census Bureau and the National Center for Health Statistics must disseminate collected data in such a way that a respondent cannot be identified. Deleting identifiers (e.g., name, address) from each data record is a standard technique practiced by releasing agencies to preserve the confidentiality of each respondent. Though this inhibits potential intruders from directly identifying a respondent, an additional confidentiality concern stems from the presence of non-confidential public use data files in which direct identifiers have not been removed. If statistical techniques can be used to link a public use data record to a released data record, an intruder may be able to identify a respondent's confidential attributes. One method of preventing disclosure when other files are available to the intruder, is disguising or "masking" each data vector in the file;In this research, we concentrate on the data perturbation technique of masking each data vector by adding a random error vector. After describing the general procedure, we consider the approach an intruder might use in attempting to determine an individual's confidential attributes. It is shown that the conditional expected value of the attributes given the masked data and the public data is the best predictor of the unknown attributes;We investigate the effect of the covariance structure of the error vectors on the success of the intruder. It is demonstrated that, if the variance of the added error is fixed at a fraction of the variance of the original variables, then the optimal correlation structure of the errors with respect to confidentiality protection is the correlation structure of the original variables;We present a masking algorithm designed to preserve the moments and univariate distribution functions of masked variables, while providing disclosure protection. The degree of protection is a function of the variance of the added error. A computer program that implements the algorithm is outlined. The procedure is designed so that the covariance structure of the masked data is similar to that of the original data. Results of masking example data files with the computer program are summarized.

Copyright

Sun Jan 01 00:00:00 UTC 1989

Collections

Theses and Dissertations

Full item page