The use of added error to avoid disclosure in microdata releases

Thumbnail Image
Date
1989
Authors
Sullivan, Gary
Major Professor
Advisor
Wayne A. Fuller
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Authors
Research Projects
Organizational Units
Organizational Unit
Statistics
As leaders in statistical research, collaboration, and education, the Department of Statistics at Iowa State University offers students an education like no other. We are committed to our mission of developing and applying statistical methods, and proud of our award-winning students and faculty.
Journal Issue
Is Version Of
Versions
Series
Department
Statistics
Abstract

By law, government agencies like the Census Bureau and the National Center for Health Statistics must disseminate collected data in such a way that a respondent cannot be identified. Deleting identifiers (e.g., name, address) from each data record is a standard technique practiced by releasing agencies to preserve the confidentiality of each respondent. Though this inhibits potential intruders from directly identifying a respondent, an additional confidentiality concern stems from the presence of non-confidential public use data files in which direct identifiers have not been removed. If statistical techniques can be used to link a public use data record to a released data record, an intruder may be able to identify a respondent's confidential attributes. One method of preventing disclosure when other files are available to the intruder, is disguising or "masking" each data vector in the file;In this research, we concentrate on the data perturbation technique of masking each data vector by adding a random error vector. After describing the general procedure, we consider the approach an intruder might use in attempting to determine an individual's confidential attributes. It is shown that the conditional expected value of the attributes given the masked data and the public data is the best predictor of the unknown attributes;We investigate the effect of the covariance structure of the error vectors on the success of the intruder. It is demonstrated that, if the variance of the added error is fixed at a fraction of the variance of the original variables, then the optimal correlation structure of the errors with respect to confidentiality protection is the correlation structure of the original variables;We present a masking algorithm designed to preserve the moments and univariate distribution functions of masked variables, while providing disclosure protection. The degree of protection is a function of the variance of the added error. A computer program that implements the algorithm is outlined. The procedure is designed so that the covariance structure of the masked data is similar to that of the original data. Results of masking example data files with the computer program are summarized.

Comments
Description
Keywords
Citation
Source
Keywords
Copyright
Sun Jan 01 00:00:00 UTC 1989