BayesWipe: A scalable probabilistic framework for improving data quality
Document Type
Article
Publication Date
10-1-2016
Abstract
Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like Conditional Functional Dependencies (which have to be provided by domain experts or learned from a clean sample of the database). In this article, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.
Identifier
84994571337 (Scopus)
Publication Title
Journal of Data and Information Quality
External Full Text Location
https://doi.org/10.1145/2992787
e-ISSN
19361963
ISSN
19361955
Issue
1
Volume
8
Grant
1322406
Fund Ref
National Science Foundation
Recommended Citation
De, Sushovan; Hu, Yuheng; Meduri, Venkata Vamsikrishna; Chen, Yi; and Kambhampati, Subbarao, "BayesWipe: A scalable probabilistic framework for improving data quality" (2016). Faculty Publications. 10253.
https://digitalcommons.njit.edu/fac_pubs/10253
