TU Berlin

Database Systems and Information Management GroupPublications

Logo FG DIMA-new  65px

Page Content

to Navigation


Semi-Supervised Data Cleaning with Raha and Baran
Citation key MahdaviZ21
Author Mohammad Mahdavi, Ziawash Abedjan
Year 2021
Journal CIDR
Note accepted
Abstract Data cleaning is a tedious data preparation task, which typically needs user supervision in the form of predefined con-figurations, such as rules, parameters, or patterns. We have recently developed two configuration-free systems, Raha and Baran, to detect and correct data errors in a semi-supervised manner. In this paper, we demonstrate how both systems can be used within an end-to-end data cleaning pipeline. Our demonstration shows how user supervision can be reduced to a negligible amount of example corrections using effective feature representation, label propagation, and trans-fer learning methods. While each cleaning step, detection and correction, faces substantially different challenges, we have designed the corresponding systems based on the same intuition. Both systems internally leverage an automatically generatable set of base detectors and correctors and learn to combine them using a few user labels. In practice, with a small number of 20 user-annotated tuples, it is possible to effectively identify and fix data quality problems inside a dataset. Furthermore, both systems benefit from knowledge on prior cleaning tasks. Using transfer learning, both systems can optimize the data cleaning task at hand in terms of error detection runtime and error connection effectiveness.
Link to publication Download Bibtex entry


Quick Access

Schnellnavigation zur Seite über Nummerneingabe