Datenschutzerklärung|Data Privacy

Martin Pagel

The Demo Paper "Semi-Supervised Data Cleaning with Raha and Baran" was Accepted for Presentation at CIDR 2021

"Semi-Supervised Data Cleaning with Raha and Baran". Mohammad Mahdavi, Ziawash Abedjan To be Presented at the 11th Annual Conference on Innovative Data Systems Research (CIDR ’21), Virtual Event, January 11-15, 2021

Data cleaning is a tedious data preparation task, which typically needs user supervision in the form of predefined con-figurations, such as rules, parameters, or patterns. We have recently developed two configuration-free systems, Raha and Baran, to detect and correct data errors in a semi-supervised manner. In this paper, we demonstrate how both systems can be used within an end-to-end data cleaning pipeline. Our demonstration shows how user supervision can be reduced to a negligible amount of example corrections using effective feature representation, label propagation, and trans-fer learning methods. While each cleaning step, detection and correction, faces substantially different challenges, we have designed the corresponding systems based on the same intuition. Both systems internally leverage an automatically generatable set of base detectors and correctors and learn to combine them using a few user labels. In practice, with a small number of 20 user-annotated tuples, it is possible to effectively identify and fix data quality problems inside a dataset. Furthermore, both systems benefit from knowledge on prior cleaning tasks. Using transfer learning, both systems can optimize the data cleaning task at hand in terms of error detection runtime and error connection effectiveness.

A preprint version is available here.

To learn more about CIDR 2021, please visit