Datenschutzerklärung|Data Privacy

Martin Pagel

Vision Paper “From Cleaning Before ML to Cleaning for ML” Accepted for Publication in the IEEE Data Engineering Bulletin

The vision paper “From Cleaning Before ML to Cleaning for ML,” authored by Felix Neutatz, Binger Chen, Ziawasch Abedjan, and Eugene Wu will be published in the upcoming March 2021 IEEE Data Engineering Bulletin (Special Issue on Data Validation for Machine Learning Models and Applications ).

The Bulletin of the Technical Committee on Data Engineering is published quarterly and is distributed to all TC members. Its scope includes the design, implementation, modeling, theory and application of database systems and their technology.

Data cleaning is widely regarded as a critical piece of machine learning (ML) applications, as data errors can corrupt models in ways that cause the application to operate incorrectly, unfairly, or dangerously. Traditional data cleaning focuses on quality issues of a dataset in isolation of the application using the data—Cleaning Before ML—which can be inefficient and, counterintuitively, degrade the ap-plication further. While recent cleaning approaches take into account signals from the ML model, such as the model accuracy, they are still local to a specific model, and do not take into account the entire application’s semantics and user goals. What is needed is an end-to-end application-driven approach towards Cleaning For ML, that can leverage signals throughout the entire ML application to optimize the cleaning for application goals and to reduce manual cleaning efforts. This paper briefly reviews recent progress in Cleaning For ML, presents our vision of a holistic cleaning framework, and outlines new challenges that arise when data cleaning meets ML applications.

> Preprint “From Cleaning Before ML to Cleaning for ML” (PDF)