Datenschutzerklärung|Data Privacy

L. Friedel

20.08.2018, 16 Uhr c.t. TU Berlin, EN building, seminar room EN 719 (7th floor), Einsteinufer 17, 10587 Berlin: "Bringing Order to Data" (Prof. Jarek Szlichta, University of Ontario Institute of Technolgy, Canada)

Poor data quality is a barrier to effective, high-quality decision making based on data. Declarative data cleaning encodes data semantics as constraints (rules) and errors arise when the data violates the constraints. Unified approaches that repair errors in data and constraints have been proposed. However, both data-only and unified approaches are by and large static. They apply cleaning to a single snapshot of the data and constraints. We have proposed a continuous data cleaning framework that can be applied to dynamic data. Our approach permits both the data and its semantics to evolve and suggests repairs based on the accumulated evidence as statistics. We built a machine learning classifier that predicts types of repairs needed to resolve an inconsistency, and learns from past user repair preferences to recommend more accurate repairs in the future. We also propose quantitative approach to data cleaning that excels at ensuring that the repaired data has desired statistical properties.

Integrity constraints (ICs) are useful for query optimization and for expressing and enforcing application semantics. However, formulating constraints manually requires domain expertise, is prone to human errors, and may be excessively time consuming, especially on large datasets. Hence, proposals for automatic discovery have been made for some classes of ICs, such as functional dependencies (FDs), and recently, order dependencies (ODs). We present a new OD discovery algorithm enabled by a novel polynomial mapping to a canonical form of ODs, and a sound and complete set of axioms for canonical ODs. We show orders-of magnitude performance improvements over the prior state of-the-art.