Datenschutzerklärung|Data Privacy

Martin Pagel

Industrial Paper “ExDRa: Exploratory Data Science on Federated Raw Data” Accepted for Presentation at SIGMOD 2021

The industrial paper “ExDRa: Exploratory Data Science on Federated Raw Data,” authored by Sebastian Baunsgaard, Matthias Boehm, Ankit Chaudhary, Behrouz Derakhshan, Stefan Geißelsöder, Philipp Marian Grulich, Michael Hildebrand, Kevin Innerebner, Volker Markl, Claus Neubauer, Sarah Osterburg, Olga Ovcharenko, Sergey Redyuk, Tobias Rieger, Alireza Rezaei Mahdiraji, Sebastian Benjamin Wrede, and Steffen Zeuch was accepted for presentation at the ACM SIGMOD International Conference on Management of Data (SIGMOD/PODS 2021), June 20-25, 2021, Xi'an, Shaanxi, China. This publication is the result of the joint work of researchers at TU Berlin, DFKI GmbH, Graz University of Technology, and Siemens AG.

Data science workflows are largely exploratory, dealing with under-specified objectives, open-ended problems, and unknown business value. Therefore, little investment is made in systematic acquisition, integration, and pre-processing of data. This lack of infrastructure results in redundant manual effort and computation. Furthermore, central data consolidation is not always technically or economically desirable or even feasible (e.g., due to privacy, and/or data ownership). The ExDRa system aims to provide system infrastructure for this exploratory data science process on federated and heterogeneous, raw data sources. Technical focus areas include (1) ad-hoc and federated data integration on raw data, (2) data organization and reuse of intermediates, and (3) optimization of the data science lifecycle, under awareness of partially accessible data. In this paper, we describe use cases, the system architecture, selected features of SystemDS' new federated backend, and promising results. Beyond existing work on federated learning, ExDRa focuses on enterprise federated ML and related data pre-processing challenges because, in this context, federated ML has the potential to create a more fine-grained spectrum of data ownership and thus, new markets.

To learn more about SIGMOD/PODS, please visit