Datenschutzerklärung|Data Privacy
Impressum

22.03.2021
Martin Pagel

Paper "Expand your Training Limits! Generating and Labeling Jobs for ML-based Data Management" Accepted for Presentation at SIGMOD 2021

The research paper "Expand your Training Limits! Generating and Labeling Jobs for ML-based Data Management," authored by Francesco Ventura, Zoi Kaoudi, Jorge Arnulfo Quiane Ruiz, and Volker Markl was accepted for presentation at the ACM SIGMOD International Conference on Management of Data (SIGMOD/PODS 2021), June 20-25, 2021, Xi'an, Shaanxi, China.

Abstract:
Machine Learning (ML) is quickly becoming a core tool in many data management components, including query optimizers which have recently shown very promising results. However, the low availability of large query workloads with labels, e.g., query runtime, widely limits further advancement in research and compromises the technology transfer from research to industry. On the one hand, collecting a labeled workload has a very high cost in terms of time and money due to the development and execution of thousands of realistic jobs to extract labels. On the other hand, a reliable training workload has to take into account users' use cases, i.e., pre-existing workload, input data, and available computational resources.
In this work, we face the unsolved problem of generating labeled workloads tailored to users' use cases. We present an innovative labeled workload augmentation framework that tackles all the above-mentioned complexities. We follow a data-driven white-box generation process to learn from pre-existing small workload patterns, from input data, and computational resources. Our framework allows users to produce a large heterogeneous set of realistic jobs with their labels, which can be used by any ML-based data management component.
We show that our framework outperforms the current state-of-the-art both in job generation and label estimation using synthetic and real datasets. It has up to 9x better labeling performance, in terms of R2 score. More importantly, it allows users to reduce the cost of getting labeled query workloads by a 54x factor (and up to an estimated 104x factor) compared to standard approaches.

To learn more about SIGMOD/PODS, please visit https://2021.sigmod.org/.