TU Berlin

Fachgebiet Datenbanksysteme und InformationsmanagementPublikationen

Logo FG DIMA-new  65px


zur Navigation


Expand your Training Limits! Generating Training Data for ML-based Data Management
Zitatschlüssel VenturaKJM21
Autor Francesco Ventura, Zoi Kaoudi, Jorge Arnulfo Quiane Ruiz, Volker Markl
Jahr 2021
Journal SIGMOD
Notiz to be published
Zusammenfassung Machine Learning (ML) is quickly becoming a prominent method in many data management components, including query optimizers which have recently shown very promising results. However, the low availability of training data (i.e., large query workloads with execution time or output cardinality as labels) widely limits further advancement in research and compromises the technology transfer from research to industry. Collecting a labeled query workload has a very high cost in terms of time and money due to the development and execution of thousands of realistic queries/jobs. In this work, we face the problem of generating training data for data management components tailored to users’ needs. We present DataFarm, an innovative framework for efficiently generating and labeling large query workloads. We follow a data-driven white box approach to learn from pre-existing small workload patterns, input data, and computational resources. Our framework allows users to produce a large heterogeneous set of realistic jobs with their labels, which can be used by any ML-based data management component. We show that our framework outperforms the current state-of-the-art both in query generation and label estimation using synthetic and real datasets. It has up to 9× better labeling performance, in terms of R2 score. More importantly, it allows users to reduce the cost of getting labeled query workloads by 54× (and up to an estimated factor of 104×) compared to standard approaches.
Link zur Publikation Download Bibtex Eintrag



Schnellnavigation zur Seite über Nummerneingabe