direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Page Content

Publications

Materialization and Reuse Optimizations for Production Data Science Pipelines, SIGMOD 2022, to Appear
Citation key DerakhshanMKRM
Author Behrouz Derakhshan, Alireza Rezaei Mahdiraji, Zoi Kaoudi, Tilmann Rabl, Volker Markl
Year 2022
Journal SIGMOD Conference
Note to appear
Abstract Abstract: Many companies and businesses train and deploy machine learning (ML) pipelines to answer prediction queries. In many applications, new training data continuously becomes available. A typical approach to ensure that ML models are up-to-date is to retrain the ML pipelines following a schedule, e.g., every day on the last seven days of data. Several use cases, such as A/B testing and ensemble learning, require many pipelines to be deployed in parallel. Existing solutions train and deploy one pipeline at a time, which generates redundant data processing since pipelines usually share similar operators. Our goal is to eliminate redundant data processing in production data science pipelines using materialization and reuse optimizations. We first categorize the generated artifacts of the pipeline operators into three groups, i.e., computed statistics, transformed data, and trained models. Then, we optimize the execution of the pipelines by materializing and reusing the generated artifacts. Our solution employs a materialization algorithm that given a storage budget, materializes the subset of the artifacts, which minimizes the run time of the subsequent executions. Furthermore, we offer a reuse algorithm that generates an optimal execution plan by combining the deployed pipelines into a directed acyclic graph (DAG) and reusing the materialized artifacts when appropriate. Our experiments show that our system can reduce the training time by up to an order of magnitude for different deployment scenarios.
Link to publication Download Bibtex entry

Zusatzinformationen / Extras

Quick Access:

Schnellnavigation zur Seite über Nummerneingabe

Auxiliary Functions