direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Inhalt des Dokuments

Publications

BlockJoin: Efficient Matrix Partitioning Through Joins.
Citation key KunftKSRM17
Author Andreas Kunft, Asterios Katsifodimos, Sebastian Schelter, Tilmann Rabl, Volker Markl
Year 2017
Journal In the Proceedings of the VLDB Endowment, Vol. 10, No. 13. 2017 (to be presented in VLDB 2018)
Abstract Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to-end ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii) linear algebra operators used for model training and crossvalidation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of data flow engines like Hadoop, Spark, or Flink. These data flow engines implement relational operators on ow-partitioned datasets. However, effcient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re-partitioning steps. In this paper, we investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. We present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel data flow engines. Our experimental evaluation shows speedups up to 6x and the skew resistance of BlockJoin compared to state-of-the-art pipelines implemented in Spark.
Link to publication Link to original publication Download Bibtex entry

Zusatzinformationen / Extras

Quick Access:

Schnellnavigation zur Seite über Nummerneingabe