TU Berlin

Database Systems and Information Management GroupPublications

Logo FG DIMA-new  65px

Page Content

to Navigation


Distributed Machine Learning - but at what COST?
Citation key BodenRM2017
Author Christoph Boden, Tilmann Rabl, Volker Markl
Year 2017
Journal ML Systems Workshop@ NIPS2017
Abstract Training machine learning models at scale is a popular workload for distributed data flow systems such as Apache Spark. However, as these systems were originally built to fulfill quite different requirements it remains an open question how effectively they actually perform for ML workloads. In this paper we argue that benchmarking of large scale ML systems should consider state of the art, single machine libraries as baselines and sketch such a benchmark for distributed data flow systems. We present an experimental evaluation of a representative problem for XGBoost, LightGBM and Vowpal Wabbit and compare them to Apache Spark MLlib with respect to both: runtime and prediction quality. Our results indicate that while being able to robustly scale with increasing data set size, current generation data flow systems are surprisingly inefficient at training machine learning models at need substantial resources to come within reach of the performance of single machine libraries.
Link to original publication Download Bibtex entry


Quick Access

Schnellnavigation zur Seite über Nummerneingabe