direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Inhalt des Dokuments


Benchmarking Data Flow Systems for Scalable Machine Learning
Zitatschlüssel BodenSRM17
Autor Christoph Boden, Andrea Spina, Tilmann Rabl, Volker Markl.
Seiten 10
Jahr 2017
DOI https://doi.org/10.1145/3070607.3070612
Journal Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR'17), Jan Hidders (Ed.). ACM, New York, NY, USA, Article 5, 10 pages.
Nummer Article No. 5
Zusammenfassung Distributed data flow systems such as Apache Spark or Apache Flink are popular choices for scaling machine learning algorithms in production. Industry applications of large scale machine learning such as click-through rate prediction rely on models trained on billions of data points which are both highly sparse and high-dimensional. Existing Benchmarks attempt to assess the performance of data flow systems such as Apache Flink, Spark or Hadoop with non-representative workloads such as WordCount, Grep or Sort. They only evaluate scalability with respect to data set size and fail to address the crucial requirement of handling high dimensional data. We introduce a representative set of distributed machine learning algorithms suitable for large scale distributed settings which have close resemblance to industry-relevant applications and provide generalizable insights into system performance. We implement mathematically equivalent versions of these algorithms in Apache Flink and Apache Spark, tune relevant system parameters and run a comprehensive set of experiments to assess their scalability with respect to both: data set size and dimensionality of the data. We evaluate the systems for data up to four billion data points and 100 million dimensions. Additionally we compare the performance to single-node implementations to put the scalability results into perspective. Our results indicate that while being able to robustly scale with increasing data set sizes, current state of the art data flow systems are surprisingly inefficient at coping with high dimensional data, which is a crucial requirement for large scale machine learning algorithms
Link zur Publikation [1] Link zur Originalpublikation [2] Download Bibtex Eintrag [3]

------ Links: ------

Zusatzinformationen / Extras


Schnellnavigation zur Seite über Nummerneingabe

Copyright TU Berlin 2008