direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Page Content

Publications

Efficient Fault Tolerance for Massively Parallel Dataflow Systems
Citation key Dudoladov16
Author Sergej Dudoladov
Year 2016
Journal Proceedings of the VLDB 2016 PhD Workshop, co-located with the 42nd International Conference on Very Large Databases
Volume 2016 (1671)
Abstract Dataflow systems provide fault tolerance by combining checkpointing and lineage but leave it up to a data scientist to decide on when and how to checkpoint. This leads to job plans that are inefficient during failure-free execution or recovery, e.g., if a data scientist forgets to checkpoint expensive operators that need to be re-executed after a failure. In this work, we aim to (1) increase efficiency of checkpointing transparently to the data scientist and (2) automate placement of checkpoints and other fault tolerance mechanism. First, we show how to reduce checkpoint size for machine learning algorithms using qpoints , a compressed representation of the algorithms’ parameters. Qpoints nable the algorithms to run faster by spending less time on checkpointing. Second, we show how to place checkpoints optimally for a given cluster without user intervention using smartpoints , our framework for building fault tolerance optimizers. Smartpoints free data scientists from making tedious decisions about fault tolerance while retaining reasonable performance guarantees in case of failure.
Link to original publication [1] Download Bibtex entry [2]

[4]
------ Links: ------

Zusatzinformationen / Extras

Quick Access:

Schnellnavigation zur Seite über Nummerneingabe

Auxiliary Functions

Copyright TU Berlin 2008