direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Inhalt des Dokuments


Efficient Fault Tolerance for Massively Parallel Dataflow Systems
Zitatschlüssel Dudoladov16
Autor Sergej Dudoladov
Jahr 2016
Journal Proceedings of the VLDB 2016 PhD Workshop, co-located with the 42nd International Conference on Very Large Databases
Jahrgang 2016 (1671)
Zusammenfassung Dataflow systems provide fault tolerance by combining checkpointing and lineage but leave it up to a data scientist to decide on when and how to checkpoint. This leads to job plans that are inefficient during failure-free execution or recovery, e.g., if a data scientist forgets to checkpoint expensive operators that need to be re-executed after a failure. In this work, we aim to (1) increase efficiency of checkpointing transparently to the data scientist and (2) automate placement of checkpoints and other fault tolerance mechanism. First, we show how to reduce checkpoint size for machine learning algorithms using qpoints , a compressed representation of the algorithms’ parameters. Qpoints nable the algorithms to run faster by spending less time on checkpointing. Second, we show how to place checkpoints optimally for a given cluster without user intervention using smartpoints , our framework for building fault tolerance optimizers. Smartpoints free data scientists from making tedious decisions about fault tolerance while retaining reasonable performance guarantees in case of failure.
Link zur Originalpublikation Download Bibtex Eintrag

Zusatzinformationen / Extras


Schnellnavigation zur Seite über Nummerneingabe