TU Berlin

Database Systems and Information Management GroupPublications

Logo FG DIMA-new  65px

Page Content

to Navigation


Efficient Fault Tolerance for Massively Parallel Dataflow Systems
Citation key Dudoladov16
Author Sergej Dudoladov
Year 2016
Journal Proceedings of the VLDB 2016 PhD Workshop, co-located with the 42nd International Conference on Very Large Databases
Volume 2016 (1671)
Abstract Dataflow systems provide fault tolerance by combining checkpointing and lineage but leave it up to a data scientist to decide on when and how to checkpoint. This leads to job plans that are inefficient during failure-free execution or recovery, e.g., if a data scientist forgets to checkpoint expensive operators that need to be re-executed after a failure. In this work, we aim to (1) increase efficiency of checkpointing transparently to the data scientist and (2) automate placement of checkpoints and other fault tolerance mechanism. First, we show how to reduce checkpoint size for machine learning algorithms using qpoints , a compressed representation of the algorithms’ parameters. Qpoints nable the algorithms to run faster by spending less time on checkpointing. Second, we show how to place checkpoints optimally for a given cluster without user intervention using smartpoints , our framework for building fault tolerance optimizers. Smartpoints free data scientists from making tedious decisions about fault tolerance while retaining reasonable performance guarantees in case of failure.
Link to original publication Download Bibtex entry


Quick Access

Schnellnavigation zur Seite über Nummerneingabe