TU Berlin

Database Systems and Information Management GroupPublications

Logo FG DIMA-new  65px

Page Content

to Navigation


Efficient fault-tolerance for iterative graph processing on distributed dataflow systems
Citation key XuHKM16
Author Xu, Chen; Holzemer, Markus; Kaul, Manohar; Markl, Volker
Pages 613-624
Year 2016
DOI 10.1109/ICDE.2016.7498275
Location Helsinki, Finland
Journal 2016 IEEE 32nd International Conference on Data Engineering (ICDE)
Volume 00
Abstract Real-world graph processing applications often require combining the graph data with tabular data. Moreover, graph processing usually is part of a larger analytics workflow consiting of data preparation, analysis and model building, and model application. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the processing. Most big graph processing algorithms are iterative and incur a long runtime, as they require multiple passes over the data until convergence. Thus, fault tolerance and quick recovery from any intermittent failure at any step of the workflow are crucial for effective and efficient analysis. In this work, we propose a novel fault-tolerance mechanism for iterative graph processing on distributed data-flow systems with the objective to reduce the checkpointing cost and failure recovery time. Rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner, without breaking pipelined tasks. In contrast to the typical unblocking checkpointing approaches (i.e., managing checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating the checkpoint creation during iterative graph processing. We achieve speedier recovery, i.e., confined recovery, by using the local log files on each node to avoid a complete re-computation from scratch. Our theoretical studies as well as our experimental analysis on Flink give further insight into our fault-tolerance strategies and show that they are more efficient than blocking checkpointing and complete recovery for iterative graph processing on dataflow systems.
Link to publication Link to original publication Download Bibtex entry


Quick Access

Schnellnavigation zur Seite über Nummerneingabe