TU Berlin

Fachgebiet Datenbanksysteme und InformationsmanagementPublikationen

Logo FG DIMA-new  65px


zur Navigation


Efficient fault-tolerance for iterative graph processing on distributed dataflow systems
Zitatschl├╝ssel XuHKM16
Autor Xu, Chen; Holzemer, Markus; Kaul, Manohar; Markl, Volker
Seiten 613-624
Jahr 2016
DOI 10.1109/ICDE.2016.7498275
Ort Helsinki, Finland
Journal 2016 IEEE 32nd International Conference on Data Engineering (ICDE)
Jahrgang 00
Zusammenfassung Real-world graph processing applications often require combining the graph data with tabular data. Moreover, graph processing usually is part of a larger analytics workflow consiting of data preparation, analysis and model building, and model application. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the processing. Most big graph processing algorithms are iterative and incur a long runtime, as they require multiple passes over the data until convergence. Thus, fault tolerance and quick recovery from any intermittent failure at any step of the workflow are crucial for effective and efficient analysis. In this work, we propose a novel fault-tolerance mechanism for iterative graph processing on distributed data-flow systems with the objective to reduce the checkpointing cost and failure recovery time. Rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner, without breaking pipelined tasks. In contrast to the typical unblocking checkpointing approaches (i.e., managing checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating the checkpoint creation during iterative graph processing. We achieve speedier recovery, i.e., confined recovery, by using the local log files on each node to avoid a complete re-computation from scratch. Our theoretical studies as well as our experimental analysis on Flink give further insight into our fault-tolerance strategies and show that they are more efficient than blocking checkpointing and complete recovery for iterative graph processing on dataflow systems.
Link zur Publikation Link zur Originalpublikation Download Bibtex Eintrag



Schnellnavigation zur Seite über Nummerneingabe