Inhalt des Dokuments
Es gibt keine deutsche Übersetzung dieser Webseite.
Abstract
When scaling out clusters to compute complex insights in long-running iterative jobs failures become quite frequent.
Therefore, the goal of this thesis was to find a recovery mechanism for distributed dataflow systems that minimizes the recovery time of iterative jobs while keeping the runtime overhead during normal execution as low as possible.
To achieve this we propose a non-blocking way of taking checkpoints and analyse the three different recovery methods simple checkpointing, confined recovery and replication based recovery both theoretical and with extensive experiments.