Datenschutzerklärung|Data Privacy

K. Forster

09.06.2016, A PhD Workshop Paper Accepted at VLDB'16

We are happy to announce that the PhD workshop paper Efficient Fault Tolerance for Massively Parallel Dataflow Systems by Sergey Dudoladov (supervised by Prof. Dr. Markl) has been accepted for publication in:

Dataflow systems provide fault tolerance by combining checkpointing and lineage but leave it up to a data scientist to decide on when and how to checkpoint. This leads to job plans that are inefficient during failure-free execution or recovery, e.g., if a data scientist forgets to checkpoint expensive operators that need to be re-executed after a failure. In this work, we aim to (1) increase efficiency of checkpointing transparently to the data scientist and (2) automate placement of checkpoints and other fault tolerance mechanism. First, we show how to reduce checkpoint size for machine learning algorithms using qpoints, a compressed representation of the algorithms’ parameters. Qpoints enable the algorithms to run faster by spending less time on checkpointing. Second, we show how to place checkpoints optimally for a given cluster without user intervention using smartpoints, our framework for building fault tolerance optimizers. Smartpoints free data scientists from making tedious decisions about fault tolerance while retaining reasonable performance guarantees in case of failure.