Inhalt des Dokuments
DIMA Researchers@VLDB 2016
Dr. Tilmann Rabl, Scientific Coordinator of the BBDC, presented the current status of the standardization effort of the BigBench benchmark in the Transaction Processing Performance Councils Technical Conference (TPCTC) collocated with VLDB, Prof. Ziawasch Abedjan presented a paper on the experimental evaluation of error detection algorithms and had a second contribution that was presented by a coauthor. Sergey Dudoladov gave a presentation on his work on fault tolerance in the PhD workshop and Jonas Traub gave a tutorial on Apache Flink in the Workshop on Big data Open Source Systems (BOSS), which was organized by Dr. Tilmann Rabl and Dr. Sebastian Schelter.
Sergey Dudoladov presented a paper  at the PhD workshop of VLDB 2016. The paper outlines research directions, which are to be pursued in the years 2016-2017, namely (a) compression of checkpoints optimized for machine learning applications and (b) automatic placement of checkpoints at runtime. The former topic utilizes the unique property of machine learning algorithm such as neural networks or generalized linear models, namely tolerance towards approximate parameter values, to devise a compression scheme which is estimated to reduce the size of checkpoints up to 75 per cent compared to the uncompressed version. The latter topic aims to develop an optimizer which will employ the randomized weighted majority algorithm to select the best checkpoint strategy at runtime based on the actual failure rates of a cluster. The overall goal of this research is to reduce fault tolerance overhead in a declarative fashion, which falls under the scope of the BBDC (Berlin Big Data Center, www.bbdc.berlin) work package 10 "Automatic parallelization and optimization". Sergey’s main goal of the visit to VLDB was to get feedback from the world leading experts in the area on both ideas and their potential to automatically optimize machine learning applications.
Professor Abedjan presented his paper “Detecting Data Errors, Where are we and what needs to be done?”  on the experimental evaluation of error detection algorithms, which was joint work with colleagues from MIT, QCRI, and University of Waterloo. His second contribution on “Temporal rules discovery for cleaning Web Data”  was presented by his co-author Dr. Mourad Ouzzani from QCRI.
Tilmann Rabl presented his paper “From BigBench to TPCx-BB: Standardization of a Big Data Benchmark”  at the annual TPCTC conference, collocated with VLDB. The paper gives an overview of the process of standardizing the BigBench big data benchmark and a summary of the changes that the benchmark underwent. He also organized the Second Workshop for Big data Open Source Systems (BOSS) with Sebastian Schelter. BOSS is a novel workshop format where multiple hands on tutorials on big data systems are run in parallel and participants can get first hand experience presented by the systems’ developers.
Jonas Traub gave a tutorial on Apache Flink  in the BOSS workshop. The focus of the workshop was the processing of data streams in real time, which has become a critical building block for many modern businesses. In stream processing, the data to be processed is viewed as continuously produced series of events over time. This departs from the well studied batch processing paradigm where data is viewed as static. In fact, batch processing problems can be viewed as a subclass of stream processing ones, where streams are finite. Building on this principle, Apache Flink has emerged as one of the most successful open-source stream processors, with an active community of 190 contributors from both industry and academia and several companies deploying it in production. The attendees of the tutorial learned about the most important notions introduced by the stream processing paradigm, got introduced to the architecture and the main building blocks of Flink and could explore different features with hands-on tasks.
Sebastian Schelter participated in the organization of the BOSS workshop by Tilmann Rabl. Furthermore, he attended a panel discussion on the relationship between the Machine Learning and Database Community, where a main outcome of the discussion was that the DB community needs a new “grand challenge” to be able to compete against prestigious projects like self-driving cars.
 Dudoladov, Sergej, Efficient Fault Tolerance for Massively Parallel Dataflow Systems, Proceedings of the VLDB 2016 PhD Workshop, co-located with the 42nd International Conference on Very Large Databases
 Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker and Nan Tang, Detecting Data Errors: Where are we and what needs to be done, Proceedings of the VLDB Endowment 9, 12 (August 2016), 993-1004.
 Ziawasch Abedjan, Cuneyt G. Akcora, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, Temporal Rules Discovery for Web Data Cleaning
 Paul Cao, Bhaskar Gowda, Seetha Lakshmi, Chinmayi Narasimhadevara, Patrick Nguyen, John Poelman, Meikel Poess, From BigBench to TPCx-BB: Standardization of a Big Data Benchmark, TPCTC – New Delhi, 09/05/2016
 Apache Flink: flink.apache.org