- © DIMA
In August 2018, several members of the DIMA research group and some former researchers of our group attended the 44th International Conference on Very Large Data-bases (VLDB) in Rio de Janeiro. The ACM VLDB conference is one of the leading international database conferences. In total we had 7 contributions in the conference itself as well as in related workshops.
The DIMA researchers mentioned some program topics as their personal highlights:
- Blockchain Tutorial by UC-Santa Barbara, a tutorial which pointed out chances, risks, pro’s, and con’s of this technology. A tutorial made by people who has no stake in any blockchain technology.
- Time Series Tutorial
- ADMS Keynotes: Quantum Computing and SIMD
- Industry Talk: Machine Learning for Online Advertising (Google)
- Amazon Time Series Tutorial (https://lovvge.github.io/Forecasting-Tutorial-VLDB-2018 )
... and gave a list of recommended reads:
- Moment-based Quantile Sketches for Efficient High Cardinality Aggregate Queries
- Improved Selectivity Estimation by Combining Knowledge from Sampling and Synopses
- Join Query Optimization Techniques for Complex Event Processing Applications (Ilya Kolchinsky and Assaf Schuster; Technion, Israel Institute of Technology)
- On Optimizing Operator Fusion Plans for Large-scale Machine Learning in SystemML, Boehm, M., Reinwald, B., Hutchison, D., Sen, P., Evfimievski, A. V., & Pansare, N.
- Filter Before you Parse: Faster Analytics on Raw Data with Sparser, Palkar, S., Abuzaid, F., Bailis, P., & Zaharia, M.
- © privat
- © privat
- © privat
» » BlockJoin: Efficient Matrix Partitioning Through Joins, Andreas Kunft, Asterios Katsifodimos, Sebastian Schelter, Tilmann Rabl, Volker Markl.
Abstract: Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to-end ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii) linear algebra operators used for model training and cross-validation. Often, these pipelines need to scale out to large datasets. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re-partitioning steps. We present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines.
Download Poster  | Link to the paper 
» » Estimating Join Selectivities using Bandwidth-Optimized Kernel Density Models, Martin Kiefer, Max Heiml, Sebastian Breß, Volker Markl.
Abstract: Accurately predicting the cardinality of intermediate plan operations is an essential part of any modern relational query optimizer. The accuracy of said estimates has a strong and direct impact on the quality of the generated plans, and incorrect estimates can have a negative impact on query performance. One of the biggest challenges in this field is to predict the result size of join operations. In this paper, we extend a bandwidth-optimized KDE models to estimate the result size of single and multiple joins. In particular, we propose two approaches: (1) Building a KDE model from a sample drawn from the join result. (2) Efficiently combining the information from base table KDE models.
Link to the paper 
» » Generating Custom Code for Efficient Query Code Execution on Heterogeneuos Processors, Sebastian Breß, Bastian Köcher, Henning Funke, Steffen Zeuch, Tilmann Rabl, Volker Markl.
Abstract: Accurately predicting the cardinality of intermediate
plan operations is an essential part of any modern relational query
optimizer. The accuracy of said estimates has a strong and direct
impact on the quality of the generated plans, and incorrect estimates
can have a negative impact on query performance. One of the biggest
challenges in this field is to predict the result size of join
operations. Kernel Density Estimation (KDE) is a statistical method to
estimate multivariate probability distributions from a data sample.
Previously, we introduced
a modern, self-tuning selectivity estimator for range scans based on KDE that outperforms state-of-the-art multidimensional histograms and is efficient to evaluate on graphics cards. In this paper, we extend these bandwidth-optimized KDE models to estimate the result size of single and multiple joins. In particular, we propose two approaches: (1) Building a KDE model from a sample drawn from the join result. (2) Efficiently combining the information from base table KDE models. We evaluated our KDE-based join estimators on a variety of synthetic and real-world datasets, demonstrating that they are superior to state-of-the art join estimators based on sketching or sampling.
Link to the paper 
Workshop TPCTC 
Christoph Boden presented the two contributions to the Tenth TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC 2018):
»» Benchmarking Distributed Data Processing Systems for Machine Learning Workloads, Christoph Boden, Tilmann Rabl, Sebastian Schelter, and Volker Markl.
Abstract: Distributed data processing systems have been widely adopted to robustly scale out computations on massive data sets to many compute nodes in recent years. These systems are also popular choices to scale out the training of machine learning models. However, there is a lack of benchmarks to assess how effciently data processing systems actually perform at executing machine learning algorithms at scale. In this this paper, we share our experience in evaluating novel data processing systems and present a core set of experiments of a benchmark for distributed data processing systems for machine learning workloads, a rationale for their necessity as well as an experimental evaluation.
Link to the paper 
»» PolyBench: The first benchmark for polystores, Jeyhun Karimov, Tilmann Rabl, Volker Markl.
Abstract: Modern business intelligence requires data processing not only across a huge variety of domains but also across different paradigms, such as relational, stream, and graph models. This variety is a challenge for existing systems that typically only support a single or few different data models. Polystores were proposed as a solution for this challenge and received wide attention both in academia and in industry. The goal of this work is to develop the first benchmark for polystores.
Link to the paper 
Panel Workshop BIRTE 
- © Jonas Traub
»» Are we making any attempts towards solving the hardest problems in stream processing today?
Jonas Traub was one of the panel speakers at the International Workshop on Real-Time Business Intelligence and Analytics (BIRTE).
Most of today’s Internet applications are data-centric and
generate vast amounts of data that needs to be processed and analyzed
for detailed reporting, enhancing user experience and increasing
monetization. Streaming data processing systems must be designed based
on a varying set of requirements. The list of requirements can be
categorized based on different properties of such systems:
1. Consistency: Does every record in the input (or equivalently an input event) need to be committed exactly-once or at-least-once or at-most-once to the output? Is the event committed atomically or eventually to all outputs?
2. Scale: How many events per second can the system process? Tens of events per second? Or Thousands? Millions? Billions or even more? Does the system auto-scale to a new workload?
3. Failure Resilience: What kind of failures is the system resilient to? Machine-level or partial datacenter-level or entire datacenter-level? Is it enough to ensure that the data processing system itself is failure-resistant? Does the output need to be stored in globally consistent way? Is the system resilient to a bug in input data, a bug in user’s business logic, etc?
4. Latency: How long does it take every event from the time it is generated to the time it is committed? Milliseconds or seconds or minutes or hours or days? Should we target SLOs for median latencies or 90th percentile or higher tail latencies?
5. Expressiveness: What kind of operations can the user express in the system? From simple stateless operations (e.g. filter) to complex joins or stateful operations (e.g. HAVING clause in SQL)? How flexible is the system to add more input sources and output sinks?
6. Cost: This includes not only hardware cost (CPU, RAM, Disk, network, etc) but also engineering design complexity, cost of production support to run as a service and providing SLOs for latency / completeness, etc. From a pure business perspective, all this cost needs to be justified by the value the end user gets.
7. Service: Does the system run as a service for the users? Multi-tenant? What kind of isolation (e.g. performance, security, etc) is provided amongst users? How is business logic isolated from infrastructure? How easy is it for users to modify business logic in a self-service way?
Lots of systems provide a lambda architecture: Use stream processing for best-effort (approximate) analysis, and use batch processing (e.g. daily) for strong consistency, high reliability, etc. This represents an easy way out. But is it the right thing to do?
Which of these pillars matters for which applications? Can we build a system that provides all these properties? What are the tradeoffs?