Inhalt des Dokuments
Termine DIMA Kolloquium
|Piotr Lasek, University Rzeszów,
"Beginnings and Current Challenges in Interactive Data Visualization"
|Alberto Lerner (NYU)|
"Riding the New Hardware Wave - Opportunities and Challenges for Peak Database Performance"
"Non-Invasive Progressive Optimization for In-Memory Databases"
|Talk James Clarkson, University of
"Tornado: Practical Heterogeneous Programming in Java"
"Monitoring motifs in graph streams"
|Guy Lohman, IBM Almaden Research Center
Wildfire: Evolving Databases for New-Generation Big Data Applications
|Till Rohrmann, Data
"Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems"
|Andrei Mihnea, SAP|
"Darwinian evolution: 3 implementations of snapshot isolation in SAP HANA
|28.11.2016 14.00 Uhr|
DFKI Projektbüro Berlin,
4th Floor, Room:
Weizenbaum, Alt-Moabit 91 C, 10559 Berlin
|Prof. Dr. Stephan
Günnemann, Technical University of Munich,|
"Beyond Independence: Efficient Learning Techniques for Networks and Temporal Data"
|Talk John Wilkes, Principal Software Engineer, Technical
"Large-scale data analysis at cloud scale"
|Prof. Rainer Gemulla, Universität Mannheim|
"New Directions for Data Mining on Sequences and Matrice"
Piotr Lasek, University Rzeszów, Poland
Beginnings and Current Challenges in Interactive Data Visualization
In today‘s world, interactive visualization of large data is a must.
Since the very beginning, filtering, sampling and aggregation were the
three basic ways of dealing with large amounts of data. All of those
methods helped and still help to squeeze a large number of objects
into a limited number of pixels on a computer‘s screen. Among those
three, the aggregation however, seems to be the most meaningful way of
preprocessing data for visualization. Thus, the aggregation,
specifically so-called inductive aggregation, became the matter of our
During the presentation we will discuss challenges and architectures
of early visuzalization system and present our prototype visualization
called Skydive. We will also and try to explain why the inductive
aggregation may be useful for data visualization (in terms of
efficiency and meaningfulness); why it is not obvious which
aggregation function could be used as a data aggregation function; and
how graphical channels paucity problem could be addressed by using
modern graphics cards capabilities.
Piotr Lasek is currently an Assistant Professor at the University
Rzeszów, Poland. He obtained his PhD at the Warsaw University of
Technology in the field of data mining - his thesis was devoted to
density-based data clustering. Over the past 2 years he was a
Postdoctoral Fellow with the Database Laboratory at York University,
Toronto working on efficient data visualization methods employing the
concept of inductive aggregation. His current research interests span
both interactive data exploration through visualization as well as
density-based clustering with constraints.
Alberto Lerner (NYU)
Riding the New Hardware Wave - Opportunities and Challenges for
Peak Database Performance
We are living in interesting times hardware-wise.
CPUs, which are already made of a mix of general and specialized
components, will soon have a reconfigurable portion as well; volatile
memory won‘t be necessarily so for much longer; flash memory, which
has been hidden—and slowed down—by layers of
block-device-compatibility logic, is being addressed in increasingly
direct ways; networks, which had a 10x boost from 1 to 10Gb not too
long ago, have gotten another 10x boost to 100Gb, now challenging
internal buses in terms of speed; and there are now new ways to lay
interconnecting fabrics that all but blur the notion of where one
computer ends and the next one starts.(*)
There was seldom a time like that in the industry when so many technologies reached their commercial debut at the same time.
And that means different tradeoffs and challenges for a systems researcher or practitioner interested in databases. The way in which each piece of datum is organized while in its resting state—if it is allowed to lay still at all—and the trajectory that it follows from there till it reaches a query result set can now be quite different than it used to be. We illustrate this by discussing two distinct use cases from classic database systems design: how to support fast (as in networking speeds fast) journalling and how to provide an elastic, distributed data structure that a query execution engine could be based on.
* Intel Xeon/Altera chip, Intel/Micron Optane, CNexLabs Host-based SSD FTL, Mellanox ConnectX-5, NVMoF/NTB PCIe/OpenCAPI, respectively.
Alberto Lerner is a consultant based in New York City, home of two very data hungry industries: finance and advertisement. He helps teams in those areas build proprietary data pipelines and storage systems. Before that, he has been part of the teams behind a few different database engines: IBM‘s DB2, working on robustness aspects of the query optimizer, Google‘s Bigtable, on elasticity aspects, and MongoDB, on general architecture. Alberto is formally trained in Computer Science and received his Ph.D. from ENST - Paris (now ParisTech), having done his research work at INRIA/Rocquencourt and NYU.
Monitoring motifs in graph streams
Imagine you are in charge of a high-volume stream of social interactions, and you would like to watch for certain graph structures in the stream of interactions. For example, Twitter recommends „who to follow“ by looking for accounts followed by at least two accounts you follow, a structure which can be described by a four node graph. There can be a substantial number of these motifs in a graph, but what if you only need to observe the changes as they happen, rather than enumerate all instances at once?
This work is an instance of the more general problem of maintaining cyclic (non-treelike) joins as the underlying relations change. We will first summarize recent work in worst-case optimal join evaluation (Ngo et al., 2014) which shows how to evaluate cyclic joins using resources asymptotically bounded by the largest possible output of the join (a property standard trees of binary joins cannot satisfy). We then build up a dataflow version of this algorithm, extend it to efficiently respond to changes in its inputs, and describe and evaluate its implementation in timely dataflow*.
This project is joint work with Khaled Ammar and Semih Salihoglu,
both of University of Waterloo.
*: Timely dataflow‘s cycles are not required for this implementation, so it could also be suitable for other, less crazy streaming systems.
Frank McSherry is an independent scientist working on scalable computation. His recent work focuses on data-parallel dataflow computation, in particular elevating their expressive power. He currently develops and maintains several related projects (https://github.com/frankmcsherry ) and writes a somewhat sassy blog (https://github.com/frankmcsherry/blog ). Independently, he has also done foundational work on differential privacy, and continues to maintain an active presence in this community.
Talk James Clarkson, University of Manchester, UK
"Tornado: Practical Heterogeneous Programming in Java"
As the popularity of “big data” frameworks grow, a lot of effort is currently being exerted trying to improve the performance of JVM (Java Virtual Machine) based languages, such as Java and Scala. One way of doing this is to develop mechanisms that allow these languages to make use of hardware accelerators, such as GPGPUs. As a result, there has been a number of projects, such as Project Sumatra (OpenJDK) , Rootbeer  and APARAPI (AMD) , that have attempted to support programming GPGPUs from Java. However, a lot of this prior art only focuses on accelerating simple workloads or providing an interface into another programming language - making it difficult for them to be used to create real-world applications. In this talk I will discuss how we have developed a framework that moves beyond the prior art and allows developers to accelerate complex Java applications.
Our Java based framework, Tornado, provides developers with a simple a task-based programming model which allows the assignment of task to device. Typically, tasks are assigned to execute on a diverse set of hardware resources such as GPGPUs, FPGAs, and Xeon Phi. Moreover, the design of Tornado allows those assignments to be changed dynamically - meaning that applications are not artificially restricted to using a specific class of device. Additionally, the Tornado API has been designed to avoid the need to re-engineer applications to utilise the framework. To achieve that, we added support for a wide range of language features than the prior art - exceptions, inheritance and objects to name a few. Finally, we will share our experiences porting and accelerating a complex Computer Vision application into pure Java.
James Clarkson is a 3rd year PhD student from the University of Manchester in the UK. He is a member of the Advanced Processor Technologies (APT) group, working under the supervision of Mikel Lujan and Christos Kotselidis. His research interests are programming languages and programming exotic hardware architectures (in Java!). He is actively contributing to the EPSRC funded AnyScale  and PAMELA  projects, and has previously contributed to the EU funded Mont Blanc project .
 AnyScale project - http://anyscale.org 
 PAMELA project - http://apt.cs.manchester.ac.uk/projects/PAMELA/ 
 Mont Blanc project - https://www.montblanc-project.eu 
 Project Sumatra - http://openjdk.java.net/projects/sumatra/ 
 Rootbeer - https://github.com/pcpratts/rootbeer1 
 APARAPI - https://code.google.com/archive/p/aparapi/ 
Non-Invasive Progressive Optimization for In-Memory Databases
Progressive optimization introduces robustness for database workloads
against wrong estimates, skewed data, correlated attributes, or outdated
statistics. Previous work focuses on cardinality estimates and rely on
expensive counting methods as well as complex learning algorithms.
In this paper, we utilize performance counters to drive progressive
optimization during query execution. The main advantages are that
performance counters introduce virtually no costs on modern CPUs and their
usage enables a noninvasive monitoring. We present fine-grained cost models
to detect differences between estimates and actual costs which enables us to
kick-start reoptimization. Based on our cost models, we implement an
optimization approach that estimates the individual selectivities of a
multi-selection query efficiently. Furthermore, we are able to learn
properties like sortedness, skew, or correlation during run-time. In our
evaluation we show, that the overhead of our approach is negligible, while
performance improvements are convincing. Using progressive optimization, we
improve runtime up to a factor of three compared to average run-times and up
to a factor of 4,5 compared to worst case run-times. As a result, we avoid
costly operator execution orders and; thus, making query execution highly
Guy Lohman, IBM Almaden Research Center (Retired)
Titel: Wildfire: Evolving
Databases for New-Generation Big Data Applications
The rising popularity of large-scale, real-time analytics applications — such as real-time inventory and pricing, mobile applications that give you suggestions, fraud detection, risk analysis, etc. — emphasize the need for scalable data management systems that can handle both fast transactions and analytics concurrently. However, efficient processing of transactional and analytical requests require very different optimizations and architectural decisions in a system.
This talk presents the Wildfire system, which targets
Hybrid Transactional and Analytical Processing (HTAP). Wildfire
leverages the Spark ecosystem to enable large-scale data processing
with different types of complex analytical requests, and columnar data
processing to facilitate fast transactions as well as analytics
Dr. Guy M. Lohman recently retired from IBM’s Almaden Research Center in San Jose, California, where he worked for over 34 years as a Distinguished Research Staff Member and Manager.
His group contributed BLU Acceleration to DB2 for Linux, UNIX, and Windows (LUW) 10.5 (2013) and the query engine of the IBM Smart Analytics Optimizer for DB2 for z/OS V1.1 (2010) and
the Informix Warehouse Accelerator (2011) products. He was the architect of the Query Optimizer of DB2 LUW and was responsible for its development from 1992 to 1997 (versions 2 – 5),
as well as its Visual Explain, efficient sampling, and Index Advisor. Dr. Lohman was elected to the IBM Academy of Technology in 2002, and named an IBM Master Inventor in 2011.
He was the General Co-Chair (with Prof. Sang Cha) of the 2015 IEEE ICDE Conference and General Chair of the 2013 ACM Symposium on Cloud Computing.
He has been awarded 40 U.S. patents and is the (co-)author of over 80 technical papers in the refereed academic literature.
Till Rohrmann, Data Artisans
"Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems"
Prof. Rainer Gemulla, Universität Mannheim
New Directions for Data Mining on Sequences and Matrices
In this talk, I will summarize our current research at the Data
Analytics chair, University of Mannheim. After a general overview,
I‘ll talk in more detail about two specific directions: declarative
sequential pattern mining on the one hand and matrix factorization on
the other hand. I‘ll briefly summarize these two directions
Sequential pattern mining is a fundamental task in data mining. Given a database of sequences (e.g., customer transactions, event logs, or natural-language sentences), the goal of sequential pattern mining is to detect relevant and interesting patterns in the data. The stated goal of our research is to make sequential pattern mining useable, useful, and efficient. I will introduce Desq, a general-purpose system for declarative pattern mining. Desq allows data scientists to specify what they want, but abstracts away algorithmic and implementation details. Desq unifies many of the existing variants of sequential pattern mining---including length, gap, span, regular-expression, and hierarchy constraints---and additionally goes beyond what was possible before. I describe how Desq improves usability, how mining can be performed efficiently and scalably, and outline directions for maximizing the usefulness of the found patterns.
Matrix factorization methods have been called the Swiss Army Knife of data mining. In general, matrix factorization methods represent each row (e.g., objects) and each column (e.g., attributes) of a data matrix with latent feature vectors (or „embeddings“). They are an effective tool for tasks such as denoising, compression, imputation of missing data, clustering, link prediction, and more. In this talk, I‘ll focus on our recent work on factorizing large Boolean matrices. Here we often have to make a choice between using expensive combinatorial methods that retain the discrete nature of the data and using continuous methods that can be more efficient but destroy the discrete structure. I present an alternative approach that first computes a continuous factorization and subsequently applies a rounding procedure to obtain a discrete representation. I discuss our current answers to questions such as what can be gained by rounding, whether this approach achieves lower reconstruction errors, and how hard it is to obtain a good factorization. A key concept to approach these questions is the notion of the „rounding rank“ of a binary matrix, which has relationships to linear classification, dimensionality reduction, and nested matrices.
John Wilkes, Principal Software Engineer, Technical Infrastructure, Google
Large-scale data analysis at cloud scale
Google has been tackling large-scale big data problems for more than 15 years. Experiences with the systems we built to do so has led us to develop a new set of tools for large-scale analysis and queries, including streaming. I’ll provide an overview of some of the systems we’ve built and are now making available for others to build on.
John Wilkes has been at Google since 2008, where he is working on
cluster management for Google‘s compute infrastructure; he was one
of the architects of Omega. He is interested in far too many aspects
of distributed systems, but a recurring theme has been technologies
that allow systems to manage themselves.
He received a PhD in computer science from the University of Cambridge, joined HP Labs in 1982, and was elected an HP Fellow and an ACM Fellow in 2002 for his work on storage system design. Along the way, he’s been program committee chair for SOSP, FAST, EuroSys and HotCloud, and has served on the steering committees for EuroSys, FAST, SoCC and HotCloud. He’s listed as an inventor on 40+ US patents, and has an adjunct faculty appointment at Carnegie-Mellon University. In his spare time he continues, stubbornly, trying to learn how to blow glass.
Prof. Dr. Stephan Günnemann, Technical University of Munich
Beyond Independence: Efficient Learning Techniques for Networks and Temporal Data
Going beyond independence, most of the data gathered in today‘s applications show complex dependency structures: people, for example, interact with each other in social networks; similarly, sensors in a cyber-physical system continuously measure dependent signals over time. In general, networks and temporal data are the most frequently observed examples for such complex data. In this talk, I will focus on two data mining tasks that operate in these domains: (i) Classification in (partially) labeled networks, and (ii) anomaly detection for temporal rating data. For both tasks I will present the underlying modeling principles, I will sketch solutions how to derive efficient learning algorithms, and I will showcase their applications in different scenarios. The talk concludes with a summary of further research our group is working on.
Stephan Günnemann is a Professor at the Department of Informatics, Technical University of Munich. He acquired his doctoral degree in 2012 at RWTH Aachen University in the field of computer science. From 2012 to 2015 he was an associate of Carnegie Mellon University, USA; initially as a postdoctoral fellow and later as a senior researcher. Stephan Günnemann has been a visiting researcher at Simon Fraser University, Canada, and a research scientist at the Research & Technology Center of Siemens AG. His research interests include efficient data mining and machine learning techniques for high-dimensional, temporal, and network data.
Andrei Mihnea, SAP
"Darwinian evolution: 3 implementations of snapshot isolation in SAP HANA"
The talk quickly presents the HANA column store, then focuses on 3 historical version of snapshot isolation implementation, presenting for each what was working well and why we evolved to the next one.
MS in computer science in 1988; the Bucharest Polytechnic
Institute, Automatic Control and Computers engineering school; Prof.
Cristian Giumale DEA in Machine Learning in 1990; Universite Paris 6;
Prof. Jean-Gabriel Ganascia Joined Sybase in 1993; currently working
at SAP, which has acquired Sybase in 2010.
Worked on the core engine of several RDBMs (Sybase ASE and IQ; SAP HANA): query optimization, Abstract Plans (optimizer hints), query compilation and execution, eager-lazy aggregation, shared-disk and shared-nothing scale-out. Current focus: database stores (in-memory and on-disk, row and column oriented), transaction processing, data lifecycle.