Inhalt des Dokuments
Termine DIMA Kolloquium
Termin/Ort | Dozent/Thema |
---|---|
22.05.2017 16.00 Uhr EN 719 | Piotr Lasek, University Rzeszów,
Poland "Beginnings and Current Challenges in Interactive Data Visualization" |
08.05.2017 16.00 Uhr EN 719 | Alberto Lerner (NYU) "Riding the New Hardware Wave - Opportunities and Challenges for Peak Database Performance" |
03.04.2017 16.00 Uhr EN 719 | Steffen Zeuch "Non-Invasive Progressive Optimization for In-Memory Databases" |
31.03.2017 11.00 Uhr EN 719 | Talk James Clarkson, University of
Manchester, UK "Tornado: Practical Heterogeneous Programming in Java" |
27.03.2017 16.00 Uhr EN 719 | Frank McSherry "Monitoring motifs in graph streams" |
13.03.2017 15.00 Uhr EN719 | Guy Lohman, IBM Almaden Research Center
(Retired) Wildfire: Evolving Databases for New-Generation Big Data Applications |
27.02.2017 16.00 Uhr EN719 | Till Rohrmann, Data
Artisans "Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems" |
08.02.2017 14.00 Uhr MA041 | Andrei Mihnea, SAP "Darwinian evolution: 3 implementations of snapshot isolation in SAP HANA |
28.11.2016 14.00 Uhr DFKI Projektbüro Berlin, 4th Floor, Room: Weizenbaum, Alt-Moabit 91 C, 10559 Berlin | Prof. Dr. Stephan
Günnemann, Technical University of Munich, "Beyond Independence: Efficient Learning Techniques for Networks and Temporal Data" |
14.11.2016 16.00 Uhr MA042 | Talk John Wilkes, Principal Software Engineer, Technical
Infrastructure, Google "Large-scale data analysis at cloud scale" |
31.10.2016 16.00 Uhr DIMA, EN719 | Prof. Rainer Gemulla, Universität Mannheim "New Directions for Data Mining on Sequences and Matrice" |
Piotr Lasek, University Rzeszów, Poland
Title:
Beginnings and Current Challenges in Interactive Data
Visualization
Abstract:
In
today‘s world, interactive visualization of large data is a must.
Since the very beginning, filtering, sampling and aggregation were
the
three basic ways of dealing with large amounts of data. All
of those
methods helped and still help to squeeze a large number
of objects
into a limited number of pixels on a computer‘s
screen. Among those
three, the aggregation however, seems to be
the most meaningful way of
preprocessing data for visualization.
Thus, the aggregation,
specifically so-called inductive
aggregation, became the matter of our
research.
During the
presentation we will discuss challenges and architectures
of
early visuzalization system and present our prototype visualization
called Skydive. We will also and try to explain why the
inductive
aggregation may be useful for data visualization (in
terms of
efficiency and meaningfulness); why it is not obvious
which
aggregation function could be used as a data aggregation
function; and
how graphical channels paucity problem could be
addressed by using
modern graphics cards capabilities.
Bio:
Piotr Lasek is currently an Assistant
Professor at the University
Rzeszów, Poland. He obtained his
PhD at the Warsaw University of
Technology in the field of data
mining - his thesis was devoted to
density-based data
clustering. Over the past 2 years he was a
Postdoctoral Fellow
with the Database Laboratory at York University,
Toronto working
on efficient data visualization methods employing the
concept of
inductive aggregation. His current research interests span
both
interactive data exploration through visualization as well as
density-based clustering with constraints.
Alberto Lerner (NYU)
Title:
Riding the New Hardware Wave - Opportunities and Challenges for
Peak Database Performance
Abstract:
We are living in interesting times hardware-wise.
CPUs, which are already made of a mix of general and specialized
components, will soon have a reconfigurable portion as well; volatile
memory won‘t be necessarily so for much longer; flash memory, which
has been hidden—and slowed down—by layers of
block-device-compatibility logic, is being addressed in increasingly
direct ways; networks, which had a 10x boost from 1 to 10Gb not too
long ago, have gotten another 10x boost to 100Gb, now challenging
internal buses in terms of speed; and there are now new ways to lay
interconnecting fabrics that all but blur the notion of where one
computer ends and the next one starts.(*)
There
was seldom a time like that in the industry when so many technologies
reached their commercial debut at the same time.
And that means different tradeoffs and challenges for a systems
researcher or practitioner interested in databases. The way in which
each piece of datum is organized while in its resting state—if it is
allowed to lay still at all—and the trajectory that it follows from
there till it reaches a query result set can now be quite different
than it used to be. We illustrate this by discussing two distinct use
cases from classic database systems design: how to support fast (as in
networking speeds fast) journalling and how to provide an elastic,
distributed data structure that a query execution engine could be
based on.
* Intel Xeon/Altera chip, Intel/Micron
Optane, CNexLabs Host-based SSD FTL, Mellanox ConnectX-5, NVMoF/NTB
PCIe/OpenCAPI, respectively.
Bio:
Alberto Lerner is a consultant based in New York City, home of two very data hungry industries: finance and advertisement. He helps teams in those areas build proprietary data pipelines and storage systems. Before that, he has been part of the teams behind a few different database engines: IBM‘s DB2, working on robustness aspects of the query optimizer, Google‘s Bigtable, on elasticity aspects, and MongoDB, on general architecture. Alberto is formally trained in Computer Science and received his Ph.D. from ENST - Paris (now ParisTech), having done his research work at INRIA/Rocquencourt and NYU.
Frank McSherry
Title:
Monitoring motifs in graph streams
Abstract:
Imagine you are in charge of a high-volume stream of social
interactions, and you would like to watch for certain graph structures
in the stream of interactions. For example, Twitter recommends „who
to follow“ by looking for accounts followed by at least two accounts
you follow, a structure which can be described by a four node graph.
There can be a substantial number of these motifs in a graph, but what
if you only need to observe the changes as they happen, rather than
enumerate all instances at once?
This work is an instance
of the more general problem of maintaining cyclic (non-treelike) joins
as the underlying relations change. We will first summarize recent
work in worst-case optimal join evaluation (Ngo et al., 2014) which
shows how to evaluate cyclic joins using resources asymptotically
bounded by the largest possible output of the join (a property
standard trees of binary joins cannot satisfy). We then build up a
dataflow version of this algorithm, extend it to efficiently respond
to changes in its inputs, and describe and evaluate its implementation
in timely dataflow*.
This project is joint work with Khaled Ammar and Semih Salihoglu,
both of University of Waterloo.
*: Timely dataflow‘s
cycles are not required for this implementation, so it could also be
suitable for other, less crazy streaming systems.
Bio:
Frank McSherry is an independent scientist working on
scalable computation. His recent work focuses on data-parallel
dataflow computation, in particular elevating their expressive power.
He currently develops and maintains several related projects
(https://github.com/frankmcsherry [1]) and writes a somewhat sassy
blog (https://github.com/frankmcsherry/blog [2]). Independently,
he has also done foundational work on differential privacy, and
continues to maintain an active presence in this
community.
Talk James Clarkson, University of Manchester, UK
Title:
"Tornado: Practical Heterogeneous Programming in Java"
Abstract:
As the popularity of “big data” frameworks grow, a lot
of effort is currently being exerted trying to improve the performance
of JVM (Java Virtual Machine) based languages, such as Java and Scala.
One way of doing this is to develop mechanisms that allow these
languages to make use of hardware accelerators, such as GPGPUs. As a
result, there has been a number of projects, such as Project Sumatra
(OpenJDK) [4], Rootbeer [5] and APARAPI (AMD) [6], that have attempted
to support programming GPGPUs from Java. However, a lot of this prior
art only focuses on accelerating simple workloads or providing an
interface into another programming language - making it difficult for
them to be used to create real-world applications. In this talk I will
discuss how we have developed a framework that moves beyond the prior
art and allows developers to accelerate complex Java applications.
Our Java based framework, Tornado, provides developers with
a simple a task-based programming model which allows the assignment of
task to device. Typically, tasks are assigned to execute on a diverse
set of hardware resources such as GPGPUs, FPGAs, and Xeon Phi.
Moreover, the design of Tornado allows those assignments to be changed
dynamically - meaning that applications are not artificially
restricted to using a specific class of device. Additionally, the
Tornado API has been designed to avoid the need to re-engineer
applications to utilise the framework. To achieve that, we added
support for a wide range of language features than the prior art -
exceptions, inheritance and objects to name a few. Finally, we will
share our experiences porting and accelerating a complex Computer
Vision application into pure Java.
Bio:
James Clarkson is a 3rd year PhD student from the University
of Manchester in the UK. He is a member of the Advanced Processor
Technologies (APT) group, working under the supervision of Mikel Lujan
and Christos Kotselidis. His research interests are programming
languages and programming exotic hardware architectures (in Java!). He
is actively contributing to the EPSRC funded AnyScale [1] and PAMELA
[2] projects, and has previously contributed to the EU funded Mont
Blanc project [3].
[1] AnyScale project -
http://anyscale.org [3]
[2] PAMELA project -
http://apt.cs.manchester.ac.uk/projects/PAMELA/ [4]
[3]
Mont Blanc project - https://www.montblanc-project.eu [5]
[4] Project Sumatra - http://openjdk.java.net/projects/sumatra/ [6]
[5] Rootbeer - https://github.com/pcpratts/rootbeer1 [7]
[6] APARAPI - https://code.google.com/archive/p/aparapi/
[8]
Steffen Zeuch
Titel:
Non-Invasive Progressive Optimization for
In-Memory Databases
Abstact:
Progressive optimization introduces
robustness for database workloads
against wrong estimates,
skewed data, correlated attributes, or outdated
statistics.
Previous work focuses on cardinality estimates and rely on
expensive counting methods as well as complex learning algorithms.
In this paper, we utilize performance counters to drive
progressive
optimization during query execution. The main
advantages are that
performance counters introduce virtually no
costs on modern CPUs and their
usage enables a noninvasive
monitoring. We present fine-grained cost models
to detect
differences between estimates and actual costs which enables us to
kick-start reoptimization. Based on our cost models, we implement
an
optimization approach that estimates the individual
selectivities of a
multi-selection query efficiently.
Furthermore, we are able to learn
properties like sortedness,
skew, or correlation during run-time. In our
evaluation we show,
that the overhead of our approach is negligible, while
performance improvements are convincing. Using progressive
optimization, we
improve runtime up to a factor of three
compared to average run-times and up
to a factor of 4,5 compared
to worst case run-times. As a result, we avoid
costly operator
execution orders and; thus, making query execution highly
robust.
Guy Lohman, IBM Almaden Research Center (Retired)
Titel: Wildfire: Evolving
Databases for New-Generation Big Data Applications
Abstract:
The rising popularity of large-scale,
real-time analytics applications — such as real-time inventory and
pricing, mobile applications that give you suggestions, fraud
detection, risk analysis, etc. — emphasize the need for scalable
data management systems that can handle both fast transactions and
analytics concurrently. However, efficient processing of transactional
and analytical requests require very different optimizations and
architectural decisions in a system.
This talk presents the Wildfire system, which targets
Hybrid Transactional and Analytical Processing (HTAP). Wildfire
leverages the Spark ecosystem to enable large-scale data processing
with different types of complex analytical requests, and columnar data
processing to facilitate fast transactions as well as analytics
concurrently.
Bio:
Dr. Guy
M. Lohman recently retired from IBM’s Almaden Research Center in San
Jose, California, where he worked for over 34 years as a Distinguished
Research Staff Member and Manager.
His group contributed BLU
Acceleration to DB2 for Linux, UNIX, and Windows (LUW) 10.5 (2013) and
the query engine of the IBM Smart Analytics Optimizer for DB2 for z/OS
V1.1 (2010) and
the Informix Warehouse Accelerator (2011)
products. He was the architect of the Query Optimizer of DB2 LUW and
was responsible for its development from 1992 to 1997 (versions 2 –
5),
as well as its Visual Explain, efficient sampling, and Index
Advisor. Dr. Lohman was elected to the IBM Academy of Technology in
2002, and named an IBM Master Inventor in 2011.
He was the
General Co-Chair (with Prof. Sang Cha) of the 2015 IEEE ICDE
Conference and General Chair of the 2013 ACM Symposium on Cloud
Computing.
He has been awarded 40 U.S. patents and is the
(co-)author of over 80 technical papers in the refereed academic
literature.
Till Rohrmann, Data Artisans
Title:
"Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems"
Abstract:
Bio:
Prof. Rainer Gemulla, Universität Mannheim
Title:
New Directions for Data Mining on Sequences and Matrices
Abstract:
In this talk, I will summarize our current research at the Data
Analytics chair, University of Mannheim. After a general overview,
I‘ll talk in more detail about two specific directions: declarative
sequential pattern mining on the one hand and matrix factorization on
the other hand. I‘ll briefly summarize these two directions
below.
Sequential pattern mining is a fundamental task in
data mining. Given a database of sequences (e.g., customer
transactions, event logs, or natural-language sentences), the goal of
sequential pattern mining is to detect relevant and interesting
patterns in the data. The stated goal of our research is to make
sequential pattern mining useable, useful, and efficient. I will
introduce Desq, a general-purpose system for declarative pattern
mining. Desq allows data scientists to specify what they want, but
abstracts away algorithmic and implementation details. Desq unifies
many of the existing variants of sequential pattern mining---including
length, gap, span, regular-expression, and hierarchy constraints---and
additionally goes beyond what was possible before. I describe how Desq
improves usability, how mining can be performed efficiently and
scalably, and outline directions for maximizing the usefulness of the
found patterns.
Matrix factorization methods have been
called the Swiss Army Knife of data mining. In general, matrix
factorization methods represent each row (e.g., objects) and each
column (e.g., attributes) of a data matrix with latent feature vectors
(or „embeddings“). They are an effective tool for tasks such as
denoising, compression, imputation of missing data, clustering, link
prediction, and more. In this talk, I‘ll focus on our recent work on
factorizing large Boolean matrices. Here we often have to make a
choice between using expensive combinatorial methods that retain the
discrete nature of the data and using continuous methods that can be
more efficient but destroy the discrete structure. I present an
alternative approach that first computes a continuous factorization
and subsequently applies a rounding procedure to obtain a discrete
representation. I discuss our current answers to questions such as
what can be gained by rounding, whether this approach achieves lower
reconstruction errors, and how hard it is to obtain a good
factorization. A key concept to approach these questions is the notion
of the „rounding rank“ of a binary matrix, which has relationships
to linear classification, dimensionality reduction, and nested
matrices.
Bio:
http://dws.informatik.uni-mannheim.de/en/people/professors/prof-dr-rainer-gemulla/#c15117
John Wilkes, Principal Software Engineer, Technical Infrastructure, Google
Title:
Large-scale data analysis at cloud scale
Abstract:
Google has been tackling large-scale big data problems for more than 15 years. Experiences with the systems we built to do so has led us to develop a new set of tools for large-scale analysis and queries, including streaming. I’ll provide an overview of some of the systems we’ve built and are now making available for others to build on.
Bio:
John Wilkes has been at Google since 2008, where he is working on
cluster management for Google‘s compute infrastructure; he was one
of the architects of Omega. He is interested in far too many aspects
of distributed systems, but a recurring theme has been technologies
that allow systems to manage themselves.
He
received a PhD in computer science from the University of Cambridge,
joined HP Labs in 1982, and was elected an HP Fellow and an ACM Fellow
in 2002 for his work on storage system design. Along the way, he’s
been program committee chair for SOSP, FAST, EuroSys and HotCloud, and
has served on the steering committees for EuroSys, FAST, SoCC and
HotCloud. He’s listed as an inventor on 40+ US patents, and has an
adjunct faculty appointment at Carnegie-Mellon University. In his
spare time he continues, stubbornly, trying to learn how to blow
glass.
Prof. Dr. Stephan Günnemann, Technical University of Munich
Title:
Beyond Independence: Efficient Learning Techniques for Networks and Temporal Data
Abstract:
Going beyond independence, most of the data gathered in today‘s applications show complex dependency structures: people, for example, interact with each other in social networks; similarly, sensors in a cyber-physical system continuously measure dependent signals over time. In general, networks and temporal data are the most frequently observed examples for such complex data. In this talk, I will focus on two data mining tasks that operate in these domains: (i) Classification in (partially) labeled networks, and (ii) anomaly detection for temporal rating data. For both tasks I will present the underlying modeling principles, I will sketch solutions how to derive efficient learning algorithms, and I will showcase their applications in different scenarios. The talk concludes with a summary of further research our group is working on.
Bio:
Stephan Günnemann is a Professor at the Department of Informatics, Technical University of Munich. He acquired his doctoral degree in 2012 at RWTH Aachen University in the field of computer science. From 2012 to 2015 he was an associate of Carnegie Mellon University, USA; initially as a postdoctoral fellow and later as a senior researcher. Stephan Günnemann has been a visiting researcher at Simon Fraser University, Canada, and a research scientist at the Research & Technology Center of Siemens AG. His research interests include efficient data mining and machine learning techniques for high-dimensional, temporal, and network data.
Andrei Mihnea, SAP
Title:
"Darwinian evolution: 3 implementations of snapshot isolation in SAP HANA"
Abstract:
The talk quickly presents the HANA column store, then focuses on 3 historical version of snapshot isolation implementation, presenting for each what was working well and why we evolved to the next one.
Bio:
MS in computer science in 1988; the Bucharest Polytechnic
Institute, Automatic Control and Computers engineering school; Prof.
Cristian Giumale DEA in Machine Learning in 1990; Universite Paris 6;
Prof. Jean-Gabriel Ganascia Joined Sybase in 1993; currently working
at SAP, which has acquired Sybase in 2010.
Worked on the core
engine of several RDBMs (Sybase ASE and IQ; SAP HANA): query
optimization, Abstract Plans (optimizer hints), query compilation and
execution, eager-lazy aggregation, shared-disk and shared-nothing
scale-out. Current focus: database stores (in-memory and on-disk, row
and column oriented), transaction processing, data
lifecycle.