Inhalt des Dokuments
Talks DIMA Research Seminar
|Johannes Starlinger, HU Berlin
Berlin"Annotation and Structuring of Patient Cases for Similarity
|Piotr Lasek, University Rzeszów, Poland|
"Beginnings and Current Challenges in Interactive Data Visualization"
|Stefano Ceri, Abdulrahman Kaitoua, Politecnico di
"Genomic Data Management – from “sense making” of genomic datasets to an efficient implementation on the cloud"
DFKI Projektbüro Berlin, Room: Weizenbaum, Alt-Moabit 91c, 10559 Berlin
|Alexander Renz-Wieland, Universität Mannheim|
"Distributed frequent sequence mining with
declarative subsequence constraints"
"Non-Invasive Progressive Optimization for In-Memory Databases"
Johannes Starlinger, HU Berlin
"Annotation and Structuring of Patient Cases for Similarity Search"
In the simpatix project, we investigate similarity search over
electronic health records. These records consist of mostly unstructured
or semi-structured data, such as clinical notes from examinations and
treatments, tabularized data from quantitative tests (such as blood
screenings), or discharge summaries. This data encodes an implicit
process describing the individual patient’s disease history. In
simpatix, we extract this process from EHRs, together with rich
annotations of clinically relevant entities (e.g., diagnoses,
treatments, or procedures), and investigate similarity measures for such
process-structured case representations to compare and find similar
cases. In the end, we want to deploy these measures in similarity search
over large collections of patient cases to enable use cases such as
clinical decision support. This talk gives an overview of the project,
conceptional technical challenges, and data sources currently accessed.
Dr.-Ing. Dr.med.univ. Johannes Starlinger is an MD and a postdoctoral
computer scientist at the Department of Computer Science at
Humboldt-Universität zu Berlin where he works in the research group for
Knowledge Management in Bioinformatics. After studying medicine at
Medical University of Vienna and computer science at HU-Berlin, Dr.
Starlinger joined the DFG-funded graduate program SOAMED in 2010 to
research service-oriented architectures in a medical area of
application, receiving his PhD in 2015. His current research focus is on
knowledge mining and similarity search over data relevant to the
biomedical domain, including scientific workflows, genomic data, and
medical data. As technical coordinator in the PREDICT project, Dr.
Starlinger researches and develops data integration systems to assist
precision oncology. As PI on the simpatix project, he investigates
process-oriented analysis and similarity assessment of patient cases and
Piotr Lasek, University Rzeszów, Poland
Beginnings and Current Challenges in Interactive Data Visualization
In today‘s world, interactive visualization of large data is a must.
Since the very beginning, filtering, sampling and aggregation were the
three basic ways of dealing with large amounts of data. All of those
methods helped and still help to squeeze a large number of objects
into a limited number of pixels on a computer‘s screen. Among those
three, the aggregation however, seems to be the most meaningful way of
preprocessing data for visualization. Thus, the aggregation,
specifically so-called inductive aggregation, became the matter of our
During the presentation we will discuss challenges and architectures
of early visuzalization system and present our prototype visualization
called Skydive. We will also and try to explain why the inductive
aggregation may be useful for data visualization (in terms of
efficiency and meaningfulness); why it is not obvious which
aggregation function could be used as a data aggregation function; and
how graphical channels paucity problem could be addressed by using
modern graphics cards capabilities.
Piotr Lasek is currently an Assistant Professor at the University
Rzeszów, Poland. He obtained his PhD at the Warsaw University of
Technology in the field of data mining - his thesis was devoted to
density-based data clustering. Over the past 2 years he was a
Postdoctoral Fellow with the Database Laboratory at York University,
Toronto working on efficient data visualization methods employing the
concept of inductive aggregation. His current research interests span
both interactive data exploration through visualization as well as
density-based clustering with constraints.
Stefano Ceri, Abdulrahman Kaitoua, Politecnico di Milano
Genomic Data Management – from “sense making” of genomic datasets to an efficient implementation on the cloud.
In this seminar, we describe our approach to querying genomic datasets; the talk is divided in two parts, first we describe our approach and vision and then we focus on the technology which is currently deployed.
In the first part, we define our approach to genomic data management and specifically we focus on tertiary data management, i.e. the need of integrating region-based information describing heterogeneous experimental datasets in order to support biological and clinical discoveries. In this part we define GenoMetric Query Language (GMQL) as a high-level algebraic language for manipulating genomic datasets consisting of regions and metadata. We also explain our plans for building an integrated repository of open data and for supporting ontological search on metadata and pattern-based search on regions, thereby moving beyond the current state-of-art.
In the second part, we describe how GMQL is currently implemented on a cluster of nodes and uses Spark, Flink, SciDB as underlying parallel data flow engines and scientific databases. We illustrate how a query is translated to a DAG representing operations over metadata and regions, and then how some of the operations are translated into Spark (for example). We next describe the architecture of our framework GDMS (Scalable Genomic Data Management System) at Cineca (https://www.cineca.it/ ), which makes use of a cluster of nodes.
Alexander Renz-Wieland, Universität Mannheim
Title: "Distributed frequent sequence mining with declarative subsequence constraints"
Frequent sequence mining extracts frequently occurring patterns
from sequential data. Some algorithms allow users to specify
constraints to control which sequences are of interest. Each algorithm
usually supports a particular subset of available constraints.
Recently, based on regular expressions, a declarative approach to
specifying many of these constraints in a unified way was proposed.
Scalable algorithms are essential to mine large datasets efficiently. Such algorithms exist for particular subsets of constraints. However, no scalable algorithm with support for many constraints has been put forward yet.
In this talk, I present a distributed two-stage algorithm based on item-based partitioning. It processes input sequences in parallel and constructs partitions that can mine for frequent sequences in parallel. We use nondeterministic finite automata as an efficient representation for the intermediary sequences we send to the partitions.
The proposed algorithm outperforms naive approaches for declarative constraints, is competitive to a state-of-the-art algorithm for traditional constraints, offers linear scalability, and makes it possible to mine datasets that cannot be mined efficiently using sequential algorithms.
Non-Invasive Progressive Optimization for In-Memory Databases
Progressive optimization introduces robustness for database workloads
against wrong estimates, skewed data, correlated attributes, or outdated
statistics. Previous work focuses on cardinality estimates and rely on
expensive counting methods as well as complex learning algorithms.
In this paper, we utilize performance counters to drive progressive
optimization during query execution. The main advantages are that
performance counters introduce virtually no costs on modern CPUs and their
usage enables a noninvasive monitoring. We present fine-grained cost models
to detect differences between estimates and actual costs which enables us to
kick-start reoptimization. Based on our cost models, we implement an
optimization approach that estimates the individual selectivities of a
multi-selection query efficiently. Furthermore, we are able to learn
properties like sortedness, skew, or correlation during run-time. In our
evaluation we show, that the overhead of our approach is negligible, while
performance improvements are convincing. Using progressive optimization, we
improve runtime up to a factor of three compared to average run-times and up
to a factor of 4,5 compared to worst case run-times. As a result, we avoid
costly operator execution orders and; thus, making query execution highly