Page Content
Talks DIMA Research Seminar
Talk/Location | Lecturer/Subject |
---|---|
29.05.2017 4pm EN719 | Johannes Starlinger, HU Berlin
Berlin"Annotation and Structuring of Patient Cases for Similarity
Search" |
22.05.2017 4pm EN719 | Piotr Lasek, University Rzeszów, Poland "Beginnings and Current Challenges in Interactive Data Visualization" |
11.05.2017 4pm EN719 | Stefano Ceri, Abdulrahman Kaitoua, Politecnico di
Milano "Genomic Data Management – from “sense making” of genomic datasets to an efficient implementation on the cloud" |
26.04.2017 4pm DFKI Projektbüro Berlin, Room: Weizenbaum, Alt-Moabit 91c, 10559 Berlin | Alexander Renz-Wieland, Universität Mannheim "Distributed frequent sequence mining with declarative subsequence constraints" |
03.04.2017 4pm EN719 | Steffen Zeuch "Non-Invasive Progressive Optimization for In-Memory Databases" |
Johannes Starlinger, HU Berlin
Titel:
"Annotation and
Structuring of Patient Cases for Similarity Search"
Abstract:
In the simpatix project, we
investigate similarity search over
electronic health records.
These records consist of mostly unstructured
or semi-structured
data, such as clinical notes from examinations and
treatments,
tabularized data from quantitative tests (such as blood
screenings), or discharge summaries. This data encodes an implicit
process describing the individual patient’s disease history.
In
simpatix, we extract this process from EHRs, together with
rich
annotations of clinically relevant entities (e.g.,
diagnoses,
treatments, or procedures), and investigate
similarity measures for such
process-structured case
representations to compare and find similar
cases. In the end,
we want to deploy these measures in similarity search
over large
collections of patient cases to enable use cases such as
clinical decision support. This talk gives an overview of the
project,
conceptional technical challenges, and data sources
currently accessed.
Bio:
Dr.-Ing.
Dr.med.univ. Johannes Starlinger is an MD and a postdoctoral
computer scientist at the Department of Computer Science at
Humboldt-Universität zu Berlin where he works in the research group
for
Knowledge Management in Bioinformatics. After studying
medicine at
Medical University of Vienna and computer science at
HU-Berlin, Dr.
Starlinger joined the DFG-funded graduate program
SOAMED in 2010 to
research service-oriented architectures in a
medical area of
application, receiving his PhD in 2015. His
current research focus is on
knowledge mining and similarity
search over data relevant to the
biomedical domain, including
scientific workflows, genomic data, and
medical data. As
technical coordinator in the PREDICT project, Dr.
Starlinger
researches and develops data integration systems to assist
precision oncology. As PI on the simpatix project, he investigates
process-oriented analysis and similarity assessment of patient
cases and
disease histories.
Piotr Lasek, University Rzeszów, Poland
Title:
Beginnings and
Current Challenges in Interactive Data Visualization
Abstract:
In today‘s world, interactive
visualization of large data is a must.
Since the very beginning,
filtering, sampling and aggregation were the
three basic ways of
dealing with large amounts of data. All of those
methods helped
and still help to squeeze a large number of objects
into a
limited number of pixels on a computer‘s screen. Among those
three, the aggregation however, seems to be the most meaningful way
of
preprocessing data for visualization. Thus, the
aggregation,
specifically so-called inductive aggregation,
became the matter of our
research.
During the presentation
we will discuss challenges and architectures
of early
visuzalization system and present our prototype visualization
called Skydive. We will also and try to explain why the inductive
aggregation may be useful for data visualization (in terms of
efficiency and meaningfulness); why it is not obvious which
aggregation function could be used as a data aggregation function;
and
how graphical channels paucity problem could be addressed by
using
modern graphics cards capabilities.
Bio:
Piotr Lasek is currently an Assistant
Professor at the University
Rzeszów, Poland. He obtained his
PhD at the Warsaw University of
Technology in the field of data
mining - his thesis was devoted to
density-based data
clustering. Over the past 2 years he was a
Postdoctoral Fellow
with the Database Laboratory at York University,
Toronto working
on efficient data visualization methods employing the
concept of
inductive aggregation. His current research interests span
both
interactive data exploration through visualization as well as
density-based clustering with constraints.
Stefano Ceri, Abdulrahman Kaitoua, Politecnico di Milano
Title:
Genomic Data
Management – from “sense making” of genomic datasets to an
efficient implementation on the cloud.
Abstact:
In this seminar, we describe our
approach to querying genomic datasets; the talk is divided in two
parts, first we describe our approach and vision and then we focus on
the technology which is currently deployed.
In the first
part, we define our approach to genomic data management and
specifically we focus on tertiary data management, i.e. the need of
integrating region-based information describing heterogeneous
experimental datasets in order to support biological and clinical
discoveries. In this part we define GenoMetric Query Language (GMQL)
as a high-level algebraic language for manipulating genomic datasets
consisting of regions and metadata. We also explain our plans for
building an integrated repository of open data and for supporting
ontological search on metadata and pattern-based search on regions,
thereby moving beyond the current state-of-art.
In the second
part, we describe how GMQL is currently implemented on a cluster of
nodes and uses Spark, Flink, SciDB as underlying parallel data flow
engines and scientific databases. We illustrate how a query is
translated to a DAG representing operations over metadata and regions,
and then how some of the operations are translated into Spark (for
example). We next describe the architecture of our framework GDMS
(Scalable Genomic Data Management System) at Cineca
(https://www.cineca.it/ [1]), which makes use of a cluster of
nodes.
Alexander Renz-Wieland, Universität Mannheim
Title: "Distributed frequent sequence mining with declarative subsequence constraints"
Abstract:
Frequent sequence mining extracts frequently occurring patterns
from sequential data. Some algorithms allow users to specify
constraints to control which sequences are of interest. Each algorithm
usually supports a particular subset of available constraints.
Recently, based on regular expressions, a declarative approach to
specifying many of these constraints in a unified way was proposed.
Scalable algorithms are essential to mine large datasets
efficiently. Such algorithms exist for particular subsets of
constraints. However, no scalable algorithm with support for many
constraints has been put forward yet.
In this talk, I present a
distributed two-stage algorithm based on item-based partitioning. It
processes input sequences in parallel and constructs partitions that
can mine for frequent sequences in parallel. We use nondeterministic
finite automata as an efficient representation for the intermediary
sequences we send to the partitions.
The proposed algorithm
outperforms naive approaches for declarative constraints, is
competitive to a state-of-the-art algorithm for traditional
constraints, offers linear scalability, and makes it possible to mine
datasets that cannot be mined efficiently using sequential
algorithms.
Bio:
Steffen Zeuch
Titel:
Non-Invasive Progressive Optimization for
In-Memory Databases
Abstact:
Progressive optimization introduces
robustness for database workloads
against wrong estimates,
skewed data, correlated attributes, or outdated
statistics.
Previous work focuses on cardinality estimates and rely on
expensive counting methods as well as complex learning algorithms.
In this paper, we utilize performance counters to drive
progressive
optimization during query execution. The main
advantages are that
performance counters introduce virtually no
costs on modern CPUs and their
usage enables a noninvasive
monitoring. We present fine-grained cost models
to detect
differences between estimates and actual costs which enables us to
kick-start reoptimization. Based on our cost models, we implement
an
optimization approach that estimates the individual
selectivities of a
multi-selection query efficiently.
Furthermore, we are able to learn
properties like sortedness,
skew, or correlation during run-time. In our
evaluation we show,
that the overhead of our approach is negligible, while
performance improvements are convincing. Using progressive
optimization, we
improve runtime up to a factor of three
compared to average run-times and up
to a factor of 4,5 compared
to worst case run-times. As a result, we avoid
costly operator
execution orders and; thus, making query execution highly
robust.