Inhalt des Dokuments
Termine DIMA Kolloquium
Termin/Ort | Dozent/Thema |
---|---|
02.04.2013 13.00 Uhr DIMA EN719 | Sebastian Breß,
Otto-von-Guericke-Universität Magdeburg |
13.06.2013 14.00 Uhr DIMA EN 719 | Prof. Periklis
Andritsos, University of Toronto "Finiding and extracting structure in large datasets" |
13.06.2013 16.00 Uhr DIMA EN 719 | Frank McSherry,
Microsoft "Naiad: a system for iterative, incremental, and interactive distributed dataflow" |
14.06.2013 10.00 Uhr DIMA EN 719 | Frank McSherry,
Microsoft "Differential Dataflow" |
14.06.2013 12.00 Uhr DIMA EN 719 | Asterios Katsifodimos, "Scalable View-based Techniques for Web Data: Algorithms and Systems" |
25.07.2013 10.15 Uhr DIMA EN 719 | Jimmy Lin, Twitter and the University of
Maryland "Real-Time Search at Twitter" |
Sebastian Breß, Otto-von-Guericke-Universität Magdeburg
TITEL:
Automatic Selection of Processing Units for Coprocessing in Databases
ABSTRACT:
Specialized processing units such as GPUs or FPGAs provide great opportunities to speed up database operations by exploiting parallelism and relieving the CPU. But utilizing coprocessors efficiently poses major challenges to developers. Besides finding fine-granular data parallel algorithms and tuning them for the available hardware, it has to be decided at runtime which (co)processor should be chosen to execute a specific task. Depending on input parameters, wrong decisions can lead to severe performance degradations since involving coprocessors introduces a significant overhead, e.g., for data transfers. We present a framework that automatically learns and adapts execution models for arbitrary algorithms on any (co)processor to find break-even points and support scheduling decisions. We demonstrate its applicability for three common use cases in modern database systems and show how their performance can be improved with wise scheduling decisions. Furthermore, we discuss prelimenary results in our research.
Speaker Biography:
Sebastian Breß studierte Informatik an der Otto-von-Guericke-Universität Magdeburg und schloss 2010 sein Bachelor- und 2012 sein Masterstudium ab. Seit April 2012 promoviert er in Magdeburg am Lehrstuhl für Datenbanken und Informationssysteme zum Thema „Heterogeneous Scheduling of Database Queries for hybrid CPU/GPU Platforms“. Dabei geht es insbesondere um die effektive Nutzung von verfügbaren Rechenressourcen (CPUs oder GPUs) während der Anfrageverarbeitung
Everybody is cordially welcome!
Prof. Periklis Andritsos, University of Toronto
Title:
Finiding and extracting structure in large datasets
Abstract:
Data design has been characterized as a process of arriving at a design that maximizes the
information content of each piece of data (or equivalently, one that minimizes redundancy).
Information content (or redundancy) is measured with respect to a prescribed model for the
data, a model that is often expressed as a set of constraints. In this talk, I consider
the problem of doing data redesign in an environment where the prescribed model is unknown
or incomplete or is the result of integrated information. Specifically, I consider the problem
of finding structural clues in a relational instance of data, missing values, and duplicate records.
We propose a set of clustering-based information-theoretic tools for finding structural summaries
that are useful in characterizing the information content of the data, and ultimately useful
in the design of new relational storage spaces. We study the use of summaries in one specific
physical design task. I also show how these information-theoretic tools can assist in information
extraction tasks and the building of attribute dictionaries in unstructured repositories of
product data.
Speaker Biography:
Periklis Andritsos is an Assistant Professor at the University of Toronto, Faculty of
Information (iSchool). He received his B.Sc. degree in Electrical and Computer Engineering from the
National Technical University of Athens, Greece. He then moved to Toronto for his graduate
studies and holds an M.Sc. and Ph.D. degree in Computer Science from the University of Toronto.
He has also been an Assistant Professor at the University of Trento and the Free University
of Bozen-Bolzano, both in Italy.
His research focuses on the analysis of large repositories and, more specifically, the structure
discovery in order to facilitate design and speed up querying. He has developed a clustering
algorithm for categorical data, which has also formed the basis of his novel work on discovering
alternative schemas in databases with inconsistencies and errors. His techniques have also been
used and patented in the industry. He is a senior member of the IEEE Computer Society and the
Association for Computing Machinery.
He is currently visiting the Database Systems and Information Management Group at the
Technical University of Berlin.
Everybody is cordially welcome!
Please, forward this invitation to interested colleagues.
Frank McSherry, Microsoft
Title:
Differential Dataflow
Abstract:
This talk will cover a
new computational frameworks supported by Naiad, differential
dataflow, that generalizes standard incremental dataflow for far
greater re-use of previous results when collections change.
Informally, differential dataflow distinguishes between the multiple
reasons a collection might change, including both loop feedback and
new input data, allowing a system to re-use the most appropriate
results from previously performed work when an incremental update
arrives. Our implementation of differential dataflow efficiently
executes queries with multiple (possibly nested) loops, while
simultaneously responding with low latency to incremental changes to
the inputs. We show how differential dataflow enables orders of
magnitude speedups for a variety of workloads on real data, and
enables new analyses previously not possible in an interactive
setting.
This is joint work with Derek G. Murray, Rebecca
Isaacs, and Michael Isard.
Speaker
Biography:
http://research.microsoft.com/en-us/people/mcsherry/ [1]
Everybody is cordially welcome!
Please, forward
this invitation to interested colleagues.
Title:
Naiad: a system for iterative, incremental, and interactive distributed dataflow
Abstract:
In this talk I’ll describe the Naiad system, based on a new model for low-latency incremental and iterative dataflow. Naiad is designed to provide three properties we do not think yet exist in a single system: the expressive power of loops, concurrent vertex execution, and fine-grained edge completion. Removing any one of these requirements yields an existing class of solutions (respectively: streaming systems like StreamInsight, iterative incremental systems like Nephele, and callback systems like Percolator), but all three together appear to require a new system design. We will describe Naiad’s structured cyclic dataflow model and protocol for tracking and coordinating outstanding work, more closely resembling memory fences than traditional distributed systems barriers. We give several examples of how Naiad can be used to efficiently implement many of the currently popular “big data” programming patterns, as well as several new ones, and experimental results indicating that Naiad’s relative performance ranges from “as good as” to “much better than” existing systems.
This is joint work with Derek G. Murray, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin Abadi.
Speaker Biography:
http://research.microsoft.com/en-us/people/mcsherry/ [2]
Everybody is cordially welcome!
Asterios Katsifodimos, INRIA
Title:
"Scalable
View-based Techniques for Web Data: Algorithms and Systems"
Abstract:
Materialized views have long
been used in databases to speed up queries. Materialized views can be
seen as precomputed query results that can be re-used to evaluate
(part of) another query, and have been a topic of intensive research,
in particular in the context of relational data warehousing. In this
talk we will investigate the applicability of materialized view-based
techniques to optimize the performance of Web data management systems,
in particular in distributed settings, considering XML data and
queries.
Bio:
Asterios
Katsifodimos is a PhD student at INRIA Saclay and Université
Paris-Sud since 2009, under the direction of Ioana Manolescu. His PhD
focuses on the management of Web Data using materialized views. Prior
to his PhD, Asterios has been a member of the High Performance
Computing Systems Laboratory at the University of Cyprus, where he
obtained his bachelor and masters degree. He work at the University of
Cyprus focused on Grid computing and information
retrieval.
Jimmy Lin, Twitter and the University of Maryland
Title
Real-Time Search at Twitter
Abstract
Twitter aims to be an information platform that connects users to
what they care about, 140 characters at a time. Whether it's breaking
new events around the world, the latest celebrity gossip, or the
recent adventures of your closest friends, the search and discovery
services aim to surface relevant and personalized content in
real-time.
Focusing in particular on architectures for
search, in this talk I'll present Earlybird, the core retrieval
engine that powers Twitter's real-time search service. Although
Earlybird builds and maintains inverted indexes like nearly all
modern retrieval engines, its index structures differ from those built
to support traditional web search. We describe these differences and
present the rationale behind our design. A key requirement of
real-time search is the ability to ingest content rapidly and make it
searchable immediately, while concurrently supporting low-latency,
high-throughput query evaluation. We believe that our solution
represents an interesting point in the design, and is well-suited to
Twitter's needs.
I'll conclude with discussion of some
future challenges that span natural language processing, information
retrieval, text mining, and data management.
Bio
immy Lin is an associate professor in the iSchool at the
University of Maryland, affiliated with the Department of Computer
Science and the Institute for Advanced Computer Studies. He graduated
with a Ph.D. in computer science from MIT in 2004. Lin's research
lies at the intersection of information retrieval and natural language
processing, and he has done work in a variety of areas, including
question answering, medical informatics, and bioinformatics. Lin's
current research focuses on massively-distributed data analytics in
cluster-based environments.
Recently, Lin just completed
an extended sabbatical at Twitter, where from 2010-2012 he worked on
services designed to surface relevant content for users and the
distributed infrastructure that supports mining relevance signals
from massive amounts of data
/
/