direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Inhalt des Dokuments

Talks DIMA Research Seminar

Talks WS15/16
Talk/Location
Lecturer/Subject
15.03.2016
4 pm
DIMA
EN 719
Dimitris Trihinas, PhD Candidate, University of Cyprus
"Low-Cost Adaptive Monitoring Techniques for the Internet of Things"
07.03.2016
4 pm
DIMA
EN 719
David Broneske, PhD Student, Magdeburg
"Multi-Staged Query Compilation for modern Hardware"
04.02.2016
2 pm
TA 215
Andrei Mihnea, SAP
"Darwinian evolution: 3 implementations of snapshot isolation in SAP HANA"
01.02.2016
4 pm
DIMA
EN 719
Andreas Kunft,
"Kernel Generation for Operators in Apache Flink with Substrate VM"
18.01.2016
4 pm
DIMA
EN 719
Juan Fumero, University of Edinburgh
"Enabling distributed data processing in R"
11.01.2016
4 pm
DIMA
EN 719
Quoc-Cuong To, INRIA, France
"Privacy-Preserving Querying Protocols using Tamper Resistant Hardware"
17.12.2015
12 am
DIMA
EN 719
Matthias Boehm, IBM Research - Almaden; San Jose
"SystemML’s Optimizer: Advanced Compilation Techniques for Large-Scale Machine Learning Programs"
14.12.2015
11 am
DIMA
EN 719
Prof Rainer Gemulla, Uni Mannheim
"Declarative Sequential Pattern Mining"
10.12.2015
10 am
Raum EW 201



Albert Bifet, Telecom ParisTech
"Data Stream Processing and Machine Learning for Data Streams"

Eugene-Paul-Wigner-Gebäude
Hardenbergstraße 36, 10625 Berlin

23.11.2015
4 pm
DIMA
EN 719
Olga Streibel, FU Berlin
"On mining trends and recognizing situations."
19.11.2015
10 am
DIMA
EN 719
Prof. Chris Jermaine, Rice University
"Large-Scale Machine Learning With The SimSQL System"
13.11.2015
10 am
DIMA
EN 719
Prof. Reza Zadeh, Stanford University
"Matrix Computations and Neural Networks in Spark"
11.11.2015
4 pm
DIMA
EN719
Prof. Jens Dittrich, Saarland University
"On Debunking Computational Models when Measuring Data-intensive Main-Memory Algorithms"
26.10.2015
4 pm
DIMA
EN 719
Thomas Bodner,    SAP Potsdam
"Towards Scalable Real-time Analytics: An Architecture for Scale-out of OLXP Workloads"
19.10.2015
4 pm
DIMA
EN719
Markus Weimer    Microsoft, Redmond
"Apache REEF – The Retainable Evaluator Execution Framework"

Thomas Bodner, SAP Potsdam

Title:

Towards Scalable Real-time Analytics: An Architecture for Scale-out of OLXP Workload

Abstract and Bio:

In this talk, we present the architecture of the SAP HANA Scale-out Extension (SOE). The goal of HANA SOE is to complement the scale-up oriented HANA core data platform with massive scale-out capabilities for large-scale analytics over real-time data. This is achieved by decoupling the main database components and providing them as services in a distributed landscape, a design choice made possible by recent advances in high-throughput, low-latency networks and storage devices. We detail three central components of HANA SOE and their interplay: a distributed shared log, a transaction broker and a distributed query executor. Furthermore, we report on the ongoing integration of HANA SOE with the Apache big data ecosystem.
 
Thomas Bodner is a developer with the Innovation Center at SAP. He received a M.S. from Technische Universität Berlin and a B.S. from Duale Hochschule Baden-Württemberg Stuttgart. His current work involves building parts of a new system software stack for large-scale data management to support the growing data-related needs of SAP’s enterprise applications.


Markus Weimer Microsoft, Redmond

Title:

Apache REEF – The Retainable Evaluator Execution Framework

Abstract and Bio:

Resource managers like Apache YARN emerged as a critical layer in the cloud computing stack. They offer a flexible, low level abstraction for leasing cluster resources and instantiating application logic on them. This flexibility comes at a high cost in terms of developer effort, as each application must repeatedly tackle the same challenges (e.g., fault-tolerance, task scheduling and coordination) and re-implement common mechanisms (e.g., caching, bulk-data transfers).
 
I present REEF, a development framework that provides a control-plane for scheduling and coordinating task-level (data-plane) work on cluster resources obtained from a resource manager. REEF provides mechanisms that facilitate resource re-use for data caching, and state management abstractions that greatly ease the development of elastic data processing work-flows on resource managed cloud platforms.
 
REEF is used to develop several commercial offerings such as the Azure Stream Analytics service at Microsoft.  REEF is also an Apache Incubator project that has attracted contributors from several institutions.
 
Dr. Markus Weimer is a Principal Scientist with the Cloud and Information Services Lab at Microsoft, Redmond and a committer to Apache REEF. His work focusses on big data systems with a special emphasis on machine learning and graph computation applications. Markus has several years of experience with Big Data machine learning systems and applications. You can follow him @markusweimer

 



Prof. Reza Zadeh, Stanford University

Title:

"Matrix Computations and Neural Networks in Spark"

Abstract and Bio:

We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark comes with the mllib.linalg library, which provides abstractions and implementations for distributed matrices. Using these abstractions, we highlight the computations that were more challenging to distribute. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea is enough: separating matrix operations from vector operations and shipping the matrix operations to be ran on the cluster, while keeping vector operations local to the driver. In the case of the Singular Value Decomposition, by taking this idea to an extreme, we are able to exploit the computational power of a cluster, while running code written decades ago for a single core. We conclude with a comprehensive set of benchmarks for hardware accelerated matrix computations from the JVM, which is interesting in its own right, as many cluster programming frameworks use the JVM.


Bio:
http://stanford.edu/~rezab/bio.html

 



Prof Rainer Gemulla, Uni Mannheim

Title:

"Declarative Sequential Pattern Mining"

Abstract:


Sequential pattern mining is a fundamental task in data mining. Given a database of sequences (e.g., customer transactions, event logs, or natural-language sentences), the goal of sequential pattern mining is to detect relevant and interesting patterns in the data. In this talk, I describe some of our work in this area, focusing on large-scale sequential pattern mining and declarative sequential pattern mining.
 
The first part of the talk introduces MG-FSM and LASH, two publicly available systems for frequent sequential pattern mining. MG-FSM provides a distributed framework for mining very large sequence databases in a scalable way. LASH has the same objective, but additionally incorporates item hierarchies to detect patterns that would otherwise be missed. Such hierarchies occur in a number of applications; e.g., products bought by customers are often arranged in a product hierarchy, and words and named entities in natural-language text form a semantic hierarchy. Example patterns that make use of hierarchies are „DSLR Camera->Tripod->Flash“ (for customer transactions) or „PERSON was born in CITY“ (for natural-language text).
 
A key problem with many of the available algorithms and tools for sequential pattern mining---including MG-FSM and LASH---is that they are not sufficiently tailored to the underlying application: They may produce many irrelevant patterns and/or miss relevant ones. In the second part of this talk, I introduce Desq, a general-purpose system for declarative pattern mining. The goal of declarative pattern mining is to let applications specify in an intuitive way which patterns are considered relevant. Desq provides a powerful pattern-specification language, which is inspired by regular expressions but allows the use of additional features such as item hierarchies. The specified pattern expressions are transparently translated into extended finite state transducers, which are subsequently simulated in a way suitable for pattern mining. An example pattern expression for typed relational phrases in natural-language text is given by „(ENTITY^ VERB+ NOUN+? PREP? ENTITY^)“; the sequential pattern „PERSON was born in CITY“ mentioned above is relevant for this expression.

 

 

 

Prof. Jens Dittrich, Saarland University

Title:

"On Debunking Computational Models when Measuring Data-intensive Main-Memory Algorithms"

Abstract:


Assume you measure the weight of a car and the scale shows minus 42 kg. You double-check the scale: it is correct. You remove all wheels: the scale shows plus 5 metric tons. You buy another, more expensive scale: it shows minus 100 kg. You flip the car upside down: both scales show about plus 200 kg. Great! Now we are in a fantastic situation to compare the weights of five different cars. We publish the results and know for sure which car is heavier than others. Or, ... maybe not?

I will present some of our recent research in experiments and analyses of data-intensive main-memory algorithms and systems. This includes: adaptive indexing [PVLDB 2013/VLDB 2014 best paper award], compressed radix tries [ICDE 2015], hash tables [PVLDB 2016], join algorithms [ongoing], and systems [ongoing]. Our experimental results show several surprises and indicate that order of magnitude runtime differences may be obtained, even for super-well researched building blocks like indexes and joins. I will also briefly demo our open source pdbf-janiform toolkit for archiving experimental results inside your pdf-papers [VLDB 2015], see https://github.com/uds-datalab/PDBF.

DISCLAIMER: This is not a talk showing you how to make your car pass environmental testing situations.


CV:
Jens Dittrich is a Full Professor of Computer Science in the area of Databases, Data Management, and Big Data at Saarland University, Germany. Previous affiliations include U Marburg, SAP AG, and ETH Zurich. He is also associated to CISPA (Center for IT-Security, Privacy and Accountability).  He received an Outrageous Ideas and Vision Paper Award at CIDR 2011; a BMBF VIP Grant in 2011; a best paper award at VLDB 2014 (the first ever given to an E&A paper); two CS teaching awards in 2011 and 2013; as well as several presentation awards including a qualification for the interdisciplinary German science slam finals in 2012 and three presentation awards at CIDR (2011, 2013, and 2015). He has been a PC member of prestigious international database conferences such as PVLDB, SIGMOD, and ICDE. In addition, he has been an area chair at PVLDB and a group leader at SIGMOD.

His research focuses on fast access to big data including in particular: data analytics on large datasets, main-memory databases, database indexing, and reproducability (see https://github.com/uds-datalab/PDBF). Since 2013 he has been teaching some of his classes on data management as flipped classrooms. See http://datenbankenlernen.de or http://youtube.com/jensdit for a list of freely available videos on database technology in German (introduction to databases) and English (database architectures and implementation techniques). A textbook/e-book complementing the latter set of videos is about to appear.

Albert Bifet, Telecom ParisTech

Title:

"Data Stream Processing and Machine Learning for Data Streams"

Abstract:


In this talk I will introduce the basic concepts of stream processing,
and machine learning for data streams. Data stream mining or real time
analytics is becoming the fastest and most efficient way to obtain
useful knowledge from what is happening now, allowing organizations to
react quickly when problems appear or to detect new trends helping to
improve their performance.  I will discuss lessons learned and open
source projects started at the industry where I have worked, Huawei
and Yahoo. In concrete, I will present Apache SAMOA that includes
algorithms for the most common machine learning tasks such as
classification and clustering. It provides a pluggable architecture
that allows it to run on Apache Flink, but also with other several
distributed stream processing engines such as Storm and Samza.

Bio:


Albert Bifet is Associate Professor at Telecom ParisTech from
September 2015. Previously he was a Senior Researcher at Huawei. He is
the author of a book on Adaptive Stream Mining and Pattern Learning
and Mining from Evolving Data Streams. His main research interest is
in Learning from Data Streams. He published more than 60 articles. He
was serving as Industrial Track co-Chair of ECM-PKDD 2015. He is one
of the leaders of MOA and Apache SAMOA software environments for
implementing algorithms and running experiments for online learning
from evolving data streams. He has been Co-Chair of BigMine (2015-
2012), and ACM SAC Data Streams Track (2016-2012).

 

 

 

 

Prof. Chris Jermaine, Rice University

Title: "Large-Scale Machine Learning With The SimSQL System"

Abstract:
In this talk, I’ll describe the SimSQL system, which is a platform for writing and executing statistical codes over large data sets, particularly for machine learning applications. Codes that run on SimSQL can be written in a very high-level, declarative language called Buds. A Buds program looks a lot like a mathematical specification of an algorithm, and statistical codes written in Buds are often just a few lines long.

At its heart, SimSQL is really a relational database system, and like other relational systems, SimSQL is designed to support data independence. That is, a single declarative code for a particular statistical inference problem can be used regardless of data set size, compute hardware, and physical data storage and distribution across machines. One concern is that a platform supporting data independence will not perform well. But we’ve done extensive experimentation, and have found that SimSQL performs as well as other competitive platforms that support writing and executing machine learning codes for large data sets.

Bio:
Chris Jermaine is an associate professor of computer science at Rice University. He is the recipient of an Alfred P. Sloan Foundation Research Fellowship, a National Science Foundation CAREER award, and an ACM SIGMOD Best Paper Award. In his spare time, Chris enjoys outdoor activities such as hiking, climbing, and whitewater boating. In one particular exploit, Chris and his wife floated a whitewater raft (home-made from scratch using a sewing machine, glue, and plastic) over 100 miles down the Nizina River (and beyond) in Alaska.

 

 

Matthias Boehm, IBM Research - Almaden; San Jose, CA, USA

Title:

SystemML’s Optimizer: Advanced Compilation Techniques for Large-Scale Machine Learning Programs"

Abstract

Declarative large-scale machine learning (ML) aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations to distributed computations on MapReduce or Spark. The compilation of large-scale ML programs exhibits many opportunities for automatic optimization, which is crucial to achieve both high efficiency and scalability if required. In this talk, we give an up-to-date overview of SystemML‘s compilation chain and selected optimization techniques. We specially discuss the end-to-end compilation chain, static and dynamic simplification rewrites, operator selection, and dynamic recompilation.

Bio:

Matthias Boehm is a Research Staff Member at IBM Research - Almaden; San Jose, CA, USA. His current research focuses on optimization and runtime techniques for declarative, large-scale machine learning in SystemML. He received his PhD degree in computer science in 2011 from Dresden University of Technology, Germany with a dissertation on cost-based optimization of integration flows under the supervision of Prof. Wolfgang Lehner. His previous research also includes systems support for time series forecasting as well as in-memory indexing and query processing.

 

 

 

 

Olga Streibel, FU Berlin

Title:

On mining trends and recognizing situations.

Abstract:

Dealing with big data brings many challenges and huge opportunities. In the recent years since big data has been introduced, the research on many interesting issues regarding big data aroused and progressed significantly. Among other research aspects on big data, the analysis, interpretation and therefore useful harvesting of big data remains an interesting issue.
In this talk we focus on trend mining and discuss its potential for situation recognition. Trend mining is based on the idea of harvesting information and knowledge out of trends while bringing an understanding into them.  It is defined as the extraction of implicit, previously unknown and potentially useful information from time-ordered texts or data, assuming that this data contains a trend. Basically, a trend template as a knowledge-based approach for trend mining assumes that mining trends with knowledge will help us in understanding a trend.
Moreover, when extending this approach onto different kind of data, we may be able to recognize an upcoming situation that a given trend may cause.

Bio:

 

Als DAAD-Postdoktorandin am Nationalen Institut für Informatik (NII) in Tokio, Japan, forschte Frau Dr. Olga Streibel seit März 2014 zu den Themen trend mining und drahtlose Netzwerke. Ihr TreMiSR Projekt verbindete die Ergebnisse der trend mining Forschung mit bestimmten Aspekten der Situationserkennung anhand der Daten und Signale aus drahtlosen Netzwerken. Kurz vor dem Projektanfang am NII in Tokio war sie als Doktorandin und wissenschaftliche Mitarbeiterin am Informatik Institut der Freien Universität Berlin in den Gruppen AG NBI und AG CSW tätig. Als ehemaliges Mitglied des Corporate Semantic Web Projektes beschäftigte sie sich während ihrer Forschungszeit an der FU Berlin mit ausgewählten Aspekten der Semantik in Daten, insbesondere im Bezug auf semantische Suche und trend mining.

Quoc-Cuong To, INRIA France

Title:
Privacy-Preserving Querying Protocols using Tamper Resistant Hardware

Abstract:

Current applications, from complex sensor systems to online e-markets acquire vast quantities of personal information, which usually end-up on central servers. However, centralizing and processing all one’s data in a single server, where they are exposed to prying eyes, poses a major problem with regards to privacy concern. Conversely, decentralized architectures helping individuals keep full control of their data, complexify global treatments and queries, impeding the development of innovative services for applications and business. We aim at reconciling individual‘s privacy on one side and global benefits for the business perspectives on the other side. Recent advances in low-cost secure hardware promote the idea of pushing the security to secure hardware devices controlling the data at the place of their acquisition. Thanks to these tangible physical elements of trust, secure distributed querying protocols can reestablish the capacity to perform global computations without revealing any sensitive information to central servers. We also show that our protocol can be applied to the MapReduce to support the security aspect of this framework in processing the big personal data in the Cloud.

Bio:
Quoc-Cuong To is a PhD student at the SMIS team (Secured & Mobile Information Systems http://www-smis.inria.fr/) in France. This team is a joint team between INRIA, University of Versailles and CNRS. Previously, he has been the lecturer at the Ho Chi Minh city University of Technology and spent five months of master internship in Switzerland. He is in the program committee of FDSE and ACOMP conferences and external reviewer of EDBT. His research interests are privacy and security in database, trusted hardware, indexing structure, distributed system. He is author of 9 peer-refereed publications. He received a number of awards of research achievements from Japan and Vietnam for young researchers.

Juan Fumero, University of Edinburgh

Title:

Enabling distributed data processing in R

Abstract:

During the past few years R has become an important language for
data analysis, data representation and visualization. R is a very
expressive language which combines functional and dynamic aspects,
with laziness and object oriented programming. However the default R
implementation is neither fast nor distributed, both features crucial
for
“big data“ processing.

In this talk I will present FastR-Flink, a compiler based on Oracle‘s R
implementation FastR with support for some constructs of Apache
Flink, a Java/Scala framework for distributed data processing. The
Apache Flink constructs such as map, reduce or filter are integrated
at the compiler level to allow the execution of distributed stream and
batch data processing applications directly from the R programming
language.

This work has been initiated at Oracle Labs during my recent internship
with them in Linz, Austria.

Short Bio:
Juan Fumero is a second year PhD student at The University of
Edinburgh.
He is doing the PhD under the supervision of Christophe Dubach in
runtime systems and JIT compilation for heterogeneous computing. He is
currently working on runtime compilation and data management from Java byte-codes to C-OpenCL using Oracle Graal compiler.

He studied the bachellor and the master degree in Computer Science in
La Laguna University  (Spain). He did the final master project with Prof. F. de
Sande and one of his PhD students (Ruyman Reyes) in yacf compiler. YacF (Yet
Another Compiler Framework) is a source to source compiler that translated C99 code into CUDA and OpenCL code within their compiler+runtime for GPU system. Also they adopted this compiler to allow OpenACC directives for heterogeneous computing.

He has been working on compilers, parallelism and  HPC since 2010 in
several companies such as HPC service at La Laguna University, ITER
Supercomputer (Spain), CERN Openlab (Switzerland) and most recently, Oracle Labs (Austria).

More info: http://homepages.inf.ed.ac.uk/s1369892/

Andreas Kunft, TU Berlin, DIMA

Title:

Kernel Generation for Operators in Apache Flink with Substrate VM

Abstract:

Apache Flink is a massive parallel execution engine written in Java. It extends the concepts of MapReduce with support for more expressive operators like joins and iterations combined with a database-style optimizer. We extract these operators and generate independent binary compute kernels for the specific program at runtime. This is done with minimal changes to the Flink code base, as the kernels reuse the code for I/O and de-/encoding of data. The resulting executables are self contained and only need pointers to their input and output.

In this talk we will discuss the general idea, status and future work of the project.

Andrei Mihnea, SAP

Title: Darwinian evolution: 3 implementations of snapshot isolation in SAP HANA

Abstract: The talk quickly presents the HANA column store, then focuses on 3 historical version of snapshot isolation implementation, presenting for each what was working well and why we evolved to the next one.

Bio:
MS in computer science in 1988; the Bucharest Polytechnic Institute, Automatic Control and Computers engineering school; Prof. Cristian Giumale
DEA in Machine Learning in 1990; Universite Paris 6; Prof. Jean-Gabriel Ganascia
Joined Sybase in 1993; currently working at SAP, which has acquired Sybase in 2010.
Worked on the core engine of several RDBMs (Sybase ASE and IQ; SAP HANA): query optimization, Abstract Plans (optimizer hints), query compilation and execution, eager-lazy aggregation, shared-disk and shared-nothing scale-out. Current focus: database stores (in-memory and on-disk, row and column oriented), transaction processing, data lifecycle.

David Broneske, PhD Student, Magdeburg

Title:  "Multi-Staged Query Compilation for modern Hardware"

Abstract:

Since the hardware landscape is getting more and more heterogeneous in the next years, generating efficient code for heterogenous hardware is one of the biggest challenges of the next century. To this end, the community has come up with a variety of methods to create efficient code for specialized, highly-parallel hardware (Polly Compiler, Delite Framework, …).

In my current work, I investigate how to combine concepts of these state-of-the-art methods to reach a hardware-sensitive database management system. Our idea is to create a database management system that rewrites its code during run-time in order to adapts itself to the underlying hardware.
This is done by creating different code variants with specific code optimizations enabled, profiling their performance and then pruning the pool to the optimal variants and code optimizations.
In this talk, I will extend this idea to query compilation and present our idea of a multi-staged query compiler for CoGaDB.

Short Bio:

David finished his Bachelor and Master studies at the University of Magdeburg in 2011 and 2013, respectively. He is currently in the third year of his PhD at the University of Magdeburg supervised by Prof. Gunter Saake. In his PhD project, David researches techniques for heterogeneous hardware. His work is integrated into CoGaDB as his test platform.

 

 

Dimitris Trihinas, PhD Candidate, University of Cyprus

Title: "Low-Cost Adaptive Monitoring Techniques for the Internet of Things"

 

Abstract:

Sensors and actuators with “smart” processing capabilities embedded in battery-powered and internet-enabled physical devices are becoming the tools for understanding the complexity of the global inter-connected world we inhabit. The Internet of Things (IoT) churns tremendous amounts of data with continuous data streams flooding from devices scattered across multiple locations to the processing engines of almost all industry sectors for analysis. However, as the number of “things” surpasses the population of the technology-enabled world, real-time processing while data volume keeps increasing and energy-efficiency are great challenges of the big data era transitioning to IoT. In this talk, we introduce a lightweight adaptive monitoring framework suitable for smart battery-powered IoT devices with limited processing capabilities. Our framework, inexpensively and in place dynamically adapts the monitoring intensity and the amount of data disseminated through the network based on the current evolution and variability of the metric stream. By accomplishing this, energy consumption and data volume are reduced, allowing IoT devices to preserve battery and ease processing on data engines. Experiments on real-world data from physical servers, internet security services, wearables and intelligent transportation systems, show that our framework achieves a balance between efficiency and accuracy. Specifically, our framework is capable of reducing data volume by 74%, energy consumption by at least 71%, while preserving a greater than 89% accuracy.

Short Bio:

Demetris Trihinas is a PhD Candidate at the University of Cyprus. He holds a Computer Science MSc from the University of Cyprus and a Dipl-Ing in Electrical and Computer Engineering from the National Technical University of Athens (NTUA). Demetris lives, breathes and thinks in the elastic “cloud” full of internet-connected „things“ and is always open for a challenge to tweak the performance and scalability of distributed systems and apps. He is the person behind the JCatascopia cloud monitoring system and the AdaM IoT framework and is also a member of the Cloud Application Management Framework which is now an official Eclipse IDE project. His work is published in IEEE/ACM conferences such as CCGrid, BigData, TCC and ICSOC, ICWE and EuroPar.

Zusatzinformationen / Extras

Quick Access:

Schnellnavigation zur Seite über Nummerneingabe

Auxiliary Functions