direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Inhalt des Dokuments

Talks DIMA Research Seminar

Talks WS12/13
Talk/Location
Lecturer/Subject
04.12.2012
16.00 c.t
Cafe Campus
Villa Bel
Marchstraße 6 - 8
10587 Berlin
Peter Van Roy
Université catholique de Louvain

"Designing robust and adaptive distributed systems with weakly interacting feedback structures"

13.12.2012
16.00 c.t.
DIMA
EN 719
Peter Haas, IBM Almaden Research Center
"
Splash: Managing Big Data for Composite Simulation Modeling
17.12.2012
16.00 c.t.
DIMA
EN 719
!!! Canceled !!! Canceled !!! Canceled !!!Renée J. Miller, University of Toronto
"On  Schema Discovery"
 
07.01.2013
13.00 c.t.
MAR 6011


Slav Petrov, Google
"t.b.a."
01.02.2013
14.00 c.t.
MA 042
Eric Sedlar, Oracle
"Specialize Technology but Generalize People"
04.02.2013
16.00 c.t.
DIMA
EN 719
Dr. Markus Endres, Uni Augsburg
"Parallele Verarbeitung von Präferenz-Datenbankanfragen auf Multicore-Architekturen"
14.02.2013
10.00 c.t.
DIMA
EN 719
Markus Holzemer, DHBW Mannheim
"
Workload Characterization for Big Data Analytics"
14.02.2013
14.00 - 18.00
MA 041. Str des 17. Juni 136
Prof. Dr. Volker Markl, TU Berlin
"Big Data Analytics Day"

06.03.2013
16.00 - 18.00
DIMA
EN 719
Paolo Missier, Newcastle University,
 
"On the evolving role of workflow middleware to address the challenges of big data science"
11.03.2013
16.00 - 18.00
DIMA
EN 719
Divy Agrawal, Department of Computer Science, University of California at Santa Barbara
 
"Managing Geo-replicated Data in Multi-Datacenters"
12.03.2013
10.00 c.t.
DIMA
EN 719
Paul Larson, Senior Researcher, Microsoft Redmond
"
larHekaton: SQL Server’s Main-Memory OLTP Engine"

21.03.2013
10.30 s.t.
DIMA
EN 719
Andras Garzo
"Web Spam Detection Techniques on Large Scale Web Collections"

Peter Van Roy

Université catholique de Louvain

Title :

"Designing robust and adaptive distributed systems with weakly interacting feedback structures"

Abstract:

Large-scale distributed systems on the Internet are subject to hostile environmental conditions such as node failures, erratic communications, partitioning, and attacks, and global problems such as hotspots, multicast storms, chaotic behavior, and cascading failures.  How can we build these systems to function in predictable fashion in such situations as well as being easy to understand and maintain?  In our work on building self-managing systems in the SELFMAN project, we have discovered a useful design pattern for building complex systems, namely as a set of Weakly Interacting Feedback Structures (WIFS).  A feedback structure consists of a graph of interacting feedback loops that together maintain one global system property.  We give examples of biological and computing systems that use WIFS, such as the human respiratory system and the TCP family of network protocols.  We then show the usefulness of the design pattern by applying it to the Scalaris key/value store from SELFMAN.   Scalaris is based on a structured peer-to-peer network with extensions for data replication and transactions.  Scalaris achieves high performance: from 4000 to 14000 read-modify-write transactions per second on a cluster with 1 to 15 nodes each containing two dual-core Intel Xeon processors at 2.66 GHz.  Scalaris is a self-managing system that consists of five WIFS, for connectivity, routing, load balancing, replication, and transactions.  We conclude by explaining why WIFS are an important design pattern for building complex systems and we outline how to extend the WIFS approach to allow proving global properties of these systems.

Bio:

Peter Van Roy is professor of Computing Science and Engineering at the Université catholique de Louvain in Louvain-la-Neuve, Belgium.  He received M.S. and Ph.D. degrees in Computer Science from the University of California at Berkeley and a French Habilitation à Diriger des Recherches from the Université Paris Diderot.  He is coauthor of the book "Concepts, Techniques, and Models of Computer Programming" and his research group hosts the Mozart Programming System.  He initiated and coordinated from 2006 to 2009 the European SELFMAN project on building self-managing systems with structured overlay networks and components.  He is currently interested in programming models for large-scale systems based on Conflict-Free Replicated Data Types.

 

Everybody is cordially welcome!

Please, forward this invitation to interested colleagues.

Best regards,
Alexander Borusan

Peter Haas, IBM Almaden Research Center

TITLE:

Splash: Managing Big Data for Composite Simulation Modeling

ABSTRACT:

The database community has raised the art of scalable DESCRIPTIVE analytics to a very high level. What enterprises really need, however, is PRESCRIPTIVE analytics to identify robust, high quality investment, planning, and policy decisions in the face of uncertainty. Such analytics, in turn, require deep PREDICTIVE analytics that go beyond mere statistical forecasting and are imbued with an understanding of the fundamental mechanisms that govern a system’s behavior, allowing what-if analyses. To help meet this need, IBM‘s Splash research prototype provides a platform for composing simulation models and datasets for cross-disciplinary modeling, simulation, and optimization in complex systems of systems such as those affecting population health and safety. Splash---which rests on a combination of data-integration, workflow management, simulation, and optimization technologies---loosely couples models via data exchange, unlike prior composite-simulation approaches. We outline the key components of Splash, and then focus on some novel MapReduce algorithms for transforming the outputs of one or more „upstream“ models to create the inputs needed by a „downstream“ model. In particular, we describe in detail Splash‘s time-alignment component, which detects and corrects for mismatches in time granularity between model outputs and inputs. We conclude by discussing some open research topics: Splash‘s model-and-data orientation requires significant extensions of many database technologies, such as data integration, query optimization and processing, and collaborative analytics.

SPEAKER BIO:

Peter J. Haas received a PhD in Operations Research in 1985 from Stanford University, and has been a Research Staff Member at the IBM Almaden Research Center since 1987. Recently designated as an IBM Master Inventor, he holds patents spanning query optimization, autonomic computing, database sampling, data mining, and query processing for uncertain data. He is also a Consulting Professor in the Department of Management Science and Engineering at Stanford University, teaching and pursuing research in stochastic modeling and simulation. He was President of the INFORMS Simulation Society from 2010-2012 and, in 2003, received its Outstanding Simulation Publication Award for his monograph on stochastic Petri nets. Other awards include the 2007 SIGMOD Test-of-Time Award for his work on sampling-based exploration of massive databases and the 2011 NIPS Big Learning Workshop Best Paper Award, for his work on scalable matrix factorization. He a five-time recipient of the IBM Research Pat Goldberg Memorial Best Paper Award, an

IBM record. Many of his techniques have been incorporated into IBM‘s DB2 UDB database product, leading to IBM Research Division and IBM Outstanding Technical Achievement awards. He is an Associate Editor for Operations Research and The VLDB Journal, and an Area Editor for ACM Transactions on Modeling and Computer Simulation. He is the author of over 100 conference publications, journal articles, and books.

Everybody is cordially welcome!

Please, forward this invitation to interested colleagues.

Renée J. Miller, University of Toronto

University of Toronto

 

Title :

On Schema Discovery

Abstract:

Structured data is distinguished from unstructured data by the presence of a schema describing the logical structure and semantics of the data. The schema is the means through which we understand and query the underlying data. Schemas enable data independence. Data Integration can be viewed as the task of discovering the schema information necessary to align and reason about multiple databases as a whole.  In this talk, I consider a few problems related to the discovery and maintenance of schemas and related foundational work on data exchange.

Bio:

Renée J. Miller received BS degrees in Mathematics and in Cognitive Science from the Massachusetts Institute of Technology (MIT). She received her MS and PhD degrees in Computer Science from the University of Wisconsin in Madison, WI. She received thePresidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the United States government on outstanding scientists and engineers beginning their careers. She received the National Science Foundation Early Career Award (formerly, the Presidential Young Investigator Award)  for her work on data integration. She is a Fellow of the ACM, the President of the VLDB Endowment, and the Program Chair for ACM SIGMOD 2011 in Athens, Greece. Her research interests are in the efficient, effective use of large volumes of complex, heterogeneous data. This interest spans data integration, data exchange, knowledge curation and data sharing. She is a Professor and the Bell Canada Chair of Information Systems at the University of Toronto. In 2011, she was elected to the Fellowship of the Royal Society of Canada (FRSC), Canada's national academy.

 

Everybody is cordially welcome!

Please, forward this invitation to interested colleagues.

Best regards,
Alexander Borusan

Dr. Markus Endres

Title:

Parallele Verarbeitung von Präferenz-Datenbankanfragen auf Multicore-Architekturen

 

Abstract:

Das Konzept von Präferenz-Datenbankanfragen hat sich in der heutigen Zeit als
fundamentaler Baustein für die Personalisierung von Suchmaschinen
herausgestellt. Dabei wird das volle Spektrum, reichend von Präferenztheorie
und -modellierung über algebraische und kostenbasierte Optimierung bis hin zu
effizienten Auswertungsalgorithmen, eingesetzt.

Des Weiteren hat in den letzten Jahren ein Paradigmenwechsel von Singlecore
Rechnerarchitekturen zu Multicore Architekturen stattgefunden. Dieser Trend
wird sich in den nächsten Jahren fortsetzen, und Prozessoren mit bis zu 100
Kernen werden keine Seltenheit sein. Diesem Umschwung werden aktuelle
Präferenz-Auswertungsverfahren nicht gerecht. So basieren nahezu alle
Verarbeitungsprozesse für Präferenzanfragen auf sequentiellen Ablaufmodellen
und können keinen Gebrauch von parallel arbeitenden Architekturen machen.

Der Vortrag gibt eine Einführung in Präferenzen, existierende
(sequentielle) Auswertungsprozesse und erläutert den Zusammenhang zu
Multicore Architekturen. Anschließend wird die Notwendigkeit der
Parallelisierung konkretisiert. Zum Schluss werden die Herausforderungen dieses
Forschungsfeldes erläutert und erste Lösungsansätze angesprochen.



Please see Attachment

Everybody is cordially welcome!

Please, forward this invitation to interested colleagues.

Best regards,

Eric Sedlar, Vice President & Technical Director Oracle Labs

Title:

Specialize Technology but Generalize People

ABSTRACT:

 We are finding that specialized hardware like GPUs or rack-integrated cluster appliances such as Oracle Exadata provide one or more orders of magnitude in performance per watt over commodity HW.  Similarly, building applications that can target distributed, customized hardware platforms and exploit parallelism will benefit from higher-level, domain-specific programming languages which allow for more optimizations.  But while the programming languages and hardware running in data centers in the next decade will be more specialized, the expertise required to build such systems is becoming more cross-functional.  Areas like HW/SW co-design and language runtimes require optimization tradeoffs spanning multiple research domains.  This talk covers examples of this trend by examining various research projects in Oracle Labs.

Markus Holzemer, DHBW Mannheim

 Title:

Workload Characterization for Big Data Analytics

ABSTRACT:

Today companies are trying to make use of ever larger amounts of data to get insights that support their decision-making. The most popular open-source tool for the analysis of huge amounts of data is the MapReduce framework and its implementation in Apache Hadoop. The scripting language Jaql is built on top of Hadoop and enables a higher level of coding with the aim to make query building for BigData as easy as SQL for traditional databases.
This paper focuses on an analysis and characterization of full end-to-end workloads in Jaql to get insights about their performance and efficiency and to find out, where future development of Jaql should be located.
For the analysis two Log Analytic workloads are analyzed that are running perfectly for small amounts of data but failing for bigger amounts. The paper then highlights four major optimizations that were applied to eliminate sequential parts of the scripts and accelerate the execution time: Fixing Jaql rewrite errors, vectorization, joining in-memory data and dealing with sequential functions.

The characterization reveales that sequential parts of a script have a very high influence on the overall efficiency and execution time. Moreover, it discovers a misbehaving of the Jaql partitioner and proposes a way to improve the efficiency of some scripts by exploiting Hadoop secondary sort.
Many optimizations that need to be done manually in the current version of Jaql can be automatized after they have been discovered.  In this context the paper shows the implementation of a new Jaql rewrite rule, the JoinToMemjoin rewrite, that fixes the issue of joining in-memory data in Jaql and drastically improves the execution plan of some Jaql scripts.
Moreover, the experiments show that it is not an easy task to fully parallelize a complex workload in the current version of Jaql. That is why the recommendation is given to focus further development of Jaql on the Jaql rewrite engine and the parallel executability of all scripts before widening the range of functions and improving the execution speed.

BIO:

Markus Holzemer studiert zur Zeit im fünften Semester Wirtschaftsinformatik an der DHBW Mannheim. Aufgrund sehr guter Studienleistungen ermöglichte der duale Partner IBM ihm in diesem Sommer einen zweimonatigen Forschungsaufenthalt im Almaden Research Center in San José.

Everybody is cordially welcome!

Please, forward this invitation to interested colleagues.

Prof. Dr. Volker Markl, TU Berlin

Sehr geehrte Damen und Herren,

Am 14.2. findet an der TU Berlin ein Nachmittag zum Thema „Big Data Analytics“ statt, verbunden mit

der Übergabe eines IBM Faculty Awards sowie eines Shared University Research Grants an das
Fachgebiet Datenbanksysteme der TU Berlin unter der Leitung von Prof. Dr. Volker Markl.

 

An diesem Tag werden Redner von IBM und TU Berlin über aktuelle Forschungsarbeiten sprechen sowie
Aspekte der Ausbildung im Bereich „Big Data Analytics“ vorstellen.

 

Wir laden Sie herzlich zu dem Event ein:

 

TU Berlin Gebäude MA Raum MA 041. Str des 17. Juni 136 10623 Berlin

 

14:00 Greeting by VP / Dean of TUB / EECS, Dean of Studies / EEC

14:15 Declarative Analytics of Big Data: Systems and Applications (Shivakumar Vaithyanathan, IBM Research Almaden)

15:15 Kaffee/Kuchen

15:45 IBM Faculty Award and SUR Grant Handover (Martin Mähler und Udo Hertz, IBM Deutschland)

16:00 Big Data Analytics – Research and Teaching (Volker Markl)

16:30 Big Data Analytics and New Hardware Architectures (Max Heimel)

17:00 New Programming Models for Big Data Analytics (Stephan Ewen)

17:30 Analyzing and Mining Big Text Data (Alan Akbik)

 

Herzliche Grüße

Alexander Borusan

Paul Larson, Microsoft Research

Title:

larHekaton: SQL Server’s Main-Memory OLTP Engine

Abstract: Hekaton is a new database engine optimized for memory resident data and OLTP workloads. Hekaton is fully integrated into SQL Server; it is not a separate product. To take advantage of Hekaton, a user simply declares a table memory optimized. Hekaton tables are fully transactional and durable and accessed using T-SQL in the same way as regular SQL Server tables. T-SQL stored procedures that reference only Hekaton tables can be compiled into machine code for further performance improvements. The engine is designed for high concurrency and uses only latch-free data structures and a new optimistic, multi-version concurrency control technique. The talk will give an overview of the design and capabilities of the Hekaton engine including some performance results.

 

Bio: Paul (Per-Ake) Larson has conducted research in the database field for over 30 years. He served as a Professor in the Department of Computer Science at the University of Waterloo for 15 years and joined Microsoft Research in 1996 where he is a Principal Researcher. Paul has worked in a variety of areas: file structures, materialized views, query processing, and query optimization among others. During the last few years he has collaborated closely with the SQL Server team on how to evolve the architecture of the core database system.

Divy Agrawal, Department of Computer Science, University of California at Santa Barbara

Title:

Managing Geo-replicated Data in Multi-Datacenters

Abstract:

Over the past few years, cloud computing and the growth of global large scale computing systems have led to applications which require data management across multiple datacenters. Initially the models provided single row level transactions with eventual consistency. Although protocols based on these models provide high availability, they are not ideal for applications needing a consistent view of the data. There has been now a gradual shift to provide transactions with strong consistency with Google's Megastore and Spanner. In this talk, we start by exploring the overall design space for architecting replica-management protocols for geo-replicated data and present an  abstract framework to verify the correctness of existing and new protocols for geo-replication. We then propose three radically different protocols for providing full transactional execution while supporting replicating data in multi-datacenter environments. First, an extension of Megastore is presented, which uses optimistic concurrency control. Next, we suggest a protocol that reduces the number of cross-datacenter messages compared to current approaches. Finally, a contrasting method is put forward, which uses gossip-based protocol for providing distributed transactions across datacenters. We conclude the talk with a discussion of different trade-offs that are offered by the proposed protocols as well as existing protocols such as Google’s Megastore, Google’s Spanner, and UC Berkeley’s geo-replication protocol referred to as MDCC.

 

Speaker Biography: 

Dr. Divyakant Agrawal is a Professor of Computer Science and the Director of Engineering Computing Infrastructure at the University of California at Santa Barbara. His research expertise is in the areas of database systems, distributed computing, data warehousing, and large-scale information systems. From January 2006 through December 2007, Dr. Agrawal served as VP of Data Solutions and Advertising Systems at the Internet Search Company ASK.com. Dr. Agrawal has also served as a Visiting Senior Research Scientist at the NEC Laboratories of America in Cupertino, CA from 1997 to 2009. During his professional career, Dr. Agrawal has served on numerous Program Committees of International Conferences, Symposia, and Workshops and served as an editor of the journal of Distributed and Parallel Databases (1993-2008), and the VLDB journal (2003-2008). He currently serves as the Editor-in-Chief of Distributed and Parallel Databases and is on the editorial boards of the ACM Transactions on Database Systems and IEEE Transactions of Knowledge and Data Engineering. He has recently been elected to the Board of Trustees of the VLDB Endowment and elected to serve on the Executive Committee of ACM Special Interest Group SIGSPATIAL. Dr. Agrawal's research philosophy is to develop data management solutions that are theoretically sound and are relevant in practice. He has published more than 320 research manuscripts in prestigious forums (journals, conferences, symposia, and workshops) on wide range of topics related to data management and distributed systems and has advised more than 35 Doctoral students during his academic career. He received the 2011 Outstanding Graduate Mentor Award from the Academic Senate at UC Santa Barbara. Recently, Dr. Agrawal has been recognized as an Association of Computing Machinery (ACM) Distinguished Scientist in 2010 and was inducted as an ACM Fellow in 2012. He has also been inducted as a Fellow of IEEE in 2012. His current interests are in the area of scalable data management and data analysis in Cloud Computing environments, security and privacy of data in the cloud, and scalable analytics over social networks data and social media

Paolo Missier, Newcastle University

TITLE:

On the evolving role of workflow middleware to address the challenges of big data science

ABSTRACT: 

The use of workflow has slowly been gaining popularity as a high level programming model for e-science. Some of the reasons for its limited uptake by the scientific community have been analyzed in a recent study [1]. These tie in with the ongoing debate on the limits and opportunities of data and method sharing in science, and are addressed by evolving requirements for workflow repositories.

 From a computing perspective, on the other hand, workflow technology has all the appeals (and perhaps the limitations) of middleware: a potential for exploiting parallelism in the data in a way that is transparent to users, portability over multiple computing infrastructures, and ancillary services such as capturing and querying execution histories (provenance).

 In this talk I will briefly discuss these capabilities, using two workflow management systems, Taverna and e-Science Central, and a case study in Chemical Engineering as a reference. Taverna has been used successfully for the past ten years mainly in bioinformatics, while the more recent e-Escience Central has been positioning itself as a tool for big data analytics which offers good scalability over a cloud infrastructure.

 

BIO: 

Dr. Paolo Missier is a Lecturer in Information and Knowledge Management with the School of Computing Science, Newcastle University, UK.
His current research interests include models and architectures for the management and exploitation of data provenance, specifically extensions of e-infrastructures for scientific provenance.
He is the PI of the EPSRC-funded project "Trusted Dynamic Coalitions", investigating provenance-based policies for information exchanges amongst partners in the presence of limited trust.
Paolo has been contributing to two Provenance-focused Working Groups: he is a member of the W3C Working Group on Provenance on the Web,  and he is co-lead of the Provenance Working Group of the NSF-funded data preservation project, DataONE.

A complete list of past and current professional activities is available on his page:http://www.ncl.ac.uk/<wbr>computing/people/profile/<wbr>Paolo.Missier and associated site:https://sites.google.com/site/<wbr>paolomissier/
Paolo holds a Ph.D. in Computer Science from the University of Manchester, UK (2007), a M.Sc. in Computer Science from University of Houston, Tx., USA (1993) and a B.Sc. and M.Sc. in Computer Science from Universita' di Udine, Italy (1990).

 

Everybody is cordially welcome!

Zusatzinformationen / Extras

Quick Access:

Schnellnavigation zur Seite über Nummerneingabe

Auxiliary Functions