direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Inhalt des Dokuments

Termine DIMA Kolloquium

Termine WS13/14
16.15 Uhr
EN 719
Prof. Themis Palpanas,    University of Paris V
"Enabling Exploratory Analysis on Very Large Scientific Data"
12.00 Uhr
EN 719
Pei-Ling Hsu
"Constructing Semantic Relationships from Unstructured and Heterogeneous Web Data, and an Application of the Constructed Relationships"
16.15 Uhr
EN 719
Jesus Camacho Rodriguez, INRIA
"PAXQuery: Efficient Parallel Processing of Complex XQuery"
16.15 Uhr
EN 719
 Andre Kelpe
"SELECT _ALL_ THE THINGS! Cascading Lingual - ANSI SQL for Apache Hadoop"
16.15 Uhr
EN 719
Martin Klein, Los Alamos National Laboratory
"A not-at-all-random Walk Through the Digital Preservation Landscape"
16.15 Uhr
EN 719
Holger Pirk, CWI Amsterdam
Waste Not, Want Not - Efficient Co-Processing of Relational Data"
10.15 Uhr
Tilmann Rabl, University of Toronto
"The Parallel Data Generation Framework"

Tilmann Rabl, University of Toronto


The Parallel Data Generation Framework


In many fields of research and business ever growing amounts of data are stored and processed. The pace of storage price drops and discovery of methods for monetizing large data analysis has come as a surprise to traditional database system vendors. This has led to the development of big data systems. Big data tasks are typically end-to-end problems, but due to the pace of development and the lack of standards a plethora of different system components has been developed and an endless number of combinations is deployed. This makes comparing big data systems a hard task.

In his talk, Tilmann will present his work on data generation and big data benchmarking. The Parallel Data Generation Framework is a generic data generator for database and big data system benchmarking. It is highly scalable and completely parallel. It is used by the TPC for a new ETL benchmark and for the new big data benchmark BigBench. BigBench is an end-to-end benchmark for big data analytics. It comprises a set of queries that are specific for big data workloads and a data model that contains structured, semi-structured and unstructured data. A BigBench proof of concept system is currently implemented in Hive and Hadoop.

Speakers Bio:

Tilmann Rabl is a postdoctoral researcher at the Middleware Systems Research Group at the University of Toronto. His research focuses on big data storage management, new hardware for big data systems, big data analytics, database systems architecture and benchmarking. During his PhD studies, he developed the Parallel Data Generation Framework (PDGF), a generic data generator for benchmarking. For his work on data generation he received a Technical Contribution Award by the Transaction Processing Performance Council (TPC). PDGF is basis of the data generator for a new TPC benchmark for data integration. In his doctoral research, Tilmann focused on data distribution in distributed databases. His doctoral thesis was nominated for the SPEC Distinguished Dissertation Award 2012 and received an honorable mention. Tilmann is a member of the steering committee of the Workshop on Big Data Benchmarking series and the Big Data Benchmarking Community.


Everybody is cordially welcome!

Holger Pirk, CWI Amsterdam


"Waste Not, Want Not - Efficient Co-Processing of Relational Data"


The variety of memory devices in modern computer systems holds
opportunities as well as challenges for data management systems. In
particular, the exploitation of Graphics Processing Units (GPUs) and
their fast memory has been studied quite intensively. However, current
approaches treat GPUs as systems in their own right and fail to
provide a generic strategy for efficient CPU/GPU cooperation. We
propose such a strategy for relational query processing: calculating
an approximate result based on lossily compressed, GPU-resident data
and refining the result using residuals, i.e., the lost data, on the

To assess the potential of the approach, we developed a prototypical
implementation for spatial range selections. We found multiple orders
of magnitude performance improvement over a CPU-only implementation
even if the data size exceeds the available GPU memory. Encouraged by
these results, we developed the required algorithms and techniques to
implemented the strategy in an existing in-memory DBMS and found up to
7 times performance improvement for selected TPC-H queries.

Speakers Bio: 

Holger is a PhD Candidate in the Database Architectures group at CWI
in Amsterdam with expected graduation in 2014. He received his
master's degree (Diplom) in computer science at Humboldt-Universität
zu Berlin in 2010. His research interests lie in analytical query
processing on memory-resident data. In particular, he studies storage
schemes and processing models for modern hardware.

Everybody is cordially welcome!

Martin Klein, Los Alamos National Laboratory


"A not-at-all-random Walk Through the Digital Preservation Landscape"


The dynamic of the Web archiving landscape is driven by a variety of factors. As recent developments at the WebCite  service show, financial resources seem just as important as a sustainable business model. Also, ever changing  preservation requirements for, for example, governmental websites can dictate the selection of preservation approaches  and the implementation of archiving software and tools.

In this talk I will discuss several Web archiving solutions implemented by the Research Library of the Los Alamos  National Laboratory. This overview includes Memento, a framework that adds the time dimension to the HTTP protocol, the introduction of Memento for Chrome, a newly developed client implementation, and SiteStory, a transactional   archiving solution. I will motivate these different approaches to help understand their main fields of application and give a brief demonstration of their powers that enable time travel for the Web

Speakers Bio: 

Martin Klein received his Diploma in Computer Science from the University of Applied Sciences Berlin (2002) and his Ph.D. in Computer Science from Old Dominion University (2011). From 2002 to 2005, he was a scientist at the University of Applied Sciences in Berlin conducting research in the realm of e-Learning and mobile computing. At Old Dominion University, he was part of the Web Science and Digital Libraries Research Group and a part-time lecturer in the Computer Science Department. He currently is a Postdoctoral Research Associate at the Research Library of the Los Alamos National Laboratory. His research interests include scholarly communication, digital preservation, temporal aspects of the web, and information retrieval and extraction.

For more information see:


Everybody is cordially welcome!

Jesus Camacho Rodriguez, INRIA

"PAXQuery: Efficient Parallel Processing of Complex XQuery"

Increasing volumes of data are being produced and exchanged over the Web, in particular in tree-structured formats such as XML or JSON. This leads to a need of highly scalable algorithms and tools for processing such data, capable to take advantage of massively parallel processing frameworks.

This work considers the problem of efficiently parallelizing the execution of complex nested data processing, expressed in XQuery. We provide novel algorithms showing how to translate such queries into PACT, a recent framework generalizing MapReduce in particular by supporting many-input tasks. We present the first formal translation of complex XQuery algebraic expressions into PACT plans, and demonstrate experimentally the efficiency and scalability of our approach.

This is a joint work with Dario Colazzo and Ioana Manolescu.

Jesús Camacho-Rodríguez is a PhD student in the LaHDAK group at Paris-Sud University and the OAK team at Inria Saclay. His research focuses on efficient techniques for large-scale Web data management and his advisors are Dario Colazzo and Ioana Manolescu. Before starting his PhD, he spent two years as a research engineer at Inria Saclay, working on XML and RDF data management in peer-to-peer systems, specifically in the ViP2P platform. He received his Engineering Degree in Computer Science from the University of Almería, Spain




Andre Kelpe


SELECT _ALL_ THE THINGS! Cascading Lingual - ANSI SQL for Apache Hadoop

In my talk, I am going to introduce Cascading Lingual (http://cascading.org/lingual) the ANSI SQL framework for Apache Hadoop and how it relates to Cascading (http://cascading.org). I am going to show the design goals, the way they have been implemented and why Cascading Lingual makes sense in todays big data world.  We will explore the usage of the catalog, the shell, JDBC support and the dataprovider mechanism, which makes all your data sources available via SQL to be processed on your hadoop cluster.

André is a general purpose geek, who works as Software Engineer for concurrent inc, the company behind cascading and lingual. In a former life he worked for TomTom Maps, where he introduced hadoop, giraph, avro and zookeeper. He is one of the co-founders of bigdata.be, the Belgian Big Data community and was for a long time involved in the Belgian hackerspace community. André has spoken at bigdata.be meetups, TomTom devdays 2012, Freedom not Fear Brussels, newline, bigdata beers Berlin and devoxx 2013.

Pei-Ling Hsu


Constructing Semantic Relationships from Unstructured and Heterogeneous Web Data, and an Application of the Constructed Relationships


To introduce my research interests to DIMA members, the research work “Constructing Semantic Relationships from Unstructured and Heterogeneous Web Data” is briefly presented. This research work aims to automatically construct semantic relationships from heterogeneous user-generated data, such as query log, social annotations, and Twitter. These heterogeneous data are integrated based on their common characteristics. Embedded semantic characteristics of the data are considered to construct various types of relationships. An application of the constructed relationships is the following research work, which is also introduced in this presentation.

Short Biography:

Pei-Ling Hsu is currently working toward the Ph.D. degree with Institute of Information Systems and Applications, National Tsing Hua University. Her current research interests include data mining, web mining, semantic relationships, and ontologies. She is an exchanged student in DIMA, TU Berlin.

Prof. Themis Palpanas, University of Paris V


Enabling Exploratory Analysis on Very Large Scientific Data


There is an increasingly pressing need, by several applications in
diverse domains, for developing techniques able to index and mine very
large collections of data series. Examples of such applications come
from astronomy, biology, the web, and other domains. It is not unusual
for these applications to involve numbers of data series in the order of
hundreds of millions to billions.

In this talk, we describe iSAX 2.0 and its improvements, iSAX 2.0
Clustered and iSAX2+, three methods designed for indexing and mining
truly massive collections of data series. We show that the main
bottleneck in mining such massive datasets is the time taken to build
the index, and we thus introduce a novel bulk loading mechanism, the
first of this kind specifically tailored to a data series index.
Furthermore, we observe that in several cases scientists, and data
analysts in general, need to issue a set of queries as soon as possible,
as a first exploratory step of the datasets. We also discuss extensions
of the above techniques that adaptively create data series indexes, and
at the same time are able to correctly answer user queries.

We show how our methods allows mining on datasets that would otherwise
be completely untenable, including the first published experiments to
index one billion data series, and experiments in mining massive data
from domains as diverse as entomology, DNA and web-scale image collections.


Themis Palpanas is a professor of computer science at the
University of Paris V - Paris Descartes, France. He received the BS
degree from the National Technical University of Athens, Greece,
and the MSc and PhD degrees from the University of Toronto, Canada.
He has previously held positions at IBM T.J. Watson Research Center
and the University of Trento. He has also been a Visiting Professor
at the National University of Singapore, worked for the University
of California, Riverside, and visited Microsoft Research and the
IBM Almaden Research Center. His interests include data management,
data analytics, streaming algorithms, and data series indexing.
His research solutions have been implemented in world-leading
commercial data management products and he is the author of eight US
patents, three of which are part of commercial products. He is the
recipient of three Best Paper awards, and the IBM Shared University
Research Award.
He is a founding member of the Event Processing Technical Society,
and is serving on the Editorial Advisory Board of the Information
Systems Journal and as an Associate Editor in the Journal of
Intelligent Data Analysis. He has served as General Chair for VLDB
2013, and on the program committees of several top database and
data mining conferences, and has been a member of the IBM Academy
of Technology Study on Event Processing.
His research has been funded by the 7th Framework Program (EU),
the European Institute of Innovation and Technology (EIT), the
Autonomous Province of Trento (Italy), the National Science
Foundation (USA), IBM Research, and Hewlett Packard Research Labs.

Zusatzinformationen / Extras


Schnellnavigation zur Seite über Nummerneingabe