Page Content
Talks DIMA Research Seminar
Talk/Location | Lecturer/Subject |
---|---|
10.02.2014 4.15.pm DIMA EN 719 | Prof. Themis Palpanas,
University of Paris V "Enabling Exploratory Analysis on Very Large Scientific Data" |
23.01.2014 12 am DIMA EN 719 | Pei-Ling Hsu "Constructing Semantic Relationships from Unstructured and Heterogeneous Web Data, and an Application of the Constructed Relationships" |
20.01.2014 4.00 pm DIMA EN 719 | Jesus Camacho Rodriguez, INRIA "PAXQuery: Efficient Parallel Processing of Complex XQuery" |
13.01.2014 4.00 pm DIMA EN 719 | Andre Kelpe "SELECT _ALL_ THE THINGS! Cascading Lingual - ANSI SQL for Apache Hadoop" |
11.11.2013 04.00 pm DIMA EN 719 | Martin Klein, Los Alamos National
Laboratory "A not-at-all-random Walk Through the Digital Preservation Landscape" |
04.11.2013 04.00 pm DIMA EN 719 | Holger Pirk,
CWI Amsterdam "Waste Not, Want Not - Efficient Co-Processing of Relational Data " |
15.10.2013 10.15 am DIMA EN732 | Tilmann Rabl, University of Toronto "The Parallel Data Generation Framework" |
Tilmann Rabl, University of Toronto
TITEL:
The Parallel Data Generation Framework
ABSTRACT:
In many fields of research and business ever growing amounts of
data are stored and processed. The pace of storage price drops and
discovery of methods for monetizing large data analysis has come as a
surprise to traditional database system vendors. This has led to the
development of big data systems. Big data tasks are typically
end-to-end problems, but due to the pace of development and the lack
of standards a plethora of different system components has been
developed and an endless number of combinations is deployed. This
makes comparing big data systems a hard task.
In his
talk, Tilmann will present his work on data generation and big data
benchmarking. The Parallel Data Generation Framework is a generic data
generator for database and big data system benchmarking. It is highly
scalable and completely parallel. It is used by the TPC for a new ETL
benchmark and for the new big data benchmark BigBench. BigBench is an
end-to-end benchmark for big data analytics. It comprises a set of
queries that are specific for big data workloads and a data model that
contains structured, semi-structured and unstructured data. A BigBench
proof of concept system is currently implemented in Hive and
Hadoop.
Speakers Bio:
Tilmann Rabl is a postdoctoral researcher at the Middleware Systems Research Group at the University of Toronto. His research focuses on big data storage management, new hardware for big data systems, big data analytics, database systems architecture and benchmarking. During his PhD studies, he developed the Parallel Data Generation Framework (PDGF), a generic data generator for benchmarking. For his work on data generation he received a Technical Contribution Award by the Transaction Processing Performance Council (TPC). PDGF is basis of the data generator for a new TPC benchmark for data integration. In his doctoral research, Tilmann focused on data distribution in distributed databases. His doctoral thesis was nominated for the SPEC Distinguished Dissertation Award 2012 and received an honorable mention. Tilmann is a member of the steering committee of the Workshop on Big Data Benchmarking series and the Big Data Benchmarking Community.
Everybody is cordially welcome!
Holger Pirk, CWI Amsterdam
TITEL:
"Waste Not, Want Not - Efficient Co-Processing of Relational Data"
ABSTRACT:
The variety of memory devices in modern computer systems holds
opportunities as well as challenges for data management systems.
In
particular, the exploitation of Graphics Processing Units
(GPUs) and
their fast memory has been studied quite intensively.
However, current
approaches treat GPUs as systems in their own
right and fail to
provide a generic strategy for efficient
CPU/GPU cooperation. We
propose such a strategy for relational
query processing: calculating
an approximate result based on
lossily compressed, GPU-resident data
and refining the result
using residuals, i.e., the lost data, on the
CPU.
To assess the potential of the approach, we developed a
prototypical
implementation for spatial range selections. We
found multiple orders
of magnitude performance improvement over
a CPU-only implementation
even if the data size exceeds the
available GPU memory. Encouraged by
these results, we developed
the required algorithms and techniques to
implemented the
strategy in an existing in-memory DBMS and found up to
7 times
performance improvement for selected TPC-H queries.
Speakers Bio:
Holger is a PhD Candidate in the Database Architectures group at
CWI
in Amsterdam with expected graduation in 2014. He received
his
master's degree (Diplom) in computer science at
Humboldt-Universität
zu Berlin in 2010. His research interests
lie in analytical query
processing on memory-resident data. In
particular, he studies storage
schemes and processing models for
modern hardware.
Everybody is cordially welcome!
Martin Klein, Los Alamos National Laboratory
TITEL:
"A not-at-all-random Walk Through the Digital Preservation Landscape"
ABSTRACT:
The dynamic of the Web archiving landscape is driven by a variety of factors. As recent developments at the WebCite service show, financial resources seem just as important as a sustainable business model. Also, ever changing preservation requirements for, for example, governmental websites can dictate the selection of preservation approaches and the implementation of archiving software and tools.
In this talk I will discuss several Web archiving solutions implemented by the Research Library of the Los Alamos National Laboratory. This overview includes Memento, a framework that adds the time dimension to the HTTP protocol, the introduction of Memento for Chrome, a newly developed client implementation, and SiteStory, a transactional archiving solution. I will motivate these different approaches to help understand their main fields of application and give a brief demonstration of their powers that enable time travel for the Web
Speakers Bio:
Martin Klein received his Diploma in Computer Science from the University of Applied Sciences Berlin (2002) and his Ph.D. in Computer Science from Old Dominion University (2011). From 2002 to 2005, he was a scientist at the University of Applied Sciences in Berlin conducting research in the realm of e-Learning and mobile computing. At Old Dominion University, he was part of the Web Science and Digital Libraries Research Group and a part-time lecturer in the Computer Science Department. He currently is a Postdoctoral Research Associate at the Research Library of the Los Alamos National Laboratory. His research interests include scholarly communication, digital preservation, temporal aspects of the web, and information retrieval and extraction.
For more information see:
http://www.cs.odu.edu/~mklein/ [1]
Everybody is cordially welcome!
Jesus Camacho Rodriguez, INRIA
"PAXQuery: Efficient Parallel Processing of Complex XQuery"
Abstract:
Increasing volumes of data are
being produced and exchanged over the Web, in particular in
tree-structured formats such as XML or JSON. This leads to a need of
highly scalable algorithms and tools for processing such data, capable
to take advantage of massively parallel processing frameworks.
This work considers the problem of efficiently parallelizing
the execution of complex nested data processing, expressed in XQuery.
We provide novel algorithms showing how to translate such queries into
PACT, a recent framework generalizing MapReduce in particular by
supporting many-input tasks. We present the first formal translation
of complex XQuery algebraic expressions into PACT plans, and
demonstrate experimentally the efficiency and scalability of our
approach.
This is a joint work with Dario Colazzo and
Ioana Manolescu.
Bio:
Jesús
Camacho-Rodríguez is a PhD student in the LaHDAK group at Paris-Sud
University and the OAK team at Inria Saclay. His research focuses on
efficient techniques for large-scale Web data management and his
advisors are Dario Colazzo and Ioana Manolescu. Before starting his
PhD, he spent two years as a research engineer at Inria Saclay,
working on XML and RDF data management in peer-to-peer systems,
specifically in the ViP2P platform. He received his Engineering Degree
in Computer Science from the University of Almería, Spain
Andre Kelpe
Title:
SELECT _ALL_ THE THINGS! Cascading Lingual - ANSI SQL for
Apache Hadoop
Abstract:
In my
talk, I am going to introduce Cascading Lingual
(http://cascading.org/lingual [2]) the ANSI SQL framework for Apache
Hadoop and how it relates to Cascading (http://cascading.org [3]). I
am going to show the design goals, the way they have been implemented
and why Cascading Lingual makes sense in todays big data world.
We will explore the usage of the catalog, the shell, JDBC support and
the dataprovider mechanism, which makes all your data sources
available via SQL to be processed on your hadoop cluster.
Bio:
André is a general purpose geek, who
works as Software Engineer for concurrent inc, the company behind
cascading and lingual. In a former life he worked for TomTom Maps,
where he introduced hadoop, giraph, avro and zookeeper. He is one of
the co-founders of bigdata.be, the Belgian Big Data community and was
for a long time involved in the Belgian hackerspace community. André
has spoken at bigdata.be meetups, TomTom devdays 2012, Freedom not
Fear Brussels, newline, bigdata beers Berlin and devoxx 2013.
Pei-Ling Hsu
Title:
Constructing Semantic Relationships from Unstructured and
Heterogeneous Web Data, and an Application of the Constructed
Relationships
Abstract:
To introduce my research interests to DIMA members, the
research work “Constructing Semantic Relationships from Unstructured
and Heterogeneous Web Data” is briefly presented. This research work
aims to automatically construct semantic relationships from
heterogeneous user-generated data, such as query log, social
annotations, and Twitter. These heterogeneous data are integrated
based on their common characteristics. Embedded semantic
characteristics of the data are considered to construct various types
of relationships. An application of the constructed relationships is
the following research work, which is also introduced in this
presentation.
Short Biography:
Pei-Ling Hsu is currently working
toward the Ph.D. degree with Institute of Information Systems and
Applications, National Tsing Hua University. Her current research
interests include data mining, web mining, semantic relationships, and
ontologies. She is an exchanged student in DIMA, TU
Berlin.
Prof. Themis Palpanas, University of Paris V
Title
Enabling Exploratory Analysis on Very Large Scientific Data
Abstract
There is an
increasingly pressing need, by several applications in
diverse
domains, for developing techniques able to index and mine very
large collections of data series. Examples of such applications
come
from astronomy, biology, the web, and other domains. It is
not unusual
for these applications to involve numbers of data
series in the order of
hundreds of millions to billions.
In this talk, we describe iSAX 2.0 and its improvements, iSAX
2.0
Clustered and iSAX2+, three methods designed for indexing
and mining
truly massive collections of data series. We show
that the main
bottleneck in mining such massive datasets is the
time taken to build
the index, and we thus introduce a novel
bulk loading mechanism, the
first of this kind specifically
tailored to a data series index.
Furthermore, we observe that in
several cases scientists, and data
analysts in general, need to
issue a set of queries as soon as possible,
as a first
exploratory step of the datasets. We also discuss extensions
of
the above techniques that adaptively create data series indexes,
and
at the same time are able to correctly answer user
queries.
We show how our methods allows mining on
datasets that would otherwise
be completely untenable, including
the first published experiments to
index one billion data
series, and experiments in mining massive data
from domains as
diverse as entomology, DNA and web-scale image collections.
Bio
Themis Palpanas is a
professor of computer science at the
University of Paris V -
Paris Descartes, France. He received the BS
degree from the
National Technical University of Athens, Greece,
and the MSc and
PhD degrees from the University of Toronto, Canada.
He has
previously held positions at IBM T.J. Watson Research Center
and
the University of Trento. He has also been a Visiting Professor
at the National University of Singapore, worked for the University
of California, Riverside, and visited Microsoft Research and the
IBM Almaden Research Center. His interests include data
management,
data analytics, streaming algorithms, and data
series indexing.
His research solutions have been implemented in
world-leading
commercial data management products and he is the
author of eight US
patents, three of which are part of
commercial products. He is the
recipient of three Best Paper
awards, and the IBM Shared University
Research Award.
He
is a founding member of the Event Processing Technical Society,
and is serving on the Editorial Advisory Board of the Information
Systems Journal and as an Associate Editor in the Journal of
Intelligent Data Analysis. He has served as General Chair for VLDB
2013, and on the program committees of several top database and
data mining conferences, and has been a member of the IBM
Academy
of Technology Study on Event Processing.
His
research has been funded by the 7th Framework Program (EU),
the
European Institute of Innovation and Technology (EIT), the
Autonomous Province of Trento (Italy), the National Science
Foundation (USA), IBM Research, and Hewlett Packard Research Labs.