Talks DIMA Research Seminar
|Bingsheng He, Nanyang
Technological University, Singapore|
"In-Memory Database Systems on Emerging Hardware: Our Ten Years’ Journey
|Dr. Kaiwen Zhang, Technische Universität
"Distributed, Expressive Top-k Subscription Filtering using Covering in Publish/Subscribe Systems"
| Prof. Ihab Ilyas, University of Waterloo|
"Data Cleaning from Theory to Practice"
"Explaining the outputs of modern data analytics"
MA 043 
|Garret Swart, Oracle|
"Running better databases on better processors"
|Peter Pietzuch, Imperial College London|
"SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures"
Mahdiraji,Jacobs-University, Bremen, Germany|
"Declarative Data Management for Scientific Meshes"
|Danny Bickson, Dato|
"Python based predictive analytics with GraphLab Create"
"Evaluating Link-based Recommendations for Wikipedia"
"Challenges of Industrial Static Analysis"
|Panagiotis Bouros, Aarhus University, Denmark|
"Managing Complex Data Types!"
Prof. Ihab Ilyas, University of Waterloo
Data Cleaning from Theory to Practice
With decades of research on the various aspects of data cleaning,
multiple technical challenges have been tackled and interesting
results have been published in many research papers. Example quality
problems include missing values, functional dependency violations and
duplicate records. Unfortunately, very little success can be claimed
in adopting any of these results in practice. Businesses and
enterprises are building silos of home-grown data curation solutions
under various names, often referred to as ETL layers in the business
intelligence stack. The impedance mismatch between the challenges
faced in industry and the challenges tackled in research papers
explain to a large extent the growing gap between the two worlds. In
this talk I claim that being pragmatic in developing data cleaning
solution does not necessarily mean being unprincipled or ad-hoc. I
discuss a subset of these practical challenges including data
ownership, human involvement, and holistic data quality concerns.
These new set of challenges often hinder current research proposals
from being adopted in the real world. I also go through a quick
overview of the approach we use in tamr (a data curation startup) to
tackle these challenges.
Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo. He received his PhD in computer science from Purdue University, West Lafayette. His main research is in the area of database systems, with special interest in data quality, managing uncertain data, rank-aware query processing, and information extraction. Ihab is a recipient of the Ontario Early Researcher Award (2009), a Cheriton Faculty Fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is a co-founder of Tamr, a startup focusing on large-scale data integration and cleaning. He serves on the VLDB Board of Trustees, and he is an associate editor of the ACM Transactions of Database Systems (TODS).
Garret Swart, Oracle
Title: "Running better databases on better processors"
The Oracle database is the world‘s leading data management
product. After the acquisition of Sun Microsystems in 2010,
engineers from Oracle and Sun started work on a new category of
microprocessor designed to process data several times faster, many
times more efficiently, and qualitatively safer. This kind of
goal cannot be reached by running software unchanged—we needed to
design new hardware and write new software at all levels of the system
to utilize it. The approach we took, and are taking, exploits
the following ideas:
Big is better: Large scale computing systems give access to huge amounts of data without the costs of moving data between systems. This allows for larger tables, bigger sorts, fatter graphs, and more cloud tenants sharing the the same resource pool on SPARC systems that scale linearly in cost and performance from 8 to 512 cores, 64 to 4096 threads.
Secure is better: Cache-line level memory access checking allows our instrumented memory allocators to manage memory at production speed while detecting bugs and reporting attacks in real time.
Information Density is better: With hardware designed for scanning n-gram compressed, bit packed, dictionary and run-length encoded columnar data at full memory bandwidth, we make maximal use of every bit stored and every cache line transferred over the memory channels with no impact on performance.
Fast is better: With hardware support for database operators running on specialized streaming processors, we can drive the memory channels at maximum rate, freeing up power and cores for running user computations on the result of these operators.
Connected is better: Integrating EDR InfiniBand on-chip and on-board with low-latency, high-throughput, one-sided networking.
Portable is better: By supporting platform independent acceleration APIs inside the database we can support a wide variety of acceleration techniques and give applications and query planners the information to make the best use of the available hardware.
Integrated is better: By supporting and accelerating multiple storage types (In-memory, NFS, NVMe, Exadata, HDFS, Fibre Channel), data formats (row major, column major, graph, JSON, spatial, MIME, Hive), algorithms, query languages, network protocols, and hardware platforms in a single product, we can share resources, increase usability and reduce the cost and the cognitive load in acquiring, storing, securing and understanding data.
Bingsheng He, Nanyang Technological University, Singapore
"In-Memory Database Systems on Emerging Hardware: Our Ten Years’ Journey"
Big data has become a buzz word. Among various big-data challenges, high performance is a must, not an option. We are facing the challenges (and also opportunities) at all levels ranging from sophisticated algorithms and procedures to mine the gold from massive data to high-performance computing (HPC) techniques and systems to get the useful data in time. In-memory database systems are a hot research topic to tame the performance challenges of big data applications. Our research has been on the novel design and implementation of in-memory database management systems on emerging hardware (many-core CPUs, GPUs, and FPGAs etc). Interestingly, we have also observed the interplay between emerging hardware and in-memory database systems. In this talk, I will present our research efforts in the past 10 years and outline our research agenda. More details about our research can be found at http://pdcc.ntu.edu.sg/xtra/ .
Dr. Bingsheng He is currently an Associate Professor at School of Computer Engineering, Nanyang Technological University, Singapore. Before that, he held a research position in the System Research group of Microsoft Research Asia (2008-2010), where his major research was building high performance cloud computing systems for Microsoft. He got the Bachelor degree in Shanghai Jiao Tong University (1999-2003), and the Ph.D. degree in Hong Kong University of Science & Technology (2003-2008). His current research interests include cloud computing, database systems and high performance computing. His papers are published in prestigious international journals (such as ACM TODS and IEEE TKDE/TPDS/TC) and proceedings (such as ACM SIGMOD, VLDB/PVLDB, ACM/IEEE SuperComputing, ACM HPDC, and ACM SoCC). He has been awarded with the IBM Ph.D. fellowship (2007-2008) and with NVIDIA Academic Partnership (2010-2011). Since 2010, he has (co-)chaired a number of international conferences and workshops, including IEEE CloudCom 2014/2015 and HardBD2016. He has served in editor board of international journals, including IEEE Transactions on Cloud Computing (IEEE TCC) and IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS).
Danny Bickson, Dato
Title: Python based predictive
analytics with GraphLab Create
Abstract: One of the most exciting areas in data science is the development of new predictive applications; apps used to drive product recommendations, predict machine failures, forecast airfare etc. These applications output real-time predictions and recommendations in response to user and machine input to directly derive business value and create cool experience.
The most interesting apps utilize multiple types of data (tables, graphs, text & images) in a creative way. In this talk, we will show how to quickly build and deploy a predictive app that exploits the power of combining different data types together using GraphLab Create, our open source based Python software.
Short bio: Dr. Danny Bickson is VP EMEA and Co-Founder at Dato. Previously he was a research scientist at Carnegie Mellon University under Prof. Carlos Guestrin (CMU) and Prof. Joe Hellerstein (Berkeley). and one of the creators of GraphLab open source project. Danny holds a PhD in distributed algorithms from the Hebrew University.
Malte Schwarzer, TU berlin
Title: Evaluating Link-based
Recommendations for Wikipedia
Literature recommender systems support users in filtering the
and increasing number of documents in digital libraries and on the
Web. For academic literature, research has proven the ability of
citation-based document similarity measures, such as Co-Citation
(CoCit), or Co-Citation Proximity Analysis (CPA) to improve
In this paper, we report on the first large-scale investigation of the
performance of the CPA approach in generating literature
recommendations for Wikipedia, which is fundamentally different
from the academic literature domain. We analyze links instead of
citations to generate article recommendations. We evaluate CPA,
CoCit, and the Apache Lucene MoreLikeThis (MLT) function,
which represents a traditional text-based similarity measure. We
use two datasets of 779,716 and 2.57 million Wikipedia articles,
the Big Data processing framework Apache Flink, and a ten-node
computing cluster. To enable our large-scale evaluation, we derive
two quasi-gold standards from the links in Wikipedia¹s ³See also²
sections and a comprehensive Wikipedia clickstream dataset.
Our results show that the citation-based measures CPA and CoCit
have complementary strengths compared to the text-based MLT
measure. While MLT performs well in identifying narrowly
similar articles that share similar words and structure, the citation-
based measures are better able to identify topically related
information, such as information on the city of a certain university
or other technical universities in the region. The CPA approach,
which consistently outperformed CoCit, is better suited for
identifying a broader spectrum of related articles, as well as
popular articles that typically exhibit a higher quality. Additional
benefits of the CPA approach are its lower runtime requirements
and its language-independence that allows for a cross-language
retrieval of articles. We present a manual analysis of exemplary
articles to demonstrate and discuss our findings.
The raw data and source code of our study, together with a
manual on how to use them, are openly available at:
- aktuell Master Student - Information System Management (Data science
track) @ TUB
- vorher Bachelor auch @TUB - Information System Management
Title: Explaining the outputs of modern data analytics
We have made substantial progress with modern data analytics, moving well beyond the realm of simply counting words. We can determine interesting graph properties---connectivity, reachability, matchings---and maintain these properties in real time. We can produce a tremendous amount of output, but it isn‘t clear that we understand it all yet.
In this talk, I‘ll explain a framework for interactively determining and tracking *explanations* for outputs of arbitrary differential dataflow computations: subsets of the actual input which reproduce the outputs. In the relational setting, this would be „provenance“ or „lineage“, but in the big data space, including iteration and non-monotonic reducers, existing techniques do not work: they return either (i) too much input data or (ii) insufficient input data to reproduce the output. We‘ll fix all of that.
This talk reflects joint work with Zaheer Chothia, John Liagouris, and Mothy Roscoe in the Systems Group in ETH Zurich.
Frank McSherry is an independent researcher formerly affiliated with Microsoft Research, Silicon Valley. While there he led the Naiad project, which introduced both differential and timely dataflow, and remains one of the top-performing big data platforms. He also works with differential privacy, due in part to its interesting relationship to data-parallel computation. Frank currently enjoys spending his time in places other than Silicon Valley.
Dr. Kaiwen Zhang, Technische Universität München
Title: Distributed, Expressive
Top-k Subscription Filtering using Covering in Publish/Subscribe
Abstract: Top-k filtering is an effective way of reducing the amount of
data sent to subscribers in pub/sub applications. We focus on the problem of top-k subscription filtering, where a publication is delivered only to the k best ranked subscribers. The naive approach to perform filtering early at the publisher edge broker works only if complete knowledge of the subscriptions is available, which is not compatible with the well-established covering optimization in content-based publish/subscribe systems. We propose an efficient rank-cover technique to reconcile top-k subscription filtering with covering. We extend the covering model to support topk and describe a novel algorithm for forwarding subscriptions
to publishers while maintaining correctness. We also establish a framework for supporting different types of ranking semantics, and propose an implementation to support fairness. Finally, we conduct an experimental evaluation and perform sensitivity analysis to demonstrate that our optimized rank-cover algorithm retains both covering and fairness while achieving properties advantageous to our targeted workloads. Our optimized solution is scalable and retains over 95% of the covering benefit when k is set at 1% selectivity, and even achieves 70% covering when k selectivity is 10%.
Biography: Kaiwen Zhang is a postdoctoral fellow in Computer Science at the TU Munich as a member of the Middleware Systems Research Group since 2010. Born in Beijing (China), he obtained his B.Sc and M.Sc at McGill University in Montréal and his Ph.D at the University of Toronto. His research interests include large-scale event processing, massively multiplayer online games, consistency in replicated systems and software-defined networking.
Alireza Rezaei Mahdiraji,Jacobs-University, Bremen, Germany
"Declarative Data Management for Scientific Meshes"
Alireza Rezaei Mahdiraji Jacobs-University, Bremen, Germany
Mesh structured data is foundational in the earth sciences, primarily in modelling contexts but has not been extensively studied in database community with a few notable exceptions. Despite increasing interest in database systems for scientific data, mesh structured data characterized by arbitrary topological discretizations of a continuous field is handled through ad-hoc algorithms tightly coupled to the source data files. Significant overlap in functionality of the algorithms requires duplicated code suggesting reusability principle has been overlooked which in turn leads to high maintenance cost.
This work presents a query language for scientific meshes with a level of ab- straction which is based on topological and geometric characteristics of meshes with emphasis on topology. The language offers a comprehensive treatment of the variety of mesh types found in practice, handling meshes of arbitrary dimension and imposing correctness constraints to ensure validity throughout transforma- tion.
Alireza is a programmer in Microbial Genomics and Bioinformatics Group at Max-Planck Institut Bremen. He received his PhD in computer science from Jacobs-University Bremen in December 2015 where he worked with Peter Bau- mann on query languages for scientific data. He did his master and bachelor degrees in Iran University of Science and Technology and Shahid Beheshti Uni- versity.
Challenges of Industrial Static Analysis
Both parsing C++ code and doing static analysis are challenging tasks. Thus the static analysis of industrial C++ code is a very tough problem. I will introduce some static analysis algorithms: how they work, what are their limitations. The focus is on symbolic execution which is a very powerful abstract interpretation method. Unfortunately, the C++ compilation model makes it very hard to analyse calls to functions that are defined in a separate translation unit. I will introduce a solution that I proposed to solve this problem and the prototype I implemented in a Google Summer of Code (GSoC) 2014 project. Finally I will give a status update on using code generation in the serializers in Flink, which is my (proposed) GSoC 2016 project.
Gábor Horváth is finishing his MS degree at Eötvös Loránd University this summer. He is a researcher in a static analysis related university project since 2012. He also teaches C and C++ to undergraduate students. In 2014 he participated in a Google Summer of Code (GSoC) project to improve the Clang Static Analyzer. He is an active contributor to the Clang compiler ever since. In 2015 he interned at Apple to further improve the Static Analyzer. This summer he proposed a GSoC project to improve the performance of serialization in Flink using code generation.
Peter Pietzuch, Imperial College London
SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures
Modern servers have become heterogeneous, often combining multi-core CPUs with many-core GPGPUs. Such heterogeneous architectures have the potential to improve the performance of data-intensive stream processing applications, but they are not supported by current relational stream processing engines. For an engine to exploit a heterogeneous architecture, it must execute streaming SQL queries with sufficient data-parallelism to fully utilise all available heterogeneous processors, and decide how to use each in the most effective way. It must do this while respecting the semantics of streaming SQL queries, in particular with regard to window handling.
In this talk, I describe SABER, a hybrid high-performance relational stream processing engine for CPUs and GPGPUs. SABER executes window-based streaming SQL queries in a data-parallel fashion using all available CPU and GPGPU cores. Instead of statically assigning query operators to heterogeneous processors, SABER employs a new adaptive heterogeneous lookahead scheduling strategy, which increases the share of operators executing on the processor that yields the highest performance. To hide data movement costs, SABER pipelines the transfer of stream data between CPU and GPGPU memory. Our experimental comparison against state-of-the-art engines shows that SABER increases processing throughput while maintaining low latency for a wide range of streaming SQL queries with both small and large window sizes.
(This talk is based on work published at SIGMOD‘16.)
Peter Pietzuch is an Associate Professor (Reader) at Imperial College London, where he leads the Large-scale Distributed Systems (LSDS) group
(http://lsds.doc.ic.ac.uk ) in the Department of Computing. His research focuses on the design and engineering of scalable, reliable and secure large-scale software systems, with a particular interest in data management and networking issues. He has published over seventy research papers in international venues, including USENIX ATC, SIGMOD, VLDB, ICDE, ICDCS, CCS, CoNEXT, NSDI, Middleware and DEBS. Before joining Imperial College London, he was a post-doctoral fellow at Harvard University. He holds PhD and MA degrees from the University of Cambridge.
Panagiotis Bouros, Aarhus University, Denmark
Managing Complex Data Types
Data are becoming increasingly more complex. Real-life objects nowadays can be routinely assigned different types of information such as text, spatial geometries, timestamps and graph (social) information. In the past, these data dimensions have been extensively studied but in most cases independently. On the other hand, the abundance of objects enriched with descriptive information from multiple sources and the plethora of applications collecting such complex data signify the need to extend or even redesign our data management systems. In this spirit, my research has focused first on introducing novel querying operators and analysis tasks, and second on devising efficient methods to process the huge amounts of complex data produced by modern businesses and sciences. Indeed, issues such as the Volume, Velocity, Variety, Veracity and Complexity arise in the context of managing complex data types which calls for novel techniques as the characteristics of modern and future datasets naturally outgrow the capabilities of contemporary query processing techniques. This talk will provide an overview of my research on managing complex data types and discuss my directions for future work.
Panagiotis Bouros is a post-doctoral researcher for the Department of Computer Science at Aarhus University, Denmark. His research focuses on query processing, managing complex data types including spatial, temporal and text, and on routing optimization problems. Prior to his current position, Panagiotis was with Humboldt-Universität zu Berlin, Germany and the University of Hong Kong, Hong Kong SAR, China. He received his diploma and PhD degree from the School of Electrical and Computer Engineering at the National Technical University of Athens, Greece, in 2003 and 2011, respectively.