Inhalt des Dokuments
Termine DIMA Kolloquium
|Prof. Jarek Szlichta, University of
Ontario Institute of Technolgy, Canada|
"Bringing Order to Data"
|Prof. Dr. Nacéra Seghouani Bennacer
"Extracting Information from Online Data & Building User Profiles" Prof. Dr. Francesca Bugiotti
"Interpreting Reputation in Twitter"
|Dr. Minh-Tan Pham,
IRISA laboratory, France|
"Graph of keypoints for remote sensing image analysis"
|Lionel Parreaux, EPFL
"Fearless Metaprogramming with Squid"
|Dr. Issa Khalil, Qatar Computing Research Institute
"Discovering Malicious Domains through Passive DNS Data Graph Analysis"
|Prof. Themis Palpanas, Senior Member of the
French University Institute (IUF) France|
"End-to-End Entity Resolution for Structured and Semi-Structured Data"
Prof. Jarek Szlichta, University of Ontario Institute of Technolgy, Canada
Bringing Order to Data
Poor data quality is a barrier to effective, high-quality decision making based on data. Declarative data cleaning encodes data semantics as constraints (rules) and errors arise when the data violates the constraints. Unified approaches that repair errors in data and constraints have been proposed. However, both data-only and unified approaches are by and large static. They apply cleaning to a single snapshot of the data and constraints. We have proposed a continuous data cleaning framework that can be applied to dynamic data. Our approach permits both the data and its semantics to evolve and suggests repairs based on the accumulated evidence as statistics. We built a machine learning classifier that predicts types of repairs needed to resolve an inconsistency, and learns from past user repair preferences to recommend more accurate repairs in the future. We also propose quantitative approach to data cleaning that excels at ensuring that the repaired data has desired statistical properties.
Integrity constraints (ICs) are useful for query optimization and for expressing and enforcing application semantics. However, formulating constraints manually requires domain expertise, is prone to human errors, and may be excessively time consuming, especially on large datasets. Hence, proposals for automatic discovery have been made for some classes of ICs, such as functional dependencies (FDs), and recently, order dependencies (ODs). We present a new OD discovery algorithm enabled by a novel polynomial mapping to a canonical form of ODs, and a sound and complete set of axioms for canonical ODs. We show orders-of magnitude performance improvements over the prior state of-the-art.
Jarek Szlichta is an Assistant Professor at University of Ontario Institute of Technology (2014), an Adjunct Professor at University of Waterloo from (2017) and IBM CAS Faculty Fellow (2017). He has been a Postdoctoral Fellow at University of Toronto (2013-2014). His research concerns big data, machine learning, business intelligence, data analytics, information integration, data quality and web search. He received doctoral degree from York University (2009-2013). During that time he spent a 3-year fellowship at IBM Centre for Advanced Studies in Toronto. His research at IBM includes optimization of queries for business intelligence, and its focus is on order dependencies. He is a recipient of IBM Research Student-of-the-Year award (2012) “for having insights and perspective that has significantly contributed to IBM in a matter of great importance”. Previously he worked at Comarch Research & Development on designing and implementing OCEAN GenRap system, which is an innovative data analytics reporting solution. This work was recognized by receiving the CeBIT Business Award.
Prof. Dr. Nacéra Seghouani Bennacer && Prof. Dr. Francesca Bugiotti
Prof. Dr. Nacéra Seghouani Bennacer
"Extracting Information from Online Data & Building User Profiles"
Large part of valuable information is available online in an unstructured textual documents. Traditional information extraction approaches involve the use of natural-language processing to identify references to specific named entities as well their relationships or to capture targeted concepts and semantic. In the context of the Web, textual documents are by nature massive, real time, ambiguous and multilingual. The use of heavy and constrained advanced traditional techniques is doomed to fail. In this context we defined, evaluated on real datasets and compared with existing approaches efficient (scaling-up) and automatic approaches combining external sources such as Wikipedia and BabelNet which are large multilingual encyclopedia graphs and machine learning algorithms for different purposes such as:
(i) Identifying topics of interest of social network users and discovering hidden dimensions of the words revealing some of psychological and personality traits of a person;
(ii) (ii) Building user profiles from resumes by retrieving, identifying and reconciling user Web resources for human resource management systems in order to have a holistic view of an applicant; and (iii) Reconciling individual profiles through the information they disclosed as names, location, links to other documents, ... and their dynamic activities (for ex. posts) across multiple social network platforms.
Short CV of Nacéra:
Prof. Dr. habil. Nacéra Seghouani Bennacer holds a position as Full Professor at Computer Science Department of CentraleSupélec & Research member of Laboratoire de Recherche en Informatique (LRI). She received in 1991 her Engineer in Industrial Engineering (ENP Algiers, ECP Paris), in 1994 her Doctor in Computer Science (CNAM Paris), and in 2014 the Habilitation to supervise research in Computer Science (Paris-Sud University).
Her current research work deals with extracting information from heterogeneous data coming from different sources including textual, ambiguous, multilingual data and large graphs for different purposes such as :
— Ontology-based rewriting & querying semi-structured documents.
— Learning hidden dimensions revealing some of psychological and personality traits of a person.
— Identifying topics of interest of social network users for recommendation systems.
— Interpreting reputation of entities by crawling the Web.
— Building user profiles across multiple social network platforms.
These research works are achieved in the context of different projects : Paris region projects such as Digiteo, FUI/Systematic, Satt and CIFRE-ANRT with academic partners and companies.
Prof. Dr. Francesca Bugiotti
"Interpreting Reputation in Twitter"
Twitter is a social network that provides a powerful source of data. The analysis of those data offers many challenges among those stands out the opportunity to find reputation of a product, a person, or any other entity of interest. Several approaches for sentiment analysis have been proposed in the literature to assess the general opinion expressed in tweets on an entity. We developed a new approach to determine the reputation of an entity on the basis of the set of events in which it is involved. To achieve this we propose a new sampling method driven by a tweet weighting measure to give a better quality and summary of the target entity. Our evaluation shows that 90% of the reputation of an entity originates from the events it is involved in, and the break down into events allows interpreting the reputation in a transparent and self-explanatory way.
Short CV of Francesca:
Prof. Dr. Ing. Francesca Bugiotti holds a position as assistant professor at CentraleSupélec in Paris. She received her "Dr. Ing." degree in Computer Engineering from Università "Roma Tre" (under supervision of prof. Paolo Azteni) in 2012, with a thesis on heterogeneity in databases. She worked as an intern and as a post-doc at Inria Saclay studying the problem of indexing RDF datasets in a cloud infrastructure and studying efficient data storage mechanisms for heterogeneous data in the cloud, supported by Inria in connection with the KIC EIT ICT Labs Europa activity on scalable cloud-based data management.
Her research activity focuses on heterogeneous data integration, conceptual models, NoSQL storage systems integration, NoSQL data model characteristics and query expressive power.
Dr. Minh-Tan Pham, IRISA laboratory, France
Graph of keypoints for remote sensing image analysis
We present a novel pointwise graph-based approach for remote sensing image analysis. In such an approach, keypoints are first extracted to represent and characterize the image. A similarity graph is then constructed from these points and the analysis from the graph’s spatial and spectral domains provides us powerful tools to tackle different remote sensing tasks. Our approach does not require any stationary condition within remote sensing imagery and is able to deal with large-size image data. Some examples on classification of optical satellite images as well as change detection from synthetic radar aperture (SAR) data are illustrated in order to evaluate and confirm the effectiveness of the proposed strategy.
Minh-Tan Pham received the Master of Engineering and Master of Research degrees in electronics and telecommunications from the Institute Mines-Telecom, Telecom Bretagne, France in 2013. He obtained his Ph.D. in Information and Image processing from Telecom Bretagne in collaboration with the French Space Agency (CNES) in 2016. He was intern at the Department of Geography, Laval University, Quebec, Canada in 2013 and the French Space Agency, Toulouse, France in 2015. He is now a post-doctoral researcher at the OBELIX team, IRISA laboratory, France. His research interests include image processing, computer vision and machine learning applied to remote sensing imagery with the current focus on mathematical morphology, hierarchical representation, graph signal processing and deep networks for feature extraction, object detection and classification of remote sensing data. He actually serves as reviewer of IEEE TIP, IEEE TGRS, IEEE JSTARS, IEEE GRSL and several MDPI journals including Remote Sensing, Sensors and Journal of Imaging. Dr. Pham will be visiting researcher at TU Berlin, Germany and at University of Trento, Italy in June and July 2018.
Lionel Parreaux, EPFL (Switzerland)
Fearless Metaprogramming with Squid
Metaprogramming gives a programmer super powers, but often at great costs in terms of complexity and reliability – when designing a metaprogram, it is all too easy to make mistakes that result in failures that are hard to diagnose and debug (for example: unintended variable capture, type mismatches, undefined symbols, etc. in the generated code). This greatly limits the applicability of metaprogramming and restricts many of its more powerful uses to experts in programming languages and compilers. Thankfully, strong type systems can help regain some confidence in the difficult art of writing metaprograms. In this talk I will present Squid, a type-safe metaprogramming framework for Scala that lets users define statically-checked code generators and optimizers with a great deal of flexibility. I will briefly touch upon an application of Squid at the center of an ongoing effort to create an efficient Scala query engine that is, still, type-safe „all the way down“. I aim to show that with such help from the compiler, it becomes possible to generate and optimize domain-specific code at both compile time and run time without much of the worry – hence the „fearless metaprogramming“.
Lionel Parreaux is a PhD student at EPFL (Switzerland), working with Christoph Koch on programming languages, compilers, and database technology. After falling in love with Scala‘s flexibility and expressive power, Lionel set out to create Squid, a new Scala metaprogramming framework joining theory and practice to facilitate unlocking the great promises of code generation and domain-specific optimization. The core of Squid‘s type system was presented at POPL 2018, and a paper about the application of Squid to stream fusion received the best paper award at GPCE 2017.
Dr. Issa Khalil, Qatar Computing Research Institute (QCRI)
We are happy to announce that Dr. Issa Khalil, a principal scientist in the Cyber Security group at the Qatar Computing Research Institute (QCRI), will give a talk at DIMA. The talk will be on Friday, May 4, at 11:00 at TU Berlin, EN building, seminar room EN 719 (7th floor). Issa will be talking about “Discovering Malicious Domains through Passive DNS Data Graph Analysis” (Abstract below). He will also present an overview of the research in the Cyber Security group at QCRI and will specifically be interested to find collaboration opportunities in management of heterogeneous data sources and graph processing. You are welcome to come just to this talk and pass on this invitation to your groups and anybody else who might be interested.
Time: 4 May 2018 11:00-12:00
Location: TU Berlin, EN building, seminar room EN 719 (7th floor), Einsteinufer 17, 10587 Berlin
Title: Discovering Malicious Domains through Passive DNS Data Graph Analysis
Issa Khalil received PhD degree in Computer Engineering from Purdue University, USA in 2007. Immediately thereafter he joined the College of Information Technology (CIT) of the United Arab Emirates University (UAEU) where he served as an associate professor and department head of the Information Security Department. In 2013, Khalil joined the Cyber Security Group in the Qatar Computing Research Institute (QCRI), a member of Qatar Foundation, as a Senior Scientist, and a Principal Scientist since 2016. Khalil’s research interests span the areas of wireless and wireline network security and privacy. He is especially interested in security data analytics, network security, and private data sharing. His novel technique to discover malicious domains following the guilt-by-association social principle attracts the attention of local media and stakeholders, and received the best paper award in CODASPY 2018. Dr. Khalil served as organizer, technical program committee member and reviewer for many international conferences and journals. He is a senior member of IEEE and member of ACM and delivers invited talks and keynotes in many local and international forums. In June 2011, Khalil was granted the CIT outstanding professor award for outstanding performance in research, teaching, and service.
Discovering Malicious Domains through Passive DNS Data Graph Analysis
Despite many efforts, the number of malicious domains, breeding grounds for many devastating security attacks, is on the rise. This project aims to discover many undiscovered malicious domains much ahead of them being used to launch actual attacks, so that such attacks may be nipped in the bud. In particular, we develop techniques to extract meaningful associations among malicious domains through the analysis of DNS data. Unlike traditional research efforts that focus on local features, we propose to discover and analyze global associations among domains. The key challenges are (1) to build meaningful associations among domains; and (2) to use these associations to reason about the potential maliciousness of domains. For the first challenge, we take advantage of the modus operandi of attackers. To avoid detection, malicious domains exhibit dynamic behavior by, for example, frequently changing the malicious domain-IP resolutions and creating new domains. This makes it very likely for attackers to reuse resources leading to intrinsic association among domains. For the second challenge, we develop graph-based inference techniques over associated domains. Our approach is based on the intuition that a domain having strong associations with known malicious domains is likely to be malicious. Carefully established associations enable the discovery of a large set of new malicious domains using a very small set of previously known malicious ones. Although our initial focus is on detecting malicious domains, we plan to explore new ways to detect other malicious vectors such as IPs, end hosts, users, mobile applications, and malicious files.
Prof. Themis Palpanas, Senior Member of the French University Institute (IUF) France
Title: End-to-End Entity Resolution for Structured and Semi-Structured Data
Entity Resolution (ER) lies at the core of data integration, with a
bulk of research focusing on both its effectiveness and time
efficiency. Initially, most relevant works were crafted for structured
(relational) data that are described by a schema of well-known quality
and meaning. With the advent of Big Data, though, these early
schema-based approaches became inapplicable, as the scope of ER moved
to semi-structured data collections, which abound in noisy,
semi-structured, voluminous and highly heterogeneous information.
In this talk, we take a close look on the entire ER workflow (from schema matching to entity clustering), covering both the schema-based and schema-agnostic cases. We will highlight recent works that significantly boost the efficiency of the overall workflow, especially meta-blocking, which cuts down on the computational cost by discarding comparisons that are repeated or lack sufficient evidence for producing duplicates. We will conclude with a brief demonstration of JedAI, our open-source reference toolbox for ER, which incorporates most of the state of the art techniques in the area.
Themis Palpanas is Senior Member of the Institut Universitaire de France (IUF), a distinction that recognizes excellence across all academic disciplines, and professor of computer science at the Paris Descartes University (France), where he is director of diNo, the data management group. He received the BS degree from the National Technical University of Athens, Greece, and the MSc and PhD degrees from the University of Toronto, Canada. He has previously held positions at the University of Trento, and at IBM T.J. Watson Research Center, and visited Microsoft Research, and the IBM Almaden Research Center.
His interests include problems related to data science (big data analytics and machine learning applications). He is the author of nine US patents, three of which have been implemented in world-leading commercial data management products. He is the recipient of three Best Paper awards, and the IBM Shared University Research (SUR) Award.
He is curently serving on the VLDB Endowment Board of Trustees, as an Editor in Chief for the BDR Journal, Associate Editor for VLDB 2019, Associate Editor in the TKDE, and IDA journals, as well as on the Editorial Advisory Board of the IS journal, and the Editorial Board of the TLDKS Journal. He has served as General Chair for VLDB 2013, Associate Editor for VLDB 2017, and Workshop Chair for EDBT 2016, ADBIS 2013, and ADBIS 2014, General Chair for the PDA@IOT International Workshop (in conjunction with VLDB 2014), and General Chair for the Event Processing Symposium 2009.
Prof. Themis Palpanas, Senior Member of the French University Institute (IUF) France