Talks Research Colloquium
Associate Professor, Northeastern University
"Scalable Search and Ranking for Scientific Data"
|Zoltán Miklós (EPFL)|
"Divide and conquer techniques for data management problems"
"On Finding Complementary Clusterings and the WMO Information System"
DIMA EN 719
|Katrin Eisenreich, SAP|
"Creation and Change Impact Analysis of What-if Scenarios under Uncertainty and Correlation"
|Ivo Santos, Microsoft Research (EMIC - European
Microsoft Innovation Center)|
"Analytics, Complex Events and Data Streams: Scenarios, Platforms and Trends"
Mirek Riedewald, Northeastern University
Title: Scalable Search
and Ranking for Scientific Data
As the amount and complexity of data in many scientific disciplines increases rapidly, new tools are needed for supporting exploratory analysis and scientific discovery. Our Scolopax system's goal is to address these challenges with novel techniques for large-scale parallel data management. In this talk, we will present an overview of Scolopax and then focus on parallel processing of joins. Our proposed model simplifies reasoning about how to assign join computation tasks to processors in MapReduce and other parallel environments. Using this model, we derive a surprisingly simple randomized algorithm, called 1-Bucket-Theta, for implementing arbitrary joins in a single MapReduce job. This algorithm only requires minimal statistics (input cardinality) and we provide proofs and strong evidence that for a variety of join problems, its latency is either close to optimal or the best realizable option. For some popular joins we show how to improve over 1-Bucket-Theta by exploiting additional input statistics. Most of these results will appear at SIGMOD 2011; other aspects of Scolopax were published at premier data management and data mining venues like VLDB, ICDE, ICML, and ICDM.
Mirek Riedewald received a Ph.D. in computer science from the University of California at Santa Barbara in 2002. After spending some time as a researcher at Cornell University and as a visiting researcher at Microsoft Research, he is now an Associate Professor at Northeastern University. Dr. Riedewald's research interests are in databases and data mining, with an emphasis on designing scalable techniques for data-driven science. Currently Dr. Riedewald is developing novel approaches for parallel data processing and for mining observational data. He has a track record of successful collaborations with scientists from different domains, including ornithology, physics, mechanical and aerospace engineering, and astronomy. His work has been published in the premier peer-reviewed data management research venues like ACM SIGMOD, VLDB, IEEE ICDE, and IEEE TKDE, as well as in domain science journals.
Zoltán Miklós (EPFL)
Title: "Divide and conquer techniques
for data management problems"
Evaluating conjunctive queries over a relational database is a central problem of database theory. This problem is closely related to constraint satisfaction problems in artificial intelligence. We discuss query decomposition methods, that are an efficient means to cope with the computational intractability of these problems. Then we discuss semantic interoperability problems in coalitions of autonomous data sources, where we study other divide and conquer techniques, as well. We also discuss further related questions in Web data management, in particular entity matching in Web document collections and Twitter streams. We discuss the fundamental differences between the various data management settings one needs to consider when applying divide and conquer techniques.
Zoltan Miklos is a postdoctoral researcher at EPFL. He defended his PhD thesis at University of Oxford in 2008. He used to work as a research assistant at the Vienna University of Technology and at the Vienna University of Economics and he also worked as a software developer at Siemens. He completed his undergraduate degrees at University ELTE, in Budapest. His research focuses on databases, data management, artificial intelligence and on the Semantic Web.
Title: "On Finding
Complementary Clusterings and the WMO Information
On Finding Complementary Clusterings:
In many cases, a dataset can be clustered following several criteria that complement each other: group membership following one criterion provides little or no information regarding group membership following the other criterion. When these criteria are not known a priori, they have to be determined from the data. I will discuss a new method for jointly finding the complementary criteria and the clustering corresponding to each criterion.On the WMO Information System:The WMO Information System (WIS) is developed to continue ensuring the international exchange of WMO products, such as meteorological, climatological and hydrological data in the 21st century. WIS is a global information management system, designed as a distributed system using a service oriented architecture to guarantee interoperability systems in 189 countries.
In WIS information is modeled with the ISO19139 and ISO19115 metadata standards for geospatial information, and included into the comprehensive catalogue. The interoperability requirements ascertain that this information can also be used in other communities.
A future challenge is to make the information easier to find for users on the WIS search portals with information retrieval techniques as well as clustering and categorization algorithms.
I was born and went to school in Munich. Main subjects Mathematics and Geography. Decided to study computer science at LMU Munich due to interest in networking. Beginning of studies coincided with foundation
of IT consulting company, to continue working with, inter alia, the Red Cross, where I did my civil service. Other jobs during my studies included Java programming tuition and longtime work at the university's network operation centre.
International period after year abroad in Barcelona at the Universitat Autonoma de Barcelona. I worked for humanitarian organizations in Africa and for the UN in Rome and Geneva, while finishing my studies with my
diploma thesis (1.0) at the CNAM in Paris about Data Mining, my main study focus.
After graduation (with 1.5), I worked as a researcher at the University of Tehran and at the CNAM in Paris, again, until I got my current job at the World Meteorological Organization of the UN in Geneva.
Katrin Eisenreich, SAP
"Creation and Change Impact Analysis of What-if Scenarios under Uncertainty and Correlation"
When performing what-if analysis -- a technique increasingly applied in business planning and decision support -- both historic and hypothetic data (assumptions) play an important role. To construct scenarios, users apply operators to analyze, modify, and integrate both forms of data.
An important factor in this context is the handling of uncertainty and correlation in data, since they can have a major impact on analysis results. Besides, once a scenario has been created, it is important to enable users to investigate which assumptions were made to arrive at the scenario, and how possible changes in underlying data might influence its overall results.
In this talk, we first look at the specific aspect of correlation in data. I will present an approach that enables users to introduce arbitrary correlation structures to analyzed data, exploiting statistical methods well-established in financial and risk analysis. A central aspect of the discussed approach is the use of precomputed approximate correlation structures (ACRs) instead of sampling at run time. Thereby, we achieve faster processing of correlation queries and become independent from specific statistical library functions at query time. Further, the ACR approach opens up possibilities to efficient processing of subsequent operations over joint distributions, such as computing risk measures over the correlated data. We will introduce the construction and application of ACRs by means of an example scenario.
The second part of the talk focuses on the topic of scenario provenance. Apart from looking at the results of a scenario analysis, we must also allow users to trace back to where those results came from. For example, looking at a very high prediction for sales, a user should be able to see whether it is backed by some evidence (e.g., historic data) or comes mostly from very optimistic assumptions about the business or economic factors. Also, when actual data deviates from an applied assumption, the user should be able to see which impact this can have on the overall scenario.
In the talk, I will illustrate the capture and querying of provenance information based on a graph structure. Apart from information about the derivation process of data items, the discussed approach also takes into account the hypothetic nature of data. In particular, specific knowledge about analytic operators, such as for ACR-based correlation introduction, are exploited to allow for an efficient change impact analysis over executed scenarios
Katrin joined SAP Research in 2006 as an intern and completed her major thesis in the field of schema and ontology matching in September 2007. She received her degree (Diplom) from the TU Dresden in September 2007. Since 2008, she has been working as a Research Associate and is now part of the Business Intelligence research practice.
In her PhD research, Katrin is working on concepts for handling uncertain data for scenario analysis on the database. The data model and operators for the computation of scenario data, as well as for tracing the processing of such data, are implemented as an extension to the SAP In-Memory Database. This research is part of a joint effort to provide flexible functionality for in-memory Forecasting and Prediction.
Ivo Santos, Research (EMIC – European Microsoft Innovation Center)
Title: "Analytics, Complex Events
and Data Streams: Scenarios, Platforms and Trends"
Different scenarios from market verticals such as manufacturing, oil and gas, utilities, financial services, health care, web analytics, and IT monitoring can profit from the opportunity to make more informed business decisions in near real-time based on the ability to monitor, analyze and act on the data in motion. These applications, typically event-driven and characterized by high input data rates, continuous analytics, and millisecond latency requirements, introduce a number of challenges to traditional Database Management Systems (DBMS). This is pushing many organizations to start adopting Data Stream Management Systems (DSMS), middleware systems that incrementally process long-running continuous queries over temporal data streams. Modern DSMS typically provide Complex Event Processing (CEP) techniques to identify meaningful patterns, relationships and data abstractions from among seemingly unrelated events, triggering immediate response actions. An example of a commercial DSMS is Microsoft StreamInsight, a platform for developing and deploying streaming applications which leverages a well-defined temporal stream model and algebra. This talk, besides providing an overall introduction to CEP, will present scenarios where its adoption is gaining momentum, provide a quick overview of existing research and commercial CEP platforms (including a more extended overview of Microsoft StreamInsight) and finally discuss some future trends and challenges for CEP.
Dr. Ivo Santos is a researcher and Software Engineer at Microsoft Research (EMIC - European Microsoft Innovation Center - research.microsoft.com/en-us/labs/emic/ ) in Aachen, Germany. He holds a PhD in Computer Science from the University of Campinas (UNICAMP, Brazil), worked as DAAD fellow at the Fraunhofer FOKUS Institute (Berlin, Germany) and twice as research intern at the Microsoft Research Database group (Redmond, WA, USA). He has expertise in the area of distributed information systems, service oriented architectures, data stream management systems and e-applications. His current research interests are on middleware and tools for distributed complex event processing systems.